Wednesday, April 10, 2013

Sychronisation of nodes with cloud services

I have read a couple of articles recently about the complexity of synchronisation and reconcilliation across cloud services (apple) and the horror that ensures.

The problems are:

Multi-point synch - The so-called triangle synch or three node synch.
Inconsistent connectivity - Partial synch and partial updates.
Clashes - Multiple nodes over-writing or under-writing each other.
Rollback semantics - How does "undo" function in this kind of environment?

This problem exists in all synchronisation exercises. Be it database sharding, file based backups, database synch across cloud services, multipoint sharing and synch etc.

I have been thinking about various strategies for either getting this to work or mitigating some of the issues.

Topology Issues


Peer to Slave - This is where there is essentially one master "node" and a backup slave device.  In this case the master always wins.  This is basically a backup service.  The Master is the author of everything while the backup device has reader rights only.

Peer to Peer - This is the classsic "sychronisation" where two peer nodes are trying to stay in synch.  In this case both nodes can be authors and both nodes are readers of both nodes product.

Peer to slave to Peer - This is the dreaded three way.  Each Peer has author rights, while all three have reader rights.  So for any change, the system needs to propogate it to the slave and then to the other Peer.  Easy with assured connectivity.  Hard with asychronous and intermittant connectivity. Almost a guarantee of a reconcilliation event at some point.

In most cases higher order topology can be reduced to one of these models, however with asynchronous connectivity, it becomes exponentially complex. 

Data Issues


File based synch - There are lots of these systems and its pretty much a solved problem.  In the higher complexity topologies it simply has an increased probability of a clash.  Easy to have a deterministic resolve rule for clashes like (Add a digit to the end of the file) to prevent data loss, but merging the two versions of the same file is still logically impossible without more information or a pre-conceived rule set for reconcilliation and merging.  This can only be done by the actual application and cannot be done at the transport layer.

Fragment based synch - Such as databases, large binaries, complex data sets, real time collaborative virtual environments etc etc. These get physically ugly and logically ugly really quick, anytime the topology count goes over two.

So is there a solution? 

Possible strategies

Authoratative Model
In this model, one node is made the "Authoratative" node and logically "creates" the data which is then synched to the second (and other) nodes.  This gives a kind of master-slave model.  When synching back to the master, there needs to be some kind of idiot check that when the nodes are out of synch, the masters copy always wins. 

This works for fairly simple use cases with low complexity. However when you have multiple authors for fine grained parts of a system... trouble.

Logical Merge Model
Take the time to create a robust set of merge rules, so that any time there is a conflict, there is a prescribed solution. This may be useful for a set of shared settings, where the most conservative options overwrite less conservative options ( I am thinking about security settings) but in the case of a database of data that has no obvious natural precedence rules we need some other strategy.

Time based Transaction Log
With a database it would be possible to create a set of activity logs and merge them by merging the logs and generating the current data set.  This works for applications that do read or write-only, but falls to bits when you have read, write, modify and delete.  In an asynchronous environment the different node copies can quickly move so far out of synch, they are completely different.

Point by Point Merger
How about avoiding the problem and making the user choose what to over-write.   Show them an option for each conflict and tell them the source of each of the values and let them choose.  (Having seens users who do not want to reconcile automatic backups in word, I can imagine how this would easily fail)

Merge at the time of synch
Theoretically, if the software is in communication with another node, then now is the perfect time to reconcile any conflicts.  Give the users the choice to over-write or keep the data. (assuming its not too complex) This will determine which way the resolution goes. The problem is that its a one-shot deal that will always result in some data being irrovocably lost.

Fail loudly
Simply inform the user that there is a conflict throw an exception and make it sound like its their fault. 

Keep two copies
Keep the previous copy on the node and create a "differece" file with only the conflicts from the synch.  This prevents data loss and gives the user or software time to gracefully merge the values.  There are still logical issues with how the data is merged, and large sets of conflicting transactions may still create some sort of failure state that cannot be automatically fixed.

Thinking thinking,....

No comments:

Post a Comment