Sync two ordered lists

Question

Sync two ordered lists

We have two autonomous systems that usually cannot communicate with each other. Both systems support the same ordered list of items. Rarely will they be able to communicate with each other to synchronize the list.

Elements are marked with a modification time stamp to detect changes. Elements are identified using UUIDs to avoid conflicts when inserting new elements (as opposed to using integers with auto-increment). When synchronization of new UUIDs is detected and copied to another system. Similarly for deletion.

The above data structure is suitable for an unordered list, but how can we handle the order? If we added an integer "rank", for which it would be necessary to transfer the numbering when inserting a new element (so that it would be necessary to synchronize all successor elements due to only 1 insertion). As an alternative, we could use fractional ranks (using the average of the ranks of the predecessor and the successor object), but this does not seem to be a reliable solution, since it will quickly encounter accuracy problems when many new elements are inserted.

We also considered the possibility of implementing this as a doubly linked list with each element containing the UUID of its predecessor and successor object. However, it will still require synchronization of 3 elements when 1 new element was added (or synchronization of 2 remaining elements when deleting one element).

Preferably, we would like to use a data structure or algorithm in which only the newly inserted element needs to be synchronized. Is there such a data structure?

Edit: we should be able to handle moving an existing item to another position!

+7

linked-list synchronization algorithm data-structures html-lists

Jason smith Apr 12 '12 at 19:58

source share

7 answers

Gene · Answer 1 · 2014-08-06T01:40:16+0000

There is no problem with the interpolated rank approach. Simply define your own numbering system based on variable length bit vectors representing binary fractions between 0 and 1 without trailing zeros. The binary point is to the left of the first digit.

The only inconvenience of this system is that the minimum possible key is 0, given by an empty bit vector. Therefore, you only use this if you are sure that the related item will forever be the first item in the list. As a rule, just give the first element of the key 1. This is equivalent to 1/2, so random inserts in the range (0..1) will tend to minimize the use of bits. To interpolate an element before and after,

01 < newly interpolated = 1/4 1 11 < newly interpolated = 3/4

To interpolate again:

 001 < newly interpolated = 1/8 01 011 < newly interpolated = 3/8 1 101 < newly interpolated = 5/8 11 111 < newly interpolated = 7/8

Note that if you want, you can omit saving the final 1! All keys (except 0, which you usually don’t use) end with 1, so saving is invalid.

Comparing binary fractions is very similar to a lexical comparison: 0 <1, and the first bit difference in scan mode from left to right indicates that it is less. If there are no differences, i.e. One vector is a strict prefix of the other, the shorter it is smaller.

Using these rules, it’s quite simple to come up with an algorithm that takes two bit vectors and calculates a result that is roughly (or precisely in some cases) between them. Just add bit strings and shift 1 to the right, discard unnecessary trailing bits, i.e. Take the middle of the two to break the range between.

In the above example, if the deletion remains with us:

 01 111

and we need to interpolate them, add 01(0) and and 111 to get 1.001 , then go to get 1001 . This works great as an interpolator. But note that the final 1 unnecessarily makes it longer than any of the operands. An easy optimization is to reset the final bit 1 along with trailing zeros to get just 1 . Of course, 1 is about halfway between what we hope.

Of course, if you make many inserts in one place (for example, sequential inserts at the beginning of the list, for example), the bit vectors will be long. This is the same phenomenon as inserting at the same point in a binary tree. It grows long and tough. To fix this, you must “rebalance” during synchronization by renumbering with the shortest possible bit vectors, for example. for 14 you would use the above sequence.

Adding

Although I have not tried, for the keys that I described, enough for the row type of the Postgres column. I have to check that the sort order is correct.

Also, the same reasoning works fine with base-k numbers for any k>=2 . The first item gets the key k/2 . There is also a simple optimization that prevents the very common cases of adding and adding elements at the end and front, respectively, because they call keys of length O (n). It supports O (log n) for these cases (although when inserted into the same place, they can still internally create O (p) keys after p inserts). I will let you do it. With k = 256, you can use infinite lengths of byte strings. In SQL, I believe you need varbinary(max) . SQL provides the correct lexicographic sort order. Implementing interpolation operations is easy if you have a BigInteger package similar to Java. If you like human-readable data, you can convert byte strings, for example. hexadecimal strings (0-9a-f) and save them. Then the correct sort order of the UTF8 string is correct.

zmbq · Answer 2 · 2012-04-12T20:08:34+0000

You can add two fields to each element - “creation timestamp” and “inserted after” (containing the identifier of the element after which the new element was inserted). After synchronizing the list, send all new items. This information is enough for you to make a list on the other hand.

With a list of newly added elements, do this (at the end of the reception): sort by the creation label, then go one by one and use the "inserted after" field to add the new element to the desired location.

You may encounter a problem if element A is added, after which B is added after A, then A is deleted. If this can happen, you will also need to synchronize A (mainly by synchronizing the operations that have occurred in the list since the last synchronization, not just the contents of the current list). This is basically a form of log delivery.

esope · Answer 3 · 2012-05-22T04:38:17+0000

You could take a look at the “lenses,” which are the concept of bi-directional programming. For example, your problem seems to be solved by my "matching lenses" described in this article .

Vikram Bhat · Answer 4 · 2014-07-30T06:38:22+0000

I think that the data structure of the order statistics tree is suitable here. In the statistics tree, you also need to maintain subtree sizes along with other data; the size field helps you easily find the element by rank as you need. All operations, such as rank, delete, change position, insert, O(logn) .

Ilya Tsyplenkov · Answer 5 · 2014-08-02T23:30:25+0000

I think you can try some kind of transactional approach here. For example, you do not physically delete items, but mark them for deletion and commit changes only during synchronization. I'm not quite sure what type of data you should choose, it depends on what operations you want to be more productive (insert, delete, search or iteration).

Suppose we have the following initial state in both systems:

 |1| |2| --- --- |A| |A| |B| |B| |C| |C| |D| |D|

After that, the first system marks element A for deletion, and the second system inserts element BC between B and C :

 |1 | |2 | ------------ -------------- |A | |A | |B[deleted]| |B | |C | |BC[inserted]| |D | |C | |D |

Both systems continue processing taking into account local changes, System 1 ignores element B , and System 2 considers element BC as a regular element.

When syncing:

As I understand it, each system receives a snapshot of the list from another system and both systems freeze processing until synchronization is complete.

Thus, each system sequentially repeats the received snapshot and the local list and writes the changes to the local list (resolving possible conflicts in accordance with the changed time stamp) after the "transaction is completed", all local changes are finally applied and the information about them is erased. For example, for the system:

 |1 pre-sync| |2-SNAPSHOT | |1 result| ------------ -------------- ---------- |A | <the same> |A | |A | |B[deleted]| <delete B> |B | <insert BC> |BC[inserted]| |BC | |C | <same> |C | |C | |D | <same> |D | |D |

Systems wake up and continue to process.

Elements are sorted by insertion order, moving can be implemented as simultaneous deletion and insertion. I also think that it will not be possible to transfer the entire list of shapshot, but only a list of elements that have actually been changed.

Arvind · Answer 6 · 2014-08-04T10:29:19+0000

I think that in general, the Operational Transformation may be related to the problem that you describe here. For example, consider the problem of editing text in real time.

We essentially have a sorted list of elements (words) that need to be synchronized and which can be added / changed / deleted randomly in the list. The only significant difference that I see is the frequency with which changes are made to the list (you say that this does not happen often)

Operational transformation is a really well-studied field. I could find this blog article with pointers and introduction. In addition, for all the challenges Google Wave has encountered, they have indeed made significant progress in operational transformation. Check it out . . There is quite a lot of literature on this subject. Take a look at this https://stackoverflow.com/a/166269/ and Differential Sync

Another parallel that struck me was the data structure used in text editors - Ropes . Therefore, if you have an operation log, say: “Index 5 is deleted,” “Index 6 is changed to ABC,” “Index 8 is inserted,” you may need to transfer the change log from System A to system B, and then restore operations from the other side.

Another option for a “pragmatic engineer” is simply to rebuild the entire list in system B when system A changes. Depending on the actual frequency and size of the changes, this may not be as bad as it seems.

Ryan norbauer · Answer 7 · 2014-08-04T15:35:54+0000

I previously solved a similar problem by including PrecedingItemID (which can be null if the item is the top / root of an ordered list) for each item and then has a kind of local cache that stores the list of all items in sorted order (this is purely for efficiency). so you don’t need to recursively query or build a list based on PrecedingItemID every time a PrecedingItemID occurs on the local client). Then, when the time comes for synchronization, I do a slightly more expensive operation of finding cases when two elements request the same PrecedingItemID. In these cases, I simply order by the time of creation (or, nevertheless, you want to agree on which one will win in the first place), put the second (or others) behind it and go on to sort the list. Then I save this new ordering in the local order cache and continue to use it until the next synchronization (just making sure that the PrecedingItemID updated as it moves).

I have not tested this method yet, so I am not 100% sure that I do not have a problematic conflict scenario, but it is at least conceptually suitable to satisfy my needs, which sound the same as the OP.

Sync two ordered lists

More articles: