Combining data with an unknown ordering function

I have a number of string arrays. The row in each array is ordered identically, according to the same criteria. However, some rows may not be present in some arrays, and there may not be an array that has a complete set of rows. Moreover, the criteria used to compare strings are not available to me: outside the context of the array, I cannot say which string should precede another.

I need a way to create a complete set of strings that are properly ordered. Or crash if there is not enough information for the arrays for me.

Is anyone familiar with this problem? What is the correct algorithm?

Examples:

ABD ACD 

Cannot order correctly, cannot solve order B and C

 ABD ABC ACD 

This has enough information to order ABCD correctly.

+4
source share
10 answers

This sounds like a special case of topological sorting problems.

+4
source

One possible way that I can think of, although probably not the most effective:

I am going to use your example to explain:

 ABD ABC ACD 

create an array of unique characters so you get (for example):

 ABDC 

You should also have an enumeration to describe possible relationships

 enum Type { Unknown = 0, Greater = 1, Equal = 2, Less = 3, } 

Now create a square matrix, the sizes of which coincide with the array indicated above, by default everything is for Type.Unknown.

 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Your array will serve as an index in the matrix when you determine the order. To see what I mean, look here:

  ABDC A 0 0 0 0 B 0 0 0 0 D 0 0 0 0 C 0 0 0 0 

Go through and make the diagonal as Type.Equal

 2 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 

Now you need to fill in the matrix, pass through each input array and get each character and one after it. Use these 2 characters, find the index in the matrix for each (using your array) and update the matrix.

 for(int i=0; i<len(arr)-1; i++) { char c1 = arr[i], c2 = arr[i+1]; int i1,i2; //get indices of characters in array and put in i1, and i2 respectively matrix[i1][i2] = Type.Less; matrix[i2][i1] = Type.Greater } 

you assign 2 places in the grid every time, because when one char is smaller than the other, it also means that the second char is larger than the first.

Now you just insert the characters into the array one at a time (Linked List will be the easiest, you will see why in a second)

Each time you insert a char, you start at the beginning of your returned array and iterate, look at the indices in the first array and check the matrix for Type.Greater or Type.Less (Depends on how you compare curr char with an array or an array with current char) and paste it if you encounter a value other than expected.

If you press Type.Unknown in matix during your insertion, you know that you do not have enough information to sort these arrays, if you click on Type.Equal, you can ignore the insertion of this char (if you do not want to duplicate the result.)

And then you will output your return array

+5
source

1). If partial orderings are inconsistent, then there will be a loop and the team will fail. Otherwise, you have the sequence s (0) ... s (n-1).

2) For each pair s (i), s (i + 1), search in partial sequences for the same pair adjacent to each other. If you find it in one of them, then continue on to the next i. If you did not find it, then it fails, because partial sequences do not order s (i) and s (i + 1).

Why not? Because if they order them, but they do not appear next to each other, then there must be something "between them." But if there is something between them in partial sequences, but not in complete sequence, you are out of the chain, then this intruder must be in full sequence either before s (i) or after s (i + 1). This contradicts the sequence of the entire sequence with partial sequences that have already been proved.

Step 2 is O (n * m), where n is the number of elements to be ordered, and m is the total length of all partial sequences. You might be able to do better, but it's pretty simple, because you can grab a series from the shelf for step 1, and step 2 is an obvious group of nested loops. Note that you can stop searching for each partial sequence if you find s (i) (or s (i + 1) if that is true), because it cannot be repeated again.

+2
source

How many sets do you have? Something that might work if they are not too meager:

  • Iterating over the first element all arrays counting how common they are. The row is.
  • Whatever element is most common.
  • Check each element of all arrays for the specified string.
  • If it exists in any element other than the first of the array, the first element of its parent array is selected and proceeds to step 3.
  • Remove the selected element from the first element of all arrays / trees and add it to the end of the list output.
  • Go to 1 until all items are deleted.

If you dumped arrays in b-trees , finding elements can be pretty quick. You can also speed up step 3 by playing with the order of arrays. I am still thinking about it.

If the arrays are very sparse, the only thing I can think of is to use some kind of controlled machine learning to try and determine the right method for ordering.

0
source

How does this differ from the classic diff / merge algorithm ? So far, both arrays are "sorted", regardless of the mechanism they sort. They are sorted by comparison, so you should be able to use similarities to merge the differences. After that, he was mostly arbitrary.

If you want to combine [A] with [B, C, D], then you don’t know where A will be (in front? End? Between C and D? You don’t know). In these cases, you just need to come up with an agreement (always ahead, always at the end, something like that).

If you want to combine [Q, X, Y, A, B] with [Q, X, W, B, D], you will need to decide whether there will be a “W” in front of or behind a “Y, A”. So whether [Q, X, Y, A, W, B, D] or [Q, X, W, Y, A, B, D] is correct. Honestly, you just need to call and collapse with it. Just not enough information.

0
source

The main problem is that, given some data, you need to define an ordering function. After that, it's just sorting the list.

Create a list of the first elements of all your arrays. Remove duplicates. So:

 AB AC BC CC CD 

becomes (A, B, C). Then we want the set of pairs to represent each value that we know is less than anything else. ((A, B), (A, C), (B, C)) for this example.

Now, for each value that we can find at the beginning of the array, take all arrays that start with this value as a group. Remove their first items. This gives us:

 B C 

 C 

and

 C D 

Restart each of these array lists.

As soon as we finish the recursion in all of our subsets, we will combine the pairs that are produced in what we originally invented. As a result, we obtain ((A, B), (A, C), (B, C), (C, D)). Depending on the properties of your values, you can decompose this into ((A, B), (A, C), (A, D), (B, C), (B, D), (C, D)).

Now your function is smaller than the function to see if the values ​​it compares match in this set. If so, the truth. Otherwise, false.

0
source

Doug McClean wins. The phrase that jumped in my opinion was a bundle, and this led to dead ends in web searches. (For example, in logical programming it is useful to talk about stratified programs, these are programs that you can visualize as a stack of layers, each layer contains one or more predicates, and each predicate is extensional (the lower layer) or the next only from the predicates in the layers below it. Ideally, you want a form in which each layer has at most one line.)

Your variation of the toposort algorithm will force you to create a DAG (you mean that acyclicity is guaranteed), possibly tracking nodes with zero incoming edges along the way (simple optimization). If, when this is done, you will have more than one “root” node, you are done because you know that you do not have the first node. Otherwise, you have exactly one; remove it from the DAG by checking if each of its child nodes is a “root” node; and keep moving until you get more than one root root (failure) or you run out of nodes (success).

The algorithm that I sketched above should be linear in the number of lines (mainly through iterations). It is probably proportional to the square of the number of lines if each node has many children, implying that each line appears many, many times in your list of lists, each of which has the next line following it.

0
source

I am tempted to say that in order to get enough information to complete the merge, each x, y pair that is next to each other in the final merged array must be present somewhere in the input arrays. In other words, transitivity may not be included in the picture at all. Can someone create a counterexample?

You can do it right here, in return, if you want.

0
source

What about a simple recursive procedure?

Using the second example:

 ABD ABC ACD 

Create a hash table of all unique orders:

 table = {A -> B, C B -> C, D C -> D} 

Count all unique values

 countUnique = 4 

Using the stack, calculate all possible paths. You know that you have a solution when the stack length matches the number of unique lines. If you find more than one solution, then you have a circular link. Please excuse my dubious pseudocode.

 main() { foreach (s in table.keys) { stack.push(s) searchForPath() stack.pop() } } searchForPath() { if (table.hasKey(stack.peek)) { // we're not at the end yet foreach (s in table[stack.peek]) { // avoid infinite recursion if (stack.contains(s) continue stack.push(s) searchForPath() stack.pop() } } else { if (stack.size == countUnique) { // solution found captureSolution() } } } 
0
source

Partial Solution:

You need to determine the order. You can do this by asking the data to "vote" for how many times one character precedes another. You will need a square matrix with a size equal to the number of characters. Initialize it to all zeros, then scan all your input lines, adding from 1 to M (a, b) for each consecutive pair (a, b). Then you need to clear this data, so you get an ordering that is transitive (if A comes to B and comes to C, A should also come to C). To do this, you need to iterate over all triplets of different symbols and "coordinate" the corresponding triplets of inequalities, so they respect transitivity.

If your data is not too noisy, it only takes one pass. If the data is too noisy, you may need to perform several passes before convergence or use some greedy approach: instead of repeating all the characters in a random order, iterate over all the inequality triplets in decreasing order of votes.

-1
source

All Articles