Combining some sorted lists with an unknown order sequence

I have several lists with a variable number of elements. Each list is sorted, but the sorting algorithm is unknown. I would like to combine the lists into one big list, which contains all the lists in the same order, without duplicates.

Input Example:

  • XS, M, L, XL
  • S, M, XXL
  • XXS, XS, S, L

Expected Result:

  • XXS, XS, S, M, L, XL

The expected result is obtained by comparing the input sequences to obtain a combined result that contains the elements of each input sequence in the correct order, for example:

XS ML XL SM XXL XXS XS SL ------------------- XXS XS SML XL XXL 

The function should notify if there are elements that have ambiguous positions. Here will be XXL (it can remain after M, L or XL), and I need to specify its position manually after XL (because here I know the sorting algorithm and can help). I thought about defining pairs from every two elements, each pair is in order, as in the original list. From this, a complete list could be made.

+7
source share
4 answers

This can be solved using the Topological Sort algorithm.

If you consider each of your input sequences as a path through a directed graph, topological sorting will arrange your set of nodes from left to right so that each directed edge points to the right. Diagram of a directed graph after topological sorting

The topological sorting wikipedia page includes this algorithm, first described by Arthur Kahn in 1962:

 L ← Empty list that will contain the sorted elements S ← Set of all nodes with no incoming edges while S is non-empty do remove a node n from S insert n into L for each node m with an edge e from n to m do remove edge e from the graph if m has no other incoming edges then insert m into S if graph has edges then return error (graph has at least one cycle) else return L (a topologically sorted order) 

This algorithm, as written, does not actually work if it finds ambiguous sequences, but it is easy to add by inserting a check at the beginning of the loop, for example:

 ... while S is non-empty do if S contains more than 1 item return error (inputs are ambiguous) remove a node n from S ... 

I don’t know what language you work in, but I chose this PHP implementation as proof of concept:

 function mergeSequences($sequences, $detectAmbiguity = false) { // build a list of nodes, with each node recording a list of all incoming edges $nodes = array(); foreach ($sequences as $seq) { foreach ($seq as $i => $item) { if (!isset($nodes[$item])) $nodes[$item] = array(); if ($i !== 0) { $nodes[$item][] = $seq[$i-1]; } } } // build a list of all nodes with no incoming edges $avail = array(); foreach ($nodes as $item => $edges) { if (count($edges) == 0) { $avail[] = $item; unset($nodes[$item]); } } $sorted = array(); $curr = '(start)'; while (count($avail) > 0) { // optional: check for ambiguous sequence if ($detectAmbiguity && count($avail) > 1) { throw new Exception("Ambiguous sequence: {$curr} can be followed by " . join(' or ', $avail)); } // get the next item and add it to the sorted list $curr = array_pop($avail); $sorted[] = $curr; // remove all edges from the currently selected items to all others foreach ($nodes as $item => $edges) { $nodes[$item] = array_diff($edges, array($curr)); if (count($nodes[$item]) == 0) { $avail[] = $item; unset($nodes[$item]); } } } if (count($nodes) > 0) { throw new Exception('Sequences contain conflicting information. Cannot continue after: ' . join(', ', $sorted)); } return $sorted; } 

You can call the function as follows:

 $input = array( array('XS', 'M', 'L', 'XL'), array('S', 'M', 'XXL'), array('XXS', 'XS', 'S', 'L'), ); echo(join(', ', mergeSequences($input))); echo(join(', ', mergeSequences($input, true))); 

To get the following output:

 XXS, XS, S, M, XXL, L, XL Uncaught exception 'Exception' with message 'Ambiguous sequence: M can be followed by L or XXL' 
+14
source

You are trying to combine partially ordered sets or posets. The ambiguous parts of the merger are called antichains . So, you need an algorithm that combines posets and tells you what antichemes are.

Here is an article describing the algorithm for merging sets and detecting antichains , as well as a link to the author’s first home page if you want to contact him to find out if there is any source code.

+6
source

Here is what I will do:

  • List preprocessing: finding out that XXS is less than XS, less than S, less ... XXL is a constraint restriction problem (http://en.wikipedia.org/wiki/Constraint_satisfaction_problem). This problem is associated with finding the correct ordering among all the elements, taking into account the restrictions defined in the source lists.
  • Create a bi-directional mapping from the set {XXS, ..., XXL} to the set {1, ..., 6} after performing step 1.
  • For each list, create a different list using the mapping defined in 2.
  • Use the modified [merge sort] (http://en.wikipedia.org/wiki/Merge_sort) to combine the two lists. Modify the merge algorithm so that it reports that the two elements being compared are identical (and ignores one of the elements that are combined).
  • Take step 4 for each pair of lists until one list appears.
  • Using the mapping defined in 2, create a text version of the list.
+3
source

To sort the parts, I think Merge Sort is enough according to your description. One thing that needs to be changed is during the merge, we must skip the elements in the input array if the first element of the input array matches the array of results.

If I understand correctly, you want to build the full order of all possible input elements. Some partial ordering is already defined in the input arrays (since they are already sorted), while others must be specified by users. For example, in the question, order

'S' <'M' <'XXL'

'XS' <'M' <'L' <XL '

'' XXS & Lt; 'XS' & Lt; 'S' & Lt; 'L'

well defined. But the algorithm still does not know more or less "XXL" than "XL", "L".
Well, since the three input arrays are sorted, there must be a full order of input elements. Therefore, I would like to ask your data provider for an ordered list of all possible data elements. That sounds silly, but it's an easy way.

If this list is not available, an easy way to do this is to ask for a pair sort for the user, and then check if it conflicts with the existing input sequence and remember it when the algorithm encounters an ambiguous pair. I believe topology sorting is more powerful than this application. Since we are dealing with single data elements, it is necessary to get out of the general order. Whereas topology sorting is partial ordering.

-one
source

All Articles