How to compare 2 string arrays and find all consecutive matches and save indexes?

For example, if I have the following 2 arrays:

string[] userSelect = new string[] {"the", "quick", "brown", "dog", "jumps", "over"}; string[] original = new string[] {"the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"}; 

I am trying to compare the userSelect array with the original array and get all consecutive matches based on the index. The userSelect array will always consist of strings from the original array. Thus, the output will look like this:

 int[] match0 = new int[] {0, 1, 2}; // indices for "the quick brown" int[] match2 = new int[] {4, 5}; // indices for "jumps over" int[] match1 = new int[] {3}; // index for "dog" 

The length of the UserSelect array will never exceed the original length of the array, however it can be shorter and the words can be in any order. How can i do this?

+7
source share
5 answers

Here is what I came up with

 var matches = (from l in userSelect.Select((s, i) => new { s, i }) join r in original.Select((s, i) => new { s, i }) on ls equals rs group l by ri - li into g from m in g.Select((l, j) => new { li, j = li - j, k = g.Key }) group m by new { mj, mk } into h select h.Select(t => ti).ToArray()) .ToArray(); 

This will lead to the conclusion

 matches[0] // { 0, 1, 2 } the quick brown matches[1] // { 4, 5 } jumps over matches[2] // { 0 } the matches[3] // { 3 } dog 

Using the input {"the", "quick", "brown", "the", "lazy", "dog"} gives:

 matches[0] // { 0, 1, 2 } the quick brown matches[1] // { 0 } the matches[2] // { 3 } the matches[3] // { 3, 4, 5 } the lazy dog 

Note that calls to ToArray are optional. If you really don't need the results in an array, you can leave this and save a little processing time.

To filter out any sequences that are completely contained in other large sequences, you can run this code (note orderby in the modified query):

 var matches = (from l in userSelect.Select((s, i) => new { s, i }) join r in original.Select((s, i) => new { s, i }) on ls equals rs group l by ri - li into g from m in g.Select((l, j) => new { li, j = li - j, k = g.Key }) group m by new { mj, mk } into h orderby h.Count() descending select h.Select(t => ti).ToArray()); int take = 0; var filtered = matches.Where(m => !matches.Take(take++) .Any(n => m.All(i => n.Contains(i)))) .ToArray(); 
+2
source

It would be easier if the words could not be repeated.,.

The general idea is to create a Dictionary<string, List<int>> from the original list of words. This will tell you which words are used in which positions. The dictionary for your sample will be:

 key="the", value={0, 6} key="quick", value={1} key="brown", value={2} ... etc 

Now, when you enter user input, you sequentially look at it, looking at the words in the dictionary to get a list of positions.

So you look at the word and that is in the dictionary. You save the position (s) returned from the dictionary. Look at the next word. There are three conditions:

  • The word is not in the dictionary. Save the previous sequential group and continue to the next word, where you potentially start a new group.
  • The word is in the dictionary, but none of the positions found matches the expected positions (the expected positions are more than the saved positions from the last word). Save the previous group in a row and continue to the next word, where you potentially start a new group.
  • The word is in the dictionary, and one of the returned positions corresponds to the expected position. Save these positions and go to the next word.

Hope you have an idea.

+2
source

This does not do what you would like, but it is a really clean and easy way to get a new array with all common lines (i.e. take the intersection of two arrays).

 var results = array1.Intersect(array2, StringComparer.OrdinalIgnoreCase); 

After executing the resutls array, each line (ignoring the case) that occurs in both array1 and resutls will be contained.

If you want a little theory, the intersection method is based on the intersection operation that you perform on a set in lambda calculus. Collections in C # implement all common operations with a lot, so you should have some familiarity with them. Here is a link to a wiki article; http://en.wikipedia.org/wiki/Intersection_(set_theory)

+1
source

It is not very elegant, but effective. When it comes to indexes, Linq makes it often more complex and less efficient than simple loops.

 string[] userSelect = new string[] { "the", "quick", "brown", "dog", "jumps", "over" }; string[] original = new string[] { "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog" }; var consecutiveGroups = new Dictionary<int, IList<string>>(); IList<Tuple<int, string>> uniques = new List<Tuple<int, string>>(); int maxIndex = Math.Min(userSelect.Length, original.Length); if (maxIndex > 0) { int minIndex = 0; int lastMatch = int.MinValue; for (int i = 0; i < maxIndex; i++) { var us = userSelect[i]; var o = original[i]; if (us == o) { if (lastMatch == i - 1) consecutiveGroups[minIndex].Add(us); else { minIndex = i; consecutiveGroups.Add(minIndex, new List<string>() { us }); } lastMatch = i; } else uniques.Add(Tuple.Create(i, us)); } } 

displays sequential group indices + uniques indices:

 var consecutiveGroupsIndices = consecutiveGroups .OrderByDescending(kv => kv.Value.Count) .Select(kv => Enumerable.Range(kv.Key, kv.Value.Count).ToArray() .ToArray()); foreach(var consIndexGroup in consecutiveGroupsIndices) Console.WriteLine(string.Join(",", consIndexGroup)); Console.WriteLine(string.Join(",", uniques.Select(u => u.Item1))); 
+1
source

Use LINQ for more fun.

After several attempts, I came up with a clean LINQ solution that could theoretically be single-line. I tried to make it effective, but, of course, functional solutions will lead to duplication of calculations, since you cannot save state.

We'll start with a little preprocessing to save on recalculations later. Yes, I know what I am doing with the index, this is a controversial practice, but if you are careful, it works and it develops rapidly:

 var index = 0; var lookup = original.ToLookup(s => s, s => index++); 

Monster

 var occurrences = userSelect .Where(lookup.Contains) .SelectMany((s, i) => lookup[s] .Select(j => new { User = userSelect.Skip(i), Original = original.Skip(j), Skipped = i }) .Select(t => t.User.Zip(t.Original, (u, v) => Tuple.Create(u, v, t.Skipped)) .TakeWhile(tuple => tuple.Item1 == tuple.Item2) ) .Select(u => new { Word = s, Start = u.Select(v => v.Item3).Min(), Length = u.Count() }) ) .GroupBy(v => v.Start + v.Length) .Select(g => g.OrderBy(u => u.Start).First()) .GroupBy(v => v.Word) .Select(g => g.OrderByDescending(u => u.Length).First()) .Select(w => Enumerable.Range(w.Start, w.Length).ToArray()) .ToList(); 

Print this with

 foreach (var occurrence in occurrences) { Console.WriteLine( "Maximal match starting with '{0}': [{1}]", userSelect[occurrence[0]], string.Join(", ", occurrence) ); } 

gives

 Maximal match starting with 'the': [0, 1, 2] Maximal match starting with 'dog': [3] Maximal match starting with 'jumps': [4, 5] 

It is immediately clear that you would not want to use this code in production , another (procedural) solution would be preferable today. However, this solution is different in that it is purely functional, with the exception of lookup . Of course, this can also be written functionally:

 var lookup = original.Select((s, i) => Tuple.Create) .ToLookup(t => t.Item1, t => t.Item2); 

How it works

When diluted, it creates a dictionary-like structure that associates each word in original with indexes, where it appears in one collection. This will be used later to create as many matching sequences as possible with each word in userSelect (for example, "the" will result in two matched sequences since it appears twice in the original ).

Then:

 .Where(lookup.Contains) 

This is easy, it removes from consideration all words in userSelect that are not displayed in original .

  // For each place where the word s appears in original... .SelectMany((s, i) => lookup[s] // Define the two subsequences of userSelect and original to work on. // We are trying to find the number of identical elements until first mismatch. .Select(j => new { User = userSelect.Skip(i), Original = original.Skip(j), Skipped = j }) // Use .Zip to find this subsequence .Select(t => t.User.Zip(t.Original, (u, v) => Tuple.Create(u, v, t.Skipped)).TakeWhile(tuple => tuple.Item1 == tuple.Item2)) // Note the index in original where the subsequence started and its length .Select(u => new { Word = s, Start = u.Select(v => v.Item3).Min(), Length = u.Count() }) ) 

At this point, we project each matching word in userSelect onto an anonymous object with the Start and Length properties. However, sequences in which the matching length is N also lead to smaller matching sequences of length N-1, N-2, ... 1.

The key point here is the realization that for all subsequences in such sets Start + Length will be the same; in addition, subsequences from different sets will have different Start + Length amounts. So let's use to crop the results:

 // Obvious from the above .GroupBy(v => v.Start + v.Length) // We want to keep the longest subsequence. Since Start + Length is constant for // all, it follows the one with the largest Length has the smallest Start: .Select(g => g.OrderBy(u => u.Start).First()) 

This will leave us so many matches for every word in userSelect , just that word will appear in original . So, let it also shorten it to the longest match:

 .GroupBy(v => v.Word) .Select(g => g.OrderByDescending(u => u.Length).First()) 

Now we have such an object as { Word = "the", Start = 0, Length = 3 } . Let convert this to an array of indices in userSelect :

 .Select(w => Enumerable.Range(w.Start, w.Length).ToArray()) 

And finally, put all these arrays in the same collection and mission!

0
source

All Articles