Find sequence in IEnumerable <T> with Linq

Question

Find sequence in IEnumerable <T> with Linq

What is the most efficient way to find sequence in IEnumerable<T> using LINQ

I want to be able to create an extension method that allows the following call:

 int startIndex = largeSequence.FindSequence(subSequence)

Compliance must be contiguous and orderly.

+6

c # .net linq

trampster Aug 24 '10 at 23:23

source share

5 answers

Thomas levesque · Answer 1 · 2010-08-25T02:06:14+0000

Here is an implementation of an algorithm that finds a subsequence in a sequence. I called the IndexOfSequence method because it makes the intent more explicit and similar to the existing IndexOf method:

 public static class ExtensionMethods { public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence) { return source.IndexOfSequence(sequence, EqualityComparer<T>.Default); } public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence, IEqualityComparer<T> comparer) { var seq = sequence.ToArray(); int p = 0; // current position in source sequence int i = 0; // current position in searched sequence var prospects = new List<int>(); // list of prospective matches foreach (var item in source) { // Remove bad prospective matches prospects.RemoveAll(k => !comparer.Equals(item, seq[p - k])); // Is it the start of a prospective match ? if (comparer.Equals(item, seq[0])) { prospects.Add(p); } // Does current character continues partial match ? if (comparer.Equals(item, seq[i])) { i++; // Do we have a complete match ? if (i == seq.Length) { // Bingo ! return p - seq.Length + 1; } } else // Mismatch { // Do we have prospective matches to fall back to ? if (prospects.Count > 0) { // Yes, use the first one int k = prospects[0]; i = p - k + 1; } else { // No, start from beginning of searched sequence i = 0; } } p++; } // No match return -1; } }

I have not fully tested it, so it may contain errors. I just did some tests on known corner cases to make sure that I don't fall into obvious traps. Everything seems to work well ...

I think the complexity is close to O (n), but I am not an expert on Big O notation, so I could be wrong ... at least it only lists the original sequence once without even returning, so it should be quite efficient.

Jon hanna · Answer 2 · 2010-08-24T23:42:29+0000

The code you say that you want to use is not LINQ, so I don’t understand why it should be used with LINQ.

This is essentially the same problem as finding a substring (indeed, listing where order is significant is a generalization of "string").

Since computer science has often addressed this issue for a long time, so you become standing on the shoulders of giants.

Some reasonable starting points:

http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

http://en.wikipedia.org/wiki/Rabin-karp

Even just the pseudocode in the wikipedia articles is enough to port to C # quite easily. Look at descriptions of performance in different cases and determine which cases are most often encountered by your code.

Ani · Answer 3 · 2015-11-16T08:50:01+0000

I understand this is an old question, but I need this exact method, and I wrote it like this:

 public static int ContainsSubsequence<T>(this IEnumerable<T> elements, IEnumerable<T> subSequence) where T: IEquatable<T> { return ContainsSubsequence(elements, 0, subSequence); } private static int ContainsSubsequence<T>(IEnumerable<T> elements, int index, IEnumerable<T> subSequence) where T: IEquatable<T> { // Do we have any elements left? bool elementsLeft = elements.Any(); // Do we have any of the sub-sequence left? bool sequenceLeft = subSequence.Any(); // No elements but sub-sequence not fully matched if (!elementsLeft && sequenceLeft) return -1; // Nope, didn't match // No elements of sub-sequence, which means even if there are // more elements, we matched the sub-sequence fully if (!sequenceLeft) return index - subSequence.Count(); // Matched! // If we didn't reach a terminal condition, // check the first element of the sub-sequence against the first element if (subSequence.First().Equals(e.First())) // Yes, it matched - move onto the next. Consume (skip) one element in each return ContainsSubsequence(elements.Skip(1), index + 1 subSequence.Skip(1)); else // No, it didn't match. Try the next element, without consuming an element // from the sub-sequence return ContainsSubsequence(elements.Skip(1), index + 1, subSequence); }

Updated to not just return if the subsequence was agreed upon, but where did it start in the original sequence.

This is the extension method in IEnumerable, completely lazy, ends earlier and is much more aligned with the current answer. However, despite the fact that (as @ wai-ha-lee points out), it recursively creates a lot of counters. Use it where necessary (performance / memory). It was good for my needs, but YMMV.

Brad cunningham · Answer 4 · 2010-08-24T23:38:49+0000

UPDATE: Given the clarification of the question, my answer below is not applicable. Leaving it for historical purposes.

You probably want to use mySequence.Where (). Then the key is to optimize the predicate in order to work well in your environment. This may vary depending on your requirements and typical usage patterns.

It is possible that what works well for small collections does not scale well for much larger collections depending on type T.

Of course, if you use 90% for small collections, then optimizing for a large collection with large outliers seems a bit YAGNI.

dcastro · Answer 5 · 2015-11-16T09:22:53+0000

You can use this library called Sequences to do this (disclaimer: I am the author).

It has an IndexOfSlice method that does exactly what you need - it's an implementation of the Knut-Morris-Pratt algorithm .

 int startIndex = largeSequence.AsSequence().IndexOfSlice(subSequence);

Find sequence in IEnumerable <T> with Linq

More articles: