The fastest way to search for a list of names in C #

I have a list of maybe 100,000 lines in memory in my application. I need to find the top 20 lines that contain a specific keyword (case insensitive). This is easy to do, I just run the next LINQ.

from s in stringList where s.ToLower().Contains(searchWord.ToLower()) select s 

However, I have a great feeling that I can do it much faster, and I need to find a way to this, because I need to search this list several times per second.

+3
source share
4 answers

Another option, although it would require a sufficient amount of memory, would be to pre-compile something like an array of suffix (a list of positions in lines sorted by the lines that they point to).

http://en.wikipedia.org/wiki/Suffix_array

This would be most possible if the list of strings you are looking for is relatively static. The entire list of row indexes can be stored in one tuple array (indexOfString, positionInString), after which you perform a binary search using String.Compare(keyword, 0, target, targetPos, keyword.Length) .

So, if you had 100,000 lines of average length 20, you would need 100,000 * 20 * 2 * sizeof (int) memory for the structure. You can cut this in half by packing both indexOfString and positionInString into a single 32-bit int, for example with positionInString in the lower 12 bits, and indexOfString in the remaining upper bits. You just need to play a bit to return two values. It is important to note that the structure contains neither strings nor substrings. The rows you are looking for exist only once.

This will basically give you a complete index and allow you to quickly find any substring (binary search by index represented by an suffix array) with minimal actual string comparisons.

If memory is expensive, a simple optimization of the initial brute force algorithm will be to pre-compile a dictionary of unique characters and assign serial numbers to represent each. Then precommute the array bit for each line with the bits set for each unique char contained in the line. Since your lines are relatively short, there should be significant variability in BitArrays restring (this will not work if your lines were very long). Then you simply compute BitArray or the search keyword and search only for the keyword in those lines where keywordBits & targetBits == keywordBits . If your strings are pre-converted to lowercase and are only in the English alphabet, BitArray will probably match a single int. Thus, it will require a minimum of additional memory, be simple to implement, and allow you to quickly filter out lines where you definitely will not find the keyword. This can be a useful optimization, as string searches are fast, but you have so many to use brute force search.

EDIT For those interested, here is the basic implementation of the initial solution that I proposed. I ran tests using 100,000 randomly generated length strings described by OP. Although it took about 30 seconds to create and sort the index, keyword searches were 3,000 times faster at 49,805 milliseconds for brute force and 18 milliseconds using indexed search, so it’s several thousand times faster. If you rarely create a list, then my simple but relatively slow method of initially building an array of suffixes should be sufficient. There are more reasonable ways to build it faster, but this will require more coding than my basic implementation below.

 // little test console app static void Main(string[] args) { var list = new SearchStringList(true); list.Add("Now is the time"); list.Add("for all good men"); list.Add("Time now for something"); list.Add("something completely different"); while (true) { string keyword = Console.ReadLine(); if (keyword.Length == 0) break; foreach (var pos in list.FindAll(keyword)) { Console.WriteLine(pos.ToString() + " =>" + list[pos.ListIndex]); } } } ~~~~~~~~~~~~~~~~~~ // file for the class that implements a simple suffix array using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Collections; namespace ConsoleApplication1 { public class SearchStringList { private List<string> strings = new List<string>(); private List<StringPosition> positions = new List<StringPosition>(); private bool dirty = false; private readonly bool ignoreCase = true; public SearchStringList(bool ignoreCase) { this.ignoreCase = ignoreCase; } public void Add(string s) { if (s.Length > 255) throw new ArgumentOutOfRangeException("string too big."); this.strings.Add(s); this.dirty = true; for (byte i = 0; i < s.Length; i++) this.positions.Add(new StringPosition(strings.Count-1, i)); } public string this[int index] { get { return this.strings[index]; } } public void EnsureSorted() { if (dirty) { this.positions.Sort(Compare); this.dirty = false; } } public IEnumerable<StringPosition> FindAll(string keyword) { var idx = IndexOf(keyword); while ((idx >= 0) && (idx < this.positions.Count) && (Compare(keyword, this.positions[idx]) == 0)) { yield return this.positions[idx]; idx++; } } private int IndexOf(string keyword) { EnsureSorted(); // binary search // When the keyword appears multiple times, this should // point to the first match in positions. The following // positions could be examined for additional matches int minP = 0; int maxP = this.positions.Count - 1; while (maxP > minP) { int midP = minP + ((maxP - minP) / 2); if (Compare(keyword, this.positions[midP]) > 0) { minP = midP + 1; } else { maxP = midP; } } if ((maxP == minP) && (Compare(keyword, this.positions[minP]) == 0)) { return minP; } else { return -1; } } private int Compare(StringPosition pos1, StringPosition pos2) { int len = Math.Max(this.strings[pos1.ListIndex].Length - pos1.StringIndex, this.strings[pos2.ListIndex].Length - pos2.StringIndex); return String.Compare(strings[pos1.ListIndex], pos1.StringIndex, this.strings[pos2.ListIndex], pos2.StringIndex, len, ignoreCase); } private int Compare(string keyword, StringPosition pos2) { return String.Compare(keyword, 0, this.strings[pos2.ListIndex], pos2.StringIndex, keyword.Length, this.ignoreCase); } // Packs index of string, and position within string into a single int. This is // set up for strings no greater than 255 bytes. If longer strings are desired, // the code for the constructor, and extracting ListIndex and StringIndex would // need to be modified accordingly, taking bits from ListIndex and using them // for StringIndex. public struct StringPosition { public static StringPosition NotFound = new StringPosition(-1, 0); private readonly int position; public StringPosition(int listIndex, byte stringIndex) { this.position = (listIndex < 0) ? -1 : this.position = (listIndex << 8) | stringIndex; } public int ListIndex { get { return (this.position >= 0) ? (this.position >> 8) : -1; } } public byte StringIndex { get { return (byte) (this.position & 0xFF); } } public override string ToString() { return ListIndex.ToString() + ":" + StringIndex; } } } } 
+1
source

Finding substrings (partial matches) is surprisingly difficult. There is nothing built in to help you with this. I suggest you explore Suffix Trees data structures that you can use to efficiently search for substrings.

You can pull searchWord.ToLower() into a local variable to save tons of string operations, by the way. You can also pre-compute the string version of stringList. If you can't precompile, at least use s.IndexOf(searchWord, StringComparison.InvariantCultureIgnoreCase) != -1 . This saves costly ToLower calls.

You can also click .AsParallel in the request.

+4
source

In this case, you need a reverse index.

If you pay a lot, you can use a database-specific full-text search index and set indexing to an index for each subset of words.

Alternatively, you can use a very successful open source project that can achieve the same.

You need to pre-index the string using the tokenizer and create the reverse index file. We have a similar use case in Java, where we must have very fast autocomplete in a large dataset.

You can look at Lucene.NET , which is the port of Apache Lucene (in Java).

If you want to rotate LINQ, you can use NHibernate Search. (Wink).

Another option is to implement pre-indexing in memory with preprocessing and bypassing the scan of unnecessary ones, take a look at the Knut-Morris-Pratt algorithm .

0
source

There one approach will be much faster. But that would mean finding exact word matches, rather than using Contains functionality.

Basically, if you have a memory for it, you can create a Dictionary of words that also refer to some identifier (or identifiers) for the string in which the word is found.

Thus, a dictionary can be of type <string, List<int>> . Of course, the advantage here is that you combine many words into a smaller collection. And the dictionary is very fast with search, as it is built on a hash table.

Now, if this is not what you are looking for, you can search in-memory full-text search libraries. SQL Server supports full-text searches using indexing to speed things up beyond traditional lookups. But a clear decision in memory will undoubtedly be faster. However, this may not give you the exact functionality of wildcard search.

0
source

All Articles