The fastest way to search a string collection

Problem:

I have a text file around 120,000 users (strings) that I would like to save in the collection, and then search in this collection.

The search method will occur every time the user changes the text of the TextBox , and the result should be the lines containing the text in the TextBox .

I do not need to change the list, just pull the results and put them in the ListBox .

What I have tried so far:

I tried with two different collections / containers, which I discard string entries from an external text file (times, of course):

  • List<string> allUsers;
  • HashSet<string> allUsers;

With the following LINQ query:

allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();

Search event (triggered when the user changes the search text):

 private void textBox_search_TextChanged(object sender, EventArgs e) { if (textBox_search.Text.Length > 2) { listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text)).ToList(); } else { listBox_choices.DataSource = null; } } 

Results:

Both gave me a poor response time (about 1-3 seconds between each keystroke).

Question:

Where do you think my bottleneck is? The collection I used? Search Method? Both?

How can I get better performance and freer functionality?

+75
collections c # linq winforms string-search
Jan 27 '14 at 11:07
source share
17 answers

You might consider doing a filtering task in the background thread, which will call the callback method when it is done, or just restart the filtering when the input changes.

The general idea is to be able to use it as follows:

 public partial class YourForm : Form { private readonly BackgroundWordFilter _filter; public YourForm() { InitializeComponent(); // setup the background worker to return no more than 10 items, // and to set ListBox.DataSource when results are ready _filter = new BackgroundWordFilter ( items: GetDictionaryItems(), maxItemsToMatch: 10, callback: results => this.Invoke(new Action(() => listBox_choices.DataSource = results)) ); } private void textBox_search_TextChanged(object sender, EventArgs e) { // this will update the background worker "current entry" _filter.SetCurrentEntry(textBox_search.Text); } } 

A rough sketch will look something like this:

 public class BackgroundWordFilter : IDisposable { private readonly List<string> _items; private readonly AutoResetEvent _signal = new AutoResetEvent(false); private readonly Thread _workerThread; private readonly int _maxItemsToMatch; private readonly Action<List<string>> _callback; private volatile bool _shouldRun = true; private volatile string _currentEntry = null; public BackgroundWordFilter( List<string> items, int maxItemsToMatch, Action<List<string>> callback) { _items = items; _callback = callback; _maxItemsToMatch = maxItemsToMatch; // start the long-lived backgroud thread _workerThread = new Thread(WorkerLoop) { IsBackground = true, Priority = ThreadPriority.BelowNormal }; _workerThread.Start(); } public void SetCurrentEntry(string currentEntry) { // set the current entry and signal the worker thread _currentEntry = currentEntry; _signal.Set(); } void WorkerLoop() { while (_shouldRun) { // wait here until there is a new entry _signal.WaitOne(); if (!_shouldRun) return; var entry = _currentEntry; var results = new List<string>(); // if there is nothing to process, // return an empty list if (string.IsNullOrEmpty(entry)) { _callback(results); continue; } // do the search in a for-loop to // allow early termination when current entry // is changed on a different thread foreach (var i in _items) { // if matched, add to the list of results if (i.Contains(entry)) results.Add(i); // check if the current entry was updated in the meantime, // or we found enough items if (entry != _currentEntry || results.Count >= _maxItemsToMatch) break; } if (entry == _currentEntry) _callback(results); } } public void Dispose() { // we are using AutoResetEvent and a background thread // and therefore must dispose it explicitly Dispose(true); } private void Dispose(bool disposing) { if (!disposing) return; // shutdown the thread if (_workerThread.IsAlive) { _shouldRun = false; _currentEntry = null; _signal.Set(); _workerThread.Join(); } // if targetting .NET 3.5 or older, we have to // use the explicit IDisposable implementation (_signal as IDisposable).Dispose(); } } 

In addition, you must dispose of the _filter instance when the parent Form hosted. This means that you must open and modify your Form Dispose method (inside the YourForm.Designer.cs file) to look something like this:

 // inside "xxxxxx.Designer.cs" protected override void Dispose(bool disposing) { if (disposing) { if (_filter != null) _filter.Dispose(); // this part is added by Visual Studio designer if (components != null) components.Dispose(); } base.Dispose(disposing); } 

On my machine, it works pretty fast, so you should check and profile it before moving on to a more complex solution.

Saying this, a “more difficult solution” may be to save the last two results in a dictionary, and then only filter them if it turns out that the new record differs only in the first of the last character.

+48
Jan 27 '14 at 12:28
source share

I did some testing and performed a search in a list of 120,000 items, and putting a new list into the record takes a small amount of time (about 1/50 second, even if all the lines match).

So the problem you see should come from populating the data source, here:

 listBox_choices.DataSource = ... 

I suspect you are simply putting too many items in a list.

Perhaps you should try to limit it to the first 20 entries, for example:

 listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text)) .Take(20).ToList(); 

Also note (as others have pointed out) that you are accessing the TextBox.Text property for each element in allUsers . This can be easily fixed as follows:

 string target = textBox_search.Text; listBox_choices.DataSource = allUsers.Where(item => item.Contains(target)) .Take(20).ToList(); 

However, I calculated how long it would take to access TextBox.Text 500,000 times, and it only took 0.7 seconds, which is much less than the 1-3 seconds mentioned in the OP. However, it is worth optimizing.

+36
Jan 27 '14 at 11:28
source share

Use the suffix tree as an index. Rather, just create a sorted dictionary that associates each suffix of each name with a list of matching names.

To enter:

 Abraham Barbara Abram 

The structure will look like this:

 a -> Barbara ab -> Abram abraham -> Abraham abram -> Abram am -> Abraham, Abram aham -> Abraham ara -> Barbara arbara -> Barbara bara -> Barbara barbara -> Barbara bram -> Abram braham -> Abraham ham -> Abraham m -> Abraham, Abram raham -> Abraham ram -> Abram rbara -> Barbara 

Search algorithm

Suppose user input is "bra."

  • Bisect dictionary on user input to find user input or position in which he could go. So we find "barbara" - the last key is lower than "bra". It is called the lower border for the bra. The search will take logarithmic time.
  • Scroll from the found key until the user input no longer matches. This will give “bram” → Abram and “braham” → Abraham.
  • Combine the result of the iteration (Abram, Abraham) and output it.

Such trees are designed to quickly find substrings. Its performance is close to O (log n). I believe this approach will work fast enough to be used directly by the GUI thread. Moreover, it will work faster than a threaded solution due to the lack of synchronization overhead.

+28
Jan 27 '14 at
source share

You need a text search engine (e.g. Lucene.Net ) or a database (you can consider a built-in, e.g. SQL CE , SQLite , etc.). In other words, you need an indexed search. A hash search is not applicable here because you are looking for a substring, while a hash based search is suitable for finding the exact value.

Otherwise, it will be an iterative search with a loop in the collection.

+14
Jan 27 '14 at 11:25
source share

It may also be useful to have the "debounce" event type. This differs from throttling in that it waits a period of time (for example, 200 μs) for changes to complete before the event fires.

For more on debouncing, see Debounce and Throttle: A Visual Explain . I appreciate that this article focuses on JavaScript instead of C #, but the principle applies.

The advantage of this is that it does not perform a search when you are still entering your query. Then he must stop trying to execute two requests at once.

+12
Jan 27 '14 at 11:13
source share

Update:

I did some profiling.

(Update 3)

  • List Content: Digits Created from 0 to 2.499.999
  • Filter text: 123 (20,477 results)
  • Core i5-2500, Win7 64bit, 8 GB RAM.
  • VS2012 + JetBrains dotTrace

The initial test run for 2.500.000 records took me 20,000 ms.

The first culprit is calling textBox_search.Text inside Contains . This causes a call to each element of the expensive get_WindowText text field method. Just changing the code to:

  var text = textBox_search.Text; listBox_choices.DataSource = allUsers.Where(item => item.Contains(text)).ToList(); 

reduced runtime to 1.858 ms .

Update 2:

The other two significant bottle necks are now calling string.Contains (about 45% of the runtime) and updating list items in set_Datasource (30%).

We could make a compromise between speed and memory usage by creating a suffix tree, because Basilevs suggested reducing the number of comparisons needed and removing some processing time from the search after pressing the key to load names from a file, which may be preferable for the user.

To improve the performance of loading items into the list, I would suggest downloading only the first few items and telling the user that there are additional items. Thus, you inform the user that there are results so that they can refine their search by entering more letters or by downloading a complete list with the click of a button.

Using BeginUpdate and EndUpdate did not change the set_Datasource .

As others have noted, the LINQ query itself is pretty fast. I believe that your bottle is an update of the list itself. You can try something like:

 if (textBox_search.Text.Length > 2) { listBox_choices.BeginUpdate(); listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text)).ToList(); listBox_choices.EndUpdate(); } 

Hope this helps.C>

+11
Jan 27 '14 at 13:52
source share

Run a search in another thread and show the loading animation or progress bar while this thread is running.

You can also try to parallelize the LINQ query.

 var queryResults = strings.AsParallel().Where(item => item.Contains("1")).ToList(); 

Here is an example demonstrating the benefits of AsParallel ():

 { IEnumerable<string> queryResults; bool useParallel = true; var strings = new List<string>(); for (int i = 0; i < 2500000; i++) strings.Add(i.ToString()); var stp = new Stopwatch(); stp.Start(); if (useParallel) queryResults = strings.AsParallel().Where(item => item.Contains("1")).ToList(); else queryResults = strings.Where(item => item.Contains("1")).ToList(); stp.Stop(); Console.WriteLine("useParallel: {0}\r\nTime Elapsed: {1}", useParallel, stp.ElapsedMilliseconds); } 
+10
Jan 27 '14 at 11:09
source share

Assuming that you are only matching prefixes, the data structure you are looking for is called trie , also known as the tree prefix. ”The IEnumerable.Where method that you are using now will have to iterate over all the elements in your dictionary with every access.

This thread shows how to create trie in C #.

+9
Jan 27 '14 at 11:24
source share

First, I would change how ListControl sees your data source, you convert the result of IEnumerable<string> to List<string> . Especially when you have typed several characters, this can be inefficient (and unnecessary). Do not make expansive copies of your data .

  • I would .Where() result of .Where() in a collection that implements only what is required from IList (search). This will allow you to create a new large list for each character.
  • Alternatively, I would avoid LINQ, and I would write something more specific (and optimized). Keep your list in memory and create an array of matching indexes, reuse the array so you don't have to redistribute it for each search.

The second step is to not search the large list if that is enough. When the user starts typing “ab” and he adds “c”, you don’t need to search in a large list; searching in the filtered list is enough (and faster). Refine your search every time. Maybe don't do a full search every time.

The third step may be more difficult: save ordered data for quick retrieval . Now you need to change the structure that you use to store your data. imagine a tree:

 Abc
  Add better ceil
  Above bone contour

This can be simply implemented using an array (if you work with ANSI names, otherwise the dictionary would be better). Create a list like this (goal illustration, it matches the beginning of the line):

 var dictionary = new Dictionary<char, List<string>>(); foreach (var user in users) { char letter = user[0]; if (dictionary.Contains(letter)) dictionary[letter].Add(user); else { var newList = new List<string>(); newList.Add(user); dictionary.Add(letter, newList); } } 

The search will be performed using the first character:

 char letter = textBox_search.Text[0]; if (dictionary.Contains(letter)) { listBox_choices.DataSource = new MyListWrapper(dictionary[letter].Where(x => x.Contains(textBox_search.Text))); } 

Please note that I used MyListWrapper() as suggested in the first step (but I briefly mentioned the second sentence for brevity, if you choose the right size for the dictionary key, you can keep each list short and fast so that you can possibly avoid or else). Also, note that you can try using the first two characters for your dictionary (more lists and shorter ones). If you continue this, you will have a tree (but I don’t think you have so many items).

there are many different algorithms for finding strings (with related data structures), just noting a few:

  • Finite-based search based : in this approach, we avoid backtracking by creating a deterministic finite-state machine (DFA) that recognizes a stored search string. They are expensive to build — they are usually created using the poweret construct, but are used very quickly.
  • Stubs : Knuth-Morris-Pratt calculates a DFA that recognizes entries with a search string as a suffix, Boyer-Moore starts a search from the end of the needle, so it can usually jump ahead the entire length of the needle at each step. Baeza-Yates keeps track of whether the previous j-characters were a search string prefix and, therefore, adapted to the search for fuzzy strings. The bit algorithm is an application of the Baeza-Yates approach.
  • Index Methods : Faster search algorithms based on text preprocessing. After building a subscript index, such as a suffix tree or suffix array, pattern occurrences can be found quickly.
  • Other options . Some search methods, such as trigram searches, are designed to look for "proximity" between the search string and text, rather than "match / inconsistency." They are sometimes called "fuzzy" searches.

A few words about parallel search. It is possible, but it is rarely trivial, because the overhead to make it parallel can be much higher than the search itself. I would not do the search on my own (separation and synchronization will soon become too expansive and possibly complicated), but I would move the search to a separate stream . If the main thread is not busy, your users will not experience any delays during their input (they will not notice if the list appears after 200 ms, but they will feel uncomfortable if they have to wait 50 ms after entering them), Of course, the search By itself, it should be fast enough, in this case you do not use streams to speed up the search, but save your user interface . Please note that a separate thread will not speed up your request , it will not hang the UI, but if your request was slow, it will still be slow in a separate thread (moreover, you will have to process several consecutive requests too).

+7
Jan 27 '14 at 11:22
source share

The WinForms ListBox control is indeed your enemy. He will slowly load records, and ScrollBar will fight with you to show all 120,000 records.

Try using the old-fashioned DataGridView data obtained in a DataTable with a single [UserName] column to store your data:

 private DataTable dt; public Form1() { InitializeComponent(); dt = new DataTable(); dt.Columns.Add("UserName"); for (int i = 0; i < 120000; ++i){ DataRow dr = dt.NewRow(); dr[0] = "user" + i.ToString(); dt.Rows.Add(dr); } dgv.AutoSizeColumnsMode = DataGridViewAutoSizeColumnsMode.Fill; dgv.AllowUserToAddRows = false; dgv.AllowUserToDeleteRows = false; dgv.RowHeadersVisible = false; dgv.DataSource = dt; } 

Then use the DataView in the TextChanged event of your TextBox to filter the data:

 private void textBox1_TextChanged(object sender, EventArgs e) { DataView dv = new DataView(dt); dv.RowFilter = string.Format("[UserName] LIKE '%{0}%'", textBox1.Text); dgv.DataSource = dv; } 
+7
Jan 27 '14 at 21:26
source share

You can try using PLINQ (Parallel LINQ). Although this does not guarantee speedup, you need to know the trial version and error.

+4
Jan 27 '14 at 11:18
source share

I doubt that you can do it faster, but for sure you need to:

a) Use the LINQ extension method

a) Use some kind of timer to delay filtering

b) Put the filter method in another thread

Hold on somewhere string previousTextBoxValue . Make a timer with a delay of 1000 ms, which starts the search by tick, if previousTextBoxValue matches your textbox.Text value. If not, reassign previousTextBoxValue current value and reset the timer. Set a timer for an event with a modified text box, and this will make your application smoother. Filtering 120,000 records in 1-3 seconds is fine, but your user interface should remain responsive.

+4
Jan 27 '14 at 11:20
source share

You can also try using the BindingSource.Filter function. I used it, and it works like a charm for filtering from many records, each time updating this property with text search. Another option is to use AutoCompleteSource for the TextBox control.

Hope this helps!

+3
Jan 28 '14 at 7:54
source share

I would try to sort the collection, search to combine only the initial part and limit the search to a certain number.

so ininialization

 allUsers.Sort(); 

and search

 allUsers.Where(item => item.StartWith(textBox_search.Text)) 

Perhaps you can add a cache.

+2
Jan 27 '14 at 16:03
source share

Use Parallel LINQ . PLINQ is a parallel implementation of LINQ to Objects. PLINQ implements a complete set of standard LINQ query operators as extension methods for the T: System.Linq namespace and has additional operators for parallel operations. PLINQ combines the simplicity and readability of LINQ syntax with the possibility of parallel programming. Like code intended for a parallel task library, PLINQ asks for a scale of concurrency based on the capabilities of the host computer.

Introduction to PLINQ

Understanding Acceleration in PLINQ

Also you can use Lucene.Net

Lucene.Net is a port of the Lucene search engine library written in C # and is intended for users of the .NET runtime. Lucene Search Library is an inverted index based library. Lucene.Net has three main goals:

+1
Jan 27 '14 at 16:56
source share

According to what I saw, I agree to sort the list.

However, sorting when the list constructor is very slow, sorting when building, you will have better lead time.

, , -.

. , .

+1
28 . '14 4:14
source share

BinarySearch, , .

O (n) BinarySearch - O (lg (n))

, , , , .

+1
29 . '14 8:44
source share



All Articles