A quick way to retrieve a list based on another list

List1: {"123456", "432978", "321675", …} // containing 100,000 members List2: {"7674543897", "1234568897", "8899776644",…} // containing 500,000 members 

I want to extract all the elements in List2 so that their first 6 digits are from List1 elements, so the string "1234568897" is valid since the first 6 digits are from the first List1s element. What is the fastest way to do this?

  foreach(string id in List1) { string result = List2.FirstOrDefault(x => x.Contains(id)); if(result!=null) { //some works here } } 

this works for a group of less than 1000, but when List2 items grow, it takes too much time.

+4
source share
3 answers

You can use Enumerable.Join , which is quite efficient :

 var match = from str1 in List1 join str2 in List2 on str1 equals (str2.Length < 6 ? str2 : str2.Substring(0, 6)) select str2; 

Demo

Edit

Since @Oleksandr Pshenychnyy suggested that it will be very slow with such large collections, here is a demo with 100,000 random lines in lines list1 and 500000 in list2 with the same range as in the question. It executes in 600 milliseconds on ideone:

http://ideone.com/rB6LU4

+3
source

It depends on the equipment you work on. However, you may be working on premature optimization. It can be fast enough, just rudely forcing him. 500,000 * 100,000 is your maximum number of comparisons (i.e. if nothing matches). It really doesn’t take much time on a smart desktop PC.

If you find that this is too slow for your purposes, there are several methods that you could use to improve performance:

  • Parallel. This will only demonstrate great benefits for multi-core hardware. Essentially, you need a dispatcher class that passes numbers from List2 to streams that perform a match lookup on List1. See the Parellel Task Library .

  • Reduce the number of comparisons smarter. Do a preliminary analysis of your collections to improve their performance for the comparison phase. for example, put items from list1 on the buckets list, where each bucket contains all sequences with the same first two numbers. for example, bucket 1 will contain those starting with "00", bucket 2 will be "01", etc. When you do the actual comparison, you will only need to compare a small number of lines. In your example for β€œ1234568897” we checked the bucket for β€œ12”, and then we know that we only need to perform a complete comparison of the lines with the lines in this bucket.

Such preprocessing can actually slow things down, so make sure you know your code exactly.

0
source

The most efficient way to implement what you need is the Aho-Corasick algorithm - if you need to dynamically process the new List 2 items. It is based on the prefix-tree, a commonly used technique in string search algorithms. This algorithm will give you O complexity (sum of all lines)

Another option is the Knuth-Morris-Pratt algorithm , which will give you the same complexity, but initially works with a single search string.

But if you are fine with O((list2.Count*log(list2.Count) + list1.Count*log(list1.Count))*AverageStrLength) , I can offer my simple implementation: Sort both lists. Then follow them at the same time and select matches.

UPDATE: This implementation is great if you have already sorted the lists. Then it runs after about 100 ms. But sorting takes 3.5 seconds, so the implementation is not as good as I expected at the beginning. Regarding a simple implementation, Tim's solution is faster.

 list1.Sort(); list2.Sort(); var result = new List<string>(); for(int i=0, j=0; i<list1.Count; i++) { var l1Val = list1[i]; for (; j < list2.Count && string.CompareOrdinal(l1Val, list2[j]) > 0; j++) ; for (; j < list2.Count && list2[j].StartsWith(l1Val); j++) { result.Add(list2[j]); } } 

This is the easiest effective implementation I can offer.

0
source

All Articles