Fast count () intersection of two string arrays

I need to count the number of elements corresponding to the intersection of two large arrays of strings, and do it very quickly.

I am using the following code:

arr1[i].Intersect(arr2[j]).Count() 

For CPU Time, VS Profiler indicates

  • 85.1% in System.Linq.Enumerable.Count()
  • 0.3% in System.Linq.Enumerable.Intersect()

Unfortunately, it may take several hours to complete all the work.

How to do it faster?

+6
source share
5 answers

You can use HashSet with arr2

 HashSet<string> arr2Set = new HashSet<string>(arr2); arr1.Where(x=>arr2Set.Contains(x)).Count(); ------------------ | |->HashSet contains method executes quickly using hash-based lookup.. 

Without considering the conversion from arr2 to arr2Set , it should be O(n)

+3
source

I suspect the reason the profiler shows the time consumed by Count is because this is where the collection is actually enumerated ( Intersect lazily evaluated and does not start before you need the result).

I believe that Intersect should have some internal optimizations to do this fast enough, but you can try using a HashSet<string> to make sure that the intersection can be done without searching through the internal array for each element:

 HashSet<string> set = new HashSet<string>(arr1); set.IntersectWith(arr2); int count = set.Count; 
+1
source

Hmmm intersect probably n ^ 2

to speed up quick sorting of both arrays. and cross both arrays. counting intersections.

too lazy to check how fast it will be, but should O (nlogn + n)

  using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; namespace Test { class Program { static void Main(string[] args) { const int arrsize = 1000000; Random rnd = new Random(42); string[] arr1 = new string[arrsize]; string[] arr2 = new string[arrsize]; for (int i = 0; i < arrsize; i++) { arr1[i] = rnd.Next().ToString(); arr2[i] = rnd.Next().ToString(); } { var stamp = (System.Diagnostics.Stopwatch.GetTimestamp()); arr1.Intersect(arr2).Count(); Console.WriteLine("array" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp)); } { HashSet<string> set = new HashSet<string>(arr1); var stamp = (System.Diagnostics.Stopwatch.GetTimestamp()); set.IntersectWith(arr2); int count = set.Count; Console.WriteLine("HashSet" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp)); } { var stamp = (System.Diagnostics.Stopwatch.GetTimestamp()); HashSet<string> set = new HashSet<string>(arr1); set.IntersectWith(arr2); int count = set.Count; Console.WriteLine("HashSet + new" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp)); } { var stamp = (System.Diagnostics.Stopwatch.GetTimestamp()); SortedSet<string> set = new SortedSet<string>(arr1); set.IntersectWith(arr2); int count = set.Count; Console.WriteLine("SortedSet +new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp)); } { SortedSet<string> set = new SortedSet<string>(arr1); var stamp = (System.Diagnostics.Stopwatch.GetTimestamp()); set.IntersectWith(arr2); int count = set.Count; Console.WriteLine("SortedSet without new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp)); } } } } 

results

array 914,637

HashSet 816,119

HashSet + New 1,150,978

SortedSet + New 16,173,836

SortedSet without new 7 946 709

it seems like the best way is to keep the hash set ready.

+1
source

when you work with sets your complexity will be O ((n log n) * (m log m)) or so,

I think it should be faster here, but I'm not sure now it is O ((n log n) + (m log m))

 possible would be var Set1 = arr1[i].Distinct().ToArray(); // if necessary, if arr1 or arr2 could be not distinct var Set2 = arr2[j].Distinct().ToArray(); nCount = Set1.Count() + Set2.Count() - Set1.Append(Set2).Distinct().Count(); 
0
source

Create a HashSet using a smaller array, and then skip the larger one, increasing the counter if the element that it exists in hashset.

0
source

All Articles