A faster way to do a list is <T> .Contains ()
I am trying to do what I consider to be a "de-intersection" (I'm not sure what my own name is, but what Tim Sweeney from EpicGames called it in the old UnrealEd)
// foo and bar have some identical elements (given a case-insensitive match) List‹string› foo = GetFoo(); List‹string› bar = GetBar(); // remove non matches foo = foo.Where(x => bar.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList(); bar = bar.Where(x => foo.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList(); Then later I do another thing when I subtract the result from the original to see which items I deleted. This is a super-fast use of .Except (), so there is no problem.
There should be a faster way to do this, because this one works rather poorly with ~ 30,000 items (rows) in any of the lists. Preferably, a way to take this step and one later in one fell swoop would be enjoyable. I tried using .Exists () instead of .Contains (), but it is a bit slower. I feel a little thick, but I think it should be possible with some combination of .Except () and .Intersect () and / or .Union ().
With the intersection, this will be done as follows:
var matches = ((from f in foo select f) .Intersect( from b in bar select b, StringComparer.InvariantCultureIgnoreCase)) This operation can be called a symmetric difference.
You need a different data structure, such as a hash table. Add the intersection of both sets to it, then divide the intersection with each set.
UPDATE:
I have some time to try this in code. I used a HashSet<T> with a set of 50,000 lines from 2 to 10 characters long with the following results:
Original : 79499 ms
Hashset : 33 ms
By the way, there is a method in a HashSet called SymmetricExceptWith that I thought would work for me, but it actually adds different elements from both sets to the set that the method is called on. Perhaps this is what you want, rather than leaving the original two sets intact, and the code will be more elegant.
Here is the code:
using System; using System.Collections.Generic; using System.Diagnostics; using System.Linq; class Program { static void Main(string[] args) { // foo and bar have some identical elements (given a case-insensitive match) var foo = getRandomStrings(); var bar = getRandomStrings(); var timer = new Stopwatch(); timer.Start(); // remove non matches var f = foo.Where(x => !bar.Contains(x)).ToList(); var b = bar.Where(x => !foo.Contains(x)).ToList(); timer.Stop(); Debug.WriteLine(String.Format("Original: {0} ms", timer.ElapsedMilliseconds)); timer.Reset(); timer.Start(); var intersect = new HashSet<String>(foo); intersect.IntersectWith(bar); var fSet = new HashSet<String>(foo); var bSet = new HashSet<String>(bar); fSet.ExceptWith(intersect); bSet.ExceptWith(intersect); timer.Stop(); var fCheck = new HashSet<String>(f); var bCheck = new HashSet<String>(b); Debug.WriteLine(String.Format("Hashset: {0} ms", timer.ElapsedMilliseconds)); Console.WriteLine("Sets equal? {0} {1}", fSet.SetEquals(fCheck), bSet.SetEquals(bCheck)); //bSet.SetEquals(set)); Console.ReadKey(); } static Random _rnd = new Random(); private const int Count = 50000; private static List<string> getRandomStrings() { var strings = new List<String>(Count); var chars = new Char[10]; for (var i = 0; i < Count; i++) { var len = _rnd.Next(2, 10); for (var j = 0; j < len; j++) { var c = (Char)_rnd.Next('a', 'z'); chars[j] = c; } strings.Add(new String(chars, 0, len)); } return strings; } } If the items are unique on each list, you should consider using a HashSet.
The HashSet (T) class provides a high level of performance operation. A set is a collection that does not contain duplicate elements and whose elements are missing a specific order.
With a sorted list, you can use binary search.
Contains the operation O (N) in the list. If you had a different data structure, such as a sorted list or dictionary, you would significantly reduce your time. Access to a key in a sorted list is usually O (log N), and in a hash, usually O (1).