The best way to take the intersection of more than two hash scenes in C #, when we do not know in advance how many hash tables there are

I am doing a logical search system for some big no. of documents in which I made a hash dictionary, and entries in the dictionary are terms, and hashes contain documents in which this term was found. Now that I want to find a single word, I will just type in the word and I will index the dictionary using the entered word in the query and print the corresponding hashset. But I also want to look for sentences, in this case I would separate the request into separate words and index the dictionary with these words, now depending on the number of words in the request that a lot of hash sets will be returned, now I want to take the intersection of these sets of hashes, so that I can return the identifiers of the documents in which I recognized the words in the request. My question is the best way to intersect these hash sets?

Currently I put hash sets in a list, and then I take the intersection of these n no. hashes two at a time, and then the intersection of the result of the first two, and then the third, etc ....

This is the code

Dictionary<string, HashSet<string>> dt = new Dictionary<string, HashSet<string>>();//assume it is filled with data... while (true) { Console.WriteLine("\n\n\nEnter the query you want to search"); string inp = Console.ReadLine(); string[] words = inp.Split(new Char[] { ' ', ',', '.', ':', '?', '!', '\t' }); List<HashSet<string>> outparr = new List<HashSet<string>>(); foreach(string w in words) { HashSet<string> outp = new HashSet<string>(); if (dt.TryGetValue(w, out outp)) { outparr.Add(outp); Console.WriteLine("Found {0} documents.", outp.Count); foreach (string s in outp) { Console.WriteLine(s); } } } HashSet<string> temp = outparr.First(); foreach(HashSet<string> hs in outparr) { temp = new HashSet<string>(temp.Intersect(hs)); } Console.WriteLine("Output After Intersection:"); Console.WriteLine("Found {0} documents: ", temp.Count); foreach(string s in temp) { Console.WriteLine(s); } } 
0
dictionary c # hashset
source share
3 answers

The principle you use sounds good, but you can tweak it a bit.

By sorting hash sets by size, you can start with the smallest one, so you can minimize the number of comparisons.

Instead of using the IEnumerable<>.Intersect method, you can do the same in a loop, but using the fact that you already have a hash set. Checking if a value exists in a hash set is very fast, so you can simply scroll through the elements in the smallest set and look for the corresponding values ​​in the next set and put them in a new set.

In a loop, you can skip the first element when you start with this. You do not need to cross it with you.

 outparr = outparr.OrderBy(o => o.Count).ToList(); HashSet<string> combined = outparr[0]; foreach(HashSet<string> hs in outparr.Skip(1)) { HashSet<string> temp = new HashSet<string>(); foreach (string s in combined) { if (hs.Contains(s)) { temp.Add(s); } } combined = temp; } 
+1
source share

IntersectWith is a good approach. Like this:

  HashSet<string> res = null; HashSet<string> outdictinary = null; foreach(string w in words) { if (dt.TryGetValue(w, out outdictinary)) { if( res==null) res =new HashSet( outdictinary,outdictinary.Comparer); else { if (res.Count==0) break; res.IntersectWith(outdictinary); } } } if (res==null) res = new HashSet(); Console.WriteLine("Output After Intersection:"); Console.WriteLine("Found {0} documents: ", res.Count); foreach(string s in res) { Console.WriteLine(s); } 
0
source share

To answer your question, it is possible that at some point you will find a set of documents that contains the words a, b and c and another set containing only other words in your query so that the intersection can become empty after several iterations. You can check this and break from foreach .

Now, IMHO, it makes no sense to make this intersection, because usually the search result should contain several files sorted by relevance. It will also be much simpler because you already have a list of files containing one word. From the hashes obtained for each word, you will need to count the number of file identifiers and return a limited number of identifiers sorted in descending order of occurrences.

0
source share

All Articles