The best way to take the intersection of more than two hash scenes in C #, when we do not know in advance how many hash tables there are

Question

The best way to take the intersection of more than two hash scenes in C #, when we do not know in advance how many hash tables there are

I am doing a logical search system for some big no. of documents in which I made a hash dictionary, and entries in the dictionary are terms, and hashes contain documents in which this term was found. Now that I want to find a single word, I will just type in the word and I will index the dictionary using the entered word in the query and print the corresponding hashset. But I also want to look for sentences, in this case I would separate the request into separate words and index the dictionary with these words, now depending on the number of words in the request that a lot of hash sets will be returned, now I want to take the intersection of these sets of hashes, so that I can return the identifiers of the documents in which I recognized the words in the request. My question is the best way to intersect these hash sets?

Currently I put hash sets in a list, and then I take the intersection of these n no. hashes two at a time, and then the intersection of the result of the first two, and then the third, etc ....

This is the code

Dictionary<string, HashSet<string>> dt = new Dictionary<string, HashSet<string>>();//assume it is filled with data... while (true) { Console.WriteLine("\n\n\nEnter the query you want to search"); string inp = Console.ReadLine(); string[] words = inp.Split(new Char[] { ' ', ',', '.', ':', '?', '!', '\t' }); List<HashSet<string>> outparr = new List<HashSet<string>>(); foreach(string w in words) { HashSet<string> outp = new HashSet<string>(); if (dt.TryGetValue(w, out outp)) { outparr.Add(outp); Console.WriteLine("Found {0} documents.", outp.Count); foreach (string s in outp) { Console.WriteLine(s); } } } HashSet<string> temp = outparr.First(); foreach(HashSet<string> hs in outparr) { temp = new HashSet<string>(temp.Intersect(hs)); } Console.WriteLine("Output After Intersection:"); Console.WriteLine("Found {0} documents: ", temp.Count); foreach(string s in temp) { Console.WriteLine(s); } }

0

dictionary c # hashset c # -4.0

user2603796 Feb 27 '15 at 13:35

source share

3 answers

Guffa · Answer 1 · 2015-02-27T13:57:06+0000

The principle you use sounds good, but you can tweak it a bit.

By sorting hash sets by size, you can start with the smallest one, so you can minimize the number of comparisons.

Instead of using the IEnumerable<>.Intersect method, you can do the same in a loop, but using the fact that you already have a hash set. Checking if a value exists in a hash set is very fast, so you can simply scroll through the elements in the smallest set and look for the corresponding values in the next set and put them in a new set.

In a loop, you can skip the first element when you start with this. You do not need to cross it with you.

 outparr = outparr.OrderBy(o => o.Count).ToList(); HashSet<string> combined = outparr[0]; foreach(HashSet<string> hs in outparr.Skip(1)) { HashSet<string> temp = new HashSet<string>(); foreach (string s in combined) { if (hs.Contains(s)) { temp.Add(s); } } combined = temp; }

Perdom · Answer 2 · 2015-02-27T13:59:28+0000

IntersectWith is a good approach. Like this:

  HashSet<string> res = null; HashSet<string> outdictinary = null; foreach(string w in words) { if (dt.TryGetValue(w, out outdictinary)) { if( res==null) res =new HashSet( outdictinary,outdictinary.Comparer); else { if (res.Count==0) break; res.IntersectWith(outdictinary); } } } if (res==null) res = new HashSet(); Console.WriteLine("Output After Intersection:"); Console.WriteLine("Found {0} documents: ", res.Count); foreach(string s in res) { Console.WriteLine(s); }

B0Andrew · Answer 3 · 2015-02-27T14:23:28+0000

To answer your question, it is possible that at some point you will find a set of documents that contains the words a, b and c and another set containing only other words in your query so that the intersection can become empty after several iterations. You can check this and break from foreach .

Now, IMHO, it makes no sense to make this intersection, because usually the search result should contain several files sorted by relevance. It will also be much simpler because you already have a list of files containing one word. From the hashes obtained for each word, you will need to count the number of file identifiers and return a limited number of identifiers sorted in descending order of occurrences.

The best way to take the intersection of more than two hash scenes in C #, when we do not know in advance how many hash tables there are

More articles: