Autocomplete style prefix search

Creating a concrete example:

  • You have a list of every name in the USA.
  • You want to complete terminations automatically in the GUI.

Obviously, you need to use the radix tree to get a list of names for this prefix. However, this does not take into account frequency information. Thus, instead of just the first 5 results being the first lexical results, I would like the most common 5 names:

eg. For dan prefix

  (5913, 'Daniel') (889, 'Danny') (820, 'Dana') (272, 'Dan') (60, 'Dane') 

Is there a trie tree algorithm that I skipped? Of course, the ideal realization (if one exists) is in python in my mind.

UPDATE: Overall, pleased with what Paddy3113 suggested, although I will say that it explodes completely when I give it a 2.6GB file, which is one of the files that I am reducing. Studying the details, the result gives some idea:

 samz;Samzetta|Samzara|Samzie samza;Samzara samzar;Samzara samzara;Samzara samze;Samzetta samzet;Samzetta samzett;Samzetta samzetta;Samzetta samzi;Samzie samzie;Samzie # Format - PREFIX;"|".join(CHOICES). 

We have a few more days on the side of generosity, so I'm still looking for a solution for the killer. Since this is not only about reduction, but also on the side of the search for things.

+4
source share
5 answers

Yes, we can use trie. The most common names for a trie node are either (1) the name in that trie node, or (2) the most common name for a child of a trie node. Here is some Python code for the game.

 from collections import defaultdict class trie: __slots__ = ('children', 'freq', 'name', 'top5') def __init__(self): self.children = defaultdict(trie) self.freq = 0 self.name = None self.top5 = [] def __getitem__(self, suffix): node = self for letter in suffix: node = node.children[letter] return node def computetop5(self): candidates = [] for letter, child in self.children.items(): child.computetop5() candidates.extend(child.top5) if self.name is not None: candidates.append((self.freq, self.name)) candidates.sort(reverse=True) self.top5 = candidates[:5] def insert(self, freq, name): node = self[name] node.freq += freq node.name = name root = trie() with open('letter_s.txt') as f: for line in f: freq, name = line.split(None, 1) root.insert(int(freq.strip()), name.strip()) root.computetop5() print(root['St'].top5) 
+4
source

Without any idea of ​​tuning, I would start by assuming that I have a list of names and their frequencies, then create dictionary matching prefixes for the set of names with this prefix, and then include each set in the list with only 5 wrt frequency names on top.

Using a list of names of only boys obtained from here , it is massaged to create a text file where each line represents an integer frequency of occurrence, some spaces, and then the following name:

 8427 OLIVER 7031 JACK 6862 HARRY 5478 ALFIE 5410 CHARLIE 5307 THOMAS 5256 WILLIAM 5217 JOSHUA 4542 GEORGE 4351 JAMES 4330 DANIEL 4308 JACOB ... 

The following code builds a dictionary:

 from collections import defaultdict MAX_SUGGEST = 5 def gen_autosuggest(name_freq_file_name): with open(name_freq_file_name) as f: name2freq = {} for nf in f: freq, name = nf.split() if name not in name2freq: name2freq[name] = int(freq) pre2suggest = defaultdict(list) for name, freq in sorted(name2freq.items(), key=lambda x: -x[1]): # in decreasing order of popularity for i, _ in enumerate(name, 1): prefix = name[:i] pre2suggest[prefix].append((name, name2freq[name])) # set max suggestions return {pre:namefs[:MAX_SUGGEST] for pre, namefs in pre2suggest.items()} if __name__ == '__main__': pre2suggest = gen_autosuggest('2010boysnames_popularity_engwales2.txt') 

If you give dict your prefix, it will return your sentences (along with their frequencies in this case, but if necessary, they can be discarded:

 >>> len(pre2suggest) 15303 >>> pre2suggest['OL'] [('OLIVER', 8427), ('OLLIE', 1130), ('OLLY', 556), ('OLIVIER', 175), ('OLIWIER', 103)] >>> pre2suggest['OLI'] [('OLIVER', 8427), ('OLIVIER', 175), ('OLIWIER', 103), ('OLI', 23), ('OLIVER-JAMES', 16)] >>> 

Look no attempts :-)

Time killer

If it takes a long time to execute, you can pre-compute the dict file and save it to a file, and then load the pre-computed values ​​when necessary, using the brine module:

 >>> import pickle >>> >>> savename = 'pre2suggest.pcl' >>> with open(savename, 'wb') as f: pickle.dump(pre2suggest, f) >>> # restore it >>> with open(savename, 'rb') as f: p2s = pickle.load(f) >>> p2s == pre2suggest True >>> 
+2
source

You could significantly increase the implementation of trie to keep it in order of popularity, rather than alphabetically, while you would also need to maintain popularity in each node trie.

0
source

Here is an idea on how to do this:

Create a trie string and store an integer with each node in the tree. This node indicates the number of names that use this node. Thus, you must increment all nodes of the name when that name is inserted into the trie.

Then you can define the names of the items by eagerly choosing the names with the highest values.

Formally, this will be the same as any string algorithm for constructing trie, but with the added step of incrementing integers.

0
source

If you need quick search queries, the only real solution is to pre-compute the answers for any prefix. This is normal in case the data does not change, but you need a way to reduce load time.

I would suggest using DBM to store a pre-computed dictionary. This is basically a dictionary where the contents are stored on disk and viewed when you refer to items. See http://docs.python.org/library/anydbm.html for more details. The only drawback is that the values ​​must be strings, so you will need to store a list of the top five entries separated by commas, say, and split it into a search.

This will have a much faster startup time than brine, since the database does not need to be loaded. It is also much simpler than using sqlite.

0
source

Source: https://habr.com/ru/post/1414191/


All Articles