Dictionary search, high performance

I need to create some kind of dictionary that additionally contains frequency words with which each word occurs in the language. Typically, this will be implemented using the std :: unordered_map right? Now here's the catch ... I want to find all the words and their frequencies that correspond to some regular expression, and performance is my biggest problem.

I do not think that I can avoid repeating a series of elements and elementary checking whether they match the pattern. So I thought it would be wise to use a pair of vectors instead of a map:

using namespace std; typedef vector<pair<string, double>> Dictionary vector<Dictionary::const_iterator> index; Dictionary dict; ... for_each(index['d'], index['e'], DoSomething); 

This would allow me to efficiently iterate over all the words starting with, in this case, ā€œdā€. Of course, this only helps if I already know the first letter of my regular expression, which often will not be what I expect. In addition, if I already know the whole word without any uncertainties and just want to see its frequency, I will have to go through the entire section until I find it. The map would let me figure it out faster. For instance. when searching for the word "deer"

 Dictionary::const_iterator it = find_if(index['d'], index['e'], [] // Lambda (pair<string, double> const &pr) { return pr.first == "deer"; }); 

Not optimal! The solution may be to use different implementations of the dictionary for different situations, and although memory does not cause much concern, this seems like a silly workaround.

Any suggestions?

+4
source share
2 answers

Along the lines you were thinking about, std::vector<std::pair<boost::regex, int> > is likely to be most effective; You are trying to find a match.

The best solution is if you are ready to do the work, execute your own class of regular expressions without capturing (the operator (...) in most regular expressions). Without capturing, it is fairly easy to convert an expression to pure DFA, and you can either regular expressions, with each regular expression returning a different code to accept. (This is my own regex class that works. For most applications, it is not as flexible as it is Boost, because it does not support capture. But this is allowed by things like:

 RegularExpression t1( expr1", 0 ); RegularExpression t2( expr2", 1 ); // ... RegularExpression t = t1 | t2 /* | t3 | t4 | ... */ ; 

If it matches, it will return 0 if expression 1 matches, if expr2 matches, etc .; You can use the match identifier as an index in an int vector. (It returns -1 if there is no match.)

Thus, the search time is linear with respect to the input length. Regardless of the number of expressions you are trying to match. (My RegularExpression class was developed over 20 years ago to generate front-end compiler. About 15 years ago, I redid it to handle UTF-8 as input.)

For many years, the code was available on the Internet, but I did not get the web page at present, so if someone has not saved the old copy, it is not. I would be happy to send it to you, but warned that the library is not supported for a while, so it cannot be trivial to get it to compile the compiler. (It was originally written in standard C ++, and still contains a number of workarounds to get it compiled with things like Sun CC 4.x.)

+1
source

Your best bet is to use the Satae endpoint converter, for example http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html .

The map, as you said, will use a lot of space.

You can see the converter as one big regular expression (Regex is a compiled automaton).

0
source

All Articles