The fastest C ++ algorithm for testing strings on a list of predefined seeds (case insensitive)

Question

The fastest C ++ algorithm for testing strings on a list of predefined seeds (case insensitive)

I have a list of seed lines, about 100 predefined lines. All lines contain only ASCII characters.

std::list<std::wstring> seeds{ L"google", L"yahoo", L"stackoverflow"};

My application constantly gets a lot of lines that can contain any characters. I need to check each row received and decide if it contains any seeds or not. The comparison must be case insensitive.

I need the fastest algorithm to check the received string.

Now my application uses this algorithm:

 std::wstring testedStr; for (auto & seed : seeds) { if (boost::icontains(testedStr, seed)) { return true; } } return false;

This works well, but I'm not sure if this is the most efficient way.

How can I implement an algorithm to achieve better performance?

This is a Windows application. The application receives valid std::wstring strings.

Update

For this task, I implemented Aho-Korasik Algo. If someone could view my code, that would be great - I don't have much experience working with such algorithms. Link to implementation: gist.github.com

+53

c ++ string algorithm windows

Victor Mezrin May 29 '16 at 11:39

source share

7 answers

If there are a finite number of matching lines, this means that you can build a tree that, after reading from root to leaves, similar lines will occupy similar branches.

This is also called trie or Radix Tree .

For example, we can have the lines cat, coach, con, conch , as well as dark, dad, dank, do . Their trick may look like this:

Searching for one of the words in the tree will search for the tree, starting from the root. Bringing it to a leaf corresponds to a match with a seed. Regardless, each character in the string must correspond to one of its children. If this is not the case, you can stop the search (for example, you will not consider words starting with "g" or any words starting with "cu").

There are various algorithms for building a tree, as well as for finding it, as well as for changing it on the fly, but I thought that I would give a conceptual overview of the solution instead of the concrete one, since I do not know the best algorithm for it.

Conceptually, an algorithm that you could use to search in a tree would be associated with the idea of creating a radix type of a fixed number of categories or values that a character in a string could take at a given time.

This allows you to check a single word on a word-list . Since you are looking for this list of words as substrings of the input string, there will be more than that.

Edit:. As mentioned in other answers, the Aho-Corasick string matching algorithm is a complex algorithm for performing string matching, consisting of a trie with additional links for accepting “shortcuts” through the tree and have a different search pattern to accompany this. (It is interesting to note that Alfred Aho is also the author of the popular compiler textbook Compilers: Principles, Methods and Tools, as well as the textbook on algorithms Design and Analysis of Computer Algorithms. He is also a former Bell member. Margaret J. Korasik too much public information about yourself.)

+56

VermillionAzure May 29 '16 at 12:06

source share

You should try the pre-existing regex utility, it may be slower than your manual algorithm, but regex is a comparison of several features, so most likely it will be several times faster than a hash map or a simple comparison with all strings. I believe that the regex implementation may already use the Aho-Corasick algorithm mentioned by RiaD, so basically you'll have at your disposal a well-tested and fast implementation.

If you have C ++ 11, you already have a standard regex library

 #include <string> #include <regex> int main(){ std::regex self_regex("google|yahoo|stackoverflow"); regex_match(input_string ,self_regex); }

I expect this to create the best minimal match tree, so I expect it to be very fast (and reliable!)

+16

GameDeveloper May 29 '16 at 13:26

source share

One of the fastest ways is to use the suffix tree https://en.wikipedia.org/wiki/Suffix_tree , but this approach has a huge drawback - a complex data structure with difficult construction. This algorithm allows you to build a tree from a string in linear complexity https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm

+10

LmTinyToon May 29 '16 at 11:46

source share

Edit: As Mattiu M. pointed out, the OP asked if the string contains a keyword. My answer only works when the string is equal to the keyword, or if you can split the string, for example. space character.

Especially with a lot of possible candidates and knowing them at compile time using an ideal hash function with a tool like gperf, it's worth a try. The basic principle is that you seed the generator with your seed and generate a function containing a hash function, which has no collisions for all initial values. At run time, you give the function a string and computes the hash, and then checks to see if this is the only possible candidate matching the hashvalue value.

The execution cost is hashing the string and then comparing it with the only possible candidate (O (1) for sample size and O (1) for string length).

To make the comparison case insensitive, you simply use a tolower for the seed and your string.

+6

H. Idden May 29 '16 at 19:43

source share

Since the number of rows is small (~ 100), you can use the following algorithm:

Calculate the maximum word length you have. Let it be N.
Create an checksum array int checks[N]; .
Let the checksum be the sum of all the characters in the search phrase. Thus, you can calculate such a checksum for each word from your list (this is known at compile time) and create std::map<int, std::vector<std::wstring>> , where int is the checksum strings, and the vector should contain all of your strings with this checksum. Create an array of such maps for each length (up to N), this can also be done at compile time.
Now follow the large line with the pointer. When the pointer points to the X character, you must add the X char value to all checks integers, and for each of them (numbers from 1 to N) remove the character value (XK), where K is the number of integers in checks . This way you will always have the correct checksum for the entire length stored in the checks array. After this search on the map there are lines with such a pair (length and checksum), and if it exists, compare it.

It should give a false positive result (when the checksum and length are equal, but the phrase is not very).

So let R be the length of a large string. Then, passing through it, we take O (R). At each step, you perform N operations with a small number of "+" (char), N operations with a small number of "-" (char), it is very fast. At each step, you will need to look for the counter in the checks array, and this is O (1), because it is one block of memory.

Also, each step you will need to find a map in the map array, which will also be O (1), since this is also one memory block. And inside the card you will have to look for a line with the correct checksum for the log (F), where F is the size of the card, and usually it will contain no more than 2-3 lines, so we can generally pretend that it is also O (1).

You can also check, and if there are no lines with the same checksum (which should happen with a high probability in just 100 words), you can generally discard the card, saving pairs instead of the card.

So, finally, this should give O (R), with a fairly small O. This method of calculating checksum can be changed, but it is quite simple and completely fast, with very few false-positive reactions.

+3

Arkady May 29 '16 at 3:10

source share

As a DarioOOs answer option, you can get a faster implementation of regular expression matching by encoding lex parser for your strings.Although it is usually used with yacc , this is the case when lex does the job itself and lex parsers are usually very efficient.

This approach may fall if all your lines are long, and then an algorithm like Aho-Corasick , Commentz-Walter or Rabin-Karp will probably offer significant improvements, and I doubt that the lex implementation uses any such algorithm.

This approach is more complicated if you need to configure lines without reconfiguration, but since flex is open source, you can cannibalize its code.

+3

PJTraill Jun 02 '16 at 12:02 on

source share

RiaD · Accepted Answer · 2016-05-29 12:03

You can use the Aho-Corasick algorithm

He builds a trie / automaton, where some vertices are marked as terminals, which means the string has seeds.

It is built in O(sum of dictionary word lengths) and gives an answer in O(test string length)

Benefits:

It works with several dictionaries, and the check time does not depend on the number of words (if we do not consider cases when it is not suitable for memory, etc.)
The algorithm is not difficult to implement (at least compared to suffix structures)

You can make case insensitive by omitting each character if ASCII (non-ASCII characters do not match)

The fastest C ++ algorithm for testing strings on a list of predefined seeds (case insensitive)

More articles: