Efficient algorithm for finding one of several lines in a text?

Question

Efficient algorithm for finding one of several lines in a text?

I need to search for incoming not very long fragments of text for occurrence of given lines. The lines are constant for the entire session and not so much (~ 10). An additional simplification is that none of the lines are contained in any other.

I am currently using the regex substring with str1 | str2 | ... str1 | str2 | ... str1 | str2 | ... This task is important, so I wonder if I can improve it. Not that I could program better than activists, but perhaps a special implementation is more effective than a general one.

Since the rows remain constant for a long time, I can let you build a data structure, such as a state transition table, in advance.

for example, if the lines are abcx , bcy and cz , and I still read abc , I should be in a combined state, which means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1 . Then reading x next will move me to string 1 matched state, etc., And any char other than xyz will move to its original state, and I won’t need to return back to b .

Any ideas or links appreciated.

+4

c ++ string-matching algorithm

davka Aug 24 '10 at 19:24

source share

8 answers

Check out the Aho-Corasick string matching algorithm !

+6

maxschlepzig Aug 24 '10 at 20:31

source share

Look at the suffix tree .

+4

wilhelmtell Aug 24 '10 at 19:48

source share

I watched the answers, but none of them were explicit enough ... and basically came down to a few links.

What intrigues me here is the uniqueness of your problem, the solutions presented so far are not capitalized at all because we are looking for a few needles in a haystack during a stack.

I would definitely look at KMP / Boyer Moore, but I would not use them blindly (at least if you have time on hand), because they are designed for one needle, and I'm pretty convinced that we could use the fact that we have several rows and check them at once, using a custom finite state machine (or custom tables for BM).

Of course, it is unlikely to improve big O (Boyer Moore works at 3n for each row, so it will be linear anyway), but you can probably get a constant coefficient.

+1

Matthieu M. Aug 25 '10 at 6:22

source share

Regex engine initialization is expected to have some overhead, so if there are no real valid regular expressions, then C - memcmp () should succeed.

If you can specify file sizes and give some specific use cases, we could build a benchmark (I find this very interesting).

Interesting: memcmp studies and time differences

Hi

STB

0

rubber boots Aug 24 '10 at 19:37

source share

There is always a Boyer Moore

0

doron Aug 24 '10 at 19:51

source share

In addition to the Rabin-Karp algorithm and the Knut-Morris-Pratt algorithm, my book of algorithms offers for matching strings. For each search string you need to build such a final state machine.

0

Christian ammer Aug 24 '10 at 20:30

source share

You can do this with the very popular Lex and Yacc tools, with support from the Flex and Bison tools. You can use Lex to get string tokens. Compare your predefined strings with tokens returned from Lexer. When a match is found, perform the required action. There are many sites that describe Lex and Yacc. One such site is http://epaperpress.com/lexandyacc/

0

bjskishore123 Aug 27 '10 at 11:28

source share

Ian · Accepted Answer · 2010-08-25T10:53:34+0000

Have a look at this: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html

The existence of a recursive / non-recursive difference is a pretty strong assumption that BOOST is not necessarily a discrete linear state machine. Therefore, there are good chances that you can do better for your specific problem.

The best answer depends on how many haystacks you have and the minimum needle size. If the smallest needle is longer than a few characters, you can do a little better than a generalized regular expression library.

Basically, all line searches work by testing for matching at the current position (cursor), and if none are found, try again when the cursor moves further to the right.

Rabin-Karp creates DFSM from the line (or lines) for which you are looking for the test and cursor movement to be combined into a single operation. However, Rabin-Karp was originally designed for one needle, so you would need to support a pullback if one match could be a suitable prefix for another. (Remember that when you want to reuse your code.)

Another tactic is to slide the cursor more than one character to the right, if at all possible. Boyer Moore does it. Usually it is designed for one needle. Build a table of all the characters and the rightmost position that they appear in the needle (if at all). Now set the cursor to len (needle) -1. The entry in the table indicates (a) what offset to the left of the cursor that can be found by the needle, or (b) that you can move the cursor len (needle) further to the right.

When you have more than one needle, the design and use of your table becomes more complex, but it can still save you an order of magnitude in probes. You can still do DFSM, but instead of calling the general search method, you call do_this_DFSM_match_at_this_offset ().

Another tactic is testing more than 8 bits at a time. There is a spam killer that looks at 32-bit machine words at a time. He then makes some simple hash code to match the 12-bit result, and then looks through the table to see if there is any damage. You have four entries for each pattern (offsets 0, 1, 2, and 3 from the beginning of the pattern), and then this method, despite the thousands of patterns in the table, they check only one or two per 32-bit word in the subject line.

So, in general, yes, you can go faster than regular expressions WHEN NEEDLES ARE CONSTANT.

Efficient algorithm for finding one of several lines in a text?

More articles: