Finding a faster way to search for strings

I need to recognize a large list of URLs (several million lines) as belonging to a certain category or not. I have another list that has substrings that, if present in the URL, fall into this category. Say category A.

The list of substrings to check has about 10 thousand such substrings. What I did was just go line by line in the subscript file and look for a match, and if the URL belongs to category A. I found in the tests that it takes a lot of time.

I am not a computer science student, therefore I do not have a lot of knowledge about algorithm optimization. But is there a way to do this faster? Just simple ideas. The programming language is not a big problem, but it is preferable to use Java or Perl.

The list of substrings that will match will not change much. However, I get different lists of URLs, so every time I get them. The bottleneck seems to be URLs, as they can be very long.

+8
java optimization search perl
source share
9 answers

Yes, I applied the Aho-Corasick algorithm in java for the problem you are proposing, and it showed a consistent improvement of about x180 on a naive implementation (what you are doing). There are several implementations available on the Internet, although I would tweak them for better performance. Note that the complexity of decisions is limited by the length of the word (in your case the URL), and not by the number of substrings. in addition, only one pass on average is required for comparison.

PS - we used this question for people in interviews, so there are many ways to solve it. I suggest the one that we use in production code, which (at the moment) is superior to all other solutions.

Edit: previously wrote the wrong algorithm name, fixed ...

+8
source share

Perl is very good at optimizing long lists of alternative strings in a regular expression (to some general compiled length of the regular expression, where it returns to a less efficient mechanism). You should be able to create a regular expression to match a specific category, for example:

$catAre = join( '|', map quotemeta, @catAstrings ); $catAre = qr/$catAre/; 
+4
source share

Of course, various approaches to optimization are possible. As for your background, I will draw you a simple one. Suppose that the list of substrings does not change very often.

  • Generate one huge regular expression from all substrings.
  • Compile this regular expression, see for example the class template in Java. Save the refrence for this compiled regex.
  • Use the same compiled regular expression to match each url.
+3
source share

I would suggest using venerable Grep instead of using a programming language for this task. It uses Boyer-Moore's fast string search algorithm , which should be sufficient for several million lines.

+2
source share

I already did this before in Perl, comparing a list of ~ 13k keywords from the incoming data stream from Twitter to find all the tweets matching any of these keywords (and which keywords correspond to each of them). In rough terms, the code is as follows:

 use Regexp::Assemble; my $ra = Regexp::Assemble->new; $ra->add(@keywords); my $regex = $ra->re; for my $tweet (@tweets) { my @matches = $tweet =~ /$regex/g; # do whatever with @matches... } 

Note that this uses Regexp :: Assemble to create a regular expression that is not part of the main Perl distribution, so you will need to install it if from CPAN, if you want to adapt this code.

If you use perl 5.10 or later, there is also a smart match operator ( ~~ ) that can do something similar without requiring additional modules.

+2
source share

You can compress substrings into classes that have the same prefix. This should reduce the time for a significant margin.

If you are looking for matches by shifting the line by 1 each iteration, you can improve your speed a bit using the best algorithm (as is the case with regular expressions).

+1
source share

For Java libraries that implement common string search algorithms, see the answers at https://stackoverflow.com/questions/5564610/fast-alernative-for-stringindexofstring-str . Combined with parallelization, you can quickly parse millions of URLs. This is easy enough to do; you should probably try it and see if the time is acceptable or not before looking too much in optimization.

+1
source share

I wrote this at first as a comment, but then I realized, I think this is more suitable as an answer
You can use some Information Search Engine (e.g. Apache Lucene in Java) and use it to index URLs as documents. Then, after indexing - you can iterate over the queries and search for each of them, the result will be the corresponding URLs.
PROS:
* the search does not require iterating over all URl for each request.
* if later you need to intersect or concatenate a substring / query - the library gives you this functionality
MINUSES:
* indexing will take some time ...
* You may need additional RAM / disk space for the index.

I think this is a method worth exploring, perhaps the time spent on indexing deserves an increase in search.

+1
source share

I am currently working on this issue. I came to this conclusion:

Aho-corasick will consume more memory when creating a tree. If there is no memory problem, than its use. But look at the HAT Trie once. This is a combination of a hash and three (trees). It will create a tree at some level, and the remaining characters form a hash value, which should be marked in the hash table.

Sorry for the more technical language. But I think HAT trie is the best option if you are looking for a specific URL from a list of URLs. (I created a HAT trie that will consume 12 MB to store 6lacks URLs.)

0
source share

All Articles