I need to recognize a large list of URLs (several million lines) as belonging to a certain category or not. I have another list that has substrings that, if present in the URL, fall into this category. Say category A.
The list of substrings to check has about 10 thousand such substrings. What I did was just go line by line in the subscript file and look for a match, and if the URL belongs to category A. I found in the tests that it takes a lot of time.
I am not a computer science student, therefore I do not have a lot of knowledge about algorithm optimization. But is there a way to do this faster? Just simple ideas. The programming language is not a big problem, but it is preferable to use Java or Perl.
The list of substrings that will match will not change much. However, I get different lists of URLs, so every time I get them. The bottleneck seems to be URLs, as they can be very long.
java optimization search perl
sfactor
source share