Most efficient way to find partial string matches in a large string file (python)

Question

Most efficient way to find partial string matches in a large string file (python)

I uploaded a Wikipedia article title file containing the name of each Wikipedia article. I need to find all the article titles that may be possible. For example, I might have the word hockey, but the hockey Wikipedia article I want is Ice_hockey. This should also be case insensitive.

I am using Python, and is there a more efficient way than just doing a string search? I will perform this search, for example, 500 or 1000 times per minute ideally. If line by line is my only option, are there some optimizations I can do in this?

I think there are several million lines in the file.

Any ideas?

Thanks.

+6

python string search large-files

apexdodge Jan 29 '11 at 9:36

source share

3 answers

If you have a fixed dataset and variable queries, then the usual technique is to reorganize the dataset into something that can be easily found. At an abstract level, you can break each article heading into separate lowercase words and add each one to the Python dictionary data structure. Then, when you receive the request, convert the query word to lowercase and look at it in the dictionary. If each dictionary value is a list of captions, you can easily find all the headings matching the given query word.

This works for simple words, but you will need to consider whether you want to match similar words, for example, "smoke" when the query is "smoke".

+3

Greg hewgill Jan 29 '11 at 21:45

source share

I would suggest you put your data in sqlite database and use SQL query to search.

+1

Michael kent Jan 30 '11 at 5:18

source share

Adam hupp · Accepted Answer · 2011-01-30T00:24:31+0000

Greg's answer is good if you want to match individual words. If you want to match substrings, you will need something more complex, such as a suffix tree (http://en.wikipedia.org/wiki/Suffix_tree). After building, the suffix tree can effectively respond to requests for arbitrary substrings, so in your example, it can match "Ice_Hockey" when someone was looking for a "hock chest".

Most efficient way to find partial string matches in a large string file (python)

More articles: