Best way to search in millions of wildcard-compatible file names (GLOBs)

I am working on a small search engine to display matching file names with a full path. and the important thing is that I need to provide a search for wildcard characters (GLOB), for example *.doc or *list*.xlx or *timesheet* or ???.doc , or something like that.

I found some related solution

Search strings matching pattern "abc: *: xyz" less than O (n)

but I'm looking for efficient algorithms that can find matches from millions of file names in less than a second, therefore better than O (n).

I am thinking of a two-phase algorithm with a substring array (Suffix array + prefix array) in the first phase and a regular RegEx search based on the results of the second phase of the first phase.

any help would be greatly appreciated ...

+6
string algorithm wildcard search glob
source share
2 answers

As far as I know, there is no way to do better than O (n) for a generic glob search.

However, for special cases of prefix and suffix search, you can sort indexes yourself to perform binary searches, as a result of which O (log n) is used to search for the prefix and suffix. The prefix index will be sorted based on the first character, then the second, etc. The suffix index will be sorted based on the last character, then the second, etc. (Reverse lines).

I would do as you suggest in your post, and do a two-step search, find the prefix or suffix indices, and then do a brute-force search using the abbreviated list provided from the first phase using a regular expression made from glob.

Since string length comparisons are faster than regular expressions, I would also pre-filter for the minimum length of a matching string or fixed-length string for example. doc.

From the sound of your original message, the indexes will also have to reference the full path of each record so that you can display it after the final results are found.

+2
source share

Check self-indexing: This is a stack overflow question and this DrDobbs article is on it.

+2
source share

All Articles