Full-text substring Search in iOS

I need an iPhone / iPad app to be able to quickly search around 10,000 entries (about every paragraph of text) for any substring contained in the entry. Therefore, if the entry contains the word "Flame", the request for "lame" must match.

I am currently using SQLite, but the search for "LIKE% term%" is too slow for many records. Enabling full-text search does not seem to fully satisfy my needs, since SQLite only supports wildcard prefix characters (for example, "Flam *", not "* lame").

I experimented using giant blob text (~ 350K) and did [NSString rangeOfString: ...], which I think uses the Boyer-Moore algorithm. This is faster than searching for "LIKE% term%", but still not the speed I hope for.

Any suggestions for approaches or libraries that would achieve such a scalable subscript search and that would work on the iPhone?

+6
substring ios search iphone full-text-search
source share
3 answers

Here are a few different options. I do not know about the benchmarks for each, so you will need to do some testing.

First up is the FTS3 extension for SQLite. This should give you a quick, indexed full-text search: http://regularrateandrhythm.com/regular-rate-rhythm-blog/sqlite3-fts-in-IOS4.html

Then, what about the regular expressions that were introduced in iOS 4:
http://developer.apple.com/library/ios/#documentation/Foundation/Reference/NSRegularExpression_Class/Reference/Reference.html

For pre-iOS 4, you can use RegexKitLite:
http://regexkit.sourceforge.net/RegexKitLite/index.html

If you decide to use regular expressions, check out this post on how to optimize them:
How to speed up iPhone regular expressions using NSRegularExpression?

+2
source share

Perhaps consider combining your second approach with an asynchronous approach. Divide your large block of text by 5.10, regardless of size, and find them separately with the same number of threads. Then combine the results using a coordinate system that knows how to correctly position the matches (for example, thread 5 looked for area 5 and found a match at position 337, which correlates with document x, position y). You will find that there is a limit when adding more threads is not appropriate, so be the first thing to find out.

0
source share

If you cannot tokenize the text (divide it into words), you cannot index it. This is why LIKE is a sequential search. If your substring cannot be limited in any way (always delete the first letter or the fixed length of the substring, for example), your text cannot be saved as a list of all possible tokens, and these tokens cannot be indexed. The key (intended for pun intended) is the search for an algorithm that creates a fairly small list of tokens that the cost of indexing them is less than the cost of a linear search.

0
source share

All Articles