Understanding Lucene Wildcard Performance

Question

Understanding Lucene Wildcard Performance

Lucene does not allow wildcards in search terms by default, but this can be enabled with

QueryParser#setAllowLeadingWildcard(true)

I understand that using the master template prevents Lucene from using the index. Searches using the master template should crawl the entire index.

How to demonstrate the performance of a leading wildcard? When can I use setAllowLeadingWildcard(true) ?

I built a test index with 10 million documents in the form:

 { name: random_3_word_phrase }

360M index on disk.

My test queries are performing well, and I was not able to actually demonstrate a performance issue. For example, a request for name:*ing is more than 1.1 million documents in less than 1 second. In the request name:*ing* more than 1.5 million documents are issued at the same time.

What's going on here? Why is it not so slow? Are 10,000,000 documents insufficient? Should documents contain more than one field?

+4

full-text-search lucene solr

Landon kuhn Aug 1 '12 at 19:39

source share

2 answers

Galacticjello · Answer 1 · 2012-08-01T19:51:58+0000

Depends on how much memory you have and how much token pointer is in memory.

The 360 MB common index can be quickly found on any old computer. The 360 GB index will take a little longer ...;)

As an example, I ran the old 2 GB index and searched for "* e".

On a box with 8 GB, he returned 500 thousand hits in less than 5 seconds. I tried the same index on a box with 1 GB of memory, and it took about 20 seconds.

To illustrate further, here is some general C # code that basically does a search like "** E *" out of 10 million random 3 phrases.

 static string substring = "E"; private static Random random = new Random((int)DateTime.Now.Ticks);//thanks to McAden private static string RandomString(int size) { StringBuilder builder = new StringBuilder(); char ch; for (int i = 0; i < size; i++) { ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * random.NextDouble() + 65))); builder.Append(ch); } return builder.ToString(); } static void FindSubStringInPhrases() { List<string> index = new List<string>(); for (int i = 0; i < 10000000; i++) { index.Add(RandomString(5) + " " + RandomString(5) + " " + RandomString(5)); } var matches = index.FindAll(SubstringPredicate); } static bool SubstringPredicate(string item) { if (item.Contains(substring)) return true; else return false; }

After 10 million phases have been loaded into the list, it only takes a second for “var matches = index.FindAll (SubstringPredicate)”; to return more than 4 million views.

The fact is that memory is fast. Once things can no longer fit into memory, and you should start swapping to disk when you see how you get into performance.

Kai chan · Answer 2 · 2012-08-01T19:55:21+0000

If I understand correctly, part of the index is the term dictionary, which is a sorted list of all indexed terms. When searching without wildcards or wildcards, Lucene can take advantage of the fact that many terms have common prefixes. On the other hand, a search using the main template checks the entire glossary of terms. This is not optimal, but the term dictionary tends to be tiny compared to other parts of the index, such as frequency and location data, so a full scan of the dictionary is usually not a big problem.

Understanding Lucene Wildcard Performance

More articles: