A faster way to get great values from Lucene Query

Question

A faster way to get great values from Lucene Query

I currently like this:

IndexSearcher searcher = new IndexSearcher(lucenePath); Hits hits = searcher.Search(query); Document doc; List<string> companyNames = new List<string>(); for (int i = 0; i < hits.Length(); i++) { doc = hits.Doc(i); companyNames.Add(doc.Get("companyName")); } searcher.Close(); companyNames = companyNames.Distinct<string>().Skip(offSet ?? 0).ToList(); return companyNames.Take(count??companyNames.Count()).ToList();

As you can see, I first collect ALL the fields (several thousand), and then distinguish between them, maybe skip some and take out some of them.

I feel that there must be a better way to do this.

+4

c # lucene

Boris Callens Mar 6 '09 at 9:40

source share

4 answers

Linking this question to an earlier question about yours (re: "Too many sentences"), I think you should definitely look at listing the term from the reader index. Cache the results (I used a sorted dictionary with a key by field name, with a list of terms as data, up to a maximum of 100 terms per field) until the index reader becomes invalid and you leave.

Or maybe I should say that, faced with a similar problem for you, this is what I did.

Hope this helps,

+3

Adrian conlon Mar 6 '09 at 10:50

source share

I suggest you find the logic to skip such an iteration, but if there is no solution in your context, you can get a performance gain with the following code
1) during the Index it is best to put the field you want to iterate in the first field

 Document doc = new Document(); Field companyField = new Field(...); doc.Add(companyField); ...

2) then you need to define a FieldSelector like this

 class CompanyNameFieldSelector : FieldSelector { public FieldSelectorResult Accept(string fieldName) { return (fieldName == "companyName" ? FieldSelectorResult.LOAD_AND_BREAK : FieldSelectorResult.NO_LOAD); } }

3) Then, when you want to iterate and select this field, you should do something like this

 FieldSelector companySelector = new CompanyNameFieldSelector(); // when you iterate through your index doc = hits.Doc(i); doc.Get("companyName", companySelector);

The performance above the code is much better than the code that you specified, because it skips reading unnecessary document fields and saves time.

+1

Ehsan Apr 20 '12 at 10:30

source share

 public List<string> GetDistinctTermList(string fieldName) { List<string> list = new List<string>(); using (IndexReader reader = idxWriter.GetReader()) { TermEnum te = reader.Terms(new Term(fieldName)); if (te != null && te.Term != null && te.Term.Field == fieldName) { list.Add(te.Term.Text); while (te.Next()) { if (te.Term.Field != fieldName) break; list.Add(te.Term.Text); } } } return list; }

+1

user2615425 Jul 24 '13 at 16:03

source share

Razzie · Accepted Answer · 2009-03-06T10:18:24+0000

I'm not sure that, frankly, Lucene does not provide "excellent" functionality. I believe that with SOLR you can use facet search to achieve this, but if you want it in Lucene, you will have to write some kind of facet function yourself. Therefore, until you encounter performance issues, you should be fine.

A faster way to get great values ​​from Lucene Query

More articles:

A faster way to get great values from Lucene Query