Lucene Fields vs. DocValues

I use and play with Lucene to index our data, and I came across some weird behavior regarding DocValues ​​fields.

So, can anyone just explain the difference between a regular document field (e.g. StringField , TextField , IntField , etc.) and DocValues ​​fields (e.g. IntDocValuesField , SortedDocValuesField (types seem to have changes in Lucene 5.0) and t .d.)?

First, why can't I access DocValues ​​using document.get (file_name) ? if so, how can I access them?

Secondly, I saw that in Lucene 5.0 some functions are changed, for example, sorting can be performed only on DocValues ​​... why?

Thirdly, DocValues ​​can be updated, but regular fields cannot (you need to delete and add the whole document) ...

Also, and perhaps most importantly, when should DocValues ​​and regular fields be used?

Joseph

+7
lucene solr
source share
1 answer

In most of these questions, you can quickly answer either with a link to the Solr Wiki or a web search, but get the gist of DocValues: they are useful for all other things related to the modern search service, except for the actual search. From the Solr Community Wiki :

DocValues ​​is a way to write field values ​​internally, which is more efficient for some purposes, such as sorting and cutting, and then traditional indexing.

...

DocValue fields are now column-oriented fields, with the document and value mapping created during the index. This approach promises to ease some of the memory requirements for fieldCache and significantly speed up torch searching, sorting, and grouping.

This should also answer why Lucene 5 requires DocValues ​​to sort - this is much more efficient than the previous approach.

The reason for this is that the storage format bypasses the standard format when collecting data for these operations, when the application previously to go through each document to find values, now it can search for values ​​and find the corresponding documents. This is very useful when you already have a list of documents for which you need to perform the intersection.

If I remember correctly, updating a field based on DocValue involves extracting the document from the previous list of tokens and then reinserting it in a new place compared to the previous approach, in which it would change the dependency loads (and reindexing was the only viable strategy).

Use DocValues ​​for fields that require any of the properties mentioned above, such as sort / cut, etc.

+5
source share

All Articles