It seems your question is more about merging indexes than indexing.
The indexing process is quite simple if you ignore low-level details. Lucene forms the so-called "inverted index" from documents. Therefore, if the text "To be or not to be" and id = 1 is included in the document, the inverted index will look like this:
[to] โ 1 [be] โ 1 [or] โ 1 [not] โ 1
This is basically an index from a word to a list of documents containing a given word. Each row of this index (word) is called a posting list. This index is preserved during long-term storage.
In reality, of course, things are more complicated:
- Lucene may miss some words based on a specific analyzer;
- words can be pre-processed using the crowding-out algorithm to reduce the flexibility of the language;
- the posting list may contain not only document identifiers, but also the offset of the word inside the document (possibly several copies) and other additional information.
There are many more complications that are not so important for a basic understanding.
It is important to understand that the Lucene index is only added. At some point in time, the application decides to commit (publish) all changes to the index. Lucene completes all maintenance operations with the index and closes it, so it is searchable. Once committed, the index is basically immutable. This index (or index part) is called a segment. When Lucene searches for a query, it searches all available segments.
So the question is - how can we change an already indexed document?
New documents or new versions of already indexed documents are indexed in new segments, and old versions are not valid in previous segments using the so-called kill list. The kill list is the only part of a fixed index that can change. As you might have guessed, index performance decreases over time, as old indexes can contain mostly deleted documents.
This is where the merger takes place. Merging is the process of combining multiple indexes to improve index performance. What basically happens during the merge is live documents copied to the new segment, and the old segments are completely deleted.
Using this simple process, Lucene can keep the index in good shape in terms of search performance.
Hope this helps.
Denis Bazhenov 01 Oct '15 at 1:58 2015-10-01 01:58
source share