How are lucene index documents?

I read several documents about Lucene; I also read the document in this link ( http://lucene.sourceforge.net/talks/pisa ).

I really donโ€™t understand how Lucene indexes documents and donโ€™t understand what algorithms Lucene uses for indexing?

The link above says Lucene uses this algorithm for indexing:

  • incremental algorithm:
    • maintain stack index segment
    • create an index for each incoming document
    • push new indexes onto the stack
    • let b = 10 be the merger coefficient; M = 8



for (size = 1; size < M; size *= b) { if (there are b indexes with size docs on top of the stack) { pop them off the stack; merge them into a single index; push the merged index onto the stack; } else { break; } } 

How does this algorithm provide optimized indexing?

Does Lucene use a B-tree algorithm or any other algorithm like indexing - or does it have a specific algorithm?

+63
algorithm indexing lucene
Apr 08 '10 at
source share
4 answers

Here's a nice article: https://web.archive.org/web/20130904073403/http://www.ibm.com/developerworks/library/wa-lucene/

Edit 12/2014: Updated to archived version due to the fact that the original has been deleted, probably the best latest option is http://lucene.apache.org/core/3_6_2/fileformats.html

There's an even later version at http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene410/package-summary.html#package_description , but it seems to have less information than in the old one.

In short, when lucene indexes a document, it breaks it down into several terms. It then stores the terms in an index file, where each term is associated with documents that contain it. You might think of it as a hash table.

Dates are generated using an analyzer that narrows every word in its root form. The most popular algorithm for the English language is the Porter generation algorithm: http://tartarus.org/~martin/PorterStemmer/

When a query is issued, it is processed through the same analyzer that was used to create the index, and then used to search for the corresponding terms (s) in the index. This provides a list of documents matching the query.

+45
Apr 09 '10 at 13:22
source share

It seems your question is more about merging indexes than indexing.

The indexing process is quite simple if you ignore low-level details. Lucene forms the so-called "inverted index" from documents. Therefore, if the text "To be or not to be" and id = 1 is included in the document, the inverted index will look like this:

 [to] โ†’ 1 [be] โ†’ 1 [or] โ†’ 1 [not] โ†’ 1 

This is basically an index from a word to a list of documents containing a given word. Each row of this index (word) is called a posting list. This index is preserved during long-term storage.

In reality, of course, things are more complicated:

  • Lucene may miss some words based on a specific analyzer;
  • words can be pre-processed using the crowding-out algorithm to reduce the flexibility of the language;
  • the posting list may contain not only document identifiers, but also the offset of the word inside the document (possibly several copies) and other additional information.

There are many more complications that are not so important for a basic understanding.

It is important to understand that the Lucene index is only added. At some point in time, the application decides to commit (publish) all changes to the index. Lucene completes all maintenance operations with the index and closes it, so it is searchable. Once committed, the index is basically immutable. This index (or index part) is called a segment. When Lucene searches for a query, it searches all available segments.

So the question is - how can we change an already indexed document?

New documents or new versions of already indexed documents are indexed in new segments, and old versions are not valid in previous segments using the so-called kill list. The kill list is the only part of a fixed index that can change. As you might have guessed, index performance decreases over time, as old indexes can contain mostly deleted documents.

This is where the merger takes place. Merging is the process of combining multiple indexes to improve index performance. What basically happens during the merge is live documents copied to the new segment, and the old segments are completely deleted.

Using this simple process, Lucene can keep the index in good shape in terms of search performance.

Hope this helps.

+17
01 Oct '15 at 1:58
source share

This is an inverted index , but it does not indicate which structure it uses. The index format in lucene contains complete information.
Start with a Summary of File Extensions.

You will first notice that we are talking about different different indices. As far as I can see, none of them use strictly speaking B-tree , but there are similarities - the above structures resemble trees.

+11
Apr 09 2018-10-10T00:
source share

In a nutshell, Lucene builds an inverted index using Skip-lists on disk, and then loads the mapping for indexed terms into memory using the End State Transformer (FST). Note, however, that Lucene does not (necessarily) load all indexed terms into RAM , as described by Michael McCandles, author of the Lucene indexing system itself. Note that using Skip-Lists, an index can be transferred from one hit to another, doing such things like set and, in particular, query ranges (like B-Trees). And Wikipedia in indexing Skip-Lists also explains why the Lucene Skip-List implementation is called a layered Skip-List - essentially, to make O(log n) search possible (again, like B-trees).

So, as soon as an inverted (term) index based on the Skip-List structure is created from documents, the index is stored on disk. Lucene then loads (as already mentioned: perhaps only a few of) these terms into the State Transformer into the FST implementation, weakly inspired by Morfologick .

Michael McCandless (also) does a pretty good and concise job explaining how and why Lucene uses the (minimal acyclic) FST to index Lucene terms are stored in memory, essentially, as SortedMap<ByteSequence,SomeOutput> , and give a basic idea of โ€‹โ€‹how FSTs work (i.e. how FST compresses byte sequences [that is, indexed terms] to make memory usage by this mapping linear). And he points to a document describing a specific FST algorithm . Also uses Lucene.

For those curious about why Lucene uses Skip-Lists, while most databases use (B +) and / or (B) -Trees, see the correct SO answer for that (Skip-Lists vs. B-Trees). This answer gives a pretty good, in-depth explanation - essentially not , so make the index index parallel updates โ€œmore amenableโ€ (because you may decide not to rebalance the B-Tree right away, thereby gaining about the same simultaneous performance as the Skip-List ), but rather, Skip-Lists eliminates the need to work with (with delay or not) the balancing (ultimately) needed for B-trees (in fact, as the answer / links show, there are probably very few differences in performance between B-Trees and [tiered mi] Skip-Lists if they are "done right.")

+9
Apr 04 '17 at 9:30
source share



All Articles