Solr: Number of published files is not equal to maxDoc

I apologize in advance if this question has already been answered - I could not find it.

I am relatively new to Solr and follow the tutorial instructions to use the standard SimplePostTool to index my command line data. I am currently using Solr 4.0 in my testing.

First, I delete everything in my index on request. Then I point SimplePostTool to several directories and index tens of thousands of files. In my case, now, each XML file is a separate document. Some documents may have the same unique Key. If that matters, XML document sizes range from 460KB.

SimplePostTool returns when it finishes, and says that 26,541 files are indexed. Then I look at the Admin collection1 page and see Num Docs = 20.985 and Max Doc = 22.921.

I saw other posts discussing the mismatch between Num Docs and Max Doc (I feel like I understand that rewriting behavior is enough). My question is why the number of indexed documents submitted by SimplePostTool does not match the Max Doc set on the Solr administration page?

+4
source share
1 answer

The reason you have a different number of numDocs and maxDoc:

numDocs represents the number of search documents in the index (and will be more than the number of XML files, as some files contain more than one). maxDoc may be larger because the maxDoc number includes logically deleted documents that are not yet removed from the index. You can reposition XML sample files again and again as much as you want, and numDocs will never grow because new documents will constantly replace old ones. From: The Official Solr Guide . This applies to older versions.

You can delete logically deleted files by optimizing your index β†’

+5
source

All Articles