It turns out like a 400K page, if I did my math correctly. This is a large page size.
What is an index for?
If you want proximity and a phrase, you need to index them all and a product such as SOLR. Through TIKI, I think you can index PDF.
Another option is to use the full SQL text. But you will need to create an application for the external interface. Where SOLR is both an application and an engine.
Do you need to index each word or just unique words? If you need only a basic search, then in English there will be only about 200,000 unique words. If you chase them like a streamer porter, this number will go down. Then throw away the stop words like "the". Then you need the correct email names and other words not contained in the dictionary. I index the documents manually, and even a very large collection ends with 300,000 (if these are real words, ocr will kill this number). If a document has 2,000 unique words, the cross-index is only 20,000,000,000. You can analyze words using REGEX. I know this seems ugly, but I do it manually in SQL and .NET. There is no proximity search or phrase, but it has a small area and is fast. (SQL Azure does not have the full text)
source share