Solr for Arabic

I use Solr to index documents in 3 languages ​​(Arabic, French and English), I used this fieldType:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> 

Everything was fine, but in Arabic, when I put this query to search for a word like Ψ­Ω‚Ω„ Solr doen't find the word, but when I put the word in the opposite Ω„Ω‚Ψ­ from left to right, Solr will find the words and the result of the return.

Can I get the result for Arabic words?

+7
source share
1 answer

I'm going to turn Daniel into a smart analysis here to respond to the recording. Don't vote for it, just find something from it to vote for :-)

There are two ways to get directional mismatch with RTL text. You can index it back or you can request it back. The simple HTML form requesting Solr will never ruin the focus. In this concern, khaled retrieved the text from the PDF using a library that falls prey to the tendency of PDF files to contain text of "visual order" rather than "logical order". Thus, the index was populated back in Arabic. To fix this, he will have to come up with a working library that extracts text from pdf files.

Forcing Apache Tika to use the latest Apache PDF file can help, or its PDF can be so dodgy that even the latest PDFBox cannot process it. In this case, he has a serious problem.

+5
source

All Articles