Solr for Arabic PDF

Question

Solr for Arabic PDF

I am trying to search for Arabic PDF files in Apache Solr. The problem is that Tika indexes PDFs in reverse order (from left to right) instead of (from right to left).

I found links to this problem here:

However, I do not know how to include the latest version of PDFBOX or ICU4J in my apache solr. My Apache Solr Contrib/extraction/lib pdfbox-1.6.0.jar contains pdfbox-1.6.0.jar and icu4j-4.8.1.1.jar . Will the mentioned files be deleted and replaced with the latest libraries on their project pages to force TIKA to use them?

Please explain, since I have no previous experience with the Java servlet. Thanks!

+6

drupal right-to-left solr apache-tika arabic

perpetual_dream Nov 27 '12 at 17:27

source share

1 answer

Josep Valls · Answer 1 · 2013-02-28T18:57:37+0000

From the tags of your question, I assume that you are using Drupal for the Apache Solr interface. Tika can work from Solr when you send its binary documents, or you can use it before sending documents to Solr. The Drupal Solr Attachments module has a setting for this "Tika (local Java application)." In the second link, you indicated that they fixed the Solr Attachments module to use the PDFBox instead of Tika to parse binary files before submitting to Solr. If you are not using Drupal, you should try a similar approach.

Solr for Arabic PDF

More articles: