How can I detect pharsi web pages from tika?

I need some sample code to help me discover Farsi web pages using apache tika toolkit.

 LanguageIdentifier identifier = new LanguageIdentifier("فارسی");
        String language = identifier.getLanguage();

I download apache.tika jar files and add them to the classpath. but this code gives an error for the Farsi language, but it works in English. How can I add Farsi to tika's languageIdentifier package?

+5
source share
1 answer

Tika has not yet sent language for Farsi. Starting with version 1.0 , 27 languages ​​are supported :

languages=be,ca,da,de,eo,et,el,en,es,fi,fr,gl,hu,is,it,lt,nl,no,pl,pt,ro,ru,sk,sl,sv,th,uk

li () 0,41, 0,022. LanguageIdentifier . .

(, ISO- 639-1 2- fa) . , Tika , .

:

  • . Hamshahri. . XML.

  • ngram . TikaCLI:

    java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt fa.ngp, n-.

  • Tika , . , LanguageIdentifier.initProfiles(), tika.language.override.properties . , ngram .

Tika, .

: , .

+9

All Articles