Apak-lucene lemmatization

Question

Apak-lucene lemmatization

I am developing a text analysis project using apache lucene. I need to lemmatize some text (convert words to their canonical forms). I already wrote the code that draws the conclusion. Using it, I can convert the following sentence

A stem is a part of a word that never changes even when morphologically distorted; lemma is the basic form of a word. For example, from the “produced” lemma “produces”, but the core is “produced”. This is because there are words like production

to

The word stem word never changes, even when the morphological expression lemma base form word examplpl from produc lemma produc stem produc becaus word product

However, I need to get the basic forms of words: an example instead of an example, produce instead of produc, etc.

I use lucene because it has analyzers for many languages (I need at least English and Russian). I know about the Stanford NLP library , but it does not have support in Russian.

So, is there a way to do lemmatization for several languages, for example, am I using lucene?

A simplified version of my code responsible for creating:

//Using apache tika to identify the language LanguageIdentifier identifier = new LanguageIdentifier(text); //getting analyzer according to the language (eg, EnglishAnalyzer for 'en') Analyzer analyzer = getAnalyzer(identifier.getLanguage()); TokenStream stream = analyzer.tokenStream("filed", text); stream.reset(); while (stream.incrementToken()) { String stem = stream.getAttribute(CharTermAttribute.class).toString(); // doing something with the stem System.out.print(stem+ " "); } stream.end(); stream.close();

UPDATE: I found a library

+7

java nlp lucene stemming lemmatization

Kirill Simonov Dec 9 '17 at 3:29

source share

1 answer

Jason angel · Answer 1 · 2018-03-17T15:18:29+0000

Yes, StanfordNLP is good for English. But if you need support for several languages, I can recommend Freeling to you, check it out Freeling_online_demo , select the language and output ( morphological analysis for lemmatization ). I don't speak Russian, but I think it works for this text:

I continue the series of posts on astrology and science. Astrology has no scientific justification, but is part of the history of science, part of culture and social consciousness. Therefore, the astrological view of science is very interesting.

For machine readability, you can use xml output (below your results), and for automation you can integrate Freeling with python / java, but usually I prefer to just call it via the command line.

Apak-lucene lemmatization

More articles: