I am developing a text analysis project using apache lucene. I need to lemmatize some text (convert words to their canonical forms). I already wrote the code that draws the conclusion. Using it, I can convert the following sentence
A stem is a part of a word that never changes even when morphologically distorted; lemma is the basic form of a word. For example, from the “produced” lemma “produces”, but the core is “produced”. This is because there are words like production
to
The word stem word never changes, even when the morphological expression lemma base form word examplpl from produc lemma produc stem produc becaus word product
However, I need to get the basic forms of words: an example instead of an example, produce instead of produc, etc.
I use lucene because it has analyzers for many languages (I need at least English and Russian). I know about the Stanford NLP library , but it does not have support in Russian.
So, is there a way to do lemmatization for several languages, for example, am I using lucene?
A simplified version of my code responsible for creating:
//Using apache tika to identify the language LanguageIdentifier identifier = new LanguageIdentifier(text); //getting analyzer according to the language (eg, EnglishAnalyzer for 'en') Analyzer analyzer = getAnalyzer(identifier.getLanguage()); TokenStream stream = analyzer.tokenStream("filed", text); stream.reset(); while (stream.incrementToken()) { String stem = stream.getAttribute(CharTermAttribute.class).toString(); // doing something with the stem System.out.print(stem+ " "); } stream.end(); stream.close();
UPDATE: I found a library
java nlp lucene stemming lemmatization
Kirill Simonov
source share