Segmenting words and grouping hyphen and apostropic words from text

I need to segment words from text. Several times hyphen words are written without hyphens, and the apostrophe words are written without an apostrophe. There are also similar problems, such as different questions of writing the same words (for example, color, color) or one word that is written with spaces between them (for example: before, before, empty space, empty space). I need to group these options as one single view and paste it into set / hashmap or another place. There may also be problems with accented symbolic words written without accent symbols (although I have not encountered them yet). Currently, cutting words in any space character and each non-alphanumeric, and then eliminating them and omitting stop words.

These indexes will later be used to check document similarity and search, etc. Any suggestions how I can deal with these problems? I thought about the idea of ​​matching a scanned word with a list of words, but the problem is that the correct nouns and words without a dictionary will be omitted.

Info: My code is in Java

+5
source share
1 answer

I think you should apply a combination of methods.

1) For the usual spelling variations, I would go with the vocabulary method. Since they are common, I would not worry about missing words without a dictionary. This should solve the color / color problem.

2) Metaphone (http://en.wikipedia.org/wiki/Metaphone) . , (, ). (, Huseyin Housein).

3) . , "John's" "John s" "Johns". " " ( ) " " " " .

4) , ​​ HyphenationCompoundWordTokenFilterFactory Solr (http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html). , . , . , ( ).

, . , . , Lucene ( Solr, Lucene), Java, ? , ; , , " " " " ( , ). , (http://lucene.apache.org/solr/api/org/apache/solr/analysis/PhoneticFilterFactory.html). FuzzyQuery, (http://lucene.apache.org/core/old_versioned_docs/versions/3_2_0/api/all/org/apache/lucene//FuzzyQuery.html)

, : , . , (- , ). - , , . . , , . , , -, .

+3
source

All Articles