Suggestions for finding alternative word forms with Lucene

I have a site that can be found using Lucene. I noticed from magazines that sometimes users don’t find what they are looking for because they introduce a specific term, but only a plural version of this term is used on the site. I would like the search to look for the use of other forms of the word. This is a problem that I am sure has been resolved many times, so what is the best solution?

Please note: this site has English content only .

Some approaches I was thinking about:

  • Look at the word in some thesaurus file to identify alternative forms of the word.
    • Some examples:
      • Searches for "car", also add "auto" in the query.
      • Searches for "hyphenation", also adds "hyphenation" and "hyphenated" to the query.
      • Searches for "small", also adds "less" and "smallest" to the query.
      • Searches for "can", also adds to the query "can not", "can not", "cans" and "canned".
      • And it should work in the opposite way (that is, the search for “hyphenation” should add “carry” and “transfer”).
    • Disadvantages:
      • Does not work for many new technical words if the dictionary / thesaurus is not updated frequently.
      • I am not sure about the thesaurus file search performance.
  • Generate alternative forms algorithmically, based on some heuristics.
    • Some examples:
      • If the word ends with "s" or "es" or "ed" or "er" or "est", cancel the suffix
      • If the word ends with "ies" or "ied" or "ier" or "iest", it is converted to "y"
      • If the word ends with "y", convert to "ies", "ied", "ier" and "iest"
      • Try adding the words "s", "es", "er" and "est" to the word.
    • Disadvantages:
      • Creates many non-words for most entries.
      • Feels like a hack.
      • Looks like what you find on TheDailyWTF.com. :)
  • Is something much more complicated?

I'm going to make some combination of the first two approaches, but I'm not sure where to find the thesaurus file (or what he called, since the thesaurus is not quite right, but neither of them is a dictionary).

+4
language-agnostic search lucene linguistics
source share
5 answers

Consider including PorterStemFilter in the analysis pipeline. Be sure to perform the same query analysis that is used to create the index.

I also used the Lancaster stemming algorithm with good results. Using PorterStemFilter as a guide, it is easy to integrate with Lucene.

+4
source share

Working with Word works fine for English, but for languages ​​where collocating is almost impossible (like mine), option # 1 is viable. I know at least one such implementation for my language (Icelandic) for Lucene, which seems to work very well.

+4
source share

Some of them look like pretty neat ideas. Personally, I would just add some tags to the query (query transformation) to make it fuzzy, or you can use the built-in FuzzyQuery , which uses Levenshtein editing distances, which would help in an erroneous way.

Using fuzzy search query tags ', also used by Levenshtein. Consider the search for a "car." If you change the query to "car ~", it will find "car" and "cars" and so on. There are other conversions to a query that should handle almost everything you need.

+3
source share

If you work in a specialized field (I did this with gardening) or with a language that works great with conventional stem methods, you can use query logging to create a manual entry table.

Just create a word match → stem for all the inconsistencies you can think of / people are looking for, and then when indexing or searching, replace any word that appears in the table with the corresponding stem. Thanks to query caching, this is a pretty cheap solution.

+1
source share

Stemming is a fairly standard way to solve this problem. I found that Porter’s trunk is an aggressive way to do standard keyword searches. It ends with a combination of words that have different meanings. Try the KStemmer algorithm.

0
source share

All Articles