I have a site that can be found using Lucene. I noticed from magazines that sometimes users don’t find what they are looking for because they introduce a specific term, but only a plural version of this term is used on the site. I would like the search to look for the use of other forms of the word. This is a problem that I am sure has been resolved many times, so what is the best solution?
Please note: this site has English content only .
Some approaches I was thinking about:
- Look at the word in some thesaurus file to identify alternative forms of the word.
- Some examples:
- Searches for "car", also add "auto" in the query.
- Searches for "hyphenation", also adds "hyphenation" and "hyphenated" to the query.
- Searches for "small", also adds "less" and "smallest" to the query.
- Searches for "can", also adds to the query "can not", "can not", "cans" and "canned".
- And it should work in the opposite way (that is, the search for “hyphenation” should add “carry” and “transfer”).
- Disadvantages:
- Does not work for many new technical words if the dictionary / thesaurus is not updated frequently.
- I am not sure about the thesaurus file search performance.
- Generate alternative forms algorithmically, based on some heuristics.
- Some examples:
- If the word ends with "s" or "es" or "ed" or "er" or "est", cancel the suffix
- If the word ends with "ies" or "ied" or "ier" or "iest", it is converted to "y"
- If the word ends with "y", convert to "ies", "ied", "ier" and "iest"
- Try adding the words "s", "es", "er" and "est" to the word.
- Disadvantages:
- Creates many non-words for most entries.
- Feels like a hack.
- Looks like what you find on TheDailyWTF.com. :)
- Is something much more complicated?
I'm going to make some combination of the first two approaches, but I'm not sure where to find the thesaurus file (or what he called, since the thesaurus is not quite right, but neither of them is a dictionary).
language-agnostic search lucene linguistics
Kip
source share