Is it possible to search for words within the Luceno index by part of speech

Question

Is it possible to search for words within the Luceno index by part of speech

I have a large set of documents stored inside the Lucene index, and I use customAnalyzer, which basically performs tokenization and is based on the contents of the documents.

Now, if I search in the documents for the word “love”, I get results when love is used either as a noun or a verb, whereas I want only those documents that use love only as a verb.

How can one realize a function where I could also mention the part of the speech of the word together with the word so that the results have only love, used as a verb, and not as a noun?

I can come up with a way to initially partially partial the tag word of each document word and save it by adding POS with the word "_" or something else, and then search accordingly, but wanted to know if there is a more reasonable way to do this in Lucene.

+4

java tokenize nlp lucene solr

London guy Apr 13 '13 at 13:53

source share

1 answer

phani · Answer 1 · 2013-04-13T17:26:11+0000

I can think of the following approaches.

Approach 1

Just as you mentioned: find out and add a tag of a part of speech to the actual term during indexing. Do the same when prompted.

I would like to discuss the disadvantages associated with them.

Minuses:

1) Future requirements may require you to receive results regardless of the part of speech. An index containing modified terms will not work.

2) You might want to execute BooleanQuery as "term: noun or adjective." You must write your own query expander.

Approach 2

Try using the Payload feature for Lucene.

Here is a quick guide to Lucene Payloads .

Steps to eliminate your use case.

1) Store the speech part tag as a payload.

2) Have custom affinity classes for each speech part tag.

3). Based on the request, assign the appropriate CustomSimilarity value to IndexSearcher. For example, assign NounBoostingSimilarity to request a noun.

4) Increase or decrease document rating based on payload. An example is given in the above tutorial.

5) Write a custom picker to filter documents with ratings that do not match the speed boost logic above.

The advantages of this approach are that the index remains compatible for any other regular search.

Minuses:

1) Maintenance overhead: you need to maintain multiple IndexSearchers for each affinity. 2) Somewhat difficult decision.

To be honest, I am not satisfied with my own decision, but just wanted to tell you that there is another way. It all depends on your scenario, whether the project is an academic one-time project or a commercial one, etc.

Is it possible to search for words within the Luceno index by part of speech

More articles: