Search for Search Query Positions by Lucene

Question

Search for Search Query Positions by Lucene

With Lucene, what would be the recommended approach for finding matches in search results?

More specifically, suppose there is a “fullText” field in index documents that stores the text content of some document. Also, suppose that for one of these documents, the content is “Fast brown fox jumping over a lazy dog.” Then a search is made for "fox dog." Obviously, the document will be a hit.

In this case, is it possible to use Lucene to provide something like the appropriate regions for the found document? So for this scenario, I would like to create something like:

[{match: "fox", startIndex: 10, length: 3}, {match: "dog", startIndex: 34, length: 3}]

I suspect that this may be implemented by what is provided in the package org.apache.lucene.search.highlight. I'm not sure about the general approach, though ...

+6

java search lucene

Voidpointer Aug 21 '09 at 10:36

source share

2 answers

Allasso · Answer 1 · 2010-04-08T14:04:29+0000

TermFreqVector is what I used. Here is a working demo that prints both the provisions of a term and the start and end indices of a term:

 public class Search { public static void main(String[] args) throws IOException, ParseException { Search s = new Search(); s.doSearch(args[0], args[1]); } Search() { } public void doSearch(String db, String querystr) throws IOException, ParseException { // 1. Specify the analyzer for tokenizing text. // The same analyzer should be used as was used for indexing StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); Directory index = FSDirectory.open(new File(db)); // 2. query Query q = new QueryParser(Version.LUCENE_CURRENT, "contents", analyzer).parse(querystr); // 3. search int hitsPerPage = 10; IndexSearcher searcher = new IndexSearcher(index, true); IndexReader reader = IndexReader.open(index, true); searcher.setDefaultFieldSortScoring(true, false); TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true); searcher.search(q, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; // 4. display term positions, and term indexes System.out.println("Found " + hits.length + " hits."); for(int i=0;i<hits.length;++i) { int docId = hits[i].doc; TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents"); TermPositionVector tpvector = (TermPositionVector)tfvector; // this part works only if there is one term in the query string, // otherwise you will have to iterate this section over the query terms. int termidx = tfvector.indexOf(querystr); int[] termposx = tpvector.getTermPositions(termidx); TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx); for (int j=0;j<termposx.length;j++) { System.out.println("termpos : "+termposx[j]); } for (int j=0;j<tvoffsetinfo.length;j++) { int offsetStart = tvoffsetinfo[j].getStartOffset(); int offsetEnd = tvoffsetinfo[j].getEndOffset(); System.out.println("offsets : "+offsetStart+" "+offsetEnd); } // print some info about where the hit was found... Document d = searcher.doc(docId); System.out.println((i + 1) + ". " + d.get("path")); } // searcher can only be closed when there // is no need to access the documents any more. searcher.close(); } }

matthiasboesinger · Answer 2 · 2016-08-27T20:46:58+0000

Here is the solution for lucene 5.2.1. It works only for single-word queries, but should demonstrate the basic principles.

Main idea:

Get a TokenStream for every document that matches your request.
Create a QueryScorer and initialize it using the extracted TokenStream .
"Loop" on each stream marker ( tokenStream.incrementToken() is tokenStream.incrementToken() ) and checks if the token matches the search criteria ( queryScorer.getTokenScore() is queryScorer.getTokenScore() ).

Here is the code:

 import java.io.IOException; import java.util.List; import java.util.Vector; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.de.GermanAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.search.highlight.InvalidTokenOffsetsException; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.TokenSources; public class OffsetSearcher { private IndexReader reader; public OffsetSearcher(IndexWriter indexWriter) throws IOException { reader = DirectoryReader.open(indexWriter, true); } public OffsetData[] getTermOffsets(Query query) throws IOException, InvalidTokenOffsetsException { List<OffsetData> result = new Vector<>(); IndexSearcher searcher = new IndexSearcher(reader); TopDocs topDocs = searcher.search(query, 1000); ScoreDoc[] scoreDocs = topDocs.scoreDocs; Document doc; TokenStream tokenStream; CharTermAttribute termAtt; OffsetAttribute offsetAtt; QueryScorer queryScorer; OffsetData offsetData; String txt, tokenText; for (int i = 0; i < scoreDocs.length; i++) { int docId = scoreDocs[i].doc; doc = reader.document(docId); txt = doc.get(RunSearch.CONTENT); tokenStream = TokenSources.getTokenStream(RunSearch.CONTENT, reader.getTermVectors(docId), txt, new GermanAnalyzer(), -1); termAtt = (CharTermAttribute)tokenStream.addAttribute(CharTermAttribute.class); offsetAtt = (OffsetAttribute)tokenStream.addAttribute(OffsetAttribute.class); queryScorer = new QueryScorer(query); queryScorer.setMaxDocCharsToAnalyze(RunSearch.MAX_DOC_CHARS); TokenStream newStream = queryScorer.init(tokenStream); if (newStream != null) { tokenStream = newStream; } queryScorer.startFragment(null); tokenStream.reset(); int startOffset, endOffset; for (boolean next = tokenStream.incrementToken(); next && (offsetAtt.startOffset() < RunSearch.MAX_DOC_CHARS); next = tokenStream.incrementToken()) { startOffset = offsetAtt.startOffset(); endOffset = offsetAtt.endOffset(); if ((endOffset > txt.length()) || (startOffset > txt.length())) { throw new InvalidTokenOffsetsException("Token " + termAtt.toString() + " exceeds length of provided text sized " + txt.length()); } float res = queryScorer.getTokenScore(); if (res > 0.0F && startOffset <= endOffset) { tokenText = txt.substring(startOffset, endOffset); offsetData = new OffsetData(tokenText, startOffset, endOffset, docId); result.add(offsetData); } } } return result.toArray(new OffsetData[result.size()]); } public void close() throws IOException { reader.close(); } public static class OffsetData { public String phrase; public int startOffset; public int endOffset; public int docId; public OffsetData(String phrase, int startOffset, int endOffset, int docId) { super(); this.phrase = phrase; this.startOffset = startOffset; this.endOffset = endOffset; this.docId = docId; } } }

Search for Search Query Positions by Lucene

More articles: