How to increase position offsets in the lucene index to match <p> tags?

Question

How to increase position offsets in the lucene index to match <p> tags?

I am using Lucene 3.0.3. In preparation for using SpanQuery and PhraseQuery, I would like to mark the border of the paragraphs in my index in such a way as to prevent the matching of these queries to the borders of the paragraph. I understand that when processing text, I need to increase the position by some large enough value in the PositionIncrementAttribute to mark the border of the paragraph. Suppose that in the original document the borders of my paragraph are marked with <p>...</p> pairs.

How to set up a token for tag detection? Also, I really don't want to index tags myself. For indexing purposes, I would rather increase the position of the next legitimate token, instead of emitting a token corresponding to the tag, since I do not want it to affect the search.

+7

html indexing tags position lucene

Gene golovchinsky Apr 21 '11 at 20:46

source share

1 answer

Christian kohlschütter · Accepted Answer · 2011-04-22T08:36:54+0000

The easiest way to add spaces (= PositionIncrement> 1) is to provide a custom token stream. You do not need to change your analyzer for this. However, HTML parsing must be done upstream (i.e. you must segment and clear your input text before submitting it to Lucene).

Here is a complete working example (import omitted):

 public class GapTest { public static void main(String[] args) throws Exception { final Directory dir = new RAMDirectory(); final IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_4_10_1, new SimpleAnalyzer()); final IndexWriter iw = new IndexWriter(dir, iwConfig); Document doc = new Document(); doc.add(new TextField("body", "ABC", Store.YES)); doc.add(new TextField("body", new PositionIncrementTokenStream(10))); doc.add(new TextField("body", "DEF", Store.YES)); System.out.println(doc); iw.addDocument(doc); iw.close(); final IndexReader ir = DirectoryReader.open(dir); IndexSearcher is = new IndexSearcher(ir); QueryParser qp = new QueryParser("body", new SimpleAnalyzer()); for (String q : new String[] { "\"ABC\"", "\"ABCD\"", "\"ABCD\"", "\"ABCD\"~10", "\"ABCDEF\"~10", "\"ABCDFE\"~10", "\"ABCDFE\"~11" }) { Query query = qp.parse(q); TopDocs docs = is.search(query, 10); System.out.println(docs.totalHits + "\t" + q); } ir.close(); } /** * A gaps-only TokenStream (uses {@link PositionIncrementAttribute} * * @author Christian Kohlschuetter */ private static final class PositionIncrementTokenStream extends TokenStream { private boolean first = true; private PositionIncrementAttribute attribute; private final int positionIncrement; public PositionIncrementTokenStream(final int positionIncrement) { super(); this.positionIncrement = positionIncrement; attribute = addAttribute(PositionIncrementAttribute.class); } @Override public boolean incrementToken() throws IOException { if (first) { first = false; attribute.setPositionIncrement(positionIncrement); return true; } else { return false; } } @Override public void reset() throws IOException { super.reset(); first = true; } }

}

How to increase position offsets in the lucene index to match <p> tags?

More articles: