Can Solr highlighting indicate the position or offset of returned fragments in the source field?

Background

Using Solr 4.0.0. I indexed the text of a set of sample documents and turned on Term Vectors to use the quick vector selection feature

<field name="raw_text" type="text_en" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" /> 

For selection, I use an Iterator border scanner with SENTENCE borders.

 <boundaryScanner name="breakIterator" class="solr.highlight.BreakIteratorBoundaryScanner"> <lst name="defaults"> <!-- type should be one of CHARACTER, WORD(default), LINE and SENTENCE --> <str name="hl.bs.type">SENTENCE</str> </lst> </boundaryScanner> 

I am making a simple request

 http://localhost:8983/solr/documents/select?q=raw_text%3AArtibonite&wt=xml&hl=true&hl.fl=raw_text&hl.useFastVectorHighlighter=true&hl.snippets=100&hl.boundaryScanner=breakIterator 

The backlight works quite well

 <response> ... <result name="response" numFound="5" start="0"> <doc> <str name="id">-1071691270</str> <str name="raw_text"> Final Report of the Independent Panel of Experts on the Cholera Outbreak in Haiti Dr. Alejando Cravioto (Chair) International Center for Diarrhoeal Disease Research, Dhaka, Bangladesh Dr. Claudio F. Lanata Instituto de Investigación Nutricional, and The US Navy Medical Research Unit 6, Lima, Peru Engr. Daniele S. Lantagne Harvard University... ~SNIP~ </str> <doc> <lst name="highlighting"> <lst name="-1071691270"> <arr name="raw_text"> ... <str> The timeline suggests that the outbreak spread along the <em>Artibonite</em> River. After establishing that the cases began in the upper reaches of the Artibonite River, potential sources of contamination that could have initiated the outbreak were investigated. </str> ... </arr> </lst> </lst> 

Problem

I want to be able to send the received sentences for further processing (entity-extraction, etc.), but I would like to track the offsets of the beginning and end of the selected sentence in the original (long) text field. Is there an easy way to do this?

Would it be better to set hl.fragsize to return the entire field, and then process / extract the sentences of interest in this way?

+6
source share
1 answer

It is impossible to return information about the displacement of fragments with the selection results, except that you are doing some kind of tuning.

You have several options:

1) You can expand the Solr highlight by creating a custom Formatter that encodes the offset information into a string. TokenGroup , which is passed to Formatter for each term, will contain information about the offset and location. If your formatter returned <span data-offset=X>text</span> or something similar, then this will be one way. This does not seem the easiest.

2) As you said, return the entire field using hl.fragsize=0 .

3) Use the TermVectorsComponent in an additional query and match the offset / position information returned from it with the selected fragments.

If you do your own fragmentation anyway, the best solution for you is probably to either do the fragmentation in Solr and handle it all yourself. In addition, you can implement your own implementation of BoundaryScanner in Java to use your own special knowledge about extracting entities when breaking fragments.

+3
source

All Articles