Since all the other answers are completely irrelevant, I will send a message:
I ran into the same problem as OP, ( Solr 4.3.0 , custom configuration, custom circuitry, etc. I'm not a newbie or I donβt know anything else, / p>
This was my erh config:
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="uprefix">ignored_</str> <str name="fmap.a">ignored_</str> <str name="fmap.div">ignored_</str> <str name="fmap.content">text</str> <str name="captureAttr">false</str> <str name="lowernames">true</str> <bool name="ignoreTikaException">true</bool> </lst> </requestHandler>
Basically, he was determined to ignore everything except the content (I believe this is reasonable for many people).
After a thorough investigation, I found out that
<str name="captureAttr">false</str>
was the cause caused by the OP problem. By default it is turned on, but I turned it off, because I do not need it. And that was my mistake. I have no idea why, but this causes Solr to insert the extracted attributes into the fmap.content
field with the extracted text at all.
So the solution is to turn it back on. Ultimate ERH :
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="uprefix">ignored_</str> <str name="fmap.a">ignored_</str> <str name="fmap.div">ignored_</str> <str name="fmap.content">text</str> <str name="captureAttr">true</str> <str name="lowernames">true</str> <bool name="ignoreTikaException">true</bool> </lst> </requestHandler>
Now only the selected text is placed in the fmap.content
field.
Unfortunately, I did not find a single piece of documentation that could explain this. Either a mistake or just plain silly behavior