What version of Solr are you using? If you use Solr 4.0 or higher, then tika is built into it. Tika binds to solr using the 'Solr-Cells ' ExtractingRequestHandler' class, which is configured in the solrconfig.xml file as follows:
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="uprefix">ignored_</str> <str name="captureAttr">true</str> <str name="fmap.a">links</str> <str name="fmap.div">ignored_</str> </lst> </requestHandler>
Now in solr, by default, as you can see in the above configuration, any fields extracted from an HTML document that are not declared in schema.xml have the prefix 'ignored _' , that is, they are mapped to the 'ignored_ *' dynamic field inside schema.xml. The default is schema.xml, which reads as follows:
<dynamicField name="*_ti" type="tint" indexed="true" stored="true"/> <dynamicField name="*_tl" type="tlong" indexed="true" stored="true"/> <dynamicField name="*_tf" type="tfloat" indexed="true" stored="true"/> <dynamicField name="*_td" type="tdouble" indexed="true" stored="true"/> <dynamicField name="*_tdt" type="tdate" indexed="true" stored="true"/> <dynamicField name="*_pi" type="pint" indexed="true" stored="true"/> <dynamicField name="*_c" type="currency" indexed="true" stored="true"/> <dynamicField name="ignored_*" type="ignored" multiValued="true"/> <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/> <dynamicField name="random_*" type="random" /> </fields>
What follows is how the βignoredβ processes are handled:
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
So, the meta data retrieved by tika is by default placed in the "ignored" field using Solr-Cell, and therefore it is ignored for indexing and storage. Therefore, to index and store metadata, you either change "uprefix = attr_" or "create specific fields or a dynamic field" for your known metadata and process it as you wish.
So, here is the corrected solrconfig.xml file:
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="uprefix">attr_</str> <str name="captureAttr">true</str> <str name="fmap.a">links</str> <str name="fmap.div">ignored_</str> </lst> </requestHandler>