How to extract meta tags from HTML files and index them in SOLR and TIKA

Question

How to extract meta tags from HTML files and index them in SOLR and TIKA

I am trying to extract meta tags of HTML files and index them in solr with tika integration. I cannot extract these meta tags using Tika and could not display in solr.

My HTML file is as follows.

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta name="product_id" content="11"/> <meta name="assetid" content="10001"/> <meta name="title" content="title of the article"/> <meta name="type" content="0xyzb"/> <meta name="category" content="article category"/> <meta name="first" content="details of the article"/> <h4>title of the article</h4> <p class="link"><a href="#link">How cite the Article</a></p> <p class="list"> <span class="listterm">Length: </span>13 to 15 feet<br> <span class="listterm">Height to Top of Head: </span>up to 18 feet<br> <span class="listterm">Weight: </span>1,200 to 4,300 pounds<br> <span class="listterm">Diet: </span>leaves and branches of trees<br> <span class="listterm">Number of Young: </span>1<br> <span class="listterm">Home: </span>Sahara<br> </p> </p>

My data-config.xml file is as follows

 <dataConfig> <dataSource name="bin" type="BinFileDataSource" /> <document> <entity name="f" dataSource="null" rootEntity="false" processor="FileListEntityProcessor" baseDir="/path/to/html/files/" fileName=".*html|xml" onError="skip" recursive="false"> <field column="fileAbsolutePath" name="path" /> <field column="fileSize" name="size"/> <field column="file" name="filename"/> <entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text" onError="skip"> <field column="product_id" name="product_id" meta="true"/> <field column="assetid" name="assetid" meta="true"/> <field column="title" name="title" meta="true"/> <field column="type" name="type" meta="true"/> <field column="first" name="first" meta="true"/> <field column="category" name="category" meta="true"/> </entity> </entity> </document> </dataConfig>

In my schema.xml file, I added the following fields.

 <field name="product_id" type="string" indexed="true" stored="true"/> <field name="assetid" type="string" indexed="true" stored="true" /> <field name="title" type="string" indexed="true" stored="true"/> <field name="type" type="string" indexed="true" stored="true"/> <field name="category" type="string" indexed="true" stored="true"/> <field name="first" type="text_general" indexed="true" stored="true"/>

In my solrconfing.xml file, I added the following code.

 <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" /> <lst name="defaults"> <str name="config">/path/to/data-config.xml</str> </lst>

Does anyone know how to extract these meta tags from HTML files and index them in solr and Tika? Your help will be appreciated.

+1

solr solr4 apache-tika data-import

Anand khatri Feb 21 '13 at 15:25

source share

2 answers

Alexandre Rafalovitch · Answer 1 · 2013-02-21T16:16:23+0000

I do not think meta = "true" means what you think it means. This usually applies to things file, not the content. So, content type, etc. Perhaps http-equiv will be displayed.

Other than that, you need to extract the actual content. You can do this with format = "xml" and then put the internal object with XPathEntityProcessor and then display the path. In addition, even then you are limited because you are stuck because AFAIK, DIH uses DefaultHtmlMapper, which extremely limits what it skips and skips most of the class and id attributes and even things like “div”. You can read the list of allowed elements and attributes yourself in the source code.

Honestly, your easier way is to have a SolrJ client and manage Tika yourself. Then you can set it to use IdentityHtmlMapper, which does not confuse HTML.

Div tiwari · Answer 2 · 2013-03-19T08:44:27+0000

What version of Solr are you using? If you use Solr 4.0 or higher, then tika is built into it. Tika binds to solr using the 'Solr-Cells ' ExtractingRequestHandler' class, which is configured in the solrconfig.xml file as follows:

  <!-- Solr Cell Update Request Handler http://wiki.apache.org/solr/ExtractingRequestHandler --> <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="uprefix">ignored_</str> <!-- capture link hrefs but ignore div attributes --> <str name="captureAttr">true</str> <str name="fmap.a">links</str> <str name="fmap.div">ignored_</str> </lst> </requestHandler>

Now in solr, by default, as you can see in the above configuration, any fields extracted from an HTML document that are not declared in schema.xml have the prefix 'ignored _' , that is, they are mapped to the 'ignored_ *' dynamic field inside schema.xml. The default is schema.xml, which reads as follows:

  <!-- some trie-coded dynamic fields for faster range queries --> <dynamicField name="*_ti" type="tint" indexed="true" stored="true"/> <dynamicField name="*_tl" type="tlong" indexed="true" stored="true"/> <dynamicField name="*_tf" type="tfloat" indexed="true" stored="true"/> <dynamicField name="*_td" type="tdouble" indexed="true" stored="true"/> <dynamicField name="*_tdt" type="tdate" indexed="true" stored="true"/> <dynamicField name="*_pi" type="pint" indexed="true" stored="true"/> <dynamicField name="*_c" type="currency" indexed="true" stored="true"/> <dynamicField name="ignored_*" type="ignored" multiValued="true"/> <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/> <dynamicField name="random_*" type="random" /> <!-- uncomment the following to ignore any fields that don't already match an existing field name or dynamic field, rather than reporting them as an error. alternately, change the type="ignored" to some other type eg "text" if you want unknown fields indexed and/or stored by default --> <!--dynamicField name="*" type="ignored" multiValued="true" /--> </fields>

What follows is how the “ignored” processes are handled:

 <!-- since fields of this type are by default not stored or indexed, any data added to them will be ignored outright. --> <fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />

So, the meta data retrieved by tika is by default placed in the "ignored" field using Solr-Cell, and therefore it is ignored for indexing and storage. Therefore, to index and store metadata, you either change "uprefix = attr_" or "create specific fields or a dynamic field" for your known metadata and process it as you wish.

So, here is the corrected solrconfig.xml file:

  <!-- Solr Cell Update Request Handler http://wiki.apache.org/solr/ExtractingRequestHandler --> <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="uprefix">attr_</str> <!-- capture link hrefs but ignore div attributes --> <str name="captureAttr">true</str> <str name="fmap.a">links</str> <str name="fmap.div">ignored_</str> </lst> </requestHandler>

How to extract meta tags from HTML files and index them in SOLR and TIKA

More articles: