How can I index .html files in SOLR

Question

How can I index .html files in SOLR

The files I want to make for indexing are stored on the server (I do not need to scan) ./ Path / to / files / HTML image file

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta name="product_id" content="11"/> <meta name="assetid" content="10001"/> <meta name="title" content="title of the article"/> <meta name="type" content="0xyzb"/> <meta name="category" content="article category"/> <meta name="first" content="details of the article"/> <h4>title of the article</h4> <p class="link"><a href="#link">How cite the Article</a></p> <p class="list"> <span class="listterm">Length: </span>13 to 15 feet<br> <span class="listterm">Height to Top of Head: </span>up to 18 feet<br> <span class="listterm">Weight: </span>1,200 to 4,300 pounds<br> <span class="listterm">Diet: </span>leaves and branches of trees<br> <span class="listterm">Number of Young: </span>1<br> <span class="listterm">Home: </span>Sahara<br> </p> </p>

I added a request handler to the solrconfing.xml file.

 <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">/path/to/data-config.xml</str> </lst>

My data-config.xml looks like this:

 <dataConfig> <dataSource type="FileDataSource" /> <document> <entity name="f" processor="FileListEntityProcessor" baseDir="/path/to html/files/" fileName=".*html" recursive="true" rootEntity="false" dataSource="null"> <field column="plainText" name="text"/> </entity> </document> </dataConfig>

I saved the default schema.xml file and added the following code snippet to the schema.xml file.

  <field name="product_id" type="string" indexed="true" stored="true"/> <field name="assetid" type="string" indexed="true" stored="true" required="true" /> <field name="title" type="string" indexed="true" stored="true"/> <field name="type" type="string" indexed="true" stored="true"/> <field name="category" type="string" indexed="true" stored="true"/> <field name="first" type="text_general" indexed="true" stored="true"/> <uniqueKey>assetid</uniqueKey>

when I tried to perform a full import after installing it, it shows that all html files are extracted. But when I searched in SOLR, it did not show me any result. Does anyone have an idea what could be causing?

I understand that all files are uploaded correctly, but not indexed in SOLR. Does anyone know how I can index these meta tags and HTML file contents in SOLR?

Your answer will be appreciated.

+5

solr solr4 full-text-indexing data-import dataimporthandler

Anand khatri Feb 05 '13 at 15:50

source share

4 answers

Jayendra · Answer 1 · 2013-02-06T04:28:56+0000

You can use the Solr Extracting Request Handler to submit the Solr using an HTML file and extract the contents from the html file. for example link

Solr uses Apache Tika to extract content from an uploaded html file

Nutch with Solr is a broader solution if you want to crawl websites and index them.
Nutch with Solr Tutorial helps you get started.

Chris warner · Answer 2 · 2013-02-05T18:58:06+0000

Did you mean that fileName = "*. Html" in your config.xml file? You now have fileName = ". * Html"

I'm sure Solr will not know how to translate your meta fields from your html to index fields. I have not tried.

I created programs to read (x) html (using xpath). This will create a formatted XML file to send to \ update. At this point, you can use the dataimporthandler to search for this formatted XML file (s).

user1050755 · Answer 3 · 2017-10-15T07:13:50+0000

Here is a complete example of converting HTML to text and extracting the corresponding metadata:

 import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNull; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.BodyContentHandler; import org.junit.Test; import java.io.ByteArrayInputStream; public class ConversionTest { @Test public void testHtmlToTextConversion() throws Exception { ByteArrayInputStream bais = new ByteArrayInputStream(("<html>\n" + "<head>\n" + "<title> \n" + " A Simple HTML Document\n" + "</title>\n" + "</head>\n" + "<body></div>\n" + "<p>This is a very simple HTML document</p>\n" + "<p>It only has two paragraphs</p>\n" + "</body>\n" + "</html>").getBytes()); BodyContentHandler contenthandler = new BodyContentHandler(); Metadata metadata = new Metadata(); AutoDetectParser parser = new AutoDetectParser(); parser.parse(bais, contenthandler, metadata, new ParseContext()); assertEquals("\nThis is a very simple HTML document\n" + "\n" + "It only has two paragraphs\n" + "\n", contenthandler.toString().replace("\r", "")); assertEquals("A Simple HTML Document", metadata.get("title")); assertEquals("A Simple HTML Document", metadata.get("dc:title")); assertNull(metadata.get("title2")); assertEquals("org.apache.tika.parser.DefaultParser", metadata.getValues("X-Parsed-By")[0]); assertEquals("org.apache.tika.parser.html.HtmlParser", metadata.getValues("X-Parsed-By")[1]); assertEquals("ISO-8859-1", metadata.get("Content-Encoding")); assertEquals("text/html; charset=ISO-8859-1", metadata.get("Content-Type")); } }

l0pan · Answer 4 · 2015-12-06T18:57:37+0000

The easiest way is to use the post tool from the bin directory. It will automatically do all the work. Here is an example

./post -c conf1 /path/to/files/*

More info here

How can I index .html files in SOLR

More articles: