Text content without metadata from Tika via SolrCell

Question

Text content without metadata from Tika via SolrCell

Using Solr 3.6 and ExtractionRequestHandler (aka Tika), is it possible to display only text content (PDF) in a field minus metadata? The "content" field created by Tika, unfortunately, contains all the metadata marked with the text content of the document.

I would like to provide some code snippets for the content, and the object metadata in the content field distorts the selection results.

UPDATE: Tika output screenshot indexed by Solr. The highlighted part is a metadata block that is added as a block of text to the contents of the PDF.

solr screenshot of tika output

ExtractRequestHandler in solrconfig.xml:

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="lowernames">true</str> <str name="uprefix">ignored_</str> </lst> </requestHandler>

Fields Schema.xml. Note. "Content" directly receives information about the content of Tika. The page and collection fields are set with literal values when the document is sent to the handler.

 <field name="id" type="string" indexed="true" stored="true" required="true"/> <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="subject" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="collection" type="text_general" indexed="true" stored="true"/> <field name="page" type="tint" indexed="true" stored="true"/> <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>

+4

solr apache-tika solr -cell

Peaeater Jun 04 '12 at 21:43

source share

4 answers

Tika with Solr creates different fields for content and metadata.

If you use the standard ExtractingRequestHandler method -

  <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <!-- All the main content goes into "text"... if you need to return the extracted text or do highlighting, use a stored field. --> <str name="fmap.content">text</str> <str name="lowernames">true</str> <str name="uprefix">ignored_</str> <!-- capture link hrefs but ignore div attributes --> <str name="captureAttr">true</str> <str name="fmap.a">links</str> <str name="fmap.div">ignored_</str> </lst> </requestHandler>

The content of the field map is set to a text field, which should only be the contents of your pdf.

Other metadata fields can be easily verified by changing schema.xml.

mark is kept true for the type of gambling field

 <fieldtype name="ignored" stored="true" indexed="false" multiValued="true" class="solr.StrField" />

Capture all fields -

  <dynamicField name="*" type="ignored" multiValued="true" />

Tika adds many fields for metadata, and the content is set separately, for example. response when submitting the extraction handler using ppt.

 <doc> <arr name="application_name"> <str>Microsoft PowerPoint</str> </arr> <str name="category">POT - US</str> <str name="comments">version 1.1</str> <arr name="company"> <str> </str> </arr> <arr name="content_type"> <str>application/vnd.ms-powerpoint</str> </arr> <arr name="creation_date"> <str>2000-03-15T16:57:27Z</str> </arr> <arr name="custom_delivery_date"> <str> </str> </arr> <arr name="custom_docid"> <str> </str> </arr> <arr name="custom_docidinslide"> <str>true</str> </arr> <arr name="custom_docidintitle"> <str>true</str> </arr> <arr name="custom_docidposition"> <str>0</str> </arr> <arr name="custom_event"> <str> </str> </arr> <arr name="custom_final"> <str>false</str> </arr> <arr name="custom_mckpapersize"> <str>US</str> </arr> <arr name="custom_notespagelayout"> <str>Lower</str> </arr> <arr name="custom_title"> <str>Lower Universal Template US</str> </arr> <arr name="custom_universal_objects"> <str>true</str> </arr> <arr name="edit_time"> <str>284587970000</str> </arr> <str name="id">101</str> <arr name="ignored_"> <str>slideShow</str> <str>slide</str> <str>slide</str> <str>slideNotes</str> </arr> <str name="keywords">test</str> <arr name="last_author"> <str>Corporate</str> </arr> <arr name="last_printed"> <str>2000-03-17T20:28:57Z</str> </arr> <arr name="last_save_date"> <str>2009-03-24T16:52:26Z</str> </arr> <arr name="manager"> <str> </str> </arr> <arr name="meta"> <str>stream_source_info</str> <str>file:/C:/temp/nuggets/100000.ppt</str> <str>Last-Author</str> <str>Corporate</str> <str>Slide-Count</str> <str>2</str> <str>custom:DocIDPosition</str> <str>0</str> <str>Application-Name</str> <str>Microsoft PowerPoint</str> <str>custom:Delivery Date</str> <str> </str> <str>custom:Event</str> <str> </str> <str>Edit-Time</str> <str>284587970000</str> <str>Word-Count</str> <str>120</str> <str>Creation-Date</str> <str>2000-03-15T16:57:27Z</str> <str>stream_size</str> <str>181248</str> <str>Manager</str> <str> </str> <str>stream_name</str> <str>100000.ppt</str> <str>Company</str> <str> </str> <str>Keywords</str> <str>test</str> <str>Last-Save-Date</str> <str>2009-03-24T16:52:26Z</str> <str>Revision-Number</str> <str>91</str> <str>Last-Printed</str> <str>2000-03-17T20:28:57Z</str> <str>Comments</str> <str>version 1.1</str> <str>Template</str> <str> </str> <str>custom:PaperSize</str> <str>US</str> <str>custom:DocID</str> <str> </str> <str>xmpTPg:NPages</str> <str>2</str> <str>custom:NotesPageLayout</str> <str>Lower</str> <str>custom:DocIDinSlide</str> <str>true</str> <str>Category</str> <str>POT - US</str> <str>custom:Universal Objects</str> <str>true</str> <str>custom:Final</str> <str>false</str> <str>custom:DocIDinTitle</str> <str>true</str> <str>Content-Type</str> <str>application/vnd.ms-powerpoint</str> <str>custom:Title</str> <str>test</str> </arr> <arr name="p"> <str>slide-content</str> <str>slide-content</str> </arr> <arr name="revision_number"> <str>91</str> </arr> <arr name="slide_count"> <str>2</str> </arr> <arr name="stream_name"> <str>100000.ppt</str> </arr> <arr name="stream_size"> <str>181248</str> </arr> <arr name="stream_source_info"> <str>file:/C:/temp/test/100000.ppt</str> </arr> <arr name="template"> <str> </str> </arr> <!-- Content field --> <arr name="text"> <str>test Test test test test tes t</str> </arr> <arr name="title"> <str>test</str> </arr> <arr name="word_count"> <str>120</str> </arr> <arr name="xmptpg_npages"> <str>2</str> </arr> </doc>

+1

Jayendra Jun 06 '12 at 7:09

source share

I no longer have the problem described above. Starting with the question, I upgraded to Solr 4.0 alpha and recreated the schema.xml from the Solr Cell example that comes with 4.0a. I suspect that my original schema copied the contents of the metadata fields into a text field, so most likely it was my own mistake.

0

Peaeater Aug 1 '12 at 16:50

source share

In the solrconfig.xml file where the request handler is specified, add this line below

 <str name="fmap.title">ignored_</str>

This tells Tika to simply ignore the title attribute (or which attributes you want to ignore), it finds it embedded in the PDF.

0

Craig Oct 9 '12 at 17:43

source share

illegal-immigrant · Accepted Answer · 2014-02-20T14:02:08+0000

Since all the other answers are completely irrelevant, I will send a message:

I ran into the same problem as OP, ( Solr 4.3.0 , custom configuration, custom circuitry, etc. I'm not a newbie or I don’t know anything else, / p>

This was my erh config:

  <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="uprefix">ignored_</str> <str name="fmap.a">ignored_</str> <str name="fmap.div">ignored_</str> <str name="fmap.content">text</str> <str name="captureAttr">false</str> <str name="lowernames">true</str> <bool name="ignoreTikaException">true</bool> </lst> </requestHandler>

Basically, he was determined to ignore everything except the content (I believe this is reasonable for many people).

After a thorough investigation, I found out that

 <str name="captureAttr">false</str>

was the cause caused by the OP problem. By default it is turned on, but I turned it off, because I do not need it. And that was my mistake. I have no idea why, but this causes Solr to insert the extracted attributes into the fmap.content field with the extracted text at all.

So the solution is to turn it back on. Ultimate ERH :

  <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="uprefix">ignored_</str> <str name="fmap.a">ignored_</str> <str name="fmap.div">ignored_</str> <str name="fmap.content">text</str> <str name="captureAttr">true</str> <str name="lowernames">true</str> <bool name="ignoreTikaException">true</bool> </lst> </requestHandler>

Now only the selected text is placed in the fmap.content field.

Unfortunately, I did not find a single piece of documentation that could explain this. Either a mistake or just plain silly behavior

Text content without metadata from Tika via SolrCell

More articles: