Input arbitrary xml to solr

Question

Input arbitrary xml to solr

I have a question about Apache Solr. If I have an arbitrary XML file and an XSD to which it corresponds, how can I enter it in Solr. Can I get some sample code? I know that you need to parse the XML and put the relevant data in the doc solr file, but I don’t understand how to do it.

+4

xml solr

John doe Jun 03 '12 at 3:41

source share

3 answers

Peaeater · Answer 1 · 2012-06-07T17:05:14+0000

DataImportHandler (DIH) allows you to transfer incoming XML to XSL, as well as parse and convert XML with DIH transformers. You can convert your custom XML to standard Solr XML using XSL, or convert or convert custom XML to Solr schema fields directly in the DIH configuration file, or a combination thereof. DIH is flexible.

Dih-config.xml example

Here is a sample dih-config.xml from the actual working site (there are no pseudo-samples here, my friend). Note that it collects xml files from a local directory on the LAMP server. If you prefer to send XML files directly via HTTP, you will need to configure ContentStreamDataSource .

It so happened that the incoming xml is already in the standard XML format of Solr update in this example, and all XSL does this by removing empty field nodes, while real conversions, such as building the contents of "ispartof_t" from "ignored_seriestitle", "ignored_seriesvolume" and "ignored_seriesissue" are performed with DIH Regex transformers. (XSLT is executed first, and the result is then passed to the DIH transformers.) The useSolrAddSchema attribute tells DIH that xml is already in the standard Solr xml format. If this were not the case, the following "xpath" attribute on the XPathEntityProcessor would be required to select content from the incoming XML document.

<dataConfig> <dataSource encoding="UTF-8" type="FileDataSource" /> <document> <!-- Pickupdir fetches all files matching the filename regex in the supplied directory and passes them to other entities which parse the file contents. --> <entity name="pickupdir" processor="FileListEntityProcessor" rootEntity="false" dataSource="null" fileName="^[\w\d-]+\.xml$" baseDir="/var/lib/tomcat6/solr/cci/import/" recursive="true" newerThan="${dataimporter.last_index_time}" > <!-- Pickupxmlfile parses standard Solr update XML. Incoming values are split into multiple tokens when given a splitBy attribute. Dates are transformed into valid Solr dates when given a dateTimeFormat to parse. --> <entity name="xml" processor="XPathEntityProcessor" transformer="RegexTransformer,TemplateTransformer" datasource="pickupdir" stream="true" useSolrAddSchema="true" url="${pickupdir.fileAbsolutePath}" xsl="xslt/dih.xsl" > <field column="abstract_t" splitBy="\|" /> <field column="coverage_t" splitBy="\|" /> <field column="creator_t" splitBy="\|" /> <field column="creator_facet" template="${xml.creator_t}" /> <field column="description_t" splitBy="\|" /> <field column="format_t" splitBy="\|" /> <field column="identifier_t" splitBy="\|" /> <field column="ispartof_t" sourceColName="ignored_seriestitle" regex="(.+)" replaceWith="$1" /> <field column="ispartof_t" sourceColName="ignored_seriesvolume" regex="(.+)" replaceWith="${xml.ispartof_t}; vol. $1" /> <field column="ispartof_t" sourceColName="ignored_seriesissue" regex="(.+)" replaceWith="${xml.ispartof_t}; no. $1" /> <field column="ispartof_t" regex="\|" replaceWith=" " /> <field column="language_t" splitBy="\|" /> <field column="language_facet" template="${xml.language_t}" /> <field column="location_display" sourceColName="ignored_class" regex="(.+)" replaceWith="$1" /> <field column="location_display" sourceColName="ignored_location" regex="(.+)" replaceWith="${xml.location_display} $1" /> <field column="location_display" regex="\|" replaceWith=" " /> <field column="othertitles_display" splitBy="\|" /> <field column="publisher_t" splitBy="\|" /> <field column="responsibility_display" splitBy="\|" /> <field column="source_t" splitBy="\|" /> <field column="sourceissue_display" sourceColName="ignored_volume" regex="(.+)" replaceWith="vol. $1" /> <field column="sourceissue_display" sourceColName="ignored_issue" regex="(.+)" replaceWith="${xml.sourceissue_display}, no. $1" /> <field column="sourceissue_display" sourceColName="ignored_year" regex="(.+)" replaceWith="${xml.sourceissue_display} ($1)" /> <field column="src_facet" template="${xml.src}" /> <field column="subject_t" splitBy="\|" /> <field column="subject_facet" template="${xml.subject_t}" /> <field column="title_t" sourceColName="ignored_title" regex="(.+)" replaceWith="$1" /> <field column="title_t" sourceColName="ignored_subtitle" regex="(.+)" replaceWith="${xml.title_t} : $1" /> <field column="title_sort" template="${xml.title_t}" /> <field column="toc_t" splitBy="\|" /> <field column="type_t" splitBy="\|" /> <field column="type_facet" template="${xml.type_t}" /> </entity> </entity> </document> </dataConfig>

To install DIH:

Make sure that the solrconfig.xml files are referenced on the DIH databases, as they are not included by default in the WAR Solr file. One easy way is to create a lib folder in the Solr instance directory, which includes DIH banks, because the solrconfig.xml file is by default in the lib folder for links. Locate the DIH packages in the apache-solr-xxx / dist folder when downloading the Solr package.

dist folder:

Create your dih-config.xml (as described above) in the Solr directory "conf".
Add a DIH request handler to the solrconfig.xml file if it does not already exist.

request handler:

 <requestHandler name="/update/dih" startup="lazy" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">dih-config.xml</str> </lst> </requestHandler>

To start DIH:

In the wiki description, the data import handler has much more information about full import and delta import, as well as about fixing, optimization, etc. Commands , but the following will trigger the DIH operation without first deleting the existing index and committing the changes after all the files have been processed. The above sample collected all the files found in the pickup directory, converted them, indexed, and finally fixed the / s update to the index (which would make them searchable when instant commit was completed).

 http://localhost:8983/solr/update/dih?command=full-import&clean=false&commit=true

Persimmonium · Answer 2 · 2012-06-03T06:09:45+0000

the easiest way is to use DataImportHandler , it allows you to apply XSL first to convert your xml to Solr xml input

Nicholas DiPiazza · Answer 3 · 2016-01-29T15:43:10+0000

After some research and searching, nothing is completely automated to do what you ask ... I think I found something.

Lux SOLR may be what we are looking for http://luxdb.org/SETUP.html

It seems that it somehow accepts SOLR and makes it included Lux, which indexes arbitrary XML.

Input arbitrary xml to solr

Dih-config.xml example

To install DIH:

To start DIH:

More articles: