DataImportHandler (DIH) allows you to transfer incoming XML to XSL, as well as parse and convert XML with DIH transformers. You can convert your custom XML to standard Solr XML using XSL, or convert or convert custom XML to Solr schema fields directly in the DIH configuration file, or a combination thereof. DIH is flexible.
Dih-config.xml example
Here is a sample dih-config.xml from the actual working site (there are no pseudo-samples here, my friend). Note that it collects xml files from a local directory on the LAMP server. If you prefer to send XML files directly via HTTP, you will need to configure ContentStreamDataSource .
It so happened that the incoming xml is already in the standard XML format of Solr update in this example, and all XSL does this by removing empty field nodes, while real conversions, such as building the contents of "ispartof_t" from "ignored_seriestitle", "ignored_seriesvolume" and "ignored_seriesissue" are performed with DIH Regex transformers. (XSLT is executed first, and the result is then passed to the DIH transformers.) The useSolrAddSchema attribute tells DIH that xml is already in the standard Solr xml format. If this were not the case, the following "xpath" attribute on the XPathEntityProcessor would be required to select content from the incoming XML document.
<dataConfig> <dataSource encoding="UTF-8" type="FileDataSource" /> <document> <entity name="pickupdir" processor="FileListEntityProcessor" rootEntity="false" dataSource="null" fileName="^[\w\d-]+\.xml$" baseDir="/var/lib/tomcat6/solr/cci/import/" recursive="true" newerThan="${dataimporter.last_index_time}" > <entity name="xml" processor="XPathEntityProcessor" transformer="RegexTransformer,TemplateTransformer" datasource="pickupdir" stream="true" useSolrAddSchema="true" url="${pickupdir.fileAbsolutePath}" xsl="xslt/dih.xsl" > <field column="abstract_t" splitBy="\|" /> <field column="coverage_t" splitBy="\|" /> <field column="creator_t" splitBy="\|" /> <field column="creator_facet" template="${xml.creator_t}" /> <field column="description_t" splitBy="\|" /> <field column="format_t" splitBy="\|" /> <field column="identifier_t" splitBy="\|" /> <field column="ispartof_t" sourceColName="ignored_seriestitle" regex="(.+)" replaceWith="$1" /> <field column="ispartof_t" sourceColName="ignored_seriesvolume" regex="(.+)" replaceWith="${xml.ispartof_t}; vol. $1" /> <field column="ispartof_t" sourceColName="ignored_seriesissue" regex="(.+)" replaceWith="${xml.ispartof_t}; no. $1" /> <field column="ispartof_t" regex="\|" replaceWith=" " /> <field column="language_t" splitBy="\|" /> <field column="language_facet" template="${xml.language_t}" /> <field column="location_display" sourceColName="ignored_class" regex="(.+)" replaceWith="$1" /> <field column="location_display" sourceColName="ignored_location" regex="(.+)" replaceWith="${xml.location_display} $1" /> <field column="location_display" regex="\|" replaceWith=" " /> <field column="othertitles_display" splitBy="\|" /> <field column="publisher_t" splitBy="\|" /> <field column="responsibility_display" splitBy="\|" /> <field column="source_t" splitBy="\|" /> <field column="sourceissue_display" sourceColName="ignored_volume" regex="(.+)" replaceWith="vol. $1" /> <field column="sourceissue_display" sourceColName="ignored_issue" regex="(.+)" replaceWith="${xml.sourceissue_display}, no. $1" /> <field column="sourceissue_display" sourceColName="ignored_year" regex="(.+)" replaceWith="${xml.sourceissue_display} ($1)" /> <field column="src_facet" template="${xml.src}" /> <field column="subject_t" splitBy="\|" /> <field column="subject_facet" template="${xml.subject_t}" /> <field column="title_t" sourceColName="ignored_title" regex="(.+)" replaceWith="$1" /> <field column="title_t" sourceColName="ignored_subtitle" regex="(.+)" replaceWith="${xml.title_t} : $1" /> <field column="title_sort" template="${xml.title_t}" /> <field column="toc_t" splitBy="\|" /> <field column="type_t" splitBy="\|" /> <field column="type_facet" template="${xml.type_t}" /> </entity> </entity> </document> </dataConfig>
To install DIH:
- Make sure that the solrconfig.xml files are referenced on the DIH databases, as they are not included by default in the WAR Solr file. One easy way is to create a lib folder in the Solr instance directory, which includes DIH banks, because the solrconfig.xml file is by default in the lib folder for links. Locate the DIH packages in the apache-solr-xxx / dist folder when downloading the Solr package.
dist folder:
request handler:
<requestHandler name="/update/dih" startup="lazy" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">dih-config.xml</str> </lst> </requestHandler>
To start DIH:
In the wiki description, the data import handler has much more information about full import and delta import, as well as about fixing, optimization, etc. Commands , but the following will trigger the DIH operation without first deleting the existing index and committing the changes after all the files have been processed. The above sample collected all the files found in the pickup directory, converted them, indexed, and finally fixed the / s update to the index (which would make them searchable when instant commit was completed).
http://localhost:8983/solr/update/dih?command=full-import&clean=false&commit=true