How do I exclude everything except text / html from heritrix bypass?

On: Heritrix Usecases there is a usage example for "Only successfully save HTML pages"

My problem: I do not know how to implement it in my cxml file. Especially: Adding ContentTypeRegExpFilter to ARCWriterProcessor => set the regexp parameter to text / html. * .... ContentTypeRegExpFilter is missing from the sample cxml files.

+4
source share
2 answers

The usage examples that you are quoting are somewhat outdated and apply to Heritrix 1.x (filters have been replaced by decision rules, very different configurations). Nevertheless, the basic concept is the same.

The cxml file is basically a Spring configuration file. You need to configure the shouldProcessRule property on the ARCWriter bean as ContentTypeMatchesRegexDecideRule

Possible configuration of ARCWriter:

  <bean id="warcWriter" class="org.archive.modules.writer.ARCWriterProcessor"> <property name="shouldProcessRule"> <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule"> <property name="decision" value="ACCEPT" /> <property name="regex" value="^text/html.*"> </bean> </property> <!-- Other properties that need to be set ... --> </bean> 

This will force the processor to process only those elements that match the DecideRule, which, in turn, passes only those whose content type (mime type) matches the provided regular expression.

Be careful with the "decision making" setting. Are you right in us? (My example controls things, everything that doesn't match is excluded).

Since shouldProcessRule inherited from the Processor, this can be applied to any processor.

More information on configuring Heritrix 3 can be found on the Heritrix 3 Wiki (crawler.archive.org user guide is dedicated to Heritrix 1)

+1
source

Chris's answer is only half the truth (at least with Heritrix 3.1.x, which I use). Return to DecideRule ACCEPT, REJECT, or NONE. If the rule returns NONE, it means that this rule has no opinion about it (for example, ACCESS_ABSTAIN in Spring Security). Now ContentTypeMatchesRegexDecideRule (like all other MatchesRegexDecideRule ) can be configured to return a solution if the regular expression matches (configured by the two properties "solution" and "regular expression"). The parameter means that this rule returns an ACCEPT solution if the regular expression matches, but returns NONE if it does not match. And as we saw - NONE is not an opinion, so shouldProcessRule will evaluate ACCEPT because no decisions have been made.

So, in order to archive answers with only the text / html * Content-Type, configure DecideRuleSequence, where by default everything is REJECTed, and only the selected records will be selected.

It looks like this:

  <bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor"> <property name="shouldProcessRule"> <bean class="org.archive.modules.deciderules.DecideRuleSequence"> <property name="rules"> <list> <!-- Begin by REJECTing all... --> <bean class="org.archive.modules.deciderules.RejectDecideRule" /> <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule"> <property name="decision" value="ACCEPT" /> <property name="regex" value="^text/html.*" /> </bean> </list> </property> </bean> </property> <!-- other properties... --> </bean> 

In order not to load images, movies, etc. at all, configure the โ€œareaโ€ of the bean using MatchesListRegexDecideRule, which REJECTs associates with known file extensions, for example:

 <!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... --> <bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule"> <property name="decision" value="REJECT"/> <property name="listLogicalOr" value="true" /> <property name="regexList"> <list> <value>.*(?i)(\.(avi|wmv|mpe?g|mp3))$</value> <value>.*(?i)(\.(rar|zip|tar|gz))$</value> <value>.*(?i)(\.(pdf|doc|xls|odt))$</value> <value>.*(?i)(\.(xml))$</value> <value>.*(?i)(\.(txt|conf|pdf))$</value> <value>.*(?i)(\.(swf))$</value> <value>.*(?i)(\.(js|css))$</value> <value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))$</value> </list> </property> </bean> 
+5
source

All Articles