Chris's answer is only half the truth (at least with Heritrix 3.1.x, which I use). Return to DecideRule ACCEPT, REJECT, or NONE. If the rule returns NONE, it means that this rule has no opinion about it (for example, ACCESS_ABSTAIN in Spring Security). Now ContentTypeMatchesRegexDecideRule (like all other MatchesRegexDecideRule ) can be configured to return a solution if the regular expression matches (configured by the two properties "solution" and "regular expression"). The parameter means that this rule returns an ACCEPT solution if the regular expression matches, but returns NONE if it does not match. And as we saw - NONE is not an opinion, so shouldProcessRule will evaluate ACCEPT because no decisions have been made.
So, in order to archive answers with only the text / html * Content-Type, configure DecideRuleSequence, where by default everything is REJECTed, and only the selected records will be selected.
It looks like this:
<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor"> <property name="shouldProcessRule"> <bean class="org.archive.modules.deciderules.DecideRuleSequence"> <property name="rules"> <list> <bean class="org.archive.modules.deciderules.RejectDecideRule" /> <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule"> <property name="decision" value="ACCEPT" /> <property name="regex" value="^text/html.*" /> </bean> </list> </property> </bean> </property> </bean>
In order not to load images, movies, etc. at all, configure the โareaโ of the bean using MatchesListRegexDecideRule, which REJECTs associates with known file extensions, for example:
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule"> <property name="decision" value="REJECT"/> <property name="listLogicalOr" value="true" /> <property name="regexList"> <list> <value>.*(?i)(\.(avi|wmv|mpe?g|mp3))$</value> <value>.*(?i)(\.(rar|zip|tar|gz))$</value> <value>.*(?i)(\.(pdf|doc|xls|odt))$</value> <value>.*(?i)(\.(xml))$</value> <value>.*(?i)(\.(txt|conf|pdf))$</value> <value>.*(?i)(\.(swf))$</value> <value>.*(?i)(\.(js|css))$</value> <value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))$</value> </list> </property> </bean>
source share