After the scan is complete, you can use the bin / nutch dump command to reset all URLs selected in the normal html format.
Usage is as follows:
$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>] [-segment <segment>] -h,--help show this help message -mimetype <mimetype> an optional list of mimetypes to dump, excluding all others. Defaults to all. -outputDir <outputDir> output directory (which will be created) to host the raw data -segment <segment> the segment(s) to use
So for example, you could do something like
$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/
This will create a new directory in the -outputDir directory and delete all pages scanned in html format.
There are many other ways to dump certain data from Nutch, see https://wiki.apache.org/nutch/CommandLineOptions
Sujen shah
source share