Where is the bypass data stored when the tracked mechanism starts?

Question

Where is the bypass data stored when the tracked mechanism starts?

I am new to Nutch. I need to scan the Internet (say, a few hundred web pages), read the workarounds and do some analysis.

I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr as I might need to search for text in the future) and started crawling using multiple URLs as a seed.

Now I do not find text/html data on my local machine. Where can I find the data and what is the best way to read the data in text format?

Version

Apache-Nutch-1.9
Solr-4.10.4

+4

web-crawler nutch

Marco99 Mar 30 '15 at 9:43

source share

1 answer

Sujen shah · Accepted Answer · 2015-04-03T05:14:30+0000

After the scan is complete, you can use the bin / nutch dump command to reset all URLs selected in the normal html format.

Usage is as follows:

 $ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>] [-segment <segment>] -h,--help show this help message -mimetype <mimetype> an optional list of mimetypes to dump, excluding all others. Defaults to all. -outputDir <outputDir> output directory (which will be created) to host the raw data -segment <segment> the segment(s) to use

So for example, you could do something like

 $ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/

This will create a new directory in the -outputDir directory and delete all pages scanned in html format.

There are many other ways to dump certain data from Nutch, see https://wiki.apache.org/nutch/CommandLineOptions

Where is the bypass data stored when the tracked mechanism starts?

Version

More articles: