Well, nutch will write workarounds in binary form, so if you want this to be saved in html format, you will have to change the code. (it will be painful if you are new to nutch).
If you need a quick and easy solution for getting html pages:
- If the list of pages / URLs you intend to have is pretty low, you better do this with a script that calls
wget for each URL. - OR use the HTTrack tool.
EDIT:
Writing your own nutch plugin will be great. Your problem will be solved, plus you can contribute to the work by presenting your work !!! If you are new to nutch (in terms of code and design), you will have to spend a lot of time creating a new plugin ... it's still easy to make.
A few pointers to help your initiative:
Here is a page that talks about writing your own nutch plugin.
Start with Fetcher.java . See Lines 647-648. This is the place where you can get the downloaded content based on each URL (for those pages that were successfully received).
pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS); updateStatus(content.getContent().length);
You must add the code immediately afterwards to call your plugin. Pass the content object to it. By now, you would assume that content.getContent() is the content for the URL you want. Inside the plugin code, write it to some file. The file name must be based on the name of the URL, otherwise it will be difficult to work with. Url can be obtained using fit.url .
Tejas patil
source share