Nutch API User Guide

I am working on a project where I need a mature finder to do some work, and I evaluate Nutch for this purpose. My current needs are relatively simple: I need a crawler that can save data to disk, and I need it to be able to retell only updated site resources and skip parts that are already being scanned. Does anyone have experience with Nutch code directly in Java, and not through the command line. I would like to start simply: create a finder (or similar), set it up minimally and run it, nothing unusual. Is there any example for this or some resource that I should look at? I turn to the Nutch documentation, but most of them are about the command line, search, and other things. How useful is the Nutch crawler without the need for indexing and searching? Any help is appreciated. Thank you

+6
java web-crawler nutch
source share
2 answers

Nutch is very different from what you have ever practiced. Since this is something like a framework, it not only has a front for query and search, but also solth solr seems more powerful than Nutch’s own search interface. It also has a workaround and indexing (in the Lucene index).

If you want to use the crawl for purposes other than search, you will need to develop your own programs and be familiar with Hadoop and MapReduce programming.

Not sure what you want to do with a workaround, but it doesn't look like Nutch is a solution

+1
source share

You may find this GitHub repository useful: https://github.com/yegor256/nutch-in-java. It demonstrates how Nutch can be used from Java programmatically, without a command line.

0
source share

All Articles