Nutch: Call in Java, not on the command line?

Am I really fat or is there no way to call Apache Nutch through some Java code programmatically? Where is the documentation (or manual or tutorial) on how to do this? Google failed me. So I really tried Bing. (Yes, I know, pathetic.) Ideas? Thanks in advance.

(Also, if Nutch is crap-shoot any other scanners written in Java that have proven their reliability on an Internet scale with the actual documentation?)

+8
java web-crawler nutch
source share
2 answers

If you look inside the bin/nutch script, you will see that it calls the Java class corresponding to your command:

 # figure out which class to run if [ "$COMMAND" = "crawl" ] ; then CLASS=org.apache.nutch.crawl.Crawl elif [ "$COMMAND" = "inject" ] ; then CLASS=org.apache.nutch.crawl.Injector elif [ "$COMMAND" = "generate" ] ; then CLASS=org.apache.nutch.crawl.Generator elif [ "$COMMAND" = "freegen" ] ; then CLASS=org.apache.nutch.tools.FreeGenerator elif [ "$COMMAND" = "fetch" ] ; then CLASS=org.apache.nutch.fetcher.Fetcher elif [ "$COMMAND" = "fetch2" ] ; then CLASS=org.apache.nutch.fetcher.Fetcher2 elif [ "$COMMAND" = "parse" ] ; then CLASS=org.apache.nutch.parse.ParseSegment elif [ "$COMMAND" = "readdb" ] ; then CLASS=org.apache.nutch.crawl.CrawlDbReader elif [ "$COMMAND" = "convdb" ] ; then CLASS=org.apache.nutch.tools.compat.CrawlDbConverter elif [ "$COMMAND" = "mergedb" ] ; then CLASS=org.apache.nutch.crawl.CrawlDbMerger elif [ "$COMMAND" = "readlinkdb" ] ; then CLASS=org.apache.nutch.crawl.LinkDbReader elif [ "$COMMAND" = "readseg" ] ; then CLASS=org.apache.nutch.segment.SegmentReader elif [ "$COMMAND" = "segread" ] ; then echo "[DEPRECATED] Command 'segread' is deprecated, use 'readseg' instead." CLASS=org.apache.nutch.segment.SegmentReader elif [ "$COMMAND" = "mergesegs" ] ; then CLASS=org.apache.nutch.segment.SegmentMerger elif [ "$COMMAND" = "updatedb" ] ; then CLASS=org.apache.nutch.crawl.CrawlDb elif [ "$COMMAND" = "invertlinks" ] ; then CLASS=org.apache.nutch.crawl.LinkDb elif [ "$COMMAND" = "mergelinkdb" ] ; then CLASS=org.apache.nutch.crawl.LinkDbMerger elif [ "$COMMAND" = "index" ] ; then CLASS=org.apache.nutch.indexer.Indexer elif [ "$COMMAND" = "solrindex" ] ; then CLASS=org.apache.nutch.indexer.solr.SolrIndexer elif [ "$COMMAND" = "dedup" ] ; then CLASS=org.apache.nutch.indexer.DeleteDuplicates elif [ "$COMMAND" = "solrdedup" ] ; then CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates elif [ "$COMMAND" = "merge" ] ; then CLASS=org.apache.nutch.indexer.IndexMerger elif [ "$COMMAND" = "plugin" ] ; then CLASS=org.apache.nutch.plugin.PluginRepository elif [ "$COMMAND" = "server" ] ; then CLASS='org.apache.nutch.searcher.DistributedSearch$Server' else CLASS=$COMMAND fi # run it exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS -classpath "$CLASSPATH" $CLASS " $@ " 

From there, we are only talking about viewing API documents and, if necessary, the source code for these classes.

+8
source share

You can see how this works in my GitHub repository: https://github.com/yegor256/nutch-in-java I came across the same problem and after several hours of research I managed to create a fully functioning part of the Java code Enjoy!)

0
source share

All Articles