Nutch 1.10 input path does not exist / linkdb / current

When I run nutch 1.10 with the following command, assuming TestCrawl2 did not exist before and should be created, ...

 sudo -E bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TestCrawlCore2 urls/ TestCrawl2/ 20 

I get an error when indexing:

 Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/apache-nutch-1.10/TestCrawl2/linkdb/current 

The linkdb directory exists, but does not contain the "current" directory. The directory is owned by root, so there should not be any permissions. Since the process exited the error, the linkdb directory contains the .locked and .. locked.crc . If I run the command again, these lock files will force it to exit in the same place. Remove TestCrawl2 directory, rinse, repeat.

Note that the nutch and solr installations themselves performed without problems in the TestCrawl instance. Just now, when I try to create a new one, I have problems. Any suggestions to fix this problem?

+6
source share
1 answer

Well, it looks like I ran into this problem:

https://issues.apache.org/jira/browse/NUTCH-2041

Which is the result of crawling a script that is not familiar with the changes in the ignore_external_links file of my nutch-site.xml file.

I am trying to crawl several sites and was hoping to keep my life simple by ignoring external links and leaving only regex-urlfilter.txt (just using +.)

Now it looks like I will have to change ignore_external_links back to false and add a regular expression filter for each of my URLs. Hopefully I can get the chick-release 1.11 soon. This seems to be fixed.

+3
source

All Articles