I have a site hosted on my local machine that I am trying to crawl with Nutch and an index in Solr (both also on my local machine). I installed Solr 4.6.1 and Nutch 1.7 according to the instructions on the Nutch website ( http://wiki.apache.org/nutch/NutchTutorial ), and I have Solr working in my browser without any problems.
I run the following command:
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 1 -topN 2
Scanning works fine, but when it tries to put data in Solr, it crashes with the following output:
Indexer: starting at 2014-02-06 16:29:28 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65) at org.apache.nutch.crawl.Crawl.run(Crawl.java:155) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
I went to the Nutch log directory and linked the hadoop.log file, it shows this:
2014-02-06 16:29:28,920 INFO solr.SolrIndexWriter - Indexing 1 documents 2014-02-06 16:29:28,921 INFO httpclient.HttpMethodDirector - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server localhost failed to respond 2014-02-06 16:29:28,921 INFO httpclient.HttpMethodDirector - Retrying request 2014-02-06 16:29:28,924 WARN mapred.LocalJobRunner - job_local331896790_0009 java.io.IOException at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:173) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:159) at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:118) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:467) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:535) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:478) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:155) ... 6 more Caused by: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422)
However, I can still access Solr in my browser. This is my first attempt at Solr / Nutch - any help from those with more knowledge will be greatly appreciated. Thank you
solr nutch
rldrummer
source share