Apache HTTPClient throws java.net.SocketException: connection reset for many domains

I create a (well-hung) web spider and notice that some servers call Apache HttpClient to give me a SocketException - in particular:

java.net.SocketException: Connection reset 

Code calling this:

 // Execute the request HttpResponse response; try { response = httpclient.execute(httpget); //httpclient is of type HttpClient } catch (NullPointerException e) { return;//deep down in apache http sometimes throws a null pointer... } 

For most servers, this is just fine. But for others, it immediately throws a SocketException.

An example of a site that raises an immediate SocketException: http://www.bhphotovideo.com/

Works great (like most websites): http://www.google.com/

Now, as you can see, www.bhphotovideo.com loads fine in a web browser. It also loads normally when I do not use the Apache HTTP client. (Code like this :)

  HttpURLConnection c = (HttpURLConnection)url.openConnection(); BufferedInputStream in = new BufferedInputStream(c.getInputStream()); Reader r = new InputStreamReader(in); int i; while ((i = r.read()) != -1) { source.append((char) i); } 

So why don't I just use this code? Well, there are some key features in the Apache HTTP client that I need to use.

Does anyone know why some servers raise this exception?

Research so far:

  • The problem occurs on my local Mac dev machines and the AWS EC2 instance, so this is not a local firewall.

  • It seems that the error is not caused by the remote machine, because the exception does not say "equal"

  • This stack overflow seems to relavent java.net.SocketException: Connection reset , but the answers do not show why this will happen only from the Apache HTTP Client, and not other approaches.

Bonus question: I do quite a lot of crawling with this system. Is there any better Java class for this besides Apache HTTP Client? I found a number of problems (such as NullPointerException, which I have to catch in the code above). It seems that HTTPClient is very picky about server communications - more picky than we would like for a crawler who can't just break when the server is not behaving.

Thanks everyone!

Decision

Honestly, I donโ€™t have the perfect solution, but it works, so itโ€™s good enough for me.

As indicated below, Bixo has created a crawler that configures HttpClient to be more forgiving for servers. To "work around" the problem more than fix it, I just used the SimpleHttpFetcher provided by Bixo here: (linked removed - it thinks I'm a spammer, so you have to do it yourself)

 SimpleHttpFetcher fetch = new SimpleHttpFetcher(new UserAgent("botname"," contact@yourcompany.com ","ENTER URL")); try { FetchedResult result = fetch.fetch("ENTER URL"); System.out.println(new String(result.getContent())); } catch (BaseFetchException e) { e.printStackTrace(); } 

The downside of this solution is that there are many dependencies for Bixo - so this may not be a good job for everyone. However, you can always just work using DefaultHttpClient and see how they created it to make it work. I decided to use the whole class because it handles some things for me, such as automatic redirection (and the message of the final destination URL), which are useful.

Thanks for the help.

Edit: TinyBixo

Hello everybody. So, I liked the way Bixo worked, but he didnโ€™t like that he had so many dependencies (including all Hadoop). So, I created a greatly simplified Bixo, without any dependencies. If you run into the problems above, I would recommend using them (and feel free to make pull requests if you want to update it!)

It is available here: https://github.com/juliuss/TinyBixo

+6
java apache web-crawler sockets
source share
3 answers

First, to answer your question:

The reset connection was caused by a server-side problem. Most likely, the server was unable to parse the request or was unable to process it and refused the connection as a result without returning a valid response. There is probably something in the HTTP requests generated by the HttpClient that causes the server-side logic to fail, possibly due to a server-side error. Just because the error message does not say "by peer" does not mean that the reset connection has occurred on the client side.

A few notes:

(1) Several popular web crawlers, such as bixo http://openbixo.org/ , use HttpClient without major problems, but to a large extent, HttpClient needed to be configured to make it more lenient with common HTTP protocol violations. By default, HttpClient is pretty strict regarding HTTP protocol compliance.

(2) Why didnโ€™t you report an NPE problem or any other problem that you encountered in the HttpClient project?

+3
source

These two options will help:

  client.getParams().setParameter("http.socket.timeout", new Integer(0)); client.getParams().setParameter("http.connection.stalecheck", new Boolean(true)); 

The first sets the socket timeout to infinite.

+2
source

Try to get the network trace using wirehark and increase it using the log4j protocol of HTTPClient. This should show why the connection is reset.

0
source

All Articles