I create a (well-hung) web spider and notice that some servers call Apache HttpClient to give me a SocketException - in particular:
java.net.SocketException: Connection reset
Code calling this:
// Execute the request HttpResponse response; try { response = httpclient.execute(httpget); //httpclient is of type HttpClient } catch (NullPointerException e) { return;//deep down in apache http sometimes throws a null pointer... }
For most servers, this is just fine. But for others, it immediately throws a SocketException.
An example of a site that raises an immediate SocketException: http://www.bhphotovideo.com/
Works great (like most websites): http://www.google.com/
Now, as you can see, www.bhphotovideo.com loads fine in a web browser. It also loads normally when I do not use the Apache HTTP client. (Code like this :)
HttpURLConnection c = (HttpURLConnection)url.openConnection(); BufferedInputStream in = new BufferedInputStream(c.getInputStream()); Reader r = new InputStreamReader(in); int i; while ((i = r.read()) != -1) { source.append((char) i); }
So why don't I just use this code? Well, there are some key features in the Apache HTTP client that I need to use.
Does anyone know why some servers raise this exception?
Research so far:
The problem occurs on my local Mac dev machines and the AWS EC2 instance, so this is not a local firewall.
It seems that the error is not caused by the remote machine, because the exception does not say "equal"
This stack overflow seems to relavent java.net.SocketException: Connection reset , but the answers do not show why this will happen only from the Apache HTTP Client, and not other approaches.
Bonus question: I do quite a lot of crawling with this system. Is there any better Java class for this besides Apache HTTP Client? I found a number of problems (such as NullPointerException, which I have to catch in the code above). It seems that HTTPClient is very picky about server communications - more picky than we would like for a crawler who can't just break when the server is not behaving.
Thanks everyone!
Decision
Honestly, I donโt have the perfect solution, but it works, so itโs good enough for me.
As indicated below, Bixo has created a crawler that configures HttpClient to be more forgiving for servers. To "work around" the problem more than fix it, I just used the SimpleHttpFetcher provided by Bixo here: (linked removed - it thinks I'm a spammer, so you have to do it yourself)
SimpleHttpFetcher fetch = new SimpleHttpFetcher(new UserAgent("botname"," contact@yourcompany.com ","ENTER URL")); try { FetchedResult result = fetch.fetch("ENTER URL"); System.out.println(new String(result.getContent())); } catch (BaseFetchException e) { e.printStackTrace(); }
The downside of this solution is that there are many dependencies for Bixo - so this may not be a good job for everyone. However, you can always just work using DefaultHttpClient and see how they created it to make it work. I decided to use the whole class because it handles some things for me, such as automatic redirection (and the message of the final destination URL), which are useful.
Thanks for the help.
Edit: TinyBixo
Hello everybody. So, I liked the way Bixo worked, but he didnโt like that he had so many dependencies (including all Hadoop). So, I created a greatly simplified Bixo, without any dependencies. If you run into the problems above, I would recommend using them (and feel free to make pull requests if you want to update it!)
It is available here: https://github.com/juliuss/TinyBixo