Connection Error Handling and JSoup

I am trying to create an application to clear the contents of several pages on a site. I am using JSoup to connect. This is my code:

for (String locale : langList){ sitemapPath = sitemapDomain+"/"+locale+"/"+sitemapName; try { Document doc = Jsoup.connect(sitemapPath) .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21") .timeout(10000) .get(); Elements element = doc.select("loc"); for (Element urls : element) { System.out.println(urls.text()); } } catch (IOException e) { System.out.println(e); } } 

Everything works fine in most cases. However, there are a few things I want to do.

Firstly, sometimes the status 404 or 500 is returned, maybe 301. With my code below, it just prints an error and moves to the next URL. What I would like to do is try to return the url status for all links. If the page connects, print 200 if you do not print the corresponding status code.

Secondly, I sometimes catch this error β€œjava.net.SocketTimeoutException:β€œ Listening. ”I could increase the wait time, but I would rather try to connect 3 times, after the third time I want to add the URL to "fail", so I can retry failed connections in the future.

Can someone with more knowledge than me help me?

+5
java jsoup connection
source share
2 answers

For your first question, you can complete your connection / read in two steps, stopping to request a status code in the middle like this:

 Connection.Response response = Jsoup.connect(sitemapPath) .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21") .timeout(10000) .execute(); int statusCode = response.statusCode(); if(statusCode == 200) { Document doc = connection.get(); Elements element = doc.select("loc"); for (Element urls : element) { System.out.println(urls.text()); } } else { System.out.println("received error code : " + statusCode); } 

Please note that the execute() method will fail with an IOException if it cannot connect to the server, if an incorrect HTTP is found in the response, etc., so you will need to handle this. However, as long as the server says something that makes sense, you can read the status code and continue. Also, if you asked Jsoup to follow the redirect, you won’t see the 30x b / c response codes. 30x will set the status code from the last page loaded.

As for your second question, all you need is a loop around the sample code I just gave you, wrapped with a try / catch block with a SocketTimeoutException . When you catch the exception, the loop should continue. If you can get the data, return or break. Scream if you need more help!

+15
source share

The above returns an IOException for me, not an execute () returning the correct status code.

Using JSoup-1.6.1 I had to modify the above code to use ignoreHttpErrors (true) ,

Now that the code is returning the answer, and not throwing an exception, and you can check the error / message codes.

 Connection.Response response = null; try { response = Jsoup.connect(bad_url) .userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5") .timeout(100000) .ignoreHttpErrors(true) .execute(); } catch (IOException e) { System.out.println("io - "+e); } System.out.println("Status code = " + response.statusCode()); System.out.println("Status msg = " + response.statusMessage()); 

Output:

 Status code = 404 Status msg = Not Found 
+15
source share

All Articles