Late answer in case you or someone else is still looking for a way to do this. I am using https://code.google.com/p/crawler-commons/ in version 0.2 and it seems to work well. The following is a simplified code example that I use:
String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
+ (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
HttpGet httpget = new HttpGet(hostId + "/robots.txt");
HttpContext context = new BasicHttpContext();
HttpResponse response = httpclient.execute(httpget, context);
if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
EntityUtils.consumeQuietly(response.getEntity());
} else {
BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
"text/plain", USER_AGENT);
}
robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);
Obviously, this has nothing to do with Jsoup, it just checks to see if the given URL is allowed for a specific USER_AGENT user. To extract the robots.txt file, I use Apache HttpClient in version 4.2.1, but this can also be replaced with java.net stuff.
, robots.txt, "Crawl-delay". , .