Parse robot.txt using java and determine url validity

I am currently using jsoup in an application to analyze and analyze web pages. But I want to make sure that I adhere to the rules of robot.txt and visit only those pages that are allowed.

I am pretty sure that jsoup was not created for this, and it all comes down to web scraping and parsing. Therefore, I planned to have a function / module that should read the robot.txt of the domain / site and determine if the URL I'm going to visit is allowed or not.

I did some research and found the following. But I'm not sure about this, so it would be great if someone did the same project, which involved the analysis of robot.txt, please share your thoughts and ideas.

http://sourceforge.net/projects/jrobotx/

https://code.google.com/p/crawler-commons/

http://code.google.com/p/crowl/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12

+4
source share
2 answers

Late answer in case you or someone else is still looking for a way to do this. I am using https://code.google.com/p/crawler-commons/ in version 0.2 and it seems to work well. The following is a simplified code example that I use:

String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
                + (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
    HttpGet httpget = new HttpGet(hostId + "/robots.txt");
    HttpContext context = new BasicHttpContext();
    HttpResponse response = httpclient.execute(httpget, context);
    if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
        rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
        // consume entity to deallocate connection
        EntityUtils.consumeQuietly(response.getEntity());
    } else {
        BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
        SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
        rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
                "text/plain", USER_AGENT);
    }
    robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);

Obviously, this has nothing to do with Jsoup, it just checks to see if the given URL is allowed for a specific USER_AGENT user. To extract the robots.txt file, I use Apache HttpClient in version 4.2.1, but this can also be replaced with java.net stuff.

, robots.txt, "Crawl-delay". , .

+6

. , . Java 4 , , .

public static boolean robotSafe(URL url) 
{
    String strHost = url.getHost();

    String strRobot = "http://" + strHost + "/robots.txt";
    URL urlRobot;
    try { urlRobot = new URL(strRobot);
    } catch (MalformedURLException e) {
        // something weird is happening, so don't trust it
        return false;
    }

    String strCommands;
    try 
    {
        InputStream urlRobotStream = urlRobot.openStream();
        byte b[] = new byte[1000];
        int numRead = urlRobotStream.read(b);
        strCommands = new String(b, 0, numRead);
        while (numRead != -1) {
            numRead = urlRobotStream.read(b);
            if (numRead != -1) 
            {
                    String newCommands = new String(b, 0, numRead);
                    strCommands += newCommands;
            }
        }
       urlRobotStream.close();
    } 
    catch (IOException e) 
    {
        return true; // if there is no robots.txt file, it is OK to search
    }

    if (strCommands.contains(DISALLOW)) // if there are no "disallow" values, then they are not blocking anything.
    {
        String[] split = strCommands.split("\n");
        ArrayList<RobotRule> robotRules = new ArrayList<>();
        String mostRecentUserAgent = null;
        for (int i = 0; i < split.length; i++) 
        {
            String line = split[i].trim();
            if (line.toLowerCase().startsWith("user-agent")) 
            {
                int start = line.indexOf(":") + 1;
                int end   = line.length();
                mostRecentUserAgent = line.substring(start, end).trim();
            }
            else if (line.startsWith(DISALLOW)) {
                if (mostRecentUserAgent != null) {
                    RobotRule r = new RobotRule();
                    r.userAgent = mostRecentUserAgent;
                    int start = line.indexOf(":") + 1;
                    int end   = line.length();
                    r.rule = line.substring(start, end).trim();
                    robotRules.add(r);
                }
            }
        }

        for (RobotRule robotRule : robotRules)
        {
            String path = url.getPath();
            if (robotRule.rule.length() == 0) return true; // allows everything if BLANK
            if (robotRule.rule == "/") return false;       // allows nothing if /

            if (robotRule.rule.length() <= path.length())
            { 
                String pathCompare = path.substring(0, robotRule.rule.length());
                if (pathCompare.equals(robotRule.rule)) return false;
            }
        }
    }
    return true;
}

-:

/**
 *
 * @author Namhost.com
 */
public class RobotRule 
{
    public String userAgent;
    public String rule;

    RobotRule() {

    }

    @Override public String toString() 
    {
        StringBuilder result = new StringBuilder();
        String NEW_LINE = System.getProperty("line.separator");
        result.append(this.getClass().getName() + " Object {" + NEW_LINE);
        result.append("   userAgent: " + this.userAgent + NEW_LINE);
        result.append("   rule: " + this.rule + NEW_LINE);
        result.append("}");
        return result.toString();
    }    
}
+1

All Articles