Why does the crawler4j example give an error?

Question

Why does the crawler4j example give an error?

I am trying to use the base crawler example in crawler4j. I took the code from crawler4j here .

package edu.crawler; import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.parser.HtmlParseData; import edu.uci.ics.crawler4j.url.WebURL; import java.util.List; import java.util.regex.Pattern; import org.apache.http.Header; public class MyCrawler extends WebCrawler { private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); /** * You should implement this function to specify whether the given url * should be crawled or not (based on your crawling logic). */ @Override public boolean shouldVisit(WebURL url) { String href = url.getURL().toLowerCase(); return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/"); } /** * This function is called when a page is fetched and ready to be processed * by your program. */ @Override public void visit(Page page) { int docid = page.getWebURL().getDocid(); String url = page.getWebURL().getURL(); String domain = page.getWebURL().getDomain(); String path = page.getWebURL().getPath(); String subDomain = page.getWebURL().getSubDomain(); String parentUrl = page.getWebURL().getParentUrl(); String anchor = page.getWebURL().getAnchor(); System.out.println("Docid: " + docid); System.out.println("URL: " + url); System.out.println("Domain: '" + domain + "'"); System.out.println("Sub-domain: '" + subDomain + "'"); System.out.println("Path: '" + path + "'"); System.out.println("Parent page: " + parentUrl); System.out.println("Anchor text: " + anchor); if (page.getParseData() instanceof HtmlParseData) { HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); String text = htmlParseData.getText(); String html = htmlParseData.getHtml(); List<WebURL> links = htmlParseData.getOutgoingUrls(); System.out.println("Text length: " + text.length()); System.out.println("Html length: " + html.length()); System.out.println("Number of outgoing links: " + links.size()); } Header[] responseHeaders = page.getFetchResponseHeaders(); if (responseHeaders != null) { System.out.println("Response headers:"); for (Header header : responseHeaders) { System.out.println("\t" + header.getName() + ": " + header.getValue()); } } System.out.println("============="); } }

Above is the code for the crawler class from the example.

 public class Controller { public static void main(String[] args) throws Exception { String crawlStorageFolder = "../data/"; int numberOfCrawlers = 7; CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); /* * Instantiate the controller for this crawl. */ PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); /* * For each crawl, you need to add some seed urls. These are the first * URLs that are fetched and then the crawler starts following links * which are found in these pages */ controller.addSeed("http://www.ics.uci.edu/~welling/"); controller.addSeed("http://www.ics.uci.edu/~lopes/"); controller.addSeed("http://www.ics.uci.edu/"); /* * Start the crawl. This is a blocking operation, meaning that your code * will reach the line after this only when crawling is finished. */ controller.start(MyCrawler.class, numberOfCrawlers); } }

Above is the class for the controller class for the web crawler. When I try to run the Controller class from my IDE (Intellij), I get the following error:

 Exception in thread "main" java.lang.UnsupportedClassVersionError: edu/uci/ics/crawler4j/crawler/CrawlConfig : Unsupported major.minor version 51.0

Is there something about the maven configuration that is found here that I should know? Should I use a different version or something else?

+6

java crawler4j

j.jerrod.taylor Mar 14 '13 at 0:27

source share

1 answer

j.jerrod.taylor · Accepted Answer · 2013-03-14T13:47:15+0000

The problem is not crawler4j. The problem was that the version of Java I used was different from the latest version of Java used in crawler4j. I switched the version right before upgrading to Java 7, and everything worked fine. I assume that upgrading my version of Java to 7 will have the same effect.

Why does the crawler4j example give an error?

More articles: