Nutch vs Heritrix vs Stormcrawler vs MegaIndex vs Mixnode

We need to crawl a large number (~ 1.5 billion) of web pages every two weeks. Speed, therefore, cost, is a huge factor for us, since our initial attempts ended in costing us more than $ 20 thousand.

Is there any evidence that the crawler works best in a distributed environment?

+4
web-crawler nutch heritrix stormcrawler
source share
2 answers

We tried only Nutch, Stormcrawler and MixNode. We ended up using mixnode to scan ~ 300 million pages across 5k domains.

My $ 0.02: mixnode is the best choice for crawling on a large scale (over 1 million URLs). For smaller scans, this is unnecessary, since you have to analyze the resulting warc files, and if you only make a few thousand pages, it will be easier to run your own script or use an open source alternative, such as nutch or stormcrawler (or even scrapy),

Mixnode is now an “alternative” to scanning on the Internet , so it is completely different from my old answer.

+6
source share

For a comparison between Nutch and StormCrawler, see my dzone article .

Heritrix can be used in distributed mode, but the documentation is not very clear how to do this. The previous 2 rely on well-established computing distribution platforms (Apache Hadoop and Apache Storm respectively), but this does not apply to Heritrix.

Heritrix is ​​also used mainly by the archiving community, while Nutch and StormCrawler are used for wider use (for example, for indexing, curettage) and have more resources for data extraction.

I am not familiar with the 2 hosted services that you mentioned, since I only use open source software.

+3
source share

All Articles