What is a web crawler for extracting and analyzing data from thousands of websites

I'm trying to crawl about a thousand websites, of which I'm only interested in html content.

I then convert the HTML to XML to understand Xpath to extract the content that interests me.

I have been using the Heritrix 2.0 crawler for several months, but I ran into huge problems of performance, memory, and stability (Heritrix crashes every day, and no attempts with JVM parameters to limit memory usage were successful).

From your experience in an area that a crawler would use to extract and analyze content from thousands of sources?

+4
source share
3 answers

I would not use the 2.x branch (which was discontinued) or 3.x (the current development) for any "serious" workaround if you do not want to help improve Heritrix or just as if you are on the verge of bleeding.

Heritrix 1.14.3 is the latest stable release, and it is really stable, used by many institutions for both small-sized and large-scale scans. I use to run crawls against tens of thousands of domains, collecting tens of millions of URLs over the course of a week.

The 3.x branch is approaching a stable release, but even then I would wait a bit for general use in Internet Archive and others to improve its performance and stability.

Update:. Since someone voted for this recently, I feel that it is worth noting that Heritrix 3.x is now stable and is the recommended version for those who start working with Heritrix.

+3
source

I would suggest writing your own using Python with Scrapy and lxml or BeautifulSoup . You should find some good tutorials on Google for them. I use Scrapy + lxml at work to check websites ~ 600 sites for broken links.

+3
source

Wow. Modern crawlers, such as search engines, crawl and index 1 million URLs. On a sinlge box per day. Of course, the HTML-XML rendering step takes a bit, but I agree with you on performance. I used only private scanners, so I cannot recommend the one you can use, but I hope that these performance indicators will help in your assessment.

0
source

All Articles