What is the technology for large-scale purification / parsing?

We are developing a large-scale web search / parsing project. Basically, a script should go through a list of web pages, extract the contents of a specific tag and store it in a database. What language would you recommend doing this on a large scale (tens of millions of pages?). .

We use MongoDB for the database, so anything with solid MongoDB drivers is a plus.

So far we have used (don't laugh) PHP, curl, and Simple HTML DOM Parser, but I don’t think it scales to millions of pages, especially since PHP does not have proper multithreading.

We need something that is easy to develop, can run on a Linux server, have a robust HTML / DOM parser to easily extract this tag and easily load millions of web pages in a reasonable amount of time. We are not looking for a web crawler because we don’t need to follow links and index all content, we just need to extract one tag from each page in the list.

+7
parsing screen-scraping
source share
4 answers

If you are really talking about large scales, then you probably want something that allows you to scale horizontally, such as a Map-Reduce wireframe, such as Hadoop . You can write Hadoop jobs in several languages, so you are not tied to Java. Here, for example, is an article about writing Hadoop jobs in Python . By the way, this is probably the language I would use thanks to libs, for example httplib2 for creating queries and lxml for analyzing the results.

If the Map-Reduce structure is full, you can save it in Python and use multiprocessing .

UPDATE: If you do not need a MapReduce framework and prefer a different language, check out ThreadPoolExecutor in Java. However, I would definitely use the Apache Commons HTTP client. The material in JDK itself is less easy to program.

+7
source share

You should probably use the tools used to test web applications (WatiN or Selenium).

You can then compose your workflow separate from the data using the tool I wrote.

https://github.com/leblancmeneses/RobustHaven.IntegrationTests

You do not need to disassemble manually when using WatiN or Selenium. Instead, you will write css querySelector.

Using TopShelf and NServiceBus, you can scale # workers horizontally.

FYI: Mono, these tools that I mention can work on Linux. (although miles may vary)

If JavaScript does not need to be evaluated to dynamically load data: Everything that requires loading a document into memory is wasting time. If you know where your tag is located, you only need a sax parser.

+3
source share

I am doing something similar using Java with the HttpClient shared resources library. Although I avoid the DOM parser because I am looking for a specific tag that can be easily found from a regular expression.

The slowest part of the operation is making HTTP requests.

+1
source share

how about c ++? There are many great libraries that can help you.

boost asio can help you build a network.

TinyXML can parse XML files.

I have no idea about the database, but almost all databases have interfaces for C ++, this is not a problem.

0
source share

All Articles