We are developing a large-scale web search / parsing project. Basically, a script should go through a list of web pages, extract the contents of a specific tag and store it in a database. What language would you recommend doing this on a large scale (tens of millions of pages?). .
We use MongoDB for the database, so anything with solid MongoDB drivers is a plus.
So far we have used (don't laugh) PHP, curl, and Simple HTML DOM Parser, but I donβt think it scales to millions of pages, especially since PHP does not have proper multithreading.
We need something that is easy to develop, can run on a Linux server, have a robust HTML / DOM parser to easily extract this tag and easily load millions of web pages in a reasonable amount of time. We are not looking for a web crawler because we donβt need to follow links and index all content, we just need to extract one tag from each page in the list.
parsing screen-scraping
Jonathan knight
source share