When you scrap a lot of statistics from a web page, how often should I embed the collected results in my database?

Question

When you scrap a lot of statistics from a web page, how often should I embed the collected results in my database?

I am scraping the website (scripting responsibly, adjusting my scraping and with permission), and I'm going to collect statistics on 300,000 users.

I plan to store this data in an SQL database, and I plan to scrap this data once a week. My question is, how often do I have to do inserts in the database, since the results come from a scraper?

Is it best to wait until all the results are there (keeping them all in memory) and insert them all when the cleanup is over? Or is it better to insert on each result that comes (coming at a decent speed)? Or something in between?

If someone can point me in the right direction, how often / when should I do this, I would appreciate it.

Also, would the answer change if I saved these results in a flat file and database?

Thank you for your time!

+2

database screen-scraping

Mithrax Sep 09 '09 at 6:05

source share

2 answers

derobert · Answer 1 · 2009-09-09T06:09:28+0000

You can increase performance by adding several hundred if your database supports inserting multiple rows for each query (both MySQL and PostgreSQL). You will probably also get more performance by adding a few attachments for each transaction (except for non-transactional databases such as MySQL with MyISAM).

The benefits of dosing quickly fall as the batch size increases; you have already reduced the request / commit overhead by 99% by the time you make 100 at a time. As you increase, you will encounter various restrictions (example: the longest allowed request).

You will also face another big compromise: if your program dies, you will lose any work that you have not yet saved to the database. Losing 100 is not so bad; you can probably redo this work in a minute or two. A loss of 300,000 will take quite a while to repeat.

Summary Personally, I start with one value / one query, as this will be the easiest to implement. If I found that the insertion time was a bottleneck (I doubt very much that the scratches would be much slower), I would move 100 values / queries.

PS: Since the site administrator gave you permission, they asked if you could just get a database dump of the relevant data? Save a lot of work!

Bill karwin · Answer 2 · 2009-09-09T06:16:30+0000

My preference is to write massive data to the database every 1000 rows, when I have to do it the way you describe. It seems like a good amount. Not too much processed if I had a crash and you need to generate some data (re-scraping in your case). But this is a good healthy bite that can reduce overhead.

As @derobert points out, moving a bunch of insertions into a transaction also helps reduce overhead. But do not put everything in one transaction - some DBMS vendors such as Oracle keep a “re-log” during the transaction, so if you do too much work, it can cause overload. Dividing the work into large, but not too large, pieces is better. That is 1000 lines.

Some SQL implementations support multi-line INSERT (this is also mentioned by @derobert), but some do not.

You are right that cleaning up the raw data to a flat file and then downloading it later is probably worth it. Each SQL provider supports this type of bulk load in different ways, such as LOAD DATA INFILE in MySQL or " .import " in SQLite, etc. You will need to tell us which brand of SQL database you are using in order to get a more specific one, but in my experience such a mechanism could be 10-20x INSERT performance even after such improvements as using transactions and multi-line insertion.

Repeat your comment, you can take a look at BULK INSERT in Microsoft SQL Server. I usually don’t use Microsoft, so I don’t have first-class experience, but I find this a useful tool in your scenario.

When you scrap a lot of statistics from a web page, how often should I embed the collected results in my database?

More articles: