It depends on the scale of the spider that you are going to do, and on the machine on which you do it. Suppose a typical URL is a string of 60 bytes or so, a memory set will occupy just over 100 bytes per URL (sets and voice recorders in Python will never allow you to increase to 60% for speed reasons). If you have a 64-bit machine (and a Python distribution) with about 16 GB of RAM available, you will probably allocate more than 10 GB to the key set, which will allow you to easily encrypt about 100 million URLs or so; but on the other hand, if you have a 32-bit machine with 3 GB of RAM, you obviously can not allocate much more than GB for the decisive set, limiting you to about 10 million URLs. Sqlite will help in about the same size range where a 32-bit machine cannot do this, but with a 64-bit 64-bit printer that can contain 100 or 200 million URLs.
Other than that, I would recommend PostgreSQL, which also has the advantage of being able to run on another machine (on a fast LAN), basically without any problems, letting you dedicate your main machine to web paging. I think MySQL and AM will be fine too, but I love the standard PostgreSQL matching and reliability ;-). This would allow, say, several billion URLs without problems (just a fast disk or, better, a RAID location, and as much RAM as you can afford to speed up, of course).
Trying to preserve memory / storage using fixed-length hashes instead of URLs that can be quite long is fine if you are ok with a random false positive that will not allow you to crawl what is actually a new URL. Such “collisions” may not necessarily be probable: even if you use only 8 bytes for a hash, you should only have a significant collision risk when you look at billions of URLs (the “square root heuristic” is a well-known problem for this).
With 8 byte strings for representing URLs, the in-memory typing architecture should easily support a billion or more URLs on a well-protected machine, as described above.
So, about the number of URLs you want to use, and how much RAM can you save? -)
Alex martelli
source share