Saving URLs During Spidering

I created a small web spider in Python that I use to collect URLs. I'm not interested in content. I now store all the visited URLs in the set in memory, because I do not want my spider to visit the URLs twice. Of course, this is a very limited way to achieve this.

So what is the best way to track my visited URLs?

Should I use a database?

  • which one? MySQL, SQLite, PostgreSQL?
  • How do I save URLs? How is a primary key trying to insert every URL before visiting it?

Or should they be written to a file?

  • single file?
  • multiple files? How do I create a file structure?

I am sure that there are books and many works on this or similar topics. Can you give me some tips that I should read?

+7
python url database web-crawler storage
source share
6 answers

These seem to be important aspects for me:

  • You cannot store the URLs in memory, as the RAM will be too high.
  • You need to quickly find a search of at least O (logn)
  • You need quick inserts

There are many ways to do this, and it depends on how big your database will be. I think the SQL database could be a good model for your problem.

You probably need a SQLite database. Usually finding strings to verify existence is a slow operation. To speed this up, you can create a CRC URL hash and store the CRC and URL in your database. You will have a pointer to this CRC field.

  • When you paste: you paste the url and hash
  • If you want to search existing: you take the CRC of a potentially new URL and check if it is already in your database.

Of course, there is a chance of collision with URL hashes, but if it does not matter for you that 100% coverage does not matter, then you can take advantage of the hit of missing a URL in your database in a collision.

You can also reduce collisions in many ways. For example, you can increase the size of your CRC (CRC8 instead of CRC4) and use a larger hash algorithm. Or use CRC as well as URL length.

+7
source share

I wrote a lot of spiders. For me, a bigger problem than running out of memory is the ability to lose all the URLs that you have already started if the code or computer crashes or you decide that you need to configure the code. If you run out of RAM, most of the machines and OS these days will have pages, so you will slow down, but still function. Having to rebuild a set of URLs collected over several hours and hours of runtime because it is no longer available can be a real performance hit.

Storing information in RAM that you DO NOT want to lose is bad. Obviously, a database is a way to go at this point, because you need quick random access to see if you have already found the URL. Searches in memory are faster, of course, but the trade-off is to figure out which URLs to store in memory are adding overhead. Instead of trying to write code to determine which URLs I need / do not need, I store them in a database and concentrate on making my code clean and maintainable, as well as my SQL queries and schemas reasonable. Make the URL field a unique index, and DBM will be able to find them as soon as possible, automatically avoiding redundant links.

Your connection to the Internet and the sites you access is likely to be much slower than your connection to the database on a machine on your internal network. The SQLite database on one machine may be the fastest, although DBM itself is not as complex as Postgres, which is my favorite. I found that putting the database on another machine on the same switch as my spider machine is very fast; Creating one machine handles the web, parsing, and then reading / writing the database is pretty intense, so if you have an old box, drop Linux on it, install Postgres and go to town. Throw extra RAM into the box if you need more speed. Having this separate window for using the database can be very enjoyable.

+9
source share

It depends on the scale of the spider that you are going to do, and on the machine on which you do it. Suppose a typical URL is a string of 60 bytes or so, a memory set will occupy just over 100 bytes per URL (sets and voice recorders in Python will never allow you to increase to 60% for speed reasons). If you have a 64-bit machine (and a Python distribution) with about 16 GB of RAM available, you will probably allocate more than 10 GB to the key set, which will allow you to easily encrypt about 100 million URLs or so; but on the other hand, if you have a 32-bit machine with 3 GB of RAM, you obviously can not allocate much more than GB for the decisive set, limiting you to about 10 million URLs. Sqlite will help in about the same size range where a 32-bit machine cannot do this, but with a 64-bit 64-bit printer that can contain 100 or 200 million URLs.

Other than that, I would recommend PostgreSQL, which also has the advantage of being able to run on another machine (on a fast LAN), basically without any problems, letting you dedicate your main machine to web paging. I think MySQL and AM will be fine too, but I love the standard PostgreSQL matching and reliability ;-). This would allow, say, several billion URLs without problems (just a fast disk or, better, a RAID location, and as much RAM as you can afford to speed up, of course).

Trying to preserve memory / storage using fixed-length hashes instead of URLs that can be quite long is fine if you are ok with a random false positive that will not allow you to crawl what is actually a new URL. Such “collisions” may not necessarily be probable: even if you use only 8 bytes for a hash, you should only have a significant collision risk when you look at billions of URLs (the “square root heuristic” is a well-known problem for this).

With 8 byte strings for representing URLs, the in-memory typing architecture should easily support a billion or more URLs on a well-protected machine, as described above.

So, about the number of URLs you want to use, and how much RAM can you save? -)

+4
source share

Are you just saving the URL? You should take a look at mongoDB. This is a NoSQL database that is pretty easy to implement.

http://try.mongodb.org/

He also got python bindings:

http://api.mongodb.org/python/1.5.2%2B/index.html

+2
source share

Since you are likely to see similar URLs at the same time (for example, when viewing a website, you will see many links to the website’s main page). I would advise you to keep the URLs in the dictionary until your memory becomes limited (just hard code with a reasonable number, like 10M URL or similar), and then run the dictionary into the CDB database file when it gets too big.

Thus, most of the checks for your URL will be in memory (this is fast), while those that are not in memory will still require only 1-2 reads from disk to make sure you visit them.

+1
source share

Consider Etching for the moment: Simple structured memory.

The mileage will vary, of course, because, as the other respondents said, you quickly run out of RAM.

0
source share

All Articles