If you want a detailed answer, take a look at section 3.8 of this document , which describes the test for viewing the URL of a modern scraper:
In the process of extracting links, any web scanner will encounter several links to the same document. In order not to download and process the document several times, before each extracted link, you need to check the URL before adding it to the border of the URL. (Alternatively, one could instead check the URLs when removing the URL from the border, but this approach would lead to a much larger border.)
To perform the URL check, we store all the URLs visible to Mercator in canonical form in a large table called a set of URLs. Again, there are too many records to fit all in memory, therefore, like a set of fingerprints of documents, a set of URLs are stored mainly on disk.
To save space, we do not store the textual representation of each URL in the set of URLs, but use a fixed-size checksum. Unlike the fingerprints presented in the set of fingerprints of documents in documents with content verification, the stream of URLs verified by the set of URLs has non-trivial locality. Therefore, to reduce the number of operations with the backup disk file, we store in memory a cache of popular URLs. The intuition for this cache is that links to some URLs are quite common, so caching popular ones in memory will result in high memory accesses.
In fact, using a cache in memory of 2,118 entries and a clock replacement policy similar to LRU, we achieve a total cache hit ratio of 66.2% and a hit rate of 9.5% in the table. recently added URLs, net hit rate of 75.7%. Moreover, out of 24.3% of requests that are absent both in the cache of popular URLs and in the table of recently added URLs, about 1 = 3 hit the buffer in our implementation of a random access file, which is also located in the user’s space. The end result of all this buffering is that each membership test that we run on a set of URLs results in an average of 0.16 search queries and 0.17 read kernel calls (some of which are served from kernel file system buffers). Thus, each test for membership in a set of URLs causes six times more kernel calls than a test for membership in a set of fingerprints of documents. These savings are due solely to the number of URLs (i.e., the repetition of popular URLs) inherent in the stream of URLs detected during the scan.
Essentially, they hash all URLs with a hash function that guarantees unique hash values for each URL, and thanks to the location of the URLs, it becomes very easy to find the URLs. Google has even opened its hash function: CityHash
WARNING!
They can also talk about bot traps !!! A bot trap is a section of a page that continues to generate new links with unique URLs, and you will essentially end up in an “endless loop” by clicking on the links served by this page. This is not quite a loop, because the loop will be the result of visiting the same URL, but it is an endless chain of URLs to avoid crawling.
Update 12.13.2012 - the day after tomorrow the world should have ended :)
Comment by Fr0zenFyr: if the AOPIC algorithm is used to select pages, it is quite easy to avoid traps such as an infinite loop. Here is a summary of how AOPIC works:
- Get a set of N homepages.
- Highlight the loan amount X for each page so that each page has an X / N loan (i.e., an equal loan amount) before scanning begins.
- Select page P where P has the largest loan amount (or, if all pages have the same loan amount, scan a random page).
- Scanning the P page (let's say that P had 100 credits when scanning).
- Extract all links from page P (for example, there are 10 of them).
- Set credits P to 0.
- Take 10% of the "tax" and place it on the lambda page.
- Select an equal number of credits for each link found on page P from the original P-tax credit: so (100 (P credits) - 10 (10% tax)) / 10 (links) = 9 credits for each link.
- Repeat from step 3.
Since the Lambda page continuously collects tax, in the end it will be the page with the largest loan amount, and we will have to “scan” it. I say “scan” in quotation marks because we don’t actually make an HTTP request for a Lambda page, we just take its credits and distribute them equally to all pages in our database.
Since bot traps give only internal links, and they rarely get credit from outside, they will constantly skip credits (from taxation) on the Lambda page. The lambda page will evenly distribute credits across all pages in the database, and with each cycle, the bot trap page will lose more and more credits until it has so few credits that it will almost never be scanned again. This will not happen with good pages, because they often get credits from the backlinks found on other pages. This also leads to a dynamic page rank, and you will notice that every time you take a picture of your database, arrange the pages by the number of credits that they have, then they will most likely be sorted approximately according to their true page rank . ,
This only allows you to avoid traps for bots such as an infinite loop, but there are many other traps for bots that you should pay attention to, and there are ways to get around them.