Web crawler design

I came across an interview question “If you were developing a web crawler, how would you avoid getting into endless loops?” And I'm trying to answer it.

How it all starts from the beginning. Let's say Google started with some hub pages to say that hundreds of them (how these hub pages were found in the first place, is another question). Since Google follows the links from the page, etc., Does it continue to make a hash table to make sure that it does not match the pages visited before.

What if there are 2 names (URLs) on the same page that say these days when we have URLs, etc.

I cited Google as an example. Although Google is not a leak, how do its web page search and page ranking algorithms work, but any guesses?

+65
data-structures search-engine google-search web-crawler
Apr 29 '11 at 16:37
source share
10 answers

If you want a detailed answer, take a look at section 3.8 of this document , which describes the test for viewing the URL of a modern scraper:

In the process of extracting links, any web scanner will encounter several links to the same document. In order not to download and process the document several times, before each extracted link, you need to check the URL before adding it to the border of the URL. (Alternatively, one could instead check the URLs when removing the URL from the border, but this approach would lead to a much larger border.)

To perform the URL check, we store all the URLs visible to Mercator in canonical form in a large table called a set of URLs. Again, there are too many records to fit all in memory, therefore, like a set of fingerprints of documents, a set of URLs are stored mainly on disk.

To save space, we do not store the textual representation of each URL in the set of URLs, but use a fixed-size checksum. Unlike the fingerprints presented in the set of fingerprints of documents in documents with content verification, the stream of URLs verified by the set of URLs has non-trivial locality. Therefore, to reduce the number of operations with the backup disk file, we store in memory a cache of popular URLs. The intuition for this cache is that links to some URLs are quite common, so caching popular ones in memory will result in high memory accesses.

In fact, using a cache in memory of 2,118 entries and a clock replacement policy similar to LRU, we achieve a total cache hit ratio of 66.2% and a hit rate of 9.5% in the table. recently added URLs, net hit rate of 75.7%. Moreover, out of 24.3% of requests that are absent both in the cache of popular URLs and in the table of recently added URLs, about 1 = 3 hit the buffer in our implementation of a random access file, which is also located in the user’s space. The end result of all this buffering is that each membership test that we run on a set of URLs results in an average of 0.16 search queries and 0.17 read kernel calls (some of which are served from kernel file system buffers). Thus, each test for membership in a set of URLs causes six times more kernel calls than a test for membership in a set of fingerprints of documents. These savings are due solely to the number of URLs (i.e., the repetition of popular URLs) inherent in the stream of URLs detected during the scan.

Essentially, they hash all URLs with a hash function that guarantees unique hash values ​​for each URL, and thanks to the location of the URLs, it becomes very easy to find the URLs. Google has even opened its hash function: CityHash

WARNING!
They can also talk about bot traps !!! A bot trap is a section of a page that continues to generate new links with unique URLs, and you will essentially end up in an “endless loop” by clicking on the links served by this page. This is not quite a loop, because the loop will be the result of visiting the same URL, but it is an endless chain of URLs to avoid crawling.

Update 12.13.2012 - the day after tomorrow the world should have ended :)

Comment by Fr0zenFyr: if the AOPIC algorithm is used to select pages, it is quite easy to avoid traps such as an infinite loop. Here is a summary of how AOPIC works:

  1. Get a set of N homepages.
  2. Highlight the loan amount X for each page so that each page has an X / N loan (i.e., an equal loan amount) before scanning begins.
  3. Select page P where P has the largest loan amount (or, if all pages have the same loan amount, scan a random page).
  4. Scanning the P page (let's say that P had 100 credits when scanning).
  5. Extract all links from page P (for example, there are 10 of them).
  6. Set credits P to 0.
  7. Take 10% of the "tax" and place it on the lambda page.
  8. Select an equal number of credits for each link found on page P from the original P-tax credit: so (100 (P credits) - 10 (10% tax)) / 10 (links) = 9 credits for each link.
  9. Repeat from step 3.

Since the Lambda page continuously collects tax, in the end it will be the page with the largest loan amount, and we will have to “scan” it. I say “scan” in quotation marks because we don’t actually make an HTTP request for a Lambda page, we just take its credits and distribute them equally to all pages in our database.

Since bot traps give only internal links, and they rarely get credit from outside, they will constantly skip credits (from taxation) on the Lambda page. The lambda page will evenly distribute credits across all pages in the database, and with each cycle, the bot trap page will lose more and more credits until it has so few credits that it will almost never be scanned again. This will not happen with good pages, because they often get credits from the backlinks found on other pages. This also leads to a dynamic page rank, and you will notice that every time you take a picture of your database, arrange the pages by the number of credits that they have, then they will most likely be sorted approximately according to their true page rank . ,

This only allows you to avoid traps for bots such as an infinite loop, but there are many other traps for bots that you should pay attention to, and there are ways to get around them.

+78
Apr 29 2018-11-21T00:
source share

While everyone here has already suggested creating your own web crawler, here's how Google rates pages.

Google gives each page a rating based on the number of backlinks (the number of links on other sites indicates a specific site / page). This is called a relevance score. This is based on the fact that if a page has links to many other pages, this is probably an important page.

Each site / page is considered as a node in the graph. Links to other pages are directed edges. The degree of a vertex is defined as the number of incoming edges. Nodes with a large number of incoming edges are rated higher.

This is how PageRank is determined. Suppose page Pj has Lj links. If one of these links refers to the page Pi, then Pj will pass 1 / Lj its value to Pi. The importance rating of Pi is then the sum of all the contributions made by the pages associated with it. So, if we denote the set of pages related to Pi by Bi, then we get the following formula:

Importance(Pi)= sum( Importance(Pj)/Lj ) for all links from Pi to Bi 

Ranks are placed in a matrix called a hyperlink: H [i, j]

The row in this matrix is ​​either 0 or 1 / Lj if there is a connection from Pi to Bi. Another property of this matrix is ​​that if we sum all the rows in the column, we get 1.

Now we need to multiply this matrix by an Eigen vector with the name i (with eigenvalue 1), so that:

 I = H*I 

Now we start the iteration: IH, IIH, IIIH .... I ^ k * H until the solution converges. those. we get almost the same numbers in the matrix at step k and k + 1.

Now all that remains in vector I is the importance of each page.

For a simple example of class homework, see http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html

As for solving a duplicate problem in your interview, do a checksum on the whole page and use either this or bash checksum as your key on the map to keep track of the pages you visit.

+7
Dec 20 2018-11-21T00:
source share

Depending on how deep their question should have been. If they were just trying to avoid following the same links back and forth, then hashing that URL would be enough.

How about content that has literally thousands of URLs that lead to the same content? Like the QueryString parameter, which has no effect, but can have an infinite number of iterations. I assume that you can also use the contents of the page and compare the URL to see if they are like catching content that is identified by multiple URLs. See, for example, Bot Traps mentioned in the @Lirik post.

+1
Apr 29 '11 at 16:43
source share

To store the results, you need some kind of hash table, you just need to check it before loading each page.

0
Apr 29 '11 at 16:43
source share

The problem is not to scan for duplicate URLs that are resolved by the index using a hash derived from the URLs. The problem is DUPLICATED CONTENT scanning. Each "Crawler Trap" URL is different (year, day, SessionID ...).

There is no “perfect” solution ... but you can use some of these strategies:

• Keep the field of your level, which is located on the website. For each page receiving URLs from the page, increase the level. It will be like a tree. You can stop scanning at a certain level, for example 10 (I think Google uses this).

• You can try to create a kind of HASH that can be compared to find similar documents, since you cannot compare with each document in your database. There is a SimHash from google, but I could not find any implementation to use. Then I created my own. My hash counts the low-frequency and high-frequency characters inside the html code and generates a 20-byte hash that compares with a small cache of the last scanned pages inside AVLTree with NearNeighbors search with some tolerance (about 2). You cannot use the character location reference in this hash. After "recognizing" the trap, you can write the url pattern of duplicate content and start ignoring pages with this.

• Like Google, you can create a ranking on each website and “trust” more than others.

0
Dec 19 '11 at 2:05
source share

The crawler stores a pool of URLs containing all the URLs to be crawled. To avoid an “infinite loop”, the basic idea is to check for the presence of each URL before adding to the pool.

However, this is not easy to implement when the system scales to a certain level. The naive approach is to store all the URLs in a hashset and check for the presence of each new URL. This will not work if there are too many URLs in memory.

There are several solutions here. For example, instead of storing all the URLs in memory, we should store them on disk. To save space, use a hash URL instead of a raw URL. It is also worth noting that we should keep the canonical form of the URL, not the original one. Therefore, if the URL is shortened by services such as bit.ly, it is better to get the final URL. To speed up the verification process, you can create a cache layer. Or you can see this as a distributed cache system, which is a separate topic.

The Build the Web Crawler post provides a detailed analysis of this issue.

0
Jul 07 '16 at 3:55
source share

I also demanded to use a scanner and cannot find a suitable one for my requirement, so after that I developed a basic scanner library to implement simple requirements. But it makes it possible to fulfill almost all the principles of the tracked operation. You can check out the DotnetCrawler github repo, which implements the Downloader-Processor-Pipeline modules on its own with a default implementation using the Entity Framework Core to save data to Sql Server.

https://github.com/mehmetozkaya/DotnetCrawler

0
Feb 21 '19 at 18:39
source share

A web scanner is a computer program that was used to collect / scan the following key values ​​(HREF links, image links, metadata, etc.) from a given website URL. It is designed as intelligent to follow the various HREF links that are already obtained from the previous URL, so Crawler can go from one website to another. This is usually called a web spider or web bot. This mechanism always acts as the basis of a search engine on the Internet.

Please find the source code from my technical blog - http://www.algonuts.info/how-to-built-a-simple-web-crawler-in-php.html

 <?php class webCrawler { public $siteURL; public $error; function __construct() { $this->siteURL = ""; $this->error = ""; } function parser() { global $hrefTag,$hrefTagCountStart,$hrefTagCountFinal,$hrefTagLengthStart,$hrefTagLengthFinal,$hrefTagPointer; global $imgTag,$imgTagCountStart,$imgTagCountFinal,$imgTagLengthStart,$imgTagLengthFinal,$imgTagPointer; global $Url_Extensions,$Document_Extensions,$Image_Extensions,$crawlOptions; $dotCount = 0; $slashCount = 0; $singleSlashCount = 0; $doubleSlashCount = 0; $parentDirectoryCount = 0; $linkBuffer = array(); if(($url = trim($this->siteURL)) != "") { $crawlURL = rtrim($url,"/"); if(($directoryURL = dirname($crawlURL)) == "http:") { $directoryURL = $crawlURL; } $urlParser = preg_split("/\//",$crawlURL); //-- Curl Start -- $curlObject = curl_init($crawlURL); curl_setopt_array($curlObject,$crawlOptions); $webPageContent = curl_exec($curlObject); $errorNumber = curl_errno($curlObject); curl_close($curlObject); //-- Curl End -- if($errorNumber == 0) { $webPageCounter = 0; $webPageLength = strlen($webPageContent); while($webPageCounter < $webPageLength) { $character = $webPageContent[$webPageCounter]; if($character == "") { $webPageCounter++; continue; } $character = strtolower($character); //-- Href Filter Start -- if($hrefTagPointer[$hrefTagLengthStart] == $character) { $hrefTagLengthStart++; if($hrefTagLengthStart == $hrefTagLengthFinal) { $hrefTagCountStart++; if($hrefTagCountStart == $hrefTagCountFinal) { if($hrefURL != "") { if($parentDirectoryCount >= 1 || $singleSlashCount >= 1 || $doubleSlashCount >= 1) { if($doubleSlashCount >= 1) { $hrefURL = "http://".$hrefURL; } else if($parentDirectoryCount >= 1) { $tempData = 0; $tempString = ""; $tempTotal = count($urlParser) - $parentDirectoryCount; while($tempData < $tempTotal) { $tempString .= $urlParser[$tempData]."/"; $tempData++; } $hrefURL = $tempString."".$hrefURL; } else if($singleSlashCount >= 1) { $hrefURL = $urlParser[0]."/".$urlParser[1]."/".$urlParser[2]."/".$hrefURL; } } $host = ""; $hrefURL = urldecode($hrefURL); $hrefURL = rtrim($hrefURL,"/"); if(filter_var($hrefURL,FILTER_VALIDATE_URL) == true) { $dump = parse_url($hrefURL); if(isset($dump["host"])) { $host = trim(strtolower($dump["host"])); } } else { $hrefURL = $directoryURL."/".$hrefURL; if(filter_var($hrefURL,FILTER_VALIDATE_URL) == true) { $dump = parse_url($hrefURL); if(isset($dump["host"])) { $host = trim(strtolower($dump["host"])); } } } if($host != "") { $extension = pathinfo($hrefURL,PATHINFO_EXTENSION); if($extension != "") { $tempBuffer =""; $extensionlength = strlen($extension); for($tempData = 0; $tempData < $extensionlength; $tempData++) { if($extension[$tempData] != "?") { $tempBuffer = $tempBuffer.$extension[$tempData]; continue; } else { $extension = trim($tempBuffer); break; } } if(in_array($extension,$Url_Extensions)) { $type = "domain"; } else if(in_array($extension,$Image_Extensions)) { $type = "image"; } else if(in_array($extension,$Document_Extensions)) { $type = "document"; } else { $type = "unknown"; } } else { $type = "domain"; } if($hrefURL != "") { if($type == "domain" && !in_array($hrefURL,$this->linkBuffer["domain"])) { $this->linkBuffer["domain"][] = $hrefURL; } if($type == "image" && !in_array($hrefURL,$this->linkBuffer["image"])) { $this->linkBuffer["image"][] = $hrefURL; } if($type == "document" && !in_array($hrefURL,$this->linkBuffer["document"])) { $this->linkBuffer["document"][] = $hrefURL; } if($type == "unknown" && !in_array($hrefURL,$this->linkBuffer["unknown"])) { $this->linkBuffer["unknown"][] = $hrefURL; } } } } $hrefTagCountStart = 0; } if($hrefTagCountStart == 3) { $hrefURL = ""; $dotCount = 0; $slashCount = 0; $singleSlashCount = 0; $doubleSlashCount = 0; $parentDirectoryCount = 0; $webPageCounter++; while($webPageCounter < $webPageLength) { $character = $webPageContent[$webPageCounter]; if($character == "") { $webPageCounter++; continue; } if($character == "\"" || $character == "'") { $webPageCounter++; while($webPageCounter < $webPageLength) { $character = $webPageContent[$webPageCounter]; if($character == "") { $webPageCounter++; continue; } if($character == "\"" || $character == "'" || $character == "#") { $webPageCounter--; break; } else if($hrefURL != "") { $hrefURL .= $character; } else if($character == "." || $character == "/") { if($character == ".") { $dotCount++; $slashCount = 0; } else if($character == "/") { $slashCount++; if($dotCount == 2 && $slashCount == 1) $parentDirectoryCount++; else if($dotCount == 0 && $slashCount == 1) $singleSlashCount++; else if($dotCount == 0 && $slashCount == 2) $doubleSlashCount++; $dotCount = 0; } } else { $hrefURL .= $character; } $webPageCounter++; } break; } $webPageCounter++; } } $hrefTagLengthStart = 0; $hrefTagLengthFinal = strlen($hrefTag[$hrefTagCountStart]); $hrefTagPointer =& $hrefTag[$hrefTagCountStart]; } } else { $hrefTagLengthStart = 0; } //-- Href Filter End -- //-- Image Filter Start -- if($imgTagPointer[$imgTagLengthStart] == $character) { $imgTagLengthStart++; if($imgTagLengthStart == $imgTagLengthFinal) { $imgTagCountStart++; if($imgTagCountStart == $imgTagCountFinal) { if($imgURL != "") { if($parentDirectoryCount >= 1 || $singleSlashCount >= 1 || $doubleSlashCount >= 1) { if($doubleSlashCount >= 1) { $imgURL = "http://".$imgURL; } else if($parentDirectoryCount >= 1) { $tempData = 0; $tempString = ""; $tempTotal = count($urlParser) - $parentDirectoryCount; while($tempData < $tempTotal) { $tempString .= $urlParser[$tempData]."/"; $tempData++; } $imgURL = $tempString."".$imgURL; } else if($singleSlashCount >= 1) { $imgURL = $urlParser[0]."/".$urlParser[1]."/".$urlParser[2]."/".$imgURL; } } $host = ""; $imgURL = urldecode($imgURL); $imgURL = rtrim($imgURL,"/"); if(filter_var($imgURL,FILTER_VALIDATE_URL) == true) { $dump = parse_url($imgURL); $host = trim(strtolower($dump["host"])); } else { $imgURL = $directoryURL."/".$imgURL; if(filter_var($imgURL,FILTER_VALIDATE_URL) == true) { $dump = parse_url($imgURL); $host = trim(strtolower($dump["host"])); } } if($host != "") { $extension = pathinfo($imgURL,PATHINFO_EXTENSION); if($extension != "") { $tempBuffer =""; $extensionlength = strlen($extension); for($tempData = 0; $tempData < $extensionlength; $tempData++) { if($extension[$tempData] != "?") { $tempBuffer = $tempBuffer.$extension[$tempData]; continue; } else { $extension = trim($tempBuffer); break; } } if(in_array($extension,$Url_Extensions)) { $type = "domain"; } else if(in_array($extension,$Image_Extensions)) { $type = "image"; } else if(in_array($extension,$Document_Extensions)) { $type = "document"; } else { $type = "unknown"; } } else { $type = "domain"; } if($imgURL != "") { if($type == "domain" && !in_array($imgURL,$this->linkBuffer["domain"])) { $this->linkBuffer["domain"][] = $imgURL; } if($type == "image" && !in_array($imgURL,$this->linkBuffer["image"])) { $this->linkBuffer["image"][] = $imgURL; } if($type == "document" && !in_array($imgURL,$this->linkBuffer["document"])) { $this->linkBuffer["document"][] = $imgURL; } if($type == "unknown" && !in_array($imgURL,$this->linkBuffer["unknown"])) { $this->linkBuffer["unknown"][] = $imgURL; } } } } $imgTagCountStart = 0; } if($imgTagCountStart == 3) { $imgURL = ""; $dotCount = 0; $slashCount = 0; $singleSlashCount = 0; $doubleSlashCount = 0; $parentDirectoryCount = 0; $webPageCounter++; while($webPageCounter < $webPageLength) { $character = $webPageContent[$webPageCounter]; if($character == "") { $webPageCounter++; continue; } if($character == "\"" || $character == "'") { $webPageCounter++; while($webPageCounter < $webPageLength) { $character = $webPageContent[$webPageCounter]; if($character == "") { $webPageCounter++; continue; } if($character == "\"" || $character == "'" || $character == "#") { $webPageCounter--; break; } else if($imgURL != "") { $imgURL .= $character; } else if($character == "." || $character == "/") { if($character == ".") { $dotCount++; $slashCount = 0; } else if($character == "/") { $slashCount++; if($dotCount == 2 && $slashCount == 1) $parentDirectoryCount++; else if($dotCount == 0 && $slashCount == 1) $singleSlashCount++; else if($dotCount == 0 && $slashCount == 2) $doubleSlashCount++; $dotCount = 0; } } else { $imgURL .= $character; } $webPageCounter++; } break; } $webPageCounter++; } } $imgTagLengthStart = 0; $imgTagLengthFinal = strlen($imgTag[$imgTagCountStart]); $imgTagPointer =& $imgTag[$imgTagCountStart]; } } else { $imgTagLengthStart = 0; } //-- Image Filter End -- $webPageCounter++; } } else { $this->error = "Unable to proceed, permission denied"; } } else { $this->error = "Please enter url"; } if($this->error != "") { $this->linkBuffer["error"] = $this->error; } return $this->linkBuffer; } } ?> 
0
May 6 '19 at 7:54
source share

Well, it’s basically a web-based graph, so you can plot from URLs and then crawl BFS or DFS while marking visited sites so you don’t visit the same page twice.

-one
Apr 29 '11 at 16:45
source share

This is an example of a web crawler. What you can use to collect mac addresses for Mac spoofing

 #!/usr/bin/env python import sys import os import urlparse import urllib from bs4 import BeautifulSoup def mac_addr_str(f_data): global fptr global mac_list word_array = f_data.split(" ") for word in word_array: if len(word) == 17 and ':' in word[2] and ':' in word[5] and ':' in word[8] and ':' in word[11] and ':' in word[14]: if word not in mac_list: mac_list.append(word) fptr.writelines(word +"\n") print word url = "http://stackoverflow.com/questions/tagged/mac-address" url_list = [url] visited = [url] pwd = os.getcwd(); pwd = pwd + "/internet_mac.txt"; fptr = open(pwd, "a") mac_list = [] while len(url_list) > 0: try: htmltext = urllib.urlopen(url_list[0]).read() except: url_list[0] mac_addr_str(htmltext) soup = BeautifulSoup(htmltext) url_list.pop(0) for tag in soup.findAll('a',href=True): tag['href'] = urlparse.urljoin(url,tag['href']) if url in tag['href'] and tag['href'] not in visited: url_list.append(tag['href']) visited.append(tag['href']) 

Change the URL to crawl more sites ...... good luck

-2
Jun 04 '15 at 1:18
source share



All Articles