How does hashing the entire contents of a web page work?

Question

How does hashing the entire contents of a web page work?

I sometimes heard esp in the context of searching for information, search engines, crawlers, etc., that we can detect duplicate pages by hashing the contents of the page. What hash functions can hash the entire web page (which is at least 2 pagers), so that 2 copies have the same hash value ?. What is the size of a typical hash output?

Are such hash functions capable of placing 2 identical web pages with small typos, etc. in the same bucket?

Thank,

+5

algorithm data-structures search-engine indexing hash

xyz Apr 30 '11 at 10:08

source share

2 answers

, , .

+1

Gumbo 30 . '11 10:21

Fred Foo · Accepted Answer · 2011-04-30T10:44:10+0000

-, x y s.t. x = y, . , :

-, MD5, SHA-1 SHA-512, , -, ,
a -, .

; - , .

How does hashing the entire contents of a web page work?

More articles: