How does hashing the entire contents of a web page work?

I sometimes heard esp in the context of searching for information, search engines, crawlers, etc., that we can detect duplicate pages by hashing the contents of the page. What hash functions can hash the entire web page (which is at least 2 pagers), so that 2 copies have the same hash value ?. What is the size of a typical hash output?

Are such hash functions capable of placing 2 identical web pages with small typos, etc. in the same bucket?

Thank,

+5
source share
2 answers

-, x y s.t. x = y, . , :

  • -, MD5, SHA-1 SHA-512, , -, ,
  • a -, .

; - , .

+8

, , .

+1

All Articles