What is the list of caterpillar seeds?

I read about how to implement a finder. I understand that we start with a list of URLs to visit (seed list). Visit all these URLs and add all links to visited pages to the list (border). So how much should I add to this list of seeds? Should I just add as many URLs as I can, and hope they bring me to the URLs on www, and does it really guarantee that I get all the other URLs? Or is there some kind of agreement for this? I mean ... what does a search engine like Google do?

+4
source share
1 answer

Basically, they make a large list of websites using links (links) between them. The more sites your search engine knows, the better. The only problem here is to make this list useful. That is, a large list of website features does not mean that a good result is set for search, so you should be able to say what is important on each web page.

But according to the processing power of the information you have, there is no need to stop somewhere.

This does not guarantee that you will reach every URL, but basically this is the only practical way to crawl on the Internet.

+3
source

All Articles