Do not repeat the same content twice

I am creating a small application that will crawl sites where content is growing (for example, on stackoverflow), the difference is that the content created rarely changes.

Now, in the first pass, I look through all the pages of the site.

But then, the page content of this site - I do not want to retell it all, only the latest additions.

So, if the site has 500 pages, in the second pass, if the site has 501 pages, I would only scan the first and second pages. Would this be a good way to handle the situation?

In the end, crawling will be completed in lucene - creating a custom search engine.

So, I would like to avoid scanning several times with the same content. Any better ideas?

EDIT:

Let's say the site has a page: Results that will be available as follows:

Results? page = 1, Results? page = 2 ... etc.

I assume that keeping track of how many pages there were at the last crawl and just scanning the difference would be enough. (perhaps using the hash of each result on the page - if I start working in the same hashes - I have to stop)

+4
source share
5 answers

If each piece of content is in a unique place, just send these locations (probably URLs) to the hash field and check it before the "workaround" content. The URL should probably be part of your stored data in Lucene, so this should be easy to do by searching before adding to the index.

+5
source

My approach is to store a hash / fingerprint of the contents of each page being viewed. Thus, when you refresh the page, you check the fingerprint, if it matches, nothing has changed, and no parsing is needed, since you are already processing the page and all links to it.

+2
source

Does the site create effective electronic tags for each resource? If so, you can issue conditional GETs of known resources, and in case the server sends a resource (i.e., it has changed), you can look for new links for crawling, updating content, etc.

Of course, this only works if your site issues E-tags and responds to conditional access ...

+1
source
  • do a standard site crawl to the entire site to get all the historical content.
  • track their RSS feed to find new content
  • periodically repeat site crawls to get updated content.
+1
source

Take a look at the Last-Modified HTTP header. In C # from HttpWebResponse, if it is not equal to DateTime.Now, the content has changed. This way you can use local content (provided that you save it) to serve your crawler.

So, when scanning, save the WebPage content and the "Last-Modified" header.

In addition, you can either save every unique AbsoluteUri that works well until the search time for AbsoluteUri outstrips the time it takes to retrieve the page, or you can use the Bloom filter: http://en.wikipedia.org/wiki/Bloom_filter .

Besides determining where the Google sitemap (or RSS feed) is located, you won’t know where the new content will be added. Automatic knowledge is like a cracker asking where your new purchases are without asking you first. :)

+1
source

All Articles