I am creating a small application that will crawl sites where content is growing (for example, on stackoverflow), the difference is that the content created rarely changes.
Now, in the first pass, I look through all the pages of the site.
But then, the page content of this site - I do not want to retell it all, only the latest additions.
So, if the site has 500 pages, in the second pass, if the site has 501 pages, I would only scan the first and second pages. Would this be a good way to handle the situation?
In the end, crawling will be completed in lucene - creating a custom search engine.
So, I would like to avoid scanning several times with the same content. Any better ideas?
EDIT:
Let's say the site has a page: Results that will be available as follows:
Results? page = 1, Results? page = 2 ... etc.
I assume that keeping track of how many pages there were at the last crawl and just scanning the difference would be enough. (perhaps using the hash of each result on the page - if I start working in the same hashes - I have to stop)
source share