I am writing a web tracked vehicle with the ultimate goal of creating a map of the crawler you have traveled. Although I donβt have a hint at what rate others, and most accurately the best scanners, pull out pages, my hours are about 2000 pages per minute.
The scanner works with a recursive backtracking algorithm, which I limited to a depth of 15. In addition, to prevent my crawler from infinitely revising the pages, it stores the URL of each page that it visited in the list and checks this list for the next URL candidate addresses.
for href in tempUrl: ... if href not in urls: collect(href,parent,depth+1)
This method seems to be becoming a problem by the time it has shifted about 300,000 pages. At the moment, the crawler on average works at a frequency of 500 pages per minute.
So my question is: what other method of achieving the same functionality and increasing its effectiveness.
I thought reducing the size of each entry might help, so instead of adding the whole URL, I add the first 2 and last characters of each URL as a string. This, however, did not help.
Is there a way to do this with kits or something else?
thanks for the help
edit: As a note, my program is not yet multithreaded. I decided that I had to solve this bottleneck before I found out about threads.
danem source share