Effective Alternative to

I am writing a web tracked vehicle with the ultimate goal of creating a map of the crawler you have traveled. Although I don’t have a hint at what rate others, and most accurately the best scanners, pull out pages, my hours are about 2000 pages per minute.

The scanner works with a recursive backtracking algorithm, which I limited to a depth of 15. In addition, to prevent my crawler from infinitely revising the pages, it stores the URL of each page that it visited in the list and checks this list for the next URL candidate addresses.

for href in tempUrl: ... if href not in urls: collect(href,parent,depth+1) 

This method seems to be becoming a problem by the time it has shifted about 300,000 pages. At the moment, the crawler on average works at a frequency of 500 pages per minute.

So my question is: what other method of achieving the same functionality and increasing its effectiveness.

I thought reducing the size of each entry might help, so instead of adding the whole URL, I add the first 2 and last characters of each URL as a string. This, however, did not help.

Is there a way to do this with kits or something else?

thanks for the help

edit: As a note, my program is not yet multithreaded. I decided that I had to solve this bottleneck before I found out about threads.

+4
source share
4 answers

Perhaps you could use set instead of list for the URLs you've seen so far.

+14
source

Just replace your "list of bypass URLs" with " set bypass URLs". Sets are optimized for random access (using the same hashing algorithms as dictionaries), and they are much faster. The search operation for lists is performed using linear search, so it is not particularly fast. You will not need to change the actual code that performs the search.

Check this.

 In [3]: timeit.timeit("500 in t", "t = list(range(1000))") Out[3]: 10.020853042602539 In [4]: timeit.timeit("500 in t", "t = set(range(1000))") Out[4]: 0.1159818172454834 
+7
source

I had a similar problem. Profiling of various methods (list / file / sets / sqlite) for memory and time has been completed. See These 2 Posts. Finally, sqlite was the best choice. You can also use a URL hash to reduce the size

Finding a string in a large text file - profiling various methods in python

sqlite database design with millions of url strings - slow bulk import from csv

+3
source

Use a dict with urls instead (access time is O (1)).

But the kit will also work. Cm

http://wiki.python.org/moin/TimeComplexity

+2
source

All Articles