Python multithreaded crawler

Question

Python multithreaded crawler

Hello! I am trying to write a web crawler using python. I wanted to use python multithreading. Even after reading the previously proposed articles and textbooks, I still have a problem. My code is here (all source code is here ):

class Crawler(threading.Thread): global g_URLsDict varLock = threading.Lock() count = 0 def __init__(self, queue): threading.Thread.__init__(self) self.queue = queue self.url = self.queue.get() def run(self): while 1: print self.getName()+" started" self.page = getPage(self.url) self.parsedPage = getParsedPage(self.page, fix=True) self.urls = getLinksFromParsedPage(self.parsedPage) for url in self.urls: self.fp = hashlib.sha1(url).hexdigest() #url-seen check Crawler.varLock.acquire() #lock for global variable g_URLs if self.fp in g_URLsDict: Crawler.varLock.release() #releasing lock else: #print url+" does not exist" Crawler.count +=1 print "total links: %d"%len(g_URLsDict) print self.fp g_URLsDict[self.fp] = url Crawler.varLock.release() #releasing lock self.queue.put(url) print self.getName()+ " %d"%self.queue.qsize() self.queue.task_done() #self.queue.task_done() #self.queue.task_done() print g_URLsDict queue = Queue.Queue() queue.put("http://www.ertir.com") for i in range(5): t = Crawler(queue) t.setDaemon(True) t.start() queue.join()

it does not work as needed, it does not give any result after stream 1, and it proceeds differently due to some time:

 Exception in thread Thread-2 (most likely raised during interpreter shutdown):

How can i fix this? And also I do not think that this is more efficient than just for a loop.

I tried to fix run ():

 def run(self): while 1: print self.getName()+" started" self.page = getPage(self.url) self.parsedPage = getParsedPage(self.page, fix=True) self.urls = getLinksFromParsedPage(self.parsedPage) for url in self.urls: self.fp = hashlib.sha1(url).hexdigest() #url-seen check Crawler.varLock.acquire() #lock for global variable g_URLs if self.fp in g_URLsDict: Crawler.varLock.release() #releasing lock else: #print url+" does not exist" print self.fp g_URLsDict[self.fp] = url Crawler.varLock.release() #releasing lock self.queue.put(url) print self.getName()+ " %d"%self.queue.qsize() #self.queue.task_done() #self.queue.task_done() self.queue.task_done()

I experimented with the task_done () command in different places, can anyone explain the difference?

+4

python multithreading thread-safety web-crawler

torayeff May 29 '12 at 13:53

source share

1 answer

Jon cage · Accepted Answer · 2012-05-29T14:55:39+0000

Only self.url = self.queue.get() when threads are initialized. You need to try and re-acquire the URLs from your queue inside the while loop if you want to pick up the new URLs for further processing along the line.

Try replacing self.page = getPage(self.url) with self.page = getPage(self.queue.get()) . Keep in mind that the get function will block indefinitely. You'll probably want a timeout after a while and add some way so that your background threads exit gracefully on request (which would eliminate the exception you saw).

There are some good examples on effbot.org that use get () as I described above.

Change Responses to your initial comments:

See the docs for task_done() ; For each get() call (which does not require a timeout), you must call task_done() , which tells any blocking join() calls that everything in this queue has now been processed. Each call to get() blocks (hibernate) while it expects a new URL to be sent to the queue.

Edit2 . Try this alternative launch function:

 def run(self): while 1: print self.getName()+" started" url = self.queue.get() # <-- note that we're blocking here to wait for a url from the queue self.page = getPage(url) self.parsedPage = getParsedPage(self.page, fix=True) self.urls = getLinksFromParsedPage(self.parsedPage) for url in self.urls: self.fp = hashlib.sha1(url).hexdigest() #url-seen check Crawler.varLock.acquire() #lock for global variable g_URLs if self.fp in g_URLsDict: Crawler.varLock.release() #releasing lock else: #print url+" does not exist" Crawler.count +=1 print "total links: %d"%len(g_URLsDict) print self.fp g_URLsDict[self.fp] = url Crawler.varLock.release() #releasing lock self.queue.put(url) print self.getName()+ " %d"%self.queue.qsize() self.queue.task_done() # <-- We've processed the url this thread pulled off the queue so indicate we're done with it.

Python multithreaded crawler

More articles: