Hello! I am trying to write a web crawler using python. I wanted to use python multithreading. Even after reading the previously proposed articles and textbooks, I still have a problem. My code is here (all source code is here ):
class Crawler(threading.Thread): global g_URLsDict varLock = threading.Lock() count = 0 def __init__(self, queue): threading.Thread.__init__(self) self.queue = queue self.url = self.queue.get() def run(self): while 1: print self.getName()+" started" self.page = getPage(self.url) self.parsedPage = getParsedPage(self.page, fix=True) self.urls = getLinksFromParsedPage(self.parsedPage) for url in self.urls: self.fp = hashlib.sha1(url).hexdigest() #url-seen check Crawler.varLock.acquire() #lock for global variable g_URLs if self.fp in g_URLsDict: Crawler.varLock.release() #releasing lock else: #print url+" does not exist" Crawler.count +=1 print "total links: %d"%len(g_URLsDict) print self.fp g_URLsDict[self.fp] = url Crawler.varLock.release() #releasing lock self.queue.put(url) print self.getName()+ " %d"%self.queue.qsize() self.queue.task_done() #self.queue.task_done() #self.queue.task_done() print g_URLsDict queue = Queue.Queue() queue.put("http://www.ertir.com") for i in range(5): t = Crawler(queue) t.setDaemon(True) t.start() queue.join()
it does not work as needed, it does not give any result after stream 1, and it proceeds differently due to some time:
Exception in thread Thread-2 (most likely raised during interpreter shutdown):
How can i fix this? And also I do not think that this is more efficient than just for a loop.
I tried to fix run ():
def run(self): while 1: print self.getName()+" started" self.page = getPage(self.url) self.parsedPage = getParsedPage(self.page, fix=True) self.urls = getLinksFromParsedPage(self.parsedPage) for url in self.urls: self.fp = hashlib.sha1(url).hexdigest()
I experimented with the task_done () command in different places, can anyone explain the difference?
source share