I use BeautifulSoup to parse html files, and all the scripts I wrote work fine, but slower. Therefore, I am experimenting using a multiprocessor pool of workers with BeautifulSoup, so my program can run faster (I have 100,000 - 1,000,000 html files to open). I wrote a more complex script, but I wrote a small example here. I try to do something like this and I keep getting an error
'RuntimeError: maximum recursion depth exceeded when etching an object
Edited Code
from bs4 import BeautifulSoup from multiprocessing import Pool def extraction(path): soup=BeautifulSoup(open(path),"lxml") return soup.title pool=Pool(processes=4) path=['/Volume3/2316/http/www.metro.co.uk/news/852300-haiti-quake-victim-footballers-stage-special-tournament/crawlerdefault.html','/Volume3/2316/http/presszoom.com/story_164020.html'] print pool.map(extraction,path) pool.close() pool.join()
After doing a search and searching for some messages, I found out that the error is occurring because BeautifulSoup exceeds the stack depth of the python interpreter. I tried to raise the limit and run the same program (I went up to 3000), but the error remains the same. I stopped raising the limit because the problem is related to BeautifulSoup when opening html files.
Using multiprocessing with BeautifulSoup will speed up my runtime, but I cannot figure out how to use it to open files.
Does anyone have a different approach on how to use BeautifulSoup with multiprocessing or how to handle such errors?
Any help would be appreciated, I sit for hours trying to fix it and understand why I get an error message.
Edit
I tested the above code with the files that I specified in the paths and got the same RuntimeError as above.
Files can be accessed here ( http://ec2-23-20-166-224.compute-1.amazonaws.com/sites/html_files/ )
python multiprocessing beautifulsoup
kich
source share