Recursion depth error when using BeautifulSoup with multiprocessing pool map

I use BeautifulSoup to parse html files, and all the scripts I wrote work fine, but slower. Therefore, I am experimenting using a multiprocessor pool of workers with BeautifulSoup, so my program can run faster (I have 100,000 - 1,000,000 html files to open). I wrote a more complex script, but I wrote a small example here. I try to do something like this and I keep getting an error

'RuntimeError: maximum recursion depth exceeded when etching an object

Edited Code

from bs4 import BeautifulSoup from multiprocessing import Pool def extraction(path): soup=BeautifulSoup(open(path),"lxml") return soup.title pool=Pool(processes=4) path=['/Volume3/2316/http/www.metro.co.uk/news/852300-haiti-quake-victim-footballers-stage-special-tournament/crawlerdefault.html','/Volume3/2316/http/presszoom.com/story_164020.html'] print pool.map(extraction,path) pool.close() pool.join() 

After doing a search and searching for some messages, I found out that the error is occurring because BeautifulSoup exceeds the stack depth of the python interpreter. I tried to raise the limit and run the same program (I went up to 3000), but the error remains the same. I stopped raising the limit because the problem is related to BeautifulSoup when opening html files.

Using multiprocessing with BeautifulSoup will speed up my runtime, but I cannot figure out how to use it to open files.

Does anyone have a different approach on how to use BeautifulSoup with multiprocessing or how to handle such errors?

Any help would be appreciated, I sit for hours trying to fix it and understand why I get an error message.

Edit

I tested the above code with the files that I specified in the paths and got the same RuntimeError as above.

Files can be accessed here ( http://ec2-23-20-166-224.compute-1.amazonaws.com/sites/html_files/ )

+8
python multiprocessing beautifulsoup
source share
1 answer

I think the reason is the return of the whole soup.title object. It seems all the elements are children and parent , their children and parents, etc. Parsed at this point, and this causes a recursion error.

If the contents of the object is what you need, you can simply call the str method:

 return soup.title.__str__() 

Unfortunately, this means that you no longer have access to all other information provided by the bs4 library.

+3
source share

All Articles