I'm currently having trouble typing this because, according to top , my processor is 100% and my memory is 85.7%, and all of them are occupied by python.
Why? Because I had it go through a 250 megabyte file to remove the markup. 250 megabytes what it is! I manipulated these files in python with a variety of other modules and things; BeautifulSoup is the first code that gives me some kind of problems with something so small. How do about 4 gigabytes of RAM that are used to process 250megs html work?
One layer that I found (on stackoverflow) and used was the following:
''.join(BeautifulSoup(corpus).findAll(text=True))
Also, this seems to remove everything except the markup, which is kind of the opposite of what I want to do. I'm sure BeautifulSoup can do this too, but the speed issue remains.
Is there anything that will do something like this (remove the markup, leave the text securely) and DO NOT need to run Cray?
performance python html parsing beautifulsoup
Waxprolix
source share