A faster / less resource-intensive way to cut html from large files than BeautifulSoup? Or, the best way to use BeautifulSoup?

I'm currently having trouble typing this because, according to top , my processor is 100% and my memory is 85.7%, and all of them are occupied by python.

Why? Because I had it go through a 250 megabyte file to remove the markup. 250 megabytes what it is! I manipulated these files in python with a variety of other modules and things; BeautifulSoup is the first code that gives me some kind of problems with something so small. How do about 4 gigabytes of RAM that are used to process 250megs html work?

One layer that I found (on stackoverflow) and used was the following:

 ''.join(BeautifulSoup(corpus).findAll(text=True)) 

Also, this seems to remove everything except the markup, which is kind of the opposite of what I want to do. I'm sure BeautifulSoup can do this too, but the speed issue remains.

Is there anything that will do something like this (remove the markup, leave the text securely) and DO NOT need to run Cray?

+6
performance python html parsing beautifulsoup
source share
2 answers

lxml.html is more efficient.

http://lxml.de/lxmlhtml.html

enter image description here

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

It looks like it will do what you want.

 import lxml.html t = lxml.html.fromstring("...") t.text_content() 

A few other similar questions: python [lxml] - html tag cleanup

lxml.etree, element.text does not return all text from an element

Filter HTML tags and resolve entities in python

UPDATE:

You probably want to clear the HTML to remove all scripts and CSS, and then extract the text using .text_content()

 from lxml import html from lxml.html.clean import clean_html tree = html.parse('http://www.example.com') tree = clean_html(tree) text = tree.getroot().text_content() 

(From: Delete all html in python? )

+14
source share

use the cleaner from lxml.html:

 >>> import lxml.html >>> from lxml.html.clean import Cleaner >>> cleaner = Cleaner(style=True) # to delete scripts styles objects comments etc;) >>> html = lxml.html.fromstring(content).xpath('//body')[0] >>> print cleaner.clean_html(html) 
0
source share

All Articles