Dedicated HTML for plain text using Python

I am trying to convert a piece of HTML text using BeautifulSoup. Here is an example:

<div> <p> Some text <span>more text</span> even more text </p> <ul> <li>list item</li> <li>yet another list item</li> </ul> </div> <p>Some other text</p> <ul> <li>list item</li> <li>yet another list item</li> </ul> 

I tried to do something like:

 def parse_text(contents_string) Newlines = re.compile(r'[\r\n]\s+') bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES) txt = bs.getText('\n') return Newlines.sub('\n', txt) 

... but this way my span is always on a new line. This, of course, is a simple example. Is there a way to get the text on the HTML page the way it will be displayed in the browser (no CSS rules required, just regular div, span, li, etc.) in Python?

+25
python beautifulsoup
Nov 12
source share
2 answers

BeautifulSoup is a scrambling library, so it's probably not the best choice for rendering HTML. If it is not important to use BeautifulSoup, you should take a look at html2text . For example:

 import html2text html = open("foobar.html").read() print html2text.html2text(html) 

It is output:

 Some text more text even more text

   * list item
   * yet another list item

 Some other text

   * list item
   * yet another list item
+60
Nov 12
source share

I ran into the same problem trying to parse the displayed HTML. BS doesn't seem to be the perfect package for this. @Del gives a great html2text solution.

In a different SO question: BeautifulSoup get_text does not share all tags and JavaScript @Helge is mentioned using nltk. Unfortunately, nltk seems to terminate this method.

I tried both html2text and nltk.clean_html and was surprised by the synchronization results, so I thought that they guaranteed a response for posterity. Of course, speeds are highly dependent on the contents of the data ...

Reply from @Helge (nltk).

 import nltk %timeit nltk.clean_html(html) was returning 153 us per loop 

It worked very well to return a line with html displayed. This nltk module was faster than html2text, although perhaps html2text is more reliable.

Answer above from @del

 betterHTML = html.decode(errors='ignore') %timeit html2text.html2text(betterHTML) %3.09 ms per loop 
+2
Nov 05 '13 at 17:53
source share



All Articles