Dedicated HTML for plain text using Python

Question

Dedicated HTML for plain text using Python

I am trying to convert a piece of HTML text using BeautifulSoup. Here is an example:

<div> <p> Some text <span>more text</span> even more text </p> <ul> <li>list item</li> <li>yet another list item</li> </ul> </div> <p>Some other text</p> <ul> <li>list item</li> <li>yet another list item</li> </ul>

I tried to do something like:

 def parse_text(contents_string) Newlines = re.compile(r'[\r\n]\s+') bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES) txt = bs.getText('\n') return Newlines.sub('\n', txt)

... but this way my span is always on a new line. This, of course, is a simple example. Is there a way to get the text on the HTML page the way it will be displayed in the browser (no CSS rules required, just regular div, span, li, etc.) in Python?

+25

python beautifulsoup

btatarov Nov 12

source share

2 answers

I ran into the same problem trying to parse the displayed HTML. BS doesn't seem to be the perfect package for this. @Del gives a great html2text solution.

In a different SO question: BeautifulSoup get_text does not share all tags and JavaScript @Helge is mentioned using nltk. Unfortunately, nltk seems to terminate this method.

I tried both html2text and nltk.clean_html and was surprised by the synchronization results, so I thought that they guaranteed a response for posterity. Of course, speeds are highly dependent on the contents of the data ...

Reply from @Helge (nltk).

 import nltk %timeit nltk.clean_html(html) was returning 153 us per loop

It worked very well to return a line with html displayed. This nltk module was faster than html2text, although perhaps html2text is more reliable.

Answer above from @del

 betterHTML = html.decode(errors='ignore') %timeit html2text.html2text(betterHTML) %3.09 ms per loop

+2

Paul Nov 05 '13 at 17:53

source share

del · Accepted Answer · 2012-11-12 03:09

BeautifulSoup is a scrambling library, so it's probably not the best choice for rendering HTML. If it is not important to use BeautifulSoup, you should take a look at html2text . For example:

 import html2text html = open("foobar.html").read() print html2text.html2text(html)

It is output:

 Some text more text even more text

   * list item
   * yet another list item

 Some other text

   * list item
   * yet another list item

Dedicated HTML for plain text using Python

More articles: