BeautifulSoup get_text does not share all tags and JavaScript

I am trying to use BeautifulSoup to get text from web pages.

Below is the script I wrote for this. It takes two arguments: the first input HTML or XML file, the second output file.

import sys from bs4 import BeautifulSoup def stripTags(s): return BeautifulSoup(s).get_text() def stripTagsFromFile(inFile, outFile): open(outFile, 'w').write(stripTags(open(inFile).read()).encode("utf-8")) def main(argv): if len(sys.argv) <> 3: print 'Usage:\t\t', sys.argv[0], 'input.html output.txt' return 1 stripTagsFromFile(sys.argv[1], sys.argv[2]) return 0 if __name__ == "__main__": sys.exit(main(sys.argv)) 

Unfortunately, for many web pages, for example: http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location I get something like this (I only show the first few lines):

 html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" Education Manager Job In London With Caleeda | Great Jobs In Teaching var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-15255540-21']); _gaq.push(['_trackPageview']); _gaq.push(['_trackPageLoadTime']); 

Is there something wrong with my script? I tried to pass "xml" as the second argument to the BeautifulSoup constructor, as well as "html5lib" and "lxml", but that does not help. Is there an alternative to BeautifulSoup that will work better for this task? All I want to do is extract the text that will be displayed in the browser for this web page.

Any help would be greatly appreciated.

+7
python html xml screen-scraping beautifulsoup
May 09 '12 at 21:31
source share
3 answers

nltk clean_html() pretty good at that!

Assuming yours already has html stored in an html variable, like

 html = urllib.urlopen(address).read() 

then just use

 import nltk clean_text = nltk.clean_html(html) 

UPDATE

Support for clean_html and clean_url will be removed for future versions of nltk. Please use BeautifulSoup for now ... this is very unfortunate.

An example of how to do this is provided on this page:

BeatifulSoup4 get_text still has javascript

+14
Nov 14 '12 at 19:48
source share

That was the problem I encountered. no solution seemed to be able to return text (text that would actually be displayed in a web browser). Other solutions mentioned that BS is not ideal for rendering and that html2text is a good approach. I tried both html2text and nltk.clean_html and was surprised by the synchronization results, so I thought that they guaranteed a response for posterity. Of course, the speed delta can be very dependent on the contents of the data ...

One answer here from @Helge was to use nltk of all things.

 import nltk %timeit nltk.clean_html(html) was returning 153 us per loop 

It worked very well to return a line with html displayed. This nltk module was faster than html2text, although perhaps html2text is more reliable.

 betterHTML = html.decode(errors='ignore') %timeit html2text.html2text(betterHTML) %3.09 ms per loop 
+1
Nov 05 '13 at 17:48
source share

Here's the answer based approach here: BeautifulSoup Grab Visible Webpage Text by jbochi. This approach allows you to add comments to elements containing text on the page, and makes a bit to clear the output by deleting new lines, consolidating space, etc.

 html = urllib.urlopen(address).read() soup = BeautifulSoup.BeautifulSoup(html) texts = soup.findAll(text=True) def visible_text(element): if element.parent.name in ['style', 'script', '[document]', 'head', 'title']: return '' result = re.sub('<!--.*-->|\r|\n', '', str(element), flags=re.DOTALL) result = re.sub('\s{2,}|&nbsp;', ' ', result) return result visible_elements = [visible_text(elem) for elem in texts] visible_text = ''.join(visible_elements) print(visible_text) 
0
May 10 '12 at 10:58 p.m.
source share



All Articles