I am trying to use BeautifulSoup to get text from web pages.
Below is the script I wrote for this. It takes two arguments: the first input HTML or XML file, the second output file.
import sys from bs4 import BeautifulSoup def stripTags(s): return BeautifulSoup(s).get_text() def stripTagsFromFile(inFile, outFile): open(outFile, 'w').write(stripTags(open(inFile).read()).encode("utf-8")) def main(argv): if len(sys.argv) <> 3: print 'Usage:\t\t', sys.argv[0], 'input.html output.txt' return 1 stripTagsFromFile(sys.argv[1], sys.argv[2]) return 0 if __name__ == "__main__": sys.exit(main(sys.argv))
Unfortunately, for many web pages, for example: http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location I get something like this (I only show the first few lines):
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" Education Manager Job In London With Caleeda | Great Jobs In Teaching var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-15255540-21']); _gaq.push(['_trackPageview']); _gaq.push(['_trackPageLoadTime']);
Is there something wrong with my script? I tried to pass "xml" as the second argument to the BeautifulSoup constructor, as well as "html5lib" and "lxml", but that does not help. Is there an alternative to BeautifulSoup that will work better for this task? All I want to do is extract the text that will be displayed in the browser for this web page.
Any help would be greatly appreciated.
python html xml screen-scraping beautifulsoup
piokuc May 09 '12 at 21:31 2012-05-09 21:31
source share