Python HTML File Processing

Question

Python HTML File Processing

I don’t know much about html ... How to remove only text from a page? For example, if the html page reads like:

<meta name="title" content="How can I make money at home online? No gimmacks please? - Yahoo! Answers">
<title>How can I make money at home online? No gimmicks please? - Yahoo! Answers</title>

I just want to extract this.

How can I make money at home online? No gimmicks please? - Yahoo! Answers

I am using the re function:

def striphtml(data):
  p = re.compile(r'<.*?>')
  return p.sub(' ',data)

but still it does not do what I intend to do.?

The above function is called:

for lines in filehandle.readlines():

        #k = str(section[6].strip())
        myFile.write(lines)

        lines = striphtml(lines)
        content.append(lines)

0

python html html-parsing

Fraz Jan 9 '12 at 2:43

source share

3 answers

Use the html parser for this. Could BeautifulSoup

Get the text content of the page:

 from BeautifulSoup import BeautifulSoup


 soup = BeautifulSoup(your_html)
 text_nodes = soup.findAll(text = True)
 retult = ' '.join(text_nodes)

+2

soulcheck Jan 9 '12 at 2:58

source share

I usually use http://lxml.de/ for parsing html! it's very easy to use, and pretty much you can use xpath for it to get tags! which simply simplify and speed up the work.

I have an example use in a script that I made to read the xml root and count words:

https://gist.github.com/1425228

You can also find more examples in the documentation: http://lxml.de/lxmlhtml.html

+1

Arthur neves Jan 9 '12 at 2:56

source share

Fabián Heredia Montiel · Accepted Answer · 2012-01-09T02:47:46+0000

Do not use regular expressions to parse HTML / XML. Instead, try http://www.crummy.com/software/BeautifulSoup/ .

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('Your resource<title>hi</title>')
soup.title.string # Your title string.

Python HTML File Processing

More articles: