Python parser

Using Python documentation, I found an HTML parser , but I have no idea which library to import, how to use it, how do I know (bearing in mind that it doesn't speak on the page).

+7
python import
Sep 16 '08 at 10:49
source share
8 answers

Try:

import HTMLParser 

In Python 3.0, the HTMLParser module has been renamed html.parser, you can find out about it here

Python 3.0

 import html.parser 

Python 2.2 and later

 import HTMLParser 
+13
September 16 '08 at 10:51
source share

You probably really want BeautifulSoup , check out the link for an example.

But anyway

 >>> import HTMLParser >>> h = HTMLParser.HTMLParser() >>> h.feed('<html></html>') >>> h.get_starttag_text() '<html>' >>> h.close() 
+23
Sep 16 '08 at 10:54
source share

I would recommend using Beautiful Soup and good documentation instead.

+4
16 Sept. '08 at 10:54
source share

You should also look at html5lib for Python as it tries to parse HTML in a way that is very similar to what web browsers do, especially when working with invalid HTML (which makes up over 90% of today's network).

+4
Sep 16 '08 at 12:14
source share

You may be interested in lxml . This is a separate package and has C components, but the fastest. It also has a very good API that makes it easy to list links in HTML documents, or list forms, sanitize HTML, and more. It also has the ability to parse malformed HTML (it is customizable).

+4
Sep 17 '08 at 11:19
source share

I do not recommend BeautifulSoup if you want speed. lxml is much faster and you can return to lxml BS soupparser if the default parser is not working.

+3
September 16 '08 at 13:21
source share

There is a link to an example below ( http://docs.python.org/2/library/htmlparser.html ), it just doesn't work with the source python or python3. This should be python2 as it speaks from above.

+1
Sep 16 '08 at 10:52
source share

For real-world HTML processing, I would recommend BeautifulSoup . It is great and takes away most of the pain. Installation is simple.

+1
Sep 16 '08 at 10:55
source share



All Articles