Getting position information when parsing HTML in Python

I am trying to find a way to parse (potentially distorted) HTML in Python and, if a set of conditions are satisfied, output this piece of the document with the position (row, column). Position information is what turns me off. And to be clear, I do not need to build a tree of objects. I just want to find specific pieces of data and their position in the original document (think about spelling, for example: "the word" foo "in row x, column y, with an error) '

As an example, I want something like this (using the ElementTree Target API ):

import xml.etree.ElementTree as ET

class EchoTarget:
    def start(self, tag, attrib):
        if somecondition():
            print "start", tag, attrib, self.getpos()
    def end(self, tag):
        if somecondition():
            print "end", tag, self.getpos()
    def data(self, data):
        if somecondition():
            print "data", repr(data), self.getpos()

target = EchoTarget()
parser = ET.XMLParser(target=target)
parser.feed("<p>some text</p>")
parser.close() 

, , getpos() ( - ) . , , XML. HTML.

, HTMLParser Python Standard Lib ( getpos()), HTML . HTML, , .

HTML, HTML, lxml html5lib. , , Python.

, , html5lib API , . . , , . , html5lib , , HTML.

lxml API, ElementTree, , , . .

lxml API SAX. , lib Python , SAX Locator Objects, , . SO- ( SAX Parser), , SAX, lxml.

, - Beautiful Soup, , , , "Beautiful Soup Python, lxml html5lib". , , - . html5lib, , . / .

, , ( ) ( script ). HTML-. , , . , . - , . , - - HTMLParser, , . , , , - . lxml html5lib .

, -, ? , (, , HTMLParser) . , , .

+4
2

html5lib , html5lib.tokenizer.HTMLTokenizer . "" , . , ( , , - , , ).

HTMLTokenizer HTMLParser , API. : https://gist.github.com/waylan/7d5b7552078f1abc6fac.

, , html5lib, html5lib. , , ( ) , . , , , .

, , HTMLParser, Python, Python 3.3 . , ( ) , ( ). , html5lib ( , , , , , ). , Python 2 Python 3. , , .

, HTMLParser html5lib. , , , .


Beautiful Soup docs, HTMLParser Python 2.7.3 3.2.2, , 3.3.

+2

- html5lib API , API HTML (, <table>xxx). API html5lib, , . , .

, html5lib ( , , !), , , lxml.

, html5lib, - . ( , , ), . , , , API .

+1

All Articles