I am trying to find a way to parse (potentially distorted) HTML in Python and, if a set of conditions are satisfied, output this piece of the document with the position (row, column). Position information is what turns me off. And to be clear, I do not need to build a tree of objects. I just want to find specific pieces of data and their position in the original document (think about spelling, for example: "the word" foo "in row x, column y, with an error) '
As an example, I want something like this (using the ElementTree Target API ):
import xml.etree.ElementTree as ET
class EchoTarget:
def start(self, tag, attrib):
if somecondition():
print "start", tag, attrib, self.getpos()
def end(self, tag):
if somecondition():
print "end", tag, self.getpos()
def data(self, data):
if somecondition():
print "data", repr(data), self.getpos()
target = EchoTarget()
parser = ET.XMLParser(target=target)
parser.feed("<p>some text</p>")
parser.close()
, , getpos() ( - ) . , , XML. HTML.
, HTMLParser Python Standard Lib ( getpos()), HTML . HTML, , .
HTML, HTML, lxml html5lib. , , Python.
, , html5lib API , . . , , . , html5lib , , HTML.
lxml API, ElementTree, , , . .
lxml API SAX. , lib Python , SAX Locator Objects, , . SO- ( SAX Parser), , SAX, lxml.
, - Beautiful Soup, , , , "Beautiful Soup Python, lxml html5lib". , , - . html5lib, , . / .
, , ( ) ( script ). HTML-. , , . , . - , . , - - HTMLParser, , . , , , - . lxml html5lib .
, -, ? , (, , HTMLParser) . , , .