I use pyparsing to parse HTML. I grab all the embed tags, but in some cases there is an a tag immediately following it, which I also want to grab if available.
Example:
import pyparsing target = pyparsing.makeHTMLTags("embed")[0] target.setParseAction(pyparsing.withAttribute(src=pyparsing.withAttribute.ANY_VALUE)) target.ignore(pyparsing.htmlComment) result = target.searchString("""..... <object....><embed>.....</embed></object><br /><a href="blah">blah</a> """)
I could not find the offset of characters in the objects of the result, otherwise I could just grab a fragment of the original input line and work from there.
EDIT:
Someone asked why I am not using BeautifulSoup. Good question, let me show you why I decided not to use it with sample code:
import BeautifulSoup import urllib import re import socket socket.setdefaulttimeout(3)
When I try to do this, BeautifulSoup fails with analysis errors 20-30% of the time. These are not rare cases of the region. pyparsing is slow and bulky, but it did not explode no matter what I throw at it. If I could be educated on the best way to use BeautifulSoup, I would be very interested to know this.
source share