Find the next tag with pyparsing

I use pyparsing to parse HTML. I grab all the embed tags, but in some cases there is an a tag immediately following it, which I also want to grab if available.

Example:

 import pyparsing target = pyparsing.makeHTMLTags("embed")[0] target.setParseAction(pyparsing.withAttribute(src=pyparsing.withAttribute.ANY_VALUE)) target.ignore(pyparsing.htmlComment) result = target.searchString("""..... <object....><embed>.....</embed></object><br /><a href="blah">blah</a> """) 

I could not find the offset of characters in the objects of the result, otherwise I could just grab a fragment of the original input line and work from there.

EDIT:

Someone asked why I am not using BeautifulSoup. Good question, let me show you why I decided not to use it with sample code:

 import BeautifulSoup import urllib import re import socket socket.setdefaulttimeout(3) # get some random blogs xml = urllib.urlopen('http://rpc.weblogs.com/shortChanges.xml').read() success, failure = 0.0, 0.0 for url in re.compile(r'\burl="([^"]+)"').findall(xml)[:30]: print url try: BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read()) except IOError: pass except Exception, e: print e failure += 1 else: success += 1 print failure / (failure + success) 

When I try to do this, BeautifulSoup fails with analysis errors 20-30% of the time. These are not rare cases of the region. pyparsing is slow and bulky, but it did not explode no matter what I throw at it. If I could be educated on the best way to use BeautifulSoup, I would be very interested to know this.

+4
source share
4 answers

If there is an optional <a> tag that would be interesting if it follows the <embed> , then add it to the search template:

 embedTag = pyparsing.makeHTMLTags("embed")[0] aTag = pyparsing.makeHTMLTags("a")[0] target = embedTag + pyparsing.Optional(aTag) result = target.searchString("""..... <object....><embed>.....</embed></object><br /><a href="blah">blah</a> """) print result.dump() 

If you want to capture the location of the expression symbol in your parser, insert one of them with the result name:

 loc = pyparsing.Empty().setParseAction(lambda s,locn,toks: locn) target = loc("beforeEmbed") + embedTag + loc("afterEmbed") + pyparsing.Optional(aTag) 
+5
source

Why would you write your own HTML parser? The standard library includes HTMLParser , and BeautifulSoup can handle any jobs that HTMLParser cannot.

+1
source

Don't you want to use regular regex? or because of his bad habit of parsing html ?: D

 re.findall("<object.*?</object>(?:<br /><a.*?</a>)?",a) 
+1
source

I managed to run your BeautifulSoup code and got no errors. I am launching BeautifulSoup 3.0.7a

Use BeautifulSoup 3.0.7a; 3.1.0.1 contains errors that prevent it from working in some cases (for example, yours).

+1
source

All Articles