Find the next tag with pyparsing

Question

Find the next tag with pyparsing

I use pyparsing to parse HTML. I grab all the embed tags, but in some cases there is an a tag immediately following it, which I also want to grab if available.

Example:

 import pyparsing target = pyparsing.makeHTMLTags("embed")[0] target.setParseAction(pyparsing.withAttribute(src=pyparsing.withAttribute.ANY_VALUE)) target.ignore(pyparsing.htmlComment) result = target.searchString("""..... <object....><embed>.....</embed></object><br /><a href="blah">blah</a> """)

I could not find the offset of characters in the objects of the result, otherwise I could just grab a fragment of the original input line and work from there.

EDIT:

Someone asked why I am not using BeautifulSoup. Good question, let me show you why I decided not to use it with sample code:

 import BeautifulSoup import urllib import re import socket socket.setdefaulttimeout(3) # get some random blogs xml = urllib.urlopen('http://rpc.weblogs.com/shortChanges.xml').read() success, failure = 0.0, 0.0 for url in re.compile(r'\burl="([^"]+)"').findall(xml)[:30]: print url try: BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read()) except IOError: pass except Exception, e: print e failure += 1 else: success += 1 print failure / (failure + success)

When I try to do this, BeautifulSoup fails with analysis errors 20-30% of the time. These are not rare cases of the region. pyparsing is slow and bulky, but it did not explode no matter what I throw at it. If I could be educated on the best way to use BeautifulSoup, I would be very interested to know this.

+4

python html parsing pyparsing

ʞɔıu Nov 20 '09 at 12:56

source share

4 answers

Why would you write your own HTML parser? The standard library includes HTMLParser , and BeautifulSoup can handle any jobs that HTMLParser cannot.

+1

Ned batchelder Nov 20 '09 at 1:02

source share

Don't you want to use regular regex? or because of his bad habit of parsing html ?: D

 re.findall("<object.*?</object>(?:<br /><a.*?</a>)?",a)

+1

YOU Nov 20 '09 at 15:21

source share

I managed to run your BeautifulSoup code and got no errors. I am launching BeautifulSoup 3.0.7a

Use BeautifulSoup 3.0.7a; 3.1.0.1 contains errors that prevent it from working in some cases (for example, yours).

+1

gibson Nov 20 '09 at 19:48

source share

Paulmcg · Accepted Answer · 2009-11-20T04:02:28+0000

If there is an optional <a> tag that would be interesting if it follows the <embed> , then add it to the search template:

 embedTag = pyparsing.makeHTMLTags("embed")[0] aTag = pyparsing.makeHTMLTags("a")[0] target = embedTag + pyparsing.Optional(aTag) result = target.searchString("""..... <object....><embed>.....</embed></object><br /><a href="blah">blah</a> """) print result.dump()

If you want to capture the location of the expression symbol in your parser, insert one of them with the result name:

 loc = pyparsing.Empty().setParseAction(lambda s,locn,toks: locn) target = loc("beforeEmbed") + embedTag + loc("afterEmbed") + pyparsing.Optional(aTag)

Find the next tag with pyparsing

More articles: