Python parsing

I am trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in this feed. Using ElementTree, I already parsed RSS to print each heading [minus final]] using the code below:

 feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date") for item in feed: print repr(item.title[0:-1]) 

I include this because, as you can see, item.title is the repr () data type that I know little about.

The specific repr(item.title[0:-1]) print ed in the interactive window looks like this:

 'randy travis (Billy Bobs 3/21' 'Michael Schenker Group (House of Blues Dallas 3/26' 

The user selects a group, and I hope, after parsing each item.title into 3 variables (one for the group, location and date ... or maybe an array or I don’t know ...) select only those that are related to selected group. Then they go to Google for geocoding, but that's a different story.

I have seen some regex examples and I read about them, but it seems very complicated. It? I thought maybe someone here would have an idea of ​​how to do this in a reasonable way. Should I use the re module? Does it matter that the output is currently repr() s? Is there a better way? I thought I would use such a loop (and this is my pseudo-python, just the notes I write):

      list = bandRaw, venue, date, latLong  
      for item in feed:  
       parse item.title for bandRaw, venue, date  
        if bandRaw == str (band)   
         send venue name + ", Dallas, TX" to google for geocoding  
         return lat, long  
       list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long  
      else  

In the end, I need the selected entries in the .csv file (comma-delimited) to look like this:

 band,venue,date,lat,long randy travis,Billy Bobs,3/21,1234.5678,1234.5678 Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765 

Hope this is not too much to ask. I will study this on my own, just thought that I should publish here to make sure that it was answered.

So the question is, what is the best way to parse each repr(item.title[0:-1]) in feed into 3 separate values, which I can then merge into a CSV file?

+4
source share
3 answers

Don't let regex scare you off ... it's worth exploring.

Given the examples above, you can try putting the back bracket again, and then use this pattern:

 import re pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)') info = pat.match(s) print info.groups() ('Michael Schenker Group ', 'House of Blues Dallas ', '3/26') 

To get each individual group, just name it on the info object:

 print info.group(1) # or info.groups()[0] print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3)) "Michael Schenker Group","House of Blues Dallas","3/26" 

In this case, it is difficult to find a regular expression to make sure that you know all the known possible characters in the name. If there are non-alpha characters in the "Michael Schenker Group" section, you will need to adjust the regex for this part to allow them.

The above pattern is broken up as follows, which is parsed from left to right:

([\w\s]+) : match any words or spaces (a plus sign means there must be one or more of these characters). The brackets mean that the match will be fixed as a group. This is part of the "Michael Schenker Group". If there can be numbers and dashes, you want to change the fragments between the square brackets, which are possible characters for the set.

\( : literal bracket. The backslash escapes the bracket, because otherwise it is considered a regular command. This is the "(" part of the line.

([\w\s]+) : Same as above, but this time corresponds to the "House of Blues Dallas" part. In parentheses so that they are captured as a second group.

(\d+/\d+) : matches digits 3 and 26 with a slash in the middle. In parentheses so that they are captured as a third group.

\) : The closing bracket for the above.

An introduction to python for regex is pretty good, and you can spend an evening on it http://docs.python.org/library/re.html#module-re . Also check out Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html .

EDIT: see below zacherates who have good changes. Two heads are better than one!

+17
source

Regular expressions are a great solution to this problem:

 >>> import re >>> s = 'Michael Schenker Group (House of Blues Dallas 3/26' >>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups() ('Michael Schenker Group', 'House of Blues Dallas', '3/26') 

As a side note, you can look at Universal Parser to handle RSS parsing, as feeds have a bad habit of being distorted.

Edit

Regarding your comment ... Lines that sometimes wrap around rather than the fact that you are using reprint. The string representation is usually limited to s, unless that string contains one or more, where instead they are used so that they do not need to be escaped:

 >>> "Hello there" 'Hello there' >>> "it not its" "it not its" 

Pay attention to the different styles of quotes.

+7
source

As for the repr(item.title[0:-1]) part repr(item.title[0:-1]) , I'm not sure where you got it from, but I'm sure you can just use item.title . All you do is remove the last char from the string and then call repr() on it, which does nothing.

Your code should look something like this:

 import geocoders # from GeoPy us = geocoders.GeocoderDotUS() import feedparser # from www.feedparser.org feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date" feed = feedparser.parse(feedurl) lines = [] for entry in feed.entries: m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title) if m: bandRaw, venue, date = m.groups() if band == bandRaw: place, (lat, lng) = us.geocode(venue + ", Dallas, TX") lines.append(",".join([band, venue, date, lat, lng])) result = "\n".join(lines) 

EDIT : replaced list with lines as the name of var. list is inline and should not be used as a variable name. Unfortunately.

0
source

All Articles