Regular expression to match a comma-separated list of key = value in which the value may contain commas

I have a naive "parser" that just does something like:
[x.split('=') for x in mystring.split(',')]

However, the sacrament may be something like 'foo=bar,breakfast=spam,eggs'

Obviously
A naive splitter just won't do it. For this, I'm limited to the Python 2.6 standard library ,
For example, pyparsing cannot be used.

Expected Result - [('foo', 'bar'), ('breakfast', 'spam,eggs')]

I am trying to do this with a regex, but I am facing the following problems:

My first attempt
r'([a-z_]+)=(.+),?'
Gave me
[('foo', 'bar,breakfast=spam,eggs')]

Obviously
Creating .+ Non-greedy does not solve the problem.

So,
I guess I need to somehow make the last comma (or $ ) mandatory.
Doing just that doesn't work,
r'([a-z_]+)=(.+?)(?:,|$)'
As well as the fact that the material behind the comma in the value containing one is omitted,
for example [('foo', 'bar'), ('breakfast', 'spam')]

I think I should use some kind of look-behind (?) Operation.
Question (s)
1. Which one do I use? or
2. How to do it / it?

Edit

Based on daramarak below,
In the end, I did almost the same thing that abarnert later proposed in a slightly more detailed form;

 vals = [x.rsplit(',', 1) for x in (data.split('='))] ret = list() while vals: value = vals.pop()[0] key = vals[-1].pop() ret.append((key, value)) if len(vals[-1]) == 0: break 

EDIT 2:

To satisfy my curiosity, is this really possible with pure regular expressions? I. so that re.findall() returns a list of 2 tuples?

+8
python regex parsing
source share
5 answers

Just for comparison, there is a regular expression here that seems to also solve the problem:

 ([^=]+) # key = # equals is how we tokenise the original string ([^=]+) # value (?:,|$) # value terminator, either comma or end of string 

The trick here is to limit what you capture in the second group. .+ swallows the = sign, which is a character we can use to distinguish keys from values. A full regex does not rely on backtracking (so it should be compatible with something like re2 , if desired) and can work on abarnert examples.

Use as follows:

 re.findall(r'([^=]+)=([^=]+)(?:,|$)', 'foo=bar,breakfast=spam,eggs,blt=bacon,lettuce,tomato,spam=spam') 

What returns:

 [('foo', 'bar'), ('breakfast', 'spam,eggs'), ('blt', 'bacon,lettuce,tomato'), ('spam', 'spam')] 
+8
source share

daramarak responds either works very closely or works as is; It’s hard to say how sample output is formatted, and vague descriptions of steps. But if it is a very close version, it is easy to fix.

Introducing it into the code:

 >>> bits=[x.rsplit(',', 1) for x in s.split('=')] >>> kv = [(bits[i][-1], bits[i+1][0]) for i in range(len(bits)-1)] 

The first line is (I think) the answer of daramarak. The first line itself gives you pairs (value_i, key_i+1) instead of (key_i, value_i) . The second line is the most obvious solution to this. With more intermediate steps and a bit of output to see how it works:

 >>> s = 'foo=bar,breakfast=spam,eggs,blt=bacon,lettuce,tomato,spam=spam' >>> bits0 = s.split('=') >>> bits0 ['foo', 'bar,breakfast', 'spam,eggs,blt', 'bacon,lettuce,tomato,spam', 'spam'] >>> bits = [x.rsplit(',', 1) for x in bits0] >>> bits [('foo'), ('bar', 'breakfast'), ('spam,eggs', 'blt'), ('bacon,lettuce,tomato', 'spam'), ('spam')] >>> kv = [(bits[i][-1], bits[i+1][0]) for i in range(len(bits)-1)] >>> kv [('foo', 'bar'), ('breakfast', 'spam,eggs'), ('blt', 'bacon,lettuce,tomato'), ('spam', 'spam')] 
+4
source share

May I suggest using separation operations as before. But first divide by equal, and then divide in the rightmost comma to make one list of left and right lines.

 input = "bob=whatever,king=kong,banana=herb,good,yellow,thorn=hurts" 

first the split will become

 first_split = input.split("=") #first_split = ['bob' 'whatever,king' 'kong,banana' 'herb,good,yellow,thorn' 'hurts'] 

then splitting in the rightmost comma gives you:

 second_split = [single_word for sublist in first_split for item in sublist.rsplit(",",1)] #second_split = ['bob' 'whatever' 'king' 'kong' 'banana' 'herb,good,yellow' 'thorn' 'hurts'] 

then you just collect the pairs as follows:

 pairs = dict(zip(second_split[::2],second_split[1::2])) 
+1
source share

Can you try this, it worked for me:

 mystring = "foo=bar,breakfast=spam,eggs,e=a" n = [] i = 0 for x in mystring.split(','): if '=' not in x: n[i-1] = "{0},{1}".format(n[i-1], x) else: n.append(x) i += 1 print n 

You get the result as:

  ['foo=bar', 'breakfast=spam,eggs', 'e=a'] 

Then you can simply browse the list and do what you want.

0
source share

Assuming the key name never contains,, you can divide by, when the next sequence without , and = will be executed = .

 re.split(r',(?=[^,=]+=)', inputString) 

(This is the same as in my original solution. I expect re.split be used, not re.findall or str.split ).

A complete solution can be performed in one layer:

 [re.findall('(.*?)=(.*)', token)[0] for token in re.split(r',(?=[^,=]+=)', inputString)] 
0
source share

All Articles