Conditional Matching in Regular Expression

I am trying to extract some information from the line below

>>> st = ''' ... <!-- info mp3 here --> ... 192 kbps<br />2:41<br />3.71 mb </div> ... <!-- info mp3 here --> ... 3.49 mb </div> ... <!-- info mp3 here --> ... 128 kbps<br />3:31<br />3.3 mb </div> ... ''' >>> 

Now, when I use the following regular expression, my output

 >>> p = re.findall(r'<!-- info mp3 here -->\s+(.*?)<br />(.*?)<br />(.*?)\s+</div>',st) >>> p [('192 kbps', '2:41', '3.71 mb'), ('128 kbps', '3:31', '3.3 mb')] 

but my desired result is

 [('192 kbps', '2:41', '3.71 mb'),(None,None,'3.49mb'), ('128 kbps', '3:31', '3.3 mb')] 

So my question is how to modify my above regex to fit all conditions. I believe the current regex is strictly dependent on <br /> tags, since I can make it conditional.

I know that I should not use regex to parse html, but this is currently the most suitable way for me.

+4
source share
2 answers

The following will work, although I wonder if there is a more elegant solution. You can certainly combine lists in one line, but I think the code is becoming less clear in general. At least in this way you can keep track of what you have done in three months ...

 st = ''' <!-- info mp3 here --> 192 kbps<br />2:41<br />3.71 mb </div> <!-- info mp3 here --> 3.49 mb </div> <!-- info mp3 here --> 128 kbps<br />3:31<br />3.3 mb </div> ''' p = re.findall(r'<!-- info mp3 here -->\s+(.*?)\s+</div>',st) p2 = [row.split('<br />') for row in p] p3 = [[None]*(3 - len(row)) + row for row in p2] >>> p3 [['192 kbps', '2:41', '3.71 mb'], [None, None, '3.49 mb'], ['128 kbps', '3:31', '3.3 mb']] 

And, depending on the variability of your string, you can write a more general cleanup function that breaks down, does, whatever, and maps it to each item being pulled.

+6
source

Here there is a regular expression that works, being more specific. I’m not sure that it’s preferable for Karmel to answer, but I decided that I would answer the question as asked. Instead of returning None first two optional groups return an empty string '' , which, in my opinion, is pretty close.

Pay attention to the structure of the nested group. The first two external groups are optional, but they require a <br /> tag. Thus, if there are less than two tags <br /> , the last element does not match until the end:

 rx = r'''<!--\ info\ mp3\ here\ -->\s+ # verbose mode; escape literal spaces (?: # outer non-capturing group ([^<>]*) # inner capturing group without <> (?:<br\ />) # inner non-capturing group matching br )? # whole outer group is optional (?: ([^<>]*) # all same as above (?:<br\ />) )? (?: # outer non-capturing group (.*?) # non-greedy wildcard match (?:\s+</div>) # inner non-capturing group matching div )''' # final group is not optional 

Tested:

 >>> re.findall(rx, st, re.VERBOSE) [('192 kbps', '2:41', '3.71 mb'), ('', '', '3.49 mb'), ('128 kbps', '3:31', '3.3 mb')] 

Pay attention to the re.VERBOSE flag, which is necessary if you do not remove all spaces and comments above.

+2
source

Source: https://habr.com/ru/post/1414244/


All Articles