Separate python regex

Question

Separate python regex

I have a verbose python regex string (with lots of spaces and comments) that I would like to convert to a “normal” style (for export to javascript). In particular, I need it to be reliable enough. If there is some obviously right way to do this, this is what I want. For example, a naive implementation will destroy the regular expression, for example r' \# # A literal hash character' , which is not entirely normal.

The best way to do this would be to force the python re module to return me an irregular representation of my regex, but I see no way to do this.

+6

python regex

bukzor Feb 14 '13 at 10:54

source share

1 answer

dpkp · Answer 1 · 2013-02-17T08:13:24+0000

I believe that you only need to solve these two questions in order to separate the detailed regex:

delete comments at the end of the line
remove unoccupied spaces

try this which associates 2 with separate regular expressions:

 import re def unverbosify_regex_simple(verbose): WS_RX = r'(?<!\\)((\\{2})*)\s+' CM_RX = r'(?<!\\)((\\{2})*)#.*$(?m)' return re.sub(WS_RX, "\\1", re.sub(CM_RX, "\\1", verbose))

The above version is a simplified version that leaves the escaped spaces as is. The result will be a little more difficult to read, but should work on regex platforms.

Alternatively, for a slightly more complex answer that “cancels” the spaces (ie '\' => '') and returns what I think most people expect:

 import re def unverbosify_regex(verbose): CM1_RX = r'(?<!\\)((\\{2})*)#.*$(?m)' CM2_RX = r'(\\)?((\\{2})*)(#)' WS_RX = r'(\\)?((\\{2})*)(\s)\s*' def strip_escapes(match): ## if even slashes: delete space and retain slashes if (match.group(1) is None): return match.group(2) ## if number of slashes is odd: delete slash and keep space (or 'comment') elif (match.group(1) == '\\'): return match.group(2) + match.group(4) ## error else: raise Exception not_verbose_regex = re.sub(WS_RX, strip_escapes, re.sub(CM2_RX, strip_escapes, re.sub(CM1_RX, "\\1", verbose))) return not_verbose_regex

UPDATE: added comments to explain even odd counts. The first group in CM_RX is fixed to save the full "comment" if the number of slashes is odd.

UPDATE 2: Fixed commenting on regex that didn't touch thumbnail hashes properly. Must handle as "\ # #escaped hash" as well as "# comment with \ # escaped hash" and "\\ # comment"

UPDATE 3: A simplified version has been added that does not clear escaped spaces.

UPDATE 4: Further simplification to exclude negative lookbehind with variable length (and reverse / reverse trick)

Separate python regex

More articles: