Proper analysis of string literals with python re module

Question

Proper analysis of string literals with python re module

I am trying to add some light markdown to javascript preprocessor support, which I write in Python.

It works for the most part, but sometimes the regex that I use acts a little weird, and I think it has something to do with the source strings and escape sequences.

Regular expression: (?<!\\)\"[^\"]+\"

Yes, I know that it only matches lines starting with a character. " However, this project is born out of curiosity more than anything else, so I can live with it for now.

To break it:

 (?<\\)\" # The group should begin with a quotation mark that is not escaped [^\"]+ # and match any number of at least one character that is not a quotation mark (this is the biggest problem, I know) \" # and end at the first quotation mark it finds

Having said that, I (obviously) start to encounter such problems:

"This is a string with an \"escaped quote\" inside it"

I'm not sure how to say "Everything except the quotation mark if this label is not escaped." I tried:

 ([^\"]|\\\")+ # a group of anything but a quote or an escaped quote

but this leads to very strange results.

I am fully prepared to hear that I am doing all this wrong. For simplicity's sake, suppose this regular expression will always start and end with double quotation marks ( " ) to avoid adding another element to the mix. I really want to understand what I have.

Thanks for any help.

EDIT

As a test for regular expression, I try to find all string literals in a mini jQuery script with the following code (using the unutbu template below):

 STRLIT = r'''(?x) # verbose mode (?<!\\) # not preceded by a backslash " # a literal double-quote .*? # non-greedy 1-or-more characters (?<!\\) # not preceded by a backslash " # a literal double-quote ''' f = open("jquery.min.js","r") jq = f.read() f.close() literals = re.findall(STRLIT,jq)

The answer below fixes almost all problems. Those that arise are in jquery's own regular expressions, which is very important. The solution no longer mistakenly identifies valid javascript as markdown links, which was the goal.

+4

python regex

Tom thorogood Jan 16 '13 at 19:38

source share

3 answers

I think I first saw this idea in ... the Jinja2 source code? They later transplanted him to Mako.

 r'''(\"\"\"|\'\'\'|\"|\')((?<!\\)\\\1|.)*?\1'''

Which does the following:

(\"\"\"|\'\'\'|\"|\') matches the Python opening quote because it comes from Python parsing code. You probably don't need all of these types of quotes.
((?<!\\)\\\1|.) Matches: AS is the corresponding quote that was ONLY ONE escaped, or any other character. Thus, \\" will still be recognized as the end of the line.
*? not greedily corresponds to the maximum possible number.
And \1 is just a closing quote.

Alas, \\\" will still display incorrectly as the end of a line. (The template engines only use this to check if there is a line, and not extract it.) This is a problem very poorly suited for regular expressions; not to mention the insane things in Perl where you can embed real code inside a regular expression, I’m not sure that this is possible even with PCRE. Although I would like to be proved wrong. :) The killer is that (?<!...) must be of constant length, but you want to check if there are even the number of backslashes before closing Your quote.

If you want to get it right, and not just basically right, you may have to use a real parser. See parsley , pyparsing , or any of these tools .

edit: By the way, there is no need to check that there is no backslash before the beginning of the quote. This invalid syntax is off-line in JS (or Python).

+6

Eevee Jan 16 '13 at 20:08

source share

Using python, the correct double-quoted string matching the regular expression:

pattern = r '"(\. | [^"]) * "

It describes how strings begin and end with the character ". For each character inside two double quotes, it either has an escaped character, or any character expects."

unutbu ansever is incorrect because this pattern cannot match the valid string "\\\\".

0

zcb Jun 21 '16 at 3:58

source share

unutbu · Accepted Answer · 2013-01-16T19:46:29+0000

Perhaps use two negative appearances:

 import re text = r'''"This is a string with an \"escaped quote\" inside it". While ""===r?+r:wt.test(r)?st.parseJSON(r) :r}catch(o){}st.data(e,n,r)}else r=t}return r}function s(e){var t;for(t in e)if(("data" ''' for match in (re.findall(r'''(?x) # verbose mode (?<!\\) # not preceded by a backslash " # a literal double-quote .*? # 1-or-more characters (?<!\\) # not preceded by a backslash " # a literal double-quote ''', text)): print(match)

gives

 "This is a string with an \"escaped quote\" inside it" "" "data"

Question icon in .+? makes the template inanimate. Unwantedness causes the pattern to match when it encounters the first unshielded double quote.

Proper analysis of string literals with python re module

More articles: