I am trying to add some light markdown to javascript preprocessor support, which I write in Python.
It works for the most part, but sometimes the regex that I use acts a little weird, and I think it has something to do with the source strings and escape sequences.
Regular expression: (?<!\\)\"[^\"]+\"
Yes, I know that it only matches lines starting with a character. " However, this project is born out of curiosity more than anything else, so I can live with it for now.
To break it:
(?<\\)\" # The group should begin with a quotation mark that is not escaped [^\"]+ # and match any number of at least one character that is not a quotation mark (this is the biggest problem, I know) \" # and end at the first quotation mark it finds
Having said that, I (obviously) start to encounter such problems:
"This is a string with an \"escaped quote\" inside it"
I'm not sure how to say "Everything except the quotation mark if this label is not escaped." I tried:
([^\"]|\\\")+
but this leads to very strange results.
I am fully prepared to hear that I am doing all this wrong. For simplicity's sake, suppose this regular expression will always start and end with double quotation marks ( " ) to avoid adding another element to the mix. I really want to understand what I have.
Thanks for any help.
EDIT
As a test for regular expression, I try to find all string literals in a mini jQuery script with the following code (using the unutbu template below):
STRLIT = r'''(?x) # verbose mode (?<!\\) # not preceded by a backslash " # a literal double-quote .*? # non-greedy 1-or-more characters (?<!\\) # not preceded by a backslash " # a literal double-quote ''' f = open("jquery.min.js","r") jq = f.read() f.close() literals = re.findall(STRLIT,jq)
The answer below fixes almost all problems. Those that arise are in jquery's own regular expressions, which is very important. The solution no longer mistakenly identifies valid javascript as markdown links, which was the goal.