Python regex matches text in single quotes, ignoring escaped quotes (and tabs / newlines)

Given a text file where the character I want to match is limited to single quotes, but can have zero or one escaped single quotation mark, as well as zero or more tabs and newlines (not escaped) - I want to match only the text. Example:

menu_item = 'casserole'; menu_item = 'meat loaf'; menu_item = 'Tony\ magic pizza'; menu_item = 'hamburger'; menu_item = 'Dave\ famous pizza'; menu_item = 'Dave\ lesser-known gyro'; 

I want to capture only text (and spaces), ignoring tabs / newlines - and I'm really not interested if an escaped quote appears in the results, if it does not affect the match:

 casserole meat loaf Tonys magic pizza hamburger Daves famous pizza Dave\ lesser-known gyro # quote is okay if necessary. 

I managed to create a regex that almost does this - it processes escaped quotes, but not newlines:

 menuPat = r"menu_item = \'(.*)(\\\')?(\t|\n)*(.*)\'" for line in inFP.readlines(): m = re.search(menuPat, line) if m is not None: print m.group() 

There are definitely a lot of questions about regex, but most of them use Perl, and if there is one that does what I want, I could not figure it out. And since I use Python, don’t worry if it spreads over several groups, it is easy to recombine them.

Some answers say that they just come with code to parse the text. Although I'm sure I can do it - I'm so close to having a working regular expression :) And it looks like this should be doable.

Update: I only realized that I was doing Python readlines () to get every line, which obviously breaks the lines passed into the regular expression. I am looking at re-recording, but any suggestions on this part will also be very helpful.

+7
source share
3 answers

This should do it:

 menu_item = '((?:[^'\\]|\\')*)' 

Here the part (?:[^'\\]|\\')* matches any sequence of any character except ' and \ , or the letter \' . The previous expression [^'\\] also allowed line breaks and tabs, which you then need to replace with a single space.

+3
source

This tested script should do the trick:

 import re re_sq_long = r""" # Match single quoted string with escaped stuff. ' # Opening literal quote ( # $1: Capture string contents [^'\\]* # Zero or more non-', non-backslash (?: # "unroll-the-loop"! \\. # Allow escaped anything. [^'\\]* # Zero or more non-', non-backslash )* # Finish {(special normal*)*} construct. ) # End $1: String contents. ' # Closing literal quote """ re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'" data = r''' menu_item = 'casserole'; menu_item = 'meat loaf'; menu_item = 'Tony\ magic pizza'; menu_item = 'hamburger'; menu_item = 'Dave\ famous pizza'; menu_item = 'Dave\ lesser-known gyro';''' matches = re.findall(re_sq_long, data, re.DOTALL | re.VERBOSE) menu_items = [] for match in matches: match = re.sub('\s+', ' ', match) # Clean whitespace match = re.sub(r'\\', '', match) # remove escapes menu_items.append(match) # Add to menu list print (menu_items) 

Here is a short version of the regular expression:

'([^'\\]*(?:\\.[^'\\]*)*)'

This regular expression is optimized using the Jeffrey Friedl roll-cycle efficiency method. (See:

+12
source

You caught a cold like this:

 pattern = re.compile(r"menu_item = '(.*?)(?<!\\)'", re.DOTALL) 

It begins to match by the first single quote that it finds, and ends with the first single quote, which is not preceded by a backslash. It also captures any new lines and tabs found between two single quotes.

+1
source

All Articles