Python regex matches only once

I'm trying to create a simple latex converter binding, just to find out python and the main regex, but I'm stuck trying to figure out why the code below does not work:

re.sub (r'\[\*\](.*?)\[\*\]: ?(.*?)$', r'\\footnote{\2}\1', s, flags=re.MULTILINE|re.DOTALL) 

I want to convert something like:

 s = """This is a note[*] and this is another[*] [*]: some text [*]: other text""" 

in

 This is a note\footnote{some text} and this is another\footnote{other text} 

this is what I got (from using my regex above):

 This is a note\footnote{some text} and this is another[*] [*]: note 2 

Why is the pattern matched only once?

EDIT:

I tried the following statement:

 re.sub(r'\[\*\](?!:)(?=.+?\[\*\]: ?(.+?)$',r'\\footnote{\1}',flags=re.DOTALL|re.MULTILINE) #(?!:) is to prevent [*]: to be matched 

Now it matches all the footnotes, but they do not match correctly.

 s = """This is a note[*] and this is another[*] [*]: some text [*]: other text""" 

gives me

 This is a note\footnote{some text} and this is another\footnote{some text} [*]: note 1 [*]: note 2 

Any thoughts on this?

+6
source share
2 answers

The reason is that you cannot match the same characters multiple times. Once a character is matched, it is consumed using the regular expression mechanism and cannot be reused for another match.

A (common) workaround is to capture overlapping parts inside the lookahead statement with capture groups. But this cannot be done in your case, because there is no way to distinguish which note is associated with the placeholder.

An easier way would be to retrieve all the notes first in the list, and then replace each placeholder with a callback. Example:

 import re s='''This is a note[*] and this is another[*] [*]: note 1 [*]: note 2''' # text and notes are separated [text,notes] = re.split(r'((?:\r?\n\[\*\]:[^\r\n]*)+$)', s)[:-1] # this generator gives the next replacement string def getnote(notes): for note in re.split(r'\r?\n\[\*\]: ', notes)[1:]: yield r'\footnote{{{}}}'.format(note) note = getnote(notes) res = re.sub(r'\[\*\]', lambda m: note.next(), text) print res 
+2
source

The problem is that when your regular expression consumes part of the string, it does not re-evaluate it to match. So, after the first replacement, he will not return to match the 2nd [*] , because it has already been consumed.

Here you will need a replacement loop until you find a match. Something like that:

 >>> str = 'This is a note[*] and this is another[*]\n\ ... [*]: note 1\n\ ... [*]: note 2' >>> reg = r'(.*?)\[\*\](.*?)\[\*\]: (note \d)(.*)' >>> >>> while re.search(reg, str, flags=re.MULTILINE|re.DOTALL): ... str = re.sub(reg, r'\1\\footnote{\3}\2\4', str, flags=re.MULTILINE|re.DOTALL) ... >>> >>> str 'This is a note\\footnote{note 1} and this is another\\footnote{note 2}\n\n' 

You can tweak the regex a bit to get rid of the trailing lines of a newline in the resulting line. Oh! and also you can precompile the regular expression using re.compile .

+2
source

All Articles