Match last occurrence with regex

Question

Match last occurrence with regex

I would like to match the last occurrence of a pattern using regular expressions.

I have text structured like this:

Pellentesque habitant morbi tristique senectus et netus et lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br>

I want to match the last text between two   in my case  Tizi Ouzou  , ideally Tizi Ouzou string

Please note that there are some spaces after the last  

I tried this:

 <br>.*<br>\s*$

but he chooses everything from the first   to the last.

NB: I'm on python and I use pythex to test my regex

+8

python regex

Ghilas belhadj Aug 24 '13 at 19:38

source share

6 answers

For me, the clearest way:

 >>> re.findall('<br>(.*?)<br>', text)[-1] 'Tizi Ouzou'

+15

moliware Aug 24 '13 at 19:56

source share

Look at related questions: you should not parse HTML with regular expression . Use the regular expression parser instead. For Python, I hear Beautiful Soup .

In any case, if you want to do this with a regular expression, you need to make sure that .* Cannot pass by another   . To do this, before using each character, we can use lookahead to make sure it does not start another   :

 <br>(?:(?!<br>).)*<br>\s*$

+7

Martin ender Aug 24 '13 at 19:46

source share

You can use in a greedy quantifier with a reduced character class (if you do not have tags between you   ):

 <br>([^<]*)<br>\s*$

or

 <br>((?:[^<]+|<(?!br>))*)<br>\s*$

to include tags inside.

Since the string you are looking for is Tizi Ouzou without   , you can extract the first capture group.

+6

Casimir et Hippolyte Aug 24 '13 at 19:44

source share

How about [^<>]* instead of .* :

 import re text = """Pellentesque habitant morbi tristique senectus et netus et lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br> """ print re.search('<br>([^<>]*)<br>\s*$', text).group(1)

prints

 Tizi Ouzou

+4

alecxe Aug 24 '13 at 19:46

source share

Try:

 re.match(r'(?s).*<br>(?=.*<br>)(.*)<br>', s).group(1)

First, it consumes all the data until the last   and returns back until it checks with confidence that there is another   after it, and then it extracts the contents between them.

This gives:

 Tizi Ouzou

EDIT : no need to look forward. Alternative (with the same result) based on m.buettner's comment

 re.match(r'(?s).*<br>(.*)<br>', s).group(1)

+3

Birei Aug 24 '13 at 19:44

source share

Jon clements · Accepted Answer · 2013-08-24T19:45:21+0000

A modeless approach using str built-in functions:

 text = """ Pellentesque habitant morbi tristique senectus et netus et lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br> """ res = text.rsplit('<br>', 2)[-2] #Tizi Ouzou

Match last occurrence with regex

More articles: