Match last occurrence with regex

I would like to match the last occurrence of a pattern using regular expressions.

I have text structured like this:

Pellentesque habitant morbi tristique senectus et netus et lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br> 

I want to match the last text between two <br> in my case <br>Tizi Ouzou<br> , ideally Tizi Ouzou string

Please note that there are some spaces after the last <br>

I tried this:

 <br>.*<br>\s*$ 

but he chooses everything from the first <br> to the last.

NB: I'm on python and I use pythex to test my regex

+8
python regex
source share
6 answers

A modeless approach using str built-in functions:

 text = """ Pellentesque habitant morbi tristique senectus et netus et lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br> """ res = text.rsplit('<br>', 2)[-2] #Tizi Ouzou 
+14
source share

For me, the clearest way:

 >>> re.findall('<br>(.*?)<br>', text)[-1] 'Tizi Ouzou' 
+15
source share

Look at related questions: you should not parse HTML with regular expression . Use the regular expression parser instead. For Python, I hear Beautiful Soup .

In any case, if you want to do this with a regular expression, you need to make sure that .* Cannot pass by another <br> . To do this, before using each character, we can use lookahead to make sure it does not start another <br> :

 <br>(?:(?!<br>).)*<br>\s*$ 
+7
source share

You can use in a greedy quantifier with a reduced character class (if you do not have tags between you <br> ):

 <br>([^<]*)<br>\s*$ 

or

 <br>((?:[^<]+|<(?!br>))*)<br>\s*$ 

to include tags inside.

Since the string you are looking for is Tizi Ouzou without <br> , you can extract the first capture group.

+6
source share

How about [^<>]* instead of .* :

 import re text = """Pellentesque habitant morbi tristique senectus et netus et lesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas <br>semper<br>tizi ouzou<br>Tizi Ouzou<br> """ print re.search('<br>([^<>]*)<br>\s*$', text).group(1) 

prints

 Tizi Ouzou 
+4
source share

Try:

 re.match(r'(?s).*<br>(?=.*<br>)(.*)<br>', s).group(1) 

First, it consumes all the data until the last <br> and returns back until it checks with confidence that there is another <br> after it, and then it extracts the contents between them.

This gives:

 Tizi Ouzou 

EDIT : no need to look forward. Alternative (with the same result) based on m.buettner's comment

 re.match(r'(?s).*<br>(.*)<br>', s).group(1) 
+3
source share

All Articles