Combining multiple regular expressions

Question

Combining multiple regular expressions

I am trying to remove some things from a block of text using regex. I have all the ready-made templates, but I can’t remove two (or more) duplicates.

For example:

import re r1 = r'I am' r2 = r'am foo' text = 'I am foo' re.sub(r1, '', text) # Returns ' foo' re.sub(r2, '', text) # Returns 'I '

How to replace both instances at the same time and end with an empty string?

I ended up using a slightly modified version of Ned Batchelder's answer :

 def clean(self, text): mask = bytearray(len(text)) for pattern in patterns: for match in re.finditer(pattern, text): r = range(match.start(), match.end()) mask[r] = 'x' * len(r) return ''.join(character for character, bit in zip(text, mask) if not bit)

+7

python regex

Blender Jul 11 '12 at 22:28

source share

4 answers

Not to be downward, but the short answer is that I'm sure you cannot. Can you change your regex so it doesn't require overlapping?

If you still want to do this, I would try to keep track of the start and stop indices of each match made in the original row. Then go through the line and save the characters not in any deletion range?

+2

Carl Walsh Jul 11 '12 at 10:39

source share

Also quite effective is the solution coming from ... Perl combines regular expressions in one:

 # aptitude install regexp-assemble $ regexp-assemble I am I am foo Ctrl + D I am(?: foo)?

regexp-assembly accepts all the regular expression or string variants that you want to match, and then combine them into one. And yes, this changes the original problem to another, as it is not related to matching the matching regular expression anymore, but combining the regular expression to match

And then you can use it in your code:

 $ python Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> re.sub("I am foo","I am(?: foo)?","") ''

Regexp port :: Build in python will be nice :)

+1

user1458574 Sep 09 '12 at 9:54

source share

Here is an alternative that filters strings on the fly using itertools.compress in text using a selector iterator. The selector returns True if the character should be saved. selector_for_patterns creates one selector for each template. The selector is combined with the entire function (only when the whole template wants to save the character, it should be in the resulting string).

 import itertools import re def selector_for_pattern(text, pattern): i = 0 for m in re.finditer(pattern, text): for _ in xrange(i, m.start()): yield True for _ in xrange(m.start(), m.end()): yield False i = m.end() for _ in xrange(i, len(text)): yield True def clean(text, patterns): gen = [selector_for_pattern(text, pattern) for pattern in patterns] selector = itertools.imap(all, itertools.izip(* gen)) return "".join(itertools.compress(text, selector))

+1

Thomas jung Sep 27 '12 at 13:58

source share

Ned batchelder · Accepted Answer · 2012-07-11T22:39:55+0000

You cannot do this with consecutive re.sub calls, as you showed. You can use re.finditer to find them all. Each match will give you a matching object that has the .start and .end attributes indicating their position. You can put it all together and then delete the characters at the end.

Here I use bytearray as a mutable string, used as a mask. It is initialized to zero bytes, and I mark with "x" all bytes that match any regular expression. Then I use a bitmask to select the characters that I want to keep in the original string, and create a new line with only unsurpassed characters:

 bits = bytearray(len(text)) for pat in patterns: for m in re.finditer(pat, text): bits[m.start():m.end()] = 'x' * (m.end()-m.start()) new_string = ''.join(c for c,bit in zip(text, bits) if not bit)

Combining multiple regular expressions

More articles: