Invalid regex Python expression in regex

Given:

ABC content 1 123 content 2 ABC content 3 XYZ 

Is it possible to create a regular expression that matches the shortest version of "ABC [\ W \ w] +? XYZ"

Essentially, I'm looking for "ABC followed by any characters ending in XYZ but don't match if I run into ABC between them" (but think of ABC as a potential regex because it won't always be a given length .. .so ABC or ABcC can also match)

So, in a more general sense: REGEX1, followed by any character and ends with REGEX2, does not match if REGEX1 occurs between them.

In this example, I do not need the first 4 lines.

(I'm sure this explanation could potentially be needed ... further explanation haha)

EDIT:

Ok, now I see the need for further explanations! Thanks for the suggestions so far. At least I will give you more and more than I’ll think about when I start to look at how each of the solutions you propose can be applied to my problem.

Proposition 1: Discard the contents of the string and regular expression.

This is definitely a very funny hack that solves a problem based on what I explained. In simplifying the issue, I also did not mention that the same thing can happen in reverse order, because the final signature may exist later (and ended up in my specific situation). This presents the problem shown below:

 ABC content 1 123 content 2 ABC content 3 XYZ content 4 MNO content 5 XYZ 

In this case, I would look at something like “ABC via XYZ”, which means “catch” [ABC, content 1, XYZ] ... but accidentally catch [ABC, content 1, 123, content 2, ABC, content 3, XYZ]. A reverse that will catch the [ABC, content 3, XYZ, content 4, MNO, content 5, XYZ] instead of the [ABC, content 2, XYZ], which we want again. The point is to try to make it as general as possible, because I will also look for things that could potentially have the same start signature (in this case, the regular expression "ABC") and different end signatures.

If there is a way to create regular expressions to encapsulate this restriction, it would be much easier to simply refer to the fact that whenever I create a regular expression to search in this type of string, instead of creating a custom search algorithm that deals with it.

Proposition 2: A + B + C + [^ A] + [^ B] + [^ C] + XYZ with the IGNORECASE flag

This seems enjoyable when the ABC is finite. Think of it as a regular expression. For instance:

 Hello!GoodBye!Hello.Later. 

VERY simplified version of what I'm trying to do. I would like "Hello.Later". given the starting regular expression Hello [!.] and the end Later [!.]. Running something simple as Hello [!.] Later [!.] Will capture the entire line, but I want to say that if the initial regex Hello [!.] Exists between the first running instance of the regular expression and the first final regular expression instance found, ignore him.

The convoy below this sentence indicates that I can be limited by regular language restrictions similar to the parenthesis problem (Google, this is interesting to think about). The purpose of this post is to check if I really have to resort to creating a basic algorithm that handles the problem I am facing. I would really like to avoid this, if possible (using the simple example I gave you above, it’s quite easy to create a finite state machine for ... I hope this holds on because it gets a little complicated).

Proposition 3: ABC (? :( ?! ABC).) *? XYZ with DOTALL flag

I like the idea of ​​this if it actually allows ABC to be a regular expression. I will have to investigate this when I arrive at the office tomorrow. At first glance, nothing seems too unusual, but I'm completely new to python regex (and new to regex in code, not just theoretical homework)

+4
source share
3 answers

The regex solution will be ABC(?:(?!ABC).)*?XYZ with the DOTALL flag.

+7
source

Edit

So, after reading your further explanations, I would say that my previous sentence, as well as MRAB, is one, somehow similar, and there will be no help. Your problem is, in fact, the span of nested structures.

Think of your “prefixes” and “suffixes” as characters. You can easily replace them with an opening and closing parenthesis or whatever, and what you want can only match the smallest (then the deepest) pair ...

For example, if your prefix is ​​"ABC". and your suffix is ​​"XYZ.":

 ABChello worldABCfooABCbarXYZ 

You want to get only ABCbarXYZ .

Same thing if the prefix is ( , and the suffix is ​​- ) , the line:

 (hello world(foo(bar) 

It would ideally fit only (bar) ...

You definitely need to use contextual free grammar (for example, programming languages: Grammar C , Python grammar ) and a parser , or make your own using a regular expression, as well as mechanisms for iterating and storing your programming language.

But this is not possible using only regular expressions. Most likely, they will help in your algorithm, but they are simply not designed to solve this problem. Not the best tool for this job ... You cannot inflate tires with a screwdriver. Therefore, you will have to use some external mechanisms, but it’s not difficult to save the context, your position in the nested stack. Using your regular expression in every single context is still possible.

Finite finite machines are finite, and nested structures have arbitrary depths, which will require your machine to develop arbitrarily, so they are not regular languages .

Because recursion in grammar allows you to define nested syntax structures, any language (including any programming language) that allows nested structures is a context-free language, not a regular language. For example, a set of strings consisting of balanced parentheses (for example, a LISP program with removing alphanumeric characters) is a context-free language, see here

Previous offer (not relevant)

If I do this:

 >>> s = """ABC content 1 123 content 2 ABC content 3 XYZ""" >>> r = re.compile(r'A+B+C+[^A]+[^B]+[^C]+XYZ', re.I) >>> re.findall(r,s) 

I get

 ['ABC\ncontent 3\nXYZ'] 

Is this what you want?

+1
source

There is another way to solve this problem: do not try to do it in one regular expression. You can break the string into the first regular expression, and then use the second in the last part.

Code is the best explanation:

 s = """ABC content 1 123 content 2 ABC content 3 XYZ content 4 XYZ""" # capturing groups to preserve the matched section prefix = re.compile('(ABC)') suffix = re.compile('(XYZ)') # prefix.split(s) == ['', 'ABC', [..], 'ABC', '\ncontent 3\nXYZ\ncontent 4\nXYZ'] # prefixmatch ^^^^^ ^^^^^^^^^^^^ rest ^^^^^^^^^^^^^^^^ prefixmatch, rest = prefix.split(s)[-2:] # suffix.split(rest,1) == ['\ncontent 3\n', 'XYZ', '\ncontent 4\nXYZ'] # ^^ interior ^^ ^^^^^ suffixmatch interior, suffixmatch = suffix.split(rest,1)[:2] # join the parts up. result = '%s%s%s' % (prefixmatch, interior, suffixmatch) # result == 'ABC\ncontent 3\nXYZ' 

Some moments:

  • there must be appropriate error handling (even just try: ... except ValueError: .. around all this) to handle the case where any regular expression does not match at all, and therefore the list unpacking fails.
  • this assumes that the desired segment will occur immediately after the last occurrence of prefix , if not, then you can repeat the results of prefix.split(s) two at a time (starting at index 1) and do the same trick splitting with suffix to find all matches .
  • this can be quite inefficient, since it creates quite a lot of intermediate data structures.
0
source

All Articles