Some negative lookbehind statements in python regex?

I'm new to programming, sorry if this seems trivial: I have text that I am trying to split into separate sentences using regular expressions. Using the .split method .split I am looking for a point followed by a capital letter, for example

 "\. AZ" 

However, I need to clarify this rule as follows:. (period) may not exceed either Abs or S And if it is followed by an uppercase letter ( AZ ), it should not match if it is the name of the month, for example, January | February | March January | February | March January | February | March .

I tried to implement the first half, but even that did not work. My code is:

 "( (?<!Abs)\. AZ) | (?<!S)\. AZ) ) " 
+8
source share
4 answers

Firstly, I think you can replace the space \s+ or \s if this is really one place (you often find double spaces in the English text).

Secondly, to match an uppercase letter, you must use [AZ] , but AZ will not work (but remember that there may be other uppercase letters than AZ ...).

Also, I think I know why this is not working. The regex engine will try to match \. [AZ] \. [AZ] if he has not surpassed Abs or S The fact is that if it precedes S , it does not precede Abs , so the first pattern matches. If it precedes Abs , it does not precede S , so the second version of the template matches. In any case, one of these patterns will be consistent, since Abs and S are mutually exclusive.

A sample for the first part of your question may be

 (?<!Abs)(?<!S)(\. [AZ]) 

or

 (?<!Abs)(?<!S)(\.\s+[AZ]) 

(with my suggestion)

This is because you need to avoid | , without it, the expression now says that it does not precede Abs, but does not precede S. If both of them are true, pattern matching will continue to scan the string and find your match.

To exclude month names, I came up with this regex:

 (?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[AZ] 

The same arguments hold true for negative look patterns.

+13
source

Use the nltk punkt tokenizer . This is possibly more reliable than using a regular expression.

 >>> import nltk.data >>> text = """ ... Punkt knows that the periods in Mr. Smith and Johann S. Bach ... do not mark sentence boundaries. And sometimes sentences ... can start with non-capitalized words. i is a good variable ... name. ... """ >>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle') >>> print '\n-----\n'.join(sent_detector.tokenize(text.strip())) Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries. ----- And sometimes sentences can start with non-capitalized words. ----- i is a good variable name. 
+1
source

Use nltk or similar tools as suggested by @root.

To answer the regular expression question:

 import re import sys print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[AZ])", sys.stdin.read()) 

Enter

 First. Second. January. Third. Abs. Forth. S. Fifth. S. Sixth. ABs. Eighth 

Exit

 ['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth', 'S. Sixth', 'ABs', 'Eighth'] 
+1
source

I am adding a short answer to the question in the title, as it is at the top of the Google search results:

To create several negative views with different lengths, you need to combine them in a chain as follows:

"(?<!1)(?<!12)(?<!123)example"

0
source

Source: https://habr.com/ru/post/926745/


All Articles