Some negative lookbehind statements in python regex?

Question

Some negative lookbehind statements in python regex?

I'm new to programming, sorry if this seems trivial: I have text that I am trying to split into separate sentences using regular expressions. Using the .split method .split I am looking for a point followed by a capital letter, for example

 "\. AZ"

I tried to implement the first half, but even that did not work. My code is:

 "( (?<!Abs)\. AZ) | (?<!S)\. AZ) ) "

+8

python regex

Elip Oct 2 '12 at 11:00

source share

4 answers

Use the nltk punkt tokenizer . This is ~~possibly~~ more reliable than using a regular expression.

 >>> import nltk.data >>> text = """ ... Punkt knows that the periods in Mr. Smith and Johann S. Bach ... do not mark sentence boundaries. And sometimes sentences ... can start with non-capitalized words. i is a good variable ... name. ... """ >>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle') >>> print '\n-----\n'.join(sent_detector.tokenize(text.strip())) Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries. ----- And sometimes sentences can start with non-capitalized words. ----- i is a good variable name.

+1

root Oct 2 '12 at 11:13

source share

Use nltk or similar tools as suggested by @root.

To answer the regular expression question:

 import re import sys print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[AZ])", sys.stdin.read())

Enter

 First. Second. January. Third. Abs. Forth. S. Fifth. S. Sixth. ABs. Eighth

Exit

 ['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth', 'S. Sixth', 'ABs', 'Eighth']

+1

jfs Oct 2 '12 at 11:41

source share

I am adding a short answer to the question in the title, as it is at the top of the Google search results:

To create several negative views with different lengths, you need to combine them in a chain as follows:

"(?<!1)(?<!12)(?<!123)example"

0

Nathan wailes Jul 12 '19 at 8:13

source share

hochl · Accepted Answer · 2012-10-02T11:16:14+0000

Firstly, I think you can replace the space \s+ or \s if this is really one place (you often find double spaces in the English text).

Secondly, to match an uppercase letter, you must use [AZ] , but AZ will not work (but remember that there may be other uppercase letters than AZ ...).

Also, I think I know why this is not working. The regex engine will try to match \. [AZ] \. [AZ] if he has not surpassed Abs or S The fact is that if it precedes S , it does not precede Abs , so the first pattern matches. If it precedes Abs , it does not precede S , so the second version of the template matches. In any case, one of these patterns will be consistent, since Abs and S are mutually exclusive.

A sample for the first part of your question may be

 (?<!Abs)(?<!S)(\. [AZ])

or

 (?<!Abs)(?<!S)(\.\s+[AZ])

(with my suggestion)

This is because you need to avoid | , without it, the expression now says that it does not precede Abs, but does not precede S. If both of them are true, pattern matching will continue to scan the string and find your match.

To exclude month names, I came up with this regex:

 (?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[AZ]

The same arguments hold true for negative look patterns.

Some negative lookbehind statements in python regex?

Enter

Exit

More articles: