Python regex shares any \ W + with some exceptions

Question

Python regex shares any \ W + with some exceptions

it is easy to break the text using regular expression into non-alpha characters:

tokens=re.split(r'(?u)\W+',text) #to split at any non-alpha unicode character

and This answer provides a way to divide into specific characters. However, I need:

splitting into any unicode non-alpha
set a regular expression for the following exceptions:
- emphasizes "_"
- this slash "/"
- ampersand "&" and with the @ sign
- fullstops surrounded by digits \ d +
- fullstops preceded by certain arbitrary strings "Mr.", "Dr." ... etc.

I can easily detect any of them using a regex, but the question is how to tell the regex to be as exceptions for non-alpha splitting.

EDIT: Here is an example of the text I'm trying to match:

 text="Mr. Jones email jones@gmail.com 12.455 12,254.25 says This is@a&test example_cool man+right more/fun 43.35. And so we stopped. And then we started again. وبعدها رجعنا إلى المنزل، وقابلنا أصدقاءنا؛ وشربنا الشاي."

and here is its Unicode version (note the non-alpha characters in Arabic u '\ u060c', u '\ u061b')

 unicode_text=u'Mr. Jones email jones@gmail.com 12.455 12,254.25 says This is@a&test example_cool man+right more/fun 43.35. And so we stopped. And then we started again. \u0648\u0628\u0639\u062f\u0647\u0627 \u0631\u062c\u0639\u0646\u0627 \u0625\u0644\u0649 \u0627\u0644\u0645\u0646\u0632\u0644\u060c \u0648\u0642\u0627\u0628\u0644\u0646\u0627 \u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627\u061b \u0648\u0634\u0631\u0628\u0646\u0627 \u0627\u0644\u0634\u0627\u064a.'

Here is the result of the regular expression in the provided answer:

 re.split(r'(?u)(?![\+&\/@\d+\.\d+Mr\.])\W+',unicode_text)

[u'Mr. ', u'Jones', u'email', u'jones@gmail.com ', u'12.455', u'12 ', u'254.25', u'says', u'This', u'is @ a & test ', u'example_cool', u'man + right ', u'more / fun', u'43.35. ', u'And', u'so ', u'we', "And finally you", "you", "started", and '\ u0648 \ u0628 \ u0639 \ u062f \ u0647 \ u0627', u '\ u0631 \ u062c \ u0639 \ u0646 \ u0627', u '\ u0625 \ u0644 \ u0649', and '\ u0627 \ u0644 \ u0645 \ u0646 \ u0632 \ u0644', and '\ u0648 \ u0642 \ u0627 \ u0628 \ u0644 \ u0646 \ u0627 ', and' \ u0623 \ u0635 \ u062f \ u0642 \ u0627 \ u0621 \ u0646 \ u0627 ', and' \ u0648 \ u0634 \ u0631 \ u0628 \ u0646 \ u0627 ', and' \ u0627 \ u0644 \ u0634 \ u0627 \ u064a.]

Please note that the regular expression did not break into full stops at the end of words. So it would be nice to have something to solve this problem.

0

python string split regex unicode

hmghaly Oct 18 '13 at 20:34

source share

2 answers

Kyle hannon · Answer 1 · 2013-10-18T22:03:20+0000

The key is to use a negative result. I think this covers all the examples from your list, but let me know if I missed something.

 In [549]: re.split(r'(?u)(?![\+&\/@\d+\.\d+Mr\.])\W+', "Mr.Jones says This is@a&test example_cool man+right more/fun 43.35") Out[549]: ['Mr.Jones', 'says', 'This', 'is@a&test', 'example_cool', 'man+right', 'more/fun', '43.35']

Everything inside the group in (?!) Will not be matched. Let me know if I understood the question correctly.

Armali · Answer 2 · 2014-09-30T09:24:03+0000

I don't think you want to separate email addresses such as jones@gmail.com in jones@gmail and com , so I changed your complete exception requirement, surrounded by numbers, to complete stops, followed by an alphanumeric character.

 re.split(r'(?u)(?![_/&@.])\W+|(?<!Mr|Dr)\.(?!\w)\W*', unicode_text)

[u'Mr. ', u'Jones', u'email', u'jones@gmail.com ', u'12.455', u'12 ', u'254.25', u'says', u'This', u'is @ a & test ', u'example_cool', u'man ', u'right', u'more / fun ', u'43.35', u'And ', u'so', u'we ', u'stopped ', u'And', u'then ', u'we', u'started ', u'again', and '\ u0648 \ u0628 \ u0639 \ u062f \ u0647 \ u0627', u '\ u0631 \ u062c \ u0639 \ u0646 \ u0627 ', u' \ u0625 \ u0644 \ u0649 ', and' \ u0627 \ u0644 \ u0645 \ u0646 \ u0632 \ u0644 ', and' \ u0648 \ u0642 \ u0627 \ u0628 \ u0644 \ u0646 \ u0627 , and '\ u0623 \ u0635 \ u062f \ u0642 \ u0627 \ u0621 \ u0646 \ u0627', and '\ u0648 \ u0634 \ u0631 \ u0628 \ u0646 \ u0627', u '\ u0627 \ u0644 \ u0634 \ u0627 \ u064a' , u '']

Python regex shares any \ W + with some exceptions

More articles: