Regex add character to string

I have a long line, which is a paragraph, however after periods there is no space. For example:

para = "I saw this film about 20 years ago and remember it as being particularly nasty. I believe it is based on a true incident: a young man breaks into a nurses\' home and rapes, tortures and kills various women.It is in black and white but saves the colour for one shocking shot.At the end the film seems to be trying to make some political statement but it just comes across as confused and obscene.Avoid." 

I am trying to use re.sub to solve this problem, but the result is not the one I expected.

This is what I did:

 re.sub("(?<=\.).", " \1", para) 

I match the first char of each sentence, and I want to put a space before it. My matching pattern (?<=\.). , which (presumably) checks for any character that appears after the period. I found out from other stackoverflow questions that \ 1 matches the last matching pattern, so I wrote a replacement pattern as \1 , a space followed by a previously matched string.

Here is the result:

 "I saw this film about 20 years ago and remember it as being particularly nasty. \x01I believe it is based on a true incident: a young man breaks into a nurses\' home and rapes, tortures and kills various women. \x01t is in black and white but saves the colour for one shocking shot. \x01t the end the film seems to be trying to make some political statement but it just comes across as confused and obscene. \x01void. \x01 

Instead of matching any character preceding the period and adding a space before it, re.sub replaced the matched character with \x01 . What for? How to add a character before a matching string?

+8
python regex nlp
source share
5 answers

(?<=a)b is a positive lookbehind . It matches b after a . a not recorded. So in your expression, I'm not sure what the value \1 represents in this case, but it's not what is inside (?<=...) .

Your current approach has another drawback: it will add a space after . even if it already exists.

To add the missing space after . , I propose another strategy: replace . -next-non-space-non-dot . and space:

 re.sub(r'\.(?=[^ .])', '. ', para) 
+8
source share

Perhaps you can use the following regex (with a positive outlook and a negative forecast ahead ):

 (?<=\.)(?!\s) 

python

 re.sub(r"(?<=\.)(?!\s)", " ", para) 

see demo

+2
source share

A slightly modified version of your regex will also work:

 print re.sub(r"([\.])([^\s])", r"\1 \2", para) # I saw this film about 20 years ago and remember it as being particularly nasty. I believe it is based on a true incident: a young man breaks into a nurses' home and rapes, tortures and kills various women. It is in black and white but saves the colour for one shocking shot. At the end the film seems to be trying to make some political statement but it just comes across as confused and obscene. Avoid. 
+2
source share

I think this is what you want to do. You can pass a function to perform a replacement.

 import re def my_replace(match): return " " + match.group() my_string = "dhd.hd hd hs fjs.hello" print(re.sub(r'(?<=\.).', my_replace, my_string)) 

Print

 dhd. hd hd hs fjs. hello 

As @ Seanny123 noted, this will add a space, even if there has already been a space after the period.

+1
source share

The simplest regular expression replacement you can use is the following:

 re.sub(r'\.(?=\w)', '. ', para) 

It simply matches each period and uses lookahead, (?=\w) to make sure that there is the next character of the word, not a space after the period, and replaces it .

0
source share

All Articles