Regular expression for anonymous emails

i use regular expression in R

regexp <- "(^|[^([:alnum:]|.|_)]) abc@abc.de ($|[^[:alnum:]])" 

to find the email address abc@abc.de in the special text and replace it with anonym-mail-adress .

 tmp <- c(" aaaaabc@abc.debbbb ", ## <- should not be matched "aaaa abc@abc.de bbbb", ## <- should be matched " abc@abc.de ", ## <- should be matched " aaa.abc@abc.de ", ## <- should not be matched " aaaa_abc@abc.de ", ## <- should not be matched "( abc@abc.de )", ## <- should be matched "aaaa ( abc@abc.de ) bbbb") ## <- should be matched replacement <- paste("\\1", " anonym@anonym.de ", "\\2", sep="") gsub(regexp, replacement, tmp, ignore.case=TRUE) 

as a result I get

 > gsub(regexp, replacement, tmp, ignore.case=TRUE) [1] " aaaaabc@abc.debbbb " "aaaa anonym@anonym.de bbbb" [3] " anonym@anonym.de " " aaa.abc@abc.de " [5] " aaaa_abc@abc.de " "( abc@abc.de )" [7] "aaaa (abc.abc.de) bbbb" 

I do not know why the last two elements of the array do not match?

Thanks and best regards.

+4
source share
1 answer

How about this?

 gsub("^( abc@abc )|(?<=[ (])( abc@abc )", " anonym@anonym ", tmp, perl=T) 

Sample before | : ^( abc@abc ) , of course, checks the beginning with abc@abc .

In the pattern after | uses a positive lookbehind and searches for abc@abc , which is preceded by space or ( (left), and if found, is replaced by anonym@anonym .

This is what I get: (Note: I replaced abc.abc in the last line abc@abc )

 [1] " aaaaabc@abc.debbbb " "aaaa anonym@anonym.de bbbb" [3] " anonym@anonym.de " " aaa.abc@abc.de " [5] " aaaa_abc@abc.de " "( anonym@anonym.de )" [7] "aaaa ( anonym@anonym.de ) bbbb" 

Edit: To explain the problem with your regex, this seems like a problem with the part:

 [^([:alnum:]|.|_)] 

I think negation should be present in every expression | . In addition, you should use [.] Instead . , since the latter implies any character. Alternatively, instead of using negation for each character that you test, we can condense this part by removing all unnecessary | as:

 [^.[:alpha:]_] # not a . or _ or any alphanumeric # using gsub on it: gsub("(^|[^.[:alpha:]_]) abc@abc ", " anonym@anonym ", tmp) # [1] " aaaaabc@abc.debbbb " "aaaa anonym@anonym.de bbbb" # [3] " anonym@anonym.de " " aaa.abc@abc.de " # [5] " aaaa_abc@abc.de " " anonym@anonym.de )" # [7] "aaaa anonym@anonym.de ) bbbb" 

You replace each abc@abc . But you will lose the character before abc@abc every time, because you also check it in the template. So you have to use a capture group. That is, if you complete the regular expression with () , you can refer to this โ€œcaptureโ€ using special variables such as \\1, \\2 etc.. Here we wrote (^|[^.[:alpha:]_]) , i.e. Part to abc@abc . Since this is the first capture, we will refer to it as \\1 to use it to recover the missing character in the previous result:

 gsub("(^|[^.[:alpha:]_]) abc@abc ", "\\ 1anonym@anonym ", tmp) # [1] " aaaaabc@abc.debbbb " "aaaa anonym@anonym.de bbbb" # [3] " anonym@anonym.de " " aaa.abc@abc.de " # [5] " aaaa_abc@abc.de " "( anonym@anonym.de )" # [7] "aaaa ( anonym@anonym.de ) bbbb" 

This is the result you need. And this is the same as my original answer using a positive look. In this case, since it just checks that it is preceded by something, you do not need to capture anything. Only the abc@abc been replaced. Hope this helps.

+2
source

All Articles