Removing Hebrew "niqqud" Using r

There have been attempts to remove niqqud (diacritics used to indicate vowels or to distinguish between alternative pronunciations of the Hebrew letters). I have, for example, this variable: sample1 <- "只住职只住职只住职只住职"

And I can not find an effective way to remove characters under the letters.

tried the stringer, str_replace_all(sample1, "[^[:alnum:]]", "") tried gsub('[:punct:]','',sample1)

no success ... :-( any ideas?

+5
source share
1 answer

You can use the \p{M} Unicode category to match diacritics with a Perl-like regular expression, and gsub all in one go:

 sample1 <- "讛只住职诪址拽" gsub("\\p{M}", "", sample1, perl=T) 

Result: [1] "讛住诪拽"

Watch the demo

\p{M} or \p{Mark} : a character intended to be combined with another character (for example, accents, umlauts, closing fields, etc.).

More details in Regular-Expressions.info, "Unicode Categories" .

+2
source

All Articles