Removing Hebrew "niqqud" Using r

Question

Removing Hebrew "niqqud" Using r

There have been attempts to remove niqqud (diacritics used to indicate vowels or to distinguish between alternative pronunciations of the Hebrew letters). I have, for example, this variable: sample1 <- "ֻסְֻסְֻסְֻסְ"

And I can not find an effective way to remove characters under the letters.

tried the stringer, str_replace_all(sample1, "[^[:alnum:]]", "") tried gsub('[:punct:]','',sample1)

no success ... :-( any ideas?

+5

regex text r unicode hebrew

Dmitry Leykin Sep 17 '15 at 18:35

source share

1 answer

Wiktor stribiżew · Accepted Answer · 2015-09-17T19:50:57+0000

You can use the \p{M} Unicode category to match diacritics with a Perl-like regular expression, and gsub all in one go:

 sample1 <- "הֻסְמַק" gsub("\\p{M}", "", sample1, perl=T)

Result: [1] "הסמק"

Watch the demo

\p{M} or \p{Mark} : a character intended to be combined with another character (for example, accents, umlauts, closing fields, etc.).

More details in Regular-Expressions.info, "Unicode Categories" .

Removing Hebrew "niqqud" Using r

More articles: