Character handling with diacritics in R

Question

Character handling with diacritics in R

I am trying to get the number of characters in lines with diacritical characters, but I cannot get the correct result.

> x <- "n̥ala"
> nchar(x)
[1] 5

What I want to get is 4because it n̥should be considered a single symbol (i.e. diacritics should not be considered symbols on their own, even with more than one diacritics laid on the base symbol).

How can I get such a result?

+4

r unicode character-encoding nlp linguistics

Stefano May 30 '15 at 19:14

source share

3 answers

, qdap, :

x <- "n?ala"

library(qdap)
character_count(word)
## [1] 4

+1

Tyler Rinker 31 '15 3:18

. :

dia.count <- function(string) {
  y <- unlist(strsplit(string, ''))
  length(grep('[A-Za-z0-9]', y, value=T))
}
dia.count(x)
[1] 4

. , . .

Update

, :

nchar(sub('[^A-Za-z]+', '', x))
[1] 4

dia.count . script ; , , . @akrun

, stringi, str_enc_toascii, :

stri_enc_toascii(x)
[1] "n\032ala"

, , , , .

nchar(sub('[^A-Za-z]', '', stri_enc_toascii(x)))
[1] 4

A good balance between the general answer and the fast script is in the comments:

nchar(iconv("n̥ala", to="ASCII", sub=""))
[1] 4

This uses a function base R iconvthat converts the string for you. credit @Molx

0

Pierre lafortune May 30 '15 at 19:41

source share

SabDeM · Accepted Answer · 2015-05-30T20:09:54+0000

Here is my solution. The idea is that phonetic alphabets can have a unicode representation, and then:

Use package Unicode; It provides a function Unicode_alphabetic_tokenizerthat:

Tokenization x Unicode . (.. , ) , .

nchar, - , , sum.

sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 4

, , , , , . , .

:

> x <- "e̯ ʊ̯"
> x
[1] "e̯ ʊ̯"
> nchar(x)
[1] 5
> sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 2

p.s. ", , . , .

Character handling with diacritics in R

Update

More articles: