Character handling with diacritics in R

I am trying to get the number of characters in lines with diacritical characters, but I cannot get the correct result.

> x <- "n̥ala"
> nchar(x)
[1] 5

What I want to get is 4because it should be considered a single symbol (i.e. diacritics should not be considered symbols on their own, even with more than one diacritics laid on the base symbol).

How can I get such a result?

+4
source share
3 answers

Here is my solution. The idea is that phonetic alphabets can have a unicode representation, and then:

Use package Unicode; It provides a function Unicode_alphabetic_tokenizerthat:

Tokenization x Unicode . (.. , ) , .

nchar, - , , sum.

sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 4

, , , , , . , .

:

> x <- "e̯ ʊ̯"
> x
[1] "e̯ ʊ̯"
> nchar(x)
[1] 5
> sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 2

p.s. ", , . , .

+2

, qdap, :

x <- "n?ala"

library(qdap)
character_count(word)
## [1] 4
+1

. :

dia.count <- function(string) {
  y <- unlist(strsplit(string, ''))
  length(grep('[A-Za-z0-9]', y, value=T))
}
dia.count(x)
[1] 4

. , . .

Update

, :

nchar(sub('[^A-Za-z]+', '', x))
[1] 4

dia.count . script ; , , . @akrun

, stringi, str_enc_toascii, :

stri_enc_toascii(x)
[1] "n\032ala"

, , , , .

nchar(sub('[^A-Za-z]', '', stri_enc_toascii(x)))
[1] 4

A good balance between the general answer and the fast script is in the comments:

nchar(iconv("n̥ala", to="ASCII", sub=""))
[1] 4

This uses a function base R iconvthat converts the string for you. credit @Molx

0
source

All Articles