When does a locale affect R regular expressions?

R has several special language character classes for regular expressions.

From ?regex:

'[[: alnum:]] means "[0-9A-Za-z], with the exception of the latter, it depends on the language and character encoding, while the former does not depend on the language and character set.

I would like to know when problems related to locality can occur.

I tried two examples based on the information on the help page ?Comparison, which describes the sort order of strings:

in Estonian, Z is between 'S and' T

and

in Danish, aa is sorted as one letter, after "z

In the first example, I would expect that T, U, V, W, X, and Y do not match. In the second example, I would expect it to not match.

Sys.setlocale("LC_ALL", "Estonian")
grepl("[A-Z]", LETTERS)

Sys.setlocale("LC_ALL", "Danish")
grepl("[a-z]", "aa")  

TRUE, , .

, locale , [a-z]?

: : -, [a-zA-Z] vs. [[:alpha:]]. , , , .

+4
1

, .

grepl("[a-zA-Z]", c("å", "é"))
## [1] FALSE FALSE
grepl("[[:alpha:]]", c("å", "é"))
## [1]  TRUE  TRUE

, - ( , , ).

mu <- "\U03BC"
ya <- "\U044F"
jeem <- "\U062C"
grepl("[a-zA-Z]+", c(mu, ya, jeem))
## [1] FALSE FALSE FALSE
grepl("[[:alpha:]]", c(mu, ya, jeem))
## [1] FALSE FALSE FALSE
+1

All Articles