Why does as.factor () for unicode strings return different results for each operating system?

Question

Why does as.factor () for unicode strings return different results for each operating system?

Why as.factor(c("\U201C", '"3', "1", "2", "\U00B5")) this code: as.factor(c("\U201C", '"3', "1", "2", "\U00B5")) return different orders of factor levels in each operating system?

On Linux:

> as.factor(c("\U201C",'"3', "1", "2","\U00B5")) [1] " "3 1 2 µ Levels: µ " 1 2 "3

On Windows:

> as.factor(c("\U201C",'"3', "1", "2","\U00B5")) [1] " "3 1 2 µ Levels: "3 " µ 1 2

On Mac OS:

>as.factor(c("\U201C",'"3', "1", "2","\U00B5")) [1] " "3 1 2 µ Levels: "3 " 1 2 µ

Some students had an RMardkown assignment containing as.numeric(as.factor(dat$var)) . Now provided is not a very good coding method, but inconsistency in the outputs leads to a lot of confusion and wasted time.

+7

r unicode

MilesMcBain Sep 06 '16 at 1:42

source share

3 answers

Jonathan carroll · Answer 1 · 2016-09-06T02:07:53+0000

It is not just Unicode, not just R; sort in the general case (as in the * nix sort command) may be locale-specific. LC_COLLATE differences requires setting LC_COLLATE (presumably on "C" ) via Sys.setlocale (according to @alistaire's comment) on all machines.

For me on Windows (7):

 sort(c("Abc", "abc", "_abc", "ABC")) [1] "_abc" "abc" "Abc" "ABC"

whereas on Linux (Ubuntu 12.04 ... wow, I need to update this machine) I get

 sort(c("Abc", "abc", "_abc", "ABC")) [1] "abc" "_abc" "Abc" "ABC"

Setting the language as described above with

 Sys.setlocale("LC_COLLATE", "C")

gives

 sort(c("Abc", "abc", "_abc", "ABC")) [1] "ABC" "Abc" "_abc" "abc"

on both machines the same.

The * nix man page for sort gives a bold warning

  *** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.

Update: It looks like I reproduce the problem when I include Unicode characters. The problem goes back to sort - try sorting the vector in your example. I can't seem to change the locale ( LC_COLLATE or LC_CTYPE ) to "en_AU.UTF-8" , which would be a potential solution.

42- · Answer 2 · 2016-09-06T02:28:26+0000

The "factor" structure involves conversion to the value of a character, and therefore it must be encoded in some font or another. The default is OS. The lexical sort order follows the language.

To some extent, @Roland, the previous answer to this question, fixes the locale problem, but not the encoding problem: Is the default order (“automatic”) factor for the R part? Alphabetically? The same on all platforms?

jav · Answer 3 · 2016-09-06T05:39:04+0000

I tried changing the locale settings but cannot solve this problem. However, given that we can trace this problem to the sort function, one of the possible alternatives is to override the factor and as.factor functions without the sort function.

 as.factor2 <- function(x){ if (is.factor(x)) x else if (!is.object(x) && is.integer(x)) { levels <- unique.default(x) # Removed sort() f <- match(x, levels) levels(f) <- as.character(levels) class(f) <- "factor" f } else factor2(x) } factor2 <- function (x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA) { if (is.null(x)) x <- character() nx <- names(x) if (missing(levels)) { y <- unique(x, nmax = nmax) ind <- 1:length(y) # Changed from sort.list(y) y <- as.character(y) levels <- unique(y[ind]) } force(ordered) exclude <- as.vector(exclude, typeof(x)) x <- as.character(x) levels <- levels[is.na(match(levels, exclude))] f <- match(x, levels) if (!is.null(nx)) names(f) <- nx nl <- length(labels) nL <- length(levels) if (!any(nl == c(1L, nL))) stop(gettextf("invalid 'labels'; length %d should be 1 or %d", nl, nL), domain = NA) levels(f) <- if (nl == nL) as.character(labels) else paste0(labels, seq_along(levels)) class(f) <- c(if (ordered) "ordered", "factor") f }

Now we can name as.factor2 as follows:

 as.factor2(c("\U201C",'"3', "1", "2","\U00B5")) # [1] " "3 1 2 µ # Levels: "3 " 1 2 µ

I would not say that this is the solution to your problem; This is a more workaround. Moreover, this is due to the training of students, I would prefer not to recreate the basic functions of R. I hope someone else can provide a simpler solution.

Why does as.factor () for unicode strings return different results for each operating system?

More articles: