Is the default ("automatic") order for factors part of the R specification? Alphabetically? The same on all platforms?

Sometimes it may be tempting to import some x data into R , for example. read.table , and then change its levels with levels(x$V1) <- c(...) . Columns are initially imported into factors unless we use the as.is parameter and specify colClasses="character" . We can consider the possibility of simultaneously converting levels on all columns of a large data frame, but then we want to make sure that all these columns (which we assume here contain the same data type, for example, the same levels in general) have levels ordered by the same way.

My multi-user question is as follows:

  • Is the sort order a specification of the R language, so is it platform independent?
  • Is it some sort of alphabetical sorting, and if so, according to which alphabet?

See for example:

 > x = as.factor(c("3","$$$av","1","2","^ab", "^ba", "3","aba","4","-ab","ba",'3',"ba")) > x [1] 3 $$$av 1 2 ^ab ^ba 3 aba 4 -ab ba 3 ba Levels: 1 2 3 4 ^ab -ab aba $$$av ba ^ba 
+2
r
source share
1 answer

Check out the factor code:

 if (missing(levels)) { y <- unique(x, nmax = nmax) ind <- sort.list(y) y <- as.character(y) levels <- unique(y[ind]) } 

As you can see, sorting is done using sort.list . In the documentation for this function, you will find:

The sort order for character vectors will depend on the sort sequence of the locale used: see Comparison.

And in help("Comparison") you can read:

Beware of making any assumptions about the sort order: for example, in Estonian, Z is between S and T, and the comparison is not necessarily by nature - in Danish aa is sorted as one letter, after z. In Welsh, there may or may not be a single sorting unit: if so, then g. Some platforms may not respect the locale and always sort in numerical order of bytes in an 8-bit locale or in Unicode for the UTF-8 locale (and may not sort in the same order for the same language into different character sets). Matching non-letters (spaces, punctuation, hyphens, fractions, etc.) is even more problematic.

Thus, it is language dependent and partly platform dependent.

+2
source share

All Articles