It is not just Unicode, not just R; sort in the general case (as in the * nix sort command) may be locale-specific. LC_COLLATE differences requires setting LC_COLLATE (presumably on "C" ) via Sys.setlocale (according to @alistaire's comment) on all machines.
For me on Windows (7):
sort(c("Abc", "abc", "_abc", "ABC")) [1] "_abc" "abc" "Abc" "ABC"
whereas on Linux (Ubuntu 12.04 ... wow, I need to update this machine) I get
sort(c("Abc", "abc", "_abc", "ABC")) [1] "abc" "_abc" "Abc" "ABC"
Setting the language as described above with
Sys.setlocale("LC_COLLATE", "C")
gives
sort(c("Abc", "abc", "_abc", "ABC")) [1] "ABC" "Abc" "_abc" "abc"
on both machines the same.
The * nix man page for sort gives a bold warning
*** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
Update: It looks like I reproduce the problem when I include Unicode characters. The problem goes back to sort - try sorting the vector in your example. I can't seem to change the locale ( LC_COLLATE or LC_CTYPE ) to "en_AU.UTF-8" , which would be a potential solution.
Jonathan carroll
source share