Various utf8 encodings in os x file names
I have a small command text in .x
$ cat .x u="Böhmáí" touch "$u" ls > .list echo "$u" >.text cat .list .text diff .list .text od -bc .list od -bc .text When I run this script sh -x .x (-x only for displaying commands)
$ sh -x .x + u=Böhmáí + touch Böhmáí + ls + echo Böhmáí + cat .list .text Böhmáí Böhmáí + diff .list .text 1c1 < Böhmáí --- > Böhmáí + od -bc .list 0000000 102 157 314 210 150 155 141 314 201 151 314 201 012 B o ̈ ** hma ́ ** i ́ ** \n 0000015 + od -bc .text 0000000 102 303 266 150 155 303 241 303 255 012 B ö ** hm á ** í ** \n 0000012 The same Böhmáí line Böhmáí encoded in different bytes in the file name and file content. In the terminal (utf8-encoded), the line looks same in both cases.
Where is the rabbit?
(This is most stolen from my previous answer ...)
Unicode allows some accented characters to be represented in several different ways: as a “code point” representing an accented character, or as a series of code points representing an unstressed version of a character, followed by accent (s). For example, “ä” can be preceded by U + 00E4 (UTF-8 0xc3a4, Latin small letter 1 with diaresis) or decomposed as U + 0061 U + 0308 (UTF-8 0x61cc88, Latin small letter a + that combines diaresis).
The OS X HFS + file system requires all file names to be stored in the UTF-8 representation of their fully expanded form . In the HFS + filename, “ä” MUST be encoded as 0x61cc88, and “ö” MUST be encoded as 0x6fcc88.
So what is happening here is that your script shell contains “Böhmáí” in pre-selected form, so it is stored in this way in variable a and saved in a text file in this way. But when you create a file with that name (with touch ), the file system converts it into a decomposed form for the actual file name. And when you ls , it shows a form in which there is a file system: a decomposed form.