.list echo "$u" >.text cat .list...">

Various utf8 encodings in os x file names

I have a small command text in .x

 $ cat .x u="Böhmáí" touch "$u" ls > .list echo "$u" >.text cat .list .text diff .list .text od -bc .list od -bc .text 

When I run this script sh -x .x (-x only for displaying commands)

 $ sh -x .x + u=Böhmáí + touch Böhmáí + ls + echo Böhmáí + cat .list .text Böhmáí Böhmáí + diff .list .text 1c1 < Böhmáí --- > Böhmáí + od -bc .list 0000000 102 157 314 210 150 155 141 314 201 151 314 201 012 B o ̈ ** hma ́ ** i ́ ** \n 0000015 + od -bc .text 0000000 102 303 266 150 155 303 241 303 255 012 B ö ** hm á ** í ** \n 0000012 

The same Böhmáí line Böhmáí encoded in different bytes in the file name and file content. In the terminal (utf8-encoded), the line looks same in both cases.

Where is the rabbit?

+8
filesystems encoding utf-8 macos
source share
1 answer

(This is most stolen from my previous answer ...)

Unicode allows some accented characters to be represented in several different ways: as a “code point” representing an accented character, or as a series of code points representing an unstressed version of a character, followed by accent (s). For example, “ä” can be preceded by U + 00E4 (UTF-8 0xc3a4, Latin small letter 1 with diaresis) or decomposed as U + 0061 U + 0308 (UTF-8 0x61cc88, Latin small letter a + that combines diaresis).

The OS X HFS + file system requires all file names to be stored in the UTF-8 representation of their fully expanded form . In the HFS + filename, “ä” MUST be encoded as 0x61cc88, and “ö” MUST be encoded as 0x6fcc88.

So what is happening here is that your script shell contains “Böhmáí” in pre-selected form, so it is stored in this way in variable a and saved in a text file in this way. But when you create a file with that name (with touch ), the file system converts it into a decomposed form for the actual file name. And when you ls , it shows a form in which there is a file system: a decomposed form.

+20
source share

All Articles