Tr -c -d behavior when deleting bytes with values ​​that are not characters

I find it difficult to understand this paragraph from the Rationale section of http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html .

ISO standard POSIX-2: 1993 had the -c option, which behaved similarly to the -C option, but did not provide functionality equivalent to -c specified in POSIX.1-2008. This meant that the historical practice of defining tr -cd \ 000- \ 177 (which would delete all bytes with the upper bit set) would have no effect, because in C locale, bytes with octal values ​​from 200 to octal 377 are not characters .

However, my CentOS 6.5 test seems to show that it really has an effect.

$ export LC_ALL=C $ export LANG=C $ locale LANG=C LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_PAPER="C" LC_NAME="C" LC_ADDRESS="C" LC_TELEPHONE="C" LC_MEASUREMENT="C" LC_IDENTIFICATION="C" LC_ALL=C $ printf "\x41\x42\x81\x82" | od -t x1 0000000 41 42 81 82 0000004 $ printf "\x41\x42\x81\x82" | tr -c -d "\000-\1777" | od -t x1 0000000 41 42 0000002 

The tr -c -d "\000-\1777" command deleted bytes with the values \x81 and \x82 . Why is the result of my test inconsistent with what is written in the specification?

+5
source share
1 answer

Since you are using CentOS, most likely your tr command will be from the GNU coreutils package. GNU tr does not (yet) distinguish between the behavior of -c and -c . In recent versions of tr both the -c and -c options are equivalent short options for the --complement option.

According to GNU documentation for tr :

Tr currently only supports single-byte characters. In the end, it will support multi-byte characters; when this happens, the -C option will force it to complement the character set, while the -c option will complement it to the character set. This difference will only matter when some values ​​are not characters, and this is only possible in locales that use multibyte encodings, when the input contains encoding errors.

I also found a quoted paragraph from the POSIX specification that will be vaguely worded, but Id agree with Ethan Reisner's interpretation that "implementations corresponding to the 1993 version of the specification will be violated, but earlier implementations (historical) and implementations corresponding to 2008 ( and newer) will work. "

In any case, GNU tr does not (yet) comply with this part of the 2008 POSIX specification (that is, by differentiating characters and values), so it cannot be used for testing.

By the way, you have a redundant 7 in your command tr -c -d "\000-\1777" .

+3
source

Source: https://habr.com/ru/post/1216101/


All Articles