Removing hex bytes from sed - no match

Question

Removing hex bytes from sed - no match

I have a text file with two bytes without ascii (0xFF and 0xFE):

??58832520.3,ABC 348384,DEF

Volume for this file:

 FF FE 35 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 33 34 38 33 38 34 2C 44 45 46

It seems that FF and FE are leading bytes (they exist in my whole file, although it seems to always be at the beginning of the line).

I am trying to remove these bytes with sed, but none of this looks like them.

 $ sed 's/[^a-zA-Z0-9\,]//g' test.csv ??588325203,ABC 348384,DEF $ sed 's/[a-zA-Z0-9\,]//g' test.csv ??.

The main question is: how do I remove these bytes? Bonus question: two regular expressions are direct negatives, so one of them should logically filter these bytes, right? Why do both of these regular expressions correspond to bytes 0xFF and 0xFE?

Update: A direct approach to removing a range of hex bytes (suggested by the two answers below) seems to cut off the first “legitimate” byte from each line and leaves the bytes that I'm trying to get rid of:

 $sed 's/[\x80-\xff]//' test.csv ??8832520.3,ABC 48384,DEF FF FE 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 34 38 33 38 34 2C 44 45 46 0A

Note the absence of “5” and “3” at the beginning of each line, and a new 0A is added at the end of the file.

Bigger update . This problem seems to be system dependent. The problem was observed in OSX, but the sentences (including my original sed expression above) work, as I expect, they are on NetBSD.

Solution : the same task seems quite simple via Perl:

 $ perl -pe 's/^\xFF\xFE//' test.csv 58832520.3,ABC 348384,DEF

However, I will leave this question open, as this is only a workaround and does not explain what the problem is with sed.

+7

regex sed hex macos

Greg Aug 08 '10 at 17:45

source share

7 answers

The FF and FE bases at the beginning of your file are the so-called Byte Sign (BOM). It may appear at the beginning of Unicode text streams to indicate the accuracy of the text. FF FE points to UTF-16 in Little Endian

Here is an excerpt from the FAQ:

Q: How do I work with specifications?
A: Below are some recommendations:
A specific protocol (for example, the Microsoft convention for .txt files) may require the use of a specification in some Unicode data streams, such as files. If you need to comply with such a protocol, use the specification.
Some protocols allow additional specifications for unlabeled text. In these cases,
If the text data stream is known as plain text but has an unknown encoding, the specification can be used as a signature. If there is no specification, encoding can be any.
If the text data stream is known to be plain Unicode text (but not which end), then the specification can be used as a signature. If there is no specification, the text should be interpreted as big-endian.
Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, you should avoid using the specification as a signature on the encoding form.
If the exact type of data stream is known (for example, Unicode big-endian or Unicode little-endian), the specification should not be used. In particular, whenever a UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE data stream is declared, the specification should not be used.

References

unicode.org/FAQ/UTF BOM

Related Questions

Why should I use the Unicode Signature Sign (BOM)?
The difference between the buy-in of a large byte and the small order of bytes.

+2

polygenelubricants Aug 08 '10 at 18:57

source share

This will delete all lines starting with specific FF FE bytes.

 sed -e 's/\xff\xfe//g' hexquestion.txt

The reason your negative regular expressions don't work is because [] indicates a character class. sed accepts a specific character set, probably ascii. These characters in your file are not 7-bit ascii characters, since they both start with F. sed does not know how to deal with them. The solution above does not use character classes, so it should be more portable between platforms and character sets.

+2

Gary Aug 08 '10 at 20:05

source share

In OS X, the byte order mark is probably read as a single word. Try either sed 's/^\xfffe//g' or sed 's/^\xfeff//g' depending on the purpose.

+1

dawg Aug 08 '10 at 23:07

source share

You can get hex codes with \ xff \ xfE and replace it with nothing.

0

schoetbi Aug 08 '10 at 17:53

source share

To show that this is not a Unicode specification problem, but a problem of eight-bit or seven-bit characters and is language-specific, try the following:

Show all bytes:

 $ printf '123 abc\xff\xfe\x7f\x80' | hexdump -C 00000000 31 32 33 20 61 62 63 ff fe 7f 80 |123 abc....|

Have sed remove characters that are not alphanumeric in the user locale. Note that space and 0x7f are removed:

 $ printf '123 abc\xff\xfe\x7f\x80'|sed 's/[^[:alnum:]]//g' | hexdump -C 00000000 31 32 33 61 62 63 ff fe 80 |123abc...|

Remove sed characters that are not alphanumeric in the C locale. Note that only “123abc” remains:

 $ printf '123 abc\xff\xfe\x7f\x80'|LANG=C sed 's/[^[:alnum:]]//g' | hexdump -C 00000000 31 32 33 61 62 63 |123abc|

0

Dennis williamson Aug 08 '10 at 23:27

source share

Alternatively, you can use ed (1):

 printf '%s\n' H $'g/[\xff\xfe]/s///g' ',p' | ed -s test.csv printf '%s\n' H $'g/[\xff\xfe]/s///g' wq | ed -s test.csv # in-place edit

0

bashfu Aug 9 '10 at 12:59

source share

deinst · Accepted Answer · 2010-08-08T17:54:53+0000

 sed 's/[^ -~]//g'

or as another answer implies

 sed 's/[\x80-\xff]//g'

See section 3.9 sed information pages. The chapter entitled "Escape".

Edit for OSX, native lang parameter is en_US.UTF-8

try

 LANG='' sed 's/[^ -~]//g' myfile

This works on the osx machine here, I'm not quite sure why it doesn't work when in UTF-8

Removing hex bytes from sed - no match

References

see also

Related Questions

More articles: