Manual conversion of Unicode codes to UTF-8 and UTF-16

Question

Manual conversion of Unicode codes to UTF-8 and UTF-16

I have a university programming exam, and one section is in Unicode.

I checked all the answers to this, and my lecturer is useless, so there is no help, so this is the last resort for you guys, maybe they will help.

The question would be something like this:

The string 'mJ 丽' contains these unicode code pages U+006D , U+0416 and U+4E3D , with the answers written in hexadecimal format, manually encode the string in UTF-8 and UTF-16.

Any help at all would be greatly appreciated as I try to figure it out.

+27

unicode utf-8 utf-16

RSM Jun 04 2018-11-11T00:

source share

3 answers

The Wikipedia descriptions for UTF-8 and UTF-16 are good:

Procedures for the line of your example:

Utf-8

UTF-8 uses up to 4 bytes to represent Unicode code points. For a 1-byte case, use the following pattern:

1-byte UTF-8 = 0xxxxxxx _bin = 7 bits = 0-7F _hex

The start byte of 2-, 3-, and 4-byte UTF-8 starts with 2, 3, or 4 of one bit, followed by a zero bit. Subsequent bytes always begin with a two-bit pattern of 10 , leaving 6 bits for data:

2-byte UTF-8 = 110xxxxx 10xxxxxx _bin = 5 + 6 (11) bits = 80-7FF _hex
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxx _bin = 4 + 6 + 6 (16) bits = 800-FFFF _hex
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx _bin = 3 + 6 + 6 + 6 (21) bits = 10000-10FFFF _hex ^†
^† Unicode code pages are undefined beyond 10FFFF _hex .

Your code points are U + 006D, U + 0416 and U + 4E3D, requiring 1-, 2- and 3-byte UTF-8 sequences, respectively. Convert to binary and assign a bit:

U + 006D = 1101101 _bin = 0 1101101 _bin = 6D _hex
U + 0416 = 10000 010110 _bin = 110 10000 10 010110 _bin = D0 96 _hex _{U + 4E3D = 0100 111000 111101 _bin = 1110 0100 10 111000 10 111101 _bin = E4 B8 BD _hex}

The final sequence of bytes:

6D D0 96 E4 B8 BD

or if null-terminated strings are required:

6D D0 96 E4 B8 BD 00

Utf-16

UTF-16 uses 2 or 4 bytes to represent Unicode codes. Algorithm:

U + 0000 to U + D7FF uses a 2-byte 0000 _hex for D7FF _hex
U + D800 to U + DFFF are invalid code points reserved for 4-byte UTF-16
U + E000 - U + FFFF uses 2 bytes of E000 _hex for FFFF _hex
U + 10000 to U + 10FFFF uses 4-byte UTF-16 encoding as follows:
Highlight 10000 _hex from code point.
Express result as a 20-bit binary.
Use the pattern 110110xxxxxxxxxx 110111xxxxxxxxxx _bin to encode the upper and lower 10 bits into two 16-bit words.

Using your code points:

U + 006D = 006D _hex
U + 0416 = 0416 _hex
U + 4E3D = 4E3D _hex

Now we have one more problem. On some machines, two bytes of the first least significant byte of the least significant bit (the so-called small-end machines) are stored and some of them store the most significant bytes (large-end machines). UTF-16 uses the U + FEFF code point (called a byte sign or specification) to help the machine determine if the byte stream contains large or low-value UTF-16:

big-endian = FE FF 00 6D 04 16 4E 3D
little-endian = FF FE 6D 00 16 04 3D 4E

With nul-term, U + 0000 = 0000 _hex :

big-endian = FE FF 00 6D 04 16 4E 3D 00 00
little-endian = FF FE 6D 00 16 04 3D 4E 00 00

Since your instructor did not give code that required 4-byte UTF-16, here is one example:

U + 1F031 = 1F031 _hex - 10000 _hex = F031 _hex = 0000111100 0000110001 _bin =
110 110 0000111100 110111 0000110001 _bin = D83C DC31 _hex

+27

Mark Tolonen Jun 05 '11 at 3:23

source share

The following program will do the necessary work. This may not be “manual” enough for your purposes, but at least you can check your work.

 #!/usr/bin/perl use 5.012; use strict; use utf8; use autodie; use warnings; use warnings qw< FATAL utf8 >; no warnings qw< uninitialized >; use open qw< :std :utf8 >; use charnames qw< :full >; use feature qw< unicode_strings >; use Encode qw< encode decode >; use Unicode::Normalize qw< NFD NFC >; my ($x) = "m丽"; open(U8,">:encoding(utf8)","/tmp/utf8-out"); print U8 $x; close(U8); open(U16,">:encoding(utf16)","/tmp/utf16-out"); print U16 $x; close(U16); system("od -t x1 /tmp/utf8-out"); my $u8 = encode("utf-8",$x); print "utf-8: 0x".unpack("H*",$u8)."\n"; system("od -t x1 /tmp/utf16-out"); my $u16 = encode("utf-16",$x); print "utf-16: 0x".unpack("H*",$u16)."\n";

+4

Seth Robertson Jun 05 2018-11-11T00:

source share

sarnold · Accepted Answer · 2011-06-05 00:14

Wow. On the one hand, I am very pleased to know that university courses teach reality, that character encodings are hard work, but in fact, knowing the UTF-8 encoding rules sounds like they expected a lot. (Will this help students pass the Turkish test ?)

The vivid description I've seen so far for UCS code point encoding rules for UTF-8 refers to the utf-8(7) man page on many Linux systems:

 Encoding The following byte sequences are used to represent a character. The sequence to be used depends on the UCS code number of the character: 0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx [... removed obsolete five and six byte forms ...] The xxx bit positions are filled with the bits of the character code number in binary representation. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as 0xfffe and 0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams.

It might be easier to remember the “compressed” version of the diagram:

The initial bytes of the beginning distorted code points begin with 1 and 1+0 gaskets are added. Subsequent bytes begin 10 .

 0x80 5 bits, one byte 0x800 4 bits, two bytes 0x10000 3 bits, three bytes

You can get ranges, taking into account how much space you can fill with bits allowed in the new view:

 2**(5+1*6) == 2048 == 0x800 2**(4+2*6) == 65536 == 0x10000 2**(3+3*6) == 2097152 == 0x200000

I know that I could remember the rules to make the diagram easier than the diagram itself. Here you hope you remember the rules well. :)

Update

Once you have built the diagram above, you can convert the Unicode input codes to UTF-8 by finding their range, converting from hexadecimal to binary, inserting bits according to the rules above, and then converting back to hex:

 U+4E3E

This corresponds to the range 0x00000800 - 0x0000FFFF ( 0x4E3E < 0xFFFF ), so the view will look like:

  1110xxxx 10xxxxxx 10xxxxxx

0x4E3E 100111000111110b . Drop the bits at x above (start on the right side, we will fill in the missing bits at the beginning with 0 ):

  1110x100 10111000 10111110

At the beginning there is a spot x , which is filled with 0 :

  11100100 10111000 10111110

Convert from Bit to Hex :

  0xE4 0xB8 0xBE

Manual conversion of Unicode codes to UTF-8 and UTF-16

Utf-8

Utf-16

More articles: