Manual conversion of Unicode codes to UTF-8 and UTF-16

I have a university programming exam, and one section is in Unicode.

I checked all the answers to this, and my lecturer is useless, so there is no help, so this is the last resort for you guys, maybe they will help.

The question would be something like this:

The string 'mJ ไธฝ' contains these unicode code pages U+006D , U+0416 and U+4E3D , with the answers written in hexadecimal format, manually encode the string in UTF-8 and UTF-16.

Any help at all would be greatly appreciated as I try to figure it out.

+27
unicode utf-8 utf-16
Jun 04 2018-11-11T00:
source share
3 answers

Wow. On the one hand, I am very pleased to know that university courses teach reality, that character encodings are hard work, but in fact, knowing the UTF-8 encoding rules sounds like they expected a lot. (Will this help students pass the Turkish test ?)

The vivid description I've seen so far for UCS code point encoding rules for UTF-8 refers to the utf-8(7) man page on many Linux systems:

 Encoding The following byte sequences are used to represent a character. The sequence to be used depends on the UCS code number of the character: 0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx [... removed obsolete five and six byte forms ...] The xxx bit positions are filled with the bits of the character code number in binary representation. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. The UCS code values 0xd800โ€“0xdfff (UTF-16 surrogates) as well as 0xfffe and 0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams. 

It might be easier to remember the โ€œcompressedโ€ version of the diagram:

The initial bytes of the beginning distorted code points begin with 1 and 1+0 gaskets are added. Subsequent bytes begin 10 .

 0x80 5 bits, one byte 0x800 4 bits, two bytes 0x10000 3 bits, three bytes 

You can get ranges, taking into account how much space you can fill with bits allowed in the new view:

 2**(5+1*6) == 2048 == 0x800 2**(4+2*6) == 65536 == 0x10000 2**(3+3*6) == 2097152 == 0x200000 

I know that I could remember the rules to make the diagram easier than the diagram itself. Here you hope you remember the rules well. :)

Update

Once you have built the diagram above, you can convert the Unicode input codes to UTF-8 by finding their range, converting from hexadecimal to binary, inserting bits according to the rules above, and then converting back to hex:

 U+4E3E 

This corresponds to the range 0x00000800 - 0x0000FFFF ( 0x4E3E < 0xFFFF ), so the view will look like:

  1110xxxx 10xxxxxx 10xxxxxx 

0x4E3E 100111000111110b . Drop the bits at x above (start on the right side, we will fill in the missing bits at the beginning with 0 ):

  1110x100 10111000 10111110 

At the beginning there is a spot x , which is filled with 0 :

  11100100 10111000 10111110 

Convert from Bit to Hex :

  0xE4 0xB8 0xBE 
+35
Jun 05 2018-11-11T00:
source share
โ€” -

The Wikipedia descriptions for UTF-8 and UTF-16 are good:

Procedures for the line of your example:

Utf-8

UTF-8 uses up to 4 bytes to represent Unicode code points. For a 1-byte case, use the following pattern:

1-byte UTF-8 = 0xxxxxxx bin = 7 bits = 0-7F hex

The start byte of 2-, 3-, and 4-byte UTF-8 starts with 2, 3, or 4 of one bit, followed by a zero bit. Subsequent bytes always begin with a two-bit pattern of 10 , leaving 6 bits for data:

2-byte UTF-8 = 110xxxxx 10xxxxxx bin = 5 + 6 (11) bits = 80-7FF hex
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxx bin = 4 + 6 + 6 (16) bits = 800-FFFF hex
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx bin = 3 + 6 + 6 + 6 (21) bits = 10000-10FFFF hex โ€ 

โ€  Unicode code pages are undefined beyond 10FFFF hex .

Your code points are U + 006D, U + 0416 and U + 4E3D, requiring 1-, 2- and 3-byte UTF-8 sequences, respectively. Convert to binary and assign a bit:

U + 006D = 1101101 bin = 0 1101101 bin = 6D hex
U + 0416 = 10000 010110 bin = 110 10000 10 010110 bin = D0 96 hex
U + 4E3D = 0100 111000 111101 bin = 1110 0100 10 111000 10 111101 bin = E4 B8 BD hex

The final sequence of bytes:

6D D0 96 E4 B8 BD

or if null-terminated strings are required:

6D D0 96 E4 B8 BD 00

Utf-16

UTF-16 uses 2 or 4 bytes to represent Unicode codes. Algorithm:

U + 0000 to U + D7FF uses a 2-byte 0000 hex for D7FF hex
U + D800 to U + DFFF are invalid code points reserved for 4-byte UTF-16
U + E000 - U + FFFF uses 2 bytes of E000 hex for FFFF hex

U + 10000 to U + 10FFFF uses 4-byte UTF-16 encoding as follows:

  • Highlight 10000 hex from code point.
  • Express result as a 20-bit binary.
  • Use the pattern 110110xxxxxxxxxx 110111xxxxxxxxxx bin to encode the upper and lower 10 bits into two 16-bit words.

Using your code points:

U + 006D = 006D hex
U + 0416 = 0416 hex
U + 4E3D = 4E3D hex

Now we have one more problem. On some machines, two bytes of the first least significant byte of the least significant bit (the so-called small-end machines) are stored and some of them store the most significant bytes (large-end machines). UTF-16 uses the U + FEFF code point (called a byte sign or specification) to help the machine determine if the byte stream contains large or low-value UTF-16:

big-endian = FE FF 00 6D 04 16 4E 3D
little-endian = FF FE 6D 00 16 04 3D 4E

With nul-term, U + 0000 = 0000 hex :

big-endian = FE FF 00 6D 04 16 4E 3D 00 00
little-endian = FF FE 6D 00 16 04 3D 4E 00 00

Since your instructor did not give code that required 4-byte UTF-16, here is one example:

U + 1F031 = 1F031 hex - 10000 hex = F031 hex = 0000111100 0000110001 bin =
110 110 0000111100 110111 0000110001 bin = D83C DC31 hex

+27
Jun 05 '11 at 3:23
source share

The following program will do the necessary work. This may not be โ€œmanualโ€ enough for your purposes, but at least you can check your work.

 #!/usr/bin/perl use 5.012; use strict; use utf8; use autodie; use warnings; use warnings qw< FATAL utf8 >; no warnings qw< uninitialized >; use open qw< :std :utf8 >; use charnames qw< :full >; use feature qw< unicode_strings >; use Encode qw< encode decode >; use Unicode::Normalize qw< NFD NFC >; my ($x) = "mไธฝ"; open(U8,">:encoding(utf8)","/tmp/utf8-out"); print U8 $x; close(U8); open(U16,">:encoding(utf16)","/tmp/utf16-out"); print U16 $x; close(U16); system("od -t x1 /tmp/utf8-out"); my $u8 = encode("utf-8",$x); print "utf-8: 0x".unpack("H*",$u8)."\n"; system("od -t x1 /tmp/utf16-out"); my $u16 = encode("utf-16",$x); print "utf-16: 0x".unpack("H*",$u16)."\n"; 
+4
Jun 05 2018-11-11T00:
source share



All Articles