In C # String / Character Encoding what is the difference between GetBytes (), GetString () and Convert ()?

Question

In C # String / Character Encoding what is the difference between GetBytes (), GetString () and Convert ()?

We are unable to get a Unicode string to convert to a UTF-8 string to send over the wire:

// Start with our unicode string. string unicode = "Convert: \u10A0"; // Get an array of bytes representing the unicode string, two for each character. byte[] source = Encoding.Unicode.GetBytes(unicode); // Convert the Unicode bytes to UTF-8 representation. byte[] converted = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, source); // Now that we have converted the bytes, save them to a new string. string utf8 = Encoding.UTF8.GetString(converted); // Send the converted string using a Microsoft function. MicrosoftFunc(utf8);

Although we converted the string to UTF-8, it does not appear as UTF-8.

+5

string c # encoding unicode utf-8

Ryall 15 sept. '09 at 12:00

source share

1 answer

Ryall · Accepted Answer · 2009-09-15T12:16:37+0000

After a very hectic and confusing morning, we found the answer to this problem.

The key point that we lacked, which made it very confusing, was that string types are always encoded in 16-bit (2-byte) Unicode . This means that when we do GetString () in bytes, they are automatically transcoded to Unicode behind the scenes , and we are no better than we were in the first place.

When we started to receive typical errors and data with a double byte on the other end, we knew that something was wrong, but at first glance at the code that we had, we did not see anything bad. Having learned what we explained above, we realized that we need to send an array of bytes if we want to keep the encoding. Fortunately, MicrosoftFunc () had an overload that could take a byte array instead of a string. This meant that we could convert the unicode string to the encoding of our choice, and then send it exactly as we expect. The code has changed to:

 // Convert from a Unicode string to an array of bytes (encoded as UTF8). byte[] source = Encoding.UTF8.GetBytes(unicode); // Send the encoded byte array directly! Do not send as a Unicode string. MicrosoftFunc(source);

Summary:

So, in conclusion, from the foregoing it can be seen that:

GetBytes (), among other things, does Encoding.Convert () from Unicode (since strings are always Unicode) and the specified encoding from which the function was called, and returns an array of encoded bytes.
GetString (), among other things, makes Encoding.Convert () from the specified encoding called by the function into Unicode (since strings are always Unicode) and returns it as a string object.
Conversion () actually converts the byte array of one encoding to another byte array of another encoding. Obviously, strings cannot be used (since strings are always Unicode).

In C # String / Character Encoding what is the difference between GetBytes (), GetString () and Convert ()?

Summary:

More articles: