Unicode character conversion from byte

In our API, we use byte [] to send data over the network. Everything worked fine until the day when our "foreign" clients decided to send / receive Unicode characters.

As far as I know, Unicode characters occupy 2 bytes, however we only allocate 1 byte in the byte array.

This is how we read a character from the byte [] array:

// buffer is a byte[6553] and index is a current location in the buffer char c = System.BitConverter.ToChar(buffer, m_index); index += SIZEOF_BYTE; return c; 

So, the current problem is that the API gets a weird Unicode character when I look at the Unicode hex code. I found that the last significant byte is correct, but the most significant byte matters when it should be 0. A quick workaround, so far, has been 0x00FF and c for msb filtering.

Please suggest the correct approach to working with Unicode characters coming from a socket?

Thanks.

Decision:

Kudos to John:

char c = (char) buffer [m_index];

And, as he mentioned, the reason it works is because the api client receives a character occupying only one byte, and BitConverter.ToChar uses two, therefore, the problem is in converting it. I still wonder why this worked for a certain set of characters, and not for others, as it should have failed in all cases.

Thanks guys, great answers!

+4
source share
7 answers

You should use Encoding.GetString using the most suitable encoding.

I don't completely understand your situation completely, but the Encoding class will almost certainly be a way to handle this.

Who controls the data here? Your code or your customers code? Have you determined which format is correct?

EDIT: Well, I looked again at your code: BitConverter.ToChar returns "A character formed by two bytes, starting with startIndex". If you want to use only one byte, just enter it:

 char c = (char) buffer[m_index]; 

I am surprised that your code works at all, as it will be interrupted at any time when the next byte was non-zero.

+5
source

You should look at the System.Text.ASCIIEncoder.ASCII.GetString function, which takes a byte [] array and converts it to a string (for ascii).

And System.Text.UTF8Encoder or System.Text.UTF16Encoder for Unicode strings in UTF8 or UTF16 encodings.

There are also functions for converting strings in bytes [] to the ASCIIEncoding, UTF8Encoding, and UTF16Encoding classes: see GetBytes (String) functions.

0
source

Unicode characters can occupy up to four bytes, but messages encoded in wire using 4 bytes for each character are rarely used. Rather, schemes such as UTF8 or UTF16 are used, which optionally introduce additional bytes.

Take a look at the Encoding guide.

0
source

Test streams should contain a byte order marker that allows you to determine how to process binary data.

0
source

It is not clear what exactly your goal is here. From what I can say, there are 2 routes you can take

  • Ignore all data sent to Unicode
  • Process both Unicode and ASCII strings

IMHO, number 1 is the way. But it looks like your protocol is not necessarily configured to work with a unicode string. You will need to do some discovery logic to determine if the string is part of the Unicode version. If so, you can use the Enconding.Unicode.GetString method to convert this specific byte array.

0
source

What encoding do your customers use? If some of your clients are still using ASCII, you will need your international clients to use something that displays the ASCII set (1-127) for themselves, such as UTF8. After that, use the GetString method encoding UTF8.

0
source

My only solution is to fix the API. Either tell users to use only the ASCII string in the byte [] or fix it to support ASCII and any other encoding that you need to use.

Deciding which encoding is provided by foreign clients from byte [] only can be a bit complicated.

0
source

All Articles