The encoding used in casting from char to byte

Take a look at the following C # code (a function extracted from the BuildProtectedURLWithValidity function at http://wmsauth.org/examples ):

 byte[] StringToBytesToBeHashed(string to_be_hashed) { byte[] to_be_hashed_byte_array = new byte[to_be_hashed.Length]; int i = 0; foreach (char cur_char in to_be_hashed) { to_be_hashed_byte_array[i++] = (byte)cur_char; } return to_be_hashed_byte_array; } 

My question is: What is different from a byte from a char in terms of encoding?

I think this really does nothing in terms of encoding, but does this mean that Encoding.Default is the one that is used, and therefore the byte for return will depend on how the infrastructure will encode the main string in a particular operating system ?

And besides, is char really more than a byte (I guess 2 bytes) and actually omits the first byte?

I was thinking of replacing all of this:

 Encoding.UTF8.GetBytes(stringToBeHashed) 

What do you think?

+7
source share
3 answers

The .NET Framework uses Unicode to represent all of its characters and strings. The integer value a char (which you can get by pressing int ) is equivalent to its UTF-16 code. For characters in the Basic Multilingual Plane (which make up the majority of the characters you will ever encounter), this value is the Unicode code point.

The .NET Framework uses the Char structure to represent a Unicode character. The Unicode standard identifies each Unicode character with a unique 21-bit scalar number called a code point, and defines a UTF-16 encoding form that defines how a code point is encoded into a sequence of one or more 16-bit values. Each 16-bit value ranges from hexadecimal 0x0000 to 0xFFFF and is stored in the Char structure. The value of a Char object is its 16-bit numeric (ordinal) value. - Char Structure

Dropping a Char in byte will result in data loss for any character whose value is greater than 255. Try the following simple example to understand why:

 char c1 = 'D'; // code point 68 byte b1 = (byte)c1; // b1 is 68 char c2 = 'ล„'; // code point 324 byte b2 = (byte)c2; // b2 is 68 too! // 324 % 256 == 68 

Yes, you should definitely use Encoding.UTF8.GetBytes .

+14
source

Casting between byte and char is similar to using ISO-8859-1 (= first 256 Unicode characters), except that it silently loses information when encoding characters outside of U + 00FF.

And besides, is char really more than a byte (I guess 2 bytes) and actually omits the first byte?

Yes. C # char = UTF-16 code = 2 bytes.

+4
source

char is a 16-bit UTF-16 code point. Dropping a char in a byte results in the low byte of the character, but both Douglas and dan04 are mistaken in that it will always quietly discard the high byte. If the high byte is not equal to zero, the result depends on whether the Check for compiler arithmetic overflow / lower stream compiler option is set:

 using System; namespace CharTest { class Program { public static void Main(string[] args) { ByteToCharTest( 's' ); ByteToCharTest( '' ); Console.ReadLine(); } static void ByteToCharTest( char c ) { const string MsgTemplate = "Casting to byte character # {0}: {1}"; string msgRes; byte b; msgRes = "Success"; try { b = ( byte )c; } catch( Exception e ) { msgRes = e.Message; } Console.WriteLine( String.Format( MsgTemplate, (Int16)c, msgRes ) ); } } } 

Overflow check output:

 Casting to byte character # 115: Success Casting to byte character # 1099: Arithmetic operation resulted in an overflow. 

Exit without overflow check:

 Casting to byte character # 115: Success Casting to byte character # 1099: Success 
+1
source

All Articles