Bitwise AND on signed characters

I have a file that I read into an array of signed char data type. I cannot change this fact.

Now I would like to do this !((c[i] & 0xc0) & 0x80) where c[i] is one of the signed characters.

Now I know from section 6.5.10 of C99 standard that "each of the [bitwise AND] operands must have an integral type".

And section 6.5 of the C99 specification tells me:

Some operators (the unary operator ~ and the binary operators <, →, and, ^ and |, collectively described as bitwise operators) must have operands with an integral type. These operators return values ​​depending on the internal representations of integers, and thus, for <signed types <. >

My question is double:

  • Since I want to work with the original bit patterns from a file, how can I convert / discard my signed char to unsigned char so that the bit patterns do not change?

  • Is there a list of these "implementation-related aspects" anywhere (say, for MVSC and GCC)?

Or you could take a different route and claim that it gives the same result for signed and unsigned characters for any value of c[i] .

Naturally, I will reward references to relevant standards or authoritative texts and discourage “informed” speculation.

+4
source share
4 answers

As others note, in any case, your implementation is based on two additions and will give exactly the result that you expect.

However, if you are worried about the results of the operation with a signed value, and all you care about is a bit pattern, just applied directly to the equivalent unsigned type. Results are defined by the standard:


6.3.1.3 Integer and unsigned integers

  • ...

  • Otherwise, if the new type is unsigned, the value is converted by re-adding or subtracting one greater than the maximum value that can be represented in the new type until the value is in the range of the new type.


This essentially indicates that the result will be two additional representations of the value.

The basis of this lies in the fact that in two additional mathematical calculations the result of the calculation modulo has some degree of two (that is, the number of bits in a type), which in turn is equivalent to masking the corresponding number of bits. And the addition of a number is a number subtracted from the power of two.

Thus, adding a negative value coincides with adding any value that differs from a value that is a multiple of this power of two.

i.e:

  (0 + signed_value) mod (2^N) == (2^N + signed_value) mod (2^N) == (7 * 2^N + signed_value) mod (2^N) 

etc .. (if you know modulo, this should be pretty obvious)

So, if you have a negative number, adding power of the two will make it positive (-5 + 256 = 251), but the lower bits of “N” will be exactly the same (0b11111011), and this will not affect the result of the mathematical operation. Since the values ​​are then truncated to match the type, the result is exactly the binary value that you expected, even if the result is “overflowing” (that is, what you would think if the number was positive to start with) - this packaging also well defined).

So, in the 8-bit add-on:

  • -5 is the same as 251 (i.e. 256-5) - 0b11111011
  • If you add 30 and 251, you get 281. But that is more than 256 and 281 mod 256 is 25. Just like 30 - 5.
  • 251 * 2 = 502. 502 mod 256 = 246. 246 and -10 are both 0b11110110.

Similarly, if you have:

 unsigned int a; int b; a - b == a + (unsigned int) -b; 

Under the hood, this throw is unlikely to be implemented with arithmetic and, of course, will be a direct assignment from one register / value to another or simply completely optimized, since mathematics does not distinguish between signed and unsigned (CPU flag intepretation is another matter, but this is an implementation detail) . The standard exists to ensure that the implementation does not take on the attempt to do something strange instead, or, I suppose, for some strange architecture that does not use two additions ...

+5
source

unsigned char UC = *(unsigned char*)&C - this way you can convert signed C to unsigned, saving the "bit pattern". So you can change your code to something like this:

 !(( (*(unsigned char*)(c+i)) & 0xc0) & 0x80) 

Explanation (with links):

761 When a pointer to an object is converted to a pointer to a character type, the result points to the least significant address byte of the object,

1124 When applied to an operand that is of type char, unsigned char, or signed char (or its qualified version), the result is 1.

These two imply that the unsigned char pointer points to the same byte as the original signed char pointer.

+1
source

You look like something like:

 signed char c[] = "\x7F\x80\xBF\xC0\xC1\xFF"; for (int i = 0; c[i] != '\0'; i++) { if (!((c[i] & 0xC0) & 0x80)) ... } 

You are (correctly) concerned about the expansion of a signed char type character. In practice, however, (c[i] & 0xC0) converts the signed character to a (signed) int , but & 0xC0 will discard any set bits in more significant bytes; the result of the expression will be in the range 0x00 .. 0xFF. I believe this will apply if you use sign and magnitude values, one complement or two binary values. The detailed bit pattern that you get for a particular character character value depends on the underlying representation; but the general conclusion is that the result will be in the range 0x00 .. 0xFF is valid.

There is a simple solution to this problem - before using it, enter the value c[i] in an unsigned char :

 if (!(((unsigned char)c[i] & 0xC0) & 0x80)) 

The value of c[i] converted to unsigned char before it moves to int (or the compiler can move to int and then force unsigned char and then push unsigned char back to int ), and the unsigned value is used in & .

Of course, the code is now simply redundant. Using & 0xC0 followed by & 0x80 is completely equivalent to just & 0x80 .

If you are processing UTF-8 data and looking for continued bytes, the correct test is:

 if (((unsigned char)c[i] & 0xC0) == 0x80) 
0
source

"Since I want to work with the original bit patterns from a file, how can I convert / cast a signed char to an unsigned char so that the bit patterns remain unchanged?"

As explained in the previous answer to your question on the same topic, any small integer type, whether it is signed or unsigned, will be promoted to type int whenever it is used in an expression.

C11 6.3.1.1

"If int can represent all the values ​​of the original type (as limited in width for the bit field), the value is converted to int; otherwise, it is converted to unsigned int. These are called whole promotions.

Also, as explained in the same answer, integer literals are always of type int .

Therefore, your expression will be compressed to the pseudo-code (int) & (int) & (int) . The operations will be performed on three temporary variables int, and the result will be of type int.

Now, if the source data contains bits that can be interpreted as sign bits for a particular signature representation (in practice, this will be two additions in all systems), you will have problems. Since these bits will be preserved when moving through a char subscription to int.

And then the bit-wise and operator executes AND on every single bit, regardless of the contents of its integer operand (C11 6.5.10 / 3), whether it is signed or not. If you have data in the signed bits of the original signed char, it will be lost. Since integer literals (0xC0 or 0x80) will not have bits that correspond to signed bits.

The solution is to prevent the transmission of sign bits into a temporary int. One solution is to cast c [i] into an unsigned char that is fully defined (C11 6.3.1.3). This will tell the compiler that "the entire contents of this variable is an integer, there are no signed bits."

Even better, I’m used to always use unsigned data in all forms of bit manipulation. Purist, 100% safe, the MISRA-C method, able to rewrite your expression, is this:

 if ( ((uint8_t)c[i] & 0xc0u) & 0x80u) > 0u) 

The suffix u actually forces the expression to an unsigned int, but it is always recommended to apply it to the intended type. He tells the reader the code, "I really know what I'm doing, and I also understand all the weird implicit rules of promotion in C".

And then, if we know our hex, (0xc0 & 0x80) pointless, this is always true. And x & 0xC0 & 0x80 always matches x & 0x80 . Therefore simplify the expression:

 if ( ((uint8_t)c[i] & 0x80u) > 0u) 

Is there a list of these implementation-specific aspects elsewhere

Yes, standard C conveniently lists them in Appendix J.3. The only implementation aspect that you encounter in this case is the implementation of the integrity of integers. In practice, these are always two additions.

EDIT: The cited text in the question refers to the fact that various bitwise operators will get the results defined by the implementation. This is briefly referred to as an implementation, defined even in the application, without exact references. Actual chapter 6.5 doesn't say much about how and how: etc. The only operators that explicitly state this are <and →, where the left offset of a negative number is even undefined behavior, but the right offset is an implementation.

0
source

All Articles