Byte Array Bit Offsets - CMAC Subsections

I need to implement as fast as possible the left shift bit of a 16-byte array in JavaCard .

I tried this code:

private static final void rotateLeft(final byte[] output, final byte[] input) { short carry = 0; short i = (short) 16; do { --i; carry = (short)((input[i] << 1) | carry); output[i] = (byte)carry; carry = (short)((carry >> 8) & 1); } while (i > 0); } 

Any ideas on how to improve performance? I was thinking about some of the magic Util.getShort(...) and Util.setShort(...) , but I was not able to get it to work faster than the implementation above.

This is one part of CMAC subkeys calculations, and it is done quite often, unfortunately. If you know a faster way to calculate CMAC connections (both subsections in one loop or something like that), let me know.

+7
performance bit-manipulation bit-shift cryptography javacard
source share
3 answers

When it comes to speed, the known length, hard-coded version is the fastest (but ugly). If you need to shift more than one bit, be sure to update the code accordingly.

 output[0] = (byte)((byte)(input[0] << 1) | (byte)((input[1] >> 7) & 1)); output[1] = (byte)((byte)(input[1] << 1) | (byte)((input[2] >> 7) & 1)); output[2] = (byte)((byte)(input[2] << 1) | (byte)((input[3] >> 7) & 1)); output[3] = (byte)((byte)(input[3] << 1) | (byte)((input[4] >> 7) & 1)); output[4] = (byte)((byte)(input[4] << 1) | (byte)((input[5] >> 7) & 1)); output[5] = (byte)((byte)(input[5] << 1) | (byte)((input[6] >> 7) & 1)); output[6] = (byte)((byte)(input[6] << 1) | (byte)((input[7] >> 7) & 1)); output[7] = (byte)((byte)(input[7] << 1) | (byte)((input[8] >> 7) & 1)); output[8] = (byte)((byte)(input[8] << 1) | (byte)((input[9] >> 7) & 1)); output[9] = (byte)((byte)(input[9] << 1) | (byte)((input[10] >> 7) & 1)); output[10] = (byte)((byte)(input[10] << 1) | (byte)((input[11] >> 7) & 1)); output[11] = (byte)((byte)(input[11] << 1) | (byte)((input[12] >> 7) & 1)); output[12] = (byte)((byte)(input[12] << 1) | (byte)((input[13] >> 7) & 1)); output[13] = (byte)((byte)(input[13] << 1) | (byte)((input[14] >> 7) & 1)); output[14] = (byte)((byte)(input[14] << 1) | (byte)((input[15] >> 7) & 1)); output[15] = (byte)(input[15] << 1); 

And use an array of RAM bytes!

+4
source share

This can help cache CMAC subkeys when reusing the same key (i.e. the same DESFire EV1 session key). For this key, the subsections are always the same.

I think David's answer could be even faster if he used two local variables to cache values ​​that are read twice from the same offset of the input array (from my observations on JCOP, accessing the array is quite expensive even for transient arrays).

EDIT: I can provide the following implementation, which does a 4-bit shift to the right using a short one (a 32-bit int for the cards supporting it will be even faster):

 short pom=0; // X000 to be stored next short pom2; // loaded value short pom3; // 0XXX to be stored next short curOffset=PARAMS_TRACK2_OFFSET; while(curOffset<16) { pom2=Util.getShort(mem_PARAMS, curOffset); pom3=(short)(pom2>>>4); curOffset=Util.setShort(mem_RAM, curOffset, (short)(pom|pom3)); pom=(short)(pom2<<12); } 

Beware, this code assumes the same bias in the source and destination.

You can expand this loop and use constant parameters if necessary.

+2
source share

This is the fastest algorithm for rotating an arbitrary number of bits that I could come up with (I rotate an array of 8 bytes, you can easily convert it to transfer 16):

Use EEPROM to create a mask table for your shifts. The mask only increases the amount of 1 s to the right:

 final static byte[] ROTL_MASK = { (byte) 0x00, //shift 0: 00000000 //this one is never used, we don't do shift 0. (byte) 0x01, //shift 1: 00000001 (byte) 0x03, //shift 2: 00000011 (byte) 0x07, //shift 3: 00000111 (byte) 0x0F, //shift 4: 00001111 (byte) 0x1F, //shift 5: 00011111 (byte) 0x3F, //shift 6: 00111111 (byte) 0x7F //shift 7: 01111111 }; 

Then you first use Util.arrayCopyNonAtomic to quickly replace bytes if the shift is greater than 8:

 final static byte BITS = 8; //swap whole bytes: Util.arrayCopyNonAtomic(in, (short) (shift/BITS), out, (short) 0, (short) (8-(shift/BITS))); Util.arrayCopyNonAtomic(in, (short) 0, out, (short) (8-(shift/BITS)), (short) (shift/BITS)); shift %= BITS; //now we need to shift only up to 8 remaining bits if (shift > 0) { //apply masks byte mask = ROTL_MASK[shift]; byte comp = (byte) (8 - shift); //rotate using masks out[8] = in[0]; // out[8] is any auxiliary variable, careful with bounds! out[0] = (byte)((byte)(in[0] << shift) | (byte)((in[1] >> comp) & mask)); out[1] = (byte)((byte)(in[1] << shift) | (byte)((in[2] >> comp) & mask)); out[2] = (byte)((byte)(in[2] << shift) | (byte)((in[3] >> comp) & mask)); out[3] = (byte)((byte)(in[3] << shift) | (byte)((in[4] >> comp) & mask)); out[4] = (byte)((byte)(in[4] << shift) | (byte)((in[5] >> comp) & mask)); out[5] = (byte)((byte)(in[5] << shift) | (byte)((in[6] >> comp) & mask)); out[6] = (byte)((byte)(in[6] << shift) | (byte)((in[7] >> comp) & mask)); out[7] = (byte)((byte)(in[7] << shift) | (byte)((in[8] >> comp) & mask)); } 

You can optionally remove the mask variable and use a direct link to the table instead.

Using this rather than naive bitwise rotation implementation turned out to be about 450-500% faster.

+1
source share

All Articles