How to use SSE command with data alignment in Delphi XE3?

I tried to run the following,

type Vector = array [1..4] of Single; {$CODEALIGN 16} function add4(const a, b: Vector): Vector; register; assembler; asm movaps xmm0, [a] movaps xmm1, [b] addps xmm0, xmm1 movaps [@result], xmm0 end; 

It gives access violation on movaps, as far as I know, movaps can be trusted if the memory cell has 16-alignment. This is not a problem if movups (alignment is not required).

So my question in Delphi XE3, {$ CODEALIGN} does not seem to work in this case.

EDIT

Very strange ... I tried the following.

 program Project3; {$APPTYPE CONSOLE} uses windows; // if not using windows, no errors at all type Vector = array [1..4] of Single; function add4(const a, b: Vector): Vector; asm movaps xmm0, [a] movaps xmm1, [b] addps xmm0, xmm1 movaps [@result], xmm0 end; procedure test(); var v1, v2: vector; begin v1[1] := 1; v2[1] := 1; v1 := add4(v1,v2); // this works end; var a, b, c: Vector; begin {$ifndef cpux64} {$MESSAGE FATAL 'this example is for x64 target only'} {$else} test(); c := add4(a, b); // throw out AV here {$endif} end. 

If "use windows" is not added, everything is in order. If you use the window, it will throw an exception in c: = add4 (a, b), but not in test ().

Who can explain this?

EDIT now it all makes sense to me. outputs for Delphi XE3 - 64-bit is

  • X64 stack frames are set to 16 bytes (as needed), {$ CODEALIGN 16} aligns the code for proc / fun to 16 bytes.
  • a dynamic array lives on a heap that can be configured to align 16 using SetMinimumBlockAlignment (mba16byte)
  • however, the stack stack is not always aligned by 16 bytes, for example, if you declare an integer var before v1, v2 in the above example, for example. test (), the example will not work .
+7
source share
3 answers

You need your data to be 16 byte aligned. This requires some caution and attention. You can verify that the heap allocator is aligned to 16 bytes. But you cannot make sure that the compiler will align the stack variables 16 bytes each, because your array has alignment property 4, the size of its elements. And any variables declared inside other structures will also have 4 bytes. This is not an easy obstacle.

I do not think that you can solve your problem in the currently available versions of the compiler. At least if you do not refuse the stack, the allocated variables, which, I think, will be too bitter to digest the pill. You may be lucky with an external assembler. U

+2
source

Use this to allocate a built-in 16 byte aligned memory manager:

 SetMinimumBlockAlignment(mba16Byte); 

Also, as far as I know, โ€œregisterโ€ and โ€œassemblerโ€ are redundant directives, so you can skip them from your code.

-

Edit: you mentioned that this is for x64. I just tried the following on Delphi XE2 compiled for x64 and it works here.

 program Project3; type Vector = array [1..4] of Single; function add4(const a, b: Vector): Vector; asm movaps xmm0, [a] movaps xmm1, [b] addps xmm0, xmm1 movaps [@result], xmm0 end; procedure f(); var v1,v2 : vector; begin v1[1] := 1; v2[1] := 1; v1 := add4(v1,v2); end; begin {$ifndef cpux64} {$MESSAGE FATAL 'this example is for x64 target only'} {$else} f(); {$endif} end. 
+1
source

You can write your own memory allocation procedures that allocate aligned data on the heap. You can specify your own alignment size (not only 16 bytes, but also 32 bytes, 64 bytes, etc.):

  procedure GetMemAligned(const bits: Integer; const src: Pointer; const SrcSize: Integer; out DstAligned, DstUnaligned: Pointer; out DstSize: Integer); var Bytes: NativeInt; i: NativeInt; begin if src <> nil then begin i := NativeInt(src); i := i shr bits; i := i shl bits; if i = NativeInt(src) then begin // the source is already aligned, nothing to do DstAligned := src; DstUnaligned := src; DstSize := SrcSize; Exit; end; end; Bytes := 1 shl bits; DstSize := SrcSize + Bytes; GetMem(DstUnaligned, DstSize); FillChar(DstUnaligned^, DstSize, 0); i := NativeInt(DstUnaligned) + Bytes; i := i shr bits; i := i shl bits; DstAligned := Pointer(i); if src <> nil then Move(src^, DstAligned^, SrcSize); end; procedure FreeMemAligned(const src: Pointer; var DstUnaligned: Pointer; var DstSize: Integer); begin if src <> DstUnaligned then begin if DstUnaligned <> nil then FreeMem(DstUnaligned, DstSize); end; DstUnaligned := nil; DstSize := 0; end; 

Then use pointers and procedures as the third argument to return the result.

You can also use functions, but this is not so obvious.

 type PVector^ = TVector; TVector = packed array [1..4] of Single; 

Then distribute these objects like this:

 const SizeAligned = SizeOf(TVector); var DataUnaligned, DataAligned: Pointer; SizeUnaligned: Integer; V1: PVector; begin GetMemAligned(4 {align by 4 bits, ie by 16 bytes}, nil, SizeAligned, DataAligned, DataUnaligned, SizeUnaligned); V1 := DataAligned; // now you can work with your vector via V1^ - it is aligned by 16 bytes and stays in the heap FreeMemAligned(nil, DataUnaligned, SizeUnaligned); end; 

As you already noted, we passed nil to GetMemAligned and FreeMemAligned - this parameter is necessary when we want to align existing data, for example. which we received as an argument to the function, for example.

Just use direct register names, not parameter names in build procedures. You will not get involved with this when using the call registration pipeline, otherwise you risk changing the registers without knowing that the parameter names used are just aliases for the registers.

In Win64 with Microsoft's call convention, the first parameter is always passed as RCX, the second is RDX, the third is R8, the fourth is R9, and the rest is on the stack. The function returns the result in RAX. But if the function returns the result of the structure ("record"), it does not return in RAX, but in an implicit argument to the address. The following registers can be changed by your function after a call: RAX, RCX, RDX, R8, R9, R10, R11. The rest must be saved. See https://msdn.microsoft.com/en-us/library/ms235286.aspx for more details.

In Win32, with the Delphi register registration agreement, the call passes the first parameter in EAX, the second in EDX, the third in ECX and rests on the stack

The following table shows the differences:

  64 32 --- --- 1) rcx eax 2) rdx edx 3) r8 ecx 4) r9 stack 

So your function will look like this (32-bit):

 procedure add4(const a, b: TVector; out Result: TVector); register; assembler; asm movaps xmm0, [eax] movaps xmm1, [edx] addps xmm0, xmm1 movaps [ecx], xmm0 end; 

Under 64-bit;

 procedure add4(const a, b: TVector; out Result: TVector); register; assembler; asm movaps xmm0, [rcx] movaps xmm1, [rdx] addps xmm0, xmm1 movaps [r8], xmm0 end; 

By the way, according to Microsoft, the floating point arguments in the agreement on 64-bit calls are passed directly to the XMM registers: first in XMM0, the second in XMM1, the third in XMM2 and the fourth in XMM3, and the rest on the stack, so you can pass them by value, not by reference.

0
source

All Articles