You can write your own memory allocation procedures that allocate aligned data on the heap. You can specify your own alignment size (not only 16 bytes, but also 32 bytes, 64 bytes, etc.):
procedure GetMemAligned(const bits: Integer; const src: Pointer; const SrcSize: Integer; out DstAligned, DstUnaligned: Pointer; out DstSize: Integer); var Bytes: NativeInt; i: NativeInt; begin if src <> nil then begin i := NativeInt(src); i := i shr bits; i := i shl bits; if i = NativeInt(src) then begin // the source is already aligned, nothing to do DstAligned := src; DstUnaligned := src; DstSize := SrcSize; Exit; end; end; Bytes := 1 shl bits; DstSize := SrcSize + Bytes; GetMem(DstUnaligned, DstSize); FillChar(DstUnaligned^, DstSize, 0); i := NativeInt(DstUnaligned) + Bytes; i := i shr bits; i := i shl bits; DstAligned := Pointer(i); if src <> nil then Move(src^, DstAligned^, SrcSize); end; procedure FreeMemAligned(const src: Pointer; var DstUnaligned: Pointer; var DstSize: Integer); begin if src <> DstUnaligned then begin if DstUnaligned <> nil then FreeMem(DstUnaligned, DstSize); end; DstUnaligned := nil; DstSize := 0; end;
Then use pointers and procedures as the third argument to return the result.
You can also use functions, but this is not so obvious.
type PVector^ = TVector; TVector = packed array [1..4] of Single;
Then distribute these objects like this:
const SizeAligned = SizeOf(TVector); var DataUnaligned, DataAligned: Pointer; SizeUnaligned: Integer; V1: PVector; begin GetMemAligned(4 {align by 4 bits, ie by 16 bytes}, nil, SizeAligned, DataAligned, DataUnaligned, SizeUnaligned); V1 := DataAligned; // now you can work with your vector via V1^ - it is aligned by 16 bytes and stays in the heap FreeMemAligned(nil, DataUnaligned, SizeUnaligned); end;
As you already noted, we passed nil to GetMemAligned and FreeMemAligned - this parameter is necessary when we want to align existing data, for example. which we received as an argument to the function, for example.
Just use direct register names, not parameter names in build procedures. You will not get involved with this when using the call registration pipeline, otherwise you risk changing the registers without knowing that the parameter names used are just aliases for the registers.
In Win64 with Microsoft's call convention, the first parameter is always passed as RCX, the second is RDX, the third is R8, the fourth is R9, and the rest is on the stack. The function returns the result in RAX. But if the function returns the result of the structure ("record"), it does not return in RAX, but in an implicit argument to the address. The following registers can be changed by your function after a call: RAX, RCX, RDX, R8, R9, R10, R11. The rest must be saved. See https://msdn.microsoft.com/en-us/library/ms235286.aspx for more details.
In Win32, with the Delphi register registration agreement, the call passes the first parameter in EAX, the second in EDX, the third in ECX and rests on the stack
The following table shows the differences:
64 32 --- --- 1) rcx eax 2) rdx edx 3) r8 ecx 4) r9 stack
So your function will look like this (32-bit):
procedure add4(const a, b: TVector; out Result: TVector); register; assembler; asm movaps xmm0, [eax] movaps xmm1, [edx] addps xmm0, xmm1 movaps [ecx], xmm0 end;
Under 64-bit;
procedure add4(const a, b: TVector; out Result: TVector); register; assembler; asm movaps xmm0, [rcx] movaps xmm1, [rdx] addps xmm0, xmm1 movaps [r8], xmm0 end;
By the way, according to Microsoft, the floating point arguments in the agreement on 64-bit calls are passed directly to the XMM registers: first in XMM0, the second in XMM1, the third in XMM2 and the fourth in XMM3, and the rest on the stack, so you can pass them by value, not by reference.