The simplest solution always encodes it in pascal and looks at the generated assembler.
Speedwise, assembler usually has only advantages in tight loops, and in general, the code is unlikely to improve, if any. There is only one piece of assembler in my code, and the advantage comes from recoding the floating-point vector operation in a fixed-point SSE. The saturation provided by the SIMD instruction sets is an added bonus.
Worse, very poorly informed assembly code floating around the Internet is actually slower than pascal equivalents for modern processors, as processor tradeoffs have changed over time.
Update:
Then simply load the class property into the local var in the prolog of your procedure before entering an assembler loop or moving the assembler to another procedure. Choose your battles.
Examining the source of RTL / VCL can also give an idea of how to access some constructs.
Btw, not all low-level optimization is done using assembler. At the Pascal level with some pointer knowledge, much can be done too, and sometimes at the Pascal level the same cache optimization can be done (see, for example, Cache optimization of rotating bitmaps )
Marco van de voort
source share