Is there a performance limitation for single-phase indexing of arrays?

Is there a performance penalty (albeit a small one) for Julia that uses unidirectional array indexing, since machine code usually more directly supports zero-based indexing?

+5
source share
2 answers

Most likely, Julia simply subtracts 1 from the indexes you give her and uses zero-base arrays under the hood. Thus, the execution penalty will be worth the deduction (almost certainly immaterial).

It would be simple enough to write two small bits of code to test the performance of each of them.

+3
source

I looked in a bit and this is what I like (I used Julia 0.6 for all experiments below):

> arr = zeros(5); > @code_llvm arr[1] define double @jlsys_getindex_51990(i8** dereferenceable(40), i64) #0 !dbg !5 { top: %2 = add i64 %1, -1 %3 = bitcast i8** %0 to double** %4 = load double*, double** %3, align 8 %5 = getelementptr double, double* %4, i64 %2 %6 = load double, double* %5, align 8 ret double %6 } 

This snippet %1 contains the actual index. Pay attention to %2 = add i64 %1, -1 . Julia really uses 0-based arrays under the hood and subtracts 1 from the index. This leads to the generation of an additional llvm command, so the llvm code looks a little less efficient. However, how this extra arithmetic operation seeps into native code is another question.

On ax86 and amd64

 > @code_native arr[1] .text Filename: array.jl Source line: 520 leaq -1(%rsi), %rax cmpq 24(%rdi), %rax jae L20 movq (%rdi), %rax movsd -8(%rax,%rsi,8), %xmm0 # xmm0 = mem[0],zero retq L20: pushq %rbp movq %rsp, %rbp movq %rsp, %rcx leaq -16(%rcx), %rax movq %rax, %rsp movq %rsi, -16(%rcx) movl $1, %edx movq %rax, %rsi callq 0xffffffffffcbf392 nopw %cs:(%rax,%rax) 

The good news in these architectures is that they support arbitrary number indexing. movsd -8(%rax,%rsi,8), %xmm0 and leaq -1(%rsi), %rax are two instructions affected by indexing based on 1 in Julia. Take a look at the movsd instruction, in this one command we perform both actual indexing and subtraction. Part -8 is a subtraction. If indexing based on 0 was used, then the movsd (%rax,%rsi,8), %xmm0 command movsd (%rax,%rsi,8), %xmm0 .

Another affected team is leaq -1(%rsi), %rax . However, because cmp instructions use the in-out argument, the %rsi value must be copied to a different register, so when indexing based on 0, the same command will still be generated, but it will probably look like leaq (%rsi), %rax .

So, on x86 and amd64 machines, indexing based on 1 leads to a simple use of a slightly more complex version of the same instructions, but no additional instructions are generated. Most likely, the code works as fast as indexing based on 0. If any slowdown is present, it is probably related to a specific microarchitecture and will be present in one CPU model and not present in another. This is different from silicon, and I would not worry about that.

Unfortunately, I do not know enough about arm and other architectures, but the situation is probably similar.

Interaction with another language

When interacting with another language, such as C or Python, you always need to remember to subtract or add 1 when passing indexes. The compiler cannot help you because other code is not available. Thus, in this case there is performance in 1 extensional arithmetic operation. But if it is not very tight, this difference is not significant.

Elephant in the room

Well, the elephant in the room is a test. Returning to the previous fragment of the assembly, most of the generated code is associated with this - the first 3 teams and all under the label L20 . Actual indexing is just movq and movsd . Therefore, if you care about really fast code, then you will get a lot more performance penalty from checking bindings than indexing based on 1. Fortunately, Julia offers ways to alleviate these problems with @inbound and --check-bounds=no .

+2
source

Source: https://habr.com/ru/post/1211444/


All Articles