EDIT partial solution below (EDIT 2), but I have one more question (see end)
I am trying to compile the following C program with gcc-4.9.2 , on Windows 7, 32 bits, running on a Pentium G3220 (according to the Windows System Information). If I understand correctly, this processor does not have AVX extensions, so it is natural that something happens, I'm just not sure what exactly. At first, I played with optimization using gcc, and I tried -mavx rather "accidentally".
The following program calculates the permutations of the numbers 0 ... n-1 (with n specified as an argument) in lexicographic order, as well as the rank of each permutation (its position in this sequential order) and "unrank" (restores the permutation from the rank) and verifies that they are all true. It should only print βOKβ or βErrorβ at the end.
- With
gcc -O3 program works correctly with all the integer inputs that I checked (1 <= n <= 11). - With
gcc -O3 -mavx it works correctly for 1 <= n <= 7 and for n> = 8, it does not print anything and actually does nothing (almost without delay before exiting). I do not receive a message from the program or from Windows (I would expect that an unknown instruction might fail, but this did not happen).
(On another computer with 64-bit versions of Windows 7 on the -5 i5 kernel and in the same gcc-4.9.2, the program seems to work fine without -mavx when it is compiled in either 32 or 64 bits )
I do not understand why it works correctly for some input values ββand does not work for others. Does anyone have a hint of this?
Here is a complete program followed by a shorter one with the same problem.
#include <stdlib.h>
EDIT
Following Brian Cain's comment, here is a shorter program with the same problem. I removed all input value checks, all things of rank / unrank, and I replaced malloc / free with an array of size 20 (only one now, since b and c are no longer used). Now the program only calculates permutations using the while(next_perm(n, a)); and does nothing with them. However, it should still print βOKβ because the q value does not change after the initial q = 0.
#include <stdlib.h>
EDIT 2: assembly exit explanation
I also add gcc disassembly output (in Intel syntax) found with gcc -O3 -mavx -S -masm=intel and gcc-4.9.2 (see link above for actual compiler binaries). However, it needs some work because, like gcc, it embeds the next_perm call, and it is less readable. I also remove the CFI directives and alignment, and virtually all other directives, to improve readability:
_next_perm: LFB0: push ebp push edi push esi push ebx mov ecx, DWORD PTR [esp+20] mov edx, DWORD PTR [esp+24] lea eax, [ecx-1] test eax, eax jle L12 mov edi, DWORD PTR [edx-4+ecx*4] cmp DWORD PTR [edx-8+ecx*4], edi mov ecx, eax jg L5 jmp L11 L28: mov esi, DWORD PTR [edx+ecx*4] cmp DWORD PTR [edx-4+ecx*4], esi jle L27 L5: sub ecx, 1 jne L28 L4: mov ebx, ecx L7: mov esi, DWORD PTR [edx+ebx*4] mov edi, DWORD PTR [edx+eax*4] mov DWORD PTR [edx+ebx*4], edi mov DWORD PTR [edx+eax*4], esi add ebx, 1 sub eax, 1 cmp ebx, eax jl L7 L2: xor eax, eax test ecx, ecx je L23 L11: sal ecx, 2 lea esi, [edx+ecx] lea ebp, [edx-4+ecx] mov ebx, DWORD PTR [esi] mov edi, DWORD PTR [ebp+0] cmp edi, ebx jle L9 lea eax, [edx+4+ecx] L10: mov esi, eax add eax, 4 mov ebx, DWORD PTR [eax-4] cmp ebx, edi jl L10 L9: mov DWORD PTR [ebp+0], ebx mov eax, 1 mov DWORD PTR [esi], edi L23: pop ebx pop esi pop edi pop ebp ret L27: cmp eax, ecx jg L4 jmp L11 L12: mov ecx, eax jmp L2
The build output is the same with or without -mavx, except for shortcut numbers: there is no AVX instruction, which means that the problem actually lies in main .
This can be verified by adding puts to the main text:
int main(int argc, char **argv) { int n, i, q = 0, a[20]; puts("X"); n = strtol(argv[1], NULL, 0); puts("Y"); for(i = 0; i < n; i++) a[i] = i; puts("Z"); while(next_perm(n, a)); puts(q?"Error":"OK"); return 0; }
Then the programs print only X and Y when they fail, so the problem comes from the AVX instructions used to construct the 'a' in the for loop between Y and Z.
Here is the result of assembling main , again without directives (LC2 points to "Y" and LC3 points to "Z"). The only AVX commands in the ouptut main assembly are between the two puts , and they are used for the for loop, which builds the original "a", that is, the array {0, 1, ..., n -1}. In fact, what happens is that the AVX instructions are used to create several "a" elements at a time (I suppose), and if the length of "a" is not a multiple of 4, then there is an extra step (between L4 and L9) before calling puts (" Z ") in L9, then while (next_perm (n, a)); on L3. Thus, the problem is very simple: if n is small enough, then the AVX loop does not actually start, and there is no error. Here, the maximum allowable n is 4, but it varies between different gcc scripts, it is a bit randomized (it seems I got 8 yesterday).
Labels LC0 and LC4 indicate two arrays of 4 elements that are used by the AVX instructions: LC0 - {0,1,2,3}, and LC4 - {4,4,4,4}. It is not surprising why they are here, even without a deep knowledge of AVX, it smells like a detailed cycle :-)
_main: push ebp mov ebp, esp push edi push esi push ebx and esp, -16 sub esp, 96 call ___main mov DWORD PTR [esp], OFFSET FLAT:LC1 call _puts mov eax, DWORD PTR [ebp+12] mov DWORD PTR [esp+8], 0 mov DWORD PTR [esp+4], 0 mov eax, DWORD PTR [eax+4] mov DWORD PTR [esp], eax call _strtol mov DWORD PTR [esp], OFFSET FLAT:LC2 mov ebx, eax call _puts test ebx, ebx jle L17 lea edx, [ebx-4] lea ecx, [ebx-1] shr edx, 2 add edx, 1 cmp ecx, 3 lea eax, [0+edx*4] jbe L10 vmovdqa xmm1, XMMWORD PTR LC4 lea esi, [esp+16] xor ecx, ecx vmovdqa xmm0, XMMWORD PTR LC0 L5: mov edi, ecx add ecx, 1 sal edi, 4 cmp edx, ecx vmovaps XMMWORD PTR [esi+edi], xmm0 vpaddd xmm0, xmm0, xmm1 ja L5 cmp ebx, eax je L9 L4: lea edx, [eax+1] mov DWORD PTR [esp+16+eax*4], eax cmp ebx, edx jle L9 mov DWORD PTR [esp+16+edx*4], edx lea edx, [eax+2] cmp ebx, edx jle L9 add eax, 3 mov DWORD PTR [esp+16+edx*4], edx cmp ebx, eax jle L9 mov DWORD PTR [esp+16+eax*4], eax L9: mov DWORD PTR [esp], OFFSET FLAT:LC3 call _puts L3: mov DWORD PTR [esp+4], esi mov DWORD PTR [esp], ebx call _next_perm test eax, eax jne L3 mov DWORD PTR [esp], OFFSET FLAT:LC5 call _puts lea esp, [ebp-12] xor eax, eax pop ebx pop esi pop edi pop ebp ret L10: xor eax, eax lea esi, [esp+16] jmp L4 L17: lea esi, [esp+16] jmp L9
Now I understand what is actually happening, but one question remains: why is there no error message when the program tries to run the AVX instruction? He just leaves, or he is killed, but without any hint that something went wrong.