Unknown GCC compilation error for ARM NEON (Critical)

I have an ARM NEON Cortex-A8. I optimized my code using NEON. But when I compile my code, I get this strange error. I don’t know how to fix it.

I am trying to compile the following code (PART 1) using Code Sourcery (PART2) on my host. And I get this strange error (PART 3). Am I something wrong here? Can anyone else compile this and see if they also get the same compilation error?

The strange part in the code, if I comment on the else if(step_size == 4) code part of the code, then the error will disappear. But, unfortunately, my optimization is not complete, so I must have it.

At first, I thought it was a problem with the CodeSourcey compiler (on my host), so I compiled the program directly in my target program (my target runs on Ubuntu). I used gcc there and again, I get the same error, and when I comment on the else if(step_size == 4) part else if(step_size == 4) , then the error disappears.

Help!


PART 1

 #include<stdio.h> #include"arm_neon.h" #define IMAGE_HEIGHT 480 #define IMAGE_WIDTH 640 float32_t integral_image[IMAGE_HEIGHT][IMAGE_WIDTH]; float32x4_t box_area_compute3(int, int , int , int , unsigned int , float); inline int min(int, int); int main() { box_area_compute3(1, 1, 4, 4, 2, 0); return 0; } float32x4_t box_area_compute3(int row, int col, int num_rows, int num_cols, unsigned int step_size, float three) { unsigned int height = IMAGE_HEIGHT; unsigned int width = IMAGE_WIDTH; int temp_row = row + num_rows; int temp_col = col + num_cols; int r1 = (min(row, height))- 1 ; int r2 = (min(temp_row, height)) - 1; int c1 = (min(col, width)) - 1; int c2 = (min(temp_col, width)) - 1; float32x4_t v128_areas; if(step_size == 2) { float32x4x2_t top_left, top_right, bottom_left, bottom_right; top_left = vld2q_f32((float32_t *)integral_image[r1] + c1); top_right = vld2q_f32((float32_t *)integral_image[r1] + c2); bottom_left = vld2q_f32((float32_t *)integral_image[r2] + c1); bottom_right = vld2q_f32((float32_t *)integral_image[r2] + c2); v128_areas = vsubq_f32(vsubq_f32(vaddq_f32(top_left.val[0], bottom_right.val[0]), top_right.val[0]), bottom_left.val[0]); } else if(step_size == 4) { float32x4x4_t top_left, top_right, bottom_left, bottom_right; top_left = vld4q_f32((float32_t *)integral_image[r1] + c1); top_right = vld4q_f32((float32_t *)integral_image[r1] + c2); bottom_left = vld4q_f32((float32_t *)integral_image[r2] + c1); bottom_right = vld4q_f32((float32_t *)integral_image[r2] + c2); v128_areas = vsubq_f32(vsubq_f32(vaddq_f32(top_left.val[0], bottom_right.val[0]), top_right.val[0]), bottom_left.val[0]); } if(three == 3.0) v128_areas = vmulq_n_f32(v128_areas, three); return v128_areas; } inline int min(int X, int Y) { return (X < Y ? X : Y); } 

PART 2

 arm-none-linux-gnueabi-gcc -O0 -g3 -Wall -c -fmessage-length=0 -fcommon -MMD -MP -MF"main.d" -MT"main.d" -mcpu=cortex-a8 -marm -mfloat-abi=hard -mfpu=neon-vfpv4 -o"main.o" "../main.c" 

PART 3

 ../main.c: In function 'box_area_compute3': ../main.c:65: error: unable to find a register to spill in class 'GENERAL_REGS' ../main.c:65: error: this is the insn: (insn 226 225 227 5 c:\program files\codesourcery\sourcery g++\bin\../lib/gcc/arm-none-linux-gnueabi/4.4.1/include/arm_neon.h:9863 (parallel [ (set (reg:XI 148 [ D.17028 ]) (unspec:XI [ (mem:XI (reg:SI 3 r3 [301]) [0 S64 A64]) (reg:XI 148 [ D.17028 ]) (unspec:V4SF [ (const_int 0 [0x0]) ] 191) ] 111)) (set (reg:SI 3 r3 [301]) (plus:SI (reg:SI 3 r3 [301]) (const_int 32 [0x20]))) ]) 1605 {neon_vld4qav4sf} (nil)) ../main.c:65: confused by earlier errors, bailing out cs-make: *** [main.o] Error 1 
+2
source share
4 answers

It's good that I contacted Code Sourcery about this issue, and they consider this a bug in the GCC compiler. So I wrote the do_it4 () {.....} function in the assembly instead of using the built-in functions. Now it works well!

+1
source

I can’t verify this because I don’t have an instrumental binding for it, but this type of error can often be circumvented by changing the code a bit. As a rule, this should not be, and it should be reported as an error, but you use processor-specific functions, which are probably less tested and polished than other compilers.

Since this is a case-sensitive error, and you have several pointers, I very much suspect that the compiler may try to load more data into the registers than necessary because of fears that some kind of overlay might happen, it probably doesn’t actually happen ) Below I will consider the possibility of this, and also do some other things that can reduce the complexity of the code from the point of view of the compiler (although it may seem that this is not so).

 #include<stdio.h> #include"arm_neon.h" #define IMAGE_HEIGHT 480 #define IMAGE_WIDTH 640 float32_t integral_image[IMAGE_HEIGHT][IMAGE_WIDTH]; float32x4_t box_area_compute3(int, int , int , int , unsigned int , float); inline int min(int, int); int main() { box_area_compute3(1, 1, 4, 4, 2, 0); return 0; } /* By putting these in separate functions the compiler will initially * think about them by themselves, without the complications of the * surrounding code. This may give it the abiltiy to optimise the * code somewhat before trying to inline it. * This may also serve to make it more obvious to the compiler that * the local variables are dead after their use (since they are * dead after the call returns, and that the lifetimes of some variable * cannot actually overlap (hopefully reducing the register needs). */ static inline float32x4_t do_it2(float32_t *tl, float32_t *tr, float32_t *bl, float32_t * br) { float32x4x2_t top_left, top_right, bottom_left, bottom_right; float32x4_t A, B; top_left = vld2q_f32(tl); top_right = vld2q_f32(tr); bottom_left = vld2q_f32(bl); bottom_right = vld2q_f32(br); /* By spreading this across several statements I have created several * additional sequence points. The compiler does not think that it * has to dereference all of the pointers before doing any of the * computations.... maybe. */ A = vaddq_f32(*top_left.val, *bottom_right.val); B = vsubq_f32(A, *top_right.val); return vsubq_f32(B, *bottom_left); } static inline float32x4_t do_it4(float32_t *tl, float32_t *tr, float32_t *bl, float32_t * br) { float32x4x4_t top_left, top_right, bottom_left, bottom_right; float32x4_t A, B; top_left = vld4q_f32(tl); top_right = vld4q_f32(tr); bottom_left = vld4q_f32(bl); bottom_right = vld4q_f32(br); A = vaddq_f32(*top_left.val, *bottom_right.val); B = vsubq_f32(A, *top_right.val); return vsubq_f32(B, *bottom_left); } float32x4_t box_area_compute3(int row, int col, int num_rows, int num_cols, unsigned int step_size, float three) { unsigned int height = IMAGE_HEIGHT; unsigned int width = IMAGE_WIDTH; int temp_row = row + num_rows; int temp_col = col + num_cols; int r1 = (min(row, height))- 1 ; int r2 = (min(temp_row, height)) - 1; int c1 = (min(col, width)) - 1; int c2 = (min(temp_col, width)) - 1; float32x4_t v128_areas; float32_t *tl = (float32_t *)integral_image[r1] + c1; float32_t *tr = (float32_t *)integral_image[r1] + c2; float32_t *bl = (float32_t *)integral_image[r2] + c1; float32_t *br = (float32_t *)integral_image[r2] + c2; switch (step_size) { case 2: v128_areas = do_it2(tl, tr, bl, br); break; case 4: v128_areas = do_it4(tl, tr, bl, br); break; } if(three == 3.0) v128_areas = vmulq_n_f32(v128_areas, three); return v128_areas; } inline int min(int X, int Y) { return (X < Y ? X : Y); } 

I hope this helps and that I have not submitted any errors.

+2
source

Line:

 float32x4x4_t top_left, top_right, bottom_left, bottom_right; 

uses all 16 q registers! Not surprisingly, the compiler cannot handle this. You could probably fix this by rewriting to use fewer registers.

0
source

ARM NEON Cortex-A8 supports vfpv3, Cortex-A5 supports vfpv4 and neon2 (for: if you use -mfloat-abi = hard, you miss the ability to emulate in missing software instructions, so you cannot generate code that will be optimized for vfpv4 but will work on vfpv3 with software emulation)

0
source

All Articles