I have one project for implementing a data structure using C, it exports various APIs for another program. Recently, I would like to do some optimization on the Hot function, profiled grofile. Here is the project for your reference.
https://github.com/Incarnation-p-lee/libds There is one hot binary_search_tree_node_insert function, as shown below:
struct binary_search_tree * binary_search_tree_node_insert(struct binary_search_tree *root, struct binary_search_tree *node) { register struct binary_search_tree **iter; if (!node || !root) { pr_log_warn("Attempt to access NULL pointer.\n"); } else { iter = &root; while (*iter) { if (node->chain.nice == (*iter)->chain.nice) { if (*iter == node) { pr_log_info("Insert node exist, nothing will be done.\n"); } else { doubly_linked_list_merge((*iter)->chain.link, node->chain.link); } return *iter; #ifndef OPT_HOT } else if (node->chain.nice > (*iter)->chain.nice) { iter = &(*iter)->right; } else if (node->chain.nice < (*iter)->chain.nice) { iter = &(*iter)->left; #else } else { binary_search_tree_insert_path_go_through(node, iter); #endif } } return *iter = node; } return NULL; }
I want to optimize two else-if parts as half of half, which can greatly affect the pipeline. Therefore, I am writing one binary_search_tree_insert_path_go_through macro instead of these two branches. And the implementation is as follows:
#define binary_search_tree_insert_path_go_through(node, iter) \ asm volatile ( \ "mov $0x18, %%rax\n\t" \ "mov $0x20, %%rdx\n\t" \ "mov 0x8(%1), %%rbx\n\t" \ "mov (%0), %%rcx\n\t" \ "cmp 0x8(%%rcx), %%rbx\n\t" \ "cmovg %%rdx, %%rax\n\t" \ "lea (%%rcx, %%rax), %0\n\t" \ :"+r"(iter) \ :"r"(node) \ :"rax", "rbx", "rcx", "rdx")
But the unit test of this function dropped by about 6-8%. From the objdump code (optimized code on the right) it has fewer instructions, I canβt understand why it costs more time before optimizing.

And there are some details for you:
struct collision_chain { struct doubly_linked_list *link; sint64 nice; }; struct binary_search_tree { struct collision_chain chain; sint32 height; union { struct binary_search_tree *left; struct avl_tree *avl_left; struct splay_tree *splay_left; }; union { struct binary_search_tree *right; struct avl_tree *avl_right; struct splay_tree *splay_right; }; }; struct doubly_linked_list { uint32 sid; void *val; struct doubly_linked_list *next; struct doubly_linked_list *previous; };
It works on a virtual box with 2 core i5-3xxM and cpuinfo as follows:
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz stepping : 9 microcode : 0x19 cpu MHz : 2568.658 cache size : 6144 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm bogomips : 5137.31 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz stepping : 9 microcode : 0x19 cpu MHz : 2568.658 cache size : 6144 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm bogomips : 5137.31 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: