We have a messaging system with high performance requirements. We recently noticed that the first message takes a lot longer than subsequent messages. A bunch of transformation and an increase in messages occur as it passes through our system, most of which are carried out through an external library.
I simply profiled this problem (using callgrind), comparing the "start" of only one message with the "run" of many messages (providing a baseline for comparison).
The main difference that I see is the do_lookup_x function, which takes a huge amount of time. Looking at the various calls to this function, they all seem to be called by a common function: _dl_runtime_resolve. Not sure what this function does, but for me it looks like the first time using different shared libraries, and then loaded into ld memory.
Is this a correct guess? That the binary will not load shared libraries into memory until they are ready for use, so we will see a significant slowdown in the first message, but not in any of the following?
How can we avoid this?
Note. We work with a microsecond scale.
source share