Impossible Cpu time for process / thread with rusage

I compute the delta between the time of the processor that the process / thread uses, between the call to a specific function.

I get an impossible value for the processor process time, some time when the next call returns a shorter processor time, which is impossible.

this is the code i use to calculate processor time:

u64 CpuTime::calculateCpuTime(bool a_thread) { struct rusage l_rusage; int retVal; if(a_thread) { retVal = getrusage(1, &l_rusage); //1 = RUSAGE_THREAD } else { retVal = getrusage(0, &l_rusage); //0 = RUSAGE_SELF } ASSERT(retVal==0); u64 userUSeconds = (static_cast<u64>(l_rusage.ru_utime.tv_sec)*1000000)+ (static_cast<u64>(l_rusage.ru_utime.tv_usec)); u64 systemUSeconds = (static_cast<u64>(l_rusage.ru_stime.tv_sec)*1000000)+ (static_cast<u64>(l_rusage.ru_stime.tv_usec)); if(a_thread) { return userUSeconds + systemUSeconds; } return (userUSeconds + systemUSeconds) / ACE_OS::num_processors_online(); } 

Also, the function that calculates this is asynchronous, no more than 1 thread can access it at a time.

this is an example of an impossible output (process):

2016/03/23 13: 39: 05.187 - No. 2 - 15471/15055 - ThreadDeltaCpuTime = 6000 ThreadCpuTime = 46756013 ProcessCpuTime = 39820759 deltaProcessCpuTime = 250

2016/03/23 13: 39: 05.386 - No. 2 - 15470/15055 - ThreadDeltaCpuTime = 5000 ThreadCpuTime = 46663013 ProcessCpuTime = 39699759 deltaProcessCpuTime = 18446744073709430616

Delta calculation is a simple newValue - oldValue, with an unsigned int, it leads to these numbers.

Example for an impossible output stream:

2016/03/23 13: 38: 59.183 - No. 2 - 15472/15055 - ThreadDeltaCpuTime = 6000 ThreadCpuTime = 47541012 ProcessCpuTime = 39658134 deltaProcessCpuTime = 5250

2016/03/23 13: 38: 59.382 - No. 2 - 15472/15055 - ThreadDeltaCpuTime = 1844 6744073708557617 ThreadCpuTime = 46547013 ProcessCpuTime = 39663009 deltaProcessCpuTime = 4625

I highlighted the thread identifier, also you can see that the function was executed 200 ms after the first.

I am looking for a clue to why this is happening, I also tried using the ACE_OS crack, with the same results (meaning), I could solve this by ignoring the β€œbad” values, but I wonder why this is happening.

EDIT:

This is where I do the actual print:

  u64 computeCpuTime() { u64 deltaCpuTime= CpuTime::getThreadDeltaCpuTime(); u64 CpuTime= getThreadCpuTime(); u64 ProcessCpuTime= getProcessCpuTime(); u64 deltaProcessCpuTime= CpuTime::getProcessDeltaCpuTime(); Log<<value(deltaCpuTime)<<value(CpuTime)<<value(ProcessCpuTime)<<value(deltaProcessCpuTime)<<endlog; return deltaCpuTime; } 

And other functions used in this calculation:

  u64 CpuTime::getThreadDeltaCpuTime() { pid_t thisThread = (pid_t) syscall (SYS_gettid); u64& val = m_threadsCpuMap[thisThread]; u64 oldValue =val; val = calculateCpuTime(true); return val - oldValue; } u64 CpuTime::getProcessDeltaCpuTime() { u64 oldValue = m_processCpu; m_processCpu = calculateCpuTime(false); return m_processCpu - oldValue; } u64 getThreadCpuTime() { return CpuTime::calculateCpuTime(true); } u64 getProcessCpuTime() { return CpuTime::calculateCpuTime(false); } 

The object making the system call ("m_cpuTime") is a singleton, protected mutex, only one thread can access it at a time.

it contains a cpu thread usage map, m_threadsCpuMap (used for delta), as well as the most recent use of the process processor, m_processCpu.

OTHER EDITING:

I simplified it with a simple test, working with a single thread, removing division with active processors, and I only check the Process Cpu time, still with impossible results.

Here are the updated features:

Test:

 for(int i = 0; i < 100000 ; i++) { for(int k = 0; k < 1000000; k++) m = k % i; cpuTime = CpuTime::instance()->getProcessDeltaCpuTime(); } 

Functions used in the test:

  u64 CpuTime::getProcessDeltaCpuTime() { u64 oldValue = m_processCpu; m_processCpu = calculateCpuTime(eThisProcess); Log<<value(oldValue)<<value(m_processCpu)<value( m_processCpu - oldValue)<<endlog; return m_processCpu - oldValue; } u64 CpuTime::calculateCpuTime(int a_type) { struct rusage l_rusage; int retVal; if(a_type == eThisThread) { retVal = /*ACE_OS::*/getrusage(1, &l_rusage); } else { retVal = /*ACE_OS::*/getrusage(0, &l_rusage); } u64 userUSeconds = (static_cast<u64>(l_rusage.ru_utime.tv_sec)*1000000)+ (static_cast<u64>(l_rusage.ru_utime.tv_usec)); u64 systemUSeconds = (static_cast<u64>(l_rusage.ru_stime.tv_sec)*100000)+ (static_cast<u64>(l_rusage.ru_stime.tv_usec)); if(a_type == eThisThread) { return userUSeconds + systemUSeconds; } return (userUSeconds + systemUSeconds)/* / ACE_OS::num_processors_online()*/; 

This is an example of logs, I deleted all the times when the delta was ~ 0

2016/03/29 08: 07: 05.198 - No. 2 - 24011/24011 - CpuTime :: getProcessDeltaCpuTime: oldValue = 14797750 m_processCpu = 14798749 m_process Cpu - oldValue = 999

2016/03/29 08: 07: 05.199 - No. 2 - 24011/24011 - CpuTime :: getProcessDeltaCpuTime: oldValue = 14798749 m_processCpu = 14799749 m_process Cpu - oldValue = 1000

2016/03/29 08: 07: 05.200 - # 2 - 24011/24011 - CpuTime :: getProcessDeltaCpuTime: oldValue = 14799749 m_processCpu = 14800749 m_process Cpu - oldValue = 1000

2016/03/29 08: 07: 05.201 - No. 2 - 24011/24011 - CpuTime :: getProcessDeltaCpuTime: oldValue = 14800749 m_processCpu = 14801749 m_process Cpu - oldValue = 1000

2016/03/29 08: 07: 05.202 - # 2 - 24011/24011 - CpuTime :: getProcessDeltaCpuTime: oldValue = 14801749 m_processCpu = 14802749 m_process Cpu - oldValue = 1000

2016/03/29 08: 07: 05.203 - # 2 - 24011/24011 - CpuTime :: getProcessDeltaCpuTime: oldValue = 14802749 m_processCpu = 13903748 m_process Cpu - oldValue = 18446744073708652615

2016/03/29 08: 07: 05.204 - No. 2 - 24011/24011 - CpuTime :: getProcessDeltaCpuTime: oldValue = 13903748 m_processCpu = 13904748 m_process Cpu - oldValue = 1000

2016/03/29 08: 07: 05.205 - No. 2 - 24011/24011 - CpuTime :: getProcessDeltaCpuTime: oldValue = 13904748 m_processCpu = 13905748 m_process Cpu - oldValue = 1000

This time there was only 1 time of unsuccessful values, out of 100,000 iterations. And after that, all subsequent calculations were also smaller, for example. it was like this: 1 2 3 4 5 6 7 8 9 10 Bad cost, 6 7 8 9 10 ... As if the process had lost count, returned, and then continued normally.

GET OTHER CHANGES:

I completely separated the code, compiled it as simple as possible. and the problem did not arise, it could be a matter of probability (earlier it was 1/100000). this is the code:

 int main() { long int n = 0; long int oldValue = 0; long int newValue = 0; unsigned long int deltaValue = 0; for(int i = 0; i < 1000000; i++) { for(long int m = 0; m <10000; m++) n = m + i; struct rusage l_rusage; int retVal; retVal = getrusage(0, &l_rusage); if(retVal != 0) return 0; long int userUSeconds = l_rusage.ru_utime.tv_sec*1000000 + l_rusage.ru_utime.tv_usec; long int systemUSeconds = l_rusage.ru_stime.tv_sec*100000 + l_rusage.ru_stime.tv_usec; oldValue = newValue; newValue = userUSeconds + systemUSeconds; deltaValue = newValue - oldValue; if(deltaValue != 0) std::cout<<"Old value: "<< oldValue <<" New Value: "<< newValue <<" Delta value: "<<deltaValue<<"\n"; } std::cout<<n; return 0; } 

And from further study of the previous test (which uses our unittest frame), the problem always arises at approximately the same processor time. This means that this is happening for some reason, I just can’t understand what is happening at this exact time, which can cause something like that. Or how it might be caused.

I am not well versed in the linux kernel to understand how this calculation is done.

Another understanding is when using clock () to get the process. this does not happen, even more - it seems that until this moment the values ​​were approximately the same, after that they were not:

2016/03/29 12: 36: 19.158 - No. 2 - 20544/20544 - CpuTime :: getProcessDeltaCpuTime: oldValue = 14598780 m_processCpu = 14598780 m_processCpu - oldValue = 0 oldClockTime = 14580000 m_clockTime = 14580000 mcl_clockTime = 14580000 mcl

2016/03/29 12: 36: 19.158 - No. 2 - 20544/20544 - CpuTime :: getProcessDeltaCpuTime: oldValue = 14598780 m_processCpu = 13699779 m_processCpu - oldValue = 18446744073708652615 oldClockTime = 14580000Tock_clock_time_14580000_clock_clock_time

2016/03/29 12: 36: 19.158 - No. 2 - 20544/20544 - CpuTime :: getProcessDeltaCpuTime: oldValue = 13699779 m_processCpu = 13699779 m_processCpu - oldValue = 0 oldClockTime = 14590000 m_clockTime = 14590000 mcl_clockTime = 14590000 mcl

Which leads me to another question, how is rusage calculated and how is the clock ()? what can lead to a difference between the two?

I worked on the problem by measuring it with clock_gettime, here is the code if 1 is of interest:

  u64 CpuMeasure::calculateCpuTime(int a_type) { struct timespec ts; if(a_type == eThisThread) { if (ACE_OS::clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts) == 0) { return (u64)ts.tv_sec * 1000000 + (u64)ts.tv_nsec / 1000; //From nano to mili seconds } } else { if (ACE_OS::clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts) == 0) { return (u64)ts.tv_sec * 1000000 + (u64)ts.tv_nsec / 1000; } } ASSERT(false); //we cant be here, unless some thing bad happened return 0; } 

I use ACE_OS for portability, but it works with the default functions from sys / time.h

However, I still doubt the strange behavior of rusage , which might make it give these values.

+6
source share
1 answer

Check these lines carefully: you multiply the number of seconds of the system by 100 thousand instead of 1 million:

 u64 userUSeconds = (static_cast<u64>(l_rusage.ru_utime.tv_sec)*1000000)+ (static_cast<u64>(l_rusage.ru_utime.tv_usec)); u64 systemUSeconds = (static_cast<u64>(l_rusage.ru_stime.tv_sec)*100000)+ (static_cast<u64>(l_rusage.ru_stime.tv_usec)); 

They are present in two of the three examples that you indicated (not in the first).

However, this explains the odd sequence:

2016/03/29 08: 07: 05.201 - No. 2 - 24011/24011 - CpuTime :: getProcessDeltaCpuTime: oldValue = 14800749 m_processCpu = 14801749 m_process Cpu - oldValue = 1000

2016/03/29 08: 07: 05.202 - # 2 - 24011/24011 - CpuTime :: getProcessDeltaCpuTime: oldValue = 14801749 m_processCpu = 14802749 m_process Cpu - oldValue = 1000

2016/03/29 08: 07: 05.203 - # 2 - 24011/24011 - CpuTime :: getProcessDeltaCpuTime: oldValue = 14802749 m_processCpu = 13903748 m_process Cpu - oldValue = 18446744073708652615

2016/03/29 08: 07: 05.204 - No. 2 - 24011/24011 - CpuTime :: getProcessDeltaCpuTime: oldValue = 13903748 m_processCpu = 13904748 m_process Cpu - oldValue = 1000

at each iteration, she adds 1,000 of us. But I think that as soon as the system time exceeds one second, only 1/10 of a second is added, which leads to a shift of -0.9 seconds, which is exactly 14.8 β†’ 13.9 seconds.

actually:

  18446744073708652615 - 2^64 + 1 = -899000 

i.e. -900000 (0.9 seconds) + 1000 us

+4
source

All Articles