I'm investigating a difference in behavior of using "top -H -p < pid >" and "mpstat -P ALL 2". My company's app is a multi-threaded process in which we bind each thread to specific machine cores. Some threads are performing TCP/UDP reads, others performing message processing, etc. Therefore, we know which CPU cores correspond to the app's threads.
"mpstat" is a program that queries the /proc/stat file and outputs the results. We typically use it as "mpstat -P ALL 2" to give stats every 2 seconds. The first set of results are discarded since that gives stats since computer start.
We have an internal program to record various system metrics including processor usage. This program gathers the information from /proc/stat. So, we know that our internal stats gathering process and "mpstat" basically give the same results, with some amount of variance as to when /proc/stat is read. The example below is several seconds after starting the "mpstat" process:
14:25:02 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
14:25:04 PM all 1.12 0.00 0.51 0.00 0.00 0.03 0.00 0.00 0.00 98.34
14:25:04 PM 0 3.83 0.00 0.55 0.00 0.00 0.00 0.00 0.00 0.00 95.63
14:25:04 PM 1 0.52 0.00 1.55 0.00 0.00 0.00 0.00 0.00 0.00 97.94
14:25:04 PM 2 0.57 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 98.86
14:25:04 PM 3 0.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.43
14:25:04 PM 4 1.14 0.00 0.57 0.00 0.00 0.00 0.00 0.00 0.00 98.30
14:25:04 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
14:25:04 PM 6 0.52 0.00 0.52 0.00 0.00 0.00 0.00 0.00 0.00 98.95
14:25:04 PM 7 0.53 0.00 1.05 0.00 0.00 0.00 0.00 0.00 0.00 98.42
14:25:04 PM 8 1.08 0.00 0.00 0.00 0.00 0.54 0.00 0.00 0.00 98.38
14:25:04 PM 9 1.07 0.00 0.53 0.00 0.00 0.00 0.00 0.00 0.00 98.40
14:25:04 PM 10 3.72 0.00 0.53 0.00 0.00 0.00 0.00 0.00 0.00 95.74
14:25:04 PM 11 3.76 0.00 0.54 0.00 0.00 0.00 0.00 0.00 0.00 95.70
14:25:04 PM 12 0.00 0.00 0.53 0.00 0.00 0.00 0.00 0.00 0.00 99.47
14:25:04 PM 13 0.53 0.00 0.53 0.00 0.00 0.00 0.00 0.00 0.00 98.93
14:25:04 PM 14 0.54 0.00 0.54 0.00 0.00 0.00 0.00 0.00 0.00 98.92
14:25:04 PM 15 0.00 0.00 0.54 0.00 0.00 0.00 0.00 0.00 0.00 99.46
Out of habit, we typically use "top -H -p < pid >" to look at CPU% . Here is an example for the same timestamp above:
top - 14:25:04 up 156 days, 17:46, 2 users, load average: 0.37, 0.24, 0.28
Threads: 27 total, 0 running, 27 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.2 us, 0.6 sy, 0.0 ni, 98.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65801192 total, 50580288 free, 10531340 used, 4689564 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 52792560 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
51768 root -61 0 8500104 6.6g 8244 S 12.3 10.5 1300:13 app.udp.2
51765 root -61 0 8500104 6.6g 8244 S 12.0 10.5 1346:30 app.udp.1
51770 root -61 0 8500104 6.6g 8244 S 12.0 10.5 1281:05 app.udp.4
51769 root -61 0 8500104 6.6g 8244 S 11.6 10.5 1336:50 app.udp.3
51727 root -61 0 8500104 6.6g 8244 S 7.0 10.5 1134:38 app.IT.1
51728 root -61 0 8500104 6.6g 8244 S 7.0 10.5 1026:32 app.IT.2
51756 root -61 0 8500104 6.6g 8244 S 7.0 10.5 1156:19 app.IT.7
51737 root -61 0 8500104 6.6g 8244 S 6.6 10.5 1070:51 app.IT.3
51740 root -61 0 8500104 6.6g 8244 S 6.6 10.5 1052:50 app.IT.5
51747 root -61 0 8500104 6.6g 8244 S 6.6 10.5 1112:28 app.IT.6
51739 root -61 0 8500104 6.6g 8244 S 6.3 10.5 986:49.64 app.IT.4
51763 root -61 0 8500104 6.6g 8244 S 6.3 10.5 1075:28 app.IT.8
51721 root -61 0 8500104 6.6g 8244 S 4.3 10.5 661:24.82 app.sr.1
51725 root -61 0 8500104 6.6g 8244 S 4.3 10.5 672:52.52 app.sr.2
51660 root -61 0 8500104 6.6g 8244 S 2.3 10.5 629:07.48 app
51709 root -61 0 8500104 6.6g 8244 S 1.7 10.5 241:38.15 app.sr.0
51764 root -61 0 8500104 6.6g 8244 S 0.7 10.5 165:54.88 app.Stats
51720 root -61 0 8500104 6.6g 8244 S 0.3 10.5 15:07.43 app.sr.3
51710 root -61 0 8500104 6.6g 8244 S 0.0 10.5 8:35.44 app.sn.0
51711 root -61 0 8500104 6.6g 8244 S 0.0 10.5 5:53.78 app.sn.0
51712 root -61 0 8500104 6.6g 8244 S 0.0 10.5 19:39.35 app.sr.0
51713 root -61 0 8500104 6.6g 8244 S 0.0 10.5 0:26.52 app.D
51718 root -61 0 8500104 6.6g 8244 S 0.0 10.5 15:19.29 app.sr.1
51719 root -61 0 8500104 6.6g 8244 S 0.0 10.5 15:11.80 app.sr.2
51722 root -61 0 8500104 6.6g 8244 S 0.0 10.5 15:17.29 app.sr.4
51723 root -61 0 8500104 6.6g 8244 S 0.0 10.5 14:51.77 app.sr.5
51724 root -61 0 8500104 6.6g 8244 S 0.0 10.5 15:24.45 app.sr.6
The top 4 threads are allocated to cores 2-5 (sorry, I didn't enable the last used CPU for the "top" output). I should see CPU cores 2-5 at 88-89% idle in the mpstat output. However, I don't see that in the mpstat results above. Why is there a difference?
BTW, I saw a comment in the StackOverflow link below to turn off Irix mode. My understanding of this is that it takes the CPU% and then divides by the number of CPUs. If using "top" result "udp.2": 12.3% / 16 cores = 0.77, which doesn't exactly correspond to "mpstat" results. I don't think taking the "top" CPU% and dividing by num CPUs is what I want.
Similar resources I've looked at: