Measuring Basic Performance Metrics of QEMU

Application of Perf and Callgrind

Ahmed Karaman - June 22, 2020

Intro

Welcome to the TCG Continuous Benchmarking project! During this project, multiple Linux profiling tools will be used. They will help you profile your scenarios, and locate performance bottlenecks and performance regressions. A number of QEMU targets will be covered. Generally, a pragmatic, tool-agnostic approach will be followed in this project. This means a wide range of tools will be considered and the ones that are the most suitable for a given situation will be used.

This particular report presents two Linux profiling tools - Perf and Callgrind. It gives you an overview of their setup, usage, pros and cons. The motivation is to provide you with a better understanding of them. This way, you will be able to make decisions regarding their usage in a better and easier way. You will also learn two ways of finding top N most executed QEMU functions of your scenario, without even having to know absolutely any detail of the used profiling tools!

Measuring Basic Performance Metrics
Finding The 25 Most Executed Functions
- Using Perf
- Using Callgrind
Comparison of Perf and Callgrind Results
Stability of Perf and Callgrind Results
Resources
- Perf Resources
- Callgrind Resources
Appendix
- Installing Perf
- Installing Valgrind

Measuring Basic Performance Metrics

For the purpose of this report, basic performance metrics are defined as: number of instructions, number of branches, and number of branch misses that occurred while executing a particular scenario. Two methods for measuring them under Linux OS will be shown: one using Perf tool, and another using Callgrind tool.

Perf

Perf is a profiler tool based on sampling and usage of CPU performance counters. It also provides per task, per CPU and per-workload counters, and source code event annotation as well. It depends to a great extent on kernel and CPU support. It does not instrument the code, so, consequently, exhibits a fast speed of execution, that is very close to the speed of the regular execution of executable or system that is observed.

Callgrind

Callgrind is a part of Valgrind. Valgrind is an instrumentation framework for building dynamic analysis tools. It includes multiple tools, each covering its respective area. Callgrind is one of them, and it identifies the number of instructions executed for each line of source code, with per-function, per-module and whole-program summaries plus extra information about callers, callees, and call graphs for every function. It can also measure branch misses using its own simulation. Given that it’s based on instrumentation, Valgrind (in setup to use Callgrind as underlying tool) runs programs about 20–300x slower than normal, depending on the tool that is used, and numerous user-defined options.

Prerequisites

Install Perf and Valgrind on your system (Callgrind will be installed as a prt of Valgrind). Please refer to the Appendix for details.
Setup and build QEMU from source code. The methods presented here work for both debug and non-debug QEMU builds, but it makes the most sense to apply them on non-debug builds. In this report (and in all other reports in this series), QEMU source tree root directory will be denoted as <qemu>, and QEMU build directory as <qemu-build>.
Download the Coulomb benchmark which is used in this report. It computes the net forces acting on all n electrons randomly scattered across a 1m x 1m surface. n can be passed as a command line argument or, if not, it defaults to 1000.
Compile the program:
```
gcc -static coulomb_double.c -o coulomb_double -lm
```
The -static flag is used just for convenience - so that invocations of QEMU do not have to rely on giving them the path to target libraries.

Measuring with Perf

Perf tool offers a rich set of subcommands to collect and analyze performance. To get the basic performance metrics of a program, the best option is subcommand perf stat. It runs a given executable and gathers performance data using CPU’s performance counter statistics. By default, perf stat measures multiple metrics (in perf parlance called events), such as task-clock, context-switches, cpu-migrations and more. Please check the “Perf Resources” section in to learn more about Perf.

Events displayed by Perf can be specified using the -e, --event <event> argument. To only measure the number of instructions, branches and branch-misses, Perf can be run using:

sudo perf stat -e instructions,branches,branch-misses <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double

And the output is:

     8,184,824,850      instructions
     1,968,845,899      branches
        25,016,032      branch-misses             #    1.27% of all branches

       0.846118331 seconds time elapsed

       0.846212000 seconds user
       0.000000000 seconds sys

Measuring with Callgrind

The general command line form for running a program with Callgrind is the following:

valgrind --tool=callgrind [callgrind options] program [program options]

Using the same Coulomb program used with Perf, instructions and branch misses can be measured with Callgrind using the following command:

valgrind --tool=callgrind --branch-sim=yes <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double

Console output:

==26339== I   refs:      8,197,830,086
==26339==
==26339== Branches:      1,396,703,000  (1,385,648,906 cond + 11,054,094 ind)
==26339== Mispredicts:      12,692,618  (   10,503,444 cond +  2,189,174 ind)
==26339== Mispred rate:            0.9% (          0.8%     +       19.8%   )

I refs represents the number of instructions executed (and it’s always printed by default). Branches, branch misses and corresponding percentage can be seen below I refs.

Callgrind also produces a file in the current working directory named callgrind.out.<pid>. This data file contains information about the various performance statistics observed during the measurement. This file is inspected in the “Finding Most Executed Functions” section of this report.

The 25 Most Executed Functions

In this section, the most executed functions are measured using both Perf and Callgrind. This helps us pin point the program hotspots. Please make sure that you’ve followed all of the instructions in the “Prerequisites” section before proceeding.

Using Perf

Perf offers the record option which runs a program to be analyzed, collects profile data and stores the profile data into a file named perf.data in the current working directory. Given that Perf is based on sampling, it’s much better to have longer running programs to detect all running functions; so this time, the program is executed with an input of 30,000 electrons instead of the default 1000.

sudo perf record <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double -n 30000

To inspect the results, the report command can be used with the --stdio flag to specify Perf standard output display profile.

sudo perf report --stdio | head -n 36 | tail -n 28

The output is piped to head and tail to only display the top 25 functions which are sorted by percentage.

# Overhead  Command      Shared Object            Symbol
# ........  ...........  .......................  ..............................................
#
80%  qemu-x86_64  qemu-x86_64              [.] float64_mul
05%  qemu-x86_64  qemu-x86_64              [.] float64_add
85%  qemu-x86_64  qemu-x86_64              [.] float64_sub
06%  qemu-x86_64  qemu-x86_64              [.] helper_mulsd
26%  qemu-x86_64  qemu-x86_64              [.] helper_addsd
71%  qemu-x86_64  qemu-x86_64              [.] helper_subsd
57%  qemu-x86_64  qemu-x86_64              [.] helper_lookup_tb_ptr
08%  qemu-x86_64  qemu-x86_64              [.] f64_compare
96%  qemu-x86_64  qemu-x86_64              [.] helper_ucomisd
14%  qemu-x86_64  qemu-x86_64              [.] helper_pand_xmm
81%  qemu-x86_64  qemu-x86_64              [.] float64_div
52%  qemu-x86_64  qemu-x86_64              [.] helper_pxor_xmm
37%  qemu-x86_64  qemu-x86_64              [.] helper_por_xmm
36%  qemu-x86_64  qemu-x86_64              [.] float64_compare_quiet
33%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f3784043840
32%  qemu-x86_64  qemu-x86_64              [.] helper_cc_compute_all
30%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f37840463c0
29%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f3784043a80
25%  qemu-x86_64  qemu-x86_64              [.] round_to_int
24%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f3784046180
22%  qemu-x86_64  qemu-x86_64              [.] soft_f64_addsub
18%  qemu-x86_64  qemu-x86_64              [.] round_to_int_and_pack
18%  qemu-x86_64  qemu-x86_64              [.] helper_cvttsd2si
12%  qemu-x86_64  qemu-x86_64              [.] helper_divsd
11%  qemu-x86_64  qemu-x86_64              [.] float64_to_int32_scalbn
10%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f3784049115
10%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f378403ec3b
09%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f378403f003
09%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f378403eb83
08%  qemu-x86_64  qemu-x86_64              [.] sf_canonicalize
07%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f378403d570
07%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f378403d297
07%  qemu-x86_64  qemu-x86_64              [.] helper_pandn_xmm
07%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f37840463d3
07%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f3784043a93

Alternatively, you can run the topN_perf Python script from our GitHub repo. The arguments should match how you would normally execute the program with QEMU.

python topN_perf.py -- <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double -n 30000

The script runs both perf record and perf report and prints the list of top functions.

 No.  Percentage  Name                       Caller
----  ----------  -------------------------  -------------------------
    16.25%  float64_mul                qemu-x86_64
    12.01%  float64_sub                qemu-x86_64
    11.99%  float64_add                qemu-x86_64
     5.69%  helper_mulsd               qemu-x86_64
     4.68%  helper_addsd               qemu-x86_64
     4.43%  helper_lookup_tb_ptr       qemu-x86_64
     4.28%  helper_subsd               qemu-x86_64
     2.71%  f64_compare                qemu-x86_64
     2.71%  helper_ucomisd             qemu-x86_64
     1.04%  helper_pand_xmm            qemu-x86_64
     0.71%  float64_div                qemu-x86_64
     0.63%  helper_pxor_xmm            qemu-x86_64
     0.50%  0x00007f7b7004ef95         [JIT] tid 491
     0.50%  0x00007f7b70044e83         [JIT] tid 491
     0.36%  helper_por_xmm             qemu-x86_64
     0.32%  helper_cc_compute_all      qemu-x86_64
     0.30%  0x00007f7b700433f0         [JIT] tid 491
     0.30%  float64_compare_quiet      qemu-x86_64
     0.27%  soft_f64_addsub            qemu-x86_64
     0.26%  round_to_int               qemu-x86_64
     0.25%  0x00007f7b7004c240         [JIT] tid 491
     0.25%  0x00007f7b70049900         [JIT] tid 491
     0.20%  0x00007f7b700496c0         [JIT] tid 491
     0.20%  0x00007f7b7004c000         [JIT] tid 491
     0.20%  0x00007f7b7004efbe         [JIT] tid 491

Using Callgrind

Unlike Perf, Callgrind doesn’t need a relatively long-running programs to produce meaningful performance data. It works by instrumenting the program with extra instructions that record activity and keep counters. The results of such tracking is recorded in file named callgrind.out.<pid> (where <pid> is pid of the process being measured).

To view the contents of the file, you can use callgrind_annotate (an external utility shipped with Valgrind). If you’re following this report in order, you would already have a Callgrind output in the current working directory, if not, run the program with Callgrind as follows:

valgrind --tool=callgrind <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double

Now run callgrind_annotate with the Callgrind output where pid is the process of the Valgrind run.

callgrind_annotate callgrind.out.pid | head -n 50 | tail -n 28

The output is piped to head and tail to only display the top 25 functions which are sorted by the number of instructions.

--------------------------------------------------------------------------------
           Ir  file:function
--------------------------------------------------------------------------------
2,014,193,756  ???:0x00000000082db000 [???]
1,677,340,458  <qemu>/fpu/softfloat.c:float64_mul [<qemu-build>/x86_64-linux-user/qemu-x86_64]
1,206,367,069  <qemu>/fpu/softfloat.c:float64_sub [<qemu-build>/x86_64-linux-user/qemu-x86_64]
1,136,213,139  <qemu>/fpu/softfloat.c:float64_add [<qemu-build>/x86_64-linux-user/qemu-x86_64]
  399,610,730  <qemu>/target/i386/ops_sse.h:helper_mulsd [<qemu-build>/x86_64-linux-user/qemu-x86_64]
  308,725,510  <qemu>/target/i386/ops_sse.h:helper_subsd [<qemu-build>/x86_64-linux-user/qemu-x86_64]
  290,848,450  <qemu>/target/i386/ops_sse.h:helper_addsd [<qemu-build>/x86_64-linux-user/qemu-x86_64]
  179,112,825  <qemu>/target/i386/ops_sse.h:helper_ucomisd [<qemu-build>/x86_64-linux-user/qemu-x86_64]
  136,652,565  <qemu>/include/exec/tb-lookup.h:helper_lookup_tb_ptr
  136,174,015  <qemu>/fpu/softfloat.c:f64_compare [<qemu-build>/x86_64-linux-user/qemu-x86_64]
  123,638,928  <qemu>/accel/tcg/tcg-runtime.c:helper_lookup_tb_ptr [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   52,058,289  <qemu>/include/exec/exec-all.h:helper_lookup_tb_ptr
   50,458,684  <qemu>/fpu/softfloat.c:float64_div [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   41,182,050  <qemu>/target/i386/ops_sse.h:helper_pand_xmm [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   41,131,601  <qemu>/include/fpu/softfloat.h:float64_mul
   39,043,872  <qemu>/target/i386/cpu.h:helper_lookup_tb_ptr
   35,822,565  <qemu>/fpu/softfloat.c:float64_compare_quiet [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   33,919,580  <qemu>/target/i386/ops_sse.h:helper_pxor_xmm [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   28,941,066  <qemu>/fpu/softfloat.c:round_to_int [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   28,409,072  <qemu>/target/i386/cc_helper.c:helper_cc_compute_all [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   20,854,735  <qemu>/fpu/softfloat.c:soft_f64_addsub [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   19,778,659  <qemu>/tcg/tcg.c:liveness_pass_1 [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   19,521,936  <qemu>/include/exec/tb-hash.h:helper_lookup_tb_ptr
   16,997,134  <qemu>/fpu/softfloat.c:round_to_int_and_pack [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   15,259,670  <qemu>/target/i386/ops_sse.h:helper_por_xmm [<qemu-build>/x86_64-linux-user/qemu-x86_64]

Alternatively, you can run the topN_callgrind Python script from our GitHub repo. The arguments should match how you would normally execute the program with QEMU.

python topN_callgrind.py -- <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double

The script runs both callgrind and callgrind_annotate and prints a better formatted list of top functions, just like in the previous example with Perf.

 No.  Percentage  Name                       Source File
----  ----------  -------------------------  ------------------------------
   24.577%  0x00000000082db000         ???
   20.467%  float64_mul                <qemu>/fpu/softfloat.c
   14.720%  float64_sub                <qemu>/fpu/softfloat.c
   13.864%  float64_add                <qemu>/fpu/softfloat.c
    4.876%  helper_mulsd               <qemu>/target/i386/ops_sse.h
    3.767%  helper_subsd               <qemu>/target/i386/ops_sse.h
    3.549%  helper_addsd               <qemu>/target/i386/ops_sse.h
    2.185%  helper_ucomisd             <qemu>/target/i386/ops_sse.h
    1.667%  helper_lookup_tb_ptr       <qemu>/include/exec/tb-lookup.h
    1.662%  f64_compare                <qemu>/fpu/softfloat.c
    1.509%  helper_lookup_tb_ptr       <qemu>/accel/tcg/tcg-runtime.c
    0.635%  helper_lookup_tb_ptr       <qemu>/include/exec/exec-all.h
    0.616%  float64_div                <qemu>/fpu/softfloat.c
    0.502%  helper_pand_xmm            <qemu>/target/i386/ops_sse.h
    0.502%  float64_mul                <qemu>/include/fpu/softfloat.h
    0.476%  helper_lookup_tb_ptr       <qemu>/target/i386/cpu.h
    0.437%  float64_compare_quiet      <qemu>/fpu/softfloat.c
    0.414%  helper_pxor_xmm            <qemu>/target/i386/ops_sse.h
    0.353%  round_to_int               <qemu>/fpu/softfloat.c
    0.347%  helper_cc_compute_all      <qemu>/target/i386/cc_helper.c
    0.254%  soft_f64_addsub            <qemu>/fpu/softfloat.c
    0.238%  helper_lookup_tb_ptr       <qemu>/include/exec/tb-hash.h
    0.233%  liveness_pass_1            <qemu>/tcg/tcg.c
    0.207%  round_to_int_and_pack      <qemu>/fpu/softfloat.c
    0.186%  helper_por_xmm             <qemu>/target/i386/ops_sse.h

Comparison of Perf and Callgrind Results

Perf’s and Callgrind’s underlying profiling methods are very different. As a consequence, the differences in their results are unavoidable, and, to an extent, expected. In some cases, these differences can be proved to be useful.

Basic Performance Metrics

Instruction counts obtained by Perf and Callgrind tend to be very similar. Number of branches and number of branch misses, on the other hand, tend to differ to some extent. This is expected, since Perf uses CPU performance counters, while Callgrind uses internal simulation for these calculations, and, most likely, their very definition of branches is not the same. Interestingly enough, the percentage of branch misses are usually approximate for both tools.

Size of Examined Executable

As noted before, Callgrind is capable of producing performance data for short-running executables, while Perf is not. Perf simply needs to reach certain reasonably large number of samples to be able to function.

Source File Location

Callgrind provides information about the file where the source code of unction is located, unlike Perf, which doesn’t provide this kind of data.

Furthermore, it is possible (for example, in case a function contains some parts that are other functions inlined from other source files), that the actual source file of the functions spans over multiple source files. Callgrind makes this distinction, and reports such parts separately, while Perf reports just a single item for such function. The most notable example is helper_lookup_tb_ptr().

JIT-ed Code Execution

Perf provides highly granular data on JIT-ed code execution, while Callgrind sums all such cases into one item.

In previous example, these items can be found in Perf output:

     0.50%  0x00007f7b7004ef95         [JIT] tid 491
     0.50%  0x00007f7b70044e83         [JIT] tid 491
...
     0.30%  0x00007f7b700433f0         [JIT] tid 491
...
     0.25%  0x00007f7b7004c240         [JIT] tid 491
     0.25%  0x00007f7b70049900         [JIT] tid 491
     0.20%  0x00007f7b700496c0         [JIT] tid 491
     0.20%  0x00007f7b7004c000         [JIT] tid 491
     0.20%  0x00007f7b7004efbe         [JIT] tid 491

While in Callgrind, there is a single item:

    1     24.577%  0x00000000082db000         ???

Depending on what one wants to do with such data, it can be advantage or disadvantage. In next report, the fact that Callgrind sums up JIT-ed code execution will be used to extract some additional interesting performance metrics for QEMU.

Percentages of Individual Items

Let’s examine all items that surpassed 3% either in Perf or Calgrind results:

Function name	Perf	Callgrind
float64_mul	16.25%	20.467%
float64_sub	12.01%	14.720%
float64_add	11.99%	13.864%
helper_mulsd	5.69%	4.876%
helper_addsd	4.68%	3.549%
helper_lookup_tb_ptr	4.43%	4.525%*
helper_subsd	4.28%	3.767%

* percentage for helper_lookup_tb_ptr for Callgrind is obtained by summing up several items.

It can be seen that the individual results are quite different. However it seems that relative relation between individual items is approximately the same for both tools.

For a performance engineer, the difference shown above does not make a significant problem. The performance improvement workflows, in general, usually focus on usage on only one tool, and both Perf and Callgrind can be used for such purpose.

In general, the more important factor while judging usability of a performance tool is its ability to provide the same or approximate results for multiple subsequent measurements of the identical scenarios. This factor, which is called stability for the purpose of this report, is examined in more depth in the following section.

Stability of Perf and Callgrind Results

Idea of the Experiment

Stability can be defined as the ability to provide nearly identical results with each run of the profiler.

A simple Python script is used to compare the stability of Valgrind vs Perf, but first, the Coulomb program is executed with Callgrind once and with Perf three times. This time, the -r, --repeat <n> Perf flag is utilized. It repeats Perf execution n times and prints the average of all events.

Stability Experiment

This is a Bash script that performs an execution using callgrind, perf, perf -r 10 and perf -r 100 20 times for the Coulomb benchmark:

mkdir output
for ((i = 0; i < 20; i++)); do
    valgrind --tool=callgrind <qemu-build>/x86_64-linux-user/qemu-x86_64 ./coulomb_double 2>>./output/out$i.txt &&
        sudo perf stat -e instructions <qemu-build>/x86_64-linux-user/qemu-x86_64 ./coulomb_double 2>>./output/out$i.txt &&
        sudo perf stat -e instructions -r 10 <qemu-build>/x86_64-linux-user/qemu-x86_64 ./coulomb_double 2>>./output/out$i.txt &&
        sudo perf stat -e instructions -r 100 <qemu-build>/x86_64-linux-user/qemu-x86_64 ./coulomb_double 2>>./output/out$i.txt
done

This is a Python script that extracts the instruction counts from each run and outputs a CSV file with the measurements, as well as, the average, standard deviation and coefficient of variation of all 20 executions for each of the four methods:

from os import listdir
import csv
import statistics

output_files = listdir('output')
run = 1
results = []

for file in output_files:
    with open('output/' + file, "r") as target:
        lines = target.readlines()
        results.append([run,
                        lines[10].split()[3].replace(',', ' '),
                        lines[14].split()[0].replace(',', ' '),
                        lines[25].split()[0].replace(',', ' '),
                        lines[32].split()[0].replace(',', ' ')])
        run += 1

callgrind_results = [int(result[1].replace(' ', '')) for result in results]
callgrind_mean = statistics.mean(callgrind_results)
callgrind_stdev = statistics.stdev(callgrind_results)
callgrind_CV = (callgrind_stdev/callgrind_mean) * 100

perf_results = [int(result[2].replace(' ', '')) for result in results]
perf_mean = statistics.mean(perf_results)
perf_stdev = statistics.stdev(perf_results)
perf_CV = (perf_stdev / perf_mean) * 100

perf_10_results = [int(result[3].replace(' ', '')) for result in results]
perf_10_mean = statistics.mean(perf_10_results)
perf_10_stdev = statistics.stdev(perf_10_results)
perf_10_CV = (perf_10_stdev / perf_10_mean) * 100

perf_100_results = [int(result[4].replace(' ', '')) for result in results]
perf_100_mean = statistics.mean(perf_100_results)
perf_100_stdev = statistics.stdev(perf_100_results)
perf_100_CV = (perf_100_stdev/perf_100_mean) * 100

with open('output.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(["Run", "callgrind", "perf",
                     "perf -r 10", "perf -r 100"])
    for result in results:
        writer.writerow(result)
    writer.writerow(["Avg", callgrind_mean, perf_mean,
                     perf_10_mean, perf_100_mean])
    writer.writerow(["σ", callgrind_stdev, perf_stdev,
                     perf_10_stdev, perf_100_stdev])
    writer.writerow(["σ (%)", callgrind_CV, perf_CV,
                     perf_10_CV, perf_100_CV])

Results of the Experiment

Run	callgrind	perf	perf -r 10	perf -r 100
1	8 197 479 927	8 185 411 320	8 185 345 224	8 185 114 468
2	8 197 479 919	8 185 195 094	8 184 991 767	8 184 923 223
3	8 197 479 842	8 184 968 958	8 185 206 200	8 185 084 344
4	8 197 479 842	8 185 162 965	8 185 236 582	8 184 992 738
5	8 197 479 919	8 185 173 560	8 185 230 524	8 185 177 606
6	8 197 479 919	8 185 107 760	8 185 089 134	8 184 985 563
7	8 197 479 927	8 185 242 718	8 185 067 722	8 184 933 899
8	8 197 479 919	8 185 259 192	8 185 356 351	8 185 202 083
9	8 197 479 927	8 185 176 054	8 184 955 923	8 185 048 778
10	8 197 479 842	8 185 062 738	8 185 061 203	8 185 148 177
11	8 197 479 842	8 185 065 330	8 184 952 823	8 184 926 132
12	8 197 479 852	8 185 370 367	8 185 303 501	8 185 125 697
13	8 197 479 927	8 185 115 629	8 184 950 032	8 185 001 142
14	8 197 479 842	8 186 555 638	8 185 187 143	8 185 035 473
15	8 197 481 000	8 185 159 045	8 185 238 812	8 185 032 286
16	8 197 479 961	8 185 170 493	8 185 029 578	8 185 120 424
17	8 197 479 842	8 185 125 155	8 184 990 595	8 184 922 657
18	8 197 480 006	8 187 323 418	8 185 326 603	8 185 092 509
19	8 197 479 919	8 185 454 074	8 185 320 982	8 185 173 455
20	8 197 481 000	8 185 164 000	8 185 206 455	8 185 044 700
Avg	8 197 480 009	8 185 363 175	8 185 152 358	8 185 054 268
σ	342.272	565489.783	144003.202	89808.068
σ (%)	0.0000041	0.0069085	0.0017593	0.0010972

From the previous experiment, it can be seen that σ (%) - Coefficient of Variation - is approx 0.0000041% for Callgrind. For Perf, it is 0.0069085% for perf stat used without any -r switch. It starts to decrease to 0.0017593% for -r 10, and further to 0.0010972% for -r 100. The maximal value that can be specified after -r is 100, so this is the maximal stability that can be achieved using Perf.

It can be concluded that despite (or perhaps, better said, because of) the slow execution time of Callgrind, it gives very stable results. The stability of Perf results increases with increasing the repetition count, but it still doesn’t reach a coefficient of variation as low as Callgrind even with the maximum possible repetitions (-r 100).

Resources

If you want to learn more about Perf and Callgrind, please check the resources section below.

Perf Resources

Official Perf Wiki

The official Perf wiki offers a detailed step by step tutorial on the usage of Perf. It lists all of the measurable hardware and software events as well as multiple examples for Perf command usages.
Performance Lab: The Power of The Perf Tool - Arnaldo Melo and Jiri Olsa

In this talk, the speakers show how to build Perf from source and how it can be used to detect and hunt down numerous performance issues. They also cover examples of some interesting Perf features and their favorite usage tips.
Linux Perf for Qt Developers - Milian Wolff

In this talk, the speaker gives a detailed introduction to Perf showing how to use it to find CPU hotspots in the code, as well as some tricks to profile wait times for lock contention issues or disk I/O. He also dives into details on how it is applicable to Qt developers in particular.

Callgrind Resources

Callgrind Official Manual

The official Callgrind manual gives an overview of Callgrind. It provides a guide for basic and advanced usage of the tool as well as in detailed description of all command line arguments of the Callgrind.
Callgrind Output Format Manual

This manual covers the internal structure of the Callgrind output file which callgrind_annotate was used to inspect in the report. It includes simple and extended examples of such files as well as the complete grammar of the format.
Stanford CS107 Callgrind Guide

This guide offers a quick introduction to get you up and running with Callgrind. It covers basic usage cases as well as some tips and tricks.

Appendix

Installing Perf

CentOS & RHEL

sudo yum install perf

Fedora

sudo dnf install perf

Arch

sudo pacman -S perf

Debian & Derivatives

sudo apt update && sudo apt install linux-tools-$(uname -r) linux-tools-generic

The uname -r command is used to provide the Linux kernel version instead of manually writing it in the installation command.

Installing Valgrind