Measuring Basic Performance Metrics of QEMU

Application of Perf and Callgrind

Ahmed Karaman - June 22, 2020

Intro

Welcome to the TCG Continuous Benchmarking project! During this project, multiple Linux profiling tools will be used. They will help you profile your scenarios, and locate performance bottlenecks and performance regressions. A number of QEMU targets will be covered. Generally, a pragmatic, tool-agnostic approach will be followed in this project. This means a wide range of tools will be considered and the ones that are the most suitable for a given situation will be used.

This particular report presents two Linux profiling tools - Perf and Callgrind. It gives you an overview of their setup, usage, pros and cons. The motivation is to provide you with a better understanding of them. This way, you will be able to make decisions regarding their usage in a better and easier way. You will also learn two ways of finding top N most executed QEMU functions of your scenario, without even having to know absolutely any detail of the used profiling tools!

Table of Contents

Measuring Basic Performance Metrics

For the purpose of this report, basic performance metrics are defined as: number of instructions, number of branches, and number of branch misses that occurred while executing a particular scenario. Two methods for measuring them under Linux OS will be shown: one using Perf tool, and another using Callgrind tool.

Perf

Perf is a profiler tool based on sampling and usage of CPU performance counters. It also provides per task, per CPU and per-workload counters, and source code event annotation as well. It depends to a great extent on kernel and CPU support. It does not instrument the code, so, consequently, exhibits a fast speed of execution, that is very close to the speed of the regular execution of executable or system that is observed.

Callgrind

Callgrind is a part of Valgrind. Valgrind is an instrumentation framework for building dynamic analysis tools. It includes multiple tools, each covering its respective area. Callgrind is one of them, and it identifies the number of instructions executed for each line of source code, with per-function, per-module and whole-program summaries plus extra information about callers, callees, and call graphs for every function. It can also measure branch misses using its own simulation. Given that it’s based on instrumentation, Valgrind (in setup to use Callgrind as underlying tool) runs programs about 20–300x slower than normal, depending on the tool that is used, and numerous user-defined options.

Prerequisites

  1. Install Perf and Valgrind on your system (Callgrind will be installed as a prt of Valgrind). Please refer to the Appendix for details.

  2. Setup and build QEMU from source code. The methods presented here work for both debug and non-debug QEMU builds, but it makes the most sense to apply them on non-debug builds. In this report (and in all other reports in this series), QEMU source tree root directory will be denoted as <qemu>, and QEMU build directory as <qemu-build>.

  3. Download the Coulomb benchmark which is used in this report. It computes the net forces acting on all n electrons randomly scattered across a 1m x 1m surface. n can be passed as a command line argument or, if not, it defaults to 1000.

  4. Compile the program:

    gcc -static coulomb_double.c -o coulomb_double -lm
    

    The -static flag is used just for convenience - so that invocations of QEMU do not have to rely on giving them the path to target libraries.

Measuring with Perf

Perf tool offers a rich set of subcommands to collect and analyze performance. To get the basic performance metrics of a program, the best option is subcommand perf stat. It runs a given executable and gathers performance data using CPU’s performance counter statistics. By default, perf stat measures multiple metrics (in perf parlance called events), such as task-clock, context-switches, cpu-migrations and more. Please check the “Perf Resources” section in to learn more about Perf.

Events displayed by Perf can be specified using the -e, --event <event> argument. To only measure the number of instructions, branches and branch-misses, Perf can be run using:

sudo perf stat -e instructions,branches,branch-misses <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double

And the output is:

     8,184,824,850      instructions
     1,968,845,899      branches
        25,016,032      branch-misses             #    1.27% of all branches

       0.846118331 seconds time elapsed

       0.846212000 seconds user
       0.000000000 seconds sys

Measuring with Callgrind

The general command line form for running a program with Callgrind is the following:

valgrind --tool=callgrind [callgrind options] program [program options]

Using the same Coulomb program used with Perf, instructions and branch misses can be measured with Callgrind using the following command:

valgrind --tool=callgrind --branch-sim=yes <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double

Console output:

==26339== I   refs:      8,197,830,086
==26339==
==26339== Branches:      1,396,703,000  (1,385,648,906 cond + 11,054,094 ind)
==26339== Mispredicts:      12,692,618  (   10,503,444 cond +  2,189,174 ind)
==26339== Mispred rate:            0.9% (          0.8%     +       19.8%   )

I refs represents the number of instructions executed (and it’s always printed by default). Branches, branch misses and corresponding percentage can be seen below I refs.

Callgrind also produces a file in the current working directory named callgrind.out.<pid>. This data file contains information about the various performance statistics observed during the measurement. This file is inspected in the “Finding Most Executed Functions” section of this report.

The 25 Most Executed Functions

In this section, the most executed functions are measured using both Perf and Callgrind. This helps us pin point the program hotspots. Please make sure that you’ve followed all of the instructions in the “Prerequisites” section before proceeding.

Using Perf

Perf offers the record option which runs a program to be analyzed, collects profile data and stores the profile data into a file named perf.data in the current working directory. Given that Perf is based on sampling, it’s much better to have longer running programs to detect all running functions; so this time, the program is executed with an input of 30,000 electrons instead of the default 1000.

sudo perf record <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double -n 30000

To inspect the results, the report command can be used with the --stdio flag to specify Perf standard output display profile.

sudo perf report --stdio | head -n 36 | tail -n 28

The output is piped to head and tail to only display the top 25 functions which are sorted by percentage.

# Overhead  Command      Shared Object            Symbol
# ........  ...........  .......................  ..............................................
#
    18.80%  qemu-x86_64  qemu-x86_64              [.] float64_mul
    14.05%  qemu-x86_64  qemu-x86_64              [.] float64_add
    13.85%  qemu-x86_64  qemu-x86_64              [.] float64_sub
     6.06%  qemu-x86_64  qemu-x86_64              [.] helper_mulsd
     5.26%  qemu-x86_64  qemu-x86_64              [.] helper_addsd
     4.71%  qemu-x86_64  qemu-x86_64              [.] helper_subsd
     4.57%  qemu-x86_64  qemu-x86_64              [.] helper_lookup_tb_ptr
     3.08%  qemu-x86_64  qemu-x86_64              [.] f64_compare
     2.96%  qemu-x86_64  qemu-x86_64              [.] helper_ucomisd
     1.14%  qemu-x86_64  qemu-x86_64              [.] helper_pand_xmm
     0.81%  qemu-x86_64  qemu-x86_64              [.] float64_div
     0.52%  qemu-x86_64  qemu-x86_64              [.] helper_pxor_xmm
     0.37%  qemu-x86_64  qemu-x86_64              [.] helper_por_xmm
     0.36%  qemu-x86_64  qemu-x86_64              [.] float64_compare_quiet
     0.33%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f3784043840
     0.32%  qemu-x86_64  qemu-x86_64              [.] helper_cc_compute_all
     0.30%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f37840463c0
     0.29%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f3784043a80
     0.25%  qemu-x86_64  qemu-x86_64              [.] round_to_int
     0.24%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f3784046180
     0.22%  qemu-x86_64  qemu-x86_64              [.] soft_f64_addsub
     0.18%  qemu-x86_64  qemu-x86_64              [.] round_to_int_and_pack
     0.18%  qemu-x86_64  qemu-x86_64              [.] helper_cvttsd2si
     0.12%  qemu-x86_64  qemu-x86_64              [.] helper_divsd
     0.11%  qemu-x86_64  qemu-x86_64              [.] float64_to_int32_scalbn
     0.10%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f3784049115
     0.10%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f378403ec3b
     0.09%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f378403f003
     0.09%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f378403eb83
     0.08%  qemu-x86_64  qemu-x86_64              [.] sf_canonicalize
     0.07%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f378403d570
     0.07%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f378403d297
     0.07%  qemu-x86_64  qemu-x86_64              [.] helper_pandn_xmm
     0.07%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f37840463d3
     0.07%  qemu-x86_64  [JIT] tid 18993          [.] 0x00007f3784043a93

Alternatively, you can run the topN_perf Python script from our GitHub repo. The arguments should match how you would normally execute the program with QEMU.

python topN_perf.py -- <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double -n 30000

The script runs both perf record and perf report and prints the list of top functions.

 No.  Percentage  Name                       Caller
----  ----------  -------------------------  -------------------------
   1      16.25%  float64_mul                qemu-x86_64
   2      12.01%  float64_sub                qemu-x86_64
   3      11.99%  float64_add                qemu-x86_64
   4       5.69%  helper_mulsd               qemu-x86_64
   5       4.68%  helper_addsd               qemu-x86_64
   6       4.43%  helper_lookup_tb_ptr       qemu-x86_64
   7       4.28%  helper_subsd               qemu-x86_64
   8       2.71%  f64_compare                qemu-x86_64
   9       2.71%  helper_ucomisd             qemu-x86_64
  10       1.04%  helper_pand_xmm            qemu-x86_64
  11       0.71%  float64_div                qemu-x86_64
  12       0.63%  helper_pxor_xmm            qemu-x86_64
  13       0.50%  0x00007f7b7004ef95         [JIT] tid 491
  14       0.50%  0x00007f7b70044e83         [JIT] tid 491
  15       0.36%  helper_por_xmm             qemu-x86_64
  16       0.32%  helper_cc_compute_all      qemu-x86_64
  17       0.30%  0x00007f7b700433f0         [JIT] tid 491
  18       0.30%  float64_compare_quiet      qemu-x86_64
  19       0.27%  soft_f64_addsub            qemu-x86_64
  20       0.26%  round_to_int               qemu-x86_64
  21       0.25%  0x00007f7b7004c240         [JIT] tid 491
  22       0.25%  0x00007f7b70049900         [JIT] tid 491
  23       0.20%  0x00007f7b700496c0         [JIT] tid 491
  24       0.20%  0x00007f7b7004c000         [JIT] tid 491
  25       0.20%  0x00007f7b7004efbe         [JIT] tid 491

Using Callgrind

Unlike Perf, Callgrind doesn’t need a relatively long-running programs to produce meaningful performance data. It works by instrumenting the program with extra instructions that record activity and keep counters. The results of such tracking is recorded in file named callgrind.out.<pid> (where <pid> is pid of the process being measured).

To view the contents of the file, you can use callgrind_annotate (an external utility shipped with Valgrind). If you’re following this report in order, you would already have a Callgrind output in the current working directory, if not, run the program with Callgrind as follows:

valgrind --tool=callgrind <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double

Now run callgrind_annotate with the Callgrind output where pid is the process of the Valgrind run.

callgrind_annotate callgrind.out.pid | head -n 50 | tail -n 28

The output is piped to head and tail to only display the top 25 functions which are sorted by the number of instructions.

--------------------------------------------------------------------------------
           Ir  file:function
--------------------------------------------------------------------------------
2,014,193,756  ???:0x00000000082db000 [???]
1,677,340,458  <qemu>/fpu/softfloat.c:float64_mul [<qemu-build>/x86_64-linux-user/qemu-x86_64]
1,206,367,069  <qemu>/fpu/softfloat.c:float64_sub [<qemu-build>/x86_64-linux-user/qemu-x86_64]
1,136,213,139  <qemu>/fpu/softfloat.c:float64_add [<qemu-build>/x86_64-linux-user/qemu-x86_64]
  399,610,730  <qemu>/target/i386/ops_sse.h:helper_mulsd [<qemu-build>/x86_64-linux-user/qemu-x86_64]
  308,725,510  <qemu>/target/i386/ops_sse.h:helper_subsd [<qemu-build>/x86_64-linux-user/qemu-x86_64]
  290,848,450  <qemu>/target/i386/ops_sse.h:helper_addsd [<qemu-build>/x86_64-linux-user/qemu-x86_64]
  179,112,825  <qemu>/target/i386/ops_sse.h:helper_ucomisd [<qemu-build>/x86_64-linux-user/qemu-x86_64]
  136,652,565  <qemu>/include/exec/tb-lookup.h:helper_lookup_tb_ptr
  136,174,015  <qemu>/fpu/softfloat.c:f64_compare [<qemu-build>/x86_64-linux-user/qemu-x86_64]
  123,638,928  <qemu>/accel/tcg/tcg-runtime.c:helper_lookup_tb_ptr [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   52,058,289  <qemu>/include/exec/exec-all.h:helper_lookup_tb_ptr
   50,458,684  <qemu>/fpu/softfloat.c:float64_div [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   41,182,050  <qemu>/target/i386/ops_sse.h:helper_pand_xmm [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   41,131,601  <qemu>/include/fpu/softfloat.h:float64_mul
   39,043,872  <qemu>/target/i386/cpu.h:helper_lookup_tb_ptr
   35,822,565  <qemu>/fpu/softfloat.c:float64_compare_quiet [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   33,919,580  <qemu>/target/i386/ops_sse.h:helper_pxor_xmm [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   28,941,066  <qemu>/fpu/softfloat.c:round_to_int [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   28,409,072  <qemu>/target/i386/cc_helper.c:helper_cc_compute_all [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   20,854,735  <qemu>/fpu/softfloat.c:soft_f64_addsub [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   19,778,659  <qemu>/tcg/tcg.c:liveness_pass_1 [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   19,521,936  <qemu>/include/exec/tb-hash.h:helper_lookup_tb_ptr
   16,997,134  <qemu>/fpu/softfloat.c:round_to_int_and_pack [<qemu-build>/x86_64-linux-user/qemu-x86_64]
   15,259,670  <qemu>/target/i386/ops_sse.h:helper_por_xmm [<qemu-build>/x86_64-linux-user/qemu-x86_64]

Alternatively, you can run the topN_callgrind Python script from our GitHub repo. The arguments should match how you would normally execute the program with QEMU.

python topN_callgrind.py -- <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double

The script runs both callgrind and callgrind_annotate and prints a better formatted list of top functions, just like in the previous example with Perf.

 No.  Percentage  Name                       Source File
----  ----------  -------------------------  ------------------------------
   1     24.577%  0x00000000082db000         ???
   2     20.467%  float64_mul                <qemu>/fpu/softfloat.c
   3     14.720%  float64_sub                <qemu>/fpu/softfloat.c
   4     13.864%  float64_add                <qemu>/fpu/softfloat.c
   5      4.876%  helper_mulsd               <qemu>/target/i386/ops_sse.h
   6      3.767%  helper_subsd               <qemu>/target/i386/ops_sse.h
   7      3.549%  helper_addsd               <qemu>/target/i386/ops_sse.h
   8      2.185%  helper_ucomisd             <qemu>/target/i386/ops_sse.h
   9      1.667%  helper_lookup_tb_ptr       <qemu>/include/exec/tb-lookup.h
  10      1.662%  f64_compare                <qemu>/fpu/softfloat.c
  11      1.509%  helper_lookup_tb_ptr       <qemu>/accel/tcg/tcg-runtime.c
  12      0.635%  helper_lookup_tb_ptr       <qemu>/include/exec/exec-all.h
  13      0.616%  float64_div                <qemu>/fpu/softfloat.c
  14      0.502%  helper_pand_xmm            <qemu>/target/i386/ops_sse.h
  15      0.502%  float64_mul                <qemu>/include/fpu/softfloat.h
  16      0.476%  helper_lookup_tb_ptr       <qemu>/target/i386/cpu.h
  17      0.437%  float64_compare_quiet      <qemu>/fpu/softfloat.c
  18      0.414%  helper_pxor_xmm            <qemu>/target/i386/ops_sse.h
  19      0.353%  round_to_int               <qemu>/fpu/softfloat.c
  20      0.347%  helper_cc_compute_all      <qemu>/target/i386/cc_helper.c
  21      0.254%  soft_f64_addsub            <qemu>/fpu/softfloat.c
  22      0.238%  helper_lookup_tb_ptr       <qemu>/include/exec/tb-hash.h
  23      0.233%  liveness_pass_1            <qemu>/tcg/tcg.c
  24      0.207%  round_to_int_and_pack      <qemu>/fpu/softfloat.c
  25      0.186%  helper_por_xmm             <qemu>/target/i386/ops_sse.h

Comparison of Perf and Callgrind Results

Perf’s and Callgrind’s underlying profiling methods are very different. As a consequence, the differences in their results are unavoidable, and, to an extent, expected. In some cases, these differences can be proved to be useful.

Basic Performance Metrics

Instruction counts obtained by Perf and Callgrind tend to be very similar. Number of branches and number of branch misses, on the other hand, tend to differ to some extent. This is expected, since Perf uses CPU performance counters, while Callgrind uses internal simulation for these calculations, and, most likely, their very definition of branches is not the same. Interestingly enough, the percentage of branch misses are usually approximate for both tools.

Size of Examined Executable

As noted before, Callgrind is capable of producing performance data for short-running executables, while Perf is not. Perf simply needs to reach certain reasonably large number of samples to be able to function.

Source File Location

Callgrind provides information about the file where the source code of unction is located, unlike Perf, which doesn’t provide this kind of data.

Furthermore, it is possible (for example, in case a function contains some parts that are other functions inlined from other source files), that the actual source file of the functions spans over multiple source files. Callgrind makes this distinction, and reports such parts separately, while Perf reports just a single item for such function. The most notable example is helper_lookup_tb_ptr().

JIT-ed Code Execution

Perf provides highly granular data on JIT-ed code execution, while Callgrind sums all such cases into one item.

In previous example, these items can be found in Perf output:

   13       0.50%  0x00007f7b7004ef95         [JIT] tid 491
   14       0.50%  0x00007f7b70044e83         [JIT] tid 491
...
   17       0.30%  0x00007f7b700433f0         [JIT] tid 491
...
   21       0.25%  0x00007f7b7004c240         [JIT] tid 491
   22       0.25%  0x00007f7b70049900         [JIT] tid 491
   23       0.20%  0x00007f7b700496c0         [JIT] tid 491
   24       0.20%  0x00007f7b7004c000         [JIT] tid 491
   25       0.20%  0x00007f7b7004efbe         [JIT] tid 491

While in Callgrind, there is a single item:

    1     24.577%  0x00000000082db000         ???

Depending on what one wants to do with such data, it can be advantage or disadvantage. In next report, the fact that Callgrind sums up JIT-ed code execution will be used to extract some additional interesting performance metrics for QEMU.

Percentages of Individual Items

Let’s examine all items that surpassed 3% either in Perf or Calgrind results:

Function name Perf Callgrind
float64_mul 16.25% 20.467%
float64_sub 12.01% 14.720%
float64_add 11.99% 13.864%
helper_mulsd 5.69% 4.876%
helper_addsd 4.68% 3.549%
helper_lookup_tb_ptr 4.43% 4.525%*
helper_subsd 4.28% 3.767%

* percentage for helper_lookup_tb_ptr for Callgrind is obtained by summing up several items.

It can be seen that the individual results are quite different. However it seems that relative relation between individual items is approximately the same for both tools.

For a performance engineer, the difference shown above does not make a significant problem. The performance improvement workflows, in general, usually focus on usage on only one tool, and both Perf and Callgrind can be used for such purpose.

In general, the more important factor while judging usability of a performance tool is its ability to provide the same or approximate results for multiple subsequent measurements of the identical scenarios. This factor, which is called stability for the purpose of this report, is examined in more depth in the following section.

Stability of Perf and Callgrind Results

Idea of the Experiment

Stability can be defined as the ability to provide nearly identical results with each run of the profiler.

A simple Python script is used to compare the stability of Valgrind vs Perf, but first, the Coulomb program is executed with Callgrind once and with Perf three times. This time, the -r, --repeat <n> Perf flag is utilized. It repeats Perf execution n times and prints the average of all events.

Stability Experiment

This is a Bash script that performs an execution using callgrind, perf, perf -r 10 and perf -r 100 20 times for the Coulomb benchmark:

mkdir output
for ((i = 0; i < 20; i++)); do
    valgrind --tool=callgrind <qemu-build>/x86_64-linux-user/qemu-x86_64 ./coulomb_double 2>>./output/out$i.txt &&
        sudo perf stat -e instructions <qemu-build>/x86_64-linux-user/qemu-x86_64 ./coulomb_double 2>>./output/out$i.txt &&
        sudo perf stat -e instructions -r 10 <qemu-build>/x86_64-linux-user/qemu-x86_64 ./coulomb_double 2>>./output/out$i.txt &&
        sudo perf stat -e instructions -r 100 <qemu-build>/x86_64-linux-user/qemu-x86_64 ./coulomb_double 2>>./output/out$i.txt
done

This is a Python script that extracts the instruction counts from each run and outputs a CSV file with the measurements, as well as, the average, standard deviation and coefficient of variation of all 20 executions for each of the four methods:

from os import listdir
import csv
import statistics

output_files = listdir('output')
run = 1
results = []

for file in output_files:
    with open('output/' + file, "r") as target:
        lines = target.readlines()
        results.append([run,
                        lines[10].split()[3].replace(',', ' '),
                        lines[14].split()[0].replace(',', ' '),
                        lines[25].split()[0].replace(',', ' '),
                        lines[32].split()[0].replace(',', ' ')])
        run += 1

callgrind_results = [int(result[1].replace(' ', '')) for result in results]
callgrind_mean = statistics.mean(callgrind_results)
callgrind_stdev = statistics.stdev(callgrind_results)
callgrind_CV = (callgrind_stdev/callgrind_mean) * 100

perf_results = [int(result[2].replace(' ', '')) for result in results]
perf_mean = statistics.mean(perf_results)
perf_stdev = statistics.stdev(perf_results)
perf_CV = (perf_stdev / perf_mean) * 100

perf_10_results = [int(result[3].replace(' ', '')) for result in results]
perf_10_mean = statistics.mean(perf_10_results)
perf_10_stdev = statistics.stdev(perf_10_results)
perf_10_CV = (perf_10_stdev / perf_10_mean) * 100

perf_100_results = [int(result[4].replace(' ', '')) for result in results]
perf_100_mean = statistics.mean(perf_100_results)
perf_100_stdev = statistics.stdev(perf_100_results)
perf_100_CV = (perf_100_stdev/perf_100_mean) * 100

with open('output.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(["Run", "callgrind", "perf",
                     "perf -r 10", "perf -r 100"])
    for result in results:
        writer.writerow(result)
    writer.writerow(["Avg", callgrind_mean, perf_mean,
                     perf_10_mean, perf_100_mean])
    writer.writerow(["σ", callgrind_stdev, perf_stdev,
                     perf_10_stdev, perf_100_stdev])
    writer.writerow(["σ (%)", callgrind_CV, perf_CV,
                     perf_10_CV, perf_100_CV])

Results of the Experiment

Run callgrind perf perf -r 10 perf -r 100
1 8 197 479 927 8 185 411 320 8 185 345 224 8 185 114 468
2 8 197 479 919 8 185 195 094 8 184 991 767 8 184 923 223
3 8 197 479 842 8 184 968 958 8 185 206 200 8 185 084 344
4 8 197 479 842 8 185 162 965 8 185 236 582 8 184 992 738
5 8 197 479 919 8 185 173 560 8 185 230 524 8 185 177 606
6 8 197 479 919 8 185 107 760 8 185 089 134 8 184 985 563
7 8 197 479 927 8 185 242 718 8 185 067 722 8 184 933 899
8 8 197 479 919 8 185 259 192 8 185 356 351 8 185 202 083
9 8 197 479 927 8 185 176 054 8 184 955 923 8 185 048 778
10 8 197 479 842 8 185 062 738 8 185 061 203 8 185 148 177
11 8 197 479 842 8 185 065 330 8 184 952 823 8 184 926 132
12 8 197 479 852 8 185 370 367 8 185 303 501 8 185 125 697
13 8 197 479 927 8 185 115 629 8 184 950 032 8 185 001 142
14 8 197 479 842 8 186 555 638 8 185 187 143 8 185 035 473
15 8 197 481 000 8 185 159 045 8 185 238 812 8 185 032 286
16 8 197 479 961 8 185 170 493 8 185 029 578 8 185 120 424
17 8 197 479 842 8 185 125 155 8 184 990 595 8 184 922 657
18 8 197 480 006 8 187 323 418 8 185 326 603 8 185 092 509
19 8 197 479 919 8 185 454 074 8 185 320 982 8 185 173 455
20 8 197 481 000 8 185 164 000 8 185 206 455 8 185 044 700
Avg 8 197 480 009 8 185 363 175 8 185 152 358 8 185 054 268
σ 342.272 565489.783 144003.202 89808.068
σ (%) 0.0000041 0.0069085 0.0017593 0.0010972

From the previous experiment, it can be seen that σ (%) - Coefficient of Variation - is approx 0.0000041% for Callgrind. For Perf, it is 0.0069085% for perf stat used without any -r switch. It starts to decrease to 0.0017593% for -r 10, and further to 0.0010972% for -r 100. The maximal value that can be specified after -r is 100, so this is the maximal stability that can be achieved using Perf.

It can be concluded that despite (or perhaps, better said, because of) the slow execution time of Callgrind, it gives very stable results. The stability of Perf results increases with increasing the repetition count, but it still doesn’t reach a coefficient of variation as low as Callgrind even with the maximum possible repetitions (-r 100).

Resources

If you want to learn more about Perf and Callgrind, please check the resources section below.

Perf Resources

  • Official Perf Wiki

    The official Perf wiki offers a detailed step by step tutorial on the usage of Perf. It lists all of the measurable hardware and software events as well as multiple examples for Perf command usages.

  • Performance Lab: The Power of The Perf Tool - Arnaldo Melo and Jiri Olsa

    In this talk, the speakers show how to build Perf from source and how it can be used to detect and hunt down numerous performance issues. They also cover examples of some interesting Perf features and their favorite usage tips.

  • Linux Perf for Qt Developers - Milian Wolff

    In this talk, the speaker gives a detailed introduction to Perf showing how to use it to find CPU hotspots in the code, as well as some tricks to profile wait times for lock contention issues or disk I/O. He also dives into details on how it is applicable to Qt developers in particular.

Callgrind Resources

  • Callgrind Official Manual

    The official Callgrind manual gives an overview of Callgrind. It provides a guide for basic and advanced usage of the tool as well as in detailed description of all command line arguments of the Callgrind.

  • Callgrind Output Format Manual

    This manual covers the internal structure of the Callgrind output file which callgrind_annotate was used to inspect in the report. It includes simple and extended examples of such files as well as the complete grammar of the format.

  • Stanford CS107 Callgrind Guide

    This guide offers a quick introduction to get you up and running with Callgrind. It covers basic usage cases as well as some tips and tricks.



Appendix

Installing Perf

CentOS & RHEL

sudo yum install perf

Fedora

sudo dnf install perf

Arch

sudo pacman -S perf

Debian & Derivatives

sudo apt update && sudo apt install linux-tools-$(uname -r) linux-tools-generic

The uname -r command is used to provide the Linux kernel version instead of manually writing it in the installation command.

Installing Valgrind

CentOS & RHEL

sudo yum install valgrind

Fedora

sudo dnf install valgrind

Arch

sudo pacman -S valgrind

Debian & Derivatives

sudo apt install valgrind

Apart from the procedures mentioned above, and for curious and advanced users, or simply those wishing the latest and greatest, both Perf and Valgrind can be also built and installed from their source code.


LinkedIn, Twitter, Facebook