Dissecting QEMU Into Three Main Parts

Code Generation, JIT Execution, and Helpers Execution

Ahmed Karaman - June 29, 2020

Intro

The previous report presented an overview of measuring basic performance metrics of QEMU, and one of these metrics, naturally, was the total number of executed host instructions. This report further utilizes Callgrind to break down that total number into numbers that correspond to three main parts of QEMU operation: code generation, JIT-ed code execution, and helpers execution.

Table of Contents

Breaking Down QEMU Execution Phases

Execution of an instance of QEMU can be split into three main parts: code generation, JIT execution and helpers execution. Code generation is often referred as “translation time” (the target code is translated to intermediate code, and, in turn, to host code), while JIT execution and helpers execution are often referred as “execution time” (host code is being executed). So JIT and helpers execution are similar in the sense that they execute host code, however, since their origin and internal organization is very different, it is useful to distinguish between the two.

There are perhaps some other parts of QEMU that are not taken into account here - for example, initialization of QEMU itself. However, for all intents and purposes, and for measuring emulation of a benchmark of all sizes except the smallest, these parts are negligible, and not subject of interest of this report. For example, QEMU initialization will be included in code generation part while doing calculations in this report, but, still, that will not impact the accuracy of results in any substantial way.

The three parts of QEMU execution mentioned above are not, of course, executed sequentially, akin to phases - their execution is interleaved. However, it is still useful to know information about each part separately. This report presents, as its key idea, a script called dissect.py that prints the total number of instructions spent in each of said QEMU parts.

The script is available on the project GitHub page.

Example of Usage

Using the same Coulomb benchmark from the previous report, it can be compiled on an x86_64 Linux machine using:

gcc -static -O2 coulomb_double.c -o coulomb_double -lm

And then the dissect.py script can be invoked using:

./dissect.py -- <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double

The script displays the total number of instructions, and then divides this number into the three components:

Total Instructions:        4,702,865,362

Code Generation:             115,819,309	 2.463%
JIT Execution:             1,081,980,528	23.007%
Helpers:                   3,505,065,525	74.530%

Principle of Operation

Callgrind distinguishes two measures for each function: “self” (for execution only within the function itself) and “inclusive” (for execution in both the function and all of its calees, to any depth). Also, there is an important --tree option of callgrind_annotated utilized in an important fashion in the dissect.py script.

Firstly, the script executes the passed QEMU invocation command with Callgrind. Secondly, it executes callgrind_annotate using the --tree=caller flag to print the callers of each function. Calculation for each part is done this way:

  • The number of “self” instructions for the JIT execution can be directly obtained.
  • The number of instructions spent in helpers can be easily calculated by subtracting the “self” number for JIT from correspondent “inclusive” number.
  • The number of code generation instructions is obtained by subtracting the “inclusive” number of JIT from the program’s total number of instructions.

Comparing 17 Targets of QEMU

Overview

One very handy usage of the dissect.py script is to compare how QEMU performs in each of its thee phases across different targets. To perform this task, a small helper Python script is used.

import csv
import os
import subprocess


############### Script Options ###############
qemu_build_path = "<qemu-build>"
benchmark_args = ["-n", "1000"]
targets = {
    "aarch64":  "aarch64-linux-gnu-gcc",
    "alpha":    "alpha-linux-gnu-gcc",
    "arm":      "arm-linux-gnueabi-gcc",
    "hppa":     "hppa-linux-gnu-gcc",
    "m68k":     "m68k-linux-gnu-gcc",
    "mips":     "mips-linux-gnu-gcc",
    "mipsel":   "mipsel-linux-gnu-gcc",
    "mips64":   "mips64-linux-gnuabi64-gcc",
    "mips64el": "mips64el-linux-gnuabi64-gcc",
    "ppc":      "powerpc-linux-gnu-gcc",
    "ppc64":    "powerpc64-linux-gnu-gcc",
    "ppc64le":  "powerpc64le-linux-gnu-gcc",
    "riscv64":  "riscv64-linux-gnu-gcc",
    "s390x":    "s390x-linux-gnu-gcc",
    "sh4":      "sh4-linux-gnu-gcc",
    "sparc64":  "sparc64-linux-gnu-gcc",
    "x86_64":   "gcc"
}
##############################################

# Store dissect.py output for each target
targets_data = []
for target_name, target_compiler in targets.items():
    print("Measuring instructions for target: " + target_name)
    compile_target = subprocess.run([target_compiler,
                                     "-O2",
                                     "-static",
                                     "coulomb_double.c",
                                     "-lm",
                                     "-o",
                                     "/tmp/coulomb_double"])
    dissect_target = subprocess.run((["./dissect.py",
                                      "--",
                                      "{}/{}-linux-user/qemu-{}".format(qemu_build_path,
                                                                        target_name,
                                                                        target_name),
                                      "/tmp/coulomb_double"] + benchmark_args),
                                    stdout=subprocess.PIPE)
    os.unlink("/tmp/coulomb_double")
    # Read the dissect output
    lines = dissect_target.stdout.decode("utf-8").split('\n')
    # Extract measurements
    total_instructions = lines[0].split()[-1]
    code_generation_percentage = lines[2].split()[-1]
    jit_execution_percentage = lines[3].split()[-1]
    helpers_execution_percentage = lines[4].split()[-1]
    # Save measurements to the targets_data list
    targets_data.append([target_name,
                         total_instructions,
                         code_generation_percentage,
                         jit_execution_percentage,
                         helpers_execution_percentage])

# Save output to CSV
csv_headers = ["Target", "Total Instructions",
               "Code Generation %", "JIT Execution %", "Helpers %"]
with open("dissect_targets.csv", "w") as csv_file:
    # Declare the writer
    writer = csv.writer(csv_file)
    # write CSV file header names
    writer.writerow(csv_headers)
    # For each target, write its collected measurements
    for target in targets_data:
        writer.writerow(target)

After providing the script with the required options, for each target, it compiles the Coulomb benchmark, and then runs dissect.py on the compiled executable. The results are saved in a CSV file.

Results

Target Total Instructions Code Generation % JIT Execution % Helpers Execution %
aarch64 4 692 357 988 2.758% 32.437% 64.804%
alpha 10 804 422 926 0.958% 11.042% 88.000%
arm 39 325 544 973 0.483% 76.003% 23.514%
hppa 12 005 435 084 0.975% 8.988% 90.037%
m68k 7 266 676 762 1.116% 5.904% 92.980%
mips 10 440 969 560 1.366% 10.643% 87.990%
mipsel 11 715 714 129 1.247% 10.012% 88.741%
mips64 10 337 898 389 1.409% 9.790% 88.801%
mips64el 11 596 334 956 1.281% 9.118% 89.601%
ppc 12 713 132 146 1.115% 10.215% 88.671%
ppc64 12 716 587 866 1.122% 9.760% 89.119%
ppc64le 12 694 752 808 1.118% 9.611% 89.271%
riscv64 4 149 509 947 5.626% 19.113% 75.261%
s390x 10 946 821 241 0.843% 8.850% 90.307%
sh4 12 728 200 623 1.344% 18.057% 80.598%
sparc64 11 979 151 647 5.634% 12.907% 81.459%
x86_64 4 703 175 766 2.469% 23.005% 74.526%

Discussion of Results

The table above offers a lot of material for discussion and exploration. For now, only a couple of them will be touched.

mips/mips64 vs mipsel/mips64el

There is one thing intriguing about mips targets: big endian versions are faster than little endian versions. This is a sort of counterintuitive, since the host is intel, a little endian system. Let’s see what are top 15 functions for mips target:

./topN_callgrind.py -n 15 -- <qemu-build>/mips-linux-user/qemu-mips coulomb_double-mips

Results:

  
   No.  Percentage  Function Name                   Source File
  ----  ----------  ------------------------------  ------------------------------
     1     21.974%  soft_f64_addsub                 <qemu>/fpu/softfloat.c
     2     16.445%  soft_f64_mul                    <qemu>/fpu/softfloat.c
     3     10.643%  0x0000000008664000              ???
     4      6.685%  ieee_ex_to_mips.part.2          <qemu>/target/mips/fpu_helper.c
     5      6.340%  soft_f64_mul                    <qemu>/include/fpu/softfloat-macros.h
     6      3.312%  float64_add                     <qemu>/fpu/softfloat.c
     7      3.284%  helper_float_mul_d              <qemu>/target/mips/fpu_helper.c
     8      3.274%  soft_f64_addsub                 <qemu>/include/qemu/bitops.h
     9      3.197%  helper_float_madd_d             <qemu>/target/mips/fpu_helper.c
    10      3.011%  helper_float_sub_d              <qemu>/target/mips/fpu_helper.c
    11      2.753%  helper_float_add_d              <qemu>/target/mips/fpu_helper.c
    12      2.676%  soft_f64_mul                    <qemu>/include/qemu/bitops.h
    13      2.454%  soft_f64_addsub                 <qemu>/include/fpu/softfloat-macros.h
    14      1.606%  float64_sub                     <qemu>/fpu/softfloat.c
    15      1.190%  helper_cmp_d_lt                 <qemu>/target/mips/fpu_helper.c
  

And for mipsel target:

./topN_callgrind.py -n 15 -- <qemu-build>/mipsel-linux-user/qemu-mipsel coulomb_double-mipsel

Results:

  
   No.  Percentage  Function Name                   Source File
  ----  ----------  ------------------------------  ------------------------------
     1     26.635%  soft_f64_addsub                 <qemu>/fpu/softfloat.c
     2     14.656%  soft_f64_mul                    <qemu>/fpu/softfloat.c
     3     10.012%  0x0000000008664000              ???
     4      7.559%  ieee_ex_to_mips.part.2          <qemu>/target/mips/fpu_helper.c
     5      5.650%  soft_f64_mul                    <qemu>/include/fpu/softfloat-macros.h
     6      5.584%  helper_float_mul_d              <qemu>/target/mips/fpu_helper.c
     7      4.603%  helper_float_add_d              <qemu>/target/mips/fpu_helper.c
     8      3.929%  soft_f64_addsub                 <qemu>/include/qemu/bitops.h
     9      3.299%  soft_f64_addsub                 <qemu>/include/fpu/softfloat-macros.h
    10      3.247%  helper_float_sub_d              <qemu>/target/mips/fpu_helper.c
    11      2.385%  soft_f64_mul                    <qemu>/include/qemu/bitops.h
    12      1.060%  helper_cmp_d_lt                 <qemu>/target/mips/fpu_helper.c
    13      1.036%  float64_lt                      <qemu>/fpu/softfloat.c
    14      0.946%  float64_add                     <qemu>/fpu/softfloat.c
    15      0.901%  soft_f64_div                    <qemu>/fpu/softfloat.c
  

From the two lists above, it is visible that, for some strange reasons that are beyond QEMU, big endian mips target uses multiply-add instructions, while little endian mips target uses separate multiply instruction and separate add instructions. This can be concluded from the presence of helper_float_madd_d in big endian case only. This means that corresponding helpers are different. Moreover, the number of executed helpers will be also different - less helpers will be called in big endian case. Numerically, the outcome will be accurate in both cases, however, the number of invoked helpers matters, resulting in better overall performance of big endian mips target.

This is not really a QEMU issue, it could be claimed that cross compiler for mips exhibits strange differences between big endian and little endian cases.

m68k (non-RISC) target vs RISC targets

m68k instruction set is not a RISC set in a strict sense. For example it contains instructions that calculate mathematical function sin() - which is not present in, for example, mips or arm instruction sets. Again, let’s examine top 15 functions for m68k case:

./topN_callgrind.py -n 15 -- <qemu-build>/m68k-linux-user/qemu-m68k coulomb_double-m68k

Results:

  
   No.  Percentage  Function Name                   Source File
  ----  ----------  ------------------------------  ------------------------------
     1     21.128%  roundAndPackFloatx80            <qemu>/fpu/softfloat.c
     2      6.646%  floatx80_mul                    <qemu>/fpu/softfloat.c
     3      5.904%  0x00000000082db000              ???
     4      5.542%  floatx80_mul                    <qemu>/include/fpu/softfloat-macros.h
     5      3.958%  subFloatx80Sigs                 <qemu>/fpu/softfloat.c
     6      3.780%  helper_ftst                     <qemu>/target/m68k/fpu_helper.c
     7      3.739%  float64_to_floatx80             <qemu>/fpu/softfloat.c
     8      3.528%  addFloatx80Sigs                 <qemu>/fpu/softfloat.c
     9      2.447%  floatx80_div                    <qemu>/include/fpu/softfloat-macros.h
    10      2.437%  floatx80_mul                    <qemu>/include/fpu/softfloat.h
    11      2.136%  subFloatx80Sigs                 <qemu>/include/fpu/softfloat-macros.h
    12      2.072%  roundAndPackFloat64             <qemu>/fpu/softfloat.c
    13      1.900%  floatx80_sin                    <qemu>/target/m68k/softfloat.c
    14      1.890%  helper_ftst                     <qemu>/include/fpu/softfloat.h
    15      1.884%  floatx80_cos                    <qemu>/target/m68k/softfloat.c
  

m68k target has fewer instructions to translate, but some of its instruction require complex softfloat helpers (displayed above), whereas mips, for example, has much more instruction to translate and execute, but they are, in great majority, basically, additions and multiplications, that require relatively simple softfloat helpers.

All in all, m68k approach seems to be more efficient from the stand point of QEMU performance. There is nothing to improve in that sense for, let’s say mips, this is just inherent consequence of differences of m68k and mips instruction sets.

LinkedIn, Twitter, Facebook