Dissecting QEMU Into Three Main Parts

Code Generation, JIT Execution, and Helpers Execution

Ahmed Karaman - June 29, 2020

Intro

The previous report presented an overview of measuring basic performance metrics of QEMU, and one of these metrics, naturally, was the total number of executed host instructions. This report further utilizes Callgrind to break down that total number into numbers that correspond to three main parts of QEMU operation: code generation, JIT-ed code execution, and helpers execution.

Breaking Down QEMU Execution Phases
- Principle of Operation
- Example of Usage
Comparing 17 Targets of QEMU
- Overview
- Results
Discussion of Results
- mips/mips64 vs mipsel/mips64el
- m68k (non-RISC) target vs RISC targets

Breaking Down QEMU Execution Phases

Execution of an instance of QEMU can be split into three main parts: code generation, JIT execution and helpers execution. Code generation is often referred as “translation time” (the target code is translated to intermediate code, and, in turn, to host code), while JIT execution and helpers execution are often referred as “execution time” (host code is being executed). So JIT and helpers execution are similar in the sense that they execute host code, however, since their origin and internal organization is very different, it is useful to distinguish between the two.

There are perhaps some other parts of QEMU that are not taken into account here - for example, initialization of QEMU itself. However, for all intents and purposes, and for measuring emulation of a benchmark of all sizes except the smallest, these parts are negligible, and not subject of interest of this report. For example, QEMU initialization will be included in code generation part while doing calculations in this report, but, still, that will not impact the accuracy of results in any substantial way.

The three parts of QEMU execution mentioned above are not, of course, executed sequentially, akin to phases - their execution is interleaved. However, it is still useful to know information about each part separately. This report presents, as its key idea, a script called dissect.py that prints the total number of instructions spent in each of said QEMU parts.

The script is available on the project GitHub page.

Example of Usage

Using the same Coulomb benchmark from the previous report, it can be compiled on an x86_64 Linux machine using:

gcc -static -O2 coulomb_double.c -o coulomb_double -lm

And then the dissect.py script can be invoked using:

./dissect.py -- <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double

The script displays the total number of instructions, and then divides this number into the three components:

Total Instructions:        4,702,865,362

Code Generation:             115,819,309	 2.463%
JIT Execution:             1,081,980,528	23.007%
Helpers:                   3,505,065,525	74.530%

Principle of Operation

Callgrind distinguishes two measures for each function: “self” (for execution only within the function itself) and “inclusive” (for execution in both the function and all of its calees, to any depth). Also, there is an important --tree option of callgrind_annotated utilized in an important fashion in the dissect.py script.

Firstly, the script executes the passed QEMU invocation command with Callgrind. Secondly, it executes callgrind_annotate using the --tree=caller flag to print the callers of each function. Calculation for each part is done this way:

The number of “self” instructions for the JIT execution can be directly obtained.
The number of instructions spent in helpers can be easily calculated by subtracting the “self” number for JIT from correspondent “inclusive” number.
The number of code generation instructions is obtained by subtracting the “inclusive” number of JIT from the program’s total number of instructions.

Comparing 17 Targets of QEMU

Overview

One very handy usage of the dissect.py script is to compare how QEMU performs in each of its thee phases across different targets. To perform this task, a small helper Python script is used.

import csv
import os
import subprocess


############### Script Options ###############
qemu_build_path = "<qemu-build>"
benchmark_args = ["-n", "1000"]
targets = {
    "aarch64":  "aarch64-linux-gnu-gcc",
    "alpha":    "alpha-linux-gnu-gcc",
    "arm":      "arm-linux-gnueabi-gcc",
    "hppa":     "hppa-linux-gnu-gcc",
    "m68k":     "m68k-linux-gnu-gcc",
    "mips":     "mips-linux-gnu-gcc",
    "mipsel":   "mipsel-linux-gnu-gcc",
    "mips64":   "mips64-linux-gnuabi64-gcc",
    "mips64el": "mips64el-linux-gnuabi64-gcc",
    "ppc":      "powerpc-linux-gnu-gcc",
    "ppc64":    "powerpc64-linux-gnu-gcc",
    "ppc64le":  "powerpc64le-linux-gnu-gcc",
    "riscv64":  "riscv64-linux-gnu-gcc",
    "s390x":    "s390x-linux-gnu-gcc",
    "sh4":      "sh4-linux-gnu-gcc",
    "sparc64":  "sparc64-linux-gnu-gcc",
    "x86_64":   "gcc"
}
##############################################

# Store dissect.py output for each target
targets_data = []
for target_name, target_compiler in targets.items():
    print("Measuring instructions for target: " + target_name)
    compile_target = subprocess.run([target_compiler,
                                     "-O2",
                                     "-static",
                                     "coulomb_double.c",
                                     "-lm",
                                     "-o",
                                     "/tmp/coulomb_double"])
    dissect_target = subprocess.run((["./dissect.py",
                                      "--",
                                      "{}/{}-linux-user/qemu-{}".format(qemu_build_path,
                                                                        target_name,
                                                                        target_name),
                                      "/tmp/coulomb_double"] + benchmark_args),
                                    stdout=subprocess.PIPE)
    os.unlink("/tmp/coulomb_double")
    # Read the dissect output
    lines = dissect_target.stdout.decode("utf-8").split('\n')
    # Extract measurements
    total_instructions = lines[0].split()[-1]
    code_generation_percentage = lines[2].split()[-1]
    jit_execution_percentage = lines[3].split()[-1]
    helpers_execution_percentage = lines[4].split()[-1]
    # Save measurements to the targets_data list
    targets_data.append([target_name,
                         total_instructions,
                         code_generation_percentage,
                         jit_execution_percentage,
                         helpers_execution_percentage])

# Save output to CSV
csv_headers = ["Target", "Total Instructions",
               "Code Generation %", "JIT Execution %", "Helpers %"]
with open("dissect_targets.csv", "w") as csv_file:
    # Declare the writer
    writer = csv.writer(csv_file)
    # write CSV file header names
    writer.writerow(csv_headers)
    # For each target, write its collected measurements
    for target in targets_data:
        writer.writerow(target)

After providing the script with the required options, for each target, it compiles the Coulomb benchmark, and then runs dissect.py on the compiled executable. The results are saved in a CSV file.

Results

Target	Total Instructions	Code Generation %	JIT Execution %	Helpers Execution %
aarch64	4 692 357 988	2.758%	32.437%	64.804%
alpha	10 804 422 926	0.958%	11.042%	88.000%
arm	39 325 544 973	0.483%	76.003%	23.514%
hppa	12 005 435 084	0.975%	8.988%	90.037%
m68k	7 266 676 762	1.116%	5.904%	92.980%
mips	10 440 969 560	1.366%	10.643%	87.990%
mipsel	11 715 714 129	1.247%	10.012%	88.741%
mips64	10 337 898 389	1.409%	9.790%	88.801%
mips64el	11 596 334 956	1.281%	9.118%	89.601%
ppc	12 713 132 146	1.115%	10.215%	88.671%
ppc64	12 716 587 866	1.122%	9.760%	89.119%
ppc64le	12 694 752 808	1.118%	9.611%	89.271%
riscv64	4 149 509 947	5.626%	19.113%	75.261%
s390x	10 946 821 241	0.843%	8.850%	90.307%
sh4	12 728 200 623	1.344%	18.057%	80.598%
sparc64	11 979 151 647	5.634%	12.907%	81.459%
x86_64	4 703 175 766	2.469%	23.005%	74.526%

Discussion of Results

The table above offers a lot of material for discussion and exploration. For now, only a couple of them will be touched.

mips/mips64 vs mipsel/mips64el

There is one thing intriguing about mips targets: big endian versions are faster than little endian versions. This is a sort of counterintuitive, since the host is intel, a little endian system. Let’s see what are top 15 functions for mips target:

./topN_callgrind.py -n 15 -- <qemu-build>/mips-linux-user/qemu-mips coulomb_double-mips

Results:

  
   No.  Percentage  Function Name                   Source File
  ----  ----------  ------------------------------  ------------------------------
     1     21.974%  soft_f64_addsub                 <qemu>/fpu/softfloat.c
     2     16.445%  soft_f64_mul                    <qemu>/fpu/softfloat.c
     3     10.643%  0x0000000008664000              ???
     4      6.685%  ieee_ex_to_mips.part.2          <qemu>/target/mips/fpu_helper.c
     5      6.340%  soft_f64_mul                    <qemu>/include/fpu/softfloat-macros.h
     6      3.312%  float64_add                     <qemu>/fpu/softfloat.c
     7      3.284%  helper_float_mul_d              <qemu>/target/mips/fpu_helper.c
     8      3.274%  soft_f64_addsub                 <qemu>/include/qemu/bitops.h
     9      3.197%  helper_float_madd_d             <qemu>/target/mips/fpu_helper.c
    10      3.011%  helper_float_sub_d              <qemu>/target/mips/fpu_helper.c
    11      2.753%  helper_float_add_d              <qemu>/target/mips/fpu_helper.c
    12      2.676%  soft_f64_mul                    <qemu>/include/qemu/bitops.h
    13      2.454%  soft_f64_addsub                 <qemu>/include/fpu/softfloat-macros.h
    14      1.606%  float64_sub                     <qemu>/fpu/softfloat.c
    15      1.190%  helper_cmp_d_lt                 <qemu>/target/mips/fpu_helper.c

And for mipsel target:

./topN_callgrind.py -n 15 -- <qemu-build>/mipsel-linux-user/qemu-mipsel coulomb_double-mipsel

Results:

  
   No.  Percentage  Function Name                   Source File
  ----  ----------  ------------------------------  ------------------------------
     1     26.635%  soft_f64_addsub                 <qemu>/fpu/softfloat.c
     2     14.656%  soft_f64_mul                    <qemu>/fpu/softfloat.c
     3     10.012%  0x0000000008664000              ???
     4      7.559%  ieee_ex_to_mips.part.2          <qemu>/target/mips/fpu_helper.c
     5      5.650%  soft_f64_mul                    <qemu>/include/fpu/softfloat-macros.h
     6      5.584%  helper_float_mul_d              <qemu>/target/mips/fpu_helper.c
     7      4.603%  helper_float_add_d              <qemu>/target/mips/fpu_helper.c
     8      3.929%  soft_f64_addsub                 <qemu>/include/qemu/bitops.h
     9      3.299%  soft_f64_addsub                 <qemu>/include/fpu/softfloat-macros.h
    10      3.247%  helper_float_sub_d              <qemu>/target/mips/fpu_helper.c
    11      2.385%  soft_f64_mul                    <qemu>/include/qemu/bitops.h
    12      1.060%  helper_cmp_d_lt                 <qemu>/target/mips/fpu_helper.c
    13      1.036%  float64_lt                      <qemu>/fpu/softfloat.c
    14      0.946%  float64_add                     <qemu>/fpu/softfloat.c
    15      0.901%  soft_f64_div                    <qemu>/fpu/softfloat.c

From the two lists above, it is visible that, for some strange reasons that are beyond QEMU, big endian mips target uses multiply-add instructions, while little endian mips target uses separate multiply instruction and separate add instructions. This can be concluded from the presence of helper_float_madd_d in big endian case only. This means that corresponding helpers are different. Moreover, the number of executed helpers will be also different - less helpers will be called in big endian case. Numerically, the outcome will be accurate in both cases, however, the number of invoked helpers matters, resulting in better overall performance of big endian mips target.

This is not really a QEMU issue, it could be claimed that cross compiler for mips exhibits strange differences between big endian and little endian cases.

m68k (non-RISC) target vs RISC targets

m68k instruction set is not a RISC set in a strict sense. For example it contains instructions that calculate mathematical function sin() - which is not present in, for example, mips or arm instruction sets. Again, let’s examine top 15 functions for m68k case:

./topN_callgrind.py -n 15 -- <qemu-build>/m68k-linux-user/qemu-m68k coulomb_double-m68k

Results:

  
   No.  Percentage  Function Name                   Source File
  ----  ----------  ------------------------------  ------------------------------
     1     21.128%  roundAndPackFloatx80            <qemu>/fpu/softfloat.c
     2      6.646%  floatx80_mul                    <qemu>/fpu/softfloat.c
     3      5.904%  0x00000000082db000              ???
     4      5.542%  floatx80_mul                    <qemu>/include/fpu/softfloat-macros.h
     5      3.958%  subFloatx80Sigs                 <qemu>/fpu/softfloat.c
     6      3.780%  helper_ftst                     <qemu>/target/m68k/fpu_helper.c
     7      3.739%  float64_to_floatx80             <qemu>/fpu/softfloat.c
     8      3.528%  addFloatx80Sigs                 <qemu>/fpu/softfloat.c
     9      2.447%  floatx80_div                    <qemu>/include/fpu/softfloat-macros.h
    10      2.437%  floatx80_mul                    <qemu>/include/fpu/softfloat.h
    11      2.136%  subFloatx80Sigs                 <qemu>/include/fpu/softfloat-macros.h
    12      2.072%  roundAndPackFloat64             <qemu>/fpu/softfloat.c
    13      1.900%  floatx80_sin                    <qemu>/target/m68k/softfloat.c
    14      1.890%  helper_ftst                     <qemu>/include/fpu/softfloat.h
    15      1.884%  floatx80_cos                    <qemu>/target/m68k/softfloat.c

m68k target has fewer instructions to translate, but some of its instruction require complex softfloat helpers (displayed above), whereas mips, for example, has much more instruction to translate and execute, but they are, in great majority, basically, additions and multiplications, that require relatively simple softfloat helpers.

All in all, m68k approach seems to be more efficient from the stand point of QEMU performance. There is nothing to improve in that sense for, let’s say mips, this is just inherent consequence of differences of m68k and mips instruction sets.

Share this report: LinkedIn, Twitter, Facebook

TCG Continuous Benchmarking

Dissecting QEMU Into Three Main Parts

Intro

Table of Contents

Breaking Down QEMU Execution Phases

Example of Usage

Principle of Operation

Comparing 17 Targets of QEMU

Overview

Results

Discussion of Results

mips/mips64 vs mipsel/mips64el

m68k (non-RISC) target vs RISC targets