Measuring QEMU Emulation Efficiency

Comparing Guest Instructions and QEMU Instructions

Ahmed Karaman - August 3, 2020

Intro

This reports presents a method for measuring the TCG emulation efficiency in QEMU. This is achieved for seventeen different targets by comparing the number of guest instructions (running the program natively on the target) and the number of QEMU instructions (running the program through QEMU). For each target, the ratio between these two numbers presents a rough estimation of the emulation efficiency for that target.

Beside the five newly introduced benchmarks in the previous report, the Coulomb benchmark is also reused in this report to provide a variety of workloads. This gives a total of six benchmark programs that can be categorized into two groups:

  • Floating point operations (group 1):
    • coulomb_double
    • matmult_double
    • qsort_double
  • Basic int and char operations (group 2):
    • matmult_int32
    • qsort_int32
    • qsort_string

All benchmarks are available on the project GitHub page.

Table of Contents

Setup

All the measurements in this report are based on the newly released QEMU version 5.1.0-rc2. To measure the number of guest instructions, the libinsn plugin is utilized which is available when QEMU is built with the --enable-plugins option. The general syntax of using the plugin is:

<qemu-executable> -plugin <qemu-plugins-build>/tests/plugin/libinsn.so -d plugin <test-program>

To measure the number of QEMU instructions, Callgrind is used. Please refer to the “Measuring Basic Performance Metrics of QEMU” report for more details on setting up and using Callgrind.

To create a plugins build based on the latest QEMU version, this bash snippet is used:

wget https://download.qemu.org/qemu-5.1.0-rc2.tar.xz
tar xfv qemu-5.1.0-rc2.tar.xz
cd qemu-5.1.0-rc2
mkdir build-gcc-plugins
cd build-gcc-plugins
../configure --disable-system --disable-tools --enable-plugins
make

Measurements

The Python script below creates a CSV table for each of the six benchmarks. Each table contains seventeen rows, one for each target. A row contains the target name, number of guest instructions, number of QEMU instructions and the ratio between the two numbers.

import csv
import os
import subprocess
import sys
import tempfile

############### Script Options ###############
qemu_build = "<qemu-plugins-build>"
targets = {
    "aarch64":  "aarch64-linux-gnu-gcc",
    "alpha":    "alpha-linux-gnu-gcc",
    "arm":      "arm-linux-gnueabi-gcc",
    "hppa":     "hppa-linux-gnu-gcc",
    "m68k":     "m68k-linux-gnu-gcc",
    "mips":     "mips-linux-gnu-gcc",
    "mipsel":   "mipsel-linux-gnu-gcc",
    "mips64":   "mips64-linux-gnuabi64-gcc",
    "mips64el": "mips64el-linux-gnuabi64-gcc",
    "ppc":      "powerpc-linux-gnu-gcc",
    "ppc64":    "powerpc64-linux-gnu-gcc",
    "ppc64le":  "powerpc64le-linux-gnu-gcc",
    "riscv64":  "riscv64-linux-gnu-gcc",
    "s390x":    "s390x-linux-gnu-gcc",
    "sh4":      "sh4-linux-gnu-gcc",
    "sparc64":  "sparc64-linux-gnu-gcc",
    "x86_64":   "gcc"
}
##############################################


def measure_qemu_instructions(qemu_exe_path, program_exe_path):
    # Measure the number of QEMU instructions using Callgrind
    with tempfile.NamedTemporaryFile() as tmp_out:
        run_callgrind = subprocess.run(["valgrind",
                                        "--tool=callgrind",
                                        "--callgrind-out-file=" + tmp_out.name,
                                        qemu_exe_path,
                                        program_exe_path],
                                       stdout=subprocess.DEVNULL,
                                       stderr=subprocess.PIPE)
    callgrind_output = run_callgrind.stderr.decode("utf-8").split("\n")
    return int(callgrind_output[8].split(" ")[-1])


csv_header = ["Target", "Guest Instructions", "QEMU Instructions", "Ratio"]
benchmarks = os.listdir('benchmarks')
libinsn_path = os.path.join(qemu_build, "tests", "plugin", "libinsn.so")
os.mkdir("tables")

for benchmark in benchmarks:
    data = []
    benchmark_name = os.path.splitext(benchmark)[0]
    benchmark_path = os.path.join("benchmarks", benchmark)
    for target_name, target_compiler in targets.items():
        with tempfile.NamedTemporaryFile() as tmp_exe:
            # Compile target
            subprocess.run([target_compiler, "-O2", "-static",
                            benchmark_path, "-o", tmp_exe.name, "-lm"])
            # Run the libinsn plugin
            run_qemu_plugin = subprocess.run(["{}/{}-linux-user/qemu-{}".
                                              format(qemu_build,
                                                     target_name,
                                                     target_name),
                                              "-plugin",
                                              libinsn_path,
                                              "-d",
                                              "plugin",
                                              tmp_exe.name],
                                             stdout=subprocess.DEVNULL,
                                             stderr=subprocess.PIPE)
            # Measure the instructions
            guest_instructions = int(run_qemu_plugin.stderr.decode("utf-8").
                                     split()[-1])
            qemu_instruction = measure_qemu_instructions("{}/{}-linux-user/qemu-{}".
                                                         format(qemu_build,
                                                                target_name,
                                                                target_name),
                                                         tmp_exe.name)
        data.append([target_name,
                     format(guest_instructions, ","),
                     format(qemu_instruction, ","),
                     "1:" + str(round((qemu_instruction / guest_instructions), 3))])

    with open(os.path.join("tables", benchmark_name) + ".csv", "w") as file:
        writer = csv.writer(file)
        writer.writerow(csv_header)
        writer.writerows(data)

Results (Benchmarks Group 1)

coulomb_double

Target Guest Instructions QEMU Instructions Ratio
aarch64 182 965 444 4 424 319 223 1:24.181
alpha 287 894 875 10 720 832 859 1:37.239
arm 4 353 433 161 39 328 640 162 1:9.034
hppa 290 299 145 12 007 537 148 1:41.363
m68k 55 464 791 7 107 559 194 1:128.145
mips 286 969 260 9 957 633 056 1:34.699
mipsel 300 313 870 11 123 315 018 1:37.039
mips64 255 992 742 9 855 532 178 1:38.499
mips64el 266 739 104 11 004 724 703 1:41.257
ppc 239 658 319 13 031 944 195 1:54.377
ppc64 228 263 889 13 034 833 440 1:57.104
ppc64le 220 968 816 13 012 936 191 1:58.890
riscv64 209 944 207 4 069 430 554 1:19.383
s390x 215 191 419 11 013 187 596 1:51.179
sh4 473 219 807 12 728 861 129 1:26.898
sparc64 263 295 373 11 969 980 973 1:45.462
x86_64 225 499 576 4 643 073 756 1:20.590

matmult_double

Target Guest Instructions QEMU Instructions Ratio
aarch64 62 565 037 1 412 678 042 1:22.579
alpha 120 146 835 3 021 375 794 1:25.147
arm 917 721 514 8 723 369 272 1:9.505
hppa 63 330 121 3 346 341 016 1:52.840
m68k 62 270 262 3 327 921 564 1:53.443
mips 87 981 027 2 263 506 435 1:25.727
mipsel 95 981 109 3 176 876 928 1:33.099
mips64 80 557 580 2 277 631 169 1:28.273
mips64el 88 557 574 3 190 361 616 1:36.026
ppc 48 136 797 3 125 669 697 1:64.933
ppc64 64 408 551 3 203 728 174 1:49.741
ppc64le 64 289 333 3 203 064 933 1:49.823
riscv64 78 623 128 1 222 950 784 1:15.555
s390x 46 190 841 2 726 829 922 1:59.034
sh4 88 962 981 3 342 515 085 1:37.572
sparc64 79 003 237 3 207 541 031 1:40.600
x86_64 61 517 622 1 250 647 935 1:20.330

qsort_double

Target Guest Instructions QEMU Instructions Ratio
aarch64 159 746 207 2 658 877 440 1:16.644
alpha 228 521 249 1 949 737 992 1:8.532
arm 662 068 324 9 121 836 857 1:13.778
hppa 247 113 645 3 141 276 704 1:12.712
m68k 203 935 507 4 934 908 874 1:24.198
mips 207 350 635 2 099 043 136 1:10.123
mipsel 207 350 618 2 099 343 286 1:10.125
mips64 188 086 328 1 971 371 119 1:10.481
mips64el 188 086 318 1 968 839 700 1:10.468
ppc 224 876 043 2 736 474 437 1:12.169
ppc64 203 809 886 2 685 763 461 1:13.178
ppc64le 193 040 770 2 642 651 058 1:13.690
riscv64 167 397 846 1 590 611 459 1:9.502
s390x 130 867 251 2 475 571 654 1:18.917
sh4 244 843 868 2 563 068 375 1:10.468
sparc64 190 084 290 3 919 439 599 1:20.619
x86_64 156 689 097 1 987 553 774 1:12.685

Results (Benchmarks Group 2)

matmult_int32

Target Guest Instructions QEMU Instructions Ratio
aarch64 62 555 845 596 194 508 1:9.531
alpha 96 215 385 370 654 042 1:3.852
arm 63 690 750 736 994 597 1:11.571
hppa 103 978 473 667 790 898 1:6.422
m68k 62 534 491 407 647 521 1:6.519
mips 88 083 941 497 767 190 1:5.651
mipsel 88 083 929 497 780 326 1:5.651
mips64 89 460 954 479 725 676 1:5.362
mips64el 89 460 943 463 106 726 1:5.177
ppc 55 843 156 338 959 876 1:6.070
ppc64 64 204 690 390 884 485 1:6.088
ppc64le 64 205 395 390 743 122 1:6.086
riscv64 86 448 202 349 669 158 1:4.045
s390x 62 614 807 492 407 746 1:7.864
sh4 72 780 143 399 937 800 1:5.495
sparc64 86 423 179 489 936 356 1:5.669
x86_64 61 590 922 400 190 791 1:6.498

qsort_int32

Target Guest Instructions QEMU Instructions Ratio
aarch64 151 968 514 2 132 112 102 1:14.030
alpha 221 192 248 1 460 982 497 1:6.605
arm 160 875 621 3 375 777 484 1:20.984
hppa 201 401 936 2 199 407 458 1:10.920
m68k 169 894 134 1 780 208 909 1:10.478
mips 176 712 823 1 501 040 830 1:8.494
mipsel 176 712 809 1 503 808 218 1:8.510
mips64 176 020 831 1 504 536 270 1:8.547
mips64el 176 020 824 1 483 550 240 1:8.428
ppc 202 473 828 1 668 592 063 1:8.241
ppc64 198 918 772 1 780 051 140 1:8.949
ppc64le 188 749 603 1 728 567 792 1:9.158
riscv64 159 048 448 1 289 755 584 1:8.109
s390x 132 119 768 2 114 840 292 1:16.007
sh4 205 090 416 1 879 285 254 1:9.163
sparc64 185 195 979 3 352 756 658 1:18.104
x86_64 145 621 672 1 751 799 973 1:12.030

qsort_string

Target Guest Instructions QEMU Instructions Ratio
aarch64 237 478 279 2 530 968 853 1:10.658
alpha 310 349 344 1 794 207 498 1:5.781
arm 277 491 839 7 167 746 267 1:25.830
hppa 286 010 885 4 608 364 139 1:16.113
m68k 242 574 561 2 295 663 078 1:9.464
mips 331 063 420 2 114 226 632 1:6.386
mipsel 331 063 408 2 111 085 204 1:6.377
mips64 304 640 414 1 969 109 275 1:6.464
mips64el 304 640 409 1 951 425 342 1:6.406
ppc 320 946 236 2 429 421 810 1:7.570
ppc64 272 956 914 2 404 978 156 1:8.811
ppc64le 273 392 915 2 386 256 069 1:8.728
riscv64 216 826 004 1 564 149 511 1:7.214
s390x 165 265 303 4 189 211 923 1:25.348
sh4 287 459 667 2 098 659 130 1:7.301
sparc64 304 142 262 4 130 702 783 1:13.581
x86_64 234 574 652 2 865 446 064 1:12.215

Analysis

The tables above are color coded to show the three best and worst emulation ratios for each benchmark. It can be noticed that within the same benchmark group, the ratios for all seventeen targets are nearly consistent.

It’s also clear that the ratio depends on the type of the program being emulated. Benchmarks in group 1 have a considerably larger emulation ratio compared to benchmarks in group 2.

The Python script below averages the ratios across different tables for each target. The results give a very good overview of QEMU’s emulation efficiency for each of the seventeen targets.

import os
import csv

# Tables directory
tables = os.listdir("tables")
csv_headers = ["Target", "QEMU Efficiency"]

# Initialize target arrays
target_names, target_ratio_sums = [], []
with open(os.path.join("tables", tables[0]), "r") as file:
    # Skip headers line
    file.readline()
    lines = file.readlines()
    for line in lines:
        # Add target name
        target_names.append(line.split(",")[0])
        # Initialize sum to zero
        target_ratio_sums.append(0)

# Number of benchmarks and targets
no_benchmarks = len(tables)
no_targets = len(target_names)

for table in tables:
    with open(os.path.join("tables", table), "r") as file:
        file.readline()
        lines = file.readlines()
        for i in range(len(lines)):
            target_ratio_sums[i] += float(lines[i].split(",")
                                          [-1].split(":")[-1])


target_ratio_avgs = ["1:"+str(round((x / no_benchmarks), 3))
                     for x in target_ratio_sums]

with open("efficiency.csv", "w") as file:
    writer = csv.writer(file)
    writer.writerow(csv_headers)
    for i in range(no_targets):
        writer.writerow([target_names[i], target_ratio_avgs[i]])

The script can be ran three times to obtain three tables.

On the left is the table for averaging the three benchmarks in group 1. The table in the middle represents the average ratio for benchmarks in group 2. Lastly, the table on the right is the average of all six benchmarks.

Target QEMU Efficiency (group 1)
aarch64 1:21.135
alpha 1:23.639
arm 1:10.772
hppa 1:35.638
m68k 1:68.595
mips 1:23.516
mipsel 1:26.754
mips64 1:25.751
mips64el 1:29.250
ppc 1:43.826
ppc64 1:40.008
ppc64le 1:40.801
riscv64 1:14.813
s390x 1:43.043
sh4 1:24.979
sparc64 1:35.560
x86_64 1:17.868
Target QEMU Efficiency (group 2)
aarch64 1:11.406
alpha 1:5.413
arm 1:19.462
hppa 1:11.152
m68k 1:8.82
mips 1:6.844
mipsel 1:6.846
mips64 1:6.791
mips64el 1:6.670
ppc 1:7.294
ppc64 1:7.949
ppc64le 1:7.991
riscv64 1:6.456
s390x 1:16.406
sh4 1:7.32
sparc64 1:12.451
x86_64 1:10.248
Target QEMU Efficiency (overall)
aarch64 1:16.270
alpha 1:14.526
arm 1:15.117
hppa 1:23.395
m68k 1:38.708
mips 1:15.180
mipsel 1:16.800
mips64 1:16.271
mips64el 1:17.960
ppc 1:25.560
ppc64 1:23.979
ppc64le 1:24.396
riscv64 1:10.635
s390x 1:29.725
sh4 1:16.149
sparc64 1:24.006
x86_64 1:14.058

LinkedIn, Twitter, Facebook