Measuring QEMU Emulation Efficiency

Comparing Guest Instructions and QEMU Instructions

Ahmed Karaman - August 3, 2020

Intro

This reports presents a method for measuring the TCG emulation efficiency in QEMU. This is achieved for seventeen different targets by comparing the number of guest instructions (running the program natively on the target) and the number of QEMU instructions (running the program through QEMU). For each target, the ratio between these two numbers presents a rough estimation of the emulation efficiency for that target.

Beside the five newly introduced benchmarks in the previous report, the Coulomb benchmark is also reused in this report to provide a variety of workloads. This gives a total of six benchmark programs that can be categorized into two groups:

Floating point operations (group 1):
- coulomb_double
- matmult_double
- qsort_double
Basic int and char operations (group 2):
- matmult_int32
- qsort_int32
- qsort_string

All benchmarks are available on the project GitHub page.

Setup
Measurements
Results (Benchmark Group 1)
Results (Benchmark Group 2)
Analysis

Setup

All the measurements in this report are based on the newly released QEMU version 5.1.0-rc2. To measure the number of guest instructions, the libinsn plugin is utilized which is available when QEMU is built with the --enable-plugins option. The general syntax of using the plugin is:

<qemu-executable> -plugin <qemu-plugins-build>/tests/plugin/libinsn.so -d plugin <test-program>

To measure the number of QEMU instructions, Callgrind is used. Please refer to the “Measuring Basic Performance Metrics of QEMU” report for more details on setting up and using Callgrind.

To create a plugins build based on the latest QEMU version, this bash snippet is used:

wget https://download.qemu.org/qemu-5.1.0-rc2.tar.xz
tar xfv qemu-5.1.0-rc2.tar.xz
cd qemu-5.1.0-rc2
mkdir build-gcc-plugins
cd build-gcc-plugins
../configure --disable-system --disable-tools --enable-plugins
make

Measurements

The Python script below creates a CSV table for each of the six benchmarks. Each table contains seventeen rows, one for each target. A row contains the target name, number of guest instructions, number of QEMU instructions and the ratio between the two numbers.

import csv
import os
import subprocess
import sys
import tempfile

############### Script Options ###############
qemu_build = "<qemu-plugins-build>"
targets = {
    "aarch64":  "aarch64-linux-gnu-gcc",
    "alpha":    "alpha-linux-gnu-gcc",
    "arm":      "arm-linux-gnueabi-gcc",
    "hppa":     "hppa-linux-gnu-gcc",
    "m68k":     "m68k-linux-gnu-gcc",
    "mips":     "mips-linux-gnu-gcc",
    "mipsel":   "mipsel-linux-gnu-gcc",
    "mips64":   "mips64-linux-gnuabi64-gcc",
    "mips64el": "mips64el-linux-gnuabi64-gcc",
    "ppc":      "powerpc-linux-gnu-gcc",
    "ppc64":    "powerpc64-linux-gnu-gcc",
    "ppc64le":  "powerpc64le-linux-gnu-gcc",
    "riscv64":  "riscv64-linux-gnu-gcc",
    "s390x":    "s390x-linux-gnu-gcc",
    "sh4":      "sh4-linux-gnu-gcc",
    "sparc64":  "sparc64-linux-gnu-gcc",
    "x86_64":   "gcc"
}
##############################################


def measure_qemu_instructions(qemu_exe_path, program_exe_path):
    # Measure the number of QEMU instructions using Callgrind
    with tempfile.NamedTemporaryFile() as tmp_out:
        run_callgrind = subprocess.run(["valgrind",
                                        "--tool=callgrind",
                                        "--callgrind-out-file=" + tmp_out.name,
                                        qemu_exe_path,
                                        program_exe_path],
                                       stdout=subprocess.DEVNULL,
                                       stderr=subprocess.PIPE)
    callgrind_output = run_callgrind.stderr.decode("utf-8").split("\n")
    return int(callgrind_output[8].split(" ")[-1])


csv_header = ["Target", "Guest Instructions", "QEMU Instructions", "Ratio"]
benchmarks = os.listdir('benchmarks')
libinsn_path = os.path.join(qemu_build, "tests", "plugin", "libinsn.so")
os.mkdir("tables")

for benchmark in benchmarks:
    data = []
    benchmark_name = os.path.splitext(benchmark)[0]
    benchmark_path = os.path.join("benchmarks", benchmark)
    for target_name, target_compiler in targets.items():
        with tempfile.NamedTemporaryFile() as tmp_exe:
            # Compile target
            subprocess.run([target_compiler, "-O2", "-static",
                            benchmark_path, "-o", tmp_exe.name, "-lm"])
            # Run the libinsn plugin
            run_qemu_plugin = subprocess.run(["{}/{}-linux-user/qemu-{}".
                                              format(qemu_build,
                                                     target_name,
                                                     target_name),
                                              "-plugin",
                                              libinsn_path,
                                              "-d",
                                              "plugin",
                                              tmp_exe.name],
                                             stdout=subprocess.DEVNULL,
                                             stderr=subprocess.PIPE)
            # Measure the instructions
            guest_instructions = int(run_qemu_plugin.stderr.decode("utf-8").
                                     split()[-1])
            qemu_instruction = measure_qemu_instructions("{}/{}-linux-user/qemu-{}".
                                                         format(qemu_build,
                                                                target_name,
                                                                target_name),
                                                         tmp_exe.name)
        data.append([target_name,
                     format(guest_instructions, ","),
                     format(qemu_instruction, ","),
                     "1:" + str(round((qemu_instruction / guest_instructions), 3))])

    with open(os.path.join("tables", benchmark_name) + ".csv", "w") as file:
        writer = csv.writer(file)
        writer.writerow(csv_header)
        writer.writerows(data)

Results (Benchmarks Group 1)

coulomb_double

Target	Guest Instructions	QEMU Instructions	Ratio
aarch64	182 965 444	4 424 319 223	1:24.181
alpha	287 894 875	10 720 832 859	1:37.239
arm	4 353 433 161	39 328 640 162	1:9.034
hppa	290 299 145	12 007 537 148	1:41.363
m68k	55 464 791	7 107 559 194	1:128.145
mips	286 969 260	9 957 633 056	1:34.699
mipsel	300 313 870	11 123 315 018	1:37.039
mips64	255 992 742	9 855 532 178	1:38.499
mips64el	266 739 104	11 004 724 703	1:41.257
ppc	239 658 319	13 031 944 195	1:54.377
ppc64	228 263 889	13 034 833 440	1:57.104
ppc64le	220 968 816	13 012 936 191	1:58.890
riscv64	209 944 207	4 069 430 554	1:19.383
s390x	215 191 419	11 013 187 596	1:51.179
sh4	473 219 807	12 728 861 129	1:26.898
sparc64	263 295 373	11 969 980 973	1:45.462
x86_64	225 499 576	4 643 073 756	1:20.590

matmult_double

Target	Guest Instructions	QEMU Instructions	Ratio
aarch64	62 565 037	1 412 678 042	1:22.579
alpha	120 146 835	3 021 375 794	1:25.147
arm	917 721 514	8 723 369 272	1:9.505
hppa	63 330 121	3 346 341 016	1:52.840
m68k	62 270 262	3 327 921 564	1:53.443
mips	87 981 027	2 263 506 435	1:25.727
mipsel	95 981 109	3 176 876 928	1:33.099
mips64	80 557 580	2 277 631 169	1:28.273
mips64el	88 557 574	3 190 361 616	1:36.026
ppc	48 136 797	3 125 669 697	1:64.933
ppc64	64 408 551	3 203 728 174	1:49.741
ppc64le	64 289 333	3 203 064 933	1:49.823
riscv64	78 623 128	1 222 950 784	1:15.555
s390x	46 190 841	2 726 829 922	1:59.034
sh4	88 962 981	3 342 515 085	1:37.572
sparc64	79 003 237	3 207 541 031	1:40.600
x86_64	61 517 622	1 250 647 935	1:20.330

qsort_double

Target	Guest Instructions	QEMU Instructions	Ratio
aarch64	159 746 207	2 658 877 440	1:16.644
alpha	228 521 249	1 949 737 992	1:8.532
arm	662 068 324	9 121 836 857	1:13.778
hppa	247 113 645	3 141 276 704	1:12.712
m68k	203 935 507	4 934 908 874	1:24.198
mips	207 350 635	2 099 043 136	1:10.123
mipsel	207 350 618	2 099 343 286	1:10.125
mips64	188 086 328	1 971 371 119	1:10.481
mips64el	188 086 318	1 968 839 700	1:10.468
ppc	224 876 043	2 736 474 437	1:12.169
ppc64	203 809 886	2 685 763 461	1:13.178
ppc64le	193 040 770	2 642 651 058	1:13.690
riscv64	167 397 846	1 590 611 459	1:9.502
s390x	130 867 251	2 475 571 654	1:18.917
sh4	244 843 868	2 563 068 375	1:10.468
sparc64	190 084 290	3 919 439 599	1:20.619
x86_64	156 689 097	1 987 553 774	1:12.685

Results (Benchmarks Group 2)

matmult_int32

Target	Guest Instructions	QEMU Instructions	Ratio
aarch64	62 555 845	596 194 508	1:9.531
alpha	96 215 385	370 654 042	1:3.852
arm	63 690 750	736 994 597	1:11.571
hppa	103 978 473	667 790 898	1:6.422
m68k	62 534 491	407 647 521	1:6.519
mips	88 083 941	497 767 190	1:5.651
mipsel	88 083 929	497 780 326	1:5.651
mips64	89 460 954	479 725 676	1:5.362
mips64el	89 460 943	463 106 726	1:5.177
ppc	55 843 156	338 959 876	1:6.070
ppc64	64 204 690	390 884 485	1:6.088
ppc64le	64 205 395	390 743 122	1:6.086
riscv64	86 448 202	349 669 158	1:4.045
s390x	62 614 807	492 407 746	1:7.864
sh4	72 780 143	399 937 800	1:5.495
sparc64	86 423 179	489 936 356	1:5.669
x86_64	61 590 922	400 190 791	1:6.498

qsort_int32

Target	Guest Instructions	QEMU Instructions	Ratio
aarch64	151 968 514	2 132 112 102	1:14.030
alpha	221 192 248	1 460 982 497	1:6.605
arm	160 875 621	3 375 777 484	1:20.984
hppa	201 401 936	2 199 407 458	1:10.920
m68k	169 894 134	1 780 208 909	1:10.478
mips	176 712 823	1 501 040 830	1:8.494
mipsel	176 712 809	1 503 808 218	1:8.510
mips64	176 020 831	1 504 536 270	1:8.547
mips64el	176 020 824	1 483 550 240	1:8.428
ppc	202 473 828	1 668 592 063	1:8.241
ppc64	198 918 772	1 780 051 140	1:8.949
ppc64le	188 749 603	1 728 567 792	1:9.158
riscv64	159 048 448	1 289 755 584	1:8.109
s390x	132 119 768	2 114 840 292	1:16.007
sh4	205 090 416	1 879 285 254	1:9.163
sparc64	185 195 979	3 352 756 658	1:18.104
x86_64	145 621 672	1 751 799 973	1:12.030

qsort_string

Target	Guest Instructions	QEMU Instructions	Ratio
aarch64	237 478 279	2 530 968 853	1:10.658
alpha	310 349 344	1 794 207 498	1:5.781
arm	277 491 839	7 167 746 267	1:25.830
hppa	286 010 885	4 608 364 139	1:16.113
m68k	242 574 561	2 295 663 078	1:9.464
mips	331 063 420	2 114 226 632	1:6.386
mipsel	331 063 408	2 111 085 204	1:6.377
mips64	304 640 414	1 969 109 275	1:6.464
mips64el	304 640 409	1 951 425 342	1:6.406
ppc	320 946 236	2 429 421 810	1:7.570
ppc64	272 956 914	2 404 978 156	1:8.811
ppc64le	273 392 915	2 386 256 069	1:8.728
riscv64	216 826 004	1 564 149 511	1:7.214
s390x	165 265 303	4 189 211 923	1:25.348
sh4	287 459 667	2 098 659 130	1:7.301
sparc64	304 142 262	4 130 702 783	1:13.581
x86_64	234 574 652	2 865 446 064	1:12.215

Analysis

The tables above are color coded to show the three best and worst emulation ratios for each benchmark. It can be noticed that within the same benchmark group, the ratios for all seventeen targets are nearly consistent.

It’s also clear that the ratio depends on the type of the program being emulated. Benchmarks in group 1 have a considerably larger emulation ratio compared to benchmarks in group 2.

The Python script below averages the ratios across different tables for each target. The results give a very good overview of QEMU’s emulation efficiency for each of the seventeen targets.

import os
import csv

# Tables directory
tables = os.listdir("tables")
csv_headers = ["Target", "QEMU Efficiency"]

# Initialize target arrays
target_names, target_ratio_sums = [], []
with open(os.path.join("tables", tables[0]), "r") as file:
    # Skip headers line
    file.readline()
    lines = file.readlines()
    for line in lines:
        # Add target name
        target_names.append(line.split(",")[0])
        # Initialize sum to zero
        target_ratio_sums.append(0)

# Number of benchmarks and targets
no_benchmarks = len(tables)
no_targets = len(target_names)

for table in tables:
    with open(os.path.join("tables", table), "r") as file:
        file.readline()
        lines = file.readlines()
        for i in range(len(lines)):
            target_ratio_sums[i] += float(lines[i].split(",")
                                          [-1].split(":")[-1])


target_ratio_avgs = ["1:"+str(round((x / no_benchmarks), 3))
                     for x in target_ratio_sums]

with open("efficiency.csv", "w") as file:
    writer = csv.writer(file)
    writer.writerow(csv_headers)
    for i in range(no_targets):
        writer.writerow([target_names[i], target_ratio_avgs[i]])

The script can be ran three times to obtain three tables.

On the left is the table for averaging the three benchmarks in group 1. The table in the middle represents the average ratio for benchmarks in group 2. Lastly, the table on the right is the average of all six benchmarks.

Target	QEMU Efficiency (group 1)
aarch64	1:21.135
alpha	1:23.639
arm	1:10.772
hppa	1:35.638
m68k	1:68.595
mips	1:23.516
mipsel	1:26.754
mips64	1:25.751
mips64el	1:29.250
ppc	1:43.826
ppc64	1:40.008
ppc64le	1:40.801
riscv64	1:14.813
s390x	1:43.043
sh4	1:24.979
sparc64	1:35.560
x86_64	1:17.868

Target	QEMU Efficiency (group 2)
aarch64	1:11.406
alpha	1:5.413
arm	1:19.462
hppa	1:11.152
m68k	1:8.82
mips	1:6.844
mipsel	1:6.846
mips64	1:6.791
mips64el	1:6.670
ppc	1:7.294
ppc64	1:7.949
ppc64le	1:7.991
riscv64	1:6.456
s390x	1:16.406
sh4	1:7.32
sparc64	1:12.451
x86_64	1:10.248

Target	QEMU Efficiency (overall)
aarch64	1:16.270
alpha	1:14.526
arm	1:15.117
hppa	1:23.395
m68k	1:38.708
mips	1:15.180
mipsel	1:16.800
mips64	1:16.271
mips64el	1:17.960
ppc	1:25.560
ppc64	1:23.979
ppc64le	1:24.396
riscv64	1:10.635
s390x	1:29.725
sh4	1:16.149
sparc64	1:24.006
x86_64	1:14.058

Share this report: LinkedIn, Twitter, Facebook

TCG Continuous Benchmarking