Measuring QEMU Performance in System Mode

Comparing the Emulation Performance of Debian for Five Different Targets

Ahmed Karaman - August 24, 2020

Intro

The final report of the TCG Continuos Benchmarking project introduces basic performance measurements for QEMU system mode emulation. Boot-up time and number of executed instructions are compared for the emulation of five different targets. The report also presents a new tool for finding the topN most executed functions in the emulation process.

Table of Contents

Setup

First, create a QEMU system build based on the latest version 5.1.0. For the purpose of this report, you’ll only need to build for the five used targets.

wget https://download.qemu.org/qemu-5.1.0.tar.xz
tar xfv qemu-5.1.0.tar.xz
cd qemu-5.1.0
mkdir build-gcc-system
cd build-gcc-system
../configure --target-list=aarch64-softmmu,arm-softmmu,mips-softmmu,mipsel-softmmu,x86_64-softmmu
make

Starting System Emulation

Debian provides support for a variety of architectures, that’s why it’s the OS of choice for testing the system mode emulation. The latest Debian version 15.0 is used.

For each of the five targets (aarch64, arm, mips, mipsel, and x86_64), the Debian image is booted up until the setup menu appears, then the emulation is manually stopped. Doing so assures that enough instructions are executed by QEMU for accurately comparing the results while at the same time avoids the unnecessary lengthy process of actually installing the OS.

Having said that, instead of downloading the .iso image of Debian which is around 350 MB for each target, it’s sufficient to use the initial ramdisk and kernel files from the netboot version of the Debian distribution.

The below snippets download the required files from the Debian archives then starts the emulation for each target. All emulations are performed with a RAM size of 1024 MB.

AArch64:

# Initial ramdisk
wget http://ftp.nl.debian.org/debian/dists/buster/main/installer-arm64/current/images/cdrom/initrd.gz
# Linux kernel
wget http://ftp.nl.debian.org/debian/dists/buster/main/installer-arm64/current/images/cdrom/vmlinuz
# Emulate Debian
<qemu-system-build>/aarch64-softmmu/qemu-system-aarch64 \
-m 1024 -M virt -cpu cortex-a57 -kernel vmlinuz -initrd initrd.gz \
-append "root=/dev/ram"

ARM:

# Initial ramdisk
wget http://ftp.nl.debian.org/debian/dists/buster/main/installer-armhf/current/images/cdrom/initrd.gz
# Linux kernel
wget http://ftp.nl.debian.org/debian/dists/buster/main/installer-armhf/current/images/cdrom/vmlinuz
# Emulate Debian
<qemu-system-build>/arm-softmmu/qemu-system-arm \
-m 1024 -M virt -cpu cortex-a15 -kernel vmlinuz -initrd initrd.gz \
-append "root=/dev/ram"

MIPS (Malta):

# Initial ramdisk
wget http://ftp.nl.debian.org/debian/dists/buster/main/installer-mips/current/images/malta/netboot/initrd.gz
# Linux kernel
wget http://ftp.nl.debian.org/debian/dists/buster/main/installer-mips/current/images/malta/netboot/vmlinux-4.19.0-10-4kc-malta
# Emulate Debian
<qemu-system-build>/mips-softmmu/qemu-system-mips \
-m 1024 -M malta -kernel vmlinux-4.19.0-10-4kc-malta -initrd initrd.gz \
-append "root=/dev/ram"

MIPSel (Malta):

# Initial ramdisk
wget http://ftp.nl.debian.org/debian/dists/buster/main/installer-mipsel/current/images/malta/netboot/initrd.gz
# Linux kernel
wget http://ftp.nl.debian.org/debian/dists/buster/main/installer-mipsel/current/images/malta/netboot/vmlinux-4.19.0-10-4kc-malta
# Emulate Debian
<qemu-system-build>/mipsel-softmmu/qemu-system-mipsel \
-m 1024 -M malta -kernel vmlinux-4.19.0-10-4kc-malta -initrd initrd.gz \
-append "root=/dev/ram"

x86_64:

# Initial ramdisk
wget http://ftp.nl.debian.org/debian/dists/buster/main/installer-amd64/current/images/cdrom/initrd.gz
# Linux kernel
wget http://ftp.nl.debian.org/debian/dists/buster/main/installer-amd64/current/images/cdrom/vmlinuz
# Emulate Debian
<qemu-system-build>/x86_64-softmmu/qemu-system-x86_64 \
-m 1024 -kernel vmlinuz -initrd initrd.gz \
-append "root=/dev/ram"

The demo below shows the successful emulation of the aarch64 target. All other targets behave the same way except the x86_64 in which the emulation output is shown on “VGA” instead of “serial0” like the other targets.

System Flow

Boot-up Time

The Linux time command is used for measuring the boot-up time for the emulation of each target. The emulation is manually shutdown as soon as the setup menu appears.

Target Boot-up Time
aarch64 16.76
arm 27.84
mips 20.76
mipsel 18.80
x86_64 17.97

The 25 Most Executed Functions

To find the top most executed QEMU functions as well as the total number of instructions, a new script, topN_system.py, is introduced which is similar to the topN scripts introduced in the “Measuring Basic Performance Metrics of QEMU” report.

The script is based on Perf. Callgrind cannot be used in this case, because unlike user mode emulation, the system mode emulation is more resource intensive and since Callgrind is based on instrumentation, it would take an enormous amount of time to complete the measurements.

The script, which is available on the project GitHub page, takes only one optional argument -n to specify the number of top functions to display. If it’s not provided, the script defaults to 25.

Usage example for aarch64 is shown below:

./topN_system.py -- <qemu-system-build>/aarch64-softmmu/qemu-system-aarch64 \
-m 1024 -M virt -cpu cortex-a57 -kernel images_debian/aarch64/vmlinuz -initrd images_debian/aarch64/initrd.gz \
-append "root=/dev/ram"

The script is executed in a similar manner for each of the five targets.

AArch64:

Number of instructions: 136,660,616,915

 No.  Percentage  Name
----  ----------  ------------------------------
   1      10.25%  helper_lookup_tb_ptr
   2       6.69%  liveness_pass_1
   3       4.71%  get_phys_addr_lpae
   4       3.65%  tcg_gen_code
   5       2.73%  tlb_flush_by_mmuidx_async_work
   6       2.57%  tlb_set_page_with_attrs
   7       2.48%  tcg_optimize
   8       2.24%  address_space_translate_internal
   9       2.14%  address_space_ldq_le
  10       1.76%  object_dynamic_cast_assert
  11       1.67%  tb_htable_lookup
  12       1.50%  flatview_do_translate
  13       1.44%  tcg_out_opc.isra.13
  14       1.07%  cpu_get_tb_cpu_state
  15       0.91%  get_phys_addr
  16       0.87%  get_page_addr_code_hostp
  17       0.80%  aa64_va_parameters
  18       0.78%  flatview_translate
  19       0.75%  S1_ptw_translate
  20       0.74%  victim_tlb_hit
  21       0.62%  arm_cpu_tlb_fill
  22       0.62%  object_class_dynamic_cast_assert
  23       0.62%  qht_lookup_custom
  24       0.62%  init_ts_info
  25       0.60%  regime_el

ARM:

Number of instructions: 236,734,647,205

 No.  Percentage  Name
----  ----------  ------------------------------
   1       9.10%  tlb_flush_by_mmuidx_async_work
   2       6.99%  helper_lookup_tb_ptr
   3       3.29%  tlb_set_page_with_attrs
   4       2.86%  liveness_pass_1
   5       2.79%  cpu_get_tb_cpu_state
   6       2.77%  get_phys_addr
   7       2.25%  tcg_gen_code
   8       1.78%  tcg_optimize
   9       1.41%  tb_htable_lookup
  10       1.25%  victim_tlb_hit
  11       1.03%  tcg_out_opc.isra.13
  12       0.91%  address_space_translate_internal
  13       0.85%  get_page_addr_code_hostp
  14       0.83%  object_class_dynamic_cast_assert
  15       0.80%  full_le_ldul_mmu
  16       0.77%  flatview_do_translate
  17       0.77%  object_dynamic_cast_assert
  18       0.73%  address_space_ldl_le
  19       0.72%  helper_uadd8
  20       0.68%  arm_ldl_ptw
  21       0.68%  arm_cpu_tlb_fill
  22       0.67%  find_next_bit
  23       0.60%  qht_lookup_custom
  24       0.56%  cpu_physical_memory_get_dirty.constprop.23
  25       0.54%  init_ts_info

MIPS:

Number of instructions: 180,841,164,833

 No.  Percentage  Name
----  ----------  ------------------------------
   1      16.91%  helper_lookup_tb_ptr
   2       8.08%  liveness_pass_1
   3       2.70%  tcg_gen_code
   4       2.26%  r4k_map_address
   5       1.87%  cpu_exec
   6       1.47%  tcg_optimize
   7       1.30%  object_class_dynamic_cast_assert
   8       1.04%  tcg_out_opc.isra.13
   9       0.77%  address_space_translate_internal
  10       0.74%  access_with_adjusted_size
  11       0.74%  tlb_set_page_with_attrs
  12       0.73%  tlb_flush_page_by_mmuidx_async_0
  13       0.71%  mips_cpu_do_interrupt
  14       0.63%  tb_htable_lookup
  15       0.61%  flatview_do_translate
  16       0.59%  tb_jmp_cache_clear_page
  17       0.49%  flatview_access_valid
  18       0.42%  tcg_out_sib_offset
  19       0.39%  init_ts_info
  20       0.38%  helper_eret
  21       0.36%  memory_region_dispatch_write
  22       0.35%  get_page_addr_code_hostp
  23       0.32%  object_dynamic_cast_assert
  24       0.31%  full_be_ldul_mmu
  25       0.30%  flatview_translate

MIPSel:

Number of instructions: 160,755,845,442

 No.  Percentage  Name
----  ----------  ------------------------------
   1       8.95%  liveness_pass_1
   2       6.29%  helper_lookup_tb_ptr
   3       2.99%  tcg_gen_code
   4       2.65%  r4k_map_address
   5       1.89%  cpu_exec
   6       1.60%  tcg_optimize
   7       1.50%  object_class_dynamic_cast_assert
   8       1.02%  tcg_out_opc.isra.13
   9       0.89%  access_with_adjusted_size
  10       0.86%  tlb_flush_page_by_mmuidx_async_0
  11       0.86%  tb_htable_lookup
  12       0.83%  flatview_do_translate
  13       0.82%  address_space_translate_internal
  14       0.82%  mips_cpu_do_interrupt
  15       0.73%  tlb_set_page_with_attrs
  16       0.72%  flatview_access_valid
  17       0.70%  tb_jmp_cache_clear_page
  18       0.51%  victim_tlb_hit
  19       0.48%  helper_eret
  20       0.43%  get_page_addr_code_hostp
  21       0.43%  memory_region_dispatch_write
  22       0.42%  tcg_out_sib_offset
  23       0.40%  flatview_translate
  24       0.39%  object_dynamic_cast_assert
  25       0.38%  init_ts_info

x86_64:

Number of instructions: 150,991,381,071

 No.  Percentage  Name
----  ----------  ------------------------------
   1      11.30%  helper_lookup_tb_ptr
   2       7.01%  liveness_pass_1
   3       4.48%  tcg_gen_code
   4       3.41%  tcg_optimize
   5       1.84%  tcg_out_opc.isra.13
   6       1.78%  helper_pcmpeqb_xmm
   7       1.20%  object_dynamic_cast_assert
   8       1.00%  cpu_exec
   9       0.99%  tcg_temp_new_internal
  10       0.88%  tb_htable_lookup
  11       0.84%  object_class_dynamic_cast_assert
  12       0.81%  init_ts_info
  13       0.80%  tlb_set_page_with_attrs
  14       0.77%  victim_tlb_hit
  15       0.75%  tcg_out_sib_offset
  16       0.62%  tcg_op_alloc
  17       0.61%  helper_pmovmskb_xmm
  18       0.58%  disas_insn.isra.50
  19       0.56%  helper_pcmpgtb_xmm
  20       0.56%  address_space_ldq
  21       0.49%  address_space_translate_internal
  22       0.49%  x86_cpu_tlb_fill
  23       0.46%  tb_gen_code
  24       0.45%  tcg_out_modrm_sib_offset
  25       0.43%  flatview_do_translate

Conclusion

The results from the topN script and the manual boot-up time measurement are combined in the table below. The results quiet resemble those from the previous reports that compared the performance in the user mode.

Target Boot-up Time Instructions
aarch64 16.76 136 660 616 915
arm 27.84 236 734 647 205
mips 20.76 180 841 164 833
mipsel 18.80 160 755 845 442
x86_64 17.97 150 991 381 071

LinkedIn, Twitter, Facebook