Intro
The previous report introduced a performance comparison between QEMU versions 5.0 and 5.1-pre-soft-freeze. The results showed an approximate 2.45% performance degradation in all PowerPC targets. To further analyze the results, the report introduced KCachegrind to compare the list of QEMU helpers executed in the two versions as well as to list the callees of these helpers.
This report presents two new Python scripts that facilitates the process of displaying the executed QEMU helpers and function callees without the need of setting up KCachegrind. The ppc/ppc64/ppc64le performance degradation is re-analysed using the new scripts. The report also introduces the analysis of three other targets, hppa and sh4, explaining why they were not affected the same way as ppc, and mips, explaining why it showed an increase in performance.
Table of Contents
- Introducing the Scripts
- Re-analyzing ppc Performance 5.0 VS 5.1-pre-soft-freeze
- Analyzing hppa Performance 5.0 VS 5.1-pre-soft-freeze
- Analyzing sh4 Performance 5.0 VS 5.1-pre-soft-freeze
- Analyzing mips Performance 5.0 VS 5.1-pre-soft-freeze
- Appendix
Introducing the Scripts
The list_helpers.py
script - as the name suggests - is used to list all helpers executed during a QEMU invocation. In the first part of its output, the script prints the total number of executed instructions. After that, it lists the executed helpers with the following info for each one:
- Number of inclusive instructions
- Overall percentage
- Number of calls
- Number of instructions per call
- Function name
- Source file
The list_fn_callees.py
is a generalization of the list_helpers.py
script. The list_helpers.py
script works under the hood by listing the callees of the JIT call. The list_fn_callees.py
script extends this by giving the user the ability to list the callees of any executed QEMU function(s). The script takes one required argument, -f
, which is a list of space separated QEMU functions. For each function, the script prints the list of the function callees in a similar manner as list_helpers.py
does.
Both scripts are available on the project GitHub page.
Examples of Usage
Compile the coulomb_float benchmark on an x86_64 Linux machine:
gcc -static -O2 coulomb_double.c -o coulomb_double -lm
To list the executed helpers, the list_helpers.py
script can be invoked using:
./list_helpers.py -- <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double
Output:
Total number of instructions: 4,701,725,992
Executed QEMU Helpers:
No. Instructions Percentage Calls Ins/Call Helper Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 1,139,103,365 24.227% 21,490,812 53 helper_mulsd <qemu>/target/i386/ops_sse.h
2 906,697,049 19.284% 18,470,085 49 helper_addsd <qemu>/target/i386/ops_sse.h
3 858,124,043 18.251% 16,940,796 50 helper_subsd <qemu>/target/i386/ops_sse.h
4 211,982,677 4.509% 7,202,293 29 helper_ucomisd <qemu>/target/i386/ops_sse.h
5 154,316,493 3.282% 2,655,074 58 helper_lookup_tb_ptr <qemu>/accel/tcg/tcg-runtime.c
6 80,851,232 1.720% 459,382 176 helper_cvttsd2si <qemu>/target/i386/ops_sse.h
7 63,073,459 1.341% 1,261,468 50 helper_divsd <qemu>/target/i386/ops_sse.h
8 30,787,517 0.655% 2,646,130 11 helper_cc_compute_all <qemu>/target/i386/cc_helper.c
9 24,699,785 0.525% 4,939,957 5 helper_pand_xmm <qemu>/target/i386/ops_sse.h
10 14,266,055 0.303% 2,853,211 5 helper_pxor_xmm <qemu>/target/i386/ops_sse.h
11 8,885,615 0.189% 1,777,123 5 helper_por_xmm <qemu>/target/i386/ops_sse.h
12 5,714,358 0.122% 5,722 998 helper_divq_EAX <qemu>/target/i386/int_helper.c
13 2,435,265 0.052% 30,065 81 helper_pcmpeqb_xmm <qemu>/target/i386/ops_sse.h
14 1,900,764 0.040% 211,196 9 helper_pandn_xmm <qemu>/target/i386/ops_sse.h
15 1,200,024 0.026% 19,048 63 helper_pmovmskb_xmm <qemu>/target/i386/ops_sse.h
16 278,000 0.006% 2,000 139 helper_cvtsi2sd <qemu>/target/i386/ops_sse.h
17 260,732 0.006% 24,471 10 helper_cc_compute_c <qemu>/target/i386/cc_helper.c
18 225,270 0.005% 5,006 45 helper_punpcklbw_xmm <qemu>/target/i386/ops_sse.h
19 95,133 0.002% 5,007 19 helper_pshufd_xmm <qemu>/target/i386/ops_sse.h
20 75,090 0.002% 5,006 15 helper_punpcklwd_xmm <qemu>/target/i386/ops_sse.h
21 36,000 0.001% 1,000 36 helper_sqrtsd <qemu>/target/i386/ops_sse.h
22 28,000 0.001% 4,000 7 helper_movmskpd <qemu>/target/i386/ops_sse.h
23 20,028 0.000% 5,007 4 helper_movl_mm_T0_xmm <qemu>/target/i386/ops_sse.h
24 17,000 0.000% 1,000 17 helper_idivl_EAX <qemu>/target/i386/int_helper.c
25 8,000 0.000% 4,000 2 helper_fnstcw <qemu>/target/i386/fpu_helper.c
26 3,354 0.000% 43 78 helper_syscall <qemu>/target/i386/seg_helper.c
27 1,497 0.000% 13 115 helper_cpuid <qemu>/target/i386/misc_helper.c
28 775 0.000% 31 25 helper_pcmpgtl_xmm <qemu>/target/i386/ops_sse.h
29 720 0.000% 6 120 helper_pslldq_xmm <qemu>/target/i386/ops_sse.h
30 625 0.000% 125 5 helper_paddq_xmm <qemu>/target/i386/ops_sse.h
31 558 0.000% 62 9 helper_paddl_xmm <qemu>/target/i386/ops_sse.h
32 528 0.000% 16 33 helper_psubb_xmm <qemu>/target/i386/ops_sse.h
33 372 0.000% 62 6 helper_psllq_xmm <qemu>/target/i386/ops_sse.h
34 310 0.000% 62 5 helper_punpckhqdq_xmm <qemu>/target/i386/ops_sse.h
35 279 0.000% 31 9 helper_punpckhdq_xmm <qemu>/target/i386/ops_sse.h
36 248 0.000% 31 8 helper_pslld_xmm <qemu>/target/i386/ops_sse.h
37 217 0.000% 31 7 helper_punpckldq_xmm <qemu>/target/i386/ops_sse.h
38 216 0.000% 2 108 helper_psrldq_xmm <qemu>/target/i386/ops_sse.h
39 198 0.000% 66 3 helper_punpcklqdq_xmm <qemu>/target/i386/ops_sse.h
40 124 0.000% 2 62 helper_idivq_EAX <qemu>/target/i386/int_helper.c
41 25 0.000% 1 25 helper_pcmpeql_xmm <qemu>/target/i386/ops_sse.h
42 24 0.000% 1 24 helper_rdtsc <qemu>/target/i386/misc_helper.c
To list the the callees of helper_mulsd
and helper_addsd
, the list_fn_callees.py
script can be invoked using:
./list_fn_callees.py -f helper_mulsd helper_addsd -- <qemu-build>/x86_64-linux-user/qemu-x86_64 coulomb_double
Output:
Total number of instructions: 4,703,399,623
Callees of helper_mulsd:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 924,195,245 19.650% 21,490,812 43 float64_mul <qemu>/fpu/softfloat.c
Callees of helper_addsd:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 721,996,199 15.351% 18,470,085 39 float64_add <qemu>/fpu/softfloat.c
Principle of Operation
The script executes the passed QEMU invocation command with Callgrind. It then uses callgrind_annotate
with two new flags, --tree=calling
and --threshold=100
.
The --tree=calling
flag is used to list the callees of each function, while the --threshold=100
flag is used to set a threshold on the displayed cost percentage. callgrind_annotate
stops printing functions when the sum of the cost percentage of the printed functions is bigger than or equal to the given threshold percentage. The default percentage is 99.
Understanding the “–tree=calling” Flag Output
To better understand the --tree=calling
flag output, consider the excerpt below obtained from running callgrind_annotate
on the Callgrind output of the x86_64 version of the Coulomb benchmark:
157 8,728,258 * <qemu>/fpu/softfloat.c:float64_to_int32_scalbn [<qemu-build>/x86_64-linux-user/qemu-x86_64]
158 45,938,200 > <qemu>/fpu/softfloat.c:round_to_int_and_pack (459382x) [<qemu-build>/x86_64-linux-user/qemu-x86_64]
159 15,618,988 > <qemu>/fpu/softfloat.c:float64_unpack_canonical (459382x) [<qemu-build>/x86_64-linux-user/qemu-x86_64]
160
161 7,965,240 * <qemu>/include/exec/tb-hash.h:helper_lookup_tb_ptr
162
163 7,350,112 * <qemu>/target/i386/ops_sse.h:helper_cvttsd2si [<qemu-build>/x86_64-linux-user/qemu-x86_64]
164 72,122,974 > <qemu>/fpu/softfloat.c:float64_to_int32_round_to_zero (459382x) [<qemu-build>/x86_64-linux-user/qemu-x86_64]
A line can come in two forms, either with a *
after the number of instructions, or with a >
.
A line with *
indicates that it contains the measurements of a top-level function. All lines following it with a >
are this function callees. The two line forms have different set of obtained measurements.
Example of a top-level function (line number 157):
8,728,258
- The number of self instructions of the function.*
- Indicates that this is a top-level function.<qemu>/fpu/softfloat.c:float64_to_int32_scalbn
- Function source file[<qemu-build>/x86_64-linux-user/qemu-x86_64]
- Program executing the function
Example of a function callee (line number 158):
45,938,200
- The number of inclusive instructions of the function.>
- Indicates that this is a callee of the top-level function.<qemu>/fpu/softfloat.c:round_to_int_and_pack
- Function source file.(459382x)
- Number of function calls.[<qemu-build>/x86_64-linux-user/qemu-x86_64]
- Program executing the function.
The list_helpers.py
and list_fn_callees.py
scripts use the information above for searching and printing the callee details of the desired functions.
Re-analyzing ppc Performance 5.0 VS 5.1-pre-soft-freeze
The previous report concluded that the changes made in SoftFloat by inlining the float64 compare specializations were the reason behind the PowerPC performance degradation. This section concludes the same, but this time by using the list_helpers.py
and list_fn_callees.py
scripts instead of KCachegrind.
Finding list of ppc helpers for QEMU 5.0:
./list_helpers.py -- <qemu-build>/ppc-linux-user/qemu-ppc coulomb_double-ppc
Results:
Total number of instructions: 12,713,132,146
Executed QEMU Helpers:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 4,079,150,406 32.092% 14,765,516 276 helper_fmadd <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
2 2,019,505,224 15.888% 39,614,918 50 helper_compute_fprf_float64 <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
3 1,136,334,017 8.940% 8,660,551 131 helper_fsub <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
4 1,057,997,648 8.324% 8,110,271 130 helper_fadd <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
5 773,545,891 6.086% 5,475,082 141 helper_fmul <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
6 760,150,923 5.980% 46,826,121 16 helper_float_check_status <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
7 632,546,190 4.976% 7,209,203 87 helper_fcmpu <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
8 258,993,128 2.038% 913,858 283 helper_fnmadd <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
9 158,379,826 1.246% 1,261,466 125 helper_fdiv <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
10 110,558,579 0.870% 2,167,299 51 helper_lookup_tb_ptr <qemu>/qemu-5.0.0/accel/tcg/tcg-runtime.c
11 109,016,868 0.858% 427,174 255 helper_fmsub <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
12 96,581,006 0.760% 48,290,503 2 helper_reset_fpstatus <qemu>/qemu-5.0.0/include/fpu/softfloat-helpers.h
13 79,473,086 0.625% 459,382 173 helper_fctiwz <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
14 266,000 0.002% 2,000 133 helper_store_fpscr <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
15 247,317 0.002% 1,000 247 helper_fnmsub <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
16 48,000 0.000% 2,000 24 helper_todouble <qemu>/qemu-5.0.0/target/ppc/fpu_helper.c
17 3,486 0.000% 42 83 helper_raise_exception_err <qemu>/qemu-5.0.0/target/ppc/excp_helper.c
18 2,380 0.000% 14 170 helper_dcbz <qemu>/qemu-5.0.0/target/ppc/mem_helper.c
Finding list of ppc helpers for QEMU 5.1-pre-soft-freeze:
./list_helpers.py -- <qemu-master-build>/ppc-linux-user/qemu-ppc coulomb_double-ppc
Results:
Total number of instructions: 13,033,447,522
Executed QEMU Helpers:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 4,079,150,406 31.303% 14,765,516 276 helper_fmadd <qemu>/qemu/target/ppc/fpu_helper.c
2 2,019,505,224 15.498% 39,614,918 50 helper_compute_fprf_float64 <qemu>/qemu/target/ppc/fpu_helper.c
3 1,136,334,017 8.720% 8,660,551 131 helper_fsub <qemu>/qemu/target/ppc/fpu_helper.c
4 1,057,997,648 8.119% 8,110,271 130 helper_fadd <qemu>/qemu/target/ppc/fpu_helper.c
5 950,907,056 7.297% 7,209,203 131 helper_fcmpu <qemu>/qemu/target/ppc/fpu_helper.c
6 773,545,891 5.936% 5,475,082 141 helper_fmul <qemu>/qemu/target/ppc/fpu_helper.c
7 760,150,923 5.833% 46,826,121 16 helper_float_check_status <qemu>/qemu/target/ppc/fpu_helper.c
8 258,993,128 1.988% 913,858 283 helper_fnmadd <qemu>/qemu/target/ppc/fpu_helper.c
9 158,379,826 1.215% 1,261,466 125 helper_fdiv <qemu>/qemu/target/ppc/fpu_helper.c
10 110,558,579 0.848% 2,167,299 51 helper_lookup_tb_ptr <qemu>/qemu/accel/tcg/tcg-runtime.c
11 109,016,868 0.837% 427,174 255 helper_fmsub <qemu>/qemu/target/ppc/fpu_helper.c
12 96,581,006 0.741% 48,290,503 2 helper_reset_fpstatus <qemu>/qemu/include/fpu/softfloat-helpers.h
13 80,391,850 0.617% 459,382 175 helper_fctiwz <qemu>/qemu/target/ppc/fpu_helper.c
14 266,000 0.002% 2,000 133 helper_store_fpscr <qemu>/qemu/target/ppc/fpu_helper.c
15 247,317 0.002% 1,000 247 helper_fnmsub <qemu>/qemu/target/ppc/fpu_helper.c
16 48,000 0.000% 2,000 24 helper_todouble <qemu>/qemu/target/ppc/fpu_helper.c
17 3,486 0.000% 42 83 helper_raise_exception_err <qemu>/qemu/target/ppc/excp_helper.c
18 2,618 0.000% 14 187 helper_dcbz <qemu>/qemu/target/ppc/mem_helper.c
To further inpsect helper_fcmpu
. The list_fn_callees.py
script is used to list the helper callees.
Finding list of helper_fcmpu callees for QEMU 5.0:
./list_fn_callees.py -f helper_fcmpu -- <qemu-build>/ppc-linux-user/qemu-ppc coulomb_double-ppc
Results:
Total number of instructions: 12,713,132,146
Callees of helper_fcmpu:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 203,052,887 1.597% 7,209,203 28 float64_lt <qemu>/qemu-5.0.0/fpu/softfloat.c
2 140,825,653 1.108% 4,856,057 29 float64_le <qemu>/qemu-5.0.0/fpu/softfloat.c
Finding list of helper_fcmpu callees for QEMU 5.1-pre-soft-freeze:
./list_fn_callees.py -f helper_fcmpu -- <qemu-master-build>/ppc-linux-user/qemu-ppc coulomb_double-ppc
Results:
Total number of instructions: 13,033,447,522:
Callees of helper_fcmpu:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 662,239,406 5.082% 12,065,260 54 float64_compare <qemu>/qemu/fpu/softfloat.c
This concludes - same as previous report- that replacing the float64 compare specializations with inline functions that call the standard float64_compare functions is the reason behind the PowerPC performance degradation.
Looking back on the summary of the performance comparison presented in the previous report, beside the performance degradation introduced in the PowerPC targets, other targets had no change in performance. The next two sections analyze two of such targets using the list_helpers.py
and list_fn_callees.py
scripts.
Decrease | No Change | Increase |
---|---|---|
ppc
2.458%
|
alpha
0.766%
|
aarch64
5.679%
|
ppc64
2.453%
|
arm
0.012%
|
m68k
4.572%
|
ppc64le
2.456%
|
hppa
0.026%
|
mips
4.614%
|
s390x
0.599%
|
mipsel
5.043%
|
|
sh4
0.015%
|
mips64
4.651%
|
|
sparc64
0.057%
|
mips64el
5.087%
|
|
riscv64
2.155%
|
||
x86_64
1.609%
|
Report 3: QEMU 5.0 and 5.1-pre-soft-freeze Dissect Comparison
Analyzing hppa Performance 5.0 VS 5.1-pre-soft-freeze
Finding list of hppa helpers for QEMU 5.0:
./list_helpers.py -- <qemu-build>/hppa-linux-user/qemu-hppa coulomb_double-hppa
Results:
Total number of instructions: 12,005,480,751
Executed QEMU Helpers:
No. Instructions Percentage Calls Ins/Call Helper Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 3,699,940,930 30.819% 21,503,828 172 helper_fmpy_d <qemu>/target/hppa/op_helper.c
2 3,104,012,400 25.855% 18,478,099 167 helper_fadd_d <qemu>/target/hppa/op_helper.c
3 2,649,102,324 22.066% 16,946,808 156 helper_fsub_d <qemu>/target/hppa/op_helper.c
4 620,535,451 5.169% 7,578,184 81 helper_fcmp_d <qemu>/target/hppa/op_helper.c
5 441,738,119 3.679% 7,605,380 58 helper_lookup_tb_ptr <qemu>/accel/tcg/tcg-runtime.c
6 195,195,906 1.626% 1,262,976 154 helper_fdiv_d <qemu>/target/hppa/op_helper.c
7 82,688,760 0.689% 459,382 180 helper_fcnv_t_d_w <qemu>/target/hppa/op_helper.c
8 15,120,000 0.126% 1,008,000 15 helper_loaded_fr0 <qemu>/target/hppa/op_helper.c
9 604,725 0.005% 8,063 75 helper_excp <qemu>/target/hppa/op_helper.c
10 308,000 0.003% 2,000 154 helper_fcnv_w_d <qemu>/target/hppa/op_helper.c
11 71,424 0.001% 23,808 3 helper_tcond <qemu>/target/hppa/op_helper.c
Finding list of hppa helpers for QEMU 5.1-pre-soft-freeze:
./list_helpers.py -- <qemu-master-build>/hppa-linux-user/qemu-hppa coulomb_double-hppa
Results:
Total number of instructions: 12,008,544,996
Executed QEMU Helpers:
No. Instructions Percentage Calls Ins/Call Helper Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 3,699,940,930 30.811% 21,503,828 172 helper_fmpy_d <qemu-master>/qemu/target/hppa/op_helper.c
2 3,104,012,400 25.848% 18,478,099 167 helper_fadd_d <qemu-master>/qemu/target/hppa/op_helper.c
3 2,649,102,324 22.060% 16,946,808 156 helper_fsub_d <qemu-master>/qemu/target/hppa/op_helper.c
4 620,535,451 5.167% 7,578,184 81 helper_fcmp_d <qemu-master>/qemu/target/hppa/op_helper.c
5 441,738,119 3.679% 7,605,380 58 helper_lookup_tb_ptr <qemu-master>/qemu/accel/tcg/tcg-runtime.c
6 195,195,906 1.625% 1,262,976 154 helper_fdiv_d <qemu-master>/qemu/target/hppa/op_helper.c
7 83,607,524 0.696% 459,382 182 helper_fcnv_t_d_w <qemu-master>/qemu/target/hppa/op_helper.c
8 15,120,000 0.126% 1,008,000 15 helper_loaded_fr0 <qemu-master>/qemu/target/hppa/op_helper.c
9 604,725 0.005% 8,063 75 helper_excp <qemu-master>/qemu/target/hppa/op_helper.c
10 294,000 0.002% 2,000 147 helper_fcnv_w_d <qemu-master>/qemu/target/hppa/op_helper.c
11 71,424 0.001% 23,808 3 helper_tcond <qemu-master>/qemu/target/hppa/op_helper.c
Let’s focus on the floating point comparison helper, helper_fcmp_d
. It appears that its performance is totally unaffected by softfloat changes, as opposed to ppc/ppc64/ppc64le helper. Now, the list_fn_callees.py
script can be used to list helper_fcmp_d
callees for QEMU 5.0:
./list_fn_callees.py -f helper_fcmp_d -- <qemu-build>/hppa-linux-user/qemu-hppa coulomb_double-hppa
Results:
Total number of instructions: 12,005,490,004
Callees of helper_fcmp_d:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 234,512,818 1.953% 7,578,184 30 update_fr0_cmp.isra.6 <qemu>/target/hppa/op_helper.c
2 64,894,460 0.541% 4,514,339 14 float64_compare <qemu>/fpu/softfloat.c
3 53,047,288 0.442% 7,578,184 7 update_fr0_op <qemu>/target/hppa/op_helper.c
4 43,799,210 0.365% 3,063,845 14 float64_compare_quiet <qemu>/fpu/softfloat.c
Further, the same can be done for QEMU 5.1-pre-soft-freeze:
./list_fn_callees.py -f helper_fcmp_d -- <qemu-master-build>/hppa-linux-user/qemu-hppa coulomb_double-hppa
Results:
Total number of instructions: 12,008,552,050
Callees of helper_fcmp_d:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 234,512,818 1.953% 7,578,184 30 update_fr0_cmp.isra.6 <qemu-master>/qemu/target/hppa/op_helper.c
2 64,894,460 0.540% 4,514,339 14 float64_compare <qemu-master>/qemu/fpu/softfloat.c
3 53,047,288 0.442% 7,578,184 7 update_fr0_op <qemu-master>/qemu/target/hppa/op_helper.c
4 43,799,210 0.365% 3,063,845 14 float64_compare_quiet <qemu-master>/qemu/fpu/softfloat.c
From the results, it can be concluded that hppa performance remained the same since its float comparison helper helper_fcmp_d
already used the standard float64_compare
and float64_compare_quiet
softfloat functions, unlike ppc/ppc64/ppc64le targets, that used the more specialized float64 compare specializations.
Analyzing sh4 Performance 5.0 VS 5.1-pre-soft-freeze
Let’s now similarly compare sh4 performance in QEMU 5.0 and QEMU 5.1-pre-soft-freeze.
Finding list of sh4 helpers for QEMU 5.0:
./list_helpers.py -- <qemu-build>/sh4-linux-user/qemu-sh4 coulomb_double-sh4
Results:
Total number of instructions: 12,728,140,143
Executed QEMU Helpers:
No. Instructions Percentage Calls Ins/Call Helper Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 3,423,220,999 26.895% 21,503,828 159 helper_fmul_DT <qemu>/target/sh4/op_helper.c
2 3,059,134,203 24.034% 18,478,099 165 helper_fadd_DT <qemu>/target/sh4/op_helper.c
3 2,635,892,905 20.709% 16,946,808 155 helper_fsub_DT <qemu>/target/sh4/op_helper.c
4 561,171,205 4.409% 7,873,279 71 helper_lookup_tb_ptr <qemu>/accel/tcg/tcg-runtime.c
5 192,918,908 1.516% 4,642,299 41 helper_fcmp_gt_DT <qemu>/target/sh4/op_helper.c
6 188,881,026 1.484% 1,262,976 149 helper_fdiv_DT <qemu>/target/sh4/op_helper.c
7 113,909,493 0.895% 2,760,445 41 helper_fcmp_eq_DT <qemu>/target/sh4/op_helper.c
8 83,148,142 0.653% 459,382 181 helper_ftrc_DT <qemu>/target/sh4/op_helper.c
9 310,000 0.002% 2,000 155 helper_float_DT <qemu>/target/sh4/op_helper.c
10 143,000 0.001% 11,000 13 helper_ld_fpscr <qemu>/target/sh4/op_helper.c
11 3,528 0.000% 42 84 helper_trapa <qemu>/target/sh4/op_helper.c
Finding list of sh4 helpers for QEMU 5.1-pre-soft-freeze:
./list_helpers.py -- <qemu-master-build>/sh4-linux-user/qemu-sh4 coulomb_double-sh4
Results:
Total number of instructions: 12,730,023,780
Executed QEMU Helpers:
No. Instructions Percentage Calls Ins/Call Helper Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 3,423,220,999 26.891% 21,503,828 159 helper_fmul_DT <qemu-master>/qemu/target/sh4/op_helper.c
2 3,059,134,203 24.031% 18,478,099 165 helper_fadd_DT <qemu-master>/qemu/target/sh4/op_helper.c
3 2,635,892,905 20.706% 16,946,808 155 helper_fsub_DT <qemu-master>/qemu/target/sh4/op_helper.c
4 561,171,205 4.408% 7,873,279 71 helper_lookup_tb_ptr <qemu-master>/qemu/accel/tcg/tcg-runtime.c
5 192,918,908 1.515% 4,642,299 41 helper_fcmp_gt_DT <qemu-master>/qemu/target/sh4/op_helper.c
6 188,881,026 1.484% 1,262,976 149 helper_fdiv_DT <qemu-master>/qemu/target/sh4/op_helper.c
7 113,909,493 0.895% 2,760,445 41 helper_fcmp_eq_DT <qemu-master>/qemu/target/sh4/op_helper.c
8 84,066,906 0.660% 459,382 183 helper_ftrc_DT <qemu-master>/qemu/target/sh4/op_helper.c
9 296,000 0.002% 2,000 148 helper_float_DT <qemu-master>/qemu/target/sh4/op_helper.c
10 143,000 0.001% 11,000 13 helper_ld_fpscr <qemu-master>/qemu/target/sh4/op_helper.c
11 3,528 0.000% 42 84 helper_trapa <qemu-master>/qemu/target/sh4/op_helper.c
Now, one can spot that helpers helper_fcmp_gt_DT
and helper_fcmp_eq_DT
- related to floating point number comparisons - didn’t have any change in their number of instructions per call. To futher inspect the reason behind this, let’s use the list_fn_callees.py
script to find their callees.
For QEMU 5.0:
./list_fn_callees.py -f helper_fcmp_gt_DT helper_fcmp_eq_DT -- <qemu-build>/sh4-linux-user/qemu-sh4 coulomb_double-sh4
Results:
Total number of instructions: 12,728,140,143
Callees of helper_fcmp_gt_DT:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 67,576,835 0.531% 4,642,299 14 float64_compare <qemu>/fpu/softfloat.c
2 32,496,093 0.255% 4,642,299 7 update_fpscr <qemu>/target/sh4/op_helper.c
Callees of helper_fcmp_eq_DT:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 39,377,478 0.309% 2,760,445 14 float64_compare <qemu>/fpu/softfloat.c
2 19,323,115 0.152% 2,760,445 7 update_fpscr <qemu>/target/sh4/op_helper.c
For QEMU 5.1-pre-soft-freeze:
./list_fn_callees.py -f helper_fcmp_gt_DT helper_fcmp_eq_DT -- <qemu-master-build>/sh4-linux-user/qemu-sh4 coulomb_double-sh4
Results:
Total number of instructions: 12,730,023,815
Callees of helper_fcmp_gt_DT:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 67,576,835 0.531% 4,642,299 14 float64_compare <qemu-master>/qemu/fpu/softfloat.c
2 32,496,093 0.255% 4,642,299 7 update_fpscr <qemu-master>/qemu/target/sh4/op_helper.c
Callees of helper_fcmp_eq_DT:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 39,377,478 0.309% 2,760,445 14 float64_compare <qemu-master>/qemu/fpu/softfloat.c
2 19,323,115 0.152% 2,760,445 7 update_fpscr <qemu-master>/qemu/target/sh4/op_helper.c
From the results, it can be seen that for the same reason as hppa, the performance of sh4 remained the same despite the changes made in the softfloat implementation.
Analyzing mips Performance 5.0 VS 5.1-pre-soft-freeze
Finding list of mips helpers for QEMU 5.0:
./list_helpers.py -- <qemu-build>/mips-linux-user/qemu-mips coulomb_double-mips
Results:
Total number of instructions: 10,438,170,014
Executed QEMU Helpers:
No. Instructions Percentage Calls Ins/Call Helper Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 2,208,051,015 21.154% 14,936,914 147 helper_float_sub_d <qemu>/target/mips/fpu_helper.c
2 2,061,214,236 19.747% 11,626,196 177 helper_float_mul_d <qemu>/target/mips/fpu_helper.c
3 1,884,622,177 18.055% 7,860,734 239 helper_float_madd_d <qemu>/target/mips/fpu_helper.c
4 1,727,127,172 16.546% 10,609,351 162 helper_float_add_d <qemu>/target/mips/fpu_helper.c
5 468,482,823 4.488% 2,003,882 233 helper_float_msub_d <qemu>/target/mips/fpu_helper.c
6 249,775,460 2.393% 4,218,047 59 helper_cmp_d_lt <qemu>/target/mips/fpu_helper.c
7 207,577,312 1.989% 1,261,468 164 helper_float_div_d <qemu>/target/mips/fpu_helper.c
8 156,426,380 1.499% 3,055,498 51 helper_cmp_d_eq <qemu>/target/mips/fpu_helper.c
9 113,177,370 1.084% 2,169,025 52 helper_lookup_tb_ptr <qemu>/accel/tcg/tcg-runtime.c
10 80,391,850 0.770% 459,382 175 helper_float_trunc_w_d <qemu>/target/mips/fpu_helper.c
11 24,287,676 0.233% 419,248 57 helper_cmp_d_le <qemu>/target/mips/fpu_helper.c
12 4,016,000 0.038% 1,004,000 4 helper_cfc1 <qemu>/target/mips/fpu_helper.c
13 718,350 0.007% 1,000 718 helper_float_sqrt_d <qemu>/target/mips/fpu_helper.c
14 372,000 0.004% 4,000 93 helper_cmp_d_ule <qemu>/target/mips/fpu_helper.c
15 290,000 0.003% 2,000 145 helper_float_cvtd_w <qemu>/target/mips/fpu_helper.c
16 257,384 0.002% 2,962 86 helper_swl <qemu>/target/mips/op_helper.c
17 176,000 0.002% 4,000 44 helper_cmp_d_un <qemu>/target/mips/fpu_helper.c
18 90,000 0.001% 1,000 90 helper_cmp_d_ult <qemu>/target/mips/fpu_helper.c
19 4,171 0.000% 43 97 helper_raise_exception_err <qemu>/target/mips/op_helper.c
Finding list of mips helpers for QEMU 5.1-pre-soft-freeze:
./list_callees.py -- <qemu-master-build>/mips-linux-user/qemu-mips coulomb_double-mips
Results:
Total number of instructions: 9,956,439,111
Executed QEMU Helpers:
No. Instructions Percentage Calls Ins/Call Helper Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 2,122,805,507 21.321% 14,936,914 142 helper_float_sub_d <qemu-master>/target/mips/fpu_helper.c
2 1,943,156,724 19.517% 11,626,196 167 helper_float_mul_d <qemu-master>/target/mips/fpu_helper.c
3 1,792,488,279 18.003% 7,860,734 228 helper_float_madd_d <qemu-master>/target/mips/fpu_helper.c
4 1,632,785,630 16.399% 10,609,351 153 helper_float_add_d <qemu-master>/target/mips/fpu_helper.c
5 444,449,149 4.464% 2,003,882 221 helper_float_msub_d <qemu-master>/target/mips/fpu_helper.c
6 207,914,606 2.088% 4,218,047 49 helper_cmp_d_lt <qemu-master>/target/mips/fpu_helper.c
7 192,439,696 1.933% 1,261,468 152 helper_float_div_d <qemu-master>/target/mips/fpu_helper.c
8 147,927,209 1.486% 3,055,498 48 helper_cmp_d_eq <qemu-master>/target/mips/fpu_helper.c
9 113,177,370 1.137% 2,169,025 52 helper_lookup_tb_ptr <qemu-master>/accel/tcg/tcg-runtime.c
10 80,391,850 0.807% 459,382 175 helper_float_trunc_w_d <qemu-master>/target/mips/fpu_helper.c
11 20,962,400 0.211% 419,248 50 helper_cmp_d_le <qemu-master>/target/mips/fpu_helper.c
12 4,016,000 0.040% 1,004,000 4 helper_cfc1 <qemu-master>/target/mips/fpu_helper.c
13 706,350 0.007% 1,000 706 helper_float_sqrt_d <qemu-master>/target/mips/fpu_helper.c
14 300,000 0.003% 4,000 75 helper_cmp_d_ule <qemu-master>/target/mips/fpu_helper.c
15 272,000 0.003% 2,000 136 helper_float_cvtd_w <qemu-master>/target/mips/fpu_helper.c
16 257,384 0.003% 2,962 86 helper_swl <qemu-master>/target/mips/op_helper.c
17 180,000 0.002% 4,000 45 helper_cmp_d_un <qemu-master>/target/mips/fpu_helper.c
18 72,000 0.001% 1,000 72 helper_cmp_d_ult <qemu-master>/target/mips/fpu_helper.c
19 4,171 0.000% 43 97 helper_raise_exception_err <qemu-master>/target/mips/op_helper.c
Looking at the results, 14 out of the 19 helpers had a deacrease in their number of instructions per call. This implies that the change added to the source code in QEMU 5.1-pre-soft-freeze reflected in all of these helpers. To pin point this change, the list_fn_callees.py
script will be used to display the callees of the top three helpers.
Finding the callees for QEMU 5.0:
./list_fn_callees.py -f helper_float_sub_d helper_float_mul_d helper_float_madd_d -- <qemu-build>/mips-linux-user/qemu-mips coulomb_double-mips
Results:
Total number of instructions: 10,438,169,918
Callees of helper_float_sub_d:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 1,756,973,417 16.832% 14,936,914 117 float64_sub <qemu>/fpu/softfloat.c
2 116,280,528 1.114% 5,537,168 21 ieee_ex_to_mips.part.2 <qemu>/target/mips/fpu_helper.c
Callees of helper_float_mul_d:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 1,498,120,304 14.352% 11,626,196 128 float64_mul <qemu>/fpu/softfloat.c
2 199,090,752 1.907% 9,480,512 21 ieee_ex_to_mips.part.2 <qemu>/target/mips/fpu_helper.c
Callees of helper_float_madd_d:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 1,009,704,817 9.673% 7,860,734 128 float64_mul <qemu>/fpu/softfloat.c
2 365,108,047 3.498% 7,860,734 46 float64_add <qemu>/fpu/softfloat.c
3 160,466,103 1.537% 7,641,243 21 ieee_ex_to_mips.part.2 <qemu>/target/mips/fpu_helper.c
Finding the callees for QEMU 5.1-pre-soft-freeze:
./list_fn_callees.py -f helper_float_sub_d helper_float_mul_d helper_float_madd_d -- <qemu-master-build>/mips-linux-user/qemu-mips coulomb_double-mips
Results:
Total number of instructions: 9,956,439,118
Callees of helper_float_sub_d:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 1,756,973,417 17.647% 14,936,914 117 float64_sub <qemu-master>/fpu/softfloat.c
Callees of helper_float_mul_d:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 1,498,120,304 15.047% 11,626,196 128 float64_mul <qemu-master>/fpu/softfloat.c
Callees of helper_float_madd_d:
No. Instructions Percentage Calls Ins/Call Function Name Source File
---- --------------- ---------- --------------- ---------- ------------------------- ------------------------------
1 1,009,704,817 10.141% 7,860,734 128 float64_mul <qemu-master>/fpu/softfloat.c
2 365,108,047 3.667% 7,860,734 46 float64_add <qemu-master>/fpu/softfloat.c
From the results, it’s seen that the function ieee_ex_to_mips.part.2
dissapeard from the list of all three helper callees. Comparing the source code of both QEMU versions shows that the reason behind this is the inlining of the ieee_ex_to_mips
function and renaming it to ieee_to_mips_xcpt
in QEMU 5.1-pre-soft-freeze.
This concludes the four analysis sections in the report. ppc, hppa, sh4, and mips were selected as examples for the analysis because they cover all cases of performance change. The next logical step would be to find the commit that updated the implementation of softfloat - causing PowerPC performance degradation - as well as the commit that inlined ieee_ex_to_mips
which caused the mips performance improvement.
This is the idea of next week’s report which will introduce a new method for automtically locating commits that introduce performance improvements or degradations in QEMU.
Appendix
Float Comparison Helper of ppc
void helper_fcmpu(CPUPPCState *env, uint64_t arg1, uint64_t arg2,
uint32_t crfD)
{
CPU_DoubleU farg1, farg2;
uint32_t ret = 0;
farg1.ll = arg1;
farg2.ll = arg2;
if (unlikely(float64_is_any_nan(farg1.d) ||
float64_is_any_nan(farg2.d))) {
ret = 0x01UL;
} else if (float64_lt(farg1.d, farg2.d, &env->fp_status)) {
ret = 0x08UL;
} else if (!float64_le(farg1.d, farg2.d, &env->fp_status)) {
ret = 0x04UL;
} else {
ret = 0x02UL;
}
env->fpscr &= ~FP_FPCC;
env->fpscr |= ret << FPSCR_FPCC;
env->crf[crfD] = ret;
if (unlikely(ret == 0x01UL
&& (float64_is_signaling_nan(farg1.d, &env->fp_status) ||
float64_is_signaling_nan(farg2.d, &env->fp_status)))) {
/* sNaN comparison */
float_invalid_op_vxsnan(env, GETPC());
}
}
Float Comparison Helper of hppa
void HELPER(fcmp_d)(CPUHPPAState *env, float64 a, float64 b,
uint32_t y, uint32_t c)
{
FloatRelation r;
if (c & 1) {
r = float64_compare(a, b, &env->fp_status);
} else {
r = float64_compare_quiet(a, b, &env->fp_status);
}
update_fr0_op(env, GETPC());
update_fr0_cmp(env, y, c, r);
}
Float Comparison Helpers of sh4
uint32_t helper_fcmp_eq_DT(CPUSH4State *env, float64 t0, float64 t1)
{
int relation;
set_float_exception_flags(0, &env->fp_status);
relation = float64_compare(t0, t1, &env->fp_status);
update_fpscr(env, GETPC());
return relation == float_relation_equal;
}
uint32_t helper_fcmp_gt_DT(CPUSH4State *env, float64 t0, float64 t1)
{
int relation;
set_float_exception_flags(0, &env->fp_status);
relation = float64_compare(t0, t1, &env->fp_status);
update_fpscr(env, GETPC());
return relation == float_relation_greater;
}
Float Operation Helpers of mips
uint64_t helper_float_ ## name ## _d(CPUMIPSState *env, \
uint64_t fdt0, uint64_t fdt1) \
{ \
uint64_t dt2; \
\
dt2 = float64_ ## name(fdt0, fdt1, &env->active_fpu.fp_status);\
update_fcr31(env, GETPC()); \
return dt2; \
}
uint64_t helper_float_madd_d(CPUOpenRISCState *env, uint64_t a,
uint64_t b, uint64_t c)
{
/* Note that or1ksim doesn't use fused operation. */
b = float64_mul(b, c, &env->fp_status);
return float64_add(a, b, &env->fp_status);
}