Thread.onSpinWait() as YIELD on AArch64

1. Summary

We explored the potential benefits of having YIELD instruction implemented for Java.

YIELD instruction on ARM may be considered similar to x86 PAUSE.

It makes sense to add support for this instruction in JVM as Thread.onSpinWait() and SpinPause. This can be done even though this instruction is implemented as NOP in some CPUs.

2. Background

Intel implemented PAUSE instruction and benefits from power and performance improvements provided by this instruction. In older Intel processors this instruction took ~10 cycles, while on Skylake it’s up to 140 cycles. A similar instruction on ARMv8, YIELD may provide benefits when used properly.

3. HotSpot

There are few places in HotSpot where spin locks matter:

Thread.onSpinWait() JEP Spin-Wait Hints (JDK-8147832, discussion)
1. Implemented as a special intrinsic, PAUSE on x86, no-op for others.
2. As of JDK 9, this is used in StampedLock, Phaser, SynchronousQueue classes. It is likely it will be used more going forward.
SpinPause
1. Extern, may be defined in .s files src/os_cpu/linux_arm/vm/linux_arm_64.s – YIELD instruction for ARM64 OpenJDK port, not implemented in AARCH64 OpenJDK port. src/os_cpu/linux_x86/vm/linux_x86_64.s - REP; NOP = PAUSE
2. Synchronized (src/share/vm/runtime/objectMonitor.cpp)
3. Safepoints (src/share/vm/runtime/safepoint.cpp)

Using SEV/WFE instructions for spin waits in java is tricky because it may require SEV for all shared state writes, and locking logic assumes thread to stay on CPU for short time and then possibly be parked.

4. YIELD/PAUSE instruction use in other software

YIELD (ARM) and PAUSE (x86) are used in other software besides Java. To name some:

5. YIELD/PAUSE in Java/JVM - performance and power considerations

5.1. Thread.onSpinWait() intrinsic

Thread.onSpinWait() method was added in JDK 9 by Azul (JEP Spin-Wait Hints) that implemented it as PAUSE instruction on x86 HW. This is now used in some java classes (StampedLock, Phaser, SynchronousQueue) widely used in conventional Java software.

Two sets of benchmarks were used to compare Thread.onSpinWait() performance on Cavium ThunderX and Raspberry Pi 3.

The first one is a benchmark developed by Gil Tene - a set of benchmarks for original Spin-Wait Hints JEP
The second one is a JMH benchmark that was developed by BellSoft to microbenchmark the same intrinsic and gain additional information.

5.1.1. Linux x86-64 results

To verify our results for ARM and give them some background here are also some measurements on Intel i5.

Gil benchmarks

Results are much better with hyper-theading on same core but to compare it with HW we had it made sense to bind threads to different cores in the same package. Results are presented for this case though we also measured other cases too.

SpinWaitTest

Latency, 100%

x86 bound dist 100

Intrinsic is off {-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_onSpinWait}	Intrinsic is on
#[Mean = 122.32, StdDeviation = 402.79] #[Max = 2408447.00, Total count = 200000000] #[Buckets = 35, SubBuckets = 256] # duration = 30342934158 # duration (ns) per round trip op = 151.71467079 # round trip ops/sec = 6591320 # 50%'ile: 121ns # 90%'ile: 130ns # 99%'ile: 144ns # 99.9%'ile: 191ns off-bound-dist.hgrm	#[Mean = 106.27, StdDeviation = 365.33] #[Max = 1925119.00, Total count = 200000000] #[Buckets = 35, SubBuckets = 256] # duration = 27006943978 # duration (ns) per round trip op = 135.03471989 # round trip ops/sec = 7405502 # 50%'ile: 103ns # 90%'ile: 116ns # 99%'ile: 136ns # 99.9%'ile: 184ns on-bound-dist.hgrm

Intrinsic is off {-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_onSpinWait}

Intrinsic is on

#[Mean = 122.32, StdDeviation = 402.79]
#[Max = 2408447.00, Total count = 200000000]
#[Buckets = 35, SubBuckets = 256]
# duration = 30342934158
# duration (ns) per round trip op = 151.71467079
# round trip ops/sec = 6591320
# 50%'ile: 121ns
# 90%'ile: 130ns
# 99%'ile: 144ns
# 99.9%'ile: 191ns

off-bound-dist.hgrm

#[Mean = 106.27, StdDeviation = 365.33]
#[Max = 1925119.00, Total count = 200000000]
#[Buckets = 35, SubBuckets = 256]
# duration = 27006943978
# duration (ns) per round trip op = 135.03471989
# round trip ops/sec = 7405502
# 50%'ile: 103ns
# 90%'ile: 116ns
# 99%'ile: 136ns
# 99.9%'ile: 184ns

on-bound-dist.hgrm

Outcome of results analysis: ~20% lower latency is observed with PAUSE instruction for Thread.opSpinWait() on x86 when intrinsic is on.

BellSoft SpinWaitBench JMH benchmark

SpinWaitBench.java — 2 threads volatile ping-pong similar to Gil’s approach

SpinWaitNoAuxBench.java — slightly less harness work per operation

SpinWaitOpBench.java — calculates cost of Thread.onSpinWait() and infra cost

SpinWaitBench

Throughput in ops/us, #/op

Intrinsic Off {-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_onSpinWait}	Intrinsic On {}
Benchmark Score Error SpinWaitBench.pong 20.108 ± 0.560 :L1-dcache-load-misses 3.863 ± 0.704 :L1-dcache-loads 200.545 ± 36.993 :L1-dcache-store-misses 0.010 ± 0.027 :L1-dcache-stores 2.664 ± 0.627 :L1-icache-load-misses 0.042 ± 0.047 :LLC-loads 1.709 ± 2.007 :LLC-stores 0.993 ± 0.238 :branch-misses 0.088 ± 0.026 :branches 131.968 ± 23.358 :consume 10.054 ± 0.280 :cycles 308.754 ± 64.203 :dTLB-load-misses 0.002 ± 0.002 :dTLB-loads 200.576 ± 39.358 :dTLB-store-misses ≈ 10⁻⁴ :dTLB-stores 2.557 ± 0.347 :iTLB-load-misses 0.001 ± 0.003 :iTLB-loads 0.001 ± 0.005 :instructions 530.368 ± 91.808 :produce 10.054 ± 0.280 :stalled-cycles-frontend 156.833 ± 39.406 :totalSpins 613.655 ± 11.808 :CPI 0.582 ± 0.056	Benchmark Score Error SpinWaitBench.pong 22.696 ± 1.004 :L1-dcache-load-misses 2.125 ± 2.344 :L1-dcache-loads 22.649 ± 5.490 :L1-dcache-store-misses 0.008 ± 0.026 :L1-dcache-stores 2.594 ± 0.868 :L1-icache-load-misses 0.026 ± 0.050 :LLC-loads 1.133 ± 2.134 :LLC-stores 0.975 ± 0.220 :branch-misses 0.455 ± 0.906 :branches 14.141 ± 3.955 :consume 11.348 ± 0.501 :cycles 271.357 ± 97.363 :dTLB-load-misses 0.002 ± 0.006 :dTLB-loads 22.711 ± 7.788 :dTLB-store-misses ≈ 10⁻⁴ :dTLB-stores 2.674 ± 0.669 :iTLB-load-misses 0.001 ± 0.001 :iTLB-loads 0.001 ± 0.003 :instructions 67.516 ± 18.550 :produce 11.348 ± 0.502 :stalled-cycles-frontend 230.792 ± 79.629 :totalSpins 61.882 ± 1.373 :CPI 4.018 ± 0.477

Intrinsic Off {-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_onSpinWait}

Intrinsic On {}

Benchmark                   Score    Error
SpinWaitBench.pong         20.108 ±  0.560
:L1-dcache-load-misses      3.863 ±  0.704
:L1-dcache-loads          200.545 ± 36.993
:L1-dcache-store-misses     0.010 ±  0.027
:L1-dcache-stores           2.664 ±  0.627
:L1-icache-load-misses      0.042 ±  0.047
:LLC-loads                  1.709 ±  2.007
:LLC-stores                 0.993 ±  0.238
:branch-misses              0.088 ±  0.026
:branches                 131.968 ± 23.358
:consume                   10.054 ±  0.280
:cycles                   308.754 ± 64.203
:dTLB-load-misses           0.002 ±  0.002
:dTLB-loads               200.576 ± 39.358
:dTLB-store-misses         ≈ 10⁻⁴
:dTLB-stores                2.557 ±  0.347
:iTLB-load-misses           0.001 ±  0.003
:iTLB-loads                 0.001 ±  0.005
:instructions             530.368 ± 91.808
:produce                   10.054 ±  0.280
:stalled-cycles-frontend  156.833 ± 39.406
:totalSpins               613.655 ± 11.808
:CPI                        0.582 ±  0.056

Benchmark                   Score    Error
SpinWaitBench.pong          22.696 ±  1.004
:L1-dcache-load-misses      2.125 ±  2.344
:L1-dcache-loads           22.649 ±  5.490
:L1-dcache-store-misses     0.008 ±  0.026
:L1-dcache-stores           2.594 ±  0.868
:L1-icache-load-misses      0.026 ±  0.050
:LLC-loads                  1.133 ±  2.134
:LLC-stores                 0.975 ±  0.220
:branch-misses              0.455 ±  0.906
:branches                  14.141 ±  3.955
:consume                   11.348 ±  0.501
:cycles                   271.357 ± 97.363
:dTLB-load-misses           0.002 ±  0.006
:dTLB-loads                22.711 ±  7.788
:dTLB-store-misses         ≈ 10⁻⁴
:dTLB-stores                2.674 ±  0.669
:iTLB-load-misses           0.001 ±  0.001
:iTLB-loads                 0.001 ±  0.003
:instructions              67.516 ± 18.550
:produce                   11.348 ±  0.502
:stalled-cycles-frontend  230.792 ± 79.629
:totalSpins                61.882 ±  1.373
:CPI                        4.018 ±  0.477

Outcome of results analysis: 10x less totalSpins, cache loads, dtlb loads, instructions. More stalled frontend. Better latency, same core is even better (thrpt, 30x spins). SpinWaitNoAuxBench shows more noise.

SpinWaitOpBench

Intrinsic off {-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_onSpinWait}	Intrinsic on {}
Benchmark Mode Cnt Score Error Units empty avgt 20 0.314 ± 0.002 ns/op :cycles avgt 4 1.016 ± 0.026 #/op :instructions avgt 4 4.975 ± 0.255 #/op	Benchmark Mode Cnt Score Error Units empty avgt 20 0.313 ± 0.001 ns/op :cycles avgt 4 1.020 ± 0.037 #/op :instructions avgt 4 5.028 ± 0.251 #/op
Benchmark Mode Cnt Score Error Units onSpinWait avgt 20 0.319 ± 0.006 ns/op :cycles avgt 4 1.029 ± 0.076 #/op :instructions avgt 4 5.033 ± 0.172 #/op	Benchmark Mode Cnt Score Error Units onSpinWait avgt 20 5.215 ± 0.023 ns/op :cycles avgt 4 16.922 ± 0.840 #/op :instructions avgt 4 6.039 ± 0.306 #/op

Intrinsic off {-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_onSpinWait}

Intrinsic on {}

Benchmark     Mode Cnt Score   Error Units
empty         avgt  20 0.314 ± 0.002 ns/op
:cycles       avgt   4 1.016 ± 0.026 #/op
:instructions avgt   4 4.975 ± 0.255 #/op

Benchmark     Mode Cnt Score   Error Units
empty         avgt  20 0.313 ± 0.001 ns/op
:cycles       avgt   4 1.020 ± 0.037 #/op
:instructions avgt   4 5.028 ± 0.251 #/op

Benchmark     Mode Cnt Score   Error Units
onSpinWait    avgt  20 0.319 ± 0.006 ns/op
:cycles       avgt   4 1.029 ± 0.076 #/op
:instructions avgt   4 5.033 ± 0.172 #/op

Benchmark     Mode Cnt Score    Error Units
onSpinWait    avgt  20  5.215 ± 0.023 ns/op
:cycles       avgt   4 16.922 ± 0.840 #/op
:instructions avgt   4  6.039 ± 0.306 #/op

Outcome of results analysis: PAUSE length on i5-3320M is ~16 cycles.

Power consumption of the system is 18.3±1.5 W for both cases then the workload runs on single CPU, according to powerstat. This method is not very informative but we had no other good means to measure power.

5.1.2. Linux AArch64 results

To study the potential benefits of implementation of YIELD on ARMv8 a patch was developed for jdk10/hs that adds C1 and C2 intrinsic implementation for onSpinWait() as yield instruction.

Gil benchmarks on ThunderX

SpinWaitTest

Latency, 99.9%

cav bound dist 99

Latency, 100%

cav bound dist 100

#[Mean = 219.41, StdDeviation = 193.52]
#[Max = 2392063.00, Total count = 200000000]
#[Buckets = 35, SubBuckets = 256]
# duration = 65672330449
# duration (ns) per round trip op = 328.361652245
# round trip ops/sec = 3045422
# 50%'ile: 220ns
# 90%'ile: 220ns
# 99%'ile: 230ns
# 99.9%'ile: 230ns

Outcome of results analysis: Same and high latency, e.g. for cores in same package.

SpinWaitBench JMH benchmark

Both on Raspberry Pi 3 and Cavium ThunderX and perf statistics show that the instruction is implemented as NOP.

On the other hand:

Average number of spins can be counted.
We may consider influence of doing less loads by adding a lot of NOPs into the body of the loop instead of 1 yield (32 for instance). With this amount of NOPs on Cavium threads are bound to cores in same package and the results are below

SpinWaitBench, throughput ops/us and #/op

Intrinsic Off

Intrinsic On

Benchmark                    Score    Error
SpinWaitBench.pong          15.780 ± 0.102
:CPI                         1.498 ± 0.033
:L1-dcache-load-misses       1.016 ± 0.041
:L1-dcache-loads            81.948 ± 4.772
:L1-dcache-store-misses      0.022 ± 0.102
:L1-dcache-stores            1.640 ± 0.215
:L1-icache-load-misses       0.012 ± 0.039
:L1-icache-loads           142.277 ± 6.906
 :branch-misses              1.036 ± 0.077
 :branches                  25.848 ± 1.335
 :consume                    7.893 ± 0.051
 :cycles                   256.317 ± 1.805
 :dTLB-load-misses           0.006 ± 0.011
 :dTLB-loads                82.002 ± 4.179
 :dTLB-store-misses          0.002 ± 0.005
 :dTLB-stores                1.631 ± 0.196
 :iTLB-load-misses           0.002 ± 0.007
 :instructions             171.083 ± 3.045
 :produce                    7.887 ± 0.052
 :stalled-cycles-backend   126.831 ± 1.510
 :stalled-cycles-frontend   10.720 ± 1.970
 :totalSpins               140.333 ± 4.825

 Benchmark                   Score    Error
SpinWaitBench.pong          14.876 ±  0.138
 :CPI                        1.058 ±  0.015
 :L1-dcache-load-misses      1.017 ±  0.046
 :L1-dcache-loads           23.534 ±  1.125
 :L1-dcache-store-misses     0.020 ±  0.045
 :L1-dcache-stores           1.627 ±  0.188
 :L1-icache-load-misses      0.018 ±  0.031
 :L1-icache-loads          109.518 ±  4.125
 :branch-misses              1.098 ±  0.208
 :branches                   6.340 ±  0.460
 :consume                    7.438 ±  0.069
 :cycles                   270.067 ± 10.309
 :dTLB-load-misses           0.009 ±  0.008
 :dTLB-loads                23.588 ±  0.368
 :dTLB-store-misses          0.002 ±  0.002
 :dTLB-stores                1.674 ±  0.242
 :iTLB-load-misses           0.002 ±  0.004
 :instructions             255.178 ± 12.632
 :produce                    7.437 ±  0.068
 :stalled-cycles-backend   109.726 ±  2.667
 :stalled-cycles-frontend   14.605 ±  1.210
 :totalSpins                37.567 ±  1.358

Outcome of results analysis: totalSpins decreased from 140 to 35 and throughput is the same. Similar results are expected with single yield instruction on hardrawe where it’s implemented for either SMT or temporal multithreading.

Throughput and measured latency in case of 48 ping-pong pairs on 96-core processor is about the same. Though runs with 32 NOPs have higher deviation (like 15±3 ops/us → 13±18 ops/us per pair).

SpinWaitOpBench

Intrinsic Off

Intrinsic On

Benchmark     Mode Cnt Score   Error Units
empty         avgt  20 1.508 ± 0.002 ns/op
:cycles       avgt   4 3.097 ± 0.287  #/op
:instructions avgt   4 4.068 ± 0.306  #/op

Benchmark     Mode Cnt Score   Error Units
empty         avgt  20 1.509 ± 0.002 ns/op
:cycles       avgt   4 3.101 ± 0.225  #/op
:instructions avgt   4 4.016 ± 0.219  #/op

Benchmark     Mode Cnt Score   Error Units
onSpinWait    avgt  20 1.509 ± 0.002 ns/op
:cycles       avgt   4 3.077 ± 0.221  #/op
:instructions avgt   4 4.002 ± 0.186  #/op

Benchmark     Mode Cnt Score    Error Units
onSpinWait    avgt  20  9.057 ± 0.018 ns/op
:cycles       avgt   4 18.814 ± 1.835  #/op
:instructions avgt   4 36.407 ± 2.472  #/op

Outcome of results analysis: intrinsic with 32 NOPs takes ~16 cycles.

5.2. SpinPause

YIELD can be used for synchronized and SpinPause in JVM very similar to Thread.onSpinWait() intrinsic. On x86, spin pause is also done with PAUSE instruction. Current implemetation of SpinPause in linux-aarch64 is an empty function. We made some sanity measurements to check SpinPause impelementation with yield and they show no regressions. But generally it is a good thing to check in any benchmarks.

6. Conclusions

We propose to add onSpinWait() intrinsic and SpinPause implementation in (Linux) AArch64 port with yield instruction.
It is harmless for CPUs that implement the instruction as NOP.
It should be benefitial for CPUs that have SMT or are able to partially shut down or pause for a while.
It may be considered to emit several nop instead of yield where latter works as nop to decrease memory pressure without loosing throughput.
- A concrete latency estimate for a potential implementation that should not cause any performance degradation for Java is around 16 cycles.