= String.length() performance https://bugs.openjdk.java.net/browse/JDK-8054307 The tests are done on JDK 9 Sandbox repo, 1x4x2 i7-4790K (Haswell) 4.0 GHz, Linux x86_64. Benchmark code is available at: http://cr.openjdk.java.net/~shade/density/ == Background The performance of length() is important for many workloads, especially those who do low-level operations on Strings. The basic version of length() just polls the array length of the backing array. It String Density code, we need to do a few other things. == Benchmarks The benchmarks are very simple. We call String.length() on a list of Strings, and list is pre-populated with the Strings of either 1-byte or 2-byte characters. "bias" parameter defines the ratio of 1-byte strings to 2-byte strings. === JDK 9 baseline First, JDK 9 baseline: Benchmark (bias) (count) Mode Cnt Score Error Units LengthBench.test 0.00 4096 avgt 25 19.065 ± 0.214 us/op LengthBench.test 0.25 4096 avgt 25 19.339 ± 0.593 us/op LengthBench.test 0.50 4096 avgt 25 19.641 ± 0.416 us/op LengthBench.test 0.75 4096 avgt 25 19.930 ± 0.872 us/op LengthBench.test 1.00 4096 avgt 25 19.170 ± 0.241 us/op Obiviously, the performance does not depend on bias. === String Density, current: Benchmark (bias) (count) Mode Cnt Score Error Units LengthBench.test 0.00 4096 avgt 25 20.248 ± 0.200 us/op LengthBench.test 0.25 4096 avgt 25 26.301 ± 0.226 us/op LengthBench.test 0.50 4096 avgt 25 34.558 ± 0.639 us/op LengthBench.test 0.75 4096 avgt 25 28.562 ± 0.282 us/op LengthBench.test 1.00 4096 avgt 25 20.514 ± 0.174 us/op Now, it gets interesting: a) in two extreme cases, the performance is slightly lower than baseline. b) with the moderate bias, the performance drops off rapidly The first effect is explained by the coder selection overhead: bias = 0.0: 3.46% 3.16% 0x00007f77d0ae1760: mov %eax,-0x14000(%rsp) 1.15% 0.81% 0x00007f77d0ae1767: push %rbp 3.23% 3.02% 0x00007f77d0ae1768: sub $0x20,%rsp 0.36% 0.34% 0x00007f77d0ae176c: movsbl 0x14(%rsi),%r10d ; get String.coder 1.86% 1.36% 0x00007f77d0ae1771: test %r10d,%r10d ; if (coder != 0), jump out 0x00007f77d0ae1774: jne 0x00007f77d0ae178b 3.77% 2.81% 0x00007f77d0ae1776: mov 0xc(%rsi),%r10d ; get String.value 0.24% 0.62% 0x00007f77d0ae177a: mov 0xc(%r12,%r10,8),%eax ; get String.value.arraylength 49.55% 53.71% 0x00007f77d0ae177f: add $0x20,%rsp ; epilogue and return 0.31% 0.65% 0x00007f77d0ae1783: pop %rbp 0.87% 0.54% 0x00007f77d0ae1784: test %eax,0xf24c876(%rip) 6.76% 6.56% 0x00007f77d0ae178a: retq Notice how the unlikely path with non-zero coder is pushed out. The second effect is explained by branch prediction misses, since now we generate both branches. bias = 0.5: 2.33% 1.60% 0x00007fc6f8ae34e0: mov %eax,-0x14000(%rsp) 0.21% 0.14% 0x00007fc6f8ae34e7: push %rbp 2.47% 1.82% 0x00007fc6f8ae34e8: sub $0x10,%rsp 0.13% 0.11% 0x00007fc6f8ae34ec: movsbl 0x14(%rsi),%r11d ; get String.coder 6.72% 9.99% 0x00007fc6f8ae34f1: mov 0xc(%rsi),%r10d ; get String.value 3.26% 2.60% 0x00007fc6f8ae34f5: test %r11d,%r11d ; if (coder == 0), jump out 0x00007fc6f8ae34f8: je 0x00007fc6f8ae3503 1.16% 1.78% 0x00007fc6f8ae34fa: mov 0xc(%r12,%r10,8),%eax ; get String.value.arraylength 28.59% 30.56% 0x00007fc6f8ae34ff: sar %eax ; >> 1 1.46% 0.75% 0x00007fc6f8ae3501: jmp 0x00007fc6f8ae3508 ; jump to return 1.87% 1.97% 0x00007fc6f8ae3503: mov 0xc(%r12,%r10,8),%eax ; get String.value.arraylength 28.40% 30.18% 0x00007fc6f8ae3508: add $0x10,%rsp ; epilogue and return 0.02% 0.05% 0x00007fc6f8ae350c: pop %rbp 0.35% 0.21% 0x00007fc6f8ae350d: test %eax,0xeb6aaed(%rip) 3.44% 1.79% 0x00007fc6f8ae3513: retq perf stats: 2,800,535,113 branches # 664.156 M/sec [38.76%] 206,470,880 branch-misses # 7.37% of all branches [38.82%] Also notice the generated code has similar arraylength accesses on both branches. This is tracked as: https://bugs.openjdk.java.net/browse/JDK-8075132 Trying to optimize the length() yields a series of patches, targeting the branch prediction costs and working around code generator quirks. === Density, patch 1 First patch is very simple: inline isLatin1() and StringUTF16.length: diff -r 66bedbaa3cfe src/java.base/share/classes/java/lang/String.java --- a/src/java.base/share/classes/java/lang/String.java Mon Feb 23 18:23:04 2015 +0000 +++ b/src/java.base/share/classes/java/lang/String.java Thu Mar 12 12:45:25 2015 +0300 @@ -650,10 +650,11 @@ * object. */ public int length() { - if (isLatin1()) { + if (coder == LATIN1) { return value.length; + } else { + return value.length >> 1; } - return StringUTF16.length(value); } /** It yields the performance very similar to Density baseline: Benchmark (bias) (count) Mode Cnt Score Error Units LengthBench.test 0.00 4096 avgt 25 20.143 ± 0.158 us/op LengthBench.test 0.25 4096 avgt 25 26.705 ± 0.206 us/op LengthBench.test 0.50 4096 avgt 25 35.034 ± 1.038 us/op LengthBench.test 0.75 4096 avgt 25 28.693 ± 0.196 us/op LengthBench.test 1.00 4096 avgt 25 20.502 ± 0.209 us/op ...and the generated code is almost the same: bias = 0.5: 2.20% 1.60% 0x00007f8860ae2ee0: mov %eax,-0x14000(%rsp) 0.19% 0.17% 0x00007f8860ae2ee7: push %rbp 2.17% 1.85% 0x00007f8860ae2ee8: sub $0x10,%rsp 0.07% 0.15% 0x00007f8860ae2eec: movsbl 0x14(%rsi),%r11d ; get field $coder 8.03% 9.51% 0x00007f8860ae2ef1: mov 0xc(%rsi),%r10d ; get field $value 3.36% 2.45% 0x00007f8860ae2ef5: test %r11d,%r11d ; test $coder 0x00007f8860ae2ef8: je 0x00007f8860ae2f03 ; jump over 1.71% 1.77% 0x00007f8860ae2efa: mov 0xc(%r12,%r10,8),%eax ; get $value.arraylength 29.66% 31.87% 0x00007f8860ae2eff: sar %eax ; >> 1 1.19% 0.73% 0x00007f8860ae2f01: jmp 0x00007f8860ae2f08 ; jump to return 1.49% 1.83% 0x00007f8860ae2f03: mov 0xc(%r12,%r10,8),%eax ; get $value.arraylength 28.86% 31.55% 0x00007f8860ae2f08: add $0x10,%rsp 0.02% 0.10% 0x00007f8860ae2f0c: pop %rbp 0.17% 0.17% 0x00007f8860ae2f0d: test %eax,0xfa070ed(%rip) 3.71% 2.28% 0x00007f8860ae2f13: retq === Density, patch 2 Second patch tries to dodge double arraylength in the generated code by pre-reading it in local variable: diff -r 66bedbaa3cfe src/java.base/share/classes/java/lang/String.java --- a/src/java.base/share/classes/java/lang/String.java Mon Feb 23 18:23:04 2015 +0000 +++ b/src/java.base/share/classes/java/lang/String.java Fri Mar 13 12:49:10 2015 +0300 @@ -650,10 +650,12 @@ * object. */ public int length() { - if (isLatin1()) { - return value.length; + int len = value.length; + if (coder == LATIN1) { + return len; + } else { + return len >> 1; } - return StringUTF16.length(value); } /** The performance scores are improved relative to Density baseline: Benchmark (bias) (count) Mode Cnt Score Error Units LengthBench.test 0.00 4096 avgt 25 20.051 ± 0.202 us/op LengthBench.test 0.25 4096 avgt 25 23.712 ± 3.189 us/op LengthBench.test 0.50 4096 avgt 25 27.089 ± 5.029 us/op LengthBench.test 0.75 4096 avgt 25 23.991 ± 3.176 us/op LengthBench.test 1.00 4096 avgt 25 21.011 ± 0.137 us/op The results are also experiencing large run-to-run variance because now we also get the CMOV selectors instead of branches in middle biases. However, that optimization relies on branch profile, which is hard to reproduce exactly in String code. JDK-8075132 shows the effect for CMOV more cleanly. bias = 0.0: 3.37% 3.19% 0x00007fdd58ae26e0: mov %eax,-0x14000(%rsp) 1.22% 0.99% 0x00007fdd58ae26e7: push %rbp 3.59% 3.05% 0x00007fdd58ae26e8: sub $0x20,%rsp 0.31% 0.39% 0x00007fdd58ae26ec: mov 0xc(%rsi),%r11d ; get field $value 2.23% 2.03% 0x00007fdd58ae26f0: mov 0xc(%r12,%r11,8),%eax ; get field $value.arraylength 53.92% 59.12% 0x00007fdd58ae26f5: movsbl 0x14(%rsi),%ebp ; get field $coder 0.14% 0.05% 0x00007fdd58ae26f9: test %ebp,%ebp ; test $coder and jump out 0x00007fdd58ae26fb: jne 0x00007fdd58ae2709 0.10% 0.12% 0x00007fdd58ae26fd: add $0x20,%rsp ; return 0.36% 1.11% 0x00007fdd58ae2701: pop %rbp 4.67% 3.24% 0x00007fdd58ae2702: test %eax,0x10e508f8(%rip) 3.89% 3.08% 0x00007fdd58ae2708: retq bias = 0.5: 2.97% 2.57% 0x00007f0a8c4afe60: mov %eax,-0x14000(%rsp) 0.91% 1.45% 0x00007f0a8c4afe67: push %rbp 3.44% 2.08% 0x00007f0a8c4afe68: sub $0x10,%rsp 0.20% 1.00% 0x00007f0a8c4afe6c: mov 0xc(%rsi),%r11d ; get field $value 2.26% 2.01% 0x00007f0a8c4afe70: mov 0xc(%r12,%r11,8),%eax ; get field $value.arraylength 52.56% 55.98% 0x00007f0a8c4afe75: movsbl 0x14(%rsi),%r11d ; get field $coder 0.07% 0.03% 0x00007f0a8c4afe7a: mov %eax,%r10d ; prepare (arraylength >> 1) 0.05% 0.08% 0x00007f0a8c4afe7d: sar %r10d 3.27% 3.47% 0x00007f0a8c4afe80: test %r11d,%r11d ; select either value 0.84% 0.22% 0x00007f0a8c4afe83: cmovne %r10d,%eax 6.78% 6.51% 0x00007f0a8c4afe87: add $0x10,%rsp ; return 0.15% 0.27% 0x00007f0a8c4afe8b: pop %rbp 0.74% 0.25% 0x00007f0a8c4afe8c: test %eax,0xe55a16e(%rip) 3.43% 3.16% 0x00007f0a8c4afe92: retq When this optimization is working properly, the performance improvement is substantial. === patch 3 Third patch tries to eliminate branches entirely, using the fact (LATIN1 == 0) and (UTF16 == 1). This allows to do the unconditional bitwise shift: diff -r 66bedbaa3cfe src/java.base/share/classes/java/lang/String.java --- a/src/java.base/share/classes/java/lang/String.java Mon Feb 23 18:23:04 2015 +0000 +++ b/src/java.base/share/classes/java/lang/String.java Thu Mar 12 12:48:07 2015 +0300 @@ -650,10 +650,7 @@ * object. */ public int length() { - if (isLatin1()) { - return value.length; - } - return StringUTF16.length(value); + return value.length >> coder; } /** The performance with this patch is winning over Density baseline even in pathological (bias=0.5) mode, and comes close to JDK 9 baseline. Benchmark (bias) (count) Mode Cnt Score Error Units LengthBench.test 0.00 4096 avgt 25 20.853 ± 0.263 us/op LengthBench.test 0.25 4096 avgt 25 21.042 ± 0.178 us/op LengthBench.test 0.50 4096 avgt 25 20.951 ± 0.178 us/op LengthBench.test 0.75 4096 avgt 25 20.658 ± 0.187 us/op LengthBench.test 1.00 4096 avgt 25 21.004 ± 0.333 us/op The generated code: 3.75% 3.22% 0x00007f4260ae43e0: mov %eax,-0x14000(%rsp) 1.16% 1.54% 0x00007f4260ae43e7: push %rbp 0.03% 0.10% 0x00007f4260ae43e8: sub $0x10,%rsp 4.03% 2.86% 0x00007f4260ae43ec: mov 0xc(%rsi),%r11d ; get field $value 1.11% 1.75% 0x00007f4260ae43f0: mov 0xc(%r12,%r11,8),%eax ; get $value.arraylength 47.88% 51.88% 0x00007f4260ae43f5: movsbl 0x14(%rsi),%ecx ; get field $coder 1.35% 0.52% 0x00007f4260ae43f9: sar %cl,%eax ; >> 1 10.44% 9.96% 0x00007f4260ae43fb: add $0x10,%rsp ; return 0.93% 0.38% 0x00007f4260ae43ff: pop %rbp 0.21% 0.65% 0x00007f4260ae4400: test %eax,0x10bafbfa(%rip) 0.03% 0.23% 0x00007f4260ae4406: retq = Conclusion 1. The performance of length() with C2 does not necessarily depends on whether isLatin1() or coder bodies are inlined. However, it might be a good idea to inline isLatin1() for consistency, and maybe StringUTF16.length() as well. 2. Pre-reading the value.arraylength into local variable works around JDK-8075132, but still experiences branch (mis)prediction problems. Additionally, it fires the CMOV optimizations that contribute to run-to-run variance in pathological cases. 3. Doing the unconditional shift by coder ID avoid branch misprediction costs, and brings the length() performance closer to JDK 9 baseline. It would be possible to achieve the same effect with the branches, when VM would be able to separate profiles for the different callers.