= String.equals performance with String compression enabled https://bugs.openjdk.java.net/browse/JDK-8054307 An early prototype for C2 intrinsics is available now, and we can see what problems we should expect going forward. Search for "RECOMMENDATION" word to see the concrete suggestions on what to fix in current prototype. The tests are done on JDK 9 Sandbox repo, 1x4x2 i7-4790K (Haswell) 4.0 GHz, Linux x86_64. Benchmark code is available at: http://cr.openjdk.java.net/~shade/density/ == Equals Targeted benchmarks for equals() compare 4 combinations for binary operations: cmp1_cmp1 is String.equals(Latin1, Latin1), cmp1_cmp2 is String.equals(Latin1, UTF16), and so on. Additionally, we are testing the strings of the same length vs strings of different lengths to assess the shortcut-ting performance. Our rough estimate from the existing corpus is that around 95% of String comparisons take the 'length' shortcut. === Same length First, the experiment with the same string lengths, which sets us up for the actual character comparisons. JDK 9 Baseline: Benchmark (size) Mode Cnt Score Error Units EqualsBench.cmp1_cmp1 1 avgt 25 4.769 ± 0.001 ns/op EqualsBench.cmp1_cmp2 1 avgt 25 4.770 ± 0.002 ns/op EqualsBench.cmp2_cmp1 1 avgt 25 4.771 ± 0.006 ns/op EqualsBench.cmp2_cmp2 1 avgt 25 4.770 ± 0.002 ns/op EqualsBench.cmp1_cmp1 64 avgt 25 8.302 ± 0.062 ns/op EqualsBench.cmp1_cmp2 64 avgt 25 8.222 ± 0.091 ns/op EqualsBench.cmp2_cmp1 64 avgt 25 8.219 ± 0.059 ns/op EqualsBench.cmp2_cmp2 64 avgt 25 8.260 ± 0.105 ns/op EqualsBench.cmp1_cmp1 4096 avgt 25 173.889 ± 9.718 ns/op EqualsBench.cmp1_cmp2 4096 avgt 25 153.549 ± 6.556 ns/op EqualsBench.cmp2_cmp1 4096 avgt 25 155.951 ± 9.209 ns/op EqualsBench.cmp2_cmp2 4096 avgt 25 149.245 ± 0.307 ns/op JDK 9 String Density: Benchmark (size) Mode Cnt Score Error Units EqualsBench.cmp1_cmp1 1 avgt 25 5.206 ± 0.157 ns/op EqualsBench.cmp1_cmp2 1 avgt 25 3.336 ± 0.008 ns/op EqualsBench.cmp2_cmp1 1 avgt 25 3.336 ± 0.011 ns/op EqualsBench.cmp2_cmp2 1 avgt 25 5.073 ± 0.161 ns/op EqualsBench.cmp1_cmp1 64 avgt 25 6.977 ± 0.370 ns/op EqualsBench.cmp1_cmp2 64 avgt 25 3.334 ± 0.011 ns/op EqualsBench.cmp2_cmp1 64 avgt 25 3.340 ± 0.015 ns/op EqualsBench.cmp2_cmp2 64 avgt 25 8.386 ± 0.256 ns/op EqualsBench.cmp1_cmp1 4096 avgt 25 93.102 ± 4.729 ns/op EqualsBench.cmp1_cmp2 4096 avgt 25 3.534 ± 0.196 ns/op EqualsBench.cmp2_cmp1 4096 avgt 25 3.531 ± 0.316 ns/op EqualsBench.cmp2_cmp2 4096 avgt 25 151.784 ± 1.555 ns/op === Why some (size = 1) cases are slower? It looks like String Density does 10+ excess instructions. Baseline: Benchmark (size) Mode Cnt Score Error Units EqualsBench.cmp1_cmp1 1 avgt 25 4.728 ± 0.059 ns/op EqualsBench.cmp1_cmp1:·CPI 1 avgt 5 0.283 ± 0.015 #/op EqualsBench.cmp1_cmp1:·L1-dcache-load-misses 1 avgt 5 0.001 ± 0.002 #/op EqualsBench.cmp1_cmp1:·L1-dcache-loads 1 avgt 5 21.829 ± 0.306 #/op EqualsBench.cmp1_cmp1:·L1-dcache-store-misses 1 avgt 5 ≈ 10⁻⁴ #/op EqualsBench.cmp1_cmp1:·L1-dcache-stores 1 avgt 5 6.247 ± 0.075 #/op EqualsBench.cmp1_cmp1:·branch-misses 1 avgt 5 ≈ 10⁻⁴ #/op EqualsBench.cmp1_cmp1:·branches 1 avgt 5 15.572 ± 0.376 #/op EqualsBench.cmp1_cmp1:·cycles 1 avgt 5 19.691 ± 0.659 #/op EqualsBench.cmp1_cmp1:·instructions 1 avgt 5 69.631 ± 2.037 #/op String Density: Benchmark (size) Mode Cnt Score Error Units EqualsBench.cmp1_cmp1 1 avgt 25 5.277 ± 0.257 ns/op EqualsBench.cmp1_cmp1:·CPI 1 avgt 5 0.273 ± 0.059 #/op EqualsBench.cmp1_cmp1:·L1-dcache-load-misses 1 avgt 5 0.001 ± 0.002 #/op EqualsBench.cmp1_cmp1:·L1-dcache-loads 1 avgt 5 24.041 ± 0.644 #/op EqualsBench.cmp1_cmp1:·L1-dcache-store-misses 1 avgt 5 ≈ 10⁻³ #/op EqualsBench.cmp1_cmp1:·L1-dcache-stores 1 avgt 5 6.273 ± 0.154 #/op EqualsBench.cmp1_cmp1:·branch-misses 1 avgt 5 ≈ 10⁻³ #/op EqualsBench.cmp1_cmp1:·branches 1 avgt 5 18.841 ± 0.338 #/op EqualsBench.cmp1_cmp1:·cycles 1 avgt 5 22.032 ± 5.010 #/op EqualsBench.cmp1_cmp1:·instructions 1 avgt 5 80.580 ± 1.465 #/op The difference in instruction length is explained by coder selection and code generation quirks. The baseline and patched versions of hot parts of the generated code are below. The difference in generated code in indented. JDK 9 Baseline: [Verified Entry Point] 0.05% 0.08% 0x00007f03ccaea980: mov %eax,-0x14000(%rsp) 5.08% 3.80% 0x00007f03ccaea987: push %rbp 0.01% 0x00007f03ccaea988: sub $0x10,%rsp 4.83% 3.61% 0x00007f03ccaea98c: mov 0x14(%rsi),%r10d ; get $cmp1_2 0.23% 0.20% 0x00007f03ccaea990: mov 0x10(%rsi),%r11d ; get $cmp1_2 0x00007f03ccaea994: mov 0xc(%r12,%r11,8),%r9d ; get $cmp1_2.value 0.26% 0.16% 0x00007f03ccaea999: cmp %r10d,%r11d ; check ($cmp1_1 == $cmp1_2) 0x00007f03ccaea99c: je 0x00007f03ccaeaa7a ; bail to "equal!" 4.59% 2.82% 0x00007f03ccaea9a2: mov 0xc(%r12,%r10,8),%r8d ; get $cmp1_1.value 0.26% 0.27% 0x00007f03ccaea9a7: mov 0xc(%r12,%r8,8),%r10d ; get $cmp1_1.value.arraylen 4.60% 4.77% 0x00007f03ccaea9ac: mov 0xc(%r12,%r9,8),%ecx ; get $cmp1_2.value.arraylen 1.30% 1.70% 0x00007f03ccaea9b1: cmp %r10d,%ecx ; check lengths 0x00007f03ccaea9b4: jne 0x00007f03ccaeaa81 ; bail to "not equal!" 8.61% 8.88% 0x00007f03ccaea9ba: lea (%r12,%r8,8),%r10 ; <--- JDK-8020282 0x00007f03ccaea9be: lea 0x10(%r12,%r8,8),%rsi ; load the start address of $cmp1_1.value 0x00007f03ccaea9c3: lea (%r12,%r9,8),%r10 ; <--- JDK-8020282 0x00007f03ccaea9c7: lea 0x10(%r12,%r9,8),%rdi ; load the start address of $cmp1_2.value ... JDK 9 String Density: [Verified Entry Point] 0.10% 0.05% 0x00007ff178aec900: mov %eax,-0x14000(%rsp) 4.42% 5.31% 0x00007ff178aec907: push %rbp 0x00007ff178aec908: sub $0x20,%rsp 4.21% 4.34% 0x00007ff178aec90c: mov 0x14(%rsi),%r11d ; get $cmp1_2 0.66% 0.33% 0x00007ff178aec910: mov 0x10(%rsi),%r10d ; get $cmp1_1 0x00007ff178aec914: movsbl 0x14(%r12,%r10,8),%r9d ; get $cmp1_1.coder 0.08% 0.14% 0x00007ff178aec91a: cmp %r11d,%r10d ; check ($cmp1_1 == $cmp1_2) 0x00007ff178aec91d: jne 0x00007ff178aec929 ; jump over if not equal 0x00007ff178aec91f: mov $0x1,%eax 0x00007ff178aec924: jmpq 0x00007ff178aeca31 3.82% 3.52% 0x00007ff178aec929: xor %eax,%eax ; 0.45% 0.58% 0x00007ff178aec92b: test %r11d,%r11d ; check ($cmp1_1 == null) 0x00007ff178aec92e: je 0x00007ff178aeca31 ; jump out to NPE 0x00007ff178aec934: movsbl 0x14(%r12,%r11,8),%ebp ; get $cmp1_2.coder 0.19% 0.13% 0x00007ff178aec93a: cmp %ebp,%r9d ; check ($cmp1_1.coder == $cmp1_2.coder) 0x00007ff178aec93d: jne 0x00007ff178aeca3d ; jump to slow-path if not equal 4.61% 1.44% 0x00007ff178aec943: test %r9d,%r9d ; check ($cmp1_1.coder == 0)? 0x00007ff178aec946: jne 0x00007ff178aeca5d ; jump to slow-path if not 0.42% 0.50% 0x00007ff178aec94c: mov 0xc(%r12,%r11,8),%r8d ; get $cmp1_2.value 0x00007ff178aec951: mov 0xc(%r12,%r10,8),%r10d ; get $cmp1_1.value 0.09% 0.14% 0x00007ff178aec956: mov 0xc(%r12,%r8,8),%r9d ; get $cmp1_2.value.arraylength 5.85% 7.35% 0x00007ff178aec95b: mov 0xc(%r12,%r10,8),%ecx ; get $cmp1_1.value.arraylength 0.71% 0.81% 0x00007ff178aec960: cmp %r9d,%ecx ; check lengths 0x00007ff178aec963: jne 0x00007ff178aeca31 ; bail to "not equal!" 6.28% 7.93% 0x00007ff178aec969: mov %r8,%r11 ; <--- (similar to) JDK-8020282 0.01% 0x00007ff178aec96c: shl $0x3,%r11 1.07% 1.40% 0x00007ff178aec970: lea 0x10(%r12,%r8,8),%rsi ; load the start address of $cmp1_1.value 0.06% 0.06% 0x00007ff178aec975: mov %r10,%r11 ; <--- (similar to) JDK-8020282 3.28% 3.03% 0x00007ff178aec978: shl $0x3,%r11 0.03% 0.08% 0x00007ff178aec97c: lea 0x10(%r12,%r10,8),%rdi ; load the start address of $cmp1_1.value ... RECOMMENDATION: Fix https://bugs.openjdk.java.net/browse/JDK-8020282 to alleviate some costs. == Why cmp1_cmp2 and cmp2_cmp1 cases are faster? This happens because String Density shortcuts the equality comparison by comparing the coders, see the sample generated code above. This allows to skip any other comparisons and loads altogether. == Why cmp1_cmp1 case is significantly faster with size=4096? Both baseline and String Density versions have the same hottest block: 9.00% 11.78% 0x00007fa414aea9f4: vmovdqu (%rdi,%rcx,1),%ymm1 17.26% 14.77% 0x00007fa414aea9f9: vmovdqu (%rsi,%rcx,1),%ymm0 16.73% 16.17% 0x00007fa414aea9fe: vpxor %ymm0,%ymm1,%ymm1 12.60% 12.45% 0x00007fa414aeaa02: vptest %ymm1,%ymm1 22.93% 24.44% 0x00007fa414aeaa07: jne 0x00007fa414aeaa64 12.17% 14.05% 0x00007fa414aeaa09: add $0x20,%rcx 0x00007fa414aeaa0d: jne 0x00007fa414aea9f4 ...but String Density version does less work, because of the compressed storage. JDK 9 Baseline: Benchmark (size) Mode Cnt Score Error Units EqualsBench.cmp1_cmp1 4096 avgt 5 151.819 ± 3.888 ns/op EqualsBench.cmp1_cmp1:·CPI 4096 avgt 0.326 #/op EqualsBench.cmp1_cmp1:·L1-dcache-load-misses 4096 avgt 0.059 #/op EqualsBench.cmp1_cmp1:·L1-dcache-loads 4096 avgt 552.395 #/op EqualsBench.cmp1_cmp1:·cycles 4096 avgt 630.489 #/op EqualsBench.cmp1_cmp1:·instructions 4096 avgt 1936.352 #/op JDK 9 String Density: Benchmark (size) Mode Cnt Score Error Units EqualsBench.cmp1_cmp1 4096 avgt 5 98.257 ± 0.157 ns/op EqualsBench.cmp1_cmp1:·CPI 4096 avgt 0.405 #/op EqualsBench.cmp1_cmp1:·L1-dcache-load-misses 4096 avgt 0.009 #/op EqualsBench.cmp1_cmp1:·L1-dcache-loads 4096 avgt 290.094 #/op EqualsBench.cmp1_cmp1:·cycles 4096 avgt 408.767 #/op EqualsBench.cmp1_cmp1:·instructions 4096 avgt 1008.241 #/op This is also why cmp2_cmp2 is not slower: the generated code is the same, but there is no improvement that cmp1_cmp1 enjoys because of the compressed storage. == Different Length These benchmark assess the performance when we compare the Strings of different lengths. JDK 9 Baseline: Benchmark (size) Mode Cnt Score Error Units EqualsDiffLenBench.cmp1_cmp1 1 avgt 25 3.412 ± 0.011 ns/op EqualsDiffLenBench.cmp1_cmp2 1 avgt 25 3.408 ± 0.001 ns/op EqualsDiffLenBench.cmp2_cmp1 1 avgt 25 3.413 ± 0.004 ns/op EqualsDiffLenBench.cmp2_cmp2 1 avgt 25 3.410 ± 0.002 ns/op JDK 9 String Density: Benchmark (size) Mode Cnt Score Error Units EqualsDiffLenBench.cmp1_cmp1 1 avgt 25 3.762 ± 0.002 ns/op EqualsDiffLenBench.cmp1_cmp2 1 avgt 25 3.325 ± 0.001 ns/op EqualsDiffLenBench.cmp2_cmp1 1 avgt 25 3.325 ± 0.004 ns/op EqualsDiffLenBench.cmp2_cmp2 1 avgt 25 4.154 ± 0.002 ns/op == Why the performance in cmp1_cmp1 and cmp2_cmp2 cases is worse? If you look at the generated code in the section above, then you will see we first compare the coders, and then compare the lengths, if coders match. The "excess" comparison of coders in cmpX_cmpX case gives a slight overhead, before short-cutting by length. Any improvement in coder selection scheme will save us these few cycles. = Conclusion equals() performance enjoys a clear benefit of having a compressed storage. The most frequent case of comparing the compressed Strings has tremendous benefits. Coder selection still requires some overhead to be absorbed.