= String::charAt performance with String compression enabled https://bugs.openjdk.java.net/browse/JDK-8054307 An early prototype for C2 intrinsics is available now, and we can see what problems we should expect going forward. Search for "RECOMMENDATION" word to see the concrete suggestions on what to fix in current prototype. The tests are done on JDK 9 Sandbox repo, 1x4x2 i7-4790K (Haswell) 4.0 GHz, Linux x86_64. Benchmark code is available at: http://cr.openjdk.java.net/~shade/density/ == Individual Benchmarks The individual benchmarks are selecting every character in some random order. "cmp1" is the version with Latin1 chars only. "cmp2" is the version with UTF16 chars. JDK 9 baseline: Benchmark (size) Mode Cnt Score Error Units CharAtBench.test_cmp1 1 avgt 25 3.958 ± 0.110 ns/op CharAtBench.test_cmp1 64 avgt 25 160.960 ± 4.578 ns/op CharAtBench.test_cmp1 4096 avgt 25 9619.311 ± 17.344 ns/op CharAtBench.test_cmp2 1 avgt 25 3.947 ± 0.058 ns/op CharAtBench.test_cmp2 64 avgt 25 160.494 ± 1.282 ns/op CharAtBench.test_cmp2 4096 avgt 25 9655.649 ± 82.634 ns/op JDK 9 (String Density) Benchmark (size) Mode Cnt Score Error Units CharAtBench.test_cmp1 1 avgt 25 4.303 ± 0.019 ns/op CharAtBench.test_cmp1 64 avgt 25 169.334 ± 0.885 ns/op CharAtBench.test_cmp1 4096 avgt 25 10362.591 ± 56.021 ns/op CharAtBench.test_cmp2 1 avgt 25 4.471 ± 0.353 ns/op CharAtBench.test_cmp2 64 avgt 25 173.377 ± 0.783 ns/op CharAtBench.test_cmp2 4096 avgt 25 10623.119 ± 40.157 ns/op There is a visible performance hit for both cases, in all modes. We can separate out a few cases though: === Q1. cmp1 with size=1 and size=4096 baseline: # parm0: rsi:rsi = 'java/lang/String' # parm1: rdx = int 3.72% 4.54% 0x00007f99f4ae15e0: mov %eax,-0x14000(%rsp) 2.35% 2.39% 0x00007f99f4ae15e7: push %rbp 2.15% 1.75% 0x00007f99f4ae15e8: sub $0x20,%rsp 1.74% 1.88% 0x00007f99f4ae15ec: mov 0xc(%rsi),%r11d ; get field $value 2.60% 1.76% 0x00007f99f4ae15f0: test %edx,%edx 0x00007f99f4ae15f2: jl 0x00007f99f4ae162d ; range check 1 2.10% 1.70% 0x00007f99f4ae15f4: mov 0xc(%r12,%r11,8),%ebp ; get arraylength 7.47% 6.72% 0x00007f99f4ae15f9: cmp %ebp,%edx ; range check 2 0x00007f99f4ae15fb: jge 0x00007f99f4ae1645 10.13% 6.22% 0x00007f99f4ae15fd: cmp %ebp,%edx ; JDK-8074383 0x00007f99f4ae15ff: jae 0x00007f99f4ae1617 1.91% 1.24% 0x00007f99f4ae1601: lea (%r12,%r11,8),%r10 ; unpack $value base 0.17% 0.05% 0x00007f99f4ae1605: movzwl 0x10(%r10,%rdx,2),%eax ; select $value[$idx] 0.22% 0.12% 0x00007f99f4ae160b: add $0x20,%rsp ; epilog and return 3.52% 2.19% 0x00007f99f4ae160f: pop %rbp 2.28% 1.36% 0x00007f99f4ae1610: test %eax,0x111529ea(%rip) 1.17% 0.58% 0x00007f99f4ae1616: retq patched: # parm0: rsi:rsi = 'java/lang/String' # parm1: rdx = int 4.13% 4.02% 0x00007f91f3c49ce0: mov %eax,-0x14000(%rsp) 1.33% 1.34% 0x00007f91f3c49ce7: push %rbp 0.03% 0.02% 0x00007f91f3c49ce8: sub $0x20,%rsp 4.14% 3.69% 0x00007f91f3c49cec: movsbl 0x14(%rsi),%r11d ; get field $coder 1.56% 1.17% 0x00007f91f3c49cf1: test %r11d,%r11d ; if $coder == 0 0x00007f91f3c49cf4: jne 0x00007f91f3c49d39 0.17% 0.21% 0x00007f91f3c49cf6: test %edx,%edx ; range check 1 0x00007f91f3c49cf8: jl 0x00007f91f3c49d61 3.64% 3.71% 0x00007f91f3c49cfa: mov 0xc(%rsi),%r11d ; get field $value 0.02% 0x00007f91f3c49cfe: mov 0xc(%r12,%r11,8),%ebp ; get $value.arraylength 11.01% 6.55% 0x00007f91f3c49d03: cmp %ebp,%edx ; range check 2 0x00007f91f3c49d05: jge 0x00007f91f3c49d79 10.66% 4.07% 0x00007f91f3c49d07: cmp %ebp,%edx ; JDK-8074383 0x00007f91f3c49d09: jae 0x00007f91f3c49d24 1.61% 1.60% 0x00007f91f3c49d0b: lea (%r12,%r11,8),%r10 ; unpack $value base 0x00007f91f3c49d0f: movslq %edx,%r11 0.07% 0x00007f91f3c49d12: movzbl 0x10(%r10,%r11,1),%eax ; select $value[$idx] 4.03% 4.48% 0x00007f91f3c49d18: add $0x20,%rsp ; epilog and return 1.40% 1.52% 0x00007f91f3c49d1c: pop %rbp 0x00007f91f3c49d1d: test %eax,0xe5392dd(%rip) 3.21% 2.70% 0x00007f91f3c49d23: retq This is not new: coder selection has some overheads, which is responsible for 0.3 ns wasted in patched case. The charAt method is isolated from other invocations by non-inlineable trampoline, and this is why we don't stand a chance to common the coder selection for a given string. The case of size=4096 experiences the same hit for the same reason. RECOMMENDATION: It might be possible to absorb some of the losses with better code generation, once JDK-8074383 is fixed. === Q2. cmp2 with size=1 baseline: [Verified Entry Point] 5.63% 4.95% 0x00007f03ac6b9a60: mov %eax,-0x14000(%rsp) 4.72% 4.05% 0x00007f03ac6b9a67: push %rbp 0.02% 0.02% 0x00007f03ac6b9a68: sub $0x20,%rsp 8.17% 6.12% 0x00007f03ac6b9a6c: mov 0xc(%rsi),%r11d ; get field $value 2.02% 0.95% 0x00007f03ac6b9a70: test %edx,%edx 0x00007f03ac6b9a72: jl 0x00007f03ac6b9aad ; range check 1, ($idx > 0) 0x00007f03ac6b9a74: mov 0xc(%r12,%r11,8),%ebp ; get $value.arraylength 10.56% 9.36% 0x00007f03ac6b9a79: cmp %ebp,%edx 0x00007f03ac6b9a7b: jge 0x00007f03ac6b9ac5 ; range check 2, ($idx < arraylength) 17.52% 17.78% 0x00007f03ac6b9a7d: cmp %ebp,%edx ; redundant <--- JDK-8074383 0x00007f03ac6b9a7f: jae 0x00007f03ac6b9a97 3.13% 3.48% 0x00007f03ac6b9a81: lea (%r12,%r11,8),%r10 ; unpack $value array reference 0x00007f03ac6b9a85: movzwl 0x10(%r10,%rdx,2),%eax ; access $value[$idx] 0.95% 1.14% 0x00007f03ac6b9a8b: add $0x20,%rsp ; epilog and return 6.36% 8.16% 0x00007f03ac6b9a8f: pop %rbp 2.92% 3.31% 0x00007f03ac6b9a90: test %eax,0xe53b56a(%rip) 1.97% 2.21% 0x00007f03ac6b9a96: retq patched version comes from StringUTF16.getChar intrinsic: [Verified Entry Point] 0.68% 0.29% 0x00007fa5884ab8e0: mov %eax,-0x14000(%rsp) 8.50% 5.24% 0x00007fa5884ab8e7: push %rbp 0x00007fa5884ab8e8: sub $0x20,%rsp 0.39% 0.20% 0x00007fa5884ab8ec: movsbl 0x14(%rsi),%r10d ; get field $coder 8.31% 5.58% 0x00007fa5884ab8f1: mov 0xc(%rsi),%r11d ; get field $value 0x00007fa5884ab8f5: test %r10d,%r10d 0x00007fa5884ab8f8: je 0x00007fa5884ab927 ; if ($coder == 0), jump out 0.42% 0.37% 0x00007fa5884ab8fa: test %edx,%edx 0x00007fa5884ab8fc: jl 0x00007fa5884ab947 ; range check 1, ($idx > 0) 0x00007fa5884ab8fe: mov 0xc(%r12,%r11,8),%ebp ; get field $value.arraylength 12.20% 14.01% 0x00007fa5884ab903: sar %ebp ; arraylength /= 2 6.65% 8.37% 0x00007fa5884ab905: cmp %ebp,%edx 0x00007fa5884ab907: jge 0x00007fa5884ab95d ; range check 2, ($idx < arraylength/2) 9.84% 9.98% 0x00007fa5884ab909: mov %r11,%r10 ; unpack $value array reference 0x00007fa5884ab90c: shl $0x3,%r10 0.41% 0.48% 0x00007fa5884ab910: shl %edx ; $idx *= 2 0x00007fa5884ab912: movslq %edx,%r11 ; <--- pesky sign extension, similar to JDK-8074124? 7.75% 10.85% 0x00007fa5884ab915: movzwl 0x10(%r10,%r11,1),%eax ; access $value[$idx] 0.61% 0.68% 0x00007fa5884ab91b: add $0x20,%rsp ; epilog and return 0.34% 0.29% 0x00007fa5884ab91f: pop %rbp 0x00007fa5884ab920: test %eax,0xe4846da(%rip) 8.38% 9.45% 0x00007fa5884ab926: retq As in the case above, we pay for coder selection. We *also* pay for additional dance around $idx and arraylength multiplications. RECOMMENDATION: It might be worthwhile to multiply the index right away, and then compare it routinely with arraylength. This will save a few cycles on arraylength "sar", and will probably play nicer with JDK-8042997. RECOMMENDATION: We can probably use the fact we compared $idx with arraylength, and so it is known to have the zero upper word, and remove "movslq" sign extension? == Streaming Benchmarks baseline: Benchmark (size) Mode Cnt Score Error Units CharAtStreamBench.test_cmp1 1 avgt 25 3.691 ± 0.044 ns/op CharAtStreamBench.test_cmp1 64 avgt 25 18.199 ± 0.174 ns/op CharAtStreamBench.test_cmp1 4096 avgt 25 1038.682 ± 0.735 ns/op CharAtStreamBench.test_cmp2 1 avgt 25 4.257 ± 0.015 ns/op CharAtStreamBench.test_cmp2 64 avgt 25 18.553 ± 0.277 ns/op CharAtStreamBench.test_cmp2 4096 avgt 25 1040.079 ± 2.727 ns/op patched: Benchmark (size) Mode Cnt Score Error Units CharAtStreamBench.test_cmp1 1 avgt 25 4.049 ± 0.024 ns/op CharAtStreamBench.test_cmp1 64 avgt 25 20.519 ± 0.641 ns/op CharAtStreamBench.test_cmp1 4096 avgt 25 1042.815 ± 3.202 ns/op CharAtStreamBench.test_cmp2 1 avgt 25 4.082 ± 0.066 ns/op CharAtStreamBench.test_cmp2 64 avgt 25 21.254 ± 0.174 ns/op CharAtStreamBench.test_cmp2 4096 avgt 25 1051.406 ± 8.339 ns/op Here, we seem roughly the same picture, although the performance hit seems absorbed. To follow up on these differences, we want to artificially inhibit the loop unrolling that otherwise masks the performance inefficiencies, with -XX:LoopUnrollLimit=1. This greatly amplifies the difference: baseline: Benchmark (size) Mode Cnt Score Error Units CharAtStreamBench.test_cmp1 1 avgt 25 7.526 ± 1.157 ns/op CharAtStreamBench.test_cmp1 64 avgt 25 34.566 ± 0.335 ns/op CharAtStreamBench.test_cmp1 4096 avgt 25 1368.801 ± 3.754 ns/op CharAtStreamBench.test_cmp2 1 avgt 25 5.442 ± 0.343 ns/op CharAtStreamBench.test_cmp2 64 avgt 25 31.971 ± 0.371 ns/op CharAtStreamBench.test_cmp2 4096 avgt 25 1360.528 ± 2.888 ns/op patched: Benchmark (size) Mode Cnt Score Error Units CharAtStreamBench.test_cmp1 1 avgt 25 4.742 ± 0.029 ns/op CharAtStreamBench.test_cmp1 64 avgt 25 37.306 ± 1.467 ns/op CharAtStreamBench.test_cmp1 4096 avgt 25 1509.189 ± 16.245 ns/op CharAtStreamBench.test_cmp2 1 avgt 25 4.938 ± 0.040 ns/op CharAtStreamBench.test_cmp2 64 avgt 25 40.530 ± 2.339 ns/op CharAtStreamBench.test_cmp2 4096 avgt 25 1884.497 ± 120.079 ns/op === Q1: test_cmp1 at size=4096 baseline: 0.02% 0.02% 0x00007f7bf8ae1071: movzwl 0x10(%r10,%rbx,2),%r8d ; access $value[$idx] 6.55% 6.74% 0x00007f7bf8ae1077: add %r8d,%eax ; accumulate 84.12% 84.46% 0x00007f7bf8ae107a: inc %ebx ; $idx++ 0.05% 0x00007f7bf8ae107c: cmp %ecx,%ebx ; check $idx, and loop back 0x00007f7bf8ae107e: jl 0x00007f7bf8ae1071 patched: 13.81% 12.90% 0x00007fdde0062b10: movslq %edi,%r8 ; <--- pesky sign extension, similar to JDK-8074124? 17.46% 16.68% 0x00007fdde0062b13: movzbl 0x10(%rbx,%r8,1),%r9d ; access $value[$idx] 25.26% 24.60% 0x00007fdde0062b19: add %r9d,%eax ; accumulate 26.31% 28.41% 0x00007fdde0062b1c: inc %edi ; $idx++ 8.97% 9.73% 0x00007fdde0062b1e: cmp %ecx,%edi ; check $idx, and loop back 0x00007fdde0062b20: jl 0x00007fdde0062b10 As you can see, the difference that makes patched version perform around 10% slower is the stray "movslq" sign extension. RECOMMENDATION: We can probably use the fact we compared $idx with arraylength, and so it is known to have the zero upper word, and remove "movslq" sign extension? This is not about the intrinsic, but general Java code, and thus it is very similar to JDK-8074124. === Q2: test_cmp2 at size=64 baseline: 0.02% 0x00007fb6d4ae18f1: movzwl 0x10(%r10,%rbx,2),%r8d ; access $value[$idx] 7.06% 6.40% 0x00007fb6d4ae18f7: add %r8d,%eax ; accumulate 82.85% 83.70% 0x00007fb6d4ae18fa: inc %ebx ; $idx++ 0.02% 0x00007fb6d4ae18fc: cmp %ecx,%ebx ; check $idx, and loop back 0x00007fb6d4ae18fe: jl 0x00007fb6d4ae18f1 patched: 12.29% 13.70% 0x00007fde58adf740: mov %r10d,%r9d ; $tmp = $idx * 2 12.06% 13.22% 0x00007fde58adf743: shl %r9d 12.78% 14.14% 0x00007fde58adf746: movslq %r9d,%r11 ; <--- pesky sign extension, similar to JDK-8074124? 13.33% 10.53% 0x00007fde58adf749: movzwl 0x10(%rbx,%r11,1),%r9d ; access $value[$tmp] 13.35% 11.92% 0x00007fde58adf74f: add %r9d,%eax ; accumulate 14.55% 13.19% 0x00007fde58adf752: inc %r10d ; $idx++ 12.75% 14.63% 0x00007fde58adf755: cmp %r8d,%r10d ; check $idx, and loop back 0x00007fde58adf758: jl 0x00007fde58adf740 So, in addition to sign extension, we are also dealing with less efficient access, that requires unpacking the index into (index*2). RECOMMENDATION: Multiply the index in intrinsic right away, so it can get hoisted out of the loop? == Conclusion 1. The performance of one-off .charAt() calls is experiencing a minor regression, due to coder selection. The improvements in the same code path (like JDK-8074383) can attenuate those unavoidable costs. 2. The performance of one-off .charAt() calls depends on the quality of StringUTF16.getChar intrinsic. Notably, a few improvements are in order: computing the index*2 early to use in arraylength comparisons, and removing a stray sign converstion. 3. The performance of streaming .charAt depends on the absence of sign extension on the hot code path. 4. The performance of streaming .charAt calls also depends on precomputing and reusing the $idx*2 properly.