= String::charAt performance with String compression enabled

https://bugs.openjdk.java.net/browse/JDK-8054307

An early prototype for C2 intrinsics is available now, and we can see what problems we should expect going forward.
Search for "RECOMMENDATION" word to see the concrete suggestions on what to fix in current prototype.

The tests are done on JDK 9 Sandbox repo, 1x4x2 i7-4790K (Haswell) 4.0 GHz, Linux x86_64.

Benchmark code is available at:
 http://cr.openjdk.java.net/~shade/density/

== Individual Benchmarks

The individual benchmarks are selecting every character in some random order. "cmp1" is the version with Latin1 chars
only. "cmp2" is the version with UTF16 chars.

JDK 9 baseline:

Benchmark              (size)  Mode  Cnt     Score     Error  Units
CharAtBench.test_cmp1       1  avgt   25     3.958 ±   0.110  ns/op
CharAtBench.test_cmp1      64  avgt   25   160.960 ±   4.578  ns/op
CharAtBench.test_cmp1    4096  avgt   25  9619.311 ±  17.344  ns/op
CharAtBench.test_cmp2       1  avgt   25     3.947 ±   0.058  ns/op
CharAtBench.test_cmp2      64  avgt   25   160.494 ±   1.282  ns/op
CharAtBench.test_cmp2    4096  avgt   25  9655.649 ±  82.634  ns/op

JDK 9 (String Density)

Benchmark              (size)  Mode  Cnt      Score    Error  Units
CharAtBench.test_cmp1       1  avgt   25      4.303 ±  0.019  ns/op
CharAtBench.test_cmp1      64  avgt   25    169.334 ±  0.885  ns/op
CharAtBench.test_cmp1    4096  avgt   25  10362.591 ± 56.021  ns/op
CharAtBench.test_cmp2       1  avgt   25      4.471 ±  0.353  ns/op
CharAtBench.test_cmp2      64  avgt   25    173.377 ±  0.783  ns/op
CharAtBench.test_cmp2    4096  avgt   25  10623.119 ± 40.157  ns/op

There is a visible performance hit for both cases, in all modes. We can separate out a few cases though:

=== Q1. cmp1 with size=1 and size=4096


baseline:

                    # parm0:    rsi:rsi   = &apos;java/lang/String&apos;
                    # parm1:    rdx       = int
  3.72%    4.54%    0x00007f99f4ae15e0: mov    %eax,-0x14000(%rsp)
  2.35%    2.39%    0x00007f99f4ae15e7: push   %rbp
  2.15%    1.75%    0x00007f99f4ae15e8: sub    $0x20,%rsp 
  1.74%    1.88%    0x00007f99f4ae15ec: mov    0xc(%rsi),%r11d          ; get field $value
  2.60%    1.76%    0x00007f99f4ae15f0: test   %edx,%edx              
                    0x00007f99f4ae15f2: jl     0x00007f99f4ae162d       ; range check 1
  2.10%    1.70%    0x00007f99f4ae15f4: mov    0xc(%r12,%r11,8),%ebp    ; get arraylength
  7.47%    6.72%    0x00007f99f4ae15f9: cmp    %ebp,%edx                ; range check 2
                    0x00007f99f4ae15fb: jge    0x00007f99f4ae1645  
 10.13%    6.22%    0x00007f99f4ae15fd: cmp    %ebp,%edx                ; JDK-8074383
                    0x00007f99f4ae15ff: jae    0x00007f99f4ae1617
  1.91%    1.24%    0x00007f99f4ae1601: lea    (%r12,%r11,8),%r10       ; unpack $value base
  0.17%    0.05%    0x00007f99f4ae1605: movzwl 0x10(%r10,%rdx,2),%eax   ; select $value[$idx]
  0.22%    0.12%    0x00007f99f4ae160b: add    $0x20,%rsp               ; epilog and return
  3.52%    2.19%    0x00007f99f4ae160f: pop    %rbp
  2.28%    1.36%    0x00007f99f4ae1610: test   %eax,0x111529ea(%rip) 
  1.17%    0.58%    0x00007f99f4ae1616: retq   


patched:

                    # parm0:    rsi:rsi   = &apos;java/lang/String&apos;
                    # parm1:    rdx       = int
  4.13%    4.02%    0x00007f91f3c49ce0: mov    %eax,-0x14000(%rsp)
  1.33%    1.34%    0x00007f91f3c49ce7: push   %rbp
  0.03%    0.02%    0x00007f91f3c49ce8: sub    $0x20,%rsp      
  4.14%    3.69%    0x00007f91f3c49cec: movsbl 0x14(%rsi),%r11d          ; get field $coder
  1.56%    1.17%    0x00007f91f3c49cf1: test   %r11d,%r11d               ; if $coder == 0
                    0x00007f91f3c49cf4: jne    0x00007f91f3c49d39  
  0.17%    0.21%    0x00007f91f3c49cf6: test   %edx,%edx                 ; range check 1
                    0x00007f91f3c49cf8: jl     0x00007f91f3c49d61  
  3.64%    3.71%    0x00007f91f3c49cfa: mov    0xc(%rsi),%r11d           ; get field $value
           0.02%    0x00007f91f3c49cfe: mov    0xc(%r12,%r11,8),%ebp     ; get $value.arraylength
 11.01%    6.55%    0x00007f91f3c49d03: cmp    %ebp,%edx                 ; range check 2
                    0x00007f91f3c49d05: jge    0x00007f91f3c49d79  
 10.66%    4.07%    0x00007f91f3c49d07: cmp    %ebp,%edx                 ; JDK-8074383
                    0x00007f91f3c49d09: jae    0x00007f91f3c49d24  
  1.61%    1.60%    0x00007f91f3c49d0b: lea    (%r12,%r11,8),%r10        ; unpack $value base
                    0x00007f91f3c49d0f: movslq %edx,%r11
  0.07%             0x00007f91f3c49d12: movzbl 0x10(%r10,%r11,1),%eax    ; select $value[$idx]
  4.03%    4.48%    0x00007f91f3c49d18: add    $0x20,%rsp                ; epilog and return
  1.40%    1.52%    0x00007f91f3c49d1c: pop    %rbp
                    0x00007f91f3c49d1d: test   %eax,0xe5392dd(%rip) 
  3.21%    2.70%    0x00007f91f3c49d23: retq   

This is not new: coder selection has some overheads, which is responsible for 0.3 ns wasted in patched case.
The charAt method is isolated from other invocations by non-inlineable trampoline, and this is why we don't stand 
a chance to common the coder selection for a given string. The case of size=4096 experiences the same hit for the 
same reason.

RECOMMENDATION: It might be possible to absorb some of the losses with better code generation, once JDK-8074383 is fixed.

=== Q2. cmp2 with size=1

baseline:

                  [Verified Entry Point]
  5.63%    4.95%    0x00007f03ac6b9a60: mov    %eax,-0x14000(%rsp)
  4.72%    4.05%    0x00007f03ac6b9a67: push   %rbp
  0.02%    0.02%    0x00007f03ac6b9a68: sub    $0x20,%rsp       
  8.17%    6.12%    0x00007f03ac6b9a6c: mov    0xc(%rsi),%r11d          ; get field $value
  2.02%    0.95%    0x00007f03ac6b9a70: test   %edx,%edx              
                    0x00007f03ac6b9a72: jl     0x00007f03ac6b9aad       ; range check 1, ($idx > 0)
                    0x00007f03ac6b9a74: mov    0xc(%r12,%r11,8),%ebp    ; get $value.arraylength
 10.56%    9.36%    0x00007f03ac6b9a79: cmp    %ebp,%edx
                    0x00007f03ac6b9a7b: jge    0x00007f03ac6b9ac5       ; range check 2, ($idx < arraylength)
 17.52%   17.78%    0x00007f03ac6b9a7d: cmp    %ebp,%edx                ; redundant <--- JDK-8074383
                    0x00007f03ac6b9a7f: jae    0x00007f03ac6b9a97       
  3.13%    3.48%    0x00007f03ac6b9a81: lea    (%r12,%r11,8),%r10       ; unpack $value array reference
                    0x00007f03ac6b9a85: movzwl 0x10(%r10,%rdx,2),%eax   ; access $value[$idx]
  0.95%    1.14%    0x00007f03ac6b9a8b: add    $0x20,%rsp               ; epilog and return
  6.36%    8.16%    0x00007f03ac6b9a8f: pop    %rbp
  2.92%    3.31%    0x00007f03ac6b9a90: test   %eax,0xe53b56a(%rip)
  1.97%    2.21%    0x00007f03ac6b9a96: retq   


patched version comes from StringUTF16.getChar intrinsic:

                  [Verified Entry Point]
  0.68%    0.29%    0x00007fa5884ab8e0: mov    %eax,-0x14000(%rsp)
  8.50%    5.24%    0x00007fa5884ab8e7: push   %rbp
                    0x00007fa5884ab8e8: sub    $0x20,%rsp         
  0.39%    0.20%    0x00007fa5884ab8ec: movsbl 0x14(%rsi),%r10d       ; get field $coder
  8.31%    5.58%    0x00007fa5884ab8f1: mov    0xc(%rsi),%r11d        ; get field $value
                    0x00007fa5884ab8f5: test   %r10d,%r10d            
                    0x00007fa5884ab8f8: je     0x00007fa5884ab927     ; if ($coder == 0), jump out
  0.42%    0.37%    0x00007fa5884ab8fa: test   %edx,%edx
                    0x00007fa5884ab8fc: jl     0x00007fa5884ab947     ; range check 1, ($idx > 0)
                    0x00007fa5884ab8fe: mov    0xc(%r12,%r11,8),%ebp  ; get field $value.arraylength
 12.20%   14.01%    0x00007fa5884ab903: sar    %ebp                   ; arraylength /= 2
  6.65%    8.37%    0x00007fa5884ab905: cmp    %ebp,%edx              
                    0x00007fa5884ab907: jge    0x00007fa5884ab95d     ; range check 2, ($idx < arraylength/2)
  9.84%    9.98%    0x00007fa5884ab909: mov    %r11,%r10              ; unpack $value array reference
                    0x00007fa5884ab90c: shl    $0x3,%r10
  0.41%    0.48%    0x00007fa5884ab910: shl    %edx                   ; $idx *= 2
                    0x00007fa5884ab912: movslq %edx,%r11              ; <--- pesky sign extension, similar to JDK-8074124?
  7.75%   10.85%    0x00007fa5884ab915: movzwl 0x10(%r10,%r11,1),%eax ; access $value[$idx]
  0.61%    0.68%    0x00007fa5884ab91b: add    $0x20,%rsp             ; epilog and return
  0.34%    0.29%    0x00007fa5884ab91f: pop    %rbp
                    0x00007fa5884ab920: test   %eax,0xe4846da(%rip)
  8.38%    9.45%    0x00007fa5884ab926: retq   

As in the case above, we pay for coder selection. We *also* pay for additional dance around $idx and arraylength multiplications.

RECOMMENDATION: It might be worthwhile to multiply the index right away, and then compare it routinely with arraylength.
This will save a few cycles on arraylength "sar", and will probably play nicer with JDK-8042997. 

RECOMMENDATION: We can probably use the fact we compared $idx with arraylength, and so it is known to have the zero upper
word, and remove "movslq" sign extension?


== Streaming Benchmarks

baseline:

Benchmark                    (size)  Mode  Cnt     Score   Error  Units
CharAtStreamBench.test_cmp1       1  avgt   25     3.691 ± 0.044  ns/op
CharAtStreamBench.test_cmp1      64  avgt   25    18.199 ± 0.174  ns/op
CharAtStreamBench.test_cmp1    4096  avgt   25  1038.682 ± 0.735  ns/op
CharAtStreamBench.test_cmp2       1  avgt   25     4.257 ± 0.015  ns/op
CharAtStreamBench.test_cmp2      64  avgt   25    18.553 ± 0.277  ns/op
CharAtStreamBench.test_cmp2    4096  avgt   25  1040.079 ± 2.727  ns/op

patched:

Benchmark                    (size)  Mode  Cnt     Score   Error  Units
CharAtStreamBench.test_cmp1       1  avgt   25     4.049 ± 0.024  ns/op
CharAtStreamBench.test_cmp1      64  avgt   25    20.519 ± 0.641  ns/op
CharAtStreamBench.test_cmp1    4096  avgt   25  1042.815 ± 3.202  ns/op
CharAtStreamBench.test_cmp2       1  avgt   25     4.082 ± 0.066  ns/op
CharAtStreamBench.test_cmp2      64  avgt   25    21.254 ± 0.174  ns/op
CharAtStreamBench.test_cmp2    4096  avgt   25  1051.406 ± 8.339  ns/op

Here, we seem roughly the same picture, although the performance hit seems absorbed. To follow up on these differences,
we want to artificially inhibit the loop unrolling that otherwise masks the performance inefficiencies, with 
-XX:LoopUnrollLimit=1. This greatly amplifies the difference:

baseline:

Benchmark                    (size)  Mode  Cnt     Score   Error  Units
CharAtStreamBench.test_cmp1       1  avgt   25     7.526 ± 1.157  ns/op
CharAtStreamBench.test_cmp1      64  avgt   25    34.566 ± 0.335  ns/op
CharAtStreamBench.test_cmp1    4096  avgt   25  1368.801 ± 3.754  ns/op
CharAtStreamBench.test_cmp2       1  avgt   25     5.442 ± 0.343  ns/op
CharAtStreamBench.test_cmp2      64  avgt   25    31.971 ± 0.371  ns/op
CharAtStreamBench.test_cmp2    4096  avgt   25  1360.528 ± 2.888  ns/op

patched:

Benchmark                    (size)  Mode  Cnt     Score     Error  Units
CharAtStreamBench.test_cmp1       1  avgt   25     4.742 ±   0.029  ns/op
CharAtStreamBench.test_cmp1      64  avgt   25    37.306 ±   1.467  ns/op
CharAtStreamBench.test_cmp1    4096  avgt   25  1509.189 ±  16.245  ns/op
CharAtStreamBench.test_cmp2       1  avgt   25     4.938 ±   0.040  ns/op
CharAtStreamBench.test_cmp2      64  avgt   25    40.530 ±   2.339  ns/op
CharAtStreamBench.test_cmp2    4096  avgt   25  1884.497 ± 120.079  ns/op

=== Q1: test_cmp1 at size=4096

baseline:

  0.02%    0.02%    0x00007f7bf8ae1071: movzwl 0x10(%r10,%rbx,2),%r8d  ; access $value[$idx]
  6.55%    6.74%    0x00007f7bf8ae1077: add    %r8d,%eax               ; accumulate
 84.12%   84.46%    0x00007f7bf8ae107a: inc    %ebx                    ; $idx++
  0.05%             0x00007f7bf8ae107c: cmp    %ecx,%ebx               ; check $idx, and loop back
                    0x00007f7bf8ae107e: jl     0x00007f7bf8ae1071  
patched:

 13.81%   12.90%    0x00007fdde0062b10: movslq %edi,%r8                ; <--- pesky sign extension, similar to JDK-8074124?
 17.46%   16.68%    0x00007fdde0062b13: movzbl 0x10(%rbx,%r8,1),%r9d   ; access $value[$idx]
 25.26%   24.60%    0x00007fdde0062b19: add    %r9d,%eax               ; accumulate
 26.31%   28.41%    0x00007fdde0062b1c: inc    %edi                    ; $idx++
  8.97%    9.73%    0x00007fdde0062b1e: cmp    %ecx,%edi               ; check $idx, and loop back
                    0x00007fdde0062b20: jl     0x00007fdde0062b10  

As you can see, the difference that makes patched version perform around 10% slower is the stray "movslq" sign extension.

RECOMMENDATION: We can probably use the fact we compared $idx with arraylength, and so it is known to have the zero upper
word, and remove "movslq" sign extension? This is not about the intrinsic, but general Java code, and thus it is very 
similar to JDK-8074124.

=== Q2: test_cmp2 at size=64

baseline:

           0.02%    0x00007fb6d4ae18f1: movzwl 0x10(%r10,%rbx,2),%r8d  ; access $value[$idx]
  7.06%    6.40%    0x00007fb6d4ae18f7: add    %r8d,%eax               ; accumulate
 82.85%   83.70%    0x00007fb6d4ae18fa: inc    %ebx                    ; $idx++
           0.02%    0x00007fb6d4ae18fc: cmp    %ecx,%ebx               ; check $idx, and loop back
                    0x00007fb6d4ae18fe: jl     0x00007fb6d4ae18f1  

patched:

 12.29%   13.70%    0x00007fde58adf740: mov    %r10d,%r9d              ; $tmp = $idx * 2
 12.06%   13.22%    0x00007fde58adf743: shl    %r9d      
 12.78%   14.14%    0x00007fde58adf746: movslq %r9d,%r11               ; <--- pesky sign extension, similar to JDK-8074124?
 13.33%   10.53%    0x00007fde58adf749: movzwl 0x10(%rbx,%r11,1),%r9d  ; access $value[$tmp]
 13.35%   11.92%    0x00007fde58adf74f: add    %r9d,%eax               ; accumulate
 14.55%   13.19%    0x00007fde58adf752: inc    %r10d                   ; $idx++
 12.75%   14.63%    0x00007fde58adf755: cmp    %r8d,%r10d              ; check $idx, and loop back
                    0x00007fde58adf758: jl     0x00007fde58adf740 

So, in addition to sign extension, we are also dealing with less efficient access, that requires unpacking the index
into (index*2).

RECOMMENDATION: Multiply the index in intrinsic right away, so it can get hoisted out of the loop?

== Conclusion

1. The performance of one-off <Latin1>.charAt() calls is experiencing a minor regression, due to coder selection.
The improvements in the same code path (like JDK-8074383) can attenuate those unavoidable costs.

2. The performance of one-off <UTF16>.charAt() calls depends on the quality of StringUTF16.getChar intrinsic. Notably,
a few improvements are in order: computing the index*2 early to use in arraylength comparisons, and removing a stray
sign converstion.

3. The performance of streaming <Latin1>.charAt depends on the absence of sign extension on the hot code path.

4. The performance of streaming <UTF16>.charAt calls also depends on precomputing and reusing the $idx*2 properly.