= String.length() performance

https://bugs.openjdk.java.net/browse/JDK-8054307

The tests are done on JDK 9 Sandbox repo, 1x4x2 i7-4790K (Haswell) 4.0 GHz, Linux x86_64.

Benchmark code is available at:
 http://cr.openjdk.java.net/~shade/density/

== Background

The performance of length() is important for many workloads, especially those who do low-level operations on Strings.
The basic version of length() just polls the array length of the backing array. It String Density code, we need to do
a few other things.

== Benchmarks

The benchmarks are very simple. We call String.length() on a list of Strings, and list is pre-populated with the Strings
of either 1-byte or 2-byte characters. "bias" parameter defines the ratio of 1-byte strings to 2-byte strings.

=== JDK 9 baseline

First, JDK 9 baseline:

Benchmark         (bias)  (count)  Mode  Cnt   Score   Error  Units
LengthBench.test    0.00     4096  avgt   25  19.065 ± 0.214  us/op
LengthBench.test    0.25     4096  avgt   25  19.339 ± 0.593  us/op
LengthBench.test    0.50     4096  avgt   25  19.641 ± 0.416  us/op
LengthBench.test    0.75     4096  avgt   25  19.930 ± 0.872  us/op
LengthBench.test    1.00     4096  avgt   25  19.170 ± 0.241  us/op

Obiviously, the performance does not depend on bias.


=== String Density, current:

Benchmark         (bias)  (count)  Mode  Cnt   Score   Error  Units
LengthBench.test    0.00     4096  avgt   25  20.248 ± 0.200  us/op
LengthBench.test    0.25     4096  avgt   25  26.301 ± 0.226  us/op
LengthBench.test    0.50     4096  avgt   25  34.558 ± 0.639  us/op
LengthBench.test    0.75     4096  avgt   25  28.562 ± 0.282  us/op
LengthBench.test    1.00     4096  avgt   25  20.514 ± 0.174  us/op

Now, it gets interesting:
 a) in two extreme cases, the performance is slightly lower than baseline.
 b) with the moderate bias, the performance drops off rapidly

The first effect is explained by the coder selection overhead:

bias = 0.0:

  3.46%    3.16%    0x00007f77d0ae1760: mov    %eax,-0x14000(%rsp)
  1.15%    0.81%    0x00007f77d0ae1767: push   %rbp
  3.23%    3.02%    0x00007f77d0ae1768: sub    $0x20,%rsp            
  0.36%    0.34%    0x00007f77d0ae176c: movsbl 0x14(%rsi),%r10d       ; get String.coder
  1.86%    1.36%    0x00007f77d0ae1771: test   %r10d,%r10d            ; if (coder != 0), jump out
                    0x00007f77d0ae1774: jne    0x00007f77d0ae178b
  3.77%    2.81%    0x00007f77d0ae1776: mov    0xc(%rsi),%r10d        ; get String.value
  0.24%    0.62%    0x00007f77d0ae177a: mov    0xc(%r12,%r10,8),%eax  ; get String.value.arraylength
 49.55%   53.71%    0x00007f77d0ae177f: add    $0x20,%rsp             ; epilogue and return
  0.31%    0.65%    0x00007f77d0ae1783: pop    %rbp
  0.87%    0.54%    0x00007f77d0ae1784: test   %eax,0xf24c876(%rip)
  6.76%    6.56%    0x00007f77d0ae178a: retq   

Notice how the unlikely path with non-zero coder is pushed out.

The second effect is explained by branch prediction misses, since now we generate both branches.

bias = 0.5:

  2.33%    1.60%    0x00007fc6f8ae34e0: mov    %eax,-0x14000(%rsp)
  0.21%    0.14%    0x00007fc6f8ae34e7: push   %rbp
  2.47%    1.82%    0x00007fc6f8ae34e8: sub    $0x10,%rsp        
  0.13%    0.11%    0x00007fc6f8ae34ec: movsbl 0x14(%rsi),%r11d       ; get String.coder
  6.72%    9.99%    0x00007fc6f8ae34f1: mov    0xc(%rsi),%r10d        ; get String.value
  3.26%    2.60%    0x00007fc6f8ae34f5: test   %r11d,%r11d            ; if (coder == 0), jump out
                    0x00007fc6f8ae34f8: je     0x00007fc6f8ae3503  
  1.16%    1.78%    0x00007fc6f8ae34fa: mov    0xc(%r12,%r10,8),%eax  ; get String.value.arraylength
 28.59%   30.56%    0x00007fc6f8ae34ff: sar    %eax                   ; >> 1
  1.46%    0.75%    0x00007fc6f8ae3501: jmp    0x00007fc6f8ae3508     ; jump to return
  1.87%    1.97%    0x00007fc6f8ae3503: mov    0xc(%r12,%r10,8),%eax  ; get String.value.arraylength
 28.40%   30.18%    0x00007fc6f8ae3508: add    $0x10,%rsp             ; epilogue and return
  0.02%    0.05%    0x00007fc6f8ae350c: pop    %rbp
  0.35%    0.21%    0x00007fc6f8ae350d: test   %eax,0xeb6aaed(%rip)
  3.44%    1.79%    0x00007fc6f8ae3513: retq   

perf stats:
     2,800,535,113 branches                  #  664.156 M/sec                   [38.76%]
       206,470,880 branch-misses             #    7.37% of all branches         [38.82%]

Also notice the generated code has similar arraylength accesses on both branches. This is tracked as:
 https://bugs.openjdk.java.net/browse/JDK-8075132

Trying to optimize the length() yields a series of patches, targeting the branch prediction costs and working
around code generator quirks.

=== Density, patch 1

First patch is very simple: inline isLatin1() and StringUTF16.length:

diff -r 66bedbaa3cfe src/java.base/share/classes/java/lang/String.java
--- a/src/java.base/share/classes/java/lang/String.java	Mon Feb 23 18:23:04 2015 +0000
+++ b/src/java.base/share/classes/java/lang/String.java	Thu Mar 12 12:45:25 2015 +0300
@@ -650,10 +650,11 @@
      *          object.
      */
     public int length() {
-        if (isLatin1()) {
+        if (coder == LATIN1) {
             return value.length;
+        } else {
+            return value.length >> 1;
         }
-        return StringUTF16.length(value);
     }
 
     /**

It yields the performance very similar to Density baseline:

Benchmark         (bias)  (count)  Mode  Cnt   Score   Error  Units
LengthBench.test    0.00     4096  avgt   25  20.143 ± 0.158  us/op
LengthBench.test    0.25     4096  avgt   25  26.705 ± 0.206  us/op
LengthBench.test    0.50     4096  avgt   25  35.034 ± 1.038  us/op
LengthBench.test    0.75     4096  avgt   25  28.693 ± 0.196  us/op
LengthBench.test    1.00     4096  avgt   25  20.502 ± 0.209  us/op

...and the generated code is almost the same:

bias = 0.5:

  2.20%    1.60%    0x00007f8860ae2ee0: mov    %eax,-0x14000(%rsp)
  0.19%    0.17%    0x00007f8860ae2ee7: push   %rbp
  2.17%    1.85%    0x00007f8860ae2ee8: sub    $0x10,%rsp         
  0.07%    0.15%    0x00007f8860ae2eec: movsbl 0x14(%rsi),%r11d        ; get field $coder
  8.03%    9.51%    0x00007f8860ae2ef1: mov    0xc(%rsi),%r10d         ; get field $value
  3.36%    2.45%    0x00007f8860ae2ef5: test   %r11d,%r11d             ; test $coder
                    0x00007f8860ae2ef8: je     0x00007f8860ae2f03      ; jump over
  1.71%    1.77%    0x00007f8860ae2efa: mov    0xc(%r12,%r10,8),%eax   ; get $value.arraylength
 29.66%   31.87%    0x00007f8860ae2eff: sar    %eax                    ; >> 1
  1.19%    0.73%    0x00007f8860ae2f01: jmp    0x00007f8860ae2f08      ; jump to return
  1.49%    1.83%    0x00007f8860ae2f03: mov    0xc(%r12,%r10,8),%eax   ; get $value.arraylength
 28.86%   31.55%    0x00007f8860ae2f08: add    $0x10,%rsp
  0.02%    0.10%    0x00007f8860ae2f0c: pop    %rbp
  0.17%    0.17%    0x00007f8860ae2f0d: test   %eax,0xfa070ed(%rip)
  3.71%    2.28%    0x00007f8860ae2f13: retq   


=== Density, patch 2

Second patch tries to dodge double arraylength in the generated code by pre-reading it in local variable:

diff -r 66bedbaa3cfe src/java.base/share/classes/java/lang/String.java
--- a/src/java.base/share/classes/java/lang/String.java	Mon Feb 23 18:23:04 2015 +0000
+++ b/src/java.base/share/classes/java/lang/String.java	Fri Mar 13 12:49:10 2015 +0300
@@ -650,10 +650,12 @@
      *          object.
      */
     public int length() {
-        if (isLatin1()) {
-            return value.length;
+        int len = value.length;
+        if (coder == LATIN1) {
+            return len;
+        } else {
+            return len >> 1;
         }
-        return StringUTF16.length(value);
     }
 
     /**

The performance scores are improved relative to Density baseline:

Benchmark         (bias)  (count)  Mode  Cnt   Score   Error  Units
LengthBench.test    0.00     4096  avgt   25  20.051 ± 0.202  us/op
LengthBench.test    0.25     4096  avgt   25  23.712 ± 3.189  us/op
LengthBench.test    0.50     4096  avgt   25  27.089 ± 5.029  us/op
LengthBench.test    0.75     4096  avgt   25  23.991 ± 3.176  us/op
LengthBench.test    1.00     4096  avgt   25  21.011 ± 0.137  us/op

The results are also experiencing large run-to-run variance because now we also get the CMOV selectors instead
of branches in middle biases. However, that optimization relies on branch profile, which is hard to reproduce exactly
in String code. JDK-8075132 shows the effect for CMOV more cleanly.

bias = 0.0:

  3.37%    3.19%    0x00007fdd58ae26e0: mov    %eax,-0x14000(%rsp)
  1.22%    0.99%    0x00007fdd58ae26e7: push   %rbp
  3.59%    3.05%    0x00007fdd58ae26e8: sub    $0x20,%rsp  
  0.31%    0.39%    0x00007fdd58ae26ec: mov    0xc(%rsi),%r11d        ; get field $value
  2.23%    2.03%    0x00007fdd58ae26f0: mov    0xc(%r12,%r11,8),%eax  ; get field $value.arraylength
 53.92%   59.12%    0x00007fdd58ae26f5: movsbl 0x14(%rsi),%ebp        ; get field $coder
  0.14%    0.05%    0x00007fdd58ae26f9: test   %ebp,%ebp              ; test $coder and jump out
                    0x00007fdd58ae26fb: jne    0x00007fdd58ae2709  
  0.10%    0.12%    0x00007fdd58ae26fd: add    $0x20,%rsp             ; return
  0.36%    1.11%    0x00007fdd58ae2701: pop    %rbp
  4.67%    3.24%    0x00007fdd58ae2702: test   %eax,0x10e508f8(%rip)
  3.89%    3.08%    0x00007fdd58ae2708: retq   


bias = 0.5:

  2.97%    2.57%    0x00007f0a8c4afe60: mov    %eax,-0x14000(%rsp)
  0.91%    1.45%    0x00007f0a8c4afe67: push   %rbp
  3.44%    2.08%    0x00007f0a8c4afe68: sub    $0x10,%rsp         
  0.20%    1.00%    0x00007f0a8c4afe6c: mov    0xc(%rsi),%r11d        ; get field $value
  2.26%    2.01%    0x00007f0a8c4afe70: mov    0xc(%r12,%r11,8),%eax  ; get field $value.arraylength
 52.56%   55.98%    0x00007f0a8c4afe75: movsbl 0x14(%rsi),%r11d       ; get field $coder
  0.07%    0.03%    0x00007f0a8c4afe7a: mov    %eax,%r10d             ; prepare (arraylength >> 1)
  0.05%    0.08%    0x00007f0a8c4afe7d: sar    %r10d
  3.27%    3.47%    0x00007f0a8c4afe80: test   %r11d,%r11d            ; select either value
  0.84%    0.22%    0x00007f0a8c4afe83: cmovne %r10d,%eax
  6.78%    6.51%    0x00007f0a8c4afe87: add    $0x10,%rsp             ; return
  0.15%    0.27%    0x00007f0a8c4afe8b: pop    %rbp
  0.74%    0.25%    0x00007f0a8c4afe8c: test   %eax,0xe55a16e(%rip)
  3.43%    3.16%    0x00007f0a8c4afe92: retq   

When this optimization is working properly, the performance improvement is substantial.

=== patch 3

Third patch tries to eliminate branches entirely, using the fact (LATIN1 == 0) and (UTF16 == 1). This allows to 
do the unconditional bitwise shift:

diff -r 66bedbaa3cfe src/java.base/share/classes/java/lang/String.java
--- a/src/java.base/share/classes/java/lang/String.java	Mon Feb 23 18:23:04 2015 +0000
+++ b/src/java.base/share/classes/java/lang/String.java	Thu Mar 12 12:48:07 2015 +0300
@@ -650,10 +650,7 @@
      *          object.
      */
     public int length() {
-        if (isLatin1()) {
-            return value.length;
-        }
-        return StringUTF16.length(value);
+        return value.length >> coder;
     }
 
     /**

The performance with this patch is winning over Density baseline even in pathological (bias=0.5) mode, and comes close
to JDK 9 baseline.

Benchmark         (bias)  (count)  Mode  Cnt   Score   Error  Units
LengthBench.test    0.00     4096  avgt   25  20.853 ± 0.263  us/op
LengthBench.test    0.25     4096  avgt   25  21.042 ± 0.178  us/op
LengthBench.test    0.50     4096  avgt   25  20.951 ± 0.178  us/op
LengthBench.test    0.75     4096  avgt   25  20.658 ± 0.187  us/op
LengthBench.test    1.00     4096  avgt   25  21.004 ± 0.333  us/op

The generated code:

  3.75%    3.22%    0x00007f4260ae43e0: mov    %eax,-0x14000(%rsp)
  1.16%    1.54%    0x00007f4260ae43e7: push   %rbp
  0.03%    0.10%    0x00007f4260ae43e8: sub    $0x10,%rsp  
  4.03%    2.86%    0x00007f4260ae43ec: mov    0xc(%rsi),%r11d        ; get field $value
  1.11%    1.75%    0x00007f4260ae43f0: mov    0xc(%r12,%r11,8),%eax  ; get $value.arraylength
 47.88%   51.88%    0x00007f4260ae43f5: movsbl 0x14(%rsi),%ecx        ; get field $coder
  1.35%    0.52%    0x00007f4260ae43f9: sar    %cl,%eax               ; >> 1
 10.44%    9.96%    0x00007f4260ae43fb: add    $0x10,%rsp             ; return
  0.93%    0.38%    0x00007f4260ae43ff: pop    %rbp
  0.21%    0.65%    0x00007f4260ae4400: test   %eax,0x10bafbfa(%rip) 
  0.03%    0.23%    0x00007f4260ae4406: retq   

= Conclusion

1. The performance of length() with C2 does not necessarily depends on whether isLatin1() or coder bodies are inlined.
However, it might be a good idea to inline isLatin1() for consistency, and maybe StringUTF16.length() as well. 

2. Pre-reading the value.arraylength into local variable works around JDK-8075132, but still experiences branch
(mis)prediction problems. Additionally, it fires the CMOV optimizations that contribute to run-to-run variance in 
pathological cases.

3. Doing the unconditional shift by coder ID avoid branch misprediction costs, and brings the length() performance closer
to JDK 9 baseline. It would be possible to achieve the same effect with the branches, when VM would be able to separate
profiles for the different callers.