= String.equals performance with String compression enabled

https://bugs.openjdk.java.net/browse/JDK-8054307

An early prototype for C2 intrinsics is available now, and we can see what problems we should expect going forward.
Search for "RECOMMENDATION" word to see the concrete suggestions on what to fix in current prototype.

The tests are done on JDK 9 Sandbox repo, 1x4x2 i7-4790K (Haswell) 4.0 GHz, Linux x86_64.

Benchmark code is available at:
 http://cr.openjdk.java.net/~shade/density/


== Equals

Targeted benchmarks for equals() compare 4 combinations for binary operations: cmp1_cmp1 is String.equals(Latin1, Latin1),
cmp1_cmp2 is String.equals(Latin1, UTF16), and so on. Additionally, we are testing the strings of the same length vs
strings of different lengths to assess the shortcut-ting performance. Our rough estimate from the existing corpus is
that around 95% of String comparisons take the 'length' shortcut.

=== Same length

First, the experiment with the same string lengths, which sets us up for the actual character comparisons.

JDK 9 Baseline:

Benchmark              (size)  Mode  Cnt    Score   Error  Units

EqualsBench.cmp1_cmp1       1  avgt   25    4.769 ± 0.001  ns/op
EqualsBench.cmp1_cmp2       1  avgt   25    4.770 ± 0.002  ns/op
EqualsBench.cmp2_cmp1       1  avgt   25    4.771 ± 0.006  ns/op
EqualsBench.cmp2_cmp2       1  avgt   25    4.770 ± 0.002  ns/op

EqualsBench.cmp1_cmp1      64  avgt   25    8.302 ± 0.062  ns/op
EqualsBench.cmp1_cmp2      64  avgt   25    8.222 ± 0.091  ns/op
EqualsBench.cmp2_cmp1      64  avgt   25    8.219 ± 0.059  ns/op
EqualsBench.cmp2_cmp2      64  avgt   25    8.260 ± 0.105  ns/op

EqualsBench.cmp1_cmp1    4096  avgt   25  173.889 ± 9.718  ns/op
EqualsBench.cmp1_cmp2    4096  avgt   25  153.549 ± 6.556  ns/op
EqualsBench.cmp2_cmp1    4096  avgt   25  155.951 ± 9.209  ns/op
EqualsBench.cmp2_cmp2    4096  avgt   25  149.245 ± 0.307  ns/op


JDK 9 String Density:

Benchmark              (size)  Mode  Cnt    Score   Error  Units

EqualsBench.cmp1_cmp1       1  avgt   25    5.206 ± 0.157  ns/op
EqualsBench.cmp1_cmp2       1  avgt   25    3.336 ± 0.008  ns/op
EqualsBench.cmp2_cmp1       1  avgt   25    3.336 ± 0.011  ns/op
EqualsBench.cmp2_cmp2       1  avgt   25    5.073 ± 0.161  ns/op

EqualsBench.cmp1_cmp1      64  avgt   25    6.977 ± 0.370  ns/op
EqualsBench.cmp1_cmp2      64  avgt   25    3.334 ± 0.011  ns/op
EqualsBench.cmp2_cmp1      64  avgt   25    3.340 ± 0.015  ns/op
EqualsBench.cmp2_cmp2      64  avgt   25    8.386 ± 0.256  ns/op

EqualsBench.cmp1_cmp1    4096  avgt   25   93.102 ± 4.729  ns/op
EqualsBench.cmp1_cmp2    4096  avgt   25    3.534 ± 0.196  ns/op
EqualsBench.cmp2_cmp1    4096  avgt   25    3.531 ± 0.316  ns/op
EqualsBench.cmp2_cmp2    4096  avgt   25  151.784 ± 1.555  ns/op

=== Why some (size = 1) cases are slower?

It looks like String Density does 10+ excess instructions. 

Baseline:

Benchmark                                      (size)  Mode  Cnt   Score    Error  Units
EqualsBench.cmp1_cmp1                               1  avgt   25   4.728 ±  0.059  ns/op
EqualsBench.cmp1_cmp1:·CPI                          1  avgt    5   0.283 ±  0.015   #/op
EqualsBench.cmp1_cmp1:·L1-dcache-load-misses        1  avgt    5   0.001 ±  0.002   #/op
EqualsBench.cmp1_cmp1:·L1-dcache-loads              1  avgt    5  21.829 ±  0.306   #/op
EqualsBench.cmp1_cmp1:·L1-dcache-store-misses       1  avgt    5  ≈ 10⁻⁴            #/op
EqualsBench.cmp1_cmp1:·L1-dcache-stores             1  avgt    5   6.247 ±  0.075   #/op
EqualsBench.cmp1_cmp1:·branch-misses                1  avgt    5  ≈ 10⁻⁴            #/op
EqualsBench.cmp1_cmp1:·branches                     1  avgt    5  15.572 ±  0.376   #/op
EqualsBench.cmp1_cmp1:·cycles                       1  avgt    5  19.691 ±  0.659   #/op
EqualsBench.cmp1_cmp1:·instructions                 1  avgt    5  69.631 ±  2.037   #/op

String Density:

Benchmark                                      (size)  Mode  Cnt   Score    Error  Units
EqualsBench.cmp1_cmp1                               1  avgt   25   5.277 ±  0.257  ns/op
EqualsBench.cmp1_cmp1:·CPI                          1  avgt    5   0.273 ±  0.059   #/op
EqualsBench.cmp1_cmp1:·L1-dcache-load-misses        1  avgt    5   0.001 ±  0.002   #/op
EqualsBench.cmp1_cmp1:·L1-dcache-loads              1  avgt    5  24.041 ±  0.644   #/op
EqualsBench.cmp1_cmp1:·L1-dcache-store-misses       1  avgt    5  ≈ 10⁻³            #/op
EqualsBench.cmp1_cmp1:·L1-dcache-stores             1  avgt    5   6.273 ±  0.154   #/op
EqualsBench.cmp1_cmp1:·branch-misses                1  avgt    5  ≈ 10⁻³            #/op
EqualsBench.cmp1_cmp1:·branches                     1  avgt    5  18.841 ±  0.338   #/op
EqualsBench.cmp1_cmp1:·cycles                       1  avgt    5  22.032 ±  5.010   #/op
EqualsBench.cmp1_cmp1:·instructions                 1  avgt    5  80.580 ±  1.465   #/op


The difference in instruction length is explained by coder selection and code generation quirks. The baseline and patched
versions of hot parts of the generated code are below. The difference in generated code in indented.

JDK 9 Baseline:

                  [Verified Entry Point]
  0.05%    0.08%    0x00007f03ccaea980: mov    %eax,-0x14000(%rsp)
  5.08%    3.80%    0x00007f03ccaea987: push   %rbp
  0.01%             0x00007f03ccaea988: sub    $0x10,%rsp             
  4.83%    3.61%    0x00007f03ccaea98c: mov    0x14(%rsi),%r10d       ; get $cmp1_2
  0.23%    0.20%    0x00007f03ccaea990: mov    0x10(%rsi),%r11d       ; get $cmp1_2
                    0x00007f03ccaea994: mov    0xc(%r12,%r11,8),%r9d  ; get $cmp1_2.value
  0.26%    0.16%    0x00007f03ccaea999: cmp    %r10d,%r11d            ; check ($cmp1_1 == $cmp1_2)
                    0x00007f03ccaea99c: je     0x00007f03ccaeaa7a     ;    bail to "equal!"
  4.59%    2.82%    0x00007f03ccaea9a2: mov    0xc(%r12,%r10,8),%r8d  ; get $cmp1_1.value
  0.26%    0.27%    0x00007f03ccaea9a7: mov    0xc(%r12,%r8,8),%r10d  ; get $cmp1_1.value.arraylen
  4.60%    4.77%    0x00007f03ccaea9ac: mov    0xc(%r12,%r9,8),%ecx   ; get $cmp1_2.value.arraylen
  1.30%    1.70%    0x00007f03ccaea9b1: cmp    %r10d,%ecx             ; check lengths
                    0x00007f03ccaea9b4: jne    0x00007f03ccaeaa81     ;   bail to "not equal!"
  8.61%    8.88%    0x00007f03ccaea9ba: lea    (%r12,%r8,8),%r10      ; <--- JDK-8020282
                    0x00007f03ccaea9be: lea    0x10(%r12,%r8,8),%rsi  ; load the start address of $cmp1_1.value
                    0x00007f03ccaea9c3: lea    (%r12,%r9,8),%r10      ; <--- JDK-8020282
                    0x00007f03ccaea9c7: lea    0x10(%r12,%r9,8),%rdi  ; load the start address of $cmp1_2.value
...

JDK 9 String Density:

                  [Verified Entry Point]
  0.10%    0.05%    0x00007ff178aec900: mov    %eax,-0x14000(%rsp)
  4.42%    5.31%    0x00007ff178aec907: push   %rbp
                    0x00007ff178aec908: sub    $0x20,%rsp              
  4.21%    4.34%    0x00007ff178aec90c: mov    0x14(%rsi),%r11d        ; get $cmp1_2
  0.66%    0.33%    0x00007ff178aec910: mov    0x10(%rsi),%r10d        ; get $cmp1_1
                    	0x00007ff178aec914: movsbl 0x14(%r12,%r10,8),%r9d  ; get $cmp1_1.coder
  0.08%    0.14%    0x00007ff178aec91a: cmp    %r11d,%r10d             ; check ($cmp1_1 == $cmp1_2)
                    0x00007ff178aec91d: jne    0x00007ff178aec929      ;   jump over if not equal
                    0x00007ff178aec91f: mov    $0x1,%eax
                    0x00007ff178aec924: jmpq   0x00007ff178aeca31
  3.82%    3.52%    	0x00007ff178aec929: xor    %eax,%eax               ; 
  0.45%    0.58%    	0x00007ff178aec92b: test   %r11d,%r11d             ; check ($cmp1_1 == null)
                    	0x00007ff178aec92e: je     0x00007ff178aeca31      ;   jump out to NPE
                    	0x00007ff178aec934: movsbl 0x14(%r12,%r11,8),%ebp  ; get $cmp1_2.coder                                                                  
  0.19%    0.13%    	0x00007ff178aec93a: cmp    %ebp,%r9d               ; check ($cmp1_1.coder == $cmp1_2.coder)
                    	0x00007ff178aec93d: jne    0x00007ff178aeca3d      ;   jump to slow-path if not equal                                                                  
  4.61%    1.44%    	0x00007ff178aec943: test   %r9d,%r9d               ; check ($cmp1_1.coder == 0)?
                    	0x00007ff178aec946: jne    0x00007ff178aeca5d      ;   jump to slow-path if not
  0.42%    0.50%    0x00007ff178aec94c: mov    0xc(%r12,%r11,8),%r8d   ; get $cmp1_2.value
                    0x00007ff178aec951: mov    0xc(%r12,%r10,8),%r10d  ; get $cmp1_1.value
  0.09%    0.14%    0x00007ff178aec956: mov    0xc(%r12,%r8,8),%r9d    ; get $cmp1_2.value.arraylength
  5.85%    7.35%    0x00007ff178aec95b: mov    0xc(%r12,%r10,8),%ecx   ; get $cmp1_1.value.arraylength
  0.71%    0.81%    0x00007ff178aec960: cmp    %r9d,%ecx               ; check lengths
                    0x00007ff178aec963: jne    0x00007ff178aeca31      ;   bail to "not equal!"
  6.28%    7.93%    	0x00007ff178aec969: mov    %r8,%r11                ; <--- (similar to) JDK-8020282
  0.01%             	0x00007ff178aec96c: shl    $0x3,%r11
  1.07%    1.40%    0x00007ff178aec970: lea    0x10(%r12,%r8,8),%rsi   ; load the start address of $cmp1_1.value
  0.06%    0.06%    	0x00007ff178aec975: mov    %r10,%r11               ; <--- (similar to) JDK-8020282
  3.28%    3.03%    	0x00007ff178aec978: shl    $0x3,%r11      
  0.03%    0.08%    0x00007ff178aec97c: lea    0x10(%r12,%r10,8),%rdi  ; load the start address of $cmp1_1.value
...

RECOMMENDATION: Fix https://bugs.openjdk.java.net/browse/JDK-8020282 to alleviate some costs.

== Why cmp1_cmp2 and cmp2_cmp1 cases are faster?

This happens because String Density shortcuts the equality comparison by comparing the coders, see the sample generated
code above. This allows to skip any other comparisons and loads altogether.


== Why cmp1_cmp1 case is significantly faster with size=4096?

Both baseline and String Density versions have the same hottest block:

  9.00%   11.78%    0x00007fa414aea9f4: vmovdqu (%rdi,%rcx,1),%ymm1
 17.26%   14.77%    0x00007fa414aea9f9: vmovdqu (%rsi,%rcx,1),%ymm0
 16.73%   16.17%    0x00007fa414aea9fe: vpxor  %ymm0,%ymm1,%ymm1
 12.60%   12.45%    0x00007fa414aeaa02: vptest %ymm1,%ymm1
 22.93%   24.44%    0x00007fa414aeaa07: jne    0x00007fa414aeaa64
 12.17%   14.05%    0x00007fa414aeaa09: add    $0x20,%rcx
                    0x00007fa414aeaa0d: jne    0x00007fa414aea9f4

...but String Density version does less work, because of the compressed storage.

JDK 9 Baseline:

Benchmark                                      (size)  Mode  Cnt     Score   Error  Units
EqualsBench.cmp1_cmp1                            4096  avgt    5   151.819 ± 3.888  ns/op
EqualsBench.cmp1_cmp1:·CPI                       4096  avgt          0.326           #/op
EqualsBench.cmp1_cmp1:·L1-dcache-load-misses     4096  avgt          0.059           #/op
EqualsBench.cmp1_cmp1:·L1-dcache-loads           4096  avgt        552.395           #/op
EqualsBench.cmp1_cmp1:·cycles                    4096  avgt        630.489           #/op
EqualsBench.cmp1_cmp1:·instructions              4096  avgt       1936.352           #/op

JDK 9 String Density:

Benchmark                                      (size)  Mode  Cnt     Score   Error  Units
EqualsBench.cmp1_cmp1                            4096  avgt    5    98.257 ± 0.157  ns/op
EqualsBench.cmp1_cmp1:·CPI                       4096  avgt          0.405           #/op
EqualsBench.cmp1_cmp1:·L1-dcache-load-misses     4096  avgt          0.009           #/op
EqualsBench.cmp1_cmp1:·L1-dcache-loads           4096  avgt        290.094           #/op
EqualsBench.cmp1_cmp1:·cycles                    4096  avgt        408.767           #/op
EqualsBench.cmp1_cmp1:·instructions              4096  avgt       1008.241           #/op

This is also why cmp2_cmp2 is not slower: the generated code is the same, but there is no improvement that cmp1_cmp1
enjoys because of the compressed storage.


== Different Length

These benchmark assess the performance when we compare the Strings of different lengths.

JDK 9 Baseline:

Benchmark                     (size)  Mode  Cnt  Score   Error  Units
EqualsDiffLenBench.cmp1_cmp1       1  avgt   25  3.412 ± 0.011  ns/op
EqualsDiffLenBench.cmp1_cmp2       1  avgt   25  3.408 ± 0.001  ns/op
EqualsDiffLenBench.cmp2_cmp1       1  avgt   25  3.413 ± 0.004  ns/op
EqualsDiffLenBench.cmp2_cmp2       1  avgt   25  3.410 ± 0.002  ns/op


JDK 9 String Density:

Benchmark                     (size)  Mode  Cnt  Score   Error  Units
EqualsDiffLenBench.cmp1_cmp1       1  avgt   25  3.762 ± 0.002  ns/op
EqualsDiffLenBench.cmp1_cmp2       1  avgt   25  3.325 ± 0.001  ns/op
EqualsDiffLenBench.cmp2_cmp1       1  avgt   25  3.325 ± 0.004  ns/op
EqualsDiffLenBench.cmp2_cmp2       1  avgt   25  4.154 ± 0.002  ns/op

== Why the performance in cmp1_cmp1 and cmp2_cmp2 cases is worse?

If you look at the generated code in the section above, then you will see we first compare the coders, and then 
compare the lengths, if coders match. The "excess" comparison of coders in cmpX_cmpX case gives a slight overhead, 
before short-cutting by length. Any improvement in coder selection scheme will save us these few cycles.


= Conclusion

equals() performance enjoys a clear benefit of having a compressed storage. The most frequent case of comparing 
the compressed Strings has tremendous benefits. Coder selection still requires some overhead to be absorbed.