*** Performance impact of decommissioning arrayStorageProperties to legacy code.

  Note: By legacy code I mean java code written in before Valhalla world, a.k.a. Java code without inline classes.

  Note: Analysis of performance impact to inline types in progress.

  The difference was considered between jdk-15-valhalla+1-72 and jdk-15-valhalla+1-66 which covers all related hotspot modifications.

  Later "baseline" means behavior when Valhalla is turned off (XX:-EnableValhalla), it was not found any difference in baseline behavior between build-66 and build-72. V-66 and v-72 mean corresponding Valhalla versions.

1. General picture. 

  It was checked ~160 benchmarks. ~30 of them are big or middle size 3d party benchmarks (SPEC..., Dacapo, Volano) all others are some subset of our microbenchmarks base. Only -XX:+UseCompressedOops was checked.

  Not significant changes were found.

  - 16 benchmarks got speedup (v-66 -> v-72) typically around +5% (some up to +10%)
  - 14 benchmarks got degradation (v-66 -> v-72) typically around -5% (some -10%)

  In the checked benchmarkbase major amount of benchmarks have the same performance as baseline, but 15 benchmarks are slower than baseline withing -10% (baseline vs v-72).

  From one side having the fact that it's typical for Valhalla changes cause benchmarks jittering within 10%, we are not consider performance changes less than 10% as significant (this threshold will be lowered with Valhalla maturity).
  From the other side 10% of benchmarks (from the selected set) degrade with Valhalla, that means we can't leave it as is and should solve it sooner or later, otherwise there is a high chance of negative acceptance by Java community.


2. Detailed "aaload" analysis (other arrays operations are in progress).

  For analysis was used the simplest benchmark like:

    Object[] a1;
    
    @Benchmark
    public void read(Blackhole bh) {
        for (int i = 0; i < size; i++) {
            bh.consume(a1[i]);
        }
    }

  Array reference is loaded from the field (a1) on each iteration intentionally, having the fact that hotspot is pretty good at array check out of loop hoisting (at least for the single array). Object[] is used by the similar reason - hotspot doesn't generate "array of inline" checks if it's possible to prove that inline types can't be used here. In the used microbenchmark the check if-array-is-flattened is performed on each iteration.
  
  Used array size==100 (checking larger arrays didn't show unique behavior for this particular benchmark). 
  
  * v-66 -> v-72
  
  The benchmark performance depends on compressed or uncompressed oops are used. Moreover, compressed oops kind of (base) are different from compressed oops (base+shift) and v-72 also depends on if klass pointer was compressed.
  
  Here are results, time in nanoseconds.
  
                            |   baseline |   v-66  |  v-72 | (v-72 + -XX:-UseCompressedClassPointers)  
  CompressedOops(base)      :     485    |   555   |  645  |  630
  CompressedOops(base+shift):     500    |   620   |  700  |  650
  UncompressedOops          :     530    |   655   |  570  |

  Decommission of arrayStorageProperties leads to +13% speedup for uncompressed oops case and -16% (base)  and -13% (base+shift) degradation for compressed oops (v-66 vs v-72). 
  
  Here we can see how much each valhalla version is slower than the baseline:
  
                            |     v-66   |   v-72 
  CompressedOops(base)      :     -14%   |   -33%  
  CompressedOops(base+shift):     -24%   |   -40%
  UncompressedOops          :     -24%   |    -8%

  
  In uncompressed oops case we got really positive result, but compressed oops got significant slowdown. Please note, all time and ratios above are related to performance of the benchmark, not to performance of "aaload" operation. JMH code around the benchmark has effect and than smaller examined operation than larger that effect.
  
  Performance degradation in compressed oops case caused by set of chained reasons: 
    - check tag in Klass -> additional dereference 
    - unpack klass pointer -> used the same scratch register for base compressed klass as base register for compressed oops (r12) -> more instructions to manage base address register  
    - checking tag is not single bit -> extra register is required -> more register spilling
  
  Thorough profiling and throwing out JMH impact have shown that "v-66 compressed aaload" is 2x times slower than baseline aaload, when "v-72 compressed aaload" is 3x times slower than baseline. "v-66 uncompressed aaload" is 1.5x times slower than baseline aaload, when "v-72 uncompressed aaload" is 1.3x times slower". The key reason is larger amount of instructions, there are no cache or memory behavior differences between v-66 and v-72.
  
  Here is compressed v-72 code with some comments and questions:
  
   mov    0x10(%r10),%ebp              ; #1 *getfield, load reference to array 
   mov    0xc(%r12,%rbp,8),%r10d       ; #2 load array length (implicit oops unpacking via x86 memory addressing)
   mov    0x8(%rsp),%r11d              ; #3
   cmp    %r10d,%r11d                  ; #4
   jae    0x00007f51c7a6e766           ; #5 lines #2-#5 - range check
   mov    0x8(%r12,%rbp,8),%r10d       ; #6 load klass ptr
   lea    (%r12,%rbp,8),%rdi           ; #7 uncompress array oop to rdi
   shl    $0x3,%r10                    ; #8
   movabs $0x800000000,%r12            ; #9   0x800000000 - klass ptr base
   add    %r12,%r10                    ; #10  #8-#10 uncompress klass ptr
   xor    %r12,%r12                    ; #11
   mov    0x8(%r10),%r8d               ; #12 load layout helper
   sar    $0x1d,%r8d                   ; #13
   cmp    $0xfffffffd,%r8d             ; #14
   jne    0x00007f51c7a6e643           ; #15
   mov    0x10(%rdi,%r11,4),%r11d      ; #16
   mov    %r11,%rbx                    ; #17
   shl    $0x3,%rbx                    ; #18 finally ref from the array uncompressed to rbx
   
   
  * line #2 (range check) and line #7 are doing the same job - uncompressing array oop, why do not join this actions? 
  
  * lines #8 and #10 and #12 - unpack klass ptr, and load layout helper. Why don't do it the same way as in line #2 (single instruction)?
  
  * line #12, #13, #14 check high byte of layout helper for value 0xA0 (value type array). 0xA - binary 1010. Highest bit is 1 for all kinds of arrays. Hotspot knows statically that we have array here. No need to check that bit. 
      The only bit need to be checked, that can be done with "test" instruction -> save one register -> less register pressure, less spilling -> less code.
      
   
   **** 
   That was analysis of the hot aaload instruction. All memory are in caches. Cold aaload behavior was also checked. Another benchmark, with large amount of different arrays which can't fit into CPU cache. As expected a high number of LLC misses were observed. At the same moment it was proved that decommissioning arrayStorageProperties didn't increase cache misses. Walking into Klass doesn't cause cache misses due to the limited number of Klasses. All extra (extra in comparison with baseline) cache misses are happening when markword or klass prt was read.
      
   
3. In general performance regressions of Valhalla checks caused by 3 reasons:
   
   - Increased amount of instructions. More work (checks) has to be done.
   
   - Complex tags and masks. Having non single bit mask is not an issue itself. But it always spoils a register. And causes more and more register spilling (as avalanche) and may crash performance of tight sensitive loop. Particularity that induced register spilling is the source of regression. I will advocate for the single bit masks as much as possible. As for layout helper tag - 3 values for 8 bits - more than enough.
   
    By the way: We don't have CMS anymore. Biased locking is going away. Markword became simpler. Could we find a one bit in markwork to mark inline type object?
    
    - More memory loads and cache misses. Unavoidable. The only way - to make better and better out of loop hoisting and checks elimination.