= String::new performance with String compression enabled

https://bugs.openjdk.java.net/browse/JDK-8054307

An early prototype for C2 intrinsics is available now, and we can see what problems we should expect going forward.
Search for "RECOMMENDATION" word to see the concrete suggestions on what to fix in current prototype.

The tests are done on JDK 9 Sandbox repo, 1x4x2 i7-4790K (Haswell) 4.0 GHz, Linux x86_64.
The disassembly is done with -XX:-UseCompressedOops to avoid oop encoding/decoding.

Benchmark code is available at:
 http://cr.openjdk.java.net/~shade/density/

== String Construction

These benchmarks assess the performance of String(char[]) constructor. "cmp" means compressible String with all
Latin1 characters. "nonCmpBeg" -- a String with almost all Latin1 characters, but one UTF-16 char at the beginning.
"nonCmpEnd" is almost the same, but the UTF-16 char is at the end. 

BASELINE:

Benchmark                   (size)  Mode  Cnt    Score   Error  Units
ConstructBench.cmp               1  avgt   50    8.176 ± 0.025  ns/op
ConstructBench.cmp              64  avgt   50   17.258 ± 0.046  ns/op
ConstructBench.cmp            4096  avgt   50  812.693 ± 1.691  ns/op
ConstructBench.nonCmpBeg         1  avgt   50    8.157 ± 0.020  ns/op
ConstructBench.nonCmpBeg        64  avgt   50   17.252 ± 0.053  ns/op
ConstructBench.nonCmpBeg      4096  avgt   50  812.890 ± 1.796  ns/op
ConstructBench.nonCmpEnd         1  avgt   50    8.170 ± 0.025  ns/op
ConstructBench.nonCmpEnd        64  avgt   50   17.287 ± 0.045  ns/op
ConstructBench.nonCmpEnd      4096  avgt   50  811.721 ± 2.339  ns/op

PATCHED:

Benchmark                   (size)  Mode  Cnt     Score   Error  Units
ConstructBench.cmp               1  avgt   50     7.827 ± 0.020  ns/op
ConstructBench.cmp              64  avgt   50    12.016 ± 0.042  ns/op
ConstructBench.cmp            4096  avgt   50   409.501 ± 1.175  ns/op
ConstructBench.nonCmpBeg         1  avgt   50    12.388 ± 0.026  ns/op
ConstructBench.nonCmpBeg        64  avgt   50    25.273 ± 0.061  ns/op
ConstructBench.nonCmpBeg      4096  avgt   50   872.756 ± 2.161  ns/op
ConstructBench.nonCmpEnd         1  avgt   50    12.407 ± 0.038  ns/op
ConstructBench.nonCmpEnd        64  avgt   50    25.622 ± 0.072  ns/op
ConstructBench.nonCmpEnd      4096  avgt   50  1213.420 ± 6.066  ns/op

Four phenomena to be explained:

 1. Why the 4096-char cmp case is faster almost 2x?
 2. Why 1-char nonCmpBeg is slower by 4 ns in patched case?
 3. Why the 4096-char nonCmpBeg case is around 10% slower?
 4. Why the 4096-char nonCmpEnd is slower almost 50% than nonCmpBeg?
 5. Why the 64-char nonCmpEnd is *NOT* slower than 64-char nonCmpBeg?

*** ISSUE 1. Why the 4096-char cmp case is faster almost 2x?

In a baseline version, the hottest block of code is the piece of StubRoutines::jlong_disjoint_arraycopy, that does
the actual copying. Notice it is using AVX2:

  3.51%    4.88%    0x00007f1995051d70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
 11.39%   14.72%    0x00007f1995051d76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
 16.78%   20.63%    0x00007f1995051d7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
 36.11%   20.16%    0x00007f1995051d82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
 18.23%   22.99%    0x00007f1995051d88: add    $0x8,%rdx
  0.02%    0.02%    0x00007f1995051d8c: jle    Stub::jlong_disjoint_arraycopy+48 0x0x7f1995051d70

In patched version, the hottest piece of code is the intrinsified StringCoderLatin1.toBytes:

  0.34%    0.27%    0x00007fba24ae006e: vmovdqu (%rsi,%rdx,2),%xmm1
 38.28%   23.57%    0x00007fba24ae0073: vpor   %xmm1,%xmm0,%xmm0
  1.99%    2.08%    0x00007fba24ae0077: vmovdqu 0x10(%rsi,%rdx,2),%xmm3
  4.99%    9.83%    0x00007fba24ae007d: vpor   %xmm3,%xmm0,%xmm0
  1.33%    1.35%    0x00007fba24ae0081: vptest %xmm2,%xmm0
 12.09%   18.46%    0x00007fba24ae0086: jne    0x00007fba24ae00f6
 12.22%   15.79%    0x00007fba24ae008c: vpackuswb %xmm3,%xmm1,%xmm1
  0.76%    0.40%    0x00007fba24ae0090: vmovdqu %xmm1,(%rdi,%rdx,1)
 14.28%   15.96%    0x00007fba24ae0095: add    $0x10,%rdx
                    0x00007fba24ae0099: jne    0x00007fba24ae006e

Notice it packs the result right into the byte[] array, so if the String is codeable into Latin1, we have the 
result almost immediately. Since the result takes around 2x less footprint, we have almost 2x performance improvement.

RECOMMENDATION:
One immediate observation is that we might do 256-bit wide comparisons with AVX2 in this case. LibraryCallKit::inline_string_copy(true) seems to be responsible for this code gen.


*** ISSUE 2. Why 1-char nonCmpBeg is slower by 4 ns in patched case?

While the opportunistic coding into Latin1 is good for the 1-char Strings, when we are dealing with 2-byte Strings, we
do double work. The trouble is this:

    private byte[] initBytes(char[] value, int off, int len) {
        byte[] val = LATIN.toBytes(value, off, len);
        if (val != null) {
            this.coder = LATIN.id();
        } else {                        // value has non-latin1 char
            this.coder = UNICODE.id();
            val = UNICODE.toBytes(value, off, len);
        }
        return val;
    }

The code allocates the byte[] array twice: once in StringCoderLatin1.toBytes, and the second time in
StringCoderUTF18.toBytes. It does two copies: one packed copy in Latin1, that fails, and another plain copy in UTF16.
See below for more pathological cases.


*** ISSUE 3. Why the 4096-char nonCmpBeg case is around 10% slower? 

While nonCmpBeg and nonCmpEnd delegate to the similar block in their relevant stubs, that is compiled to the
same vectorized block. This is a baseline delegating to Stub::jlong_disjoint_arraycopy:

  4.26%    4.56%    0x00007ff645051d70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
 42.80%   27.21%    0x00007ff645051d76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
 16.99%   20.88%    0x00007ff645051d7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
  5.62%   11.20%    0x00007ff645051d82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
 16.57%   18.53%    0x00007ff645051d88: add    $0x8,%rdx
  0.02%             0x00007ff645051d8c: jle    Stub::jlong_disjoint_arraycopy+48 0x0x7ff645051d70

...and this is patched delegating to Stub::jbyte_disjoint_arraycopy:

  3.44%    4.51%    0x00007f2bb50519d0: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
 12.32%   14.14%    0x00007f2bb50519d6: vmovdqu %ymm0,-0x38(%rsi,%rdx,8)
 16.90%   19.33%    0x00007f2bb50519dc: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
 32.88%   15.19%    0x00007f2bb50519e2: vmovdqu %ymm1,-0x18(%rsi,%rdx,8)
 17.75%   22.63%    0x00007f2bb50519e8: add    $0x8,%rdx
  0.03%             0x00007f2bb50519ec: jle    Stub::jbyte_disjoint_arraycopy+112 0x0x7f2bb50519d0

The prior dance around figuring out the compressibility accounts for another 10%. The more attenuated case of that is
below:

*** ISSUE 4. Why the 4096-char nonCmpEnd is slower almost 50% than nonCmpBeg?

Both nonCmpBeg and nonCmpEnd delegate to the same block in Stub::jbyte_disjoint_arraycopy, and the 
hot block there is identical. The performance difference seems to be explained by the presense of 
hot intrinsified version of StringCoderLatin1.toBytes in nonCmpEnd case:

                    0x00007fcfe4ae23d1: mov    $0xff00ff00,%ecx
...
                    0x00007fcfe4ae23e0: vmovd  %ecx,%xmm1
           0.03%    0x00007fcfe4ae23e4: vpshufd $0x0,%xmm1,%xmm1         ; xmm1 has a mask now
...
  0.27%    1.42%    0x00007fcfe4ae23f8: vmovdqu (%rsi,%rdx,2),%xmm0      ; load and OR next 16 bytes
 12.54%   16.49%    0x00007fcfe4ae23fd: vpor   %xmm0,%xmm3,%xmm3
  1.01%    3.07%    0x00007fcfe4ae2401: vmovdqu 0x10(%rsi,%rdx,2),%xmm2  ; load and OR next 16 bytes
  1.49%    6.05%    0x00007fcfe4ae2407: vpor   %xmm2,%xmm3,%xmm3
  0.64%    1.97%    0x00007fcfe4ae240b: vptest %xmm1,%xmm3               ; test if there are high bits
  4.43%   15.00%    0x00007fcfe4ae2410: jne    0x00007fcfe4ae2480        ;  <bail>
  3.44%   10.99%    0x00007fcfe4ae2416: vpackuswb %xmm2,%xmm0,%xmm0
  0.50%    1.41%    0x00007fcfe4ae241a: vmovdqu %xmm0,(%rdi,%rdx,1)
  4.54%   10.23%    0x00007fcfe4ae241f: add    $0x10,%rdx
                    0x00007fcfe4ae2423: jne    0x00007fcfe4ae23f8

While we can't possibly eliminate the pre-scanning, it still can be optimized with AVX2? See Issue 1 above.

*** ISSUE 5. Why the 64-char nonCmpEnd is *NOT* slower than 64-char nonCmpBeg?

With 256-bit streaming ops, 64-char strings are processed in a 1-2 strides. Therefore, there is no difference
where in the string a mismatch occurs. The performance effects of streaming part of pre-scanning start to manifest on
large strings.

= Conclusion

1. The performance costs of pre-scanning the String when the result is fitting in Latin1 are minimal, and are hidden
very well by the less work to do. That is, e.g. a substring() that does the re-scan of the copied character block should
perform no differently than just copying the character block blindly.

2. When String is not fitting in Latin1, the costs of pre-scanning are clearly visible, especially when the 2-byte char
is at the end of the large String. This calls for better optimization in pre-scanning, altough it should only affect
large strings.

3. Given that most of our Strings are 1-byte compressible, it doesn't seem profitable to decouple the pre-scanning
from Latin1 encoding. The only reason to do that is to avoid double byte[] allocation, but that will put us to walking
the source char[] twice in most common cases.