= String::new performance with String compression enabled https://bugs.openjdk.java.net/browse/JDK-8054307 An early prototype for C2 intrinsics is available now, and we can see what problems we should expect going forward. Search for "RECOMMENDATION" word to see the concrete suggestions on what to fix in current prototype. The tests are done on JDK 9 Sandbox repo, 1x4x2 i7-4790K (Haswell) 4.0 GHz, Linux x86_64. The disassembly is done with -XX:-UseCompressedOops to avoid oop encoding/decoding. Benchmark code is available at: http://cr.openjdk.java.net/~shade/density/ == String Construction These benchmarks assess the performance of String(char[]) constructor. "cmp" means compressible String with all Latin1 characters. "nonCmpBeg" -- a String with almost all Latin1 characters, but one UTF-16 char at the beginning. "nonCmpEnd" is almost the same, but the UTF-16 char is at the end. BASELINE: Benchmark (size) Mode Cnt Score Error Units ConstructBench.cmp 1 avgt 50 8.176 ± 0.025 ns/op ConstructBench.cmp 64 avgt 50 17.258 ± 0.046 ns/op ConstructBench.cmp 4096 avgt 50 812.693 ± 1.691 ns/op ConstructBench.nonCmpBeg 1 avgt 50 8.157 ± 0.020 ns/op ConstructBench.nonCmpBeg 64 avgt 50 17.252 ± 0.053 ns/op ConstructBench.nonCmpBeg 4096 avgt 50 812.890 ± 1.796 ns/op ConstructBench.nonCmpEnd 1 avgt 50 8.170 ± 0.025 ns/op ConstructBench.nonCmpEnd 64 avgt 50 17.287 ± 0.045 ns/op ConstructBench.nonCmpEnd 4096 avgt 50 811.721 ± 2.339 ns/op PATCHED: Benchmark (size) Mode Cnt Score Error Units ConstructBench.cmp 1 avgt 50 7.827 ± 0.020 ns/op ConstructBench.cmp 64 avgt 50 12.016 ± 0.042 ns/op ConstructBench.cmp 4096 avgt 50 409.501 ± 1.175 ns/op ConstructBench.nonCmpBeg 1 avgt 50 12.388 ± 0.026 ns/op ConstructBench.nonCmpBeg 64 avgt 50 25.273 ± 0.061 ns/op ConstructBench.nonCmpBeg 4096 avgt 50 872.756 ± 2.161 ns/op ConstructBench.nonCmpEnd 1 avgt 50 12.407 ± 0.038 ns/op ConstructBench.nonCmpEnd 64 avgt 50 25.622 ± 0.072 ns/op ConstructBench.nonCmpEnd 4096 avgt 50 1213.420 ± 6.066 ns/op Four phenomena to be explained: 1. Why the 4096-char cmp case is faster almost 2x? 2. Why 1-char nonCmpBeg is slower by 4 ns in patched case? 3. Why the 4096-char nonCmpBeg case is around 10% slower? 4. Why the 4096-char nonCmpEnd is slower almost 50% than nonCmpBeg? 5. Why the 64-char nonCmpEnd is *NOT* slower than 64-char nonCmpBeg? *** ISSUE 1. Why the 4096-char cmp case is faster almost 2x? In a baseline version, the hottest block of code is the piece of StubRoutines::jlong_disjoint_arraycopy, that does the actual copying. Notice it is using AVX2: 3.51% 4.88% 0x00007f1995051d70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0 11.39% 14.72% 0x00007f1995051d76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8) 16.78% 20.63% 0x00007f1995051d7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1 36.11% 20.16% 0x00007f1995051d82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8) 18.23% 22.99% 0x00007f1995051d88: add $0x8,%rdx 0.02% 0.02% 0x00007f1995051d8c: jle Stub::jlong_disjoint_arraycopy+48 0x0x7f1995051d70 In patched version, the hottest piece of code is the intrinsified StringCoderLatin1.toBytes: 0.34% 0.27% 0x00007fba24ae006e: vmovdqu (%rsi,%rdx,2),%xmm1 38.28% 23.57% 0x00007fba24ae0073: vpor %xmm1,%xmm0,%xmm0 1.99% 2.08% 0x00007fba24ae0077: vmovdqu 0x10(%rsi,%rdx,2),%xmm3 4.99% 9.83% 0x00007fba24ae007d: vpor %xmm3,%xmm0,%xmm0 1.33% 1.35% 0x00007fba24ae0081: vptest %xmm2,%xmm0 12.09% 18.46% 0x00007fba24ae0086: jne 0x00007fba24ae00f6 12.22% 15.79% 0x00007fba24ae008c: vpackuswb %xmm3,%xmm1,%xmm1 0.76% 0.40% 0x00007fba24ae0090: vmovdqu %xmm1,(%rdi,%rdx,1) 14.28% 15.96% 0x00007fba24ae0095: add $0x10,%rdx 0x00007fba24ae0099: jne 0x00007fba24ae006e Notice it packs the result right into the byte[] array, so if the String is codeable into Latin1, we have the result almost immediately. Since the result takes around 2x less footprint, we have almost 2x performance improvement. RECOMMENDATION: One immediate observation is that we might do 256-bit wide comparisons with AVX2 in this case. LibraryCallKit::inline_string_copy(true) seems to be responsible for this code gen. *** ISSUE 2. Why 1-char nonCmpBeg is slower by 4 ns in patched case? While the opportunistic coding into Latin1 is good for the 1-char Strings, when we are dealing with 2-byte Strings, we do double work. The trouble is this: private byte[] initBytes(char[] value, int off, int len) { byte[] val = LATIN.toBytes(value, off, len); if (val != null) { this.coder = LATIN.id(); } else { // value has non-latin1 char this.coder = UNICODE.id(); val = UNICODE.toBytes(value, off, len); } return val; } The code allocates the byte[] array twice: once in StringCoderLatin1.toBytes, and the second time in StringCoderUTF18.toBytes. It does two copies: one packed copy in Latin1, that fails, and another plain copy in UTF16. See below for more pathological cases. *** ISSUE 3. Why the 4096-char nonCmpBeg case is around 10% slower? While nonCmpBeg and nonCmpEnd delegate to the similar block in their relevant stubs, that is compiled to the same vectorized block. This is a baseline delegating to Stub::jlong_disjoint_arraycopy: 4.26% 4.56% 0x00007ff645051d70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0 42.80% 27.21% 0x00007ff645051d76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8) 16.99% 20.88% 0x00007ff645051d7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1 5.62% 11.20% 0x00007ff645051d82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8) 16.57% 18.53% 0x00007ff645051d88: add $0x8,%rdx 0.02% 0x00007ff645051d8c: jle Stub::jlong_disjoint_arraycopy+48 0x0x7ff645051d70 ...and this is patched delegating to Stub::jbyte_disjoint_arraycopy: 3.44% 4.51% 0x00007f2bb50519d0: vmovdqu -0x38(%rdi,%rdx,8),%ymm0 12.32% 14.14% 0x00007f2bb50519d6: vmovdqu %ymm0,-0x38(%rsi,%rdx,8) 16.90% 19.33% 0x00007f2bb50519dc: vmovdqu -0x18(%rdi,%rdx,8),%ymm1 32.88% 15.19% 0x00007f2bb50519e2: vmovdqu %ymm1,-0x18(%rsi,%rdx,8) 17.75% 22.63% 0x00007f2bb50519e8: add $0x8,%rdx 0.03% 0x00007f2bb50519ec: jle Stub::jbyte_disjoint_arraycopy+112 0x0x7f2bb50519d0 The prior dance around figuring out the compressibility accounts for another 10%. The more attenuated case of that is below: *** ISSUE 4. Why the 4096-char nonCmpEnd is slower almost 50% than nonCmpBeg? Both nonCmpBeg and nonCmpEnd delegate to the same block in Stub::jbyte_disjoint_arraycopy, and the hot block there is identical. The performance difference seems to be explained by the presense of hot intrinsified version of StringCoderLatin1.toBytes in nonCmpEnd case: 0x00007fcfe4ae23d1: mov $0xff00ff00,%ecx ... 0x00007fcfe4ae23e0: vmovd %ecx,%xmm1 0.03% 0x00007fcfe4ae23e4: vpshufd $0x0,%xmm1,%xmm1 ; xmm1 has a mask now ... 0.27% 1.42% 0x00007fcfe4ae23f8: vmovdqu (%rsi,%rdx,2),%xmm0 ; load and OR next 16 bytes 12.54% 16.49% 0x00007fcfe4ae23fd: vpor %xmm0,%xmm3,%xmm3 1.01% 3.07% 0x00007fcfe4ae2401: vmovdqu 0x10(%rsi,%rdx,2),%xmm2 ; load and OR next 16 bytes 1.49% 6.05% 0x00007fcfe4ae2407: vpor %xmm2,%xmm3,%xmm3 0.64% 1.97% 0x00007fcfe4ae240b: vptest %xmm1,%xmm3 ; test if there are high bits 4.43% 15.00% 0x00007fcfe4ae2410: jne 0x00007fcfe4ae2480 ; 3.44% 10.99% 0x00007fcfe4ae2416: vpackuswb %xmm2,%xmm0,%xmm0 0.50% 1.41% 0x00007fcfe4ae241a: vmovdqu %xmm0,(%rdi,%rdx,1) 4.54% 10.23% 0x00007fcfe4ae241f: add $0x10,%rdx 0x00007fcfe4ae2423: jne 0x00007fcfe4ae23f8 While we can't possibly eliminate the pre-scanning, it still can be optimized with AVX2? See Issue 1 above. *** ISSUE 5. Why the 64-char nonCmpEnd is *NOT* slower than 64-char nonCmpBeg? With 256-bit streaming ops, 64-char strings are processed in a 1-2 strides. Therefore, there is no difference where in the string a mismatch occurs. The performance effects of streaming part of pre-scanning start to manifest on large strings. = Conclusion 1. The performance costs of pre-scanning the String when the result is fitting in Latin1 are minimal, and are hidden very well by the less work to do. That is, e.g. a substring() that does the re-scan of the copied character block should perform no differently than just copying the character block blindly. 2. When String is not fitting in Latin1, the costs of pre-scanning are clearly visible, especially when the 2-byte char is at the end of the large String. This calls for better optimization in pre-scanning, altough it should only affect large strings. 3. Given that most of our Strings are 1-byte compressible, it doesn't seem profitable to decouple the pre-scanning from Latin1 encoding. The only reason to do that is to avoid double byte[] allocation, but that will put us to walking the source char[] twice in most common cases.