= String.hashCode performance

https://bugs.openjdk.java.net/browse/JDK-8054307

The tests are done on JDK 9 Sandbox repo, 1x4x2 i7-4790K (Haswell) 4.0 GHz, Linux x86_64.

Benchmark code is available at:
 http://cr.openjdk.java.net/~shade/density/

== Background

String hashcode is a simple P(31) hash. The JDK implementation, however, caches String hashcodes, and uses "0" as the
marker for "not yet computed" hashcode. We can use that quirk in our benchmarks, by carefully constructing the strings
with zero hashcodes, but with either Latin1-only, or UTF16 characters in them. To do that, we pad the two known 
zero-hashcoded English and Russian phrases with zero characters. Such padding would not change the hashcode of the prefix.
Given such a string, JDK implementation would always recompute the hashcode.  

While hashcode caching greatly alleviates the hashcode computation costs, in many cases users are dealing with young
Strings, compute and use the hashcode once, then discard them, rinse and repeat. Then the cost of String hash code 
computation may become important.

== Benchmarks

Baseline:

Benchmark           (size)  Mode  Cnt     Score   Error  Units
HashCodeBench.cmp1       1  avgt   25    19.667 ± 0.226  ns/op
HashCodeBench.cmp1      64  avgt   25    47.512 ± 0.398  ns/op
HashCodeBench.cmp1    4096  avgt   25  3084.317 ± 1.345  ns/op
HashCodeBench.cmp2       1  avgt   25    25.717 ± 0.080  ns/op
HashCodeBench.cmp2      64  avgt   25    47.315 ± 0.020  ns/op
HashCodeBench.cmp2    4096  avgt   25  3083.622 ± 0.913  ns/op

Patched:

Benchmark           (size)  Mode  Cnt     Score    Error  Units
HashCodeBench.cmp1       1  avgt   25    19.816 ±  0.085  ns/op
HashCodeBench.cmp1      64  avgt   25    47.889 ±  0.583  ns/op
HashCodeBench.cmp1    4096  avgt   25  3084.170 ±  1.417  ns/op
HashCodeBench.cmp2       1  avgt   25    27.421 ±  0.684  ns/op
HashCodeBench.cmp2      64  avgt   25    49.013 ±  0.206  ns/op
HashCodeBench.cmp2    4096  avgt   25  3089.809 ±  4.960  ns/op


== Analysis

=== cmp1 case

Notice how cmp1 case is not affected at all, because the code in StringLatin1.hashCode() is very similar to the 
baseline version:

baseline:

    public int hashCode() {
        int h = hash;
        if (h == 0) {
            for (char v : value) {
                h = 31 * h + v;
            }
            hash = h;
        }
        return h;
    }

patched (Latin1):

    public static int hashCode(byte[] value) {
        int h = 0;
        for (byte v : value) {
            h = 31 * h + (v & 0xff);
        }
        return h;
    }

One can expect the sub-byte masking to be optimized, and indeed it is folded by compiler into a byte load. The
diassemblies are nearly identical, the only difference is using either zero-extended short read, or zero-extended
byte read.

baseline, cmp1 perfasm (hottest loop):

  0.02%    0.02%    0x00007f5350ae1f90: mov    %esi,%edx                ; %edx = %esi * 31
  0.02%             0x00007f5350ae1f92: shl    $0x5,%edx                 
  7.46%    3.88%    0x00007f5350ae1f95: movzwl 0x10(%rcx,%rbx,2),%r8d   ; get a[0]
  0.02%             0x00007f5350ae1f9b: sub    %esi,%edx
  7.46%    4.11%    0x00007f5350ae1f9d: add    %r8d,%edx                ; mix a[0]
  7.63%    7.62%    0x00007f5350ae1fa0: movslq %ebx,%r11
           0.02%    0x00007f5350ae1fa3: movzwl 0x12(%rcx,%r11,2),%r8d   ; get a[1]
  0.02%             0x00007f5350ae1fa9: movzwl 0x16(%rcx,%r11,2),%eax   ; get a[3]
                    0x00007f5350ae1faf: movzwl 0x14(%rcx,%r11,2),%r9d   ; get a[2]
  8.27%    3.78%    0x00007f5350ae1fb5: mov    %edx,%r11d               ; %r11 = %edx * 31
           0.05%    0x00007f5350ae1fb8: shl    $0x5,%r11d
                    0x00007f5350ae1fbc: sub    %edx,%r11d
  7.55%    3.74%    0x00007f5350ae1fbf: add    %r8d,%r11d               ; mix a[1]
  8.01%   19.16%    0x00007f5350ae1fc2: mov    %r11d,%edx               ; %edx = %r11 * 31
           0.03%    0x00007f5350ae1fc5: shl    $0x5,%edx
  7.90%    8.16%    0x00007f5350ae1fc8: sub    %r11d,%edx
  7.71%    4.18%    0x00007f5350ae1fcb: add    %r9d,%edx                ; mix a[2]
  7.85%   15.02%    0x00007f5350ae1fce: mov    %edx,%esi                ; %esi = %edx * 31
           0.02%    0x00007f5350ae1fd0: shl    $0x5,%esi                 
  7.61%   11.76%    0x00007f5350ae1fd3: sub    %edx,%esi                
  7.49%    3.83%    0x00007f5350ae1fd5: add    %eax,%esi                ; mix a[3]
  7.93%    7.74%    0x00007f5350ae1fd7: add    $0x4,%ebx                ; loop counter += 4
                    0x00007f5350ae1fda: cmp    %r10d,%ebx               ; loop back branch
                    0x00007f5350ae1fdd: jl     0x00007f5350ae1f90       ;


patched, cmp1 perfasm (hottest loop):

           0.02%    0x00007fc18ba45130: mov    %ecx,%edi                ; %edi = %ecx * 31
  0.03%             0x00007fc18ba45132: shl    $0x5,%edi     
  8.05%    4.31%    0x00007fc18ba45135: sub    %ecx,%edi
  7.70%    3.86%    0x00007fc18ba45137: movslq %r9d,%r11
                    0x00007fc18ba4513a: movzbl 0x10(%rbx,%r11,1),%r8d   ; get a[0]
  0.07%    0.05%    0x00007fc18ba45140: add    %r8d,%edi                ; mix a[0]
  7.97%    7.24%    0x00007fc18ba45143: movslq %r9d,%r11
  0.05%    0.07%    0x00007fc18ba45146: movzbl 0x11(%rbx,%r11,1),%ecx   ; get a[1]
                    0x00007fc18ba4514c: movzbl 0x13(%rbx,%r11,1),%edx   ; get a[3]
  0.05%    0.02%    0x00007fc18ba45152: movzbl 0x12(%rbx,%r11,1),%r8d   ; get a[2]
  7.34%    3.59%    0x00007fc18ba45158: mov    %edi,%r11d               ; %r11 = %edi * 31
  0.08%    0.10%    0x00007fc18ba4515b: shl    $0x5,%r11d
  0.02%    0.02%    0x00007fc18ba4515f: sub    %edi,%r11d
  7.76%    3.58%    0x00007fc18ba45162: add    %ecx,%r11d               ; mix a[1]
  8.02%   17.60%    0x00007fc18ba45165: mov    %r11d,%edi               ; %edi = %r11 * 31
  0.07%    0.02%    0x00007fc18ba45168: shl    $0x5,%edi
  7.58%    3.46%    0x00007fc18ba4516b: sub    %r11d,%edi
  7.87%   11.82%    0x00007fc18ba4516e: add    %r8d,%edi                ; mix a[2]
  7.23%   14.38%    0x00007fc18ba45171: mov    %edi,%ecx                ; %ecx = %edi * 31
           0.05%    0x00007fc18ba45173: shl    $0x5,%ecx
  7.39%   10.95%    0x00007fc18ba45176: sub    %edi,%ecx
  7.58%    3.34%    0x00007fc18ba45178: add    %edx,%ecx                ; mix a[3]
  7.61%    7.34%    0x00007fc18ba4517a: add    $0x4,%r9d                ; loop counter += 4
  0.02%    0.02%    0x00007fc18ba4517e: cmp    %r10d,%r9d               ; loop back branch
                    0x00007fc18ba45181: jl     0x00007fc18ba45130 

It is somewhat suprising the patched version is not bringing the performance improvement though: the memory footprint
for the large String is lower. It seems that the computation costs are dominating the scenario, and memory references
are predicted perfectly. Character-count-wise, the amount of work is the same.

=== cmp2 case

In UTF16 case, however, the code is much more complicated, because it has to deal with endianness:

    public static int hashCode(byte[] value) {
        int h = 0;
        int length = value.length >> 1;
        for (int i = 0; i < length; i++) {
            h = 31 * h + getChar(value, i);
        }
        return h; 
    }

    public static char getChar(byte[] val, int index) {
        index <<= 1;
        return (char)(((val[index++] & 0xff) << HI_BYTE_SHIFT) |
                      ((val[index]   & 0xff) << LO_BYTE_SHIFT));
    }

But, it is also intrinsified by VM, which handles the endianess by it's own. The generated code is rather complicated
due to loop unrolling, so we will use -XX:LoopUnrollLimit=1 to limit it. Here's the hottest loop:

  5.12%    5.41%    0x00007ff400ae0410: mov    %eax,%edi                 ; $acc *= 31
  6.26%    6.92%    0x00007ff400ae0412: shl    $0x5,%edi
 10.35%   11.26%    0x00007ff400ae0415: sub    %eax,%edi 
  8.90%    8.81%    0x00007ff400ae0417: mov    %ecx,%r11d             
  4.47%    4.58%    0x00007ff400ae041a: shl    %r11d                     ; $idx *= 2
  5.70%    6.35%    0x00007ff400ae041d: mov    %r11d,%r8d             
  6.85%    5.93%    0x00007ff400ae0420: add    $0x2,%r8d
  8.75%    7.95%    0x00007ff400ae0424: cmp    %r8d,%r9d                 ; range check
                    0x00007ff400ae0427: jb     0x00007ff400ae04a4 
  4.78%    3.91%    0x00007ff400ae042d: movslq %r11d,%r10                ; (unnecessary sign extension?)
  4.95%    5.71%    0x00007ff400ae0430: movzwl 0x10(%rdx,%r10,1),%eax    ; get $val[$idx] 
  8.31%    8.42%    0x00007ff400ae0436: add    %edi,%eax                 ; $acc += $val[$idx]
 11.96%   10.99%    0x00007ff400ae0438: inc    %ecx                      ; $idx++
  3.60%    3.85%    0x00007ff400ae043a: cmp    %r13d,%ecx                ; check limit and back branch
                    0x00007ff400ae043d: jl     0x00007ff400ae0410  

The intrinsic version has the range check, and also an unnecessary sign extension (similar to https://bugs.openjdk.java.net/browse/JDK-8074124).


=== cmp2 (prototype)

In fact, replacing the getChar() body with Unsafe call like this:

diff -r 66bedbaa3cfe src/java.base/share/classes/java/lang/StringUTF16.java
--- a/src/java.base/share/classes/java/lang/StringUTF16.java	Mon Feb 23 18:23:04 2015 +0000
+++ b/src/java.base/share/classes/java/lang/StringUTF16.java	Fri Feb 27 19:16:03 2015 +0300
@@ -584,10 +584,13 @@
         val[index]   = (byte)(c >> LO_BYTE_SHIFT);
     }
 
+    private static final sun.misc.Unsafe U = sun.misc.Unsafe.getUnsafe();
+    private static final long BYTE_ARR_OFFSET = U.arrayBaseOffset(byte[].class);
+    private static final long BYTE_ARR_SCALE = U.arrayIndexScale(byte[].class);
+
     public static char getChar(byte[] val, int index) {
         index <<= 1;
-        return (char)(((val[index++] & 0xff) << HI_BYTE_SHIFT) |
-                      ((val[index]   & 0xff) << LO_BYTE_SHIFT));
+        return U.getChar(val, BYTE_ARR_OFFSET + BYTE_ARR_SCALE * index);
     }

Yields the peformance very close to both configurations!

Baseline:

Benchmark           (size)  Mode  Cnt     Score   Error  Units
HashCodeBench.cmp2       1  avgt   25    25.717 ± 0.080  ns/op
HashCodeBench.cmp2      64  avgt   25    47.315 ± 0.020  ns/op
HashCodeBench.cmp2    4096  avgt   25  3083.622 ± 0.913  ns/op

Patched:

Benchmark           (size)  Mode  Cnt     Score    Error  Units
HashCodeBench.cmp2       1  avgt   25    27.421 ±  0.684  ns/op
HashCodeBench.cmp2      64  avgt   25    49.013 ±  0.206  ns/op
HashCodeBench.cmp2    4096  avgt   25  3089.809 ±  4.960  ns/op

Patched (prototype):

Benchmark           (size)  Mode  Cnt     Score   Error  Units
HashCodeBench.cmp2       1  avgt   25    27.261 ± 0.056  ns/op
HashCodeBench.cmp2      64  avgt   25    49.179 ± 0.357  ns/op
HashCodeBench.cmp2    4096  avgt   25  3094.200 ± 4.437  ns/op

It is a bit slower, because it has a few minor codegen inefficiencies:

  1.06%    0.89%    0x00007f890cae09f0: mov    %eax,%r11d                ; $acc *= 31
 13.06%   15.23%    0x00007f890cae09f3: shl    $0x5,%r11d 
  0.91%    0.98%    0x00007f890cae09f7: sub    %eax,%r11d
 12.25%   11.25%    0x00007f890cae09fa: shl    %r9d                      ; $idx *= 2
  1.03%    0.65%    0x00007f890cae09fd: mov    %r9d,%r10d                
 11.85%   10.88%    0x00007f890cae0a00: add    $0x2,%r10d
  1.21%    0.83%    0x00007f890cae0a04: mov    %r11d,%eax
 12.45%   11.88%    0x00007f890cae0a07: mov    %r8d,%r9d                 ; (unnecessary move)
  0.81%    0.96%    0x00007f890cae0a0a: movslq %r10d,%r10                ; (unnecessary sign extension?)
 12.38%   11.55%    0x00007f890cae0a0d: movzwl 0x10(%rbx,%r10,1),%r11d   ; get $val[$idx] 
  1.21%    1.26%    0x00007f890cae0a13: add    %r11d,%eax                ; $acc += $val[$idx]
 12.45%   12.74%    0x00007f890cae0a16: mov    %r9d,%r8d                 ; (unnecessary move)
  0.79%    0.89%    0x00007f890cae0a19: inc    %r8d                      ; $idx++
 11.54%   13.06%    0x00007f890cae0a1c: cmp    %ecx,%r8d                 ; check limit and back branch
                    0x00007f890cae0a1f: jl     0x00007f890cae09f0


== Conclusion

1. String.hashCode performance for Latin1 strings is not affected, the new code is almost the same as the previous one.
If we ever intrinsify hashCode computations, or if we run the scenario where the array walks experience cache misses,
the new version should win.

2. String.hashCode performance for UTF16 strings is not affected, and the new code performs the same.