= String.hashCode performance https://bugs.openjdk.java.net/browse/JDK-8054307 The tests are done on JDK 9 Sandbox repo, 1x4x2 i7-4790K (Haswell) 4.0 GHz, Linux x86_64. Benchmark code is available at: http://cr.openjdk.java.net/~shade/density/ == Background String hashcode is a simple P(31) hash. The JDK implementation, however, caches String hashcodes, and uses "0" as the marker for "not yet computed" hashcode. We can use that quirk in our benchmarks, by carefully constructing the strings with zero hashcodes, but with either Latin1-only, or UTF16 characters in them. To do that, we pad the two known zero-hashcoded English and Russian phrases with zero characters. Such padding would not change the hashcode of the prefix. Given such a string, JDK implementation would always recompute the hashcode. While hashcode caching greatly alleviates the hashcode computation costs, in many cases users are dealing with young Strings, compute and use the hashcode once, then discard them, rinse and repeat. Then the cost of String hash code computation may become important. == Benchmarks Baseline: Benchmark (size) Mode Cnt Score Error Units HashCodeBench.cmp1 1 avgt 25 19.667 ± 0.226 ns/op HashCodeBench.cmp1 64 avgt 25 47.512 ± 0.398 ns/op HashCodeBench.cmp1 4096 avgt 25 3084.317 ± 1.345 ns/op HashCodeBench.cmp2 1 avgt 25 25.717 ± 0.080 ns/op HashCodeBench.cmp2 64 avgt 25 47.315 ± 0.020 ns/op HashCodeBench.cmp2 4096 avgt 25 3083.622 ± 0.913 ns/op Patched: Benchmark (size) Mode Cnt Score Error Units HashCodeBench.cmp1 1 avgt 25 19.816 ± 0.085 ns/op HashCodeBench.cmp1 64 avgt 25 47.889 ± 0.583 ns/op HashCodeBench.cmp1 4096 avgt 25 3084.170 ± 1.417 ns/op HashCodeBench.cmp2 1 avgt 25 27.421 ± 0.684 ns/op HashCodeBench.cmp2 64 avgt 25 49.013 ± 0.206 ns/op HashCodeBench.cmp2 4096 avgt 25 3089.809 ± 4.960 ns/op == Analysis === cmp1 case Notice how cmp1 case is not affected at all, because the code in StringLatin1.hashCode() is very similar to the baseline version: baseline: public int hashCode() { int h = hash; if (h == 0) { for (char v : value) { h = 31 * h + v; } hash = h; } return h; } patched (Latin1): public static int hashCode(byte[] value) { int h = 0; for (byte v : value) { h = 31 * h + (v & 0xff); } return h; } One can expect the sub-byte masking to be optimized, and indeed it is folded by compiler into a byte load. The diassemblies are nearly identical, the only difference is using either zero-extended short read, or zero-extended byte read. baseline, cmp1 perfasm (hottest loop): 0.02% 0.02% 0x00007f5350ae1f90: mov %esi,%edx ; %edx = %esi * 31 0.02% 0x00007f5350ae1f92: shl $0x5,%edx 7.46% 3.88% 0x00007f5350ae1f95: movzwl 0x10(%rcx,%rbx,2),%r8d ; get a[0] 0.02% 0x00007f5350ae1f9b: sub %esi,%edx 7.46% 4.11% 0x00007f5350ae1f9d: add %r8d,%edx ; mix a[0] 7.63% 7.62% 0x00007f5350ae1fa0: movslq %ebx,%r11 0.02% 0x00007f5350ae1fa3: movzwl 0x12(%rcx,%r11,2),%r8d ; get a[1] 0.02% 0x00007f5350ae1fa9: movzwl 0x16(%rcx,%r11,2),%eax ; get a[3] 0x00007f5350ae1faf: movzwl 0x14(%rcx,%r11,2),%r9d ; get a[2] 8.27% 3.78% 0x00007f5350ae1fb5: mov %edx,%r11d ; %r11 = %edx * 31 0.05% 0x00007f5350ae1fb8: shl $0x5,%r11d 0x00007f5350ae1fbc: sub %edx,%r11d 7.55% 3.74% 0x00007f5350ae1fbf: add %r8d,%r11d ; mix a[1] 8.01% 19.16% 0x00007f5350ae1fc2: mov %r11d,%edx ; %edx = %r11 * 31 0.03% 0x00007f5350ae1fc5: shl $0x5,%edx 7.90% 8.16% 0x00007f5350ae1fc8: sub %r11d,%edx 7.71% 4.18% 0x00007f5350ae1fcb: add %r9d,%edx ; mix a[2] 7.85% 15.02% 0x00007f5350ae1fce: mov %edx,%esi ; %esi = %edx * 31 0.02% 0x00007f5350ae1fd0: shl $0x5,%esi 7.61% 11.76% 0x00007f5350ae1fd3: sub %edx,%esi 7.49% 3.83% 0x00007f5350ae1fd5: add %eax,%esi ; mix a[3] 7.93% 7.74% 0x00007f5350ae1fd7: add $0x4,%ebx ; loop counter += 4 0x00007f5350ae1fda: cmp %r10d,%ebx ; loop back branch 0x00007f5350ae1fdd: jl 0x00007f5350ae1f90 ; patched, cmp1 perfasm (hottest loop): 0.02% 0x00007fc18ba45130: mov %ecx,%edi ; %edi = %ecx * 31 0.03% 0x00007fc18ba45132: shl $0x5,%edi 8.05% 4.31% 0x00007fc18ba45135: sub %ecx,%edi 7.70% 3.86% 0x00007fc18ba45137: movslq %r9d,%r11 0x00007fc18ba4513a: movzbl 0x10(%rbx,%r11,1),%r8d ; get a[0] 0.07% 0.05% 0x00007fc18ba45140: add %r8d,%edi ; mix a[0] 7.97% 7.24% 0x00007fc18ba45143: movslq %r9d,%r11 0.05% 0.07% 0x00007fc18ba45146: movzbl 0x11(%rbx,%r11,1),%ecx ; get a[1] 0x00007fc18ba4514c: movzbl 0x13(%rbx,%r11,1),%edx ; get a[3] 0.05% 0.02% 0x00007fc18ba45152: movzbl 0x12(%rbx,%r11,1),%r8d ; get a[2] 7.34% 3.59% 0x00007fc18ba45158: mov %edi,%r11d ; %r11 = %edi * 31 0.08% 0.10% 0x00007fc18ba4515b: shl $0x5,%r11d 0.02% 0.02% 0x00007fc18ba4515f: sub %edi,%r11d 7.76% 3.58% 0x00007fc18ba45162: add %ecx,%r11d ; mix a[1] 8.02% 17.60% 0x00007fc18ba45165: mov %r11d,%edi ; %edi = %r11 * 31 0.07% 0.02% 0x00007fc18ba45168: shl $0x5,%edi 7.58% 3.46% 0x00007fc18ba4516b: sub %r11d,%edi 7.87% 11.82% 0x00007fc18ba4516e: add %r8d,%edi ; mix a[2] 7.23% 14.38% 0x00007fc18ba45171: mov %edi,%ecx ; %ecx = %edi * 31 0.05% 0x00007fc18ba45173: shl $0x5,%ecx 7.39% 10.95% 0x00007fc18ba45176: sub %edi,%ecx 7.58% 3.34% 0x00007fc18ba45178: add %edx,%ecx ; mix a[3] 7.61% 7.34% 0x00007fc18ba4517a: add $0x4,%r9d ; loop counter += 4 0.02% 0.02% 0x00007fc18ba4517e: cmp %r10d,%r9d ; loop back branch 0x00007fc18ba45181: jl 0x00007fc18ba45130 It is somewhat suprising the patched version is not bringing the performance improvement though: the memory footprint for the large String is lower. It seems that the computation costs are dominating the scenario, and memory references are predicted perfectly. Character-count-wise, the amount of work is the same. === cmp2 case In UTF16 case, however, the code is much more complicated, because it has to deal with endianness: public static int hashCode(byte[] value) { int h = 0; int length = value.length >> 1; for (int i = 0; i < length; i++) { h = 31 * h + getChar(value, i); } return h; } public static char getChar(byte[] val, int index) { index <<= 1; return (char)(((val[index++] & 0xff) << HI_BYTE_SHIFT) | ((val[index] & 0xff) << LO_BYTE_SHIFT)); } But, it is also intrinsified by VM, which handles the endianess by it's own. The generated code is rather complicated due to loop unrolling, so we will use -XX:LoopUnrollLimit=1 to limit it. Here's the hottest loop: 5.12% 5.41% 0x00007ff400ae0410: mov %eax,%edi ; $acc *= 31 6.26% 6.92% 0x00007ff400ae0412: shl $0x5,%edi 10.35% 11.26% 0x00007ff400ae0415: sub %eax,%edi 8.90% 8.81% 0x00007ff400ae0417: mov %ecx,%r11d 4.47% 4.58% 0x00007ff400ae041a: shl %r11d ; $idx *= 2 5.70% 6.35% 0x00007ff400ae041d: mov %r11d,%r8d 6.85% 5.93% 0x00007ff400ae0420: add $0x2,%r8d 8.75% 7.95% 0x00007ff400ae0424: cmp %r8d,%r9d ; range check 0x00007ff400ae0427: jb 0x00007ff400ae04a4 4.78% 3.91% 0x00007ff400ae042d: movslq %r11d,%r10 ; (unnecessary sign extension?) 4.95% 5.71% 0x00007ff400ae0430: movzwl 0x10(%rdx,%r10,1),%eax ; get $val[$idx] 8.31% 8.42% 0x00007ff400ae0436: add %edi,%eax ; $acc += $val[$idx] 11.96% 10.99% 0x00007ff400ae0438: inc %ecx ; $idx++ 3.60% 3.85% 0x00007ff400ae043a: cmp %r13d,%ecx ; check limit and back branch 0x00007ff400ae043d: jl 0x00007ff400ae0410 The intrinsic version has the range check, and also an unnecessary sign extension (similar to https://bugs.openjdk.java.net/browse/JDK-8074124). === cmp2 (prototype) In fact, replacing the getChar() body with Unsafe call like this: diff -r 66bedbaa3cfe src/java.base/share/classes/java/lang/StringUTF16.java --- a/src/java.base/share/classes/java/lang/StringUTF16.java Mon Feb 23 18:23:04 2015 +0000 +++ b/src/java.base/share/classes/java/lang/StringUTF16.java Fri Feb 27 19:16:03 2015 +0300 @@ -584,10 +584,13 @@ val[index] = (byte)(c >> LO_BYTE_SHIFT); } + private static final sun.misc.Unsafe U = sun.misc.Unsafe.getUnsafe(); + private static final long BYTE_ARR_OFFSET = U.arrayBaseOffset(byte[].class); + private static final long BYTE_ARR_SCALE = U.arrayIndexScale(byte[].class); + public static char getChar(byte[] val, int index) { index <<= 1; - return (char)(((val[index++] & 0xff) << HI_BYTE_SHIFT) | - ((val[index] & 0xff) << LO_BYTE_SHIFT)); + return U.getChar(val, BYTE_ARR_OFFSET + BYTE_ARR_SCALE * index); } Yields the peformance very close to both configurations! Baseline: Benchmark (size) Mode Cnt Score Error Units HashCodeBench.cmp2 1 avgt 25 25.717 ± 0.080 ns/op HashCodeBench.cmp2 64 avgt 25 47.315 ± 0.020 ns/op HashCodeBench.cmp2 4096 avgt 25 3083.622 ± 0.913 ns/op Patched: Benchmark (size) Mode Cnt Score Error Units HashCodeBench.cmp2 1 avgt 25 27.421 ± 0.684 ns/op HashCodeBench.cmp2 64 avgt 25 49.013 ± 0.206 ns/op HashCodeBench.cmp2 4096 avgt 25 3089.809 ± 4.960 ns/op Patched (prototype): Benchmark (size) Mode Cnt Score Error Units HashCodeBench.cmp2 1 avgt 25 27.261 ± 0.056 ns/op HashCodeBench.cmp2 64 avgt 25 49.179 ± 0.357 ns/op HashCodeBench.cmp2 4096 avgt 25 3094.200 ± 4.437 ns/op It is a bit slower, because it has a few minor codegen inefficiencies: 1.06% 0.89% 0x00007f890cae09f0: mov %eax,%r11d ; $acc *= 31 13.06% 15.23% 0x00007f890cae09f3: shl $0x5,%r11d 0.91% 0.98% 0x00007f890cae09f7: sub %eax,%r11d 12.25% 11.25% 0x00007f890cae09fa: shl %r9d ; $idx *= 2 1.03% 0.65% 0x00007f890cae09fd: mov %r9d,%r10d 11.85% 10.88% 0x00007f890cae0a00: add $0x2,%r10d 1.21% 0.83% 0x00007f890cae0a04: mov %r11d,%eax 12.45% 11.88% 0x00007f890cae0a07: mov %r8d,%r9d ; (unnecessary move) 0.81% 0.96% 0x00007f890cae0a0a: movslq %r10d,%r10 ; (unnecessary sign extension?) 12.38% 11.55% 0x00007f890cae0a0d: movzwl 0x10(%rbx,%r10,1),%r11d ; get $val[$idx] 1.21% 1.26% 0x00007f890cae0a13: add %r11d,%eax ; $acc += $val[$idx] 12.45% 12.74% 0x00007f890cae0a16: mov %r9d,%r8d ; (unnecessary move) 0.79% 0.89% 0x00007f890cae0a19: inc %r8d ; $idx++ 11.54% 13.06% 0x00007f890cae0a1c: cmp %ecx,%r8d ; check limit and back branch 0x00007f890cae0a1f: jl 0x00007f890cae09f0 == Conclusion 1. String.hashCode performance for Latin1 strings is not affected, the new code is almost the same as the previous one. If we ever intrinsify hashCode computations, or if we run the scenario where the array walks experience cache misses, the new version should win. 2. String.hashCode performance for UTF16 strings is not affected, and the new code performs the same.