= Experimenting with binary selection of Coders https://bugs.openjdk.java.net/browse/JDK-8054307 The tests are done on JDK 9 Sandbox repo, 1x4x2 i7-4790K (Haswell) 4.0 GHz, Linux x86_64. Benchmark code is available at: http://cr.openjdk.java.net/~shade/density/ == Background It is known from the previous experiments that coder selection may be a significant chunk of work, especially when fast String operations are concerned. Previously, we figured out that dispatching over the static methods is the best alternative: http://cr.openjdk.java.net/~shade/density/selection-monomorphic.txt http://shipilev.net/blog/2015/black-magic-method-dispatch/ However, it would seem that in binary cases, when we need to act based on two Strings, and therefore two types of coders, the selection even more complicated. This post is about trying to figure the best way to do a binary op on two coded Strings. == Benchmark To make the test case as closer to current String code as possible, we are simulating String with this class: public static class Data { byte coder; byte[] data; } ...and dispatch statically over the "coder" implementation: @CompilerControl(CompilerControl.Mode.DONT_INLINE) public static int process_0_0(byte[] d1, byte[] d2) { return d1.length + d2.length; } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public static int process_0_1(byte[] d1, byte[] d2) { return d1.length + d2.length; } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public static int process_1_0(byte[] d1, byte[] d2) { return d1.length + d2.length; } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public static int process_1_1(byte[] d1, byte[] d2) { return d1.length + d2.length; } There are a few ideas how to do this dispatch, including: 1) Explicit "coder" ID field, and dispatch on it: public int doByID(Data o) { byte[] d = data; byte[] od = o.data; byte c = coder; byte oc = o.coder; if (c == 0) { if (oc == 0) return process_0_0(d, od); else return process_0_1(d, od); } else { if (oc == 0) return process_1_0(d, od); else return process_1_1(d, od); } } (in subsequent code examples, we shrink the actual dispatch code with a shortcut) 2) First element of byte[] array containing the value. The drawback of this encoding scheme is that we would need to make processors to properly skip the head of the array, which may introduce some overheads. Also, accessing the array element from the Java code may cause the array bounds check, and so we might want to have an additional test with Unsafe.getByte() that bypasses the language checks. static final Unsafe U; static final long IDX = U.arrayBaseOffset(byte[].class); public int doByFirst(Data o) { byte[] d = data; byte[] od = o.data; byte c = d[0]; byte oc = od[0]; } public int doByFirstUnsafe(Data o) { byte[] d = data; byte[] od = o.data; byte c = U.getByte(d, IDX); byte oc = U.getByte(od, IDX); } 3) Tagging the byte[] pointer. It could not be done in a platform-independent manner purely from the Java code. Having said that, we can emulate the pointer tagging with encoding the coder in byte[] arraylength. It can be noted that 2-byte String byte[] arrays will almost always be even-sized, and we can probably make sure the 1-byte String byte[] arrays are odd-sized. The exact details how that is done (or, if it's doable to begin with) is irrelevant for this experiment. public int doByLen(Data o) { byte[] d = data; byte[] od = o.data; byte c = (byte) (d.length & 1); byte oc = (byte) (od.length & 1); } We also have a baseline test that does not select coders, but merely blindly invokes the binary operations on the data. This test case helps to provide the baseline like our current String implementation. public int baselineRef(Data o) { byte[] d = data; byte[] od = o.data; return process_0_0(d, od); } == Performance === Out of the box (C2) performance Benchmark (bias) (count) Mode Cnt Score Error Units PairSelect.baselineRef 0.00 10000 avgt 150 67.405 ± 0.394 us/op PairSelect.baselineRef 0.25 10000 avgt 150 68.047 ± 1.017 us/op PairSelect.baselineRef 0.50 10000 avgt 150 68.277 ± 1.066 us/op PairSelect.baselineRef 0.75 10000 avgt 150 67.582 ± 0.331 us/op PairSelect.baselineRef 1.00 10000 avgt 150 67.635 ± 0.461 us/op PairSelect.selectByFirst 0.00 10000 avgt 150 80.835 ± 1.278 us/op PairSelect.selectByFirst 0.25 10000 avgt 150 111.631 ± 1.092 us/op PairSelect.selectByFirst 0.50 10000 avgt 150 137.566 ± 1.253 us/op PairSelect.selectByFirst 0.75 10000 avgt 150 110.407 ± 0.812 us/op PairSelect.selectByFirst 1.00 10000 avgt 150 81.984 ± 1.572 us/op PairSelect.selectByFirstUnsafe 0.00 10000 avgt 150 73.599 ± 0.823 us/op PairSelect.selectByFirstUnsafe 0.25 10000 avgt 150 103.865 ± 1.303 us/op PairSelect.selectByFirstUnsafe 0.50 10000 avgt 150 129.304 ± 1.230 us/op PairSelect.selectByFirstUnsafe 0.75 10000 avgt 150 104.947 ± 1.664 us/op PairSelect.selectByFirstUnsafe 1.00 10000 avgt 150 74.479 ± 1.445 us/op PairSelect.selectByID 0.00 10000 avgt 150 71.071 ± 0.341 us/op PairSelect.selectByID 0.25 10000 avgt 150 94.972 ± 0.559 us/op PairSelect.selectByID 0.50 10000 avgt 150 116.038 ± 0.905 us/op PairSelect.selectByID 0.75 10000 avgt 150 95.571 ± 1.190 us/op PairSelect.selectByID 1.00 10000 avgt 150 71.455 ± 0.649 us/op PairSelect.selectByLen 0.00 10000 avgt 150 76.011 ± 1.232 us/op PairSelect.selectByLen 0.25 10000 avgt 150 105.915 ± 0.407 us/op PairSelect.selectByLen 0.50 10000 avgt 150 133.260 ± 1.396 us/op PairSelect.selectByLen 0.75 10000 avgt 150 106.293 ± 1.773 us/op PairSelect.selectByLen 1.00 10000 avgt 150 76.885 ± 1.180 us/op As we can see, selectByID wins every other case, and it is very close to the baseline case, slower only for ~5% in bias=0.0 and bias=1.0 cases. Other biases suffer from (unavoidable) branch misprediction misses. While it is prudent to check the assembly for all the cases (and we sure did it), it is enough to see the disassemblies at bias=0.0 to understand the difference. The epilog for all cases seems to be the same (it is separated by dashed line). What's different is the actual coder selection. ==== baselineRef ... 2.83% 2.75% 0x00007f09a1190c8c: mov 0x10(%rdx),%r11d ; get field $o.data 10.00% 14.56% 0x00007f09a1190c90: mov 0x10(%rsi),%r10d ; get field $this.data ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 0.52% 1.01% 0x00007f09a1190c94: mov %r11,%rdx 0.94% 0.44% 0x00007f09a1190c97: shl $0x3,%rdx ; unpack $o.data 1.53% 1.71% 0x00007f09a1190c9b: mov %r10,%rsi ; unpack $this.data 0.34% 0.47% 0x00007f09a1190c9e: shl $0x3,%rsi 0.40% 0.71% 0x00007f09a1190ca2: nop 0.89% 0.25% 0x00007f09a1190ca3: callq 0x00007f09a1046420 ; call process0_0 0.12% 0.25% 0x00007f09a1190ca8: add $0x10,%rsp ; epilog and return 0.89% 0.67% 0x00007f09a1190cac: pop %rbp 2.61% 1.56% 0x00007f09a1190cad: test %eax,0x1723734d(%rip) 0.12% 0.25% 0x00007f09a1190cb3: retq baseline test just unpacks the compressed reference to byte[] array before calling the non-inlineable process0_0 method. ==== selectByFirst [Verified Entry Point] ... 1.38% 1.78% 0x00007f8a35197cac: mov 0x10(%rdx),%r10d ; get field $o.data 12.13% 17.38% 0x00007f8a35197cb0: mov 0x10(%rsi),%r9d ; get field $this.data 0.99% 0.75% 0x00007f8a35197cb4: mov 0xc(%r12,%r9,8),%r8d ; get $this.data.arraylength 0.74% 0.47% 0x00007f8a35197cb9: test %r8d,%r8d ; 0x00007f8a35197cbc: jbe 0x00007f8a35197cf8 ; fail, jump to exceptional path 0.21% 0.26% 0x00007f8a35197cbe: mov 0xc(%r12,%r10,8),%r8d ; get $o.data.arraylength 31.58% 29.12% 0x00007f8a35197cc3: movsbl 0x10(%r12,%r9,8),%r11d ; get $this.data[0] 0.79% 0.34% 0x00007f8a35197cc9: test %r8d,%r8d ; 0x00007f8a35197ccc: jbe 0x00007f8a35197d11 ; fail, jump to exceptional path 1.79% 1.20% 0x00007f8a35197cce: movsbl 0x10(%r12,%r10,8),%ebp ; get $o.data[0] 2.66% 1.64% 0x00007f8a35197cd4: test %r11d,%r11d ; test ($this.data[0] == 0) 0x00007f8a35197cd7: jne 0x00007f8a35197d31 ; jump to unlikely path 0.13% 0.14% 0x00007f8a35197cd9: test %ebp,%ebp ; test ($o.data[0] == 0) 0x00007f8a35197cdb: jne 0x00007f8a35197d51 ; jump to unlikely path ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 1.84% 1.36% 0x00007f8a35197cdd: lea (%r12,%r10,8),%rdx ; unpack $o.data 0.42% 0.88% 0x00007f8a35197ce1: lea (%r12,%r9,8),%rsi ; unpack $this.data 0.21% 0.29% 0x00007f8a35197ce7: callq 0x00007f8a35046420 ; call process_0_0 0.54% 0.59% 0x00007f8a35197cec: add $0x20,%rsp ; epilog and return 1.65% 1.28% 0x00007f8a35197cf0: pop %rbp 0.91% 0.53% 0x00007f8a35197cf1: test %eax,0x187e8309(%rip) 0.34% 0.61% 0x00007f8a35197cf7: retq ==== selectByFirstUnsafe [Verified Entry Point] ... 2.20% 2.12% 0x00007f83580fdd0c: mov 0x10(%rdx),%r10d ; get field $o.data 9.34% 14.65% 0x00007f83580fdd10: movsbl 0x10(%r12,%r10,8),%r8d ; get $o.data[0] 29.93% 24.27% 0x00007f83580fdd16: mov 0x10(%rsi),%r11d ; get field $this.data 0.71% 0.78% 0x00007f83580fdd1a: movsbl 0x10(%r12,%r11,8),%ebp ; get $this.data[0] 0.32% 0.25% 0x00007f83580fdd20: test %ebp,%ebp ; test ($this.data[0] == 0) 0x00007f83580fdd22: jne 0x00007f83580fdd48 ; jump to unlikely path 0.12% 0.14% 0x00007f83580fdd24: test %r8d,%r8d ; test ($o.data[0] == 0) 0x00007f83580fdd27: jne 0x00007f83580fdd65 ; jump to unlikely path ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 1.97% 1.46% 0x00007f83580fdd29: mov %r10,%rdx ; unpack $o.data 0.76% 0.71% 0x00007f83580fdd2c: shl $0x3,%rdx ; 0.41% 0.34% 0x00007f83580fdd30: mov %r11,%rsi ; unpack $this.data 0.17% 0.29% 0x00007f83580fdd33: shl $0x3,%rsi ; 1.87% 1.88% 0x00007f83580fdd37: callq 0x00007f8357fac420 ; call process0_0 0.14% 0.19% 0x00007f83580fdd3c: add $0x20,%rsp ; epilog and return 0.92% 1.07% 0x00007f83580fdd40: pop %rbp 1.76% 1.29% 0x00007f83580fdd41: test %eax,0x15e702b9(%rip) 0.25% 0.41% 0x00007f83580fdd47: retq selectByFirstUnsafe notably wins over selectByFirst because Unsafe allows us to spare two bound checks. ==== selectByID [Verified Entry Point] ... 2.01% 2.06% 0x00007f04acf71e0c: mov 0x10(%rdx),%r11d ; get field $o.data 12.74% 15.67% 0x00007f04acf71e10: movsbl 0xc(%rdx),%r10d ; get field $o.coder 4.11% 4.08% 0x00007f04acf71e15: movsbl 0xc(%rsi),%r9d ; get field $this.coder 0.64% 0.25% 0x00007f04acf71e1a: mov 0x10(%rsi),%ebp ; get field $this.data 0.62% 0.52% 0x00007f04acf71e1d: test %r9d,%r9d ; test ($this.coder == 0) 0x00007f04acf71e20: jne 0x00007f04acf71e48 ; jump to unlikely path 0.54% 0.62% 0x00007f04acf71e22: test %r10d,%r10d ; test ($o.coder == 0) 0x00007f04acf71e25: jne 0x00007f04acf71e65 ; jump to unlikely path ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 1.04% 1.53% 0x00007f04acf71e27: mov %r11,%rdx ; unpack $o.data 0.76% 0.27% 0x00007f04acf71e2a: shl $0x3,%rdx 0.78% 0.84% 0x00007f04acf71e2e: mov %rbp,%rsi ; unpack $this.data 0.64% 0.72% 0x00007f04acf71e31: shl $0x3,%rsi 1.13% 1.10% 0x00007f04acf71e35: xchg %ax,%ax 0.67% 0.37% 0x00007f04acf71e37: callq 0x00007f04ace2b420 ; call process0_0 0.13% 0.13% 0x00007f04acf71e3c: add $0x20,%rsp ; epilog and return 0.89% 0.54% 0x00007f04acf71e40: pop %rbp 2.22% 2.01% 0x00007f04acf71e41: test %eax,0x15ff21b9(%rip) 0.19% 0.20% 0x00007f04acf71e47: retq selectByID, while code-wise being very close to selectByFirstUnsafe, beats it in performance. This is probably because the indirect memory access in selectByFirstUnsafe costs more than more direct pull of coder ID. ==== selectByLen [Verified Entry Point] ... 2.02% 0.72% 0x00007f5531194bec: mov 0x10(%rdx),%r10d ; get field $o.data 11.18% 14.92% 0x00007f5531194bf0: mov 0x10(%rsi),%r8d ; get field $this.data 0.58% 0.61% 0x00007f5531194bf4: mov 0xc(%r12,%r8,8),%r11d ; get $this.data.arraylength 1.16% 0.18% 0x00007f5531194bf9: mov 0xc(%r12,%r10,8),%ecx ; get $o.data.arraylength 34.01% 29.82% 0x00007f5531194bfe: and $0x1,%ecx ; mask 1.69% 1.06% 0x00007f5531194c01: and $0x1,%r11d ; mask 0.19% 0.34% 0x00007f5531194c05: test %r11d,%r11d ; test ($this.coder == 0) 0x00007f5531194c08: jne 0x00007f5531194c28 ; jump to unlikely path 0.59% 0.16% 0x00007f5531194c0a: test %ecx,%ecx ; test ($o.coder == 0) 0x00007f5531194c0c: jne 0x00007f5531194c49 ; jump to unlikely path ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 1.94% 1.64% 0x00007f5531194c0e: lea (%r12,%r10,8),%rdx ; unpack $o.data 0.06% 0.06% 0x00007f5531194c12: lea (%r12,%r8,8),%rsi ; unpack $this.data 0.27% 0.53% 0x00007f5531194c16: nop 0.61% 0.45% 0x00007f5531194c17: callq 0x00007f5531046420 ; call process0_0 0.08% 0.18% 0x00007f5531194c1c: add $0x20,%rsp ; epilog and return 0.93% 1.64% 0x00007f5531194c20: pop %rbp 2.07% 1.25% 0x00007f5531194c21: test %eax,0x177923d9(%rip) 0.06% 0.08% 0x00007f5531194c27: retq selectByLen additionally pays for masking. == C1 Benchmark (bias) (count) Mode Cnt Score Error Units PairSelect.baselineRef 0.00 10000 avgt 25 68.054 ± 0.348 us/op PairSelect.baselineRef 0.25 10000 avgt 25 68.152 ± 0.076 us/op PairSelect.baselineRef 0.50 10000 avgt 25 68.673 ± 1.247 us/op PairSelect.baselineRef 0.75 10000 avgt 25 68.265 ± 0.949 us/op PairSelect.baselineRef 1.00 10000 avgt 25 69.253 ± 4.857 us/op PairSelect.selectByFirst 0.00 10000 avgt 25 89.529 ± 6.966 us/op PairSelect.selectByFirst 0.25 10000 avgt 25 115.199 ± 0.550 us/op PairSelect.selectByFirst 0.50 10000 avgt 25 149.689 ± 8.972 us/op PairSelect.selectByFirst 0.75 10000 avgt 25 115.809 ± 0.857 us/op PairSelect.selectByFirst 1.00 10000 avgt 25 84.686 ± 0.642 us/op PairSelect.selectByFirstUnsafe 0.00 10000 avgt 25 78.038 ± 1.071 us/op PairSelect.selectByFirstUnsafe 0.25 10000 avgt 25 107.767 ± 2.083 us/op PairSelect.selectByFirstUnsafe 0.50 10000 avgt 25 136.252 ± 0.691 us/op PairSelect.selectByFirstUnsafe 0.75 10000 avgt 25 109.709 ± 2.815 us/op PairSelect.selectByFirstUnsafe 1.00 10000 avgt 25 78.187 ± 2.797 us/op PairSelect.selectByID 0.00 10000 avgt 25 76.958 ± 4.108 us/op PairSelect.selectByID 0.25 10000 avgt 25 100.764 ± 0.586 us/op PairSelect.selectByID 0.50 10000 avgt 25 124.001 ± 0.749 us/op PairSelect.selectByID 0.75 10000 avgt 25 97.785 ± 0.415 us/op PairSelect.selectByID 1.00 10000 avgt 25 73.564 ± 0.327 us/op PairSelect.selectByLen 0.00 10000 avgt 25 81.998 ± 3.652 us/op PairSelect.selectByLen 0.25 10000 avgt 25 113.365 ± 1.273 us/op PairSelect.selectByLen 0.50 10000 avgt 25 143.963 ± 1.233 us/op PairSelect.selectByLen 0.75 10000 avgt 25 114.169 ± 0.668 us/op PairSelect.selectByLen 1.00 10000 avgt 25 78.964 ± 0.362 us/op Similar to C2 case, in C1, select by ID wins over all cases, and is reasonably close to the baseline. The dissasembly shows the similar reasons for this performance difference. ==== baselineRef [Verified Entry Point] 0.73% 1.26% 0x00007f50b8713b60: mov %eax,-0x14000(%rsp) 2.77% 2.81% 0x00007f50b8713b67: push %rbp 0.28% 0.17% 0x00007f50b8713b68: sub $0x30,%rsp 1.31% 1.54% 0x00007f50b8713b6c: mov 0x10(%rsi),%esi ; get and unpack $this.data 2.16% 0.86% 0x00007f50b8713b6f: shl $0x3,%rsi 0.24% 0.34% 0x00007f50b8713b73: mov 0x10(%rdx),%edx ; get and unpack $o.data 9.90% 11.96% 0x00007f50b8713b76: shl $0x3,%rdx 1.65% 1.74% 0x00007f50b8713b7a: nop 1.27% 0.34% 0x00007f50b8713b7b: nop 0.13% 0.09% 0x00007f50b8713b7c: nop 0.47% 0.54% 0x00007f50b8713b7d: nop 1.44% 2.38% 0x00007f50b8713b7e: nop 1.39% 0.28% 0x00007f50b8713b7f: callq 0x00007f50b85ef420 ; call process0_0 0.21% 0.43% 0x00007f50b8713b84: add $0x30,%rsp ; epilogue and return 1.37% 1.80% 0x00007f50b8713b88: pop %rbp 1.72% 1.24% 0x00007f50b8713b89: test %eax,0x15e8c571(%rip) 0.17% 0.36% 0x00007f50b8713b8f: retq Same as in C2 case, C1 just unpacks the references, and calls process0_0. ==== selectByFirst [Verified Entry Point] 1.20% 1.19% 0x00007fce88b36940: mov %eax,-0x14000(%rsp) 1.44% 1.52% 0x00007fce88b36947: push %rbp 0.61% 1.07% 0x00007fce88b36948: sub $0x30,%rsp 1.29% 1.80% 0x00007fce88b3694c: mov 0x10(%rsi),%esi ; get and unpack $this.data 0.50% 0.87% 0x00007fce88b3694f: shl $0x3,%rsi 0.77% 1.59% 0x00007fce88b36953: mov 0x10(%rdx),%edi ; get and unpack $o.data 12.45% 19.07% 0x00007fce88b36956: shl $0x3,%rdi 0.92% 1.69% 0x00007fce88b3695a: cmpl $0x0,0xc(%rsi) ; bounds check for $this.data[0] 0.20% 0.61% 0x00007fce88b36961: jbe 0x00007fce88b36a0f ; jump to exceptional path 0.37% 0.57% 0x00007fce88b36967: movsbl 0x10(%rsi),%edx ; get $this.data[0] 0.39% 1.06% 0x00007fce88b3696b: cmpl $0x0,0xc(%rdi) ; bounds check for $o.data[0] 27.85% 22.78% 0x00007fce88b36972: jbe 0x00007fce88b36a26 ; jump to exceptional path 1.03% 0.87% 0x00007fce88b36978: movsbl 0x10(%rdi),%ebx ; get $o.data[0] 1.44% 1.45% 0x00007fce88b3697c: cmp $0x0,%edx ; test ($this.data[0] == 0) 0x00007fce88b3697f: jne 0x00007fce88b36996 ; jump to unlikely branch 0.31% 0.44% 0x00007fce88b36985: cmp $0x0,%ebx ; test ($o.data[0] == 0) 1.07% 0.82% 0x00007fce88b36988: mov %rdi,%rdx ; prepare for call 1.01% 0.78% 0x00007fce88b3698b: jne 0x00007fce88b369e8 ; not 0? jump to unlikely branch 0.68% 0.46% 0x00007fce88b36991: jmpq 0x00007fce88b369d0 ; jump to the call Bound checks here are responsible for the performance difference against other tests. ==== selectByFirstUnsafe [Verified Entry Point] 0.56% 0.83% 0x00007fbb4916ce40: mov %eax,-0x14000(%rsp) 2.61% 1.44% 0x00007fbb4916ce47: push %rbp 0.65% 1.06% 0x00007fbb4916ce48: sub $0x30,%rsp 1.55% 2.22% 0x00007fbb4916ce4c: mov 0x10(%rsi),%esi ; get and unpack $this.data 0.99% 1.41% 0x00007fbb4916ce4f: shl $0x3,%rsi 0.38% 0.92% 0x00007fbb4916ce53: mov 0x10(%rdx),%edi ; get and unpack $O.data 8.75% 12.08% 0x00007fbb4916ce56: shl $0x3,%rdi 1.75% 1.95% 0x00007fbb4916ce5a: mov $0x10,%rdx ; oops, loading the IDX? 0.63% 1.30% 0x00007fbb4916ce64: movsbl (%rsi,%rdx,1),%ebx ; get $this.data[0] 0.32% 0.60% 0x00007fbb4916ce68: movsbl (%rdi,%rdx,1),%edx ; get $o.data[0] 23.04% 21.55% 0x00007fbb4916ce6c: cmp $0x0,%ebx ; test ($this.data[0] == 0) 0x00007fbb4916ce6f: jne 0x00007fbb4916ce86 ; jump to unlikely branch 0.58% 1.06% 0x00007fbb4916ce75: cmp $0x0,%edx ; test ($o.data[0] == 0) 2.32% 1.73% 0x00007fbb4916ce78: mov %rdi,%rdx ; prepare for call 0.18% 0.25% 0x00007fbb4916ce7b: jne 0x00007fbb4916ced8 ; not 0? jump to unlikely branch 1.91% 1.08% 0x00007fbb4916ce81: jmpq 0x00007fbb4916cec0 ; jump to the call Unsafe gets are optimized by C1 as well, and therefore, no bound checks. ==== selectByID [Verified Entry Point] 1.25% 1.33% 0x00007fc6c44e8040: mov %eax,-0x14000(%rsp) 1.83% 1.22% 0x00007fc6c44e8047: push %rbp 0.83% 1.03% 0x00007fc6c44e8048: sub $0x30,%rsp 1.05% 1.46% 0x00007fc6c44e804c: mov 0x10(%rsi),%edi ; get and unpack $this.data 1.30% 0.72% 0x00007fc6c44e804f: shl $0x3,%rdi 0.75% 1.51% 0x00007fc6c44e8053: mov 0x10(%rdx),%ebx ; get and unpack $o.data 12.60% 16.53% 0x00007fc6c44e8056: shl $0x3,%rbx 0.94% 1.40% 0x00007fc6c44e805a: movsbl 0xc(%rsi),%esi ; get $this.coder 0.94% 0.44% 0x00007fc6c44e805e: movsbl 0xc(%rdx),%edx ; get $o.coder 3.34% 3.75% 0x00007fc6c44e8062: cmp $0x0,%esi ; test ($this.coder == 0) 0x00007fc6c44e8065: jne 0x00007fc6c44e807f ; jump to unlikely branch 0.13% 0.07% 0x00007fc6c44e806b: cmp $0x0,%edx ; test ($o.coder == 0) 1.27% 1.29% 0x00007fc6c44e806e: mov %rdi,%rsi ; prepare for call 0.83% 0.46% 0x00007fc6c44e8071: mov %rbx,%rdx 0.72% 0.89% 0x00007fc6c44e8074: jne 0x00007fc6c44e80d0 ; not 0? jump to unlikely branch 0.99% 1.20% 0x00007fc6c44e807a: jmpq 0x00007fc6c44e80b8 ; jump to the call Similarly, pulling the zero-th element indirectly seems to cost more than pulling the ID directly. ==== selectbyLen [Verified Entry Point] 0.45% 0.54% 0x00007fa5e80d1b40: mov %eax,-0x14000(%rsp) 2.49% 2.38% 0x00007fa5e80d1b47: push %rbp 0.53% 0.76% 0x00007fa5e80d1b48: sub $0x30,%rsp 1.85% 4.05% 0x00007fa5e80d1b4c: mov 0x10(%rsi),%esi ; get and unpack $this.data 0.25% 0.27% 0x00007fa5e80d1b4f: shl $0x3,%rsi 0.53% 0.60% 0x00007fa5e80d1b53: mov 0x10(%rdx),%edi ; get and unpack $o.data 10.41% 15.28% 0x00007fa5e80d1b56: shl $0x3,%rdi 2.18% 2.93% 0x00007fa5e80d1b5a: mov 0xc(%rsi),%edx ; get $this.data.arraylength 0.11% 0.18% 0x00007fa5e80d1b5d: and $0x1,%edx ; mask 0.45% 0.43% 0x00007fa5e80d1b60: mov 0xc(%rdi),%ebx ; get $o.data.arraylength 26.56% 23.39% 0x00007fa5e80d1b63: and $0x1,%ebx ; mask 2.36% 2.35% 0x00007fa5e80d1b66: cmp $0x0,%edx ; test ($this.coder == 0) 0x00007fa5e80d1b69: jne 0x00007fa5e80d1b80 ; jump to another branch 0.15% 0.13% 0x00007fa5e80d1b6f: cmp $0x0,%ebx ; test ($o.coder == 0) 1.85% 1.28% 0x00007fa5e80d1b72: mov %rdi,%rdx ; prepare for call 0x00007fa5e80d1b75: jne 0x00007fa5e80d1bd0 ; not 0? jump to another branch 2.82% 2.69% 0x00007fa5e80d1b7b: jmpq 0x00007fa5e80d1bb8 ; jump to the call selectByLen additionally pays for masking. = Conclusion 1. Nothing beats selecting by coder ID. It is reasonably close to the baseline, and we seem to pay the cost of actual field gets and branches. The branching cost may be great due to branch prediction misses, which seem unavoidable. 2. Blending the coder ID into the array does not help anything. First, we have to pay for the cost of bounds-check, which is avoidable with Unsafe. Second, the more complicated insns sequence to pull the array element seems to get this scheme behind plain coder ID. This does not take into the account the additional logic required to filter out zero-th element in the coder implementations. 3. Tagging the pointers, or at least the emulation of such tagging, does not help either. The costs of demangling the data from the tagged pointer is significant. TL;DR; Plain coder ID fields are already the best, and it's unlikely we can do better.