= Experimenting with binary selection of Coders

https://bugs.openjdk.java.net/browse/JDK-8054307

The tests are done on JDK 9 Sandbox repo, 1x4x2 i7-4790K (Haswell) 4.0 GHz, Linux x86_64.

Benchmark code is available at:
 http://cr.openjdk.java.net/~shade/density/

== Background

It is known from the previous experiments that coder selection may be a significant chunk of work, especially when 
fast String operations are concerned. Previously, we figured out that dispatching over the static methods is the best
alternative:
 http://cr.openjdk.java.net/~shade/density/selection-monomorphic.txt
 http://shipilev.net/blog/2015/black-magic-method-dispatch/

However, it would seem that in binary cases, when we need to act based on two Strings, and therefore two types of
coders, the selection even more complicated. This post is about trying to figure the best way to do a binary op
on two coded Strings.

== Benchmark

To make the test case as closer to current String code as possible, we are simulating String with this class:

    public static class Data {
        byte coder;
        byte[] data;
    }

...and dispatch statically over the "coder" implementation:

        @CompilerControl(CompilerControl.Mode.DONT_INLINE)
        public static int process_0_0(byte[] d1, byte[] d2) {
            return d1.length + d2.length;
        }

        @CompilerControl(CompilerControl.Mode.DONT_INLINE)
        public static int process_0_1(byte[] d1, byte[] d2) {
            return d1.length + d2.length;
        }

        @CompilerControl(CompilerControl.Mode.DONT_INLINE)
        public static int process_1_0(byte[] d1, byte[] d2) {
            return d1.length + d2.length;
        }

        @CompilerControl(CompilerControl.Mode.DONT_INLINE)
        public static int process_1_1(byte[] d1, byte[] d2) {
            return d1.length + d2.length;
        }


There are a few ideas how to do this dispatch, including:

 1) Explicit "coder" ID field, and dispatch on it:

        public int doByID(Data o) {
            byte[] d = data;
            byte[] od = o.data;
            byte c = coder;
            byte oc = o.coder;
            if (c == 0) {
                if (oc == 0)
                    return process_0_0(d, od);
                else
                    return process_0_1(d, od);
            } else {
                if (oc == 0)
                    return process_1_0(d, od);
                else
                    return process_1_1(d, od);
            }
        }

  (in subsequent code examples, we shrink the actual dispatch code with a shortcut)

 2) First element of byte[] array containing the value. The drawback of this encoding scheme is that we would need
to make processors to properly skip the head of the array, which may introduce some overheads. Also, accessing the
array element from the Java code may cause the array bounds check, and so we might want to have an additional test
with Unsafe.getByte() that bypasses the language checks.

        static final Unsafe U;
        static final long IDX = U.arrayBaseOffset(byte[].class);

        public int doByFirst(Data o) {
            byte[] d = data;
            byte[] od = o.data;
            byte c = d[0];
            byte oc = od[0];
            <dispatch based on o/oc>
        }

        public int doByFirstUnsafe(Data o) {
            byte[] d = data;
            byte[] od = o.data;
            byte c = U.getByte(d, IDX);
            byte oc = U.getByte(od, IDX);
            <dispatch based on o/oc>
        }


 3) Tagging the byte[] pointer. It could not be done in a platform-independent manner purely from the Java code. 
Having said that, we can emulate the pointer tagging with encoding the coder in byte[] arraylength. It can be noted
that 2-byte String byte[] arrays will almost always be even-sized, and we can probably make sure the 1-byte String
byte[] arrays are odd-sized. The exact details how that is done (or, if it's doable to begin with) is irrelevant for
this experiment.

        public int doByLen(Data o) {
            byte[] d = data;
            byte[] od = o.data;
            byte c = (byte) (d.length & 1);
            byte oc = (byte) (od.length & 1);
            <dispatch based on o/oc>
        }
  
We also have a baseline test that does not select coders, but merely blindly invokes the binary operations on the data.
This test case helps to provide the baseline like our current String implementation.

        public int baselineRef(Data o) {
            byte[] d = data;
            byte[] od = o.data;
            return process_0_0(d, od);
        }

== Performance

=== Out of the box (C2) performance

Benchmark                       (bias)  (count)  Mode  Cnt    Score   Error  Units
PairSelect.baselineRef            0.00    10000  avgt  150   67.405 ± 0.394  us/op
PairSelect.baselineRef            0.25    10000  avgt  150   68.047 ± 1.017  us/op
PairSelect.baselineRef            0.50    10000  avgt  150   68.277 ± 1.066  us/op
PairSelect.baselineRef            0.75    10000  avgt  150   67.582 ± 0.331  us/op
PairSelect.baselineRef            1.00    10000  avgt  150   67.635 ± 0.461  us/op

PairSelect.selectByFirst          0.00    10000  avgt  150   80.835 ± 1.278  us/op
PairSelect.selectByFirst          0.25    10000  avgt  150  111.631 ± 1.092  us/op
PairSelect.selectByFirst          0.50    10000  avgt  150  137.566 ± 1.253  us/op
PairSelect.selectByFirst          0.75    10000  avgt  150  110.407 ± 0.812  us/op
PairSelect.selectByFirst          1.00    10000  avgt  150   81.984 ± 1.572  us/op

PairSelect.selectByFirstUnsafe    0.00    10000  avgt  150   73.599 ± 0.823  us/op
PairSelect.selectByFirstUnsafe    0.25    10000  avgt  150  103.865 ± 1.303  us/op
PairSelect.selectByFirstUnsafe    0.50    10000  avgt  150  129.304 ± 1.230  us/op
PairSelect.selectByFirstUnsafe    0.75    10000  avgt  150  104.947 ± 1.664  us/op
PairSelect.selectByFirstUnsafe    1.00    10000  avgt  150   74.479 ± 1.445  us/op

PairSelect.selectByID             0.00    10000  avgt  150   71.071 ± 0.341  us/op
PairSelect.selectByID             0.25    10000  avgt  150   94.972 ± 0.559  us/op
PairSelect.selectByID             0.50    10000  avgt  150  116.038 ± 0.905  us/op
PairSelect.selectByID             0.75    10000  avgt  150   95.571 ± 1.190  us/op
PairSelect.selectByID             1.00    10000  avgt  150   71.455 ± 0.649  us/op

PairSelect.selectByLen            0.00    10000  avgt  150   76.011 ± 1.232  us/op
PairSelect.selectByLen            0.25    10000  avgt  150  105.915 ± 0.407  us/op
PairSelect.selectByLen            0.50    10000  avgt  150  133.260 ± 1.396  us/op
PairSelect.selectByLen            0.75    10000  avgt  150  106.293 ± 1.773  us/op
PairSelect.selectByLen            1.00    10000  avgt  150   76.885 ± 1.180  us/op

As we can see, selectByID wins every other case, and it is very close to the baseline case, slower only for ~5% in
bias=0.0 and bias=1.0 cases. Other biases suffer from (unavoidable) branch misprediction misses.

While it is prudent to check the assembly for all the cases (and we sure did it), it is enough to see the disassemblies
at bias=0.0 to understand the difference. The epilog for all cases seems to be the same (it is separated by dashed line).
What's different is the actual coder selection.

==== baselineRef

  ...
  2.83%    2.75%    0x00007f09a1190c8c: mov    0x10(%rdx),%r11d      ; get field $o.data
 10.00%   14.56%    0x00007f09a1190c90: mov    0x10(%rsi),%r10d      ; get field $this.data
 ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
  0.52%    1.01%    0x00007f09a1190c94: mov    %r11,%rdx
  0.94%    0.44%    0x00007f09a1190c97: shl    $0x3,%rdx             ; unpack $o.data
  1.53%    1.71%    0x00007f09a1190c9b: mov    %r10,%rsi             ; unpack $this.data
  0.34%    0.47%    0x00007f09a1190c9e: shl    $0x3,%rsi         
  0.40%    0.71%    0x00007f09a1190ca2: nop    
  0.89%    0.25%    0x00007f09a1190ca3: callq  0x00007f09a1046420    ; call process0_0
  0.12%    0.25%    0x00007f09a1190ca8: add    $0x10,%rsp            ; epilog and return
  0.89%    0.67%    0x00007f09a1190cac: pop    %rbp
  2.61%    1.56%    0x00007f09a1190cad: test   %eax,0x1723734d(%rip) 
  0.12%    0.25%    0x00007f09a1190cb3: retq                   

baseline test just unpacks the compressed reference to byte[] array before calling the non-inlineable process0_0 method.

==== selectByFirst

                  [Verified Entry Point]
  ...      
  1.38%    1.78%    0x00007f8a35197cac: mov    0x10(%rdx),%r10d        ; get field $o.data
 12.13%   17.38%    0x00007f8a35197cb0: mov    0x10(%rsi),%r9d         ; get field $this.data
  0.99%    0.75%    0x00007f8a35197cb4: mov    0xc(%r12,%r9,8),%r8d    ; get $this.data.arraylength
  0.74%    0.47%    0x00007f8a35197cb9: test   %r8d,%r8d               ; <bounds check>
                    0x00007f8a35197cbc: jbe    0x00007f8a35197cf8      ;   fail, jump to exceptional path
  0.21%    0.26%    0x00007f8a35197cbe: mov    0xc(%r12,%r10,8),%r8d   ; get $o.data.arraylength
 31.58%   29.12%    0x00007f8a35197cc3: movsbl 0x10(%r12,%r9,8),%r11d  ; get $this.data[0]
  0.79%    0.34%    0x00007f8a35197cc9: test   %r8d,%r8d               ; <bounds check>
                    0x00007f8a35197ccc: jbe    0x00007f8a35197d11      ;   fail, jump to exceptional path
  1.79%    1.20%    0x00007f8a35197cce: movsbl 0x10(%r12,%r10,8),%ebp  ; get $o.data[0]
  2.66%    1.64%    0x00007f8a35197cd4: test   %r11d,%r11d             ; test ($this.data[0] == 0)
                    0x00007f8a35197cd7: jne    0x00007f8a35197d31      ;   jump to unlikely path
  0.13%    0.14%    0x00007f8a35197cd9: test   %ebp,%ebp               ; test ($o.data[0] == 0)
                    0x00007f8a35197cdb: jne    0x00007f8a35197d51      ;   jump to unlikely path
  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
  1.84%    1.36%    0x00007f8a35197cdd: lea    (%r12,%r10,8),%rdx      ; unpack $o.data
  0.42%    0.88%    0x00007f8a35197ce1: lea    (%r12,%r9,8),%rsi       ; unpack $this.data
  0.21%    0.29%    0x00007f8a35197ce7: callq  0x00007f8a35046420      ; call process_0_0
  0.54%    0.59%    0x00007f8a35197cec: add    $0x20,%rsp              ; epilog and return
  1.65%    1.28%    0x00007f8a35197cf0: pop    %rbp
  0.91%    0.53%    0x00007f8a35197cf1: test   %eax,0x187e8309(%rip) 
  0.34%    0.61%    0x00007f8a35197cf7: retq   

==== selectByFirstUnsafe

                  [Verified Entry Point]
  ...
  2.20%    2.12%    0x00007f83580fdd0c: mov    0x10(%rdx),%r10d        ; get field $o.data
  9.34%   14.65%    0x00007f83580fdd10: movsbl 0x10(%r12,%r10,8),%r8d  ; get $o.data[0]
 29.93%   24.27%    0x00007f83580fdd16: mov    0x10(%rsi),%r11d        ; get field $this.data
  0.71%    0.78%    0x00007f83580fdd1a: movsbl 0x10(%r12,%r11,8),%ebp  ; get $this.data[0]
  0.32%    0.25%    0x00007f83580fdd20: test   %ebp,%ebp               ; test ($this.data[0] == 0)
                    0x00007f83580fdd22: jne    0x00007f83580fdd48      ;   jump to unlikely path
  0.12%    0.14%    0x00007f83580fdd24: test   %r8d,%r8d               ; test ($o.data[0] == 0)
                    0x00007f83580fdd27: jne    0x00007f83580fdd65      ;   jump to unlikely path
  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
  1.97%    1.46%    0x00007f83580fdd29: mov    %r10,%rdx               ; unpack $o.data
  0.76%    0.71%    0x00007f83580fdd2c: shl    $0x3,%rdx               ; 
  0.41%    0.34%    0x00007f83580fdd30: mov    %r11,%rsi               ; unpack $this.data
  0.17%    0.29%    0x00007f83580fdd33: shl    $0x3,%rsi               ; 
  1.87%    1.88%    0x00007f83580fdd37: callq  0x00007f8357fac420      ; call process0_0
  0.14%    0.19%    0x00007f83580fdd3c: add    $0x20,%rsp              ; epilog and return
  0.92%    1.07%    0x00007f83580fdd40: pop    %rbp
  1.76%    1.29%    0x00007f83580fdd41: test   %eax,0x15e702b9(%rip) 
  0.25%    0.41%    0x00007f83580fdd47: retq   

selectByFirstUnsafe notably wins over selectByFirst because Unsafe allows us to spare two bound checks.

==== selectByID

                  [Verified Entry Point]
  ...
  2.01%    2.06%    0x00007f04acf71e0c: mov    0x10(%rdx),%r11d        ; get field $o.data
 12.74%   15.67%    0x00007f04acf71e10: movsbl 0xc(%rdx),%r10d         ; get field $o.coder
  4.11%    4.08%    0x00007f04acf71e15: movsbl 0xc(%rsi),%r9d          ; get field $this.coder
  0.64%    0.25%    0x00007f04acf71e1a: mov    0x10(%rsi),%ebp         ; get field $this.data
  0.62%    0.52%    0x00007f04acf71e1d: test   %r9d,%r9d               ; test ($this.coder == 0)    
                    0x00007f04acf71e20: jne    0x00007f04acf71e48      ;   jump to unlikely path
  0.54%    0.62%    0x00007f04acf71e22: test   %r10d,%r10d             ; test ($o.coder == 0)
                    0x00007f04acf71e25: jne    0x00007f04acf71e65      ;   jump to unlikely path
  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
  1.04%    1.53%    0x00007f04acf71e27: mov    %r11,%rdx               ; unpack $o.data
  0.76%    0.27%    0x00007f04acf71e2a: shl    $0x3,%rdx 
  0.78%    0.84%    0x00007f04acf71e2e: mov    %rbp,%rsi               ; unpack $this.data
  0.64%    0.72%    0x00007f04acf71e31: shl    $0x3,%rsi          
  1.13%    1.10%    0x00007f04acf71e35: xchg   %ax,%ax
  0.67%    0.37%    0x00007f04acf71e37: callq  0x00007f04ace2b420      ; call process0_0
  0.13%    0.13%    0x00007f04acf71e3c: add    $0x20,%rsp              ; epilog and return
  0.89%    0.54%    0x00007f04acf71e40: pop    %rbp
  2.22%    2.01%    0x00007f04acf71e41: test   %eax,0x15ff21b9(%rip) 
  0.19%    0.20%    0x00007f04acf71e47: retq   

selectByID, while code-wise being very close to selectByFirstUnsafe, beats it in performance. This is probably because
the indirect memory access in selectByFirstUnsafe costs more than more direct pull of coder ID.

==== selectByLen

                  [Verified Entry Point]
  ...
  2.02%    0.72%    0x00007f5531194bec: mov    0x10(%rdx),%r10d        ; get field $o.data
 11.18%   14.92%    0x00007f5531194bf0: mov    0x10(%rsi),%r8d         ; get field $this.data
  0.58%    0.61%    0x00007f5531194bf4: mov    0xc(%r12,%r8,8),%r11d   ; get $this.data.arraylength
  1.16%    0.18%    0x00007f5531194bf9: mov    0xc(%r12,%r10,8),%ecx   ; get $o.data.arraylength
 34.01%   29.82%    0x00007f5531194bfe: and    $0x1,%ecx               ; mask
  1.69%    1.06%    0x00007f5531194c01: and    $0x1,%r11d              ; mask
  0.19%    0.34%    0x00007f5531194c05: test   %r11d,%r11d             ; test ($this.coder == 0)
                    0x00007f5531194c08: jne    0x00007f5531194c28      ;   jump to unlikely path
  0.59%    0.16%    0x00007f5531194c0a: test   %ecx,%ecx               ; test ($o.coder == 0)
                    0x00007f5531194c0c: jne    0x00007f5531194c49      ;   jump to unlikely path
  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----  ----
  1.94%    1.64%    0x00007f5531194c0e: lea    (%r12,%r10,8),%rdx      ; unpack $o.data
  0.06%    0.06%    0x00007f5531194c12: lea    (%r12,%r8,8),%rsi       ; unpack $this.data
  0.27%    0.53%    0x00007f5531194c16: nop    
  0.61%    0.45%    0x00007f5531194c17: callq  0x00007f5531046420      ; call process0_0
  0.08%    0.18%    0x00007f5531194c1c: add    $0x20,%rsp              ; epilog and return
  0.93%    1.64%    0x00007f5531194c20: pop    %rbp
  2.07%    1.25%    0x00007f5531194c21: test   %eax,0x177923d9(%rip)
  0.06%    0.08%    0x00007f5531194c27: retq   

selectByLen additionally pays for masking.

== C1

Benchmark                       (bias)  (count)  Mode  Cnt    Score   Error  Units
PairSelect.baselineRef            0.00    10000  avgt   25   68.054 ± 0.348  us/op
PairSelect.baselineRef            0.25    10000  avgt   25   68.152 ± 0.076  us/op
PairSelect.baselineRef            0.50    10000  avgt   25   68.673 ± 1.247  us/op
PairSelect.baselineRef            0.75    10000  avgt   25   68.265 ± 0.949  us/op
PairSelect.baselineRef            1.00    10000  avgt   25   69.253 ± 4.857  us/op

PairSelect.selectByFirst          0.00    10000  avgt   25   89.529 ± 6.966  us/op
PairSelect.selectByFirst          0.25    10000  avgt   25  115.199 ± 0.550  us/op
PairSelect.selectByFirst          0.50    10000  avgt   25  149.689 ± 8.972  us/op
PairSelect.selectByFirst          0.75    10000  avgt   25  115.809 ± 0.857  us/op
PairSelect.selectByFirst          1.00    10000  avgt   25   84.686 ± 0.642  us/op

PairSelect.selectByFirstUnsafe    0.00    10000  avgt   25   78.038 ± 1.071  us/op
PairSelect.selectByFirstUnsafe    0.25    10000  avgt   25  107.767 ± 2.083  us/op
PairSelect.selectByFirstUnsafe    0.50    10000  avgt   25  136.252 ± 0.691  us/op
PairSelect.selectByFirstUnsafe    0.75    10000  avgt   25  109.709 ± 2.815  us/op
PairSelect.selectByFirstUnsafe    1.00    10000  avgt   25   78.187 ± 2.797  us/op

PairSelect.selectByID             0.00    10000  avgt   25   76.958 ± 4.108  us/op
PairSelect.selectByID             0.25    10000  avgt   25  100.764 ± 0.586  us/op
PairSelect.selectByID             0.50    10000  avgt   25  124.001 ± 0.749  us/op
PairSelect.selectByID             0.75    10000  avgt   25   97.785 ± 0.415  us/op
PairSelect.selectByID             1.00    10000  avgt   25   73.564 ± 0.327  us/op

PairSelect.selectByLen            0.00    10000  avgt   25   81.998 ± 3.652  us/op
PairSelect.selectByLen            0.25    10000  avgt   25  113.365 ± 1.273  us/op
PairSelect.selectByLen            0.50    10000  avgt   25  143.963 ± 1.233  us/op
PairSelect.selectByLen            0.75    10000  avgt   25  114.169 ± 0.668  us/op
PairSelect.selectByLen            1.00    10000  avgt   25   78.964 ± 0.362  us/op

Similar to C2 case, in C1, select by ID wins over all cases, and is reasonably close to the baseline. The dissasembly
shows the similar reasons for this performance difference.

==== baselineRef

                  [Verified Entry Point]
  0.73%    1.26%    0x00007f50b8713b60: mov    %eax,-0x14000(%rsp)
  2.77%    2.81%    0x00007f50b8713b67: push   %rbp
  0.28%    0.17%    0x00007f50b8713b68: sub    $0x30,%rsp       
  1.31%    1.54%    0x00007f50b8713b6c: mov    0x10(%rsi),%esi      ; get and unpack $this.data
  2.16%    0.86%    0x00007f50b8713b6f: shl    $0x3,%rsi          
  0.24%    0.34%    0x00007f50b8713b73: mov    0x10(%rdx),%edx      ; get and unpack $o.data
  9.90%   11.96%    0x00007f50b8713b76: shl    $0x3,%rdx         
  1.65%    1.74%    0x00007f50b8713b7a: nop    
  1.27%    0.34%    0x00007f50b8713b7b: nop    
  0.13%    0.09%    0x00007f50b8713b7c: nop    
  0.47%    0.54%    0x00007f50b8713b7d: nop    
  1.44%    2.38%    0x00007f50b8713b7e: nop    
  1.39%    0.28%    0x00007f50b8713b7f: callq  0x00007f50b85ef420   ; call process0_0
  0.21%    0.43%    0x00007f50b8713b84: add    $0x30,%rsp           ; epilogue and return
  1.37%    1.80%    0x00007f50b8713b88: pop    %rbp
  1.72%    1.24%    0x00007f50b8713b89: test   %eax,0x15e8c571(%rip)
  0.17%    0.36%    0x00007f50b8713b8f: retq   

Same as in C2 case, C1 just unpacks the references, and calls process0_0.

==== selectByFirst

                  [Verified Entry Point]
  1.20%    1.19%    0x00007fce88b36940: mov    %eax,-0x14000(%rsp)
  1.44%    1.52%    0x00007fce88b36947: push   %rbp
  0.61%    1.07%    0x00007fce88b36948: sub    $0x30,%rsp         
  1.29%    1.80%    0x00007fce88b3694c: mov    0x10(%rsi),%esi     ; get and unpack $this.data
  0.50%    0.87%    0x00007fce88b3694f: shl    $0x3,%rsi          
  0.77%    1.59%    0x00007fce88b36953: mov    0x10(%rdx),%edi     ; get and unpack $o.data
 12.45%   19.07%    0x00007fce88b36956: shl    $0x3,%rdi        
  0.92%    1.69%    0x00007fce88b3695a: cmpl   $0x0,0xc(%rsi)      ; bounds check for $this.data[0]
  0.20%    0.61%    0x00007fce88b36961: jbe    0x00007fce88b36a0f  ;   jump to exceptional path
  0.37%    0.57%    0x00007fce88b36967: movsbl 0x10(%rsi),%edx     ; get $this.data[0]
  0.39%    1.06%    0x00007fce88b3696b: cmpl   $0x0,0xc(%rdi)      ; bounds check for $o.data[0]
 27.85%   22.78%    0x00007fce88b36972: jbe    0x00007fce88b36a26  ;   jump to exceptional path
  1.03%    0.87%    0x00007fce88b36978: movsbl 0x10(%rdi),%ebx     ; get $o.data[0]
  1.44%    1.45%    0x00007fce88b3697c: cmp    $0x0,%edx           ; test ($this.data[0] == 0)
                    0x00007fce88b3697f: jne    0x00007fce88b36996  ;   jump to unlikely branch
  0.31%    0.44%    0x00007fce88b36985: cmp    $0x0,%ebx           ; test ($o.data[0] == 0)
  1.07%    0.82%    0x00007fce88b36988: mov    %rdi,%rdx           ; prepare for call
  1.01%    0.78%    0x00007fce88b3698b: jne    0x00007fce88b369e8  ;   not 0? jump to unlikely branch
  0.68%    0.46%    0x00007fce88b36991: jmpq   0x00007fce88b369d0  ; jump to the call

Bound checks here are responsible for the performance difference against other tests.

==== selectByFirstUnsafe

                  [Verified Entry Point]
  0.56%    0.83%    0x00007fbb4916ce40: mov    %eax,-0x14000(%rsp)
  2.61%    1.44%    0x00007fbb4916ce47: push   %rbp
  0.65%    1.06%    0x00007fbb4916ce48: sub    $0x30,%rsp 
  1.55%    2.22%    0x00007fbb4916ce4c: mov    0x10(%rsi),%esi    ; get and unpack $this.data
  0.99%    1.41%    0x00007fbb4916ce4f: shl    $0x3,%rsi           
  0.38%    0.92%    0x00007fbb4916ce53: mov    0x10(%rdx),%edi    ; get and unpack $O.data
  8.75%   12.08%    0x00007fbb4916ce56: shl    $0x3,%rdi
  1.75%    1.95%    0x00007fbb4916ce5a: mov    $0x10,%rdx         ; oops, loading the IDX?
  0.63%    1.30%    0x00007fbb4916ce64: movsbl (%rsi,%rdx,1),%ebx ; get $this.data[0]
  0.32%    0.60%    0x00007fbb4916ce68: movsbl (%rdi,%rdx,1),%edx ; get $o.data[0]
 23.04%   21.55%    0x00007fbb4916ce6c: cmp    $0x0,%ebx          ; test ($this.data[0] == 0)
                    0x00007fbb4916ce6f: jne    0x00007fbb4916ce86 ;   jump to unlikely branch
  0.58%    1.06%    0x00007fbb4916ce75: cmp    $0x0,%edx          ; test ($o.data[0] == 0)
  2.32%    1.73%    0x00007fbb4916ce78: mov    %rdi,%rdx          ; prepare for call
  0.18%    0.25%    0x00007fbb4916ce7b: jne    0x00007fbb4916ced8 ;   not 0? jump to unlikely branch
  1.91%    1.08%    0x00007fbb4916ce81: jmpq   0x00007fbb4916cec0 ; jump to the call

Unsafe gets are optimized by C1 as well, and therefore, no bound checks.

==== selectByID

                  [Verified Entry Point]
  1.25%    1.33%    0x00007fc6c44e8040: mov    %eax,-0x14000(%rsp)
  1.83%    1.22%    0x00007fc6c44e8047: push   %rbp
  0.83%    1.03%    0x00007fc6c44e8048: sub    $0x30,%rsp       
  1.05%    1.46%    0x00007fc6c44e804c: mov    0x10(%rsi),%edi    ; get and unpack $this.data
  1.30%    0.72%    0x00007fc6c44e804f: shl    $0x3,%rdi          
  0.75%    1.51%    0x00007fc6c44e8053: mov    0x10(%rdx),%ebx    ; get and unpack $o.data
 12.60%   16.53%    0x00007fc6c44e8056: shl    $0x3,%rbx         
  0.94%    1.40%    0x00007fc6c44e805a: movsbl 0xc(%rsi),%esi     ; get $this.coder
  0.94%    0.44%    0x00007fc6c44e805e: movsbl 0xc(%rdx),%edx     ; get $o.coder
  3.34%    3.75%    0x00007fc6c44e8062: cmp    $0x0,%esi          ; test ($this.coder == 0)
                    0x00007fc6c44e8065: jne    0x00007fc6c44e807f ;   jump to unlikely branch
  0.13%    0.07%    0x00007fc6c44e806b: cmp    $0x0,%edx          ; test ($o.coder == 0)
  1.27%    1.29%    0x00007fc6c44e806e: mov    %rdi,%rsi          ; prepare for call
  0.83%    0.46%    0x00007fc6c44e8071: mov    %rbx,%rdx          
  0.72%    0.89%    0x00007fc6c44e8074: jne    0x00007fc6c44e80d0 ; not 0? jump to unlikely branch
  0.99%    1.20%    0x00007fc6c44e807a: jmpq   0x00007fc6c44e80b8 ; jump to the call

Similarly, pulling the zero-th element indirectly seems to cost more than pulling the ID directly.

==== selectbyLen

                  [Verified Entry Point]
  0.45%    0.54%    0x00007fa5e80d1b40: mov    %eax,-0x14000(%rsp)
  2.49%    2.38%    0x00007fa5e80d1b47: push   %rbp
  0.53%    0.76%    0x00007fa5e80d1b48: sub    $0x30,%rsp        
  1.85%    4.05%    0x00007fa5e80d1b4c: mov    0x10(%rsi),%esi    ; get and unpack $this.data
  0.25%    0.27%    0x00007fa5e80d1b4f: shl    $0x3,%rsi          
  0.53%    0.60%    0x00007fa5e80d1b53: mov    0x10(%rdx),%edi    ; get and unpack $o.data
 10.41%   15.28%    0x00007fa5e80d1b56: shl    $0x3,%rdi          
  2.18%    2.93%    0x00007fa5e80d1b5a: mov    0xc(%rsi),%edx     ; get $this.data.arraylength
  0.11%    0.18%    0x00007fa5e80d1b5d: and    $0x1,%edx          ; mask
  0.45%    0.43%    0x00007fa5e80d1b60: mov    0xc(%rdi),%ebx     ; get $o.data.arraylength
 26.56%   23.39%    0x00007fa5e80d1b63: and    $0x1,%ebx          ; mask
  2.36%    2.35%    0x00007fa5e80d1b66: cmp    $0x0,%edx          ; test ($this.coder == 0)
                    0x00007fa5e80d1b69: jne    0x00007fa5e80d1b80 ;   jump to another branch
  0.15%    0.13%    0x00007fa5e80d1b6f: cmp    $0x0,%ebx          ; test ($o.coder == 0)
  1.85%    1.28%    0x00007fa5e80d1b72: mov    %rdi,%rdx          ; prepare for call
                    0x00007fa5e80d1b75: jne    0x00007fa5e80d1bd0 ;   not 0? jump to another branch
  2.82%    2.69%    0x00007fa5e80d1b7b: jmpq   0x00007fa5e80d1bb8 ;   jump to the call

selectByLen additionally pays for masking.

= Conclusion

1. Nothing beats selecting by coder ID. It is reasonably close to the baseline, and we seem to pay the cost of actual
field gets and branches. The branching cost may be great due to branch prediction misses, which seem unavoidable.

2. Blending the coder ID into the array does not help anything. First, we have to pay for the cost of bounds-check, 
which is avoidable with Unsafe. Second, the more complicated insns sequence to pull the array element seems to get this
scheme behind plain coder ID. This does not take into the account the additional logic required to filter out zero-th
element in the coder implementations.

3. Tagging the pointers, or at least the emulation of such tagging, does not help either. The costs of demangling the 
data from the tagged pointer is significant.

TL;DR; Plain coder ID fields are already the best, and it's unlikely we can do better.