Overview of specialized classfile format

Maurizio Cimadamore, October 2014, version 0.2

In this document, we will show the enhancements to the classfile format that are required in order to support type-specialization. As described in [1], the classfile format, in its current state, does not preserve enough type information to allow specialization of generic classes at runtime. To overcome this problem, the valhalla javac compiler [2] might decorate a specializable class with additional information - in the form of the bytecode attributes shown below - such that a relatively mechanical on-demand class specialization process can be defined.

Changelog

0.2

Covers the new layered structure of the TypeVariablesMap attribute.

0.1

Covers new erasure_index field in the TypeVariablesMap attribute.

The
TypeVariablesMap
attribute

The first thing a specializer runtime might need to know is which type-variables have been marked with the special modifier any in the corresponding source code. Since the source code is subject to type-erasure, all type information involving type-parameters is lost - meaning that an any type-variable is turned into an ordinary type-variable whose bound is simply Object. To make up for this information loss, we define a bytecode attribute, namely TypeVariablesMap, which stores all source-related flags associated with any given type-variable. The structure of this attribute is given below:

TypeVariablesMap_attribute {
    u2 attribute_name_index;
    u4 attribute_length;
    u1 entries_length;
    {
        u2 owner_idx
        u1 tvars_length;
        {
            u1 flags;
            u2 erasure_idx;
        } tvars_info[tvars_length];
    } entries_info[entries_length]
}

Here, entries_length denotes the number of type-variable mappings in this class/method declaration (maximum number of 255 mappings are supported); each mapping is associated to a given owner - the declaration the type-variables in this mapping belongs to. For this purpose, owner_idx points to a constant pool entry of kind CONSTANT_Utf8_info containing the string-based representation of the owning declaration (either a method or a class -- see example below). Each mapping contains tvars_length type-variables, where each type-variable T is associated with an 8-bit flags (flags) - currently, only one bit is used, with 0 denoting standard type-variables and 1 denoting any type-variables, respectively; and with an index (erasure_idx) to a constant pool entry of kind CONSTANT_Utf8_info containing the erased signature of T. Consider the following source:

class Outer<any T> {
    <any Z> void m() {
        class Inner<any U> { }
    }
}

The above program generates three two classfiles, one for the toplevel class Outer and one for the local class Inner. Let's look at the TypeVariableMapping attribute for Inner:

TypeVariablesMap:
  LOuter$1Inner;:
    Tvar  Flags  Erased bound
    U     [ANY]  Ljava/lang/Object;
  LOuter;::m()V:
    Tvar  Flags  Erased bound
    Z     [ANY]  Ljava/lang/Object;
  LOuter;:
    Tvar  Flags  Erased bound
    T     [ANY]  Ljava/lang/Object;

Note how the TypeVariablesMap attribute for Inner defines mappings for both the current and the enclosing type-variables (mappings are sorted from innermost to outermost). This allows for fast type-variable lookups (the alternative would have been to rely on existing InnerClasses and EnclosingMethod attribute - which requires jumping between different classfiles).

The
BytecodeMapping
attribute

The runtime specializer needs to know which opcodes in the erased classfile needs to be specialized; for instance, if the erased classfile performs an aload instruction, and the local variable has type any T in the source code, the specializer might need i.e. to replace the aload with an iload. To allow this rewriting in a straightforward fashion, we introduce an additional bytecode attribute, namely BytecodeMapping, which stores the bytecode offsets of all specializable opcodes in a given method. Extra type information is also stored in this attribute, so that the original (unerased) type information can be reconstructed by the specializer. The structure of this attribute is given below:

BytecodeMapping_attribute {
   u2 attribute_name_index;
   u4 attribute_length;
   u2 mappings_length; 
   {
       u2 bc_offset;
       u2 cp_idx;
   } mappings[mapping_length];
}

Here, mapping_length denotes the number of mappings in this attribute; the mappings are stored in an array (mappings) of size mapping_length, where each mapping is a tuple of two elements: a bytecode offset (bc_offset) and an index to a constant pool entry of kind CONSTANT_Utf8_info (cp_idx). The cp_idx field is crucial to retrieve unerased type-information associated with a given opcode - this info mihght be required by the specializer in order to emit correct opcodes/constant pool entries in the specialized classfiles. An overview of the possible specializable opcoes, along with the type information associated with them is given in the following table (in this table we use the term 'type' to denote an unerased type signature):

opcode category Utf8 value
aloadXX 1 local variable type
astoreXX 1 top-of-stack element type
aaload 1 array element type
aastore 1 array element type
areturn 1 enclosing method return type
dupXX 1 top-of-stack element type
if_acmpXX 1 top-of-stack element type
new 2 class type
anewarray 2 array type
amultinewarray 2 array type
ldc 2 class literal type
checkcast 2 cast type
instanceof 2 instanceof type
XXfield 3 instantiated field descriptor
invokeXX 3 instantiated method descriptor

As it can be seen, specializable opcodes are divided into three main categories; opcodes in the first category (such as aload) can be specialized only if the associated unerased type is either an any type variable or an array type whose element type is an any type-variable; opcodes in the second category can be specialized if the associated unerased type is a class type where at least one type-parameter is an any type-variable (or an array thereof).

In the third category we find all opcodes associated with member access (field acces/method call). Such opcodes are specializable only if the unerased selector type is a class type where at least one type-parameter is an any type-variable (or an array thereof). Note that, as the specializer might need to emit specialized constant pool entries, the associated Utf8 entry needs to store information about both the unerased member owner type and the unerased member type (after all relevant type-substitution has occurred). The two signatures (owner and member type) are concatenated using the symbol :: (see the example in the following section).

Mapping Examples

In the following sections we present some examples to show how the BytecodeMapping attribute is used in practice. Some of those examples are bases upon a slightly simplified version of the Box class in [1] given below:

class Box<any T> {
    T t;

    T get() { return t; }
}

1.
aload
,
astore

The following generates two bytecode mappings (one for aload, one for astore) both pointing to the siganture TT;.

<any T> void test(T t0) {
    t0 = t0;
}

Here's the relevant javap output:

<T extends java.lang.Object> void test(T);
descriptor: (Ljava/lang/Object;)V
flags:
Code:
  stack=1, locals=2, args_size=2
     0: aload_1
     1: astore_1
     2: return
BytecodeMapping:
  Code_idx  Signature
      0:    TT;
      1:    TT;

2.
aaload
,
aastore

The following generates (among others) two bytecode mappings (one for aaload, one for aastore) both pointing to the siganture TT;.

<any T> void test(T[] tarr, T t) {
    t = tarr[0];
    tarr[0] = t;
}

Here's the relevant javap output:

<T extends java.lang.Object> void test(T[], T);
descriptor: ([Ljava/lang/Object;Ljava/lang/Object;)V
flags:
Code:
  stack=3, locals=3, args_size=3
     0: aload_1
     1: iconst_0
     2: aaload<any T> void testCmpNe(T t1, T t2) {
    boolean b = t1 == t2;
}
     3: astore_2
     4: aload_1
     5: iconst_0
     6: aload_2
     7: aastore
     8: return
BytecodeMapping:
  Code_idx  Signature
      2:    TT;
      3:    TT;
      6:    TT;
      7:    TT;

3.
areturn

The following generates (among others) a bytecode mappings for areturn pointing to the siganture TT;.

<any T> T test(T t) {
    return t;
}

Here's the relevant javap output:

 <T extends java.lang.Object> T test(T);
descriptor: (Ljava/lang/Object;)Ljava/lang/Object;
flags:
Code:
  stack=1, locals=2, args_size=2
     0: aload_1
     1: areturn
BytecodeMapping:
  Code_idx  Signature
      0:    TT;
      1:    TT;

4.
dup

The following generates (among others) a bytecode mappings for dup pointing to the siganture TT;.

    <any T> void test(T t1, T t2, T t3) {
        t1 = (t2 = t3);
    }

Here's the relevant javap output:

<T extends java.lang.Object> void test(T, T, T);
descriptor: (Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)V
flags:
Code:
  stack=2, locals=4, args_size=4
     0: aload_3
     1: dup
     2: astore_2
     3: astore_1<any T> void testCmpNe(T t1, T t2) {
    boolean b = t1 == t2;
}
     4: return
  LineNumberTable:
    line 4: 0
    line 5: 4
BytecodeMapping:
  Code_idx  Signature
      0:    TT;
      1:    TT;
      2:    TT;
      3:    TT;

5.
if_acmpne
,
if_acmpeq

The following generates (among others) two bytecode mappings (one for if_acmpne, one for if_cmpeq) both pointing to the siganture TT;.

<any T> void test(T t1, T t2) {
    boolean b1 = t1 == t2;
    boolean b2 = t1 != t2;
}

Here's the relevant javap output:

<T extends java.lang.Object> void test(T, T);
descriptor: (Ljava/lang/Object;Ljava/lang/Object;)V
flags:
Code:
  stack=2, locals=5, args_size=3
     0: aload_1
     1: aload_2
     2: if_acmpne     9
     5: iconst_1
     6: goto          10
     9: iconst_0
    10: istore_3
    11: aload_1
    12: aload_2
    13: if_acmpeq     20
    16: iconst_1
    17: goto          21
    20: iconst_0
    21: istore        4
    23: return
BytecodeMapping:
  Code_idx  Signature
      0:    TT;
      1:    TT;
      2:    TT;
     11:    TT;
     12:    TT;
     13:    TT;

6.
new

The following generates a bytecode mapping for new pointing to the siganture LBox<TT;>;.

<any T> void test(T t) {
    new Box<T>();
}

Here's the relevant javap output:

<T extends java.lang.Object> void test(T);
descriptor: (Ljava/lang/Object;)V
flags:
Code:
  stack=2, locals=2, args_size=2
     0: new           #2                  // class Box
     3: dup
     4: invokespecial #3                  // Method Box."<init>":()V
     7: pop
     8: return
BytecodeMapping:
  Code_idx  Signature
      0:    LBox<TT;>;
      4:    LBox<TT;>;::()V

7.
anewarray
,
multianewarray

The following generates two bytecode mappings (one for newarray, one for anewarray) each pointng to the correspoinding unerased array signature - [TZ; and [[TZ;, respectively.

<any Z> void test() {
    Z[] arr1 = new Z[2];
    Z[][] arr2 = new Z[2][4];
}

Here's the relevant javap output:

<Z extends java.lang.Object> void test();
descriptor: ()V
flags:
Code:
  stack=2, locals=3, args_size=1
     0: iconst_2
     1: anewarray     #2                  // class java/lang/Object
     4: astore_1
     5: iconst_2
     6: iconst_4
     7: multianewarray #3,  2             // class "[[Ljava/lang/Object;"
    11: astore_2
    12: return
BytecodeMapping:
  Code_idx  Signature
      1:    [TZ;
      7:    [[TZ;

8.
ldc

The following generates a bytecode mapping for new pointing to the siganture LBox<TT;>;.

<any T> void test() {
    Class<?> c = Box<T>.class;
}

Here's the relevant javap output:

<T extends java.lang.Object> void test();
descriptor: ()V
flags:
Code:
  stack=1, locals=2, args_size=1
     0: ldc           #2                  // class Box
     2: astore_1
     3: return
BytecodeMapping:
  Code_idx  Signature
      0:    LBox<TT;>;

9.
checkcast
,
instanceof

The following generates two bytecode mappings (one for checkcast, one for instanceof) both pointing to the siganture LBox<TZ;>;.

<any Z> void test() {
    Object o = (Box<Z>)null;
    boolean b = (o instanceof Box<Z>);
}

Here's the relevant javap output:

<Z extends java.lang.Object> void test();
descriptor: ()V
flags:
Code:
  stack=1, locals=3, args_size=1
     0: aconst_null
     1: checkcast     #2                  // class Box
     4: astore_1
     5: aload_1
     6: instanceof    #2                  // class Box
     9: istore_2
    10: return
  LineNumberTable:
    line 4: 0
    line 5: 5
    line 6: 10
BytecodeMapping:
  Code_idx  Signature
      1:    LBox<TZ;>;
      6:    LBox<TZ;>;

10.
getfield
,
invokevirtual

The following generates two (among others) bytecode mappings (one for getfield, one for invokevirtual) each pointng to the correspoinding unerased member descriptor - LBox<TZ;>;::TZ; and LBox<TZ;>;::()TZ;, respectively.

<any Z> void test(Box<Z> bz) {
    Z z = bz.t;
    z = bz.get();
}

Here's the relevant javap output:

<Z extends java.lang.Object> void test(Box<Z>);
descriptor: (LBox;)V
flags:
Code:
  stack=1, locals=3, args_size=2
     0: aload_1
     1: getfield      #2                  // Field Box.t:Ljava/lang/Object;
     4: astore_2
     5: aload_1
     6: invokevirtual #3                  // Method Box.get:()Ljava/lang/Object;
     9: astore_2
    10: return
  LineNumberTable:
    line 4: 0
    line 5: 5
    line 6: 10
BytecodeMapping:
  Code_idx  Signature
      1:    LBox<TZ;>;::TZ;
      4:    TZ;
      6:    LBox<TZ;>;::()TZ;
      9:    TZ;