Token Codes: IR for Liftable Lambdas

October 2017 (v. 0.6)

John Rose

In the context of our investments in programming with the constant pool, and given recent requests from the Vector API developers to help analyze lambdas for vectorization, we are giving some more thought to the problem of lambda cracking. The rest of this paper must be read as a proposal for further evaluation, rather than a declaration of an immediate plan.

A missing piece of design is an intermediate representation which can be associated with a cracked lambda, that accurately represents the behavior of the lambda body or expression. This document sketches a concrete encoding of Java expressions which is particularly adapted to the needs of functional-style APIs such as Java 8 Stream. Use of such APIs is characterized by many small lambdas, each contributing a small, simple expression. In some cases, it is profitable for the runtime support code to “crack” the lambdas, inspect their bodies, and perform ad hoc optimizations based on the particulars of their expressions.

This can be done today, awkwardly and unreliably, by extracting the the bytecodes behind a lambda. (Sometimes this can be done with the JVMCI or by using serialization tricks.) The bytecodes are then decompiled (mechanically reverse-engineered) to recover the expression. If all goes well the expression is then scrutable by the runtime, and optimization can be performed.

Liftable codes provide a new way to crack lambdas that is more direct and reliable. The key step is defining a new representation for lambda bodies, called token codes, which is midway in complexity between JVM bytecodes and Java source code. (The unit of information is the token, which is somewhat larger than the byte, but still natural to the JVM’s internal environment.) This representation is constructed so that it can be easily lifted back to the original Java expression. At the same time, token codes can also readily be lowered down to bytecodes, for direct execution by the JVM.

Concretely, token code is a series of constant pool constants, all of a type eligible for the ldc operation or for use as arguments to a bootstrap method (such as a lambda metafactory). A well-formed series of token codes expresses a stack action, which is an algorithm that consumes some incoming group of items, which are viewed as the top portion of a stack, very similar to the JVM expression stack. The algorithm produces a new group of items, which replace the incoming items on the stack. Stack actions can produce and/or consume any number of items. They are composed by simple concatenation. Most JVM bytecodes and Java expressions have a natural correspondence to stack actions, as will be shown below.

Most constant pool types express single-token stack actions which simply push themselves onto the stack. Method handle constant pool entries have a more dramatic action: They invoke themselves on the stack, popping arguments and pushing results. When we add some constructs for data movement (PUSH, POP) and grouping (PACK), we have all we need to render Java expressions as stack actions. We will get crucial help from the method handle APIs, which will supply us method handles for the Java expression operators we need, plus access to named methods, fields, and constructors.

The CONSTANT_Integer and CONSTANT_MethodHandle constant pool types play the most important role in token codes. Integer constants provide us with all the ad hoc code points we need to complete the token code instruction set. Method handles self-invoke as described above. These tokens can be pushed as simple constants if they are preceded by a special quoting token called LDC.

Unlike bytecodes, token code has no control flow primitives. Instead, it assumes that control flow will be synthesized using method handle combinators such as guardWithTest and loop. To support such a rendering of control flow, there are special operators for building method handle constants that embody the control flow, and for packing and unpacking bundles of local values shared between basic blocks.

Here are some advantages of token codes:

They are easy to work with, as a single array of Objects, with no cross-reference to any contextual symbol table or block structure.
It is easy to compile down to bytecodes or lambda forms. (The “basic type” information needed to render bytecodes is simple to derive in one pass.)
It is simple to lift them up (decompile them) to expression trees.
They naturally embed in bootstrap specifiers (BSM argument lists) and therefore interoperate with other mechanisms that work with bootstrap methods.
They can represent small ad hoc expression trees, which may be useful for template parameters and other metaprogramming. For example, a “recipe” for a toString method might be parameterized by a concatenated group of field references (and perhaps other limited expressions), with the recipe confector decoding both access paths and names from the tokens.

Here are some disadvantages:

They are less dense than bytecodes. In effect they are a u2-based instruction set instead of u1.
It is hard to represent complex control flow. That must be done with grouping constructs to form basic blocks into method handles, which are then composed using control-flow combinators such as MethodHandles.loop.
They are yet another representation. Although it would be nice if bytecodes could be used more directly, it seems that a new design point is justified, midway between bytecodes and Java expressions.

Token code syntax

A token code sequence is physically represented as a series of constant pool indexes to constants whose types are compatible with ldc. During Java execution an alternative physical representation is an array whose elements are the sequence of values of those constants, either resolved to Object as if by ldc and then boxed, or else in some unresolved form (like Constable). A TokenCode object may be wrapped around this array to disambiguate it from other uses of such arrays, if that is necessary.

There are also grouping constructs, which allows a subsequence of tokens to be nested within a containing sequence. The subsequence must also be a well-formed token code sequence.

A token code sequence which cannot be divided into smaller non-empty subsequences is called an “elemental stack action”. Stack actions may be concatenated into larger stack actions called chains. Chaining is associative, so A1 A2 followed by A3 is the same as A2 A3 preceded by A1, for any three stack actions. There are a few other grouping constructs besides chains, all of which use an element count in the first token to delimit the grouped sub-sequence from surrounding tokens.

Before we give an exhaustive grammar of token codes, here is a summary of fourteen distinct token code operations which are specially encoded by CONSTANT_Integer constant tokens:

n	opcode	count	extension	effect
0	`LDC`	short `C`	`C` tokens	push `C` tokens
1	`LDB`	`C`	`Class`, `C` tokens	push computed value
2	`METHOD`	`C`	`MethodType`, `C` tokens	push anonymous method
3	`CONDY`	`C`	`String`, `Class`, `C` tokens	push computed value
4	`INVOKEC`		`MethodHandle`	invoke leaf method
5	`INVOKEB`	`C`	`C` tokens	invoke computed combinator
6	`MACRO`		`Long`, `List`/`MethodHandle`	execute inlined tokens
7	`INDY`	`C`	`String`, `MethodType`, `C` tokens	invoke computed call site target
8	`PUT`	split `S`/`C`		bury `C` values at `S`
9	`GET`	split `S`/`C`		extract `C` values from `S`
10	`DUP`	split `S`/`C`		copy `C` values from `S`
11	`POP`	split `S`/`C`		drop `C` values at `S`
12	`PACK`	short `C`	`MethodType`(args)value	pop args, push value
13	`UNPACK`	short `C`	`MethodType`(args)value	pop value, push args

The opcode field of an integer is encoded in a compact range of non-negative numbers in [0..13]. This range is also defined by reference to an enumeration of the opcodes, where LDC is the first and so on. For each opcode, there may a long or short count field. The instruction may be followed by one or more extension tokens.

If a field is left blank in the table above, it must be zero. If the extension is blank, the instruction is always a single token. A “short” count is limited to the range 0..255. Counts which accompany slots are short, as are those which are explicitly marked as short in the table. Other counts (which count tokens) are unsigned 24-bit values.

If the instruction refers to a stack slot S, it is a number between zero and D-1, where D is the current depth of the stack. An S value of zero always refers to the top of the stack, while larger values refer to deeper slots in the stack.

Instructions with a stack slot S operate on a counted range of C stack slots from S+C-1 (deep in the stack) to S (near the top of the stack). None of the stack references may be out of bounds, so we require S+C < D+1 as well as S < D.

Additional code points are reserved.

Here are the same opcodes, with summaries of their stack effects:

push constants (0..3)
LDC{V}: ( ) → (vv)
LDB{T;BB}: ( ) → (t))
METHOD{(A)T;LL}: ( ) → (λ(A).{LL}:T)
CONDY{N,T;BB}: ( ) → ({…;BB}:T)

invoke code (4..7)
INVOKEC{M}: (aa) — M → (t)
INVOKEB{(A)T;BB}: (aa) — M → (t)
MACRO{N,R,L}: (nn) — {LL} → (rr)

INDY{N,(A)T;BB}: (aa) — {…} → (t)

manage stack (8..11)
PUT(S,C): (ss,vv) → (vv,ss)
GET(S,C): (vv,ss) → (ss,vv)
DUP(S,C): (vv,ss) → (vv,ss,vv)
POP(S,C): (vv,ss) → (ss) manage argument lists (12..13)
PACK{(A)P}: (aa) → (p)
UNPACK{(A)P}: (p) → (aa)

Notes:

vv are one or more constants resolved from {V}
blocks {BB} are executed at most once (for resolution)
blocks {LL}, may be executed any number of times
aa are of type A and t is of type T
M is a method handle of type (A)T
in INVOKEB, {BB} resolves to M
in MACRO, L resolves to the token list {LL}
length(ss)=S and length(vv)=C
p is of type P, an array or list type

Full grammar

The grammar of well-formed sequences includes constraints on the use of an expression evaluation stack. Each token code sequence (or subsequence) consumes a statically defined number N of consumed inputs and produces another number R of produced outputs. The stack must never underflow but may grow to any desired depth. Every such sequence denotes a stack action.

We will accompany most productions with the two parameters [N,R], for the consumed and produced number of stack elements. Relations between parameters in the head of a grammar rules and those in the body are expressed in WHERE expressions. The terminal elements are the various types of constant pool entries, all named CONSTANT[T] for some constant pool tag T.

Integer constant pool entries which serve as instructions are named with the parameterized terminal INSTRUCTION[OP], where the OP value is the opcode of the instruction. Depending on the opcode, the instruction may also be parameterized by an immediate 24-bit long count field or an 8-bit short count field. For instructions which move on-stack data, the short count is also accompanied by a 16-bit slot field. Depending on the opcode, instruction may also be extended by one or more extension words, the number of which is sometimes influenced by the count field (either long or short).

Here is the full grammar defining well-formed token codes:

TokenCodeBlock[Inputs,Results,Length] =
   ActionChain[N,R]  &  CountedLiterals[C]
   WHERE[Inputs>=N && R>=Results]
   WHERE[C=Length]

ActionChain[N,R] =
   ( )  WHERE[N=0,R=0]  # empty sequence does nothing to stack
 | SingleAction[N1,R1] ActionChain[N2,R2]
            WHERE[N=MAX[N1, N1-R1+N2]]
            WHERE[R=N-N1+R1-N2+R2]
            # these constraints prevent stack underflow

SingleAction[N,R] =
 | LoadValues[R]  WHERE[N=0]  # no inputs, one or more outputs
 | InvokeCode[N,R]  # N is argument count, R is return count
 | MoveData[N,R]  # N is deepest location affected, R-N is TOS change

LoadValues[R] =
   ( INSTRUCTION[LDC,1] )? SelfLiteral  WHERE[R=1]
 | INSTRUCTION[LDC,C] CountedLiterals[C]  WHERE[R=C, 0<C<256]
 | INSTRUCTION[LDB,C] FieldType[_] TokenCodeBlock[0,1,C]
 | INSTRUCTION[METHOD,C] MethodType[N1,R1] TokenCodeBlock[N1,R1,C]
 | INSTRUCTION[CONDY,C] BootstrapName FieldType[_] TokenCodeBlock[3,1,C]

# The token Integer[0] is a reserved instruction, not LDC[0].

InvokeCode[N,R] =
   ( INSTRUCTION[INVOKEC] )?  SelfInvoke[N,R]
 | INSTRUCTION[INVOKEB,C]  MethodType[N,R]  TokenCodeBlock[0,1,C]
 | INSTRUCTION[MACRO] Long[N<<32|R] TokenCodeList
 | INSTRUCTION[INDY,C] BootstrapName MethodType[N,R] TokenCodeBlock[3,1,C]

MoveData[N,R] =
   INSTRUCTION[PUT,C,S]  WHERE[N=S+C,R=C+S]  # remove C items at TOS to copy after #S
 | INSTRUCTION[GET,C,S]  WHERE[N=C+S,R=S+C]  # remove C items at #S and copy to TOS
 | INSTRUCTION[DUP,C,S]  WHERE[N=C+S,R=C+S+C]  # dup C items at #S to TOS
 | INSTRUCTION[POP,C,S]  WHERE[N=C+S,R=S]  # delete C items at #S
 # PUT and USE are linear in their chain of custody, which is to say
 # they move items without changing the number of references to them.
 # POP kills old references; only DUP creates new references to old items.
 # (Minimizing data movement and retention on the stack is an interesting problem.)
 | INSTRUCTION[PACK,C] MethodType[N1,1]  WHERE[N=MAX(C,N1),R=1]
 | INSTRUCTION[UNPACK,C] MethodType[N1,1]  WHERE[N=1,R=MAX(C,N1)]
 # PACK and UNPACK convert groups of stack elements
 # to and from a single "buffer" value, a list or an array.
 # If the count value C is non-zero, it must be no less than N1
 # if C > N1 > 0, the trailing parameter type (N1-1) is repeated to make C parameters
 # if C > N1 = 0 and the buffer is an array, the trailing parameter is
 # taken to the array element type

SelfLiteral =
   CONSTANT[CONSTANT_String] | CONSTANT[CONSTANT_Class]
 | CONSTANT[CONSTANT_Long] | CONSTANT[CONSTANT_Float]
 | CONSTANT[CONSTANT_Double] | CONSTANT[CONSTANT_MethodType]
# some CP types are self-pushing, but not Integer or MethodHandle
# a CONSTANT_Dynamic self-literal is not allowed; use an explicit LDC

SelfInvoke[N,R] =
   mh:CONSTANT[CONSTANT_MethodHandle]
            WHERE[N=(mh.type.parameterCount)]
            WHERE[R=(mh.type.returnType==void?0:1)]
# a method handle token is immediately invoked on the stack
# the method type determines the stack effects [N,R]
# a CONSTANT_Dynamic self-invoke is not allowed; use INVOKEB+LDC

TokenCodeList =
   ConstantDynamic[_]
 | SelfInvoke[0,1]
# the token code list must be a List object and will be memoized
# the constant or method handle may be evaluated statically
# if it is a method handle, it is invoked once to obtain the list

# Utility productions:
MethodType[N,R] = mt:CONSTANT[CONSTANT_MethodType]
        WHERE[N=mt.pCount, R=(mt.rt==void?0:1)]
FieldType[CT] =
   ct:CONSTANT[CONSTANT_Class]  WHERE[CT=resolve(ct)]
 | mt:MethodType[0,1]  WHERE[CT=resolve(mt.rt)]
BootstrapName = CONSTANT[CONSTANT_String]
Long[X] = x:CONSTANT[CONSTANT_Long]  WHERE[X=resolve(x)]
Integer[X] = x:CONSTANT[CONSTANT_Integer]  WHERE[X=resolve(x)]
CountedLiterals[C] = (C)*[CONSTANT[_]]  # a counted sequence of constants
ConstantDynamic[CT] = con:CONSTANT[CONSTANT_Dynamic]  WHERE[con.type=CT]

# Parameterized terminals:
CONSTANT[T] = ## constant pool entry of tag T ##
x:CONSTANT[T] = ## constant pool entry of tag T and value x ##

# Raw instruction format:
INSTRUCTION[OP,C] = Integer[OP.ordinal + (C<<8)] WHERE[C < (1<<24)]
INSTRUCTION[OP] = INSTRUCTION[OP,0]  # missing C must be zero
INSTRUCTION[OP,C,S] = INSTRUCTION[OP, C + (S<<8)]  WHERE[C < 256]

Token code semantics

The execution state of a token code machine is a stack, plus one or more cursors pointing into constant sequences of token codes, plus a resolution table that memoizes the results of instruction linkage actions.

Elements on the stack are untyped (or typed as Object). Therefore non-reference values are always represented on stack by references to boxes. Off-stack values such as method arguments and return values are described unambiguously with field and method types.

When translating to bytecodes, a box may be eliminated if all of its uses are observed to be the same non-reference type. Mixed usage, where the same value is used both as a reference and a primitive, or as two different primitive types, is illegal, although the rules for enforcing this strong typing are not described here.

Any action chain construct (of the form ActionChain[N,R]) is statically described by a pair of numbers [N,R] that encodes the stack effects. The chain requires a stack of depth S which must be at least N items deep, which the expression will use. The expression produces R items and leaves them on the stack in place of the original N items. Often R is one or zero. After the expression executes the stack is therefore of depth S-N+R.

Constants of type CONSTANT_String, CONSTANT_Class, CONSTANT_MethodType, CONSTANT_Long, CONSTANT_Float, and CONSTANT_Double are treated as self-evaluating literals, which simply push themselves onto the stack. The primitives may be boxed. If the token codes are later rendered to bytecodes, boxing can usually be avoided by simple target typing.

Constants of type CONSTANT_MethodHandle are immediately applied to the stack. This gives gives immediate access not only to all methods but all constructors and fields. The INVOKEB instruction allows complex combinations of methods to be created and invoked.

Constants of type CONSTANT_Integer are reserved for token code instructions whose opcodes are provided by the enumeration Expression.Op. Stack actions introduced by these instructions vary in length, format, and semantics. The integer constant itself is bit-encoded into an opcode field, an optional long or short count field, and an optional slot field:

opcode (bits [0..7]) – an index into the Expression.Op enum
long count (bits [8..31]) – a token count (for a block construct)
short count (bits [8..15]) – an argument count in the range 0..255
slot (bits [16..31]) – a location in the stack in the range 0..65535

The stack data movement instructions use the high 16 bits of the long count field as a slot field, leaving the remaining bits to serve as a short count that determines the number of stack slots to move.

The parameterized grammar terminal INSTRUCTION shows how all such instructions are used to construct various types of expressions. If an instruction is not documented to use its count field, that field must be zero. If the count field denotes a number of stack values (to push, pop, or duplicate), it is limited to the range 0..255, whether or not there is an accompanying slot field.

Valid opcodes are:

LDC (0) – load one (or more) typed literal constant(s) from the following token(s)
LDB (1) – execute a block of token codes once, memoize the TOS, and push it
METHOD (2) – create a method handle with a given type and token block
CONDY (3) – emulate or compile to a bytecode dynamic constant
INVOKEC (4) – invoke a following (constant) method handle on stacked arguments
INVOKEB (5) – execute a block of token codes once, memoize the TOS, and invoke it
MACRO (6) – run token codes indirectly available as a single constant
INDY (7) – emulate or compile to a bytecode dynamic invocation
PUT (8) – move from TOS to selected stack position(s)
GET (9) – move to TOS from selected stack position(s)
DUP (10) – copy items at selected stack position(s) to TOS
POP (11) – discard items at selected stack position(s)
PACK (12) – pack stack items into one argument list object
UNPACK (13) – unpack stack items from one argument list object

An LDC expression can be used to “quote” a block of one or more constants of any type, allowing them to be pushed on the stack as data values. The count field determines how many constants are to be pushed. As a special case, if the count field is zero, the instruction functions as a NOP (the instruction is all zero bits).

Constants of type CONSTANT_Dynamic, if they evaluate to int or Integer, or if they evaluate to MethodHandle or a subtype, have unspecified semantics, since after resolution they cannot easily be distinguished from ordinary tokens. To safely use a CONSTANT_Dynamic, incorporate it into a LDC expression. (If the dynamic constant will evaluate to a method handle, ensure self invocation by incorporating it into an INVOKEC expression.) For details on this new kind of constant, see the JEP for dynamic constants.

When invoking methods, the stack has the same order as the standard JVM stack. That is, the receiver must be pushed first, and then the arguments from left to right. Note, however, that a self-invoking method handle token appears after all of its arguments, even though Java bytecodes for invoking a method handle will push the method handle first, because it is the receiver of an explicit invoke or invokeExact method. A self-invoking method handle does not need an explicit reference to an invoke method. For uniformity, and in cases of ambiguity, it may optionally have an INVOKEC instruction as a prefix.

What does token code for an expression look like?

Token code uses an ActionChain to represent an expression of the Java language. The nesting of sub-expressions within an expression tree is reflected naturally in the sequential nesting of chained stack actions.

The values used by the expression must be pre-positioned on the stack, somehow, and any values delivered by the expression will be left on the stack.

The basic rule is that an internal node in an expression tree is represented by a method handle token which collapses the stacked inputs of the expression into zero or one outputs. The leaf nodes of the expression are typically represented by data movement instructions like DUP (non-destructive use) or GET (final use of a value). These draw from the incoming values stacked before the expression executes, or perhaps from previously computed value pushed temporarily into the stack by PUT.

In the simplest or luckiest cases, the data pre-positioned on the stack is in exactly the order needed to feed the stack action that evaluates the desired expression. In the general case, data movement may be needed to glue the stack actions together. Data movement, of course, is from one point of view just another kind of stack action, but it is also a kind of stack action that is straightforwad to refactor or optimize, since its semantics are purely local.

Let’s try a simple floating point expression, Math.sqrt(b*b - 4*a*c). Assuming it is a method body, the bytecodes will look something like this:

float f(float a, float b, float c) {
  //return Math.sqrt(b*b - 4*a*c);
  fload #b; fload #b; fmul
  ldc 4f; fload #a; fmul
  fload #c; fmul; fsub
  invokestatic Math.sqrt; freturn
}

To render this in token code, let us assume that the three parameters are presented in alphabetical order as above. Then a simple rendering of the expression uses a DUP or literal expression for each leaf node, and a method handle for each interior node, in the expression tree.

float f(float a, float b, float c) {
  //stack = (a, b, c)
  //return Math.sqrt(b*b - 4f*a*c);
  DUP #b; DUP #b
  MethodHandle(Float.Ops.MUL)
  LDC 4f; DUP #a
  MethodHandle(Float.Ops.MUL)
  DUP #c
  MethodHandle(Float.Ops.MUL)
  MethodHandle(Float.Ops.SUB)
  MethodHandle(Math::sqrt)
  //stack = (a, b, c, value)
}

The pseudo-code notation #a means to access the value originally named “a” wherever it is on the stack.

A token code assembler would do well to supply a mechanism for giving symbolic names to mid-stack elements like #a, etc., since the offset required to DUP a selected element must vary from place to place. Likewise, a disassembler would do well to track the names of values even as they are copied to various places on the stack.

Minimization of inputs and outputs

As a matter of tidiness, the stack action which represents an expression should, when possible, consume only values needed to evaluate the expression. We say such a stack action has minimized inputs. Likewise, stack actions should have minimized outputs, meaning that every output of the stack action is used by an subsequent stack action, or at a boundary point such as the delivery of a return value from a method body.

Such minimization is not always possible and/or desirable. For example, when a stack action represents a method body, the incoming stack is fixed by the method’s type signature, the stack action will have to “work around” any parameters it does not want to use.

Also, if a stack action A1 is followed by another stack action A2 that needs to use a value V, then even if A1 does not use V it may have to pass V through to A2, as both an incoming and outgoing value. This case is avoidable with some data movement, since the value V can be parked deep in the stack outside of A1’s inputs, and unparked just before A2 is executed. The parking and unparking code would then appear as extra stack actions with wide, messy inputs or outputs, but the original actions A1 and A2 would have their inputs and output minimized.

When a stack action represents a method body that returns a value, the value left on the top of the stack is returned. Any other stacked values (perhaps unused parameter values) are silently ignored. In that case, the outputs of the method body stack action are not minimized. If a minimized action is needed, it is simple to add explicit POP instructions to clear out the useless stack values, but this is usually wasted effort.

Where are the locals?

As described above, stack actions are obviously good for evaluating expression trees. But there are important types of expressions which are not simple trees. The JVM uses local variables when the simple operations available to the stack machine do not directly encode the programmer’s intent. Usually the part that doesn’t fit into the stack model is a named variable in source code, which is mapped to a local.

For example, a named variable might be used to hold a scratch value that must be used twice, like this:

float f(float[] a) {
  //int tem = f();
invokestatic f; istore #tem
  //return a[tem] + b[tem];
aload #a; iload #tem; faload
aload #b; iload #tem; faload
fadd; freturn
}

But this use of locals can be emulated by a pure stack machine, simply by parking the temporary value deep enough in the stack, and then using mid-stack duplication instructions as needed, with an unpark operation on the final use of the value. The DUP class of data movement opcodes handle is designed to handle such use cases:

stack = (a, b, ...)
  //int tem = f();
MethodHandle(f); PUT #tem
stack = (a, b, tem, ...)
  //return a[tem] + b[tem];
DUP #a; DUP #tem; Ops.Float.ALOAD
DUP #b; GET #tem; Ops.Float.ALOAD
MethodHandle(Float.Ops.ADD)
stack = (a, b, ..., value)

Notice that the last (killing) use of a value is GET, while DUP serves to both use and preserve a value. All of the on-stack data movement opcodes merely permute the stack without changing its size, except DUP which grows the stack and POP which shrinks it.

When lowering token codes to bytecodes, mid-stack data movement instructions must be converted to local variable accesses, since the JVM does not natively support such instructions. Identifying values which must be committed to locals, instead of being kept on stack, should be done in a pre-pass before bytecode generation.

Side effects

More crucially, the JVM bytecode set also uses locals to encode expressions which may produce multiple outputs. One of the most common multi-output idioms is the auto-increment array access a[i++]. This expression is coded by both loading the local containing i and storing back to it, even before the array reference is performed. (In fact, there is a special JVM instruction for this common idiom.)

Here is an example:

float f(float[] a, int i) {
  //return a[i++] + a[i++];
aload #a; iinc #i; faload
aload #a; iinc #i; faload
fadd; freturn
}

The corresponding token code actions consume the index value i and also produce it as a secondary value:

stack = (a, i, ...)
  //return a[i++] + a[i++];
DUP #a
GET #i; DUP; LDC 1; MethodHandle(Ops.Integer.ADD); PUT #i
MethodHandle(Ops.Float.ALOAD)
DUP #a
GET #i; DUP; LDC 1; MethodHandle(Ops.Integer.ADD); PUT #i
MethodHandle(Ops.Float.ALOAD)
MethodHandle(Ops.Float.ADD)
stack = (a, i+2, ..., value)

Control flow

Stack actions are excellent for expression simple expressions without control flow. But Java code often has control flow, even in expressions of the form a ? b : c or a && b. It would seem that stack actions are useless for branching control flow, since stack actions cannot “jump over” each other. But in fact they can; the INVOKEB and METHOD opcodes serve as grouping constructs, and the control combinators of java.lang.invoke.MethodHandles can weave together such groups to create the structured control flow statements of Java.

A METHOD grouping construct builds a constant method handle of a selected type from the token codes of its body. This is truly a constant, since the construct has no dependencies on stack state when it executes. The actions inside the method handle can be “replayed” at any time by calling the method handle. An INVOKEB grouping construct is used to weave together several such methods into a control flow combination. The effect of INVOKEB is similar toLDB, in that an initialization expression is executed once to weave the combined control flow, and it is also similar to INVOKEC in that it directly executes the woven control flow method handle.

(The INVOKEB instruction could be awkwardly synthesized from LDB and INVOKEC, but at the risk of blurring the fact that the invoked entity is constant control flow, not a data-dependent method handle. Thus, it “makes the cut” as a basic token code operation.)

The single value returned by the method handle would seem to make it difficult for such actions to return more than one value, but the PACK opcode can be used within the method handle body to group multiple outputs into a single return value. When the package of argument values is received by its intended user, it can be unpacked using UNPACK into an identical group of stack values. Also, there are method handle API points which work directly on packed argument lists, such as spreadArguments and collectArguments. The “envelope” for packing and unpacking may be a simple list or array of boxed values. In the future they may be stored more efficiently using unboxed argument lists.

When lowering to bytecodes, a code generator will want to recognize uses of control flow combinators like guardWithTest, loop, and catchException, and inline their operands into one block of bytecodes, linked together by control flow transfers.

A similar point applies when raising token codes to expressions. It seems straightforward (pending further experimentation) to analyze the nested sequence of stack actions into a nested AST of expressions and statements with control flow operators.

Looking forward

The design sketched above is intended to be easy to analyze, at least for simple expressions without complex control flow. It is also intended to be universal enough to encode the logic of any Java method (but not every detail of constructors). Whether it actually works out that way, or whether it needs further adjustment, requires further experimentation.

Simple experiments to try would seem to include lowering to bytecodes, raising from bytecodes, raising to javac-style ASTs, type inference, liveness analysis, minimization, value numbering, side effect detection, and race analysis. As of today (August 2017) none of these experiments have been tried, but all of them seem reasonable to try, at some point. Such experiments would surely lead to improvements in the basic design.

Perhaps the first thing to try is raising from bytecodes, and then, combined with serialization hooks, use them for lambda cracking in the context of the lanewise expressions of the Panama Vector API.

This proposal may affect a similar proposal for isolated methods. Isolated methods do not need to be bytes at all, but could be constant pool constants composed using something like token code.

The token code interpreter could be used to build a simple but universal bootstrap method for CONSTANT_Dynamic constant pool constants. The sequence of static arguments to this bootstrap method could easily represent simple ad hoc expressions to initialize the constants. For example, a compiled regular-expression constant could be created by interpreting two token codes, a CONSTANT_String containing the pattern to compile and a CONSTANT_MethodHandle pointing to Pattern.compile. Such a universal bootstrap method would reduce pressure to create metafactories for simple types of constants. For invokedynamic, such a universal bootstrap method might be useful if the semantics of the instruction were partly determined by some form of ad hoc metaprogramming, expressed as a method handle expression in token codes.

It is sometimes useful to pull run-time computations back to compile time. In C++ this is done with the constexpr keyword; Java has a limited set of constant evaluations it can perform. Some sort of declaration (@SafeToRunAtCompileTime, @EvalWhenCompile, @Pure, @ConstExpr) could trigger compilation of affected code into token codes instead of bytecode. This might ease both metaprogramming and (safe) compile-time evaluation.

Combined with crackable lambdas, token codes can be used to enforce restrictions on the complexity of lambda bodies, at compile time, in case libraries need to do so. For example, a library may wish to ensure that a lambda body is a pure expression free of side effects. Such library-specific analysis will often be easier to carry out on a token code sequence than on the bytecodes (and associated metadata) of an equivalent method.

Token codes may provide a higher-level basis for compile-time execution of code and construction of constants.

Because of they are easy to compose (by sequence concatenation) token codes could play a role in runtime assembly of ad hoc code fragments. As such they are a candidate to replace the LambdaForm class used internally in the java.lang.invoke package.

References

[Vector API]: https://youtu.be/LGVxiDxIrFM?t=5s
[dynamic constants]: http://openjdk.java.net/jeps/8177279
[unboxed argument lists]: http://bugs.openjdk.java.net/browse/JDK-8182862
[isolated methods]: http://openjdk.java.net/jeps/8158765