In the context of our investments in programming with the constant pool, and given recent requests from the Vector API developers to help analyze lambdas for vectorization, we are giving some more thought to the problem of lambda cracking. The rest of this paper must be read as a proposal for further evaluation, rather than a declaration of an immediate plan.
A missing piece of design is an intermediate representation which
can be associated with a cracked lambda, that accurately represents
the behavior of the lambda body or expression. This document sketches
a concrete encoding of Java expressions which is particularly
adapted to the needs of functional-style APIs such as Java 8 Stream
.
Use of such APIs is characterized by many small lambdas, each
contributing a small, simple expression. In some cases, it is
profitable for the runtime support code to “crack” the lambdas,
inspect their bodies, and perform ad hoc optimizations based on the
particulars of their expressions.
This can be done today, awkwardly and unreliably, by extracting the the bytecodes behind a lambda. (Sometimes this can be done with the JVMCI or by using serialization tricks.) The bytecodes are then decompiled (mechanically reverse-engineered) to recover the expression. If all goes well the expression is then scrutable by the runtime, and optimization can be performed.
Liftable codes provide a new way to crack lambdas that is more direct and reliable. The key step is defining a new representation for lambda bodies, called token codes, which is midway in complexity between JVM bytecodes and Java source code. (The unit of information is the token, which is somewhat larger than the byte, but still natural to the JVM’s internal environment.) This representation is constructed so that it can be easily lifted back to the original Java expression. At the same time, token codes can also readily be lowered down to bytecodes, for direct execution by the JVM.
Concretely, token code is a series of constant pool constants, all of
a type eligible for the ldc
operation or for use as arguments to a
bootstrap method (such as a lambda metafactory). A well-formed series
of token codes expresses a stack action, which is an algorithm that
consumes some incoming group of items, which are viewed as the top
portion of a stack, very similar to the JVM expression stack. The
algorithm produces a new group of items, which replace the incoming
items on the stack. Stack actions can produce and/or consume any
number of items. They are composed by simple concatenation. Most JVM
bytecodes and Java expressions have a natural correspondence to stack
actions, as will be shown below.
Most constant pool types express single-token stack actions which
simply push themselves onto the stack. Method handle constant pool
entries have a more dramatic action: They invoke themselves on the
stack, popping arguments and pushing results. When we add some
constructs for data movement (PUSH
, POP
) and grouping (PACK
), we
have all we need to render Java expressions as stack actions. We will
get crucial help from the method handle APIs, which will supply us
method handles for the Java expression operators we need, plus access
to named methods, fields, and constructors.
The CONSTANT_Integer
and CONSTANT_MethodHandle
constant pool types
play the most important role in token codes. Integer constants
provide us with all the ad hoc code points we need to complete the
token code instruction set. Method handles self-invoke as described
above. These tokens can be pushed as simple constants if they
are preceded by a special quoting token called LDC
.
Unlike bytecodes, token code has no control flow primitives. Instead,
it assumes that control flow will be synthesized using method handle
combinators such as guardWithTest
and loop
. To support such a
rendering of control flow, there are special operators for building
method handle constants that embody the control flow, and for packing
and unpacking bundles of local values shared between basic blocks.
Here are some advantages of token codes:
toString
method might be
parameterized by a concatenated group of field references
(and perhaps other limited expressions), with the recipe
confector decoding both access paths and names from the
tokens.Here are some disadvantages:
MethodHandles.loop
.A token code sequence is physically represented as a series of
constant pool indexes to constants whose types are compatible with
ldc
. During Java execution an alternative physical representation
is an array whose elements are the sequence of values of those
constants, either resolved to Object
as if by ldc
and then boxed,
or else in some unresolved form (like Constable
). A TokenCode
object may be wrapped around this array to disambiguate it from other
uses of such arrays, if that is necessary.
There are also grouping constructs, which allows a subsequence of tokens to be nested within a containing sequence. The subsequence must also be a well-formed token code sequence.
A token code sequence which cannot be divided into smaller non-empty
subsequences is called an “elemental stack action”. Stack actions may
be concatenated into larger stack actions called chains. Chaining is
associative, so A1 A2
followed by A3
is the same as A2 A3
preceded by A1
, for any three stack actions. There are a few other
grouping constructs besides chains, all of which use an element count
in the first token to delimit the grouped sub-sequence from
surrounding tokens.
Before we give an exhaustive grammar of token codes, here is a summary
of fourteen distinct token code operations which are specially encoded
by CONSTANT_Integer
constant tokens:
n | opcode | count | extension | effect |
---|---|---|---|---|
0 | LDC |
short C |
C tokens |
push C tokens |
1 | LDB |
C |
Class , C tokens |
push computed value |
2 | METHOD |
C |
MethodType , C tokens |
push anonymous method |
3 | CONDY |
C |
String , Class , C tokens |
push computed value |
4 | INVOKEC |
MethodHandle |
invoke leaf method | |
5 | INVOKEB |
C |
C tokens |
invoke computed combinator |
6 | MACRO |
Long , List /MethodHandle |
execute inlined tokens | |
7 | INDY |
C |
String , MethodType , C tokens |
invoke computed call site target |
8 | PUT |
split S /C |
bury C values at S |
|
9 | GET |
split S /C |
extract C values from S |
|
10 | DUP |
split S /C |
copy C values from S |
|
11 | POP |
split S /C |
drop C values at S |
|
12 | PACK |
short C |
MethodType (args)value |
pop args, push value |
13 | UNPACK |
short C |
MethodType (args)value |
pop value, push args |
The opcode field of an integer is encoded in a compact range of
non-negative numbers in [0..13]
. This range is also defined by
reference to an enumeration of the opcodes, where LDC
is the first
and so on. For each opcode, there may a long or short count field.
The instruction may be followed by one or more extension tokens.
If a field is left blank in the table above, it must be zero. If the extension is blank, the instruction is always a single token. A “short” count is limited to the range 0..255. Counts which accompany slots are short, as are those which are explicitly marked as short in the table. Other counts (which count tokens) are unsigned 24-bit values.
If the instruction refers to a stack slot S
, it is a number between
zero and D-1
, where D
is the current depth of the stack. An S
value of zero always refers to the top of the stack, while larger
values refer to deeper slots in the stack.
Instructions with a stack slot S
operate on a counted range of C
stack slots from S+C-1
(deep in the stack) to S
(near the top of
the stack). None of the stack references may be out of bounds, so we
require S+C < D+1
as well as S < D
.
Additional code points are reserved.
Here are the same opcodes, with summaries of their stack effects:
push constants (0..3)LDC{V} : ( ) → (vv)LDB{T;BB} : ( ) → (t))METHOD{(A)T;LL} : ( ) →
(λ(A).{LL}:T)CONDY{N,T;BB} : ( ) →
({…;BB}:T) |
invoke code (4..7)
|
manage stack (8..11)PUT(S,C) : (ss,vv) → (vv,ss)GET(S,C) : (vv,ss) → (ss,vv)DUP(S,C) : (vv,ss) → (vv,ss,vv)POP(S,C) : (vv,ss) → (ss) |
manage argument lists (12..13)PACK{(A)P} : (aa) → (p)UNPACK{(A)P} : (p) → (aa) |
Notes:
|
The grammar of well-formed sequences includes constraints on the use
of an expression evaluation stack. Each token code sequence (or
subsequence) consumes a statically defined number N
of consumed
inputs and produces another number R
of produced outputs. The stack
must never underflow but may grow to any desired depth. Every such
sequence denotes a stack action.
We will accompany most productions with the two parameters [N,R]
,
for the consumed and produced number of stack elements. Relations
between parameters in the head of a grammar rules and those in the
body are expressed in WHERE
expressions. The terminal elements are
the various types of constant pool entries, all named CONSTANT[T]
for some constant pool tag T
.
Integer constant pool entries which serve as instructions are named
with the parameterized terminal INSTRUCTION[OP]
, where the OP
value is the opcode of the instruction. Depending on the opcode, the
instruction may also be parameterized by an immediate 24-bit long
count field or an 8-bit short count field. For instructions which
move on-stack data, the short count is also accompanied by a 16-bit
slot field. Depending on the opcode, instruction may also be
extended by one or more extension words, the number of which is
sometimes influenced by the count field (either long or short).
Here is the full grammar defining well-formed token codes:
TokenCodeBlock[Inputs,Results,Length] =
ActionChain[N,R] & CountedLiterals[C]
WHERE[Inputs>=N && R>=Results]
WHERE[C=Length]
ActionChain[N,R] =
( ) WHERE[N=0,R=0] # empty sequence does nothing to stack
| SingleAction[N1,R1] ActionChain[N2,R2]
WHERE[N=MAX[N1, N1-R1+N2]]
WHERE[R=N-N1+R1-N2+R2]
# these constraints prevent stack underflow
SingleAction[N,R] =
| LoadValues[R] WHERE[N=0] # no inputs, one or more outputs
| InvokeCode[N,R] # N is argument count, R is return count
| MoveData[N,R] # N is deepest location affected, R-N is TOS change
LoadValues[R] =
( INSTRUCTION[LDC,1] )? SelfLiteral WHERE[R=1]
| INSTRUCTION[LDC,C] CountedLiterals[C] WHERE[R=C, 0<C<256]
| INSTRUCTION[LDB,C] FieldType[_] TokenCodeBlock[0,1,C]
| INSTRUCTION[METHOD,C] MethodType[N1,R1] TokenCodeBlock[N1,R1,C]
| INSTRUCTION[CONDY,C] BootstrapName FieldType[_] TokenCodeBlock[3,1,C]
# The token Integer[0] is a reserved instruction, not LDC[0].
InvokeCode[N,R] =
( INSTRUCTION[INVOKEC] )? SelfInvoke[N,R]
| INSTRUCTION[INVOKEB,C] MethodType[N,R] TokenCodeBlock[0,1,C]
| INSTRUCTION[MACRO] Long[N<<32|R] TokenCodeList
| INSTRUCTION[INDY,C] BootstrapName MethodType[N,R] TokenCodeBlock[3,1,C]
MoveData[N,R] =
INSTRUCTION[PUT,C,S] WHERE[N=S+C,R=C+S] # remove C items at TOS to copy after #S
| INSTRUCTION[GET,C,S] WHERE[N=C+S,R=S+C] # remove C items at #S and copy to TOS
| INSTRUCTION[DUP,C,S] WHERE[N=C+S,R=C+S+C] # dup C items at #S to TOS
| INSTRUCTION[POP,C,S] WHERE[N=C+S,R=S] # delete C items at #S
# PUT and USE are linear in their chain of custody, which is to say
# they move items without changing the number of references to them.
# POP kills old references; only DUP creates new references to old items.
# (Minimizing data movement and retention on the stack is an interesting problem.)
| INSTRUCTION[PACK,C] MethodType[N1,1] WHERE[N=MAX(C,N1),R=1]
| INSTRUCTION[UNPACK,C] MethodType[N1,1] WHERE[N=1,R=MAX(C,N1)]
# PACK and UNPACK convert groups of stack elements
# to and from a single "buffer" value, a list or an array.
# If the count value C is non-zero, it must be no less than N1
# if C > N1 > 0, the trailing parameter type (N1-1) is repeated to make C parameters
# if C > N1 = 0 and the buffer is an array, the trailing parameter is
# taken to the array element type
SelfLiteral =
CONSTANT[CONSTANT_String] | CONSTANT[CONSTANT_Class]
| CONSTANT[CONSTANT_Long] | CONSTANT[CONSTANT_Float]
| CONSTANT[CONSTANT_Double] | CONSTANT[CONSTANT_MethodType]
# some CP types are self-pushing, but not Integer or MethodHandle
# a CONSTANT_Dynamic self-literal is not allowed; use an explicit LDC
SelfInvoke[N,R] =
mh:CONSTANT[CONSTANT_MethodHandle]
WHERE[N=(mh.type.parameterCount)]
WHERE[R=(mh.type.returnType==void?0:1)]
# a method handle token is immediately invoked on the stack
# the method type determines the stack effects [N,R]
# a CONSTANT_Dynamic self-invoke is not allowed; use INVOKEB+LDC
TokenCodeList =
ConstantDynamic[_]
| SelfInvoke[0,1]
# the token code list must be a List object and will be memoized
# the constant or method handle may be evaluated statically
# if it is a method handle, it is invoked once to obtain the list
# Utility productions:
MethodType[N,R] = mt:CONSTANT[CONSTANT_MethodType]
WHERE[N=mt.pCount, R=(mt.rt==void?0:1)]
FieldType[CT] =
ct:CONSTANT[CONSTANT_Class] WHERE[CT=resolve(ct)]
| mt:MethodType[0,1] WHERE[CT=resolve(mt.rt)]
BootstrapName = CONSTANT[CONSTANT_String]
Long[X] = x:CONSTANT[CONSTANT_Long] WHERE[X=resolve(x)]
Integer[X] = x:CONSTANT[CONSTANT_Integer] WHERE[X=resolve(x)]
CountedLiterals[C] = (C)*[CONSTANT[_]] # a counted sequence of constants
ConstantDynamic[CT] = con:CONSTANT[CONSTANT_Dynamic] WHERE[con.type=CT]
# Parameterized terminals:
CONSTANT[T] = ## constant pool entry of tag T ##
x:CONSTANT[T] = ## constant pool entry of tag T and value x ##
# Raw instruction format:
INSTRUCTION[OP,C] = Integer[OP.ordinal + (C<<8)] WHERE[C < (1<<24)]
INSTRUCTION[OP] = INSTRUCTION[OP,0] # missing C must be zero
INSTRUCTION[OP,C,S] = INSTRUCTION[OP, C + (S<<8)] WHERE[C < 256]
The execution state of a token code machine is a stack, plus one or more cursors pointing into constant sequences of token codes, plus a resolution table that memoizes the results of instruction linkage actions.
Elements on the stack are untyped (or typed as Object
). Therefore
non-reference values are always represented on stack by references to
boxes. Off-stack values such as method arguments and return values
are described unambiguously with field and method types.
When translating to bytecodes, a box may be eliminated if all of its uses are observed to be the same non-reference type. Mixed usage, where the same value is used both as a reference and a primitive, or as two different primitive types, is illegal, although the rules for enforcing this strong typing are not described here.
Any action chain construct (of the form ActionChain[N,R]
) is
statically described by a pair of numbers [N,R]
that encodes the
stack effects. The chain requires a stack of depth S
which must be
at least N
items deep, which the expression will use. The
expression produces R
items and leaves them on the stack in place of
the original N
items. Often R
is one or zero. After the
expression executes the stack is therefore of depth S-N+R
.
Constants of type CONSTANT_String
, CONSTANT_Class
,
CONSTANT_MethodType
, CONSTANT_Long
, CONSTANT_Float
, and
CONSTANT_Double
are treated as self-evaluating literals, which
simply push themselves onto the stack. The primitives may be boxed.
If the token codes are later rendered to bytecodes, boxing can usually
be avoided by simple target typing.
Constants of type CONSTANT_MethodHandle
are immediately applied to
the stack. This gives gives immediate access not only to all methods
but all constructors and fields. The INVOKEB
instruction allows
complex combinations of methods to be created and invoked.
Constants of type CONSTANT_Integer
are reserved for token code
instructions whose opcodes are provided by the enumeration
Expression.Op
. Stack actions introduced by these instructions vary
in length, format, and semantics. The integer constant itself is
bit-encoded into an opcode field, an optional long or short count
field, and an optional slot field:
Expression.Op
enumThe stack data movement instructions use the high 16 bits of the long count field as a slot field, leaving the remaining bits to serve as a short count that determines the number of stack slots to move.
The parameterized grammar terminal INSTRUCTION
shows how all such
instructions are used to construct various types of expressions. If
an instruction is not documented to use its count field, that field
must be zero. If the count field denotes a number of stack values
(to push, pop, or duplicate), it is limited to the range 0..255,
whether or not there is an accompanying slot field.
Valid opcodes are:
LDC
(0) – load one (or more) typed literal constant(s) from the following token(s)LDB
(1) – execute a block of token codes once, memoize the TOS, and push itMETHOD
(2) – create a method handle with a given type and token blockCONDY
(3) – emulate or compile to a bytecode dynamic constantINVOKEC
(4) – invoke a following (constant) method handle on stacked argumentsINVOKEB
(5) – execute a block of token codes once, memoize the TOS, and invoke itMACRO
(6) – run token codes indirectly available as a single constantINDY
(7) – emulate or compile to a bytecode dynamic invocationPUT
(8) – move from TOS to selected stack position(s)GET
(9) – move to TOS from selected stack position(s)DUP
(10) – copy items at selected stack position(s) to TOSPOP
(11) – discard items at selected stack position(s)PACK
(12) – pack stack items into one argument list objectUNPACK
(13) – unpack stack items from one argument list objectAn LDC
expression can be used to “quote” a block of one or more
constants of any type, allowing them to be pushed on the stack as data
values. The count field determines how many constants are to be
pushed. As a special case, if the count field is zero, the
instruction functions as a NOP
(the instruction is all zero bits).
Constants of type CONSTANT_Dynamic
, if they evaluate to
int
or Integer
, or if they evaluate to MethodHandle
or a
subtype, have unspecified semantics, since after resolution they
cannot easily be distinguished from ordinary tokens. To safely use a
CONSTANT_Dynamic
, incorporate it into a LDC
expression.
(If the dynamic constant will evaluate to a method handle, ensure self
invocation by incorporating it into an INVOKEC
expression.)
For details on this new kind of constant, see the JEP for dynamic
constants.
When invoking methods, the stack has the same order as the standard
JVM stack. That is, the receiver must be pushed first, and then the
arguments from left to right. Note, however, that a self-invoking
method handle token appears after all of its arguments, even though
Java bytecodes for invoking a method handle will push the method
handle first, because it is the receiver of an explicit invoke
or
invokeExact
method. A self-invoking method handle does not need
an explicit reference to an invoke
method. For uniformity, and
in cases of ambiguity, it may optionally have an INVOKEC
instruction
as a prefix.
Token code uses an ActionChain
to represent an expression of the Java
language. The nesting of sub-expressions within an expression tree is
reflected naturally in the sequential nesting of chained stack actions.
The values used by the expression must be pre-positioned on the stack, somehow, and any values delivered by the expression will be left on the stack.
The basic rule is that an internal node in an expression tree is
represented by a method handle token which collapses the stacked
inputs of the expression into zero or one outputs. The leaf nodes of
the expression are typically represented by data movement instructions
like DUP
(non-destructive use) or GET
(final use of a value).
These draw from the incoming values stacked before the expression
executes, or perhaps from previously computed value pushed temporarily
into the stack by PUT
.
In the simplest or luckiest cases, the data pre-positioned on the stack is in exactly the order needed to feed the stack action that evaluates the desired expression. In the general case, data movement may be needed to glue the stack actions together. Data movement, of course, is from one point of view just another kind of stack action, but it is also a kind of stack action that is straightforwad to refactor or optimize, since its semantics are purely local.
Let’s try a simple floating point expression, Math.sqrt(b*b - 4*a*c)
. Assuming it is a method body, the bytecodes will look
something like this:
float f(float a, float b, float c) {
//return Math.sqrt(b*b - 4*a*c);
fload #b; fload #b; fmul
ldc 4f; fload #a; fmul
fload #c; fmul; fsub
invokestatic Math.sqrt; freturn
}
To render this in token code, let us assume that the three parameters
are presented in alphabetical order as above. Then a simple rendering
of the expression uses a DUP
or literal expression for each leaf
node, and a method handle for each interior node, in the expression
tree.
float f(float a, float b, float c) {
//stack = (a, b, c)
//return Math.sqrt(b*b - 4f*a*c);
DUP #b; DUP #b
MethodHandle(Float.Ops.MUL)
LDC 4f; DUP #a
MethodHandle(Float.Ops.MUL)
DUP #c
MethodHandle(Float.Ops.MUL)
MethodHandle(Float.Ops.SUB)
MethodHandle(Math::sqrt)
//stack = (a, b, c, value)
}
The pseudo-code notation #a
means to access the value originally
named “a” wherever it is on the stack.
A token code assembler would do well to supply a mechanism for giving
symbolic names to mid-stack elements like #a
, etc., since the offset
required to DUP
a selected element must vary from place to place.
Likewise, a disassembler would do well to track the names of values
even as they are copied to various places on the stack.
As a matter of tidiness, the stack action which represents an expression should, when possible, consume only values needed to evaluate the expression. We say such a stack action has minimized inputs. Likewise, stack actions should have minimized outputs, meaning that every output of the stack action is used by an subsequent stack action, or at a boundary point such as the delivery of a return value from a method body.
Such minimization is not always possible and/or desirable. For example, when a stack action represents a method body, the incoming stack is fixed by the method’s type signature, the stack action will have to “work around” any parameters it does not want to use.
Also, if a stack action A1 is followed by another stack action A2 that needs to use a value V, then even if A1 does not use V it may have to pass V through to A2, as both an incoming and outgoing value. This case is avoidable with some data movement, since the value V can be parked deep in the stack outside of A1’s inputs, and unparked just before A2 is executed. The parking and unparking code would then appear as extra stack actions with wide, messy inputs or outputs, but the original actions A1 and A2 would have their inputs and output minimized.
When a stack action represents a method body that returns a value, the
value left on the top of the stack is returned. Any other stacked
values (perhaps unused parameter values) are silently ignored. In
that case, the outputs of the method body stack action are not
minimized. If a minimized action is needed, it is simple to add
explicit POP
instructions to clear out the useless stack values,
but this is usually wasted effort.
As described above, stack actions are obviously good for evaluating expression trees. But there are important types of expressions which are not simple trees. The JVM uses local variables when the simple operations available to the stack machine do not directly encode the programmer’s intent. Usually the part that doesn’t fit into the stack model is a named variable in source code, which is mapped to a local.
For example, a named variable might be used to hold a scratch value that must be used twice, like this:
float f(float[] a) {
//int tem = f();
invokestatic f; istore #tem
//return a[tem] + b[tem];
aload #a; iload #tem; faload
aload #b; iload #tem; faload
fadd; freturn
}
But this use of locals can be emulated by a pure stack machine, simply
by parking the temporary value deep enough in the stack, and then
using mid-stack duplication instructions as needed, with an unpark
operation on the final use of the value. The DUP
class of data
movement opcodes handle is designed to handle such use cases:
stack = (a, b, ...)
//int tem = f();
MethodHandle(f); PUT #tem
stack = (a, b, tem, ...)
//return a[tem] + b[tem];
DUP #a; DUP #tem; Ops.Float.ALOAD
DUP #b; GET #tem; Ops.Float.ALOAD
MethodHandle(Float.Ops.ADD)
stack = (a, b, ..., value)
Notice that the last (killing) use of a value is GET
, while DUP
serves to both use and preserve a value. All of the on-stack data
movement opcodes merely permute the stack without changing its size,
except DUP
which grows the stack and POP
which shrinks it.
When lowering token codes to bytecodes, mid-stack data movement instructions must be converted to local variable accesses, since the JVM does not natively support such instructions. Identifying values which must be committed to locals, instead of being kept on stack, should be done in a pre-pass before bytecode generation.
More crucially, the JVM bytecode set also uses locals to encode
expressions which may produce multiple outputs. One of the most
common multi-output idioms is the auto-increment array access
a[i++]
. This expression is coded by both loading the local
containing i
and storing back to it, even before the array reference
is performed. (In fact, there is a special JVM instruction for this
common idiom.)
Here is an example:
float f(float[] a, int i) {
//return a[i++] + a[i++];
aload #a; iinc #i; faload
aload #a; iinc #i; faload
fadd; freturn
}
The corresponding token code actions consume the index value i
and
also produce it as a secondary value:
stack = (a, i, ...)
//return a[i++] + a[i++];
DUP #a
GET #i; DUP; LDC 1; MethodHandle(Ops.Integer.ADD); PUT #i
MethodHandle(Ops.Float.ALOAD)
DUP #a
GET #i; DUP; LDC 1; MethodHandle(Ops.Integer.ADD); PUT #i
MethodHandle(Ops.Float.ALOAD)
MethodHandle(Ops.Float.ADD)
stack = (a, i+2, ..., value)
Stack actions are excellent for expression simple expressions without
control flow. But Java code often has control flow, even in
expressions of the form a ? b : c
or a && b
. It would seem that
stack actions are useless for branching control flow, since stack
actions cannot “jump over” each other. But in fact they can; the
INVOKEB
and METHOD
opcodes serve as grouping constructs, and the
control combinators of java.lang.invoke.MethodHandles
can weave
together such groups to create the structured control flow statements
of Java.
A METHOD
grouping construct builds a constant method handle of a
selected type from the token codes of its body. This is truly a
constant, since the construct has no dependencies on stack state
when it executes. The actions inside the method handle can be
“replayed” at any time by calling the method handle. An INVOKEB
grouping construct is used to weave together several such methods
into a control flow combination. The effect of INVOKEB
is similar
toLDB
, in that an initialization expression is executed once to
weave the combined control flow, and it is also similar to INVOKEC
in that it directly executes the woven control flow method handle.
(The INVOKEB
instruction could be awkwardly synthesized from LDB
and INVOKEC
, but at the risk of blurring the fact that the invoked
entity is constant control flow, not a data-dependent method handle.
Thus, it “makes the cut” as a basic token code operation.)
The single value returned by the method handle would seem to make it
difficult for such actions to return more than one value, but the
PACK
opcode can be used within the method handle body to group
multiple outputs into a single return value. When the package of
argument values is received by its intended user, it can be unpacked
using UNPACK
into an identical group of stack values. Also,
there are method handle API points which work directly on packed
argument lists, such as spreadArguments
and collectArguments
.
The “envelope” for packing and unpacking may be a simple list or
array of boxed values. In the future they may be stored more
efficiently using unboxed argument lists.
When lowering to bytecodes, a code generator will want to recognize
uses of control flow combinators like guardWithTest
, loop
, and
catchException
, and inline their operands into one block of
bytecodes, linked together by control flow transfers.
A similar point applies when raising token codes to expressions. It seems straightforward (pending further experimentation) to analyze the nested sequence of stack actions into a nested AST of expressions and statements with control flow operators.
The design sketched above is intended to be easy to analyze, at least for simple expressions without complex control flow. It is also intended to be universal enough to encode the logic of any Java method (but not every detail of constructors). Whether it actually works out that way, or whether it needs further adjustment, requires further experimentation.
Simple experiments to try would seem to include lowering to bytecodes, raising from bytecodes, raising to javac-style ASTs, type inference, liveness analysis, minimization, value numbering, side effect detection, and race analysis. As of today (August 2017) none of these experiments have been tried, but all of them seem reasonable to try, at some point. Such experiments would surely lead to improvements in the basic design.
Perhaps the first thing to try is raising from bytecodes, and then, combined with serialization hooks, use them for lambda cracking in the context of the lanewise expressions of the Panama Vector API.
This proposal may affect a similar proposal for isolated methods. Isolated methods do not need to be bytes at all, but could be constant pool constants composed using something like token code.
The token code interpreter could be used to build a simple but
universal bootstrap method for CONSTANT_Dynamic
constant
pool constants. The sequence of static arguments to this bootstrap
method could easily represent simple ad hoc expressions to initialize
the constants. For example, a compiled regular-expression constant
could be created by interpreting two token codes, a CONSTANT_String
containing the pattern to compile and a CONSTANT_MethodHandle
pointing to Pattern.compile
. Such a universal bootstrap method
would reduce pressure to create metafactories for simple types of
constants. For invokedynamic
, such a universal bootstrap method
might be useful if the semantics of the instruction were partly
determined by some form of ad hoc metaprogramming, expressed as a
method handle expression in token codes.
It is sometimes useful to pull run-time computations back to compile
time. In C++ this is done with the constexpr
keyword; Java has a
limited set of constant evaluations it can perform. Some sort of
declaration (@SafeToRunAtCompileTime
, @EvalWhenCompile
, @Pure
,
@ConstExpr
) could trigger compilation of affected code into token
codes instead of bytecode. This might ease both metaprogramming and
(safe) compile-time evaluation.
Combined with crackable lambdas, token codes can be used to enforce restrictions on the complexity of lambda bodies, at compile time, in case libraries need to do so. For example, a library may wish to ensure that a lambda body is a pure expression free of side effects. Such library-specific analysis will often be easier to carry out on a token code sequence than on the bytecodes (and associated metadata) of an equivalent method.
Token codes may provide a higher-level basis for compile-time execution of code and construction of constants.
Because of they are easy to compose (by sequence concatenation) token
codes could play a role in runtime assembly of ad hoc code fragments.
As such they are a candidate to replace the LambdaForm
class used
internally in the java.lang.invoke
package.
[Vector API]: https://youtu.be/LGVxiDxIrFM?t=5s
[dynamic constants]: http://openjdk.java.net/jeps/8177279
[unboxed argument lists]: http://bugs.openjdk.java.net/browse/JDK-8182862
[isolated methods]: http://openjdk.java.net/jeps/8158765