# Foreign-abi downcall intrinsics technical description

October 2020. Jorn Vernee

Prior knowledge required: https://github.com/openjdk/panama-foreign/blob/foreign-jextract/doc/panama_ffi.md

## How native calls work

On a low-level, normal function calls conform to a certain contract called a calling convention or ABI. This is essentially a set of rules for how to map a function invocation in a source language to machine instructions, so that the two sides of a function call, the caller and the callee, can agree on where and how argument and return values are placed into machine registers and onto the machine stack, so that they can be picked up in those locations on the other side of the call.

When doing a native call, we need to adapt from the Java ABI to a target foreign ABI, and do several other things to make sure that all our Java virtual machine invariants can be maintained.

When adapting from one ABI to another there are several things that need to happen, most notably: 'shuffling'. Shuffling is the process of moving arguments between registers and onto or off of the stack so that they are in the right places according to the calling convention of the target function. Other things that need to be done to adapt to the target ABI include copying structs that are passed by-value, and passing a pointer to the copy, to the target function. Or, splitting a struct that is passed by-value into several registers/stack locations. The target ABI dictates if and how these things should occur.

In Java in particular, there is also another step we need to do: A thread state transition. This marks the state of the running thread as being inside native code. Upon returning from a native call this thread state is set back to 'in Java', along with doing a safepoint poll right away (and a stack re-guard if needed). This for instance makes it so that a concurrent GC can inspect the thread's stack while the thread is in native code.

There are also trivial/critical/leaf calls, which don't do this thread state transition. Also, contrary to regular calls, these kinds of calls are not safepoints; they have no oop map associated with them for instance, so we can not safely stop the executing thread at this point in the program, because we have no way of recovering a valid JVM state. This also blocks us from having any safepoints down the call chain, so we can not do any upcalls into Java from a trivial native call. It also means that we can not recover from a stack overflow exception (of the OS signal kind). If we need a safepoint, we have to wait at least until the native call is over, and the thread hits the next safepoint after that. This can block the GC from doing it's work for instance. In other words, trivial calls are only useful for very short calls that don't touch anything Java related.

To implement stack walking for threads that are inside an upcall from native code, essentially having a stack like:

xxxxxxxxxx[Java frame 2] <- top[entry frame ][native frame]...[native frame][Java frame 1]...

We need to be able to step over all the native frames on the stack. These are not important for, for instance, the GC to look at, but we are also not guaranteed to be able to understand them, so we need to skip them. This is done using some thread local state that is stored inside a struct called JavaFrameAnchor. This stores for instance that 'last Java pc' (program counter) and the 'last Java sp' (stack pointer). When we do an upcall, the JavaFrameAnchor is copied to the stack in the entry frame. Then, later when we hit a safepoint during the upcall, the stack walking code will look at the saved frame anchor while walking over the entry frame, and jump directly to the last Java frame before the upcall.

## Panama implementation overview

### Binding recipe

The linking step of downcalls determines a so-called 'binding recipe' which is a set of operators used to pre-process arguments, as well as VM_STORE and VM_LOAD operators, that represent the storing of a Java primitive value into a register, or on the stack. (See [1]). The binding recipe is a simple stack-based IR.

Some of these have a different meaning when 'boxing' and 'unboxing'. 'boxing' is the process of turning native value into Java ones, this happens for the arguments of an upcall, and the return value of a downcall. 'unboxing' turns Java values into native ones. This happens for the arguments of a downcall, and the return value of an upcall. Also, when unboxing, the operand stack starts with a single (Java) value, and ends with no values. While when boxing it's the opposite: we start with no values on the operand stack, and end with 1 (the returned value).

Here are the descriptions of the operators, taken from the javadoc. The syntax is

xxxxxxxxxxOPERATOR_NAME([field1], [field2])
• VM_STORE([storage location], [type]) Pops a [type] from the operand stack, and moves it to [storage location] The [type] must be one of byte, short, char, int, long, float, or double.

Loads a [type] from [storage location], and pushes it onto the operand stack. The [type] must be one of byte, short, char, int, long, float, or double.

• BUFFER_STORE([offset into memory region], [type]) Pops a MemorySegment from the operand stack, loads a [type] from [offset into memory region] from it, and pushes it onto the operand stack. The [type] must be one of byte, short, char, int, long, float, or double.

• BUFFER_LOAD([offset into memory region], [type])

Pops a [type], and then a MemorySegment from the operand stack, and then stores [type] to [offset into memory region] of the MemorySegment. The [type] must be one of byte, short, char, int, long, float, or double.

• COPY([size], [alignment]) Creates a new MemorySegment with the given [size] and [alignment], and copies contents from a MemorySegment popped from the top of the operand stack into this new buffer, and pushes the new buffer onto the operand stack.

• ALLOCATE([size], [alignment]) Creates a new MemorySegment with the give [size] and [alignment], and pushes it onto the operand stack.

• UNBOX_ADDRESS() Pops a MemoryAddress from the operand stack, converts it to a long, and pushes that onto the operand stack.

Pops a long from the operand stack, converts it to a MemoryAddress, and pushes that onto the operand stack.

• BASE_ADDRESS() Pops a MemorySegment from the operand stack, takes the base address of the segment (the MemoryAddress that points to the start), and pushes that onto the operand stack.

• TO_SEGMENT([size])

Pops a MemoryAddress from the operand stack, and converts it to a MemorySegment with the given size, and pushes that onto the operand stack.

• DUP() Duplicates the value on the top of the operand stack (without popping it!), and pushes the duplicate onto the operand stack.

#### Example:

Let's say we want to link a C function with the following declaration:

xxxxxxxxxxstruct MyStruct {  int x;  int y;};​void* func(int i, double d, void* p, struct MyStruct ms);

Using the Windows x64 C ABI.

We get the recipe:

xxxxxxxxxxarg0 (int i):                                  // Stack: int  VM_STORE(RCX, int.class)                     // Stack:​arg1 (double d):                               // Stack: double  VM_STORE(XMM1, double.class)                 // Stack:  arg2 (void* p):                                // Stack: MemoryAddress  UNBOX_ADDRESS()                              // Stack: long  VM_STORE(R8, long.class)                     // Stack:  arg3 (struct MyStruct ms):                     // Stack: MemorySegment  BUFFER_LOAD(0, long.class)                   // Stack: long  VM_STORE(R9, long.class)                     // Stack:​return (void*):                                // Stack:  VM_LOAD(RAX, long.class)                     // Stack: long  BOX_ADDRESS()                                // Stack: MemoryAddress

Note that MyStruct fits into a 64 bit register, so it is passed directly in a register instead of making a copy and passing a reference to that copy to the target function. If MyStruct were bigger than 64 bits, for instance if it were declared as:

xxxxxxxxxxstruct MyStruct {  long long x;  long long y;};

The recipe would be:

xxxxxxxxxxarg3 (struct MyStruct ms):                     // Stack: MemorySegment  COPY(16, 8)                                  // Stack: MemorySegment  BASE_ADDRESS()                               // Stack: MemoryAddress  UNBOX_ADDRESS()                              // Stack: long  VM_STORE(R9, long.class)                     // Stack:

With the SysV ABI however, passing this struct is more complex, since it is split into several registers (requiring a DUP). This results in the following recipe:

xxxxxxxxxxarg3 (struct MyStruct ms):                     // Stack: MemorySegment  DUP()                                        // Stack: MemorySegment MemorySegment  BUFFER_LOAD(0, long.class)                   // Stack: MemorySegment long  VM_STORE(RDX, long.class)                    // Stack: MemorySegment  BUFFER_LOAD(8, long.class)                   // Stack: long  VM_STORE(RCX, long.class)                    // Stack:

Though, this also shows the flexibility of the binding recipe IR, as it can support ABIs that have different strategies for handling arguments.

### Stages of a downcall

The main class for doing downcalls is jdk.internal.foreign.abi.ProgrammableInvoker [2]

There are five invocation stages for interpreted calls:

1. (ProgrammableInvoker::invokeInterpBindings [3]): Pre-process arguments according to binding recipe, except VM_(STORE/LOAD) operators, and call an injected method handle to move arguments into registers/stack according to VM_(STORE/LOAD) operators (the 'leaf' method handle).
2. (ProgrammableInvoker::invokeMoves [4]): Allocate a buffer, with a 'slot' for each register of the ABI, and a separate buffer for any stack arguments. Fill buffers with value according to VM_(STORE/LOAD) operators, as well as the target function address, and the pointer to the stack argument buffer and size of stack arguments.
3. (ProgrammableInvoker::invokeNative [5]): Call a pre-generated assembly stub (using JNI), which takes the argument buffer, and for each register in the ABI, copies the value found in the corresponding buffer slot into the register (even if they are not used for this particular call. This allows a single stub to be shared for each call with this ABI). Copies also stack arguments from the stack argument buffer to the native stack. Inserts shadow space, if needed, before finally calling the target function. The return value is loaded from the return registers of the ABI into the buffer, and we return.
4. (ProgrammableInvoker::invokeMoves [4]): We come back to the code from stage #2 to move the values from the buffer into either an Object, or an Object[] if there are multiple VM_(STORE/LOAD) operators. Then we return.
5. (ProgrammableInvoker::invokeInterpBindings [3]): We come back to the code from stage #1, which will process the returned Objects (which are all boxed primitive values) according to the binding recipe for the return value. After which the final value is returned.

It's important here to understand that we have a 'leaf' method handle, which corresponds to stage #2-4, that just takes and returns Java primitive values, or, for returns only, returns an Object[] of such values (though this version can not currently be intrinsified). This method handle implements the VM_(STORE/LOAD) operators of a binding recipe. The code that implements the other operators of a recipe (stage #1 & 5) wraps this leaf method handle, and pre-processes each argument into a Java primitive value, as well as post-processing any primitive values returned by the leaf handle (again according to the recipe).

### Specializing the binding recipe

For stage #1 & 5 we can instead of calling ProgrammableInvoker::invokeInterpBindings, use the binding recipe and method handle combinators, to build a method handle chain on top of the leaf method handle, that replaces what invokeInterpBindings does. This for instance removes the need to have an intermediate operand stack, and all values flow through this MH chain directly instead. Doing this cuts the invocation time in half, after the call becomes hot and the method handle chain is optimized away. (See ProgrammableInvoker::specialize [6]).

### C2 Intrinsification of the leaf handle

For the low-level leaf call (stage #2 and #3) there is also an alternative way of doing this as well: we can use a method handle intrinsic called linkToNative [7] (wrapped by a NativeMethodHandle [8]), which can be used to implement the leaf method handle (at least for some call shapes currently). The intrinsic takes a fallback method handle, and a NativeEntryPoint [9], along with the normal primitive value passed to the leaf handle. The default method handle invocation stub behind this intrinsic [10] just calls the fallback handle, which will then end up using the buffer strategy described above. However, when C2 tries to inline this linkToNative intrinsic method, it will generate a specialized call instead (discarding the fallback handle) [11], using a custom IR node called CallNativeNode [12], which includes the registers to use to pass arguments, which are then retrieved during matching [13]. C2 will generate a 'native invoker' stub [14] [15], which does the needed thread state transition, and sets the fields of JavaFrameAnchor. This is more or less the same as JNI, except that it doesn't do any argument shuffling, that is done by the caller (using a CallNativeNode). As a result, stack arguments are not supported like this, since calling the stub puts the return address on the stack, and the stub creates its own frame, both of which displace the stack arguments set up by the caller. (As a note; I'd like to replace the assembly stub with a nested call to C2 that generates the native invoker, and uses the CallNativeNode directly to do argument shuffling there, to avoid these problems).

### Trivial calls

For some calls we can say that we don't need to do any thread state transition. This is currently controlled by adding an attribute to the FunctionDescriptor that is used when linking a downcall. This attribute, a boolean flag, is propagated by the linker to the NativeEntryPoint that is passed to the linkToNative intrinsic. C2 can retrieve this flag then, and based on it choose not to generate the intermediate 'native invoker' stub, but use a CallNativeNode to call the target function directly, skipping any thread state transition (and associated overhead).

Note that these trivial calls don't guarantee that a thread state transition will never occur, only that it can be optimized away by C2 in some cases.

## Implementation notes

### GC stack walking

In order to support the new linkToNative intrinsic the GC stack walking code needed some fixes. The GC tries to process oops that are passed to a callee. It will look at the original bytecode to find the signature of the callee, and then derive from that the registers it needs to inspect to find oops. In the case of the linkToNative intrinsic though, we replace the call in the bytecode with a custom call, so the bytecode information is not correct. Luckily though, we don't pass any oops to native code either, so we can instead mark the particular place in the code as being an optimized native call, and skip the subsequent GC operations [16].

Currently the GC and C2 assume that the RBP register will always be saved by the callee when we do a call. The GC code also relies on the fact that stack walking walks over the callee's frame. The RBP register is then filled into a RegisterMap from a known location in the frame right before stepping back to the caller's frame. For optimized native calls however, we don't walk the callee's frame at all, since this is native code (this is different from JNI for instance, which uses an intermediate frame that is walked). However, the current machinery still expects RBP to be saved if it holds an oop, so instead we manually save its location in a thread local struct called JavaFrameAnchor, which also holds some other bookkeeping information. When walking back from an entry frame (the first frame when doing an upcall back into Java), we load the location of the saved RBP register value from the JavaFrameAnchor into the RegisterMap instead [17].