State of Panama Pointers

Concurrent memory access

June 2019: _(v. 0.2)

Maurizio Cimadamore

The Panama foreign memory access API allows Java code to directly manipulate a memory segment in an uniform way, regardless of its physical location (e.g. Java heap, native stack, mapped memory, etc.). The API provides several guarantees, such as safety and deterministic deallocation. The goal of this document is to explore how this API can be extended to allow concurrent access to memory resources without violating the aforementioned guarantees.

The memory access API

The typical workflow of the memory access API goes as follows:

The following snippet shows how the API can be used in practice to set elements inside a native array whose elements are laid out in a non-contiguous fashion:

static final Sequence seq = Sequence.of(20,
                Group.struct(
                        Value.ofSignedInt(32).withName("elem"),
                        Padding.of(32)
                )); // [ 20 [ i32(elem) x32 ] ]

static final VarHandle elemhandle = seq.toPath()
        .elementPath()
        .elementPath("elem")
        .dereferenceHandle(int.class);

try (MemorySegment segment = MemoryScope.ofNative(seq)) {
  MemoryAddress addr = segment.baseAddress();
  for (long i = 0 ; i < seq.elementsSize() ; i++) {
    elemhandle.set(addr, i);
  }
}

This API is not only expressive, but it also boils down to code which the JIT can optimize pretty well, using existing techniques. Efficiency aside, there are three fundamental principles which have driven the design of this API:

When access to a memory segment occurs within the same thread, it is easy to show how the memory access API satisfies all said guarantees. However, as we shall see in the remainder of this document, satisfying the principles outlined above becomes very hard when dealing with multiple threads accessing the same memory segment concurrently.

Concurrent access modes

In this section we will explore some of the ways in which memory can be accessed. The goal is to identify memory access patterns, a vocabulary which will form the backbone of the discussion throughout the remainder of this document.

It's mine: static thread confinement

Perhaps the simplest form of memory access mode is what we will refer to as static thread confinement. In this mode, every memory segment has an owning thread, which is established once and for all during segment creation (hence the word static); that is, if thread A creates segment S, we say that T owns S. This implies that no thread other than T can access and/or close the segment. The following diagram shows the life-cycle of a statically confined memory segment:

Figure 1: Static confinement

This is indeed a very simple life-cycle as there are only two possible states; first the segment is created and its ownership is set to the thread who created it. From here, the only possible state transition occurs when the segment is closed (from the very thread who owns it!); at that point resources associated to the segment can be released, and the segment will no longer be usable.

Despite the relatively simple nature of static thread confinement, there are a number of situations in which such an access mode might be quite handy; for instance, this access mode is great for clients which needs to perform some off-heap allocation in order to pass a struct to some native library. This access mode is probably also good enough for those clients that need to serialize some complex Java object graph into native memory. On the other hand, the restricted nature of this access mode makes it an unfeasible choice in cases where the same memory segment needs to be accessed by more than one thread.

One thread at a time, please: dynamic thread confinement

A disciplined way to add more flexibility to the static confinement access mode is to remove the restriction that the owning thread must remain fixed for the entire life-cycle of the memory segment. This leads to an access mode where multiple threads can access the same memory segment, provided they do so one at a time. This pattern of memory access is captured in the following diagram:

Figure 2: Dynamic confinement

While the state transitions remain simple, any thread that wants to manipulate a memory segment must acquire it first. If this operation is successful, the segment goes in the acquired state, and cannot be reclaimed by a different thread - that is, the segment will act as if it was statically confined to the thread which acquired it. Once a thread has finished working with the segment, it can release it, thus making it available to other threads. Under this access mode, a memory segment can be closed only if the segment is not owned by any thread.

This access mode is more flexible than its static counterpart, and can be useful when modeling producer/consumer-like use cases; that is, one thread acquires a segment, writes to it and then releases it, possibly so that another thread can acquire the same segment and read contents from it, before releasing it again. In other words, this access mode imposes cooperation between all the threads that want access to a contended memory segment.

Off to the races: sharing memory across threads

There are cases where the restrictions put forward by either static or dynamic confinement are just too strong. Consider a one-publisher/many-subscribers use case, where one thread writes to a segment, then makes it available for multiple threads to read from it, possibly in a concurrent fashion. Or, again, consider the case of an off-heap cache, whose contents must be made available to more than just one thread at a time. In all these use cases, any form of confinement is simply too restrictive.

What we really want is a way to share the same memory segment across multiple threads, but do so in a way that makes it impossible for any of the thread accessing the shared memory segment to do so after the segment has already been closed by another thread. The life-cycle of a shared memory segment is shown in the diagram below:

Figure 3: Shared access

Things start to be more convoluted here; as before, a shared segment starts off in a state where it is not owned by any thread. And, as before, any thread that wants to access the segment needs to acquire it first. The important distinction is that multiple thread can acquire the same segment simultaneously. This means that, at any given point in time, a shared memory segment can have zero, one, or many owning threads. As before, a shared segment can only be closed if no thread has acquired it. This is what makes accessing a shared segment safe.

Note that, when a memory segment has multiple owners, it is possible for the threads owning it to perform racy read/write operations. So, while this model guarantees that no thread can access an already closed segment, it does very little to guarantee correctness of access across multiple threads. In other words, the threads accessing the memory segment in a concurrent fashion must implement some sort of synchronization strategy in order to ensure correctness. This can be done using atomic operations (e.g. compare and swap), explicit read/write locks, to name few examples.

Slice 'n' dice: divide and conquer

An important subclass of shared memory access occurs when a shared segment is created by a master thread, then sliced up in multiple chunks, each of which is assigned to a different worker thread. The worker threads will start to process the contents of the memory segment slice they have access to, and release it once the work on the slice is complete. Once all the worker threads have completed their work, the master thread can resume its work, perform some final operation on the memory segment, and then release it (and, possibly, close it). This pattern is shown in the diagram below (for simplicity we considered a case with 3 worker threads):

Figure 4: Divide and conquer shared access

The diagram in Figure 4 has some commonalities with that in Figure 3. Again, a shared segment starts off in a state where no thread owns it. The master thread (T in the diagram) then performs an acquire operation. After that, we see that multiple worker threads acquire the memory segment (or slices of it). We need three distinct states to model the fact that, during this phase, the memory segment can have one, two or three working thread working on it at the same time. Once the last worker thread completes and releases the segment, we go back in a state where the master thread is the only one owning the segment. At this point the segment can be released and, eventually safely closed.

This shared memory access mode, combined with the capability of slicing a memory segment into multiple disjoint sub-segments, makes it very useful to model classic divide and conquer problems (e.g. matrix multiplication), where each worker thread is responsible to work on a subset of the input and a master thread is responsible for joining together the various results collected from the worker threads.

Implementing concurrent access modes

In this section we will explore how the access modes shown in the previous section can be implemented in the memory access API. As we shall see, concurrent access poses several challenges when it comes to preserve the API's basic safety and efficiency guarantees.

Static confinement

Static confinement is the simplest access mode, and it is also relatively straightforward to implement. The only thing the implementation needs to do, is to associate an owning thread to each memory segment:

class MemorySegmentImpl implements MemorySegment {
  final Thread _theThread = Thread.currentThread();
  boolean isAlive = true;

  void close() {
    if (!isAlive || _theThread != Thread.currentThread()) {
      throw new IllegalStateException();
    }
    isAlive = false;
    //release memory
  }
}

As we can see, if each segment keeps track of its owning thread, we can guard every operation in the segment, to make sure that they can only be accessed from the owning thread itself. And, since only the owning thread can access the segment, we can reliably read the isAlive variable (e.g. during each memory access). This leads to a straightforward implementation which is both safe and efficient (because of the absence of any locking operation either during segment creation or memory access).

Dynamic confinement

Things get significantly more complex as we move on to consider dynamic ownership. The problem with dynamic ownership is that the thread owning a segment can change (see Figure 2). This seems to suggest that the implementation snippet shown above should be tweaked as follows:

class ConfinedMemorySegmentImpl implements MemorySegment {
  /*final*/ Thread _theThread; //this can change
  boolean isAlive = true;

  void acquire() {
    if (!isAlive || _theThread != null) {        
      throw new IllegalStateException();
    } else {
      _theThread = Thread.currentThread();
    }
  }

  void release() {
    if (!isAlive || _theThread != Thread.currentThread()) {
      throw new IllegalStateException();
    } else {
        _theThread = null;
    }
  }

  void close() {
    if (!isAlive || _theThread != null) {
      throw new IllegalStateException();
    }
    isAlive = false;
    //release memory
  }
}

But there are problems with this approach; first, there can be a race between two threads attempting to acquire the same segment; secondly, there can be a race between a thread attempting to close the segment and another attempting to acquire it. Even marking both isAlive and _theThread as volatile is insufficient; in fact, when executing checks like these:

if (!isAlive || _theThread != null) {
  ...
}

It is possible for a thread A to see isAlive set to true, and then, by the time the owning thread is checked, some other thread B could have closed the scope, thus releasing all resources associated with it. But, since thread A has already performed the liveliness check, the acquire operation will go ahead, potentially granting access to an already closed segment. Reversing the order of the checks in the if statement leads to similar issues.

Transferring ownership: the handoff pattern

One possibility to avoid these issues is to revert back to a model where the owning thread cannot change, and add a handoff operation which can be used by the owning thread to release access to the segment and transfer it to a different thread - for instance:

class ConfinedMemorySegmentImpl implements MemorySegment {
  final Thread _theThread;
  boolean isAlive = true;

  ConfinedMemorySegmentImpl handoff(Thread newOwner) {
    if (!isAlive || _theThread != Thread.currentThread()) {
      throw new IllegalStateException();
    }
    isAlive = false; //make it unusable, but don't release memory!
    return new ConfinedMemorySegment(newOwner);
  }

  void close() {
    if (!isAlive || _theThread != null) {
      throw new IllegalStateException();
    }
    isAlive = false;
    //release memory
  }
}

This leads to more straightforward implementation, but also a more restrictive one: it is up to the owning thread to explicitly transfer ownership to a new thread - meaning that the owning thread must know the thread it wants to transfer ownership to.

Locking

Another possibility is to use a lock. The lock would need to be acquired upon entering any of the operations exposed by the confined segment (this can be achieved by marking all segment methods as synchronized). However, this approach would kill performances of memory access, since, upon each access we would have to acquire the lock associated with the segment object.

Atomic state updates

Another implementation option is to use atomic updates to the state variable in the confined segment. For instance, we could use a single mutable Thread variable, as shown below:

class ConfinedMemorySegmentImpl implements MemorySegment {
  final AtomicReference<Thread> _theThread;
  
  final static Thread NO_OWNER = new Thread();
  final static Thread CLOSED = new Thread();

  void acquire() {
    if (!_theThread.compareAndSet(NO_OWNER, Thread.currentThread())) {        
      throw new IllegalStateException();
    }
  }

  void release() {
    if (!_theThread.compareAndSet(Thread.currentThread(), NO_OWNER)) {        
      throw new IllegalStateException();
    }
  }

  void close() {
    if (!_theThread.compareAndSet(NO_OWNER, CLOSED)) {        
      throw new IllegalStateException();
    }
    //release memory
  }
}

This synchronization scheme is pleasingly simple, and efficient: since all state transitions are atomic updates, we don't have to worry about races. For instance, if two threads attempt to acquire the same segment, only one of them will see the value of _theThread set to NO_OWNER. Similarly if a thread is attempting to close a segment while another is acquiring it, only one of these operation can succeed, while the other must fail by construction. Also, when the memory segment is accessed for read/write, we only need to check that the current thread is the same as the one stored in the _theThread variable, and we can do so with a getPlain operation: either the access operation will see a matching thread (meaning that the accessing thread is the very owner of the segment), or it will see a mismatch (either because the segment has a different owner, no owner at all, or has already been closed), in which case access must fail.

Shared access

Things are similarly convoluted in the shared access mode; again a naive approach would fail to take into account races between multiple threads, as shown in the example below:

class SharedMemorySegmentImpl implements MemorySegment {
  final Set<Thread> owners = new HashSet<>();
  boolean isAlive = true;

  void acquire() {
    if (!isAlive || !owners.add(Thread.currentThread())) {
        throw new IllegalStateException();
    }
  }

  void release() {
    if (!isAlive || !owners.remove(Thread.currentThread())) {
        throw new IllegalStateException();
    }
  }

  void close() {
    if (!isAlive || !owners.isEmpty()) {
      throw new IllegalStateException();
    }
    isAlive = false;
    //release memory
  }
}

Again, consistency between isAlive and owners is at risk here, as it will be possible for a thread to assume that a segment is alive when, by the time access occurs, it is no longer the case. There's an additional issue here in that the owners set must be made thread-safe, or the integrity of the data structure itself is at risk if updated concurrently from multiple threads.

Locking

Since there's no meaningful handoff operation we can define in this case (to keep the segment ownership immutable), we must once again turn to locking. Here we can observe that a read/write lock should be enough to preserve consistency. That is:

Again, while the access pattern described above is safe, it is also inherently slow - as each access operation requires interacting with a lock object, which can be very expensive.

Atomic state updates

Like we did for the dynamic confinement access mode, we could use atomic state transitions to guarantee consistency across multiple threads - e.g. by using an atomic counter to keep track of how many threads have acquired the segment. This is a sketch of how such a strategy can be implemented:

class SharedMemorySegmentImpl implements MemorySegment {
  final AtomicInteger activeCount = new AtomicInteger();
  
  final static int NO_OWNER = 0;
  final static int CLOSED = -1;

  void acquire() {
    if (activeCount.updateAndGet(i -> i == CLOSED ? CLOSED : ++i) == CLOSED) {
      throw new IllegalStateException();
    }
  }

  void release() {
    if (activeCount.getAndUpdate(i -> i <= NO_OWNER ? i : --i) <= NO_OWNER) {        
      throw new IllegalStateException();
    }
  }

  void close() {
    if (!activeCount.compareAndSet(NO_OWNER, CLOSED)) {        
      throw new IllegalStateException();
    }
    //release memory
  }
}

Performance-wise this implementation strategy would be as efficient as the one shown in the dynamic confinement case, as no expensive lock operation is required before performing memory access - a getPlain on the atomic counter should be enough to preserve basic safety.

That said, such a scheme is not without issues:

To counteract such issues, we would need to resort to a more complex scheme where the atomic counter is replaced by a set of active threads, and then have all operation check against the set, to make sure the operation is indeed possible. Not only this approach leads to extra synchronization considerations (e.g. how to update such a set in a thread-safe way?), but it also adds significant performance overhead - since each memory access operation will have to do some expensive lookup operation to check that the thread accessing the memory belongs to the set stored in the accessed segment.

Dead end?

Summing up, while the current memory access API lends well to the static confinement access mode, and can also be stretched in a reasonably straightforward way to support dynamic confinement, there seems to be no safe and efficient way to implement shared access to a memory segment.

There are also more subtle issues, in that, while the dynamic confinement implementation strategy shown above works reasonably well, it cannot perform as well as the static confinement implementation, because each memory access operation will need to perform a thread check against a mutable thread variable, which cannot be optimized as effectively by the JIT compiler.

Memory handles

The reason why our previous attempts at modeling concurrent memory access modes have failed, is because the MemorySegment abstraction was playing a dual role: on the one hand, it allowed access to memory (e.g. by generating addresses that could be used with memory access var handles); on the other hand, it also embodied the state machine modeling the specific memory access mode to be implemented. This caused frictions, as there was no simple way to guarantee - especially in the shared case - that each memory access from a given thread followed an acquire operation from that same thread; this means that each memory access needs to be validated again, thus making access more expensive. At the same time, there was no way to guarantee, at the API level, that a release operation from a given thread was only possible after an acquired performed by the same thread.

In this section we will explore a possible solution to these issues; the idea is to move all access operations from the MemorySegment abstraction to a new abstraction, namely MemoryHandle, which has to be explicitly acquired. This is a sketch of what the API looks like:

interface MemorySegment extends AutoCloseable {
  long size();

  void resize(long offset, long size);

  MemorySegment asReadOnly();

  ...

  MemoryHandle acquire();

  void close();
}

interface MemoryHandle extends AutoCloseable {
  MemorySegment segment();

  MemoryAddress baseAddress();

  int getInt(MemoryAddress addr);

  void setInt(MemoryAddress addr, int value);

  //other getter/setters

  void close();
}

As we can see, in order to access a memory segment, we now have to retrieve an handle; the handle will support operation for generating addresses within the segment, as well as helper methods to read/write the segment's memory. After a client has finished accessing memory, it can release the handle and, when all handles have been released, the segment can safely be closed, and all the resources associated with it released. Below is an example of how a client would interact with such an API:

try (MemorySegment segment = MemorySegment.ofNative(100)) {
  try (MemoryHandle handle = MemorySegment.acquire()) {
     int anInt = handle.getInt(handle.baseAddress());
  }
}

It is easy to see how this API solves the two issues we mentioned above: acquire operations are now naturally matched by release operations - e.g. it is no longer possible to release a segment without first having acquired an handle to it. Secondly, and more importantly, after an handle has been acquired, we know that memory can be safely accessed - we no longer need to worry about the segment being closed by some other thread: as long as the handle is still active, closing a segment will result in an error.

Dynamic confinement

Implementing dynamic confinement with this refined API is straightforward; essentially, we can build on the synchronization model shown here, and treat each MemorySegment::acquire as an acquire operation, and each MemoryHandle::close as the dual release operation.

One benefit of splitting the memory access API this way is that, upon acquire, we can create a fresh memory handle that is specific to the very thread which triggered the acquire. This means that, in the memory handle implementation, the thread owner variable can, once again, be marked as final. This is crucial, because, upon memory access, we need to check that (i) the memory handle from which access originated is still valid and (ii) that the thread from which access originated is the same as the thread owning the handle. These checks are similar in complexity to the ones shown in the static confinement case (e.g. the thread check compares two thread constants, and can easily be hoisted), so we can now expect comparable memory access performances.

Shared access

For shared memory access we can build on the synchronization model shown here, again treating each MemorySegment::acquire as an acquire operation, and each MemoryHandle::close as the dual release operation.

We have seen before how such a synchronization model, while providing a reasonable performance model, was not strict enough to prevent bad things - such as allowing memory access to a thread which did not do an acquire - from happening. Thanks to the new memory handle abstraction, these issues are no longer present: since memory access is not possible without having acquired a memory handle, we no longer have to worry about non-cooperative threads accessing memory either by accident or maliciously. Also, since the memory handle API only allows - by construction - for a release operation to occur after an acquire, we no longer have to worry about spurious releases from threads which did not acquire the segment in the first place.

The worst thing that can happen is for the same thread to acquire the same segment multiple times - meaning it will potentially have to release all acquired handles to the same segment before the segment can be closed. But this seems hardly a deal breaker, and the performance gains deriving from this simpler synchronization approach seems to make up for that.

What about static confinement?

While the approach described in this section seems expressive enough to handle dynamic confinement and shared access with ease, things don't look promising in the simple static confined case. The reason is that the usability cost imposed by the new API (see example above) seems excessive in cases where only one thread wants to access the segment.

One possible solution for this usability issue is to have an UnsharedMemorySegment abstraction which implements both MemorySegment and MemoryHandle. After all, in the static confined case, the life-cycle is very simple: either the segment is alive, or it has been closed - which means that the life-cycle of the segment itself and the life-cycle of the underlying memory handle are indeed one and the same.

If we go down this path, we can recover good usability in the static confinement case:

try (UnsharedMemorySegment segment = UnsharedMemorySegment.ofNative(100)) {
  int anInt = segment.getInt(handle.baseAddress());
}

For consistency, we could also define ConfinedMemorySegment and SharedMemorySegment and only add the acquire operation on these two abstractions - after all, acquiring a statically confined segment doesn't make much sense.

Conclusions (or pick your poison)

In this document we have shown the relevant concurrent memory access modes, and how the Panama memory access API can be extended to take them into account. We have seen how the current memory access API fails to handle dynamic confinement efficiently, and completely breaks apart with shared access mode; to make room for these use cases and support them in an efficient fashion it is necessary to introduce a new abstraction, namely MemoryHandle which can be used to safely access memory after a successful acquire operation on a segment.

In terms of the evolution of the memory access API, we have several choices:

  1. only support the static confined access mode
  2. support static confined access mode, and also support creation of unsafe shared segments in sun.misc.Unsafe - such segments will be inherently racy, and therefore could cause hard JVM crashes if used incorrectly (hence the term unsafe)
  3. support the full spectrum of access modes, by splitting the MemorySegment API as described in this document.

One thing worth mentioning is that these options are not necessarily mutually exclusive: for instance we could start by providing (1) or (2), and then evolve to (3) as we learn more about typical usages of the memory access API.