Microbenchmark to Measure the Performance Effect of JVMTI Agent Calls by the Example of GetOwnedMonitorStackDepthInfo
=====================================================================================================================

Main purpose of this benchmark is to measure the performance effect of JVMTI calls by an agent
loaded by OpenJDK 16 at revision jdk-16+4 versus the performance effect when webrev.6 of JDK-8227745
is applied on top of jdk-16+4. This is done by example of the JVMTI function
GetOwnedMonitorStackDepthInfo() [1].

Also the benchmark is supposed to be a worst case for unnecessary deoptimizations with webrev.6
because of imprecise debug info.

The Benchmark has four test cases. Each has a tight and hot loop. The benchmark score is the loop
count / ms while the JVMTI agent performs GetOwnedMonitorStackDepthInfoALot calls. The intervals of
the calls are given as argument to the agent.

The benchmark results show

(A) Benchmark runs with serial gc showed a negative performance effect of 5% even if there where no
    ea-local objects in scope already with low agent activity. This was fixed in webrev.6 by letting
    a target thread that is suspended for object deoptimization spin for a short interval before
    calling wait() (see JavaThread::wait_for_object_deoptimization()).

(B) No effect if the agent is inactive besides the positive effect of escape analysis (ea).

(C) No negative performance effect if there are no ea-local objects (i.e. local to the compiled
    method or local to the thread according to escape analysis) in scope of the target thread (see
    TestCase00_GlobalEscape)

(D) In the worst case performance can be reduced by 8% to 10% by webrev.6 over jdk-16+4 if the
    target thread executes a hot and extremely long loop with ea-local objects but the jit did not
    optimize on that (see TestCase01_ArgEscape). In that case webrev.6 conservatively but
    unnecessarily deoptimizes the hot loop, because the debug info is imprecise: there are ea-local
    objects, but they are not accessed. Subsequently execution will switch to the interpreter. It
    will be back to compiled mode very quickly by means of on-stack-replacement (OSR). But OSR
    compilations are performance wise often inferior to ordinary compilations. This explains the
    negative performance effect.

(E) The negative effect in (D) can be easily avoided, if the long running loop is replaced by an
    outer and an inner loop and the latter is factorized out into a separate method (see
    TestCase01_ArgEscape_less_OSR_execution).

(F) Only at intervals shortern that 10ms the performance effect of the deoptimizations necessary for
    the implementation of JDK-8227745 get significant. If there are ea-local objects, then
    performance will be impacted by 20% - 30%. On the other side it will be hardly impacted if there
    are no ea-local objects.

Results in Detail
-----------------

The results are given in 2 spreadsheets.
(LibreOffice would be a suitable viewer)

GetOwnedMonitorInfoALotResults_serialgc_no_tiered_comp.xlsx
GetOwnedMonitorInfoALotResults_pargc.xlsx

The raw output including vm parameters can be found in results_raw.zip

The runs with serial gc and with tiered compilation disabled were mainly conducted, because at first
there was too much variance in the results. The variance was actually caused by the NUMA
architecture of the test machine. It was eliminated by pinning the jvm process to one NUMA
node. Still the serial gc runs were valuable because they disclosed the performance bug described in
above in (A).

Why GetOwnedMonitorStackDepthInfo?
----------------------------------

The following JVMTI operations require deoptimizations in the implementation of JDK-8227745 if
ea-local objects are accessed.

1. Heap operations where objects are visited by following references or by iterating them directely
   on the heap.

2. GetOwnedMonitor[StackDepth]Info

3. PopFrame/ForceEarlyReturn. These operations don't provide access to ea-local objects, but still
   they are not compatible with ea because their implementation interferes with poping frames to
   recover from reallocation failures during frame deoptimization.

4. Get/SetLocalObject

It is assumed that the additional deoptimizations for 1., 3., and 4. are not performance critical.

1. Heap Operations are very heavyweight compared to deoptimizing objects/frames. Agents don't do
   them at a high frequency.

3. For PopFrame/ForceEarlyReturn the target thread is switched to interpreter execution anyway. Also
   I would think that the operations are hardly used, and if they are used, then interactively,
   i.e. not at a high frequency.

4. I would also reckon that Get/SetLocalObject are almost always used in interactive sessions with
   low frequency. The spec does not require a suspended thread (which surprised me, because JDI
   does), but looking at the implementation I'd think the thread must not be running, because
   otherwise the java frame would not be found, because it is searched with a stackwalk in the
   prologue of the vm op, i.e. before the safepoint:

   VM_GetOrSetLocal::get_vframe()
     VM_GetOrSetLocal::get_java_vframe()
       VM_GetOrSetLocal::doit_prologue()

   The stackwalk there is actually unsafe and can crash the VM (will create a bugreport).
   So existing agents will make sure that the target thread is not running, before reading locals.

Even for 2. I doubt that a real agent exists that performs these operations at a high
frequency. Rather they are used to analyze hanging systems. Bytecode instrumentation is the
recommended tool for sampling information at high rates.


Is webrev.6 too Conservative Regarding Deoptimizations?
------------------------------------------------------

Yes, webrev.6 is too conservative. It deoptimizes compiled frames and objects even if it is not
necessary due to imprecise debug info. Firstly because the existing ea-local objects might not be in
the result set of a JVMTI call or secondly because the jit did not optimize on the escape state of
ea-local objects in the result set. The debug info is imprecise, because it does not provide one
flag per frame slot that would indicate the slot holds a reference to an ea-local object for which
ea-based optimizations exist. Instead there are 2 flags per safepoint: one indicating if any
ea-local object is in scope and another indicating if any ea-local object is passed as argument, if
the safepoint is actually a call site.

To answer the question we have to look at the potential gain and at the costs. All deoptimizations
from TestCase01_ArgEscape* are unnecessary for both reasons. The potential gain is the performance
difference between jdk-16+4 and webrev.6, because jdk-16+4 does not deoptimize. So in the best case,
which is the worst case for webrev.6, the gain will be be around 10% unless the agent samples at
intervals as short as 1ms or shorter in which case the gain would be up to 30%.

Monitor information is typically only retrieved in the case of hanging systems or maybe when hitting
a breakpoint, but it is not sampled at high frequencies. In the latter case bytecode instrumentation
is the better choice. The worst case is rare and not even that bad. It can be easily avoided by
replacing the (almost) endless loop. So realistically the gain by avoiding unnecessary
deoptimizations will be close to zero.

Now let's look at the costs of being more precise. One has to be aware that optimizations that rely
on the escape state of accessed objects need not be a the current pc of the target thread. They can
be far ahead when execution continues, maybe even in caller methods. So it is not trivial to know,
if code with optimizations on the escape state of an escaping object will be reached. Also besides
elimination of allocations and locks there are optimizations of the memory graph. Alltogether it
will make the implementation in the jit and in the runtime significantly more complex if information
if ea-based optimizations exist were to be kept for accessed objects. Webrev.6 provides just 2
flags per safepoint in the debug info: one is true iff ea-local objects are in scope and the other
is true at calls that pass ea-local objects as argument. One idea for improvement could be to split
the first flag in two: one for monitors and the other for locals and expressions. This would help to
avoid deoptimization of the frame with the innermost lock. Callers that pass ea-local objecs as
arguments have to be deoptimized anyway, because we cannot know if objects collected so far are
arguments.

So it might be possible to split the flag for ea-locals in scope at moderate costs. This would help
to avoid the worst case TestCase01_ArgEscape*. But a new worst case can be constructed:
TestCase01_ArgEscape passes an ea-local that is not in the result set to a callee that does its own
locking. Then we would have to deoptimize the long running loop again.

TestCase02_NoEscape can be improved with a third flag.

Well, maybe TestCase01_ArgEscape is not such a corner case. Then it could pay off to do the flag
splitting.

Setup and Execution
-------------------

The benchmark source code is contained in 8227745_GetOwnedMonitorStackDepthInfoALot.patch. It is
implemented as a jtreg test, which makes building it easy. Execution is without
jtreg. libGetOwnedMonitorStackDepthInfoALot.c holds the source code of the native JVMTI
agent. GetOwnedMonitorStackDepthInfoALot.java is the actual benchmark.

The Benchmark has four test cases that have a testmethod with a loop. The benchmark score is the
loop count / ms while the JVMTI agent performs GetOwnedMonitorStackDepthInfoALot calls. The
intervals of the calls are given as argument to the agent.

The agent can be built with the command

make build-test-hotspot-jtreg-native

This results in support/test/hotspot/jtreg/native/lib/libGetOwnedMonitorStackDepthInfoALot.so

See contents of results_raw.zip for command lines to run the benchmark.

Test Environment:

20x Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
128 GB RAM
Linux lu0486 4.4.0-177-generic #207-Ubuntu SMP Mon Mar 16 01:16:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

The jvm process was bound to one NUMA node to get stable results with the command

numactl --cpunodebind=$NODE --membind=$NODE

[1] Specification of GetOwnedMonitorStackDepthInfo()
    https://docs.oracle.com/en/java/javase/14/docs/specs/jvmti.html#GetOwnedMonitorStackDepthInfo