Microbenchmark to Measure the Performance Effect of JVMTI Agent Calls by the Example of GetOwnedMonitorStackDepthInfo ===================================================================================================================== Main purpose of this benchmark is to measure the performance effect of JVMTI calls by an agent loaded by OpenJDK 16 at revision jdk-16+4 versus the performance effect when webrev.6 of JDK-8227745 is applied on top of jdk-16+4. This is done by example of the JVMTI function GetOwnedMonitorStackDepthInfo() [1]. Also the benchmark is supposed to be a worst case for unnecessary deoptimizations with webrev.6 because of imprecise debug info. The Benchmark has four test cases. Each has a tight and hot loop. The benchmark score is the loop count / ms while the JVMTI agent performs GetOwnedMonitorStackDepthInfoALot calls. The intervals of the calls are given as argument to the agent. The benchmark results show (A) Benchmark runs with serial gc showed a negative performance effect of 5% even if there where no ea-local objects in scope already with low agent activity. This was fixed in webrev.6 by letting a target thread that is suspended for object deoptimization spin for a short interval before calling wait() (see JavaThread::wait_for_object_deoptimization()). (B) No effect if the agent is inactive besides the positive effect of escape analysis (ea). (C) No negative performance effect if there are no ea-local objects (i.e. local to the compiled method or local to the thread according to escape analysis) in scope of the target thread (see TestCase00_GlobalEscape) (D) In the worst case performance can be reduced by 8% to 10% by webrev.6 over jdk-16+4 if the target thread executes a hot and extremely long loop with ea-local objects but the jit did not optimize on that (see TestCase01_ArgEscape). In that case webrev.6 conservatively but unnecessarily deoptimizes the hot loop, because the debug info is imprecise: there are ea-local objects, but they are not accessed. Subsequently execution will switch to the interpreter. It will be back to compiled mode very quickly by means of on-stack-replacement (OSR). But OSR compilations are performance wise often inferior to ordinary compilations. This explains the negative performance effect. (E) The negative effect in (D) can be easily avoided, if the long running loop is replaced by an outer and an inner loop and the latter is factorized out into a separate method (see TestCase01_ArgEscape_less_OSR_execution). (F) Only at intervals shortern that 10ms the performance effect of the deoptimizations necessary for the implementation of JDK-8227745 get significant. If there are ea-local objects, then performance will be impacted by 20% - 30%. On the other side it will be hardly impacted if there are no ea-local objects. Results in Detail ----------------- The results are given in 2 spreadsheets. (LibreOffice would be a suitable viewer) GetOwnedMonitorInfoALotResults_serialgc_no_tiered_comp.xlsx GetOwnedMonitorInfoALotResults_pargc.xlsx The raw output including vm parameters can be found in results_raw.zip The runs with serial gc and with tiered compilation disabled were mainly conducted, because at first there was too much variance in the results. The variance was actually caused by the NUMA architecture of the test machine. It was eliminated by pinning the jvm process to one NUMA node. Still the serial gc runs were valuable because they disclosed the performance bug described in above in (A). Why GetOwnedMonitorStackDepthInfo? ---------------------------------- The following JVMTI operations require deoptimizations in the implementation of JDK-8227745 if ea-local objects are accessed. 1. Heap operations where objects are visited by following references or by iterating them directely on the heap. 2. GetOwnedMonitor[StackDepth]Info 3. PopFrame/ForceEarlyReturn. These operations don't provide access to ea-local objects, but still they are not compatible with ea because their implementation interferes with poping frames to recover from reallocation failures during frame deoptimization. 4. Get/SetLocalObject It is assumed that the additional deoptimizations for 1., 3., and 4. are not performance critical. 1. Heap Operations are very heavyweight compared to deoptimizing objects/frames. Agents don't do them at a high frequency. 3. For PopFrame/ForceEarlyReturn the target thread is switched to interpreter execution anyway. Also I would think that the operations are hardly used, and if they are used, then interactively, i.e. not at a high frequency. 4. I would also reckon that Get/SetLocalObject are almost always used in interactive sessions with low frequency. The spec does not require a suspended thread (which surprised me, because JDI does), but looking at the implementation I'd think the thread must not be running, because otherwise the java frame would not be found, because it is searched with a stackwalk in the prologue of the vm op, i.e. before the safepoint: VM_GetOrSetLocal::get_vframe() VM_GetOrSetLocal::get_java_vframe() VM_GetOrSetLocal::doit_prologue() The stackwalk there is actually unsafe and can crash the VM (will create a bugreport). So existing agents will make sure that the target thread is not running, before reading locals. Even for 2. I doubt that a real agent exists that performs these operations at a high frequency. Rather they are used to analyze hanging systems. Bytecode instrumentation is the recommended tool for sampling information at high rates. Is webrev.6 too Conservative Regarding Deoptimizations? ------------------------------------------------------ Yes, webrev.6 is too conservative. It deoptimizes compiled frames and objects even if it is not necessary due to imprecise debug info. Firstly because the existing ea-local objects might not be in the result set of a JVMTI call or secondly because the jit did not optimize on the escape state of ea-local objects in the result set. The debug info is imprecise, because it does not provide one flag per frame slot that would indicate the slot holds a reference to an ea-local object for which ea-based optimizations exist. Instead there are 2 flags per safepoint: one indicating if any ea-local object is in scope and another indicating if any ea-local object is passed as argument, if the safepoint is actually a call site. To answer the question we have to look at the potential gain and at the costs. All deoptimizations from TestCase01_ArgEscape* are unnecessary for both reasons. The potential gain is the performance difference between jdk-16+4 and webrev.6, because jdk-16+4 does not deoptimize. So in the best case, which is the worst case for webrev.6, the gain will be be around 10% unless the agent samples at intervals as short as 1ms or shorter in which case the gain would be up to 30%. Monitor information is typically only retrieved in the case of hanging systems or maybe when hitting a breakpoint, but it is not sampled at high frequencies. In the latter case bytecode instrumentation is the better choice. The worst case is rare and not even that bad. It can be easily avoided by replacing the (almost) endless loop. So realistically the gain by avoiding unnecessary deoptimizations will be close to zero. Now let's look at the costs of being more precise. One has to be aware that optimizations that rely on the escape state of accessed objects need not be a the current pc of the target thread. They can be far ahead when execution continues, maybe even in caller methods. So it is not trivial to know, if code with optimizations on the escape state of an escaping object will be reached. Also besides elimination of allocations and locks there are optimizations of the memory graph. Alltogether it will make the implementation in the jit and in the runtime significantly more complex if information if ea-based optimizations exist were to be kept for accessed objects. Webrev.6 provides just 2 flags per safepoint in the debug info: one is true iff ea-local objects are in scope and the other is true at calls that pass ea-local objects as argument. One idea for improvement could be to split the first flag in two: one for monitors and the other for locals and expressions. This would help to avoid deoptimization of the frame with the innermost lock. Callers that pass ea-local objecs as arguments have to be deoptimized anyway, because we cannot know if objects collected so far are arguments. So it might be possible to split the flag for ea-locals in scope at moderate costs. This would help to avoid the worst case TestCase01_ArgEscape*. But a new worst case can be constructed: TestCase01_ArgEscape passes an ea-local that is not in the result set to a callee that does its own locking. Then we would have to deoptimize the long running loop again. TestCase02_NoEscape can be improved with a third flag. Well, maybe TestCase01_ArgEscape is not such a corner case. Then it could pay off to do the flag splitting. Setup and Execution ------------------- The benchmark source code is contained in 8227745_GetOwnedMonitorStackDepthInfoALot.patch. It is implemented as a jtreg test, which makes building it easy. Execution is without jtreg. libGetOwnedMonitorStackDepthInfoALot.c holds the source code of the native JVMTI agent. GetOwnedMonitorStackDepthInfoALot.java is the actual benchmark. The Benchmark has four test cases that have a testmethod with a loop. The benchmark score is the loop count / ms while the JVMTI agent performs GetOwnedMonitorStackDepthInfoALot calls. The intervals of the calls are given as argument to the agent. The agent can be built with the command make build-test-hotspot-jtreg-native This results in support/test/hotspot/jtreg/native/lib/libGetOwnedMonitorStackDepthInfoALot.so See contents of results_raw.zip for command lines to run the benchmark. Test Environment: 20x Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz 128 GB RAM Linux lu0486 4.4.0-177-generic #207-Ubuntu SMP Mon Mar 16 01:16:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux The jvm process was bound to one NUMA node to get stable results with the command numactl --cpunodebind=$NODE --membind=$NODE [1] Specification of GetOwnedMonitorStackDepthInfo() https://docs.oracle.com/en/java/javase/14/docs/specs/jvmti.html#GetOwnedMonitorStackDepthInfo