Minimize application tail latency using cache-partitioning-aware G1GC Author: Ram (Ramki) Krishnan (ramkri123@gmail.com) Organization: Supportvectors on behalf of Red Hat Created: 2017/4/16 Type: Feature State: Draft Component: vm/gc Collaborators: Asif Qamar (asif@asifqamar.com), Red Hat Summary ------- Enhance G1 to improve application throughput and isolation and reduce application tail latency on systems with cache-partitioning-aware processors. The goal is to address both intra and inter JVM scenarios. Non-Goals --------- Extend cache-partitioning-awareness to work on any OS other than Linux. Extend cache-partitioning-awareness to work on any other systems other than those with Intel processors with Resource Director Technology (RDT) Cache Allocation Technology (CAT) feature [1] [4] support. Cache interaction between JVM and other applications on the same system. Motivation ---------- Caching is an important part of multi-core, multi-socket system memory access hierarchy with each level of cache improving the memory access latency by an order of magnitude. The sizes of caches are different by orders of magnitude across the hierarchy as compared to the system memory [2] [3]. Keeping the working set consistent and avoiding thrashing is critical to maximizing the cache performance and thus, overall system performance. The Last-level Cache (LLC) is a shared resource across all the cores in a socket and cause for contention among applications with differing Quality of Service (QoS) requirements running in different cores. In this context, JVM garbage collection operation performs read and/or write operations on several memory locations thus substantially thrashing the LLC working set for currently executing applications, especially the latency-sensitive ones. By isolating the JVM garbage collection operation to a certain region of the LLC, the noisy-neighbor impact on currently executing applications is minimized which improves application throughput and reduces application tail latency. Description ----------- There is a LLC cache-partitioning-manager which interfaces with the HW through the Linux kernel and maintains partitions of different sizes. The cache-partitioning-manager is an infrastructure orchestration component (this can be implemented as a system component or an external component) and will be leveraged by this proposal. With this background, there are two distinct use cases in a system which benefit from this approach. Use Case 1: Interaction between multiple JVMs Any garbage collection activity in one JVM has a substantial noisy-neighbor impact on the applications executing in other JVMs because of LLC cache working set thrashing. To address this, each JVM has an option to specify a GC LLC cache partition name as follows .-XX: G1GcCpuLLCCachePartitionName=Part-jvm-gc-interaction. during startup (the choice of the name is strictly exemplary). The cache partition name is unique per system. All the GC threads running in the context of the JVM are mapped to use the aforementioned L3 cache partition during the entire lifetime of the JVM. Other threads running in the context of the JVM use the entire L3 cache and *are not* restricted to the aforementioned cache partition. This is achieved by calling the appropriate Intel RDT CAT APIs in the context of the JVM. There can be multiple cache partitions for JVMs with differing QoS requirements, for example one for JVMs hosting low-latency applications and another for JVMs hosting normal applications. The low-latency JVM LLC partition would typically be larger than the normal JVM LLC partition so that the low-latency GC activities are prioritized and complete faster. In a multi-socket system, the LLC cache partition could apply to more than one socket; this is a LLC cache-partitioning manager configuration option. Use Case 2: Interaction between young GC threads and application threads within a JVM Within a JVM, young GC garbage collection within a Thread Local Allocation Buffer (TLAB) *will not* prevent other application threads from running. In this scenario, the young GC garbage collection thread has a substantial noisy-neighbor impact on other executing application threads because of LLC cache working set thrashing. To address this, each JVM has an option to specify a GC LLC cache partition name as follows .-XX: G1GcCpuLLCCachePartitionName=Part-jvm-gc-local. during startup (the choice of the name is strictly exemplary). The cache partition name is unique per system. All the young GC threads operating within a TLAB use this LLC cache partition during garbage collection. Other threads running in the context of the JVM use the entire L3 cache and *are not* restricted to the aforementioned cache partition. This is achieved by calling the appropriate Intel RDT CAT APIs in the context of the JVM. In a multi-socket system, the LLC cache partition could apply to more than one socket; this is a LLC cache-partitioning manager configuration option. If there are multiple JVMs in a system, the aforementioned cache partition is shared across them; the size is set appropriately based on the number of JVMs. Summarizing cases 1 and 2 for single and multiple JVMs in a system: Single JVM in a system .-XX: G1GcCpuLLCCachePartitionName=Part-jvm-gc-local. for young GC threads operating within a TLAB within a JVM. Multiple JVMs in a system .-XX: G1GcCpuLLCCachePartitionName=Part-jvm-gc-local. for young GC threads operating within a TLAB within a JVM. This cache partition is shared across multiple JVMs in a system. The size is set appropriately based on the number of JVMs and the probability of concurrent garbage collection events. .-XX: G1GcCpuLLCCachePartitionName=Part-jvm-gc-interaction. for all GC threads within a JVM. This cache partition is shared across multiple JVMs in a system. The size is set appropriately based on the number of JVMs and the probability of concurrent garbage collection events. G1 Analytics Enhancements The following G1 Analytics enhancements will be integrated into G1 for troubleshooting and performance tuning - Intel RDT Cache Monitoring Technology (CMT) [5] for monitoring cache partition usage - Intel RDT Memory Bandwidth Monitoring (MBM) [6] for monitoring memory bandwidth usage Testing ------- Normal testing with the flags (.-XX: G1GcCpuLLCCachePartitionName=AA., .-XX: G1GcCpuLLCCachePartitionName=BB..) should flush out any correctness issues. Testing needs specific hardware which is described in [4]. Performance testing of a system would involve fine tuning of the number of cache partitions and their sizes based on the applications in a JVM and the number of JVMs leveraging G1 analytics enhancements and existing Linux perf tools for monitoring cache hits and misses. LLC cache partitioning will likely increase the memory accesses and thus memory bandwidth. This has to be carefully measured and fine tuned using G1 analytics enhancements and existing Linux perf tools. Risks and Assumptions --------------------- The cache partitioning function depends on the CPU microarchitecture with the assumption that the memory addresses are uniformly distributed over the partition with a high probability. This behavior might vary with different generations of microarchitectures from the same processor vendor and across different processor vendors. Performance testing leveraging G1 Analytics enhancements as described above will address this. Proof of Concept ---------------- To demonstrate the value of cache partitioning for G1GC, a proof of concept (POC) was put together, details are below. The POC results clearly demonstrate that LLC cache partitioning is clearly beneficial for G1GC. In this case LLC maps to L3 cache. Happy to share the code patch if there is interest. POC code changes - LLC Cache partition of size specified by -LLCCachePartitionPercent= of L3 cache of total size 25600 Kbytes was created - All G1GC threads were mapped to this cache partition for their entire lifetime POC results A Jtreg test hotspot/test/gc/g1/TestStringDeduplicationFullGC.java was was repeatedly executed with different LLC cache partition sizes. As the partition size reduces, G1GC takes substantially more time to complete. * Partition percent - 5% - approx. time - 10.7 seconds * Partition percent - 10% - approx. time - 6.7 seconds * Partition percent - 25% - approx. time - 5.8 seconds * Partition percent - 50% - approx. time - 5.6 seconds * Partition percent - 100% (entire cache) - approx. time - 5.5 seconds POC system configuration unknown485d60a3a27d.attlocal.net:/root->lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Stepping: 1 CPU MHz: 1200.170 BogoMIPS: 4394.42 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-19 unknown485d60a3a27d.attlocal.net:/root->uname -a Linux unknown485d60a3a27d.attlocal.net 4.11.0-rc5+ #5 SMP Tue Apr 4 12:59:58 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux Follow-up work -------------- G1 self-tuning to dynamically refine cache partition: Fine tuning of the number of cache partitions and their sizes based on the applications in a JVM and the number of JVMs involves substantial manual effort. By using predictive analytics techniques such as time series analysis on the G1 analytics data, the appropriate number of cache partitions and their sizes can be automatically determined as a function time. L1/L2 cache partitioning: Extend the LLC partitioning to L1 and/or cache partitioning to further improve GC performance especially for young GC garbage collection within a TLAB. LLC code segment partitioning: Leverage Intel RDT Code Data Partitioning (CDP) [7] to protect latency-sensitive application code segments in the LLC to further improve GC performance. Orchestration Ecosystem Integration ----------------------------------- Popular orchestration ecosystems such as Kubernetes support Intel RDT as part of node capability discovery [8]. LLC cache partitioning aware JVMs naturally integrate into this ecosystem. References ---------- [1] .Intel RDT CAT,. https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology [2] .Intel Broadwell Microarchitecture,. https://en.wikichip.org/wiki/intel/microarchitectures/broadwell [3] .Cavium ThunderX,. http://www.cavium.com/pdfFiles/ThunderX_PB_p12_Rev1.pdf [4] .Intel(R) RDT hardware support,. https://github.com/01org/intel-cmt-cat/blob/master/README [5] .Intel RDT CMT,. https://software.intel.com/en-us/blogs/2014/06/18/benefit-of-cache-monitoring [6] .Intel RDT MBM,. https://software.intel.com/en-us/articles/introduction-to-memory-bandwidth-monitoring [7] .Intel RDT CDP,. https://software.intel.com/en-us/articles/introduction-to-code-and-data-prioritization-with-usage-models [8] .Kubernetes Node Feature Discovery,. https://github.com/kubernetes-incubator/node-feature-discovery