package org.openjdk.gcbench.wip; import org.openjdk.jmh.annotations.*; import org.openjdk.jmh.infra.Blackhole; import java.util.LinkedList; import java.util.concurrent.TimeUnit; @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Fork(1) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Threads(Threads.MAX) @State(Scope.Benchmark) public class ArrayVsLinked { @Param({"1", "100", "10000", "1000000"}) private int size; Node[] nodes; LinkedNode root; @Setup public void setup() { nodes = new Node[size]; LinkedNode linked = new LinkedNode(size, null); for (int c = 1; c < size; c++) { // minus first node linked = new LinkedNode(size - c, linked); } for (int c = 0; c < size; c++) { nodes[c] = new Node(c + 1); } root = linked; if (array() != linked()) { throw new IllegalStateException("Oops, tests return different results"); } } @Benchmark public int array() { int v = 0; for (Node n : nodes) { v += n.x; } return v; } @Benchmark public int linked() { int v = 0; LinkedNode n = root; do { v += n.x; n = n.next; } while (n != null); return v; } private static class Node { final int x; public Node(int x) { this.x = x; } } private static class LinkedNode { final int x; final LinkedNode next; public LinkedNode(int x, LinkedNode next) { this.x = x; this.next = next; } } /* i7 4790K, Linux x86_64, JDK 8u101 Hypothesis: "Array of object references can seldom be faster than the same objects tied into the linked list (that is, having "next" field embedded into them)" Intriguing hypothesis, because it looks true. ==== Part I. Initial Introduction Benchmark (size) Mode Cnt Score Error Units ArrayVsLinked.array 1 avgt 25 5.696 ± 0.073 ns/op ArrayVsLinked.array 100 avgt 25 67.588 ± 0.303 ns/op ArrayVsLinked.array 10000 avgt 25 7433.365 ± 73.190 ns/op ArrayVsLinked.array 1000000 avgt 25 1392598.287 ± 46557.698 ns/op ArrayVsLinked.linked 1 avgt 25 4.205 ± 0.051 ns/op ArrayVsLinked.linked 100 avgt 25 171.251 ± 2.082 ns/op ArrayVsLinked.linked 10000 avgt 25 16331.893 ± 31.125 ns/op ArrayVsLinked.linked 1000000 avgt 25 2275335.901 ± 19037.833 ns/op Array looks faster, hm! Let's look at disassembly. Hottest loop disassembly for "linked" (size=10000): 0.65% 0.67% ↗ 0x00007f6895206a91: lea (%r12,%r11,8),%r10 ; unpack reference 12.13% 12.34% │ 0x00007f6895206a95: data16 data16 nopw 0x0(%rax,%rax,1) ; 0.31% 0.38% │ 0x00007f6895206aa0: add 0xc(%r10),%edx ; add LinkedNode.x to %edx 81.98% 82.85% │ 0x00007f6895206aa4: mov 0x10(%r10),%r11d ; get field $next 2.28% 1.73% │ 0x00007f6895206aa8: test %eax,0x18892552(%rip) ; 0.58% 0.40% │ 0x00007f6895206aae: test %r11d,%r11d ; c != null? ╰ 0x00007f6895206ab1: jne 0x00007f6895206a91 ; jump back Hottest loop disassembly for "array" (size=10000): 1.17% 1.00% ↗ 0x00007f5c00fe8f12: mov 0x10(%rsi,%rcx,4),%r11d ; load Node ref at offset (%rsi + %rcx*4 + 0x10) 1.86% 1.80% │ 0x00007f5c00fe8f17: add 0xc(%r12,%r11,8),%edx ; unpack and load Node.x -> %edx 20.61% 21.87% │ 0x00007f5c00fe8f1c: movslq %ecx,%rax 1.18% 0.82% │ 0x00007f5c00fe8f1f: mov 0x14(%rsi,%rax,4),%r11d ; load Node ref at offset (%rsi + %rax*4 + 0x14) 0.95% 0.73% │ 0x00007f5c00fe8f24: mov 0xc(%r12,%r11,8),%r10d ; unpack and load Node.x -> %r10 14.42% 15.03% │ 0x00007f5c00fe8f29: mov 0x18(%rsi,%rax,4),%r11d ; load Node ref at offset (%rsi + %rax*4 + 0x18) 4.22% 4.21% │ 0x00007f5c00fe8f2e: mov 0xc(%r12,%r11,8),%r8d ; unpack and load Node.x -> %r8 23.20% 23.42% │ 0x00007f5c00fe8f33: mov 0x1c(%rsi,%rax,4),%r11d ; load Node ref at offset (%rsi + %rax*4 + 0x1c) 0.60% 0.54% │ 0x00007f5c00fe8f38: mov 0xc(%r12,%r11,8),%r11d ; unpack and load Node.x -> %r11 17.09% 16.91% │ 0x00007f5c00fe8f3d: add %r10d,%edx ; add everything to %edx 1.73% 1.47% │ 0x00007f5c00fe8f40: add %r8d,%edx 2.85% 2.72% │ 0x00007f5c00fe8f43: add %r11d,%edx 6.66% 6.69% │ 0x00007f5c00fe8f46: add $0x4,%ecx ; c += 4 1.30% 1.10% │ 0x00007f5c00fe8f49: cmp %edi,%ecx ; loop back ╰ 0x00007f5c00fe8f4b: jl 0x00007f5c00fe8f12 All right, fine, maybe arrays won because optimizer was able to unroll the loop and do four accesses at once! ==== Part II. Fair Game, episode S01E01 Big deal, we will cut the unrolling down with -XX:LoopUnrollLimit=1: Benchmark (size) Mode Cnt Score Error Units ArrayVsLinked.array 1 avgt 25 5.614 ± 0.030 ns/op ArrayVsLinked.array 100 avgt 25 100.668 ± 4.206 ns/op ArrayVsLinked.array 10000 avgt 25 7594.128 ± 69.510 ns/op ArrayVsLinked.array 1000000 avgt 25 1369931.206 ± 4410.174 ns/op ArrayVsLinked.linked 1 avgt 25 4.229 ± 0.073 ns/op ArrayVsLinked.linked 100 avgt 25 169.909 ± 0.900 ns/op ArrayVsLinked.linked 10000 avgt 25 16392.555 ± 103.031 ns/op ArrayVsLinked.linked 1000000 avgt 25 2365784.700 ± 65894.460 ns/op Still faster, bastard. Let's look at assembly again: Hottest loop disassembly for "linked" (size=10000): (unchanged!) 0.61% 0.62% ↗ 0x00007f8b58bc4651: lea (%r12,%r11,8),%r10 ; unpack Node reference 12.31% 12.71% │ 0x00007f8b58bc4655: data16 data16 nopw 0x0(%rax,%rax,1) ; 0.28% 0.43% │ 0x00007f8b58bc4660: add 0xc(%r10),%edx ; load and add LinkedNode.x to %edx 82.07% 82.55% │ 0x00007f8b58bc4664: mov 0x10(%r10),%r11d ; get field $next 2.28% 1.62% │ 0x00007f8b58bc4668: test %eax,0x15e3e992(%rip) ; 0.58% 0.43% │ 0x00007f8b58bc466e: test %r11d,%r11d ; c != null? ╰ 0x00007f8b58bc4671: jne 0x00007f8b58bc4651 ; jump back Hottest loop disassembly for "array" (size=10000): 12.02% 12.68% ↗ 0x00007f4281200060: mov 0x10(%r10,%r9,4),%r8d ; load Node ref at offset (%r10 + %r9*4 + 0x10) 13.04% 13.20% │ 0x00007f4281200065: add 0xc(%r12,%r8,8),%edx ; unpack, load Node.x, and add it to %edx 62.07% 62.88% │ 0x00007f428120006a: inc %r9d ; c++ 4.95% 4.53% │ 0x00007f428120006d: cmp %r11d,%r9d ; loop back │ 0x00007f4281200070: jge 0x00007f428120000b 5.94% 5.26% ╰ 0x00007f4281200072: jmp 0x00007f4281200060 All right, fine. Maybe the absence of safepoint polls or complex addressing due to compressed oops did it! ==== Part III. Fair Game, episode S01E03 Let's run with "-XX:LoopUnrollLimit=1 -XX:+UseCountedLoopSafepoints -XX:-UseCompressedOops" Benchmark (size) Mode Cnt Score Error Units ArrayVsLinked.array 1 avgt 25 5.622 ± 0.051 ns/op ArrayVsLinked.array 100 avgt 25 107.013 ± 2.822 ns/op ArrayVsLinked.array 10000 avgt 25 10308.472 ± 86.381 ns/op ArrayVsLinked.array 1000000 avgt 25 2413613.689 ± 194914.392 ns/op ArrayVsLinked.linked 1 avgt 25 4.178 ± 0.044 ns/op ArrayVsLinked.linked 100 avgt 25 125.352 ± 0.585 ns/op ArrayVsLinked.linked 10000 avgt 25 14594.877 ± 315.666 ns/op ArrayVsLinked.linked 1000000 avgt 25 3055714.233 ± 320473.514 ns/op Finally they are getting closer! Linked lists FTW! Not quite, damn. Let's look at disassembly again. Hottest loop disassembly for "linked" (size=10000): 1.02% 1.06% ↗ 0x00007f273d200b00: add 0x10(%r11),%edx ; load Node.x and add it to %edx 94.50% 94.49% │ 0x00007f273d200b04: mov 0x18(%r11),%r11 ; load Node.next to %r11 2.97% 3.09% │ 0x00007f273d200b08: test %eax,0x16f234f2(%rip) ; 0.00% │ 0x00007f273d200b0e: test %r11,%r11 ; if not the end, loop back ╰ 0x00007f273d200b11: jne 0x00007f273d200b00 Hottest loop disassembly for "array" (size=10000): 15.57% 16.96% ↗ 0x00007f20a91f23d0: mov 0x18(%r11,%r8,8),%r9 ; load Node ref at offset (%r11 + %r8*8 + 0x18) 0.33% 0.36% │ 0x00007f20a91f23d5: add 0x10(%r9),%edx ; load Node.x and add it to %edx 70.08% 70.02% │ 0x00007f20a91f23d9: inc %r8d ; c++ 0.01% │ 0x00007f20a91f23dc: test %eax,0x180d7c1e(%rip) ; 12.35% 11.26% │ 0x00007f20a91f23e2: cmp %r10d,%r8d ; if not ready, loop back ╰ 0x00007f20a91f23e5: jge 0x00007f20a91f2383 But, how come! We do more operations in array case, and it is still faster?! ==== Part IV. Climax Now that we know the hot loops are similar, time for hardware counters ("No, dad, noooo!"): Benchmark (size) Mode Cnt Score Error Units ArrayVsLinked.array 10000 avgt 25 10301.105 ± 65.584 ns/op ArrayVsLinked.array:CPI 10000 avgt 5 0.583 ± 0.011 #/op ArrayVsLinked.array:L1-dcache-load-misses 10000 avgt 5 2756.053 ± 159.336 #/op ArrayVsLinked.array:L1-dcache-loads 10000 avgt 5 29338.375 ± 607.107 #/op ; <--- same number of loads ArrayVsLinked.array:cycles 10000 avgt 5 39270.945 ± 785.069 #/op ArrayVsLinked.array:instructions 10000 avgt 5 67351.912 ± 913.290 #/op ArrayVsLinked.linked 10000 avgt 25 14754.238 ± 254.779 ns/op ArrayVsLinked.linked:CPI 10000 avgt 5 1.126 ± 0.099 #/op ; <--- abominable CPI ArrayVsLinked.linked:L1-dcache-load-misses 10000 avgt 5 3300.364 ± 1232.200 #/op ArrayVsLinked.linked:L1-dcache-loads 10000 avgt 5 29656.929 ± 998.873 #/op ArrayVsLinked.linked:cycles 10000 avgt 5 56031.205 ± 4756.404 #/op ; <--- moar cycles ArrayVsLinked.linked:instructions 10000 avgt 5 49754.904 ± 1757.675 #/op ; <--- less instructions (did not help) "linked" does has an abysmal clocks-per-instruction for a very hot tests. That happens because what are the few next elements is not very predictable (i.e. they form a _linear_ data dependency) and does not pipeline well. ==== Part V. Topping the Level All right, fine. Let's put it down with toplev (https://github.com/andikleen/pmu-tools/). "linked", -p size=10000 is memory bound, plus heavy data dependencies make it execute 1 uop/cycle around 25% of all cycles: C0 BE Backend_Bound: 47.40 % [ 3.07%] This category represents slots fraction where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units, and once completed these uops get retired according to program order. For example, stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. C0 BE/Mem Backend_Bound.Memory_Bound: 35.50 % [ 3.04%] This metric represents slots fraction the Memory subsystem within the Backend was a bottleneck. Memory Bound estimates slots fraction where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation, in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two). C0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization.1_Port_Utilized: 26.39 % [ 3.04%] This metric represents Core cycles fraction where the CPU executed total of 1 uop per cycle on all execution ports. This can be due to heavy data-dependency among software instructions, or over oversubscribing a particular hardware resource. In some other cases with high 1_Port_Utilized and L1_Bound, this metric can point to L1 data-cache latency bottleneck that may not necessarily manifest with complete execution starvation (due to the short L1 latency e.g. walking a linked list) - looking at the assembly can be helpful. Tip: consider 'Core Ports Saturation' analysis-type as next step. C0-T0 BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 61.58 % [ 3.04%] This metric estimates how often the CPU was stalled without loads missing the L1 data cache. The L1 data cache typically has the shortest latency. However, in certain cases like loads blocked on older stores, a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls, while some non-completed demand load lives in the machine without having that demand load missing the L1 cache. Sampling events: mem_load_uops_retired.l1_hit:pp mem_load_uops_retired.hit_lfb:pp "array", size=10000 is completely dominated by pure computing resources: C0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization.3m_Ports_Utilized.Port_0: 62.49 % [ 3.00%] This metric represents Core cycles fraction CPU dispatched uops on execution port 0 (SNB+: ALU; HSW+:ALU and 2nd branch) C0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization.3m_Ports_Utilized.Port_2: 76.40 % [ 3.00%] This metric represents Core cycles fraction CPU dispatched uops on execution port 2 (Loads and Store-address) C0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization.3m_Ports_Utilized.Port_3: 76.42 % [ 3.00%] This metric represents Core cycles fraction CPU dispatched uops on execution port 3 (Loads and Store-address) C0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization.3m_Ports_Utilized.Port_6: 87.53 % [ 3.00%] This metric represents Core cycles fraction CPU dispatched uops on execution port 6 (Branches and simple ALU) C0 RET Retiring: 75.99 % [ 3.07%] This category represents slots fraction utilized by useful work i.e. issued uops that eventually get retired. Ideally, all pipeline slots would be attributed to the Retiring category. Retiring of 100% would indicate the maximum 4 uops retired per cycle has been achieved. Maximizing Retiring typically increases the Instruction-Per-Cycle metric. Note that a high Retiring value does not necessary mean there is no room for more performance. For example, Microcode assists are categorized under Retiring. They hurt performance and can often be avoided. . A high Retiring value for non-vectorized code may be a good hint for programmer to consider vectorizing his code. Doing so essentially lets more computations be done without significantly increasing number of instructions thus improving the performance. BOTTOM-LINE: DONT. DATE. ROBOTS Use linked lists only when understanding what'ta hell you are doing. */ }