== UseCondCardMark support in interpreter https://bugs.openjdk.java.net/browse/JDK-8078438 We are using a very basic test, that is specially crafted to collide the interpreted and compiled code. It requires @Contended to isolate from false sharing between sink1 and sink2: @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Fork(value = 5, jvmArgsAppend = {"-XX:-RestrictContended"}) @State(Scope.Group) public class CardMarks { private Object src = new Object(); @Contended Object sink1; @Contended Object sink2; @Benchmark @CompilerControl(CompilerControl.Mode.EXCLUDE) @Group("ref") public void interp() { sink1 = src; sink1 = src; sink1 = src; sink1 = src; sink1 = src; sink1 = src; sink1 = src; sink1 = src; sink1 = src; sink1 = src; sink1 = src; sink1 = src; sink1 = src; sink1 = src; sink1 = src; sink1 = src; } @Benchmark @Group("ref") public void compiled() { sink2 = src; } } == Baseline First, a few baseline tests. -XX:-UseCondCardMark: Benchmark Mode Cnt Score Error Units CardMarks.ref avgt 25 87.647 ± 5.023 ns/op CardMarks.ref:compiled avgt 25 1.395 ± 0.094 ns/op CardMarks.ref:interp avgt 25 167.898 ± 9.982 ns/op CardMarks.ref:·CPI avgt 5 0.666 ± 0.207 #/op CardMarks.ref:·L1-dcache-load-misses avgt 5 0.233 ± 0.060 #/op CardMarks.ref:·L1-dcache-loads avgt 5 6.176 ± 0.593 #/op CardMarks.ref:·L1-dcache-store-misses avgt 5 0.174 ± 0.029 #/op CardMarks.ref:·L1-dcache-stores avgt 5 2.891 ± 0.210 #/op CardMarks.ref:·cycles avgt 5 11.605 ± 3.967 #/op CardMarks.ref:·instructions avgt 5 17.416 ± 1.289 #/op -XX:+UseCondCardMark: Benchmark Mode Cnt Score Error Units CardMarks.ref avgt 25 89.049 ± 3.695 ns/op CardMarks.ref:compiled avgt 25 2.395 ± 0.073 ns/op CardMarks.ref:interp avgt 25 181.703 ± 7.419 ns/op CardMarks.ref:·CPI avgt 5 0.748 ± 0.043 #/op CardMarks.ref:·L1-dcache-load-misses avgt 5 0.392 ± 0.082 #/op CardMarks.ref:·L1-dcache-loads avgt 5 8.916 ± 1.301 #/op CardMarks.ref:·L1-dcache-store-misses avgt 5 0.269 ± 0.060 #/op CardMarks.ref:·L1-dcache-stores avgt 5 2.285 ± 0.379 #/op CardMarks.ref:·cycles avgt 5 19.831 ± 1.434 #/op CardMarks.ref:·instructions avgt 5 26.533 ± 3.300 #/op Here, turning on the conditional card marks are not helping. In fact, the compiled code is performing worse, because it *both* has larger instruction path length, and it also experiences sharing troubles with the interpreted code. == Patched -XX:-UseCondCardMark: Benchmark Mode Cnt Score Error Units CardMarks.ref avgt 25 86.885 ± 7.041 ns/op CardMarks.ref:compiled avgt 25 1.454 ± 0.090 ns/op CardMarks.ref:interp avgt 25 172.316 ± 14.006 ns/op CardMarks.ref:·CPI avgt 5 0.691 ± 0.217 #/op CardMarks.ref:·L1-dcache-load-misses avgt 5 0.235 ± 0.026 #/op CardMarks.ref:·L1-dcache-loads avgt 5 6.161 ± 0.418 #/op CardMarks.ref:·L1-dcache-store-misses avgt 5 0.174 ± 0.021 #/op CardMarks.ref:·L1-dcache-stores avgt 5 2.883 ± 0.192 #/op CardMarks.ref:·branch-misses avgt 5 0.009 ± 0.001 #/op CardMarks.ref:·branches avgt 5 1.973 ± 0.090 #/op CardMarks.ref:·cycles avgt 5 12.021 ± 3.829 #/op CardMarks.ref:·instructions avgt 5 17.399 ± 0.744 #/op This version performs the same as the baseline. fast_aputfield 211 fast_aputfield [0x00007f998bf86a00, 0x00007f998bf86a60] 96 bytes 0x00007f998bf86a00: pop %rax 0.24% 0.60% 0x00007f998bf86a01: movzwl 0x1(%r13),%ebx 0.31% 0.79% 0x00007f998bf86a06: mov -0x28(%rbp),%rcx 0.68% 1.10% 0x00007f998bf86a0a: shl $0x2,%ebx 0x00007f998bf86a0d: mov 0x28(%rcx,%rbx,8),%edx 0.55% 1.21% 0x00007f998bf86a11: mov 0x20(%rcx,%rbx,8),%rbx 0.41% 0.63% 0x00007f998bf86a16: shr $0x15,%edx 0.57% 1.13% 0x00007f998bf86a19: and $0x1,%edx 0.06% 0.15% 0x00007f998bf86a1c: pop %rcx 0.21% 0.42% 0x00007f998bf86a1d: cmp (%rcx),%rax 0.77% 1.09% 0x00007f998bf86a20: shr $0x3,%rax 2.95% 1.75% 0x00007f998bf86a24: mov %eax,(%rcx,%rbx,1) 1.48% 3.05% 0x00007f998bf86a27: shr $0x9,%rcx 0.18% 0.16% 0x00007f998bf86a2b: movabs $0x7f9987b82000,%r10 0.46% 1.10% 0x00007f998bf86a35: movb $0x0,(%r10,%rcx,1) ... -XX:+UseCondCardMark: Benchmark Mode Cnt Score Error Units CardMarks.ref avgt 25 40.164 ± 2.718 ns/op CardMarks.ref:compiled avgt 25 0.882 ± 0.001 ns/op CardMarks.ref:interp avgt 25 79.445 ± 5.437 ns/op CardMarks.ref:·CPI avgt 5 0.293 ± 0.043 #/op CardMarks.ref:·L1-dcache-load-misses avgt 5 0.231 ± 0.084 #/op CardMarks.ref:·L1-dcache-loads avgt 5 8.347 ± 1.429 #/op CardMarks.ref:·L1-dcache-store-misses avgt 5 0.228 ± 0.081 #/op CardMarks.ref:·L1-dcache-stores avgt 5 1.895 ± 0.299 #/op CardMarks.ref:·branch-misses avgt 5 0.012 ± 0.004 #/op CardMarks.ref:·branches avgt 5 4.537 ± 0.490 #/op CardMarks.ref:·cycles avgt 5 7.317 ± 0.290 #/op CardMarks.ref:·instructions avgt 5 25.029 ± 3.244 #/op fast_aputfield 211 fast_aputfield [0x00007f6a84e07a20, 0x00007f6a84e07aa0] 128 bytes 0x00007f6a84e07a20: pop %rax 1.38% 1.70% 0x00007f6a84e07a21: movzwl 0x1(%r13),%ebx 0.07% 0.06% 0x00007f6a84e07a26: mov -0x28(%rbp),%rcx 0.92% 0.94% 0x00007f6a84e07a2a: shl $0x2,%ebx 0.01% 0x00007f6a84e07a2d: mov 0x28(%rcx,%rbx,8),%edx 1.72% 1.66% 0x00007f6a84e07a31: mov 0x20(%rcx,%rbx,8),%rbx 0.11% 0.02% 0x00007f6a84e07a36: shr $0x15,%edx 0.81% 0.87% 0x00007f6a84e07a39: and $0x1,%edx 0.07% 0.07% 0x00007f6a84e07a3c: pop %rcx 1.44% 1.75% 0x00007f6a84e07a3d: cmp (%rcx),%rax 0.17% 0.15% 0x00007f6a84e07a40: shr $0x3,%rax 0.70% 0.91% 0x00007f6a84e07a44: mov %eax,(%rcx,%rbx,1) 1.31% 1.27% 0x00007f6a84e07a47: shr $0x9,%rcx 1.23% 1.27% 0x00007f6a84e07a4b: movabs $0x7f6a80bd9000,%r10 0.12% 0.07% 0x00007f6a84e07a55: cmpb $0x0,(%r10,%rcx,1) <----- card mark check 1.32% 1.24% 0x00007f6a84e07a5a: je 0x00007f6a84e07a65 <----- jump over 0x00007f6a84e07a60: movb $0x0,(%r10,%rcx,1) ... Now we see both compiled and interpreted versions are not experiencing sharing, and the performance substantially improves. This is the reason for the change.