= Reference CAS induces GC store barrier even on failure This is a rough estimate for the performance impact of: https://bugs.openjdk.java.net/browse/JDK-8019968 The tests are done on 4x10x2 Xeon (Westmere-EX), Solaris 11 x86_64, jdk-dev @ 2015-07-22, plus patches. We run in six modes, aggregated in three groups: a) -XX:+UseParallelGC -XX:+UseParallelGC -XX:+UseNewCode b) -XX:+UseParallelGC -XX:+UseCondCardMark -XX:+UseParallelGC -XX:+UseCondCardMark -XX:+UseNewCode c) -XX:+UseG1GC -XX:+UseG1GC -XX:+UseNewCode All three groups stress different GC collector modes. Our hypothesis is that the change should help a lot in the case (a), where the stray stores into card mark table make the store barrier very heavy-weight. == Benchmarks: @Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Fork(3) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @State(Scope.Benchmark) public class CASToNull { Object tombstone; AtomicReference ref; @Param({"0", "1", "2", "4", "8", "16", "32", "64"}) int backoff; @Setup public void setup() { tombstone = new Object(); ref = new AtomicReference<>(); ref.set(new Object()); } @Benchmark public boolean fail() { Blackhole.consumeCPU(backoff); return ref.compareAndSet(tombstone, null); } @Benchmark public boolean success() { Blackhole.consumeCPU(backoff); Object c = ref.get(); return ref.compareAndSet(c, null); } } @Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Fork(3) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @State(Scope.Benchmark) public class CAStoYoung { Object tombstone; AtomicReference ref; @Param({"0", "1", "2", "4", "8", "16", "32", "64"}) int backoff; @Setup public void setup() { tombstone = new Object(); ref = new AtomicReference<>(); ref.set(new Object()); } @Benchmark public boolean fail() { Blackhole.consumeCPU(backoff); return ref.compareAndSet(tombstone, new Object()); // never succeeds } @Benchmark public boolean success() { Blackhole.consumeCPU(backoff); Object c = ref.get(); return ref.compareAndSet(c, new Object()); // mostly succeeds } @Benchmark public void spin() { Blackhole.consumeCPU(backoff); Object next = new Object(); AtomicReference ref = this.ref; Object cur; do { cur = ref.get(); } while (!ref.compareAndSet(cur, next)); } } == Results === CASToNull.* CASToNull.* should always go through a fast-path detecting the store of "null": write barrier is not needed here. This is our negative control test. Indeed, the difference is not present. Benchmark (backoff) Mode Cnt Score Error Units # -XX:+UseParallelGC CASToNull.fail 0 avgt 15 1158.432 ± 27.358 ns/op CASToNull.fail 1 avgt 15 1627.847 ± 429.314 ns/op CASToNull.fail 2 avgt 15 1624.253 ± 190.054 ns/op CASToNull.fail 4 avgt 15 1680.333 ± 28.913 ns/op CASToNull.fail 8 avgt 15 2399.365 ± 32.580 ns/op CASToNull.fail 16 avgt 15 3906.060 ± 62.469 ns/op CASToNull.fail 32 avgt 15 5826.774 ± 129.034 ns/op CASToNull.fail 64 avgt 15 8504.821 ± 53.115 ns/op # -XX:+UseParallelGC -XX:+UseNewCode CASToNull.fail 0 avgt 15 1191.762 ± 22.767 ns/op CASToNull.fail 1 avgt 15 1803.329 ± 344.910 ns/op CASToNull.fail 2 avgt 15 1733.315 ± 162.026 ns/op CASToNull.fail 4 avgt 15 1861.545 ± 172.146 ns/op CASToNull.fail 8 avgt 15 2385.091 ± 28.901 ns/op CASToNull.fail 16 avgt 15 3918.301 ± 51.220 ns/op CASToNull.fail 32 avgt 15 5909.455 ± 92.402 ns/op CASToNull.fail 64 avgt 15 8546.466 ± 108.750 ns/op # -XX:+UseParallelGC -XX:+UseCondCardMark CASToNull.fail 0 avgt 15 1148.689 ± 32.610 ns/op CASToNull.fail 1 avgt 15 1214.575 ± 21.240 ns/op CASToNull.fail 2 avgt 15 1526.381 ± 180.223 ns/op CASToNull.fail 4 avgt 15 1719.105 ± 61.719 ns/op CASToNull.fail 8 avgt 15 2405.597 ± 45.498 ns/op CASToNull.fail 16 avgt 15 3894.482 ± 60.682 ns/op CASToNull.fail 32 avgt 15 5925.025 ± 183.776 ns/op CASToNull.fail 64 avgt 15 8530.796 ± 68.172 ns/op # -XX:+UseParallelGC -XX:+UseCondCardMark -XX:+UseNewCode CASToNull.fail 0 avgt 15 1141.622 ± 18.158 ns/op CASToNull.fail 1 avgt 15 1749.190 ± 427.915 ns/op CASToNull.fail 2 avgt 15 1637.298 ± 261.124 ns/op CASToNull.fail 4 avgt 15 1704.933 ± 48.667 ns/op CASToNull.fail 8 avgt 15 2376.481 ± 32.389 ns/op CASToNull.fail 16 avgt 15 3921.200 ± 65.509 ns/op CASToNull.fail 32 avgt 15 5853.103 ± 87.538 ns/op CASToNull.fail 64 avgt 15 8504.405 ± 82.682 ns/op # -XX:+UseG1GC CASToNull.fail 0 avgt 15 1202.096 ± 26.977 ns/op CASToNull.fail 1 avgt 15 1827.246 ± 387.969 ns/op CASToNull.fail 2 avgt 15 1572.458 ± 240.386 ns/op CASToNull.fail 4 avgt 15 1779.501 ± 81.576 ns/op CASToNull.fail 8 avgt 15 2405.732 ± 30.223 ns/op CASToNull.fail 16 avgt 15 3957.790 ± 71.167 ns/op CASToNull.fail 32 avgt 15 5917.228 ± 130.575 ns/op CASToNull.fail 64 avgt 15 8659.984 ± 56.462 ns/op # -XX:+UseG1GC -XX:+UseNewCode CASToNull.fail 0 avgt 15 1213.194 ± 19.937 ns/op CASToNull.fail 1 avgt 15 1486.297 ± 353.353 ns/op CASToNull.fail 2 avgt 15 1798.166 ± 103.569 ns/op CASToNull.fail 4 avgt 15 1821.993 ± 143.017 ns/op CASToNull.fail 8 avgt 15 2436.366 ± 45.047 ns/op CASToNull.fail 16 avgt 15 3920.117 ± 51.029 ns/op CASToNull.fail 32 avgt 15 6003.204 ± 144.333 ns/op CASToNull.fail 64 avgt 15 8553.080 ± 104.600 ns/op ------------------------------------------------------------------------ # -XX:+UseParallelGC CASToNull.success 0 avgt 15 2868.559 ± 418.247 ns/op CASToNull.success 1 avgt 15 2866.673 ± 56.382 ns/op CASToNull.success 2 avgt 15 3810.401 ± 361.009 ns/op CASToNull.success 4 avgt 15 3686.814 ± 83.372 ns/op CASToNull.success 8 avgt 15 5520.490 ± 76.941 ns/op CASToNull.success 16 avgt 15 9335.767 ± 127.010 ns/op CASToNull.success 32 avgt 15 14766.018 ± 116.561 ns/op CASToNull.success 64 avgt 15 20982.783 ± 391.029 ns/op # -XX:+UseParallelGC -XX:+UseNewCode CASToNull.success 0 avgt 15 3075.972 ± 302.512 ns/op CASToNull.success 1 avgt 15 3242.284 ± 454.872 ns/op CASToNull.success 2 avgt 15 3666.982 ± 160.667 ns/op CASToNull.success 4 avgt 15 3630.523 ± 70.669 ns/op CASToNull.success 8 avgt 15 5491.432 ± 82.781 ns/op CASToNull.success 16 avgt 15 9330.437 ± 80.662 ns/op CASToNull.success 32 avgt 15 14537.532 ± 213.339 ns/op CASToNull.success 64 avgt 15 21463.386 ± 174.082 ns/op # -XX:+UseParallelGC -XX:+UseCondCardMark CASToNull.success 0 avgt 15 3090.290 ± 396.683 ns/op CASToNull.success 1 avgt 15 3647.916 ± 134.391 ns/op CASToNull.success 2 avgt 15 3613.078 ± 185.156 ns/op CASToNull.success 4 avgt 15 3669.604 ± 113.880 ns/op CASToNull.success 8 avgt 15 5552.499 ± 104.412 ns/op CASToNull.success 16 avgt 15 9384.205 ± 111.441 ns/op CASToNull.success 32 avgt 15 14930.311 ± 1179.601 ns/op CASToNull.success 64 avgt 15 20996.914 ± 412.662 ns/op # -XX:+UseParallelGC -XX:+UseCondCardMark -XX:+UseNewCode CASToNull.success 0 avgt 15 2853.955 ± 401.024 ns/op CASToNull.success 1 avgt 15 3117.257 ± 386.775 ns/op CASToNull.success 2 avgt 15 3653.674 ± 214.236 ns/op CASToNull.success 4 avgt 15 3719.997 ± 96.044 ns/op CASToNull.success 8 avgt 15 5619.496 ± 318.103 ns/op CASToNull.success 16 avgt 15 9395.054 ± 116.810 ns/op CASToNull.success 32 avgt 15 14695.077 ± 246.336 ns/op CASToNull.success 64 avgt 15 20973.765 ± 394.062 ns/op # -XX:+UseG1GC CASToNull.success 0 avgt 15 3165.089 ± 490.134 ns/op CASToNull.success 1 avgt 15 3102.257 ± 543.112 ns/op CASToNull.success 2 avgt 15 3517.486 ± 298.164 ns/op CASToNull.success 4 avgt 15 3675.551 ± 100.089 ns/op CASToNull.success 8 avgt 15 5459.346 ± 42.143 ns/op CASToNull.success 16 avgt 15 9450.194 ± 119.024 ns/op CASToNull.success 32 avgt 15 14582.071 ± 269.651 ns/op CASToNull.success 64 avgt 15 21070.824 ± 384.602 ns/op # -XX:+UseG1GC -XX:+UseNewCode CASToNull.success 0 avgt 15 2832.507 ± 190.049 ns/op CASToNull.success 1 avgt 15 2949.046 ± 300.905 ns/op CASToNull.success 2 avgt 15 3491.630 ± 354.535 ns/op CASToNull.success 4 avgt 15 3827.768 ± 88.708 ns/op CASToNull.success 8 avgt 15 5456.930 ± 71.300 ns/op CASToNull.success 16 avgt 15 9560.692 ± 111.305 ns/op CASToNull.success 32 avgt 15 14837.683 ± 137.573 ns/op CASToNull.success 64 avgt 15 21144.107 ± 228.580 ns/op === CAStoYoung.fail CASToYoung.fail test should never induce a write barrier, since the store did not succeed. With conditional card marking, the barrier cost would be alleviated. Ditto for G1, but the barrier itself is rather large there. Indeed, we can see that the overheads of failing CAS drop off rapidly with new code. # -XX:+UseParallelGC CAStoYoung.fail 0 avgt 15 10886.932 ± 367.659 ns/op CAStoYoung.fail 1 avgt 15 9609.906 ± 414.545 ns/op CAStoYoung.fail 2 avgt 15 8137.711 ± 835.879 ns/op CAStoYoung.fail 4 avgt 15 8707.719 ± 100.268 ns/op CAStoYoung.fail 8 avgt 15 8538.445 ± 319.947 ns/op CAStoYoung.fail 16 avgt 15 8349.820 ± 334.644 ns/op CAStoYoung.fail 32 avgt 15 8819.781 ± 185.966 ns/op CAStoYoung.fail 64 avgt 15 8816.870 ± 170.126 ns/op # -XX:+UseParallelGC -XX:+UseNewCode CAStoYoung.fail 0 avgt 15 9995.590 ± 7801.890 ns/op CAStoYoung.fail 1 avgt 15 2052.901 ± 116.928 ns/op CAStoYoung.fail 2 avgt 15 1991.903 ± 95.804 ns/op CAStoYoung.fail 4 avgt 15 2085.363 ± 60.892 ns/op CAStoYoung.fail 8 avgt 15 2802.850 ± 94.273 ns/op CAStoYoung.fail 16 avgt 15 4265.955 ± 80.505 ns/op CAStoYoung.fail 32 avgt 15 6587.419 ± 210.064 ns/op CAStoYoung.fail 64 avgt 15 8678.888 ± 94.451 ns/op # -XX:+UseParallelGC -XX:+UseCondCardMark CAStoYoung.fail 0 avgt 15 1922.153 ± 435.211 ns/op CAStoYoung.fail 1 avgt 15 2326.227 ± 401.591 ns/op CAStoYoung.fail 2 avgt 15 2124.282 ± 203.236 ns/op CAStoYoung.fail 4 avgt 15 2127.239 ± 28.840 ns/op CAStoYoung.fail 8 avgt 15 2826.651 ± 71.603 ns/op CAStoYoung.fail 16 avgt 15 4264.711 ± 80.146 ns/op CAStoYoung.fail 32 avgt 15 6584.945 ± 139.176 ns/op CAStoYoung.fail 64 avgt 15 8678.597 ± 108.368 ns/op # -XX:+UseParallelGC -XX:+UseCondCardMark -XX:+UseNewCode CAStoYoung.fail 0 avgt 15 2121.498 ± 377.194 ns/op CAStoYoung.fail 1 avgt 15 1834.087 ± 214.093 ns/op CAStoYoung.fail 2 avgt 15 2009.337 ± 408.236 ns/op CAStoYoung.fail 4 avgt 15 4772.391 ± 5178.709 ns/op CAStoYoung.fail 8 avgt 15 2975.667 ± 932.668 ns/op CAStoYoung.fail 16 avgt 15 4253.491 ± 96.041 ns/op CAStoYoung.fail 32 avgt 15 6537.454 ± 153.321 ns/op CAStoYoung.fail 64 avgt 15 8584.301 ± 78.052 ns/op # -XX:+UseG1GC CAStoYoung.fail 0 avgt 15 1843.731 ± 186.010 ns/op CAStoYoung.fail 1 avgt 15 1815.408 ± 104.263 ns/op CAStoYoung.fail 2 avgt 15 1933.081 ± 59.323 ns/op CAStoYoung.fail 4 avgt 15 2164.548 ± 38.333 ns/op CAStoYoung.fail 8 avgt 15 2830.278 ± 44.081 ns/op CAStoYoung.fail 16 avgt 15 4208.434 ± 69.866 ns/op CAStoYoung.fail 32 avgt 15 6800.963 ± 206.472 ns/op CAStoYoung.fail 64 avgt 15 8589.649 ± 167.935 ns/op # -XX:+UseG1GC -XX:+UseNewCode CAStoYoung.fail 0 avgt 15 1928.165 ± 209.471 ns/op CAStoYoung.fail 1 avgt 15 1845.913 ± 159.669 ns/op CAStoYoung.fail 2 avgt 15 1976.318 ± 84.220 ns/op CAStoYoung.fail 4 avgt 15 2107.321 ± 48.574 ns/op CAStoYoung.fail 8 avgt 15 2777.636 ± 43.708 ns/op CAStoYoung.fail 16 avgt 15 4207.685 ± 55.991 ns/op CAStoYoung.fail 32 avgt 15 6497.191 ± 131.136 ns/op CAStoYoung.fail 64 avgt 15 8615.024 ± 214.188 ns/op === CAStoYoung.success CAStoYoung.success would always induce the store barrier on success path. However, in multithreaded benchmark, there are always cases where the about-to-succeed CAS fails. This explains why plain -XX:+UseParallelGC is much slower. New code solves that, as in CAStoYoung.fail case. # -XX:+UseParallelGC CAStoYoung.success 0 avgt 15 9758.306 ± 447.366 ns/op CAStoYoung.success 1 avgt 15 9799.157 ± 1232.733 ns/op CAStoYoung.success 2 avgt 15 9032.226 ± 688.622 ns/op CAStoYoung.success 4 avgt 15 9152.247 ± 480.075 ns/op CAStoYoung.success 8 avgt 15 13531.909 ± 8079.631 ns/op CAStoYoung.success 16 avgt 15 9256.949 ± 579.371 ns/op CAStoYoung.success 32 avgt 15 16545.181 ± 302.252 ns/op CAStoYoung.success 64 avgt 15 21293.076 ± 190.791 ns/op # -XX:+UseParallelGC -XX:+UseNewCode CAStoYoung.success 0 avgt 15 4769.846 ± 614.958 ns/op CAStoYoung.success 1 avgt 15 5145.693 ± 904.553 ns/op CAStoYoung.success 2 avgt 15 4961.582 ± 726.803 ns/op CAStoYoung.success 4 avgt 15 4823.449 ± 142.396 ns/op CAStoYoung.success 8 avgt 15 11601.428 ± 8267.744 ns/op CAStoYoung.success 16 avgt 15 10799.277 ± 163.160 ns/op CAStoYoung.success 32 avgt 15 15779.070 ± 219.367 ns/op CAStoYoung.success 64 avgt 15 21056.654 ± 431.968 ns/op # -XX:+UseParallelGC -XX:+UseCondCardMark CAStoYoung.success 0 avgt 15 4084.124 ± 83.819 ns/op CAStoYoung.success 1 avgt 15 4062.072 ± 128.419 ns/op CAStoYoung.success 2 avgt 15 4453.532 ± 96.601 ns/op CAStoYoung.success 4 avgt 15 4859.179 ± 66.126 ns/op CAStoYoung.success 8 avgt 15 6370.763 ± 129.623 ns/op CAStoYoung.success 16 avgt 15 10045.163 ± 169.578 ns/op CAStoYoung.success 32 avgt 15 15243.672 ± 247.383 ns/op CAStoYoung.success 64 avgt 15 21471.432 ± 173.306 ns/op # -XX:+UseParallelGC -XX:+UseCondCardMark -XX:+UseNewCode CAStoYoung.success 0 avgt 15 4018.500 ± 183.154 ns/op CAStoYoung.success 1 avgt 15 4226.981 ± 171.994 ns/op CAStoYoung.success 2 avgt 15 4572.614 ± 536.322 ns/op CAStoYoung.success 4 avgt 15 4965.257 ± 119.945 ns/op CAStoYoung.success 8 avgt 15 6480.704 ± 119.147 ns/op CAStoYoung.success 16 avgt 15 10153.789 ± 212.914 ns/op CAStoYoung.success 32 avgt 15 15568.629 ± 195.838 ns/op CAStoYoung.success 64 avgt 15 20841.767 ± 191.084 ns/op # -XX:+UseG1GC CAStoYoung.success 0 avgt 15 3859.917 ± 109.589 ns/op CAStoYoung.success 1 avgt 15 4024.143 ± 122.817 ns/op CAStoYoung.success 2 avgt 15 4292.511 ± 116.272 ns/op CAStoYoung.success 4 avgt 15 4736.174 ± 121.958 ns/op CAStoYoung.success 8 avgt 15 6972.395 ± 1620.592 ns/op CAStoYoung.success 16 avgt 15 10488.929 ± 1600.554 ns/op CAStoYoung.success 32 avgt 15 15941.075 ± 1247.687 ns/op CAStoYoung.success 64 avgt 15 21148.458 ± 396.209 ns/op # -XX:+UseG1GC -XX:+UseNewCode CAStoYoung.success 0 avgt 15 4126.292 ± 60.947 ns/op CAStoYoung.success 1 avgt 15 4423.496 ± 167.244 ns/op CAStoYoung.success 2 avgt 15 4734.201 ± 99.917 ns/op CAStoYoung.success 4 avgt 15 5130.220 ± 97.021 ns/op CAStoYoung.success 8 avgt 15 6799.242 ± 318.445 ns/op CAStoYoung.success 16 avgt 15 10151.588 ± 137.374 ns/op CAStoYoung.success 32 avgt 15 15583.170 ± 165.248 ns/op CAStoYoung.success 64 avgt 15 21506.133 ± 431.274 ns/op === CASToYoung.spin CASToYoung is the pathological example where CAS fails dominate the execution, when the CAS is contended over multiple threads. This is why plain -XX:+UseParallelGC is penalized a lot. As usual, new code helps to alleviate the cost associated with the store barrier there. # -XX:+UseParallelGC CAStoYoung.spin 0 avgt 15 42501.423 ± 5093.349 ns/op CAStoYoung.spin 1 avgt 15 46319.246 ± 14612.201 ns/op CAStoYoung.spin 2 avgt 15 58269.539 ± 16824.328 ns/op CAStoYoung.spin 4 avgt 15 46332.126 ± 10489.046 ns/op CAStoYoung.spin 8 avgt 15 62295.550 ± 23724.737 ns/op CAStoYoung.spin 16 avgt 15 85166.436 ± 35161.373 ns/op CAStoYoung.spin 32 avgt 15 63730.166 ± 26874.916 ns/op CAStoYoung.spin 64 avgt 15 57391.224 ± 8186.361 ns/op # -XX:+UseParallelGC -XX:+UseNewCode CAStoYoung.spin 0 avgt 15 16713.976 ± 5067.542 ns/op CAStoYoung.spin 1 avgt 15 20013.591 ± 5147.438 ns/op CAStoYoung.spin 2 avgt 15 13195.552 ± 1149.168 ns/op CAStoYoung.spin 4 avgt 15 11103.271 ± 341.758 ns/op CAStoYoung.spin 8 avgt 15 12064.269 ± 1524.717 ns/op CAStoYoung.spin 16 avgt 15 16958.504 ± 5953.961 ns/op CAStoYoung.spin 32 avgt 15 25047.935 ± 1374.876 ns/op CAStoYoung.spin 64 avgt 15 25838.782 ± 733.215 ns/op # -XX:+UseParallelGC -XX:+UseCondCardMark CAStoYoung.spin 0 avgt 15 8214.193 ± 1242.655 ns/op CAStoYoung.spin 1 avgt 15 7590.829 ± 2271.611 ns/op CAStoYoung.spin 2 avgt 15 8717.976 ± 4038.639 ns/op CAStoYoung.spin 4 avgt 15 10460.989 ± 4968.104 ns/op CAStoYoung.spin 8 avgt 15 12275.134 ± 5282.991 ns/op CAStoYoung.spin 16 avgt 15 17090.367 ± 4331.339 ns/op CAStoYoung.spin 32 avgt 15 22330.230 ± 333.581 ns/op CAStoYoung.spin 64 avgt 15 23523.539 ± 2073.867 ns/op # -XX:+UseParallelGC -XX:+UseCondCardMark -XX:+UseNewCode CAStoYoung.spin 0 avgt 15 8353.776 ± 1419.534 ns/op CAStoYoung.spin 1 avgt 15 6910.703 ± 345.017 ns/op CAStoYoung.spin 2 avgt 15 9770.812 ± 4705.379 ns/op CAStoYoung.spin 4 avgt 15 7190.847 ± 259.970 ns/op CAStoYoung.spin 8 avgt 15 18812.308 ± 192.154 ns/op CAStoYoung.spin 16 avgt 15 15195.887 ± 2224.822 ns/op CAStoYoung.spin 32 avgt 15 22779.196 ± 412.195 ns/op CAStoYoung.spin 64 avgt 15 25651.270 ± 169.812 ns/op # -XX:+UseG1GC CAStoYoung.spin 0 avgt 15 6552.965 ± 747.977 ns/op CAStoYoung.spin 1 avgt 15 6505.077 ± 541.207 ns/op CAStoYoung.spin 2 avgt 15 8739.285 ± 4096.145 ns/op CAStoYoung.spin 4 avgt 15 6478.451 ± 863.285 ns/op CAStoYoung.spin 8 avgt 15 12140.651 ± 4956.752 ns/op CAStoYoung.spin 16 avgt 15 14914.233 ± 2606.670 ns/op CAStoYoung.spin 32 avgt 15 20855.356 ± 2231.988 ns/op CAStoYoung.spin 64 avgt 15 21980.149 ± 1871.782 ns/op # -XX:+UseG1GC -XX:+UseNewCode CAStoYoung.spin 0 avgt 15 6353.831 ± 620.468 ns/op CAStoYoung.spin 1 avgt 15 8214.940 ± 2141.160 ns/op CAStoYoung.spin 2 avgt 15 6926.123 ± 1136.153 ns/op CAStoYoung.spin 4 avgt 15 6919.368 ± 371.117 ns/op CAStoYoung.spin 8 avgt 15 12152.679 ± 5763.521 ns/op CAStoYoung.spin 16 avgt 15 13735.609 ± 1313.101 ns/op CAStoYoung.spin 32 avgt 15 19966.317 ± 1985.727 ns/op CAStoYoung.spin 64 avgt 15 22718.686 ± 2227.863 ns/op == Conclusion Conditionally emitting the store barrier on CAS failure path seems to be a practical alternative for conditional card marking in heavily-failing CAS scenarios. It does not seem to affect other uses.