package org.openjdk; import org.openjdk.jmh.annotations.*; import java.io.File; import java.net.URI; import java.util.concurrent.*; @State(Scope.Thread) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Warmup(iterations = 20, time = 1, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Fork(4) public class URIParserBench { /* URI$Parser optimization experiment, with the goal to improve startup (and footprint) while not impacting peak performance. Baseline includes recent improvements from removing redundant volatiles in URI, see JDK-8145862 and JDK-8145680 Experiment description: - After improving URI w.r.t. volatile fields, I noticed interpreted execution of URI constructors was suspiciously slow (>10us) - Initial pass at profiling showed time being spent in charAt/substring methods wrapping URI$Parser.input. Removing these improved speed in the interpreter by ~20% without affecting peak performance. - Next profiling showed there was time being spent in synthetic methods that were being generated for the inner Parser class to be able to access private fields and methods in the parent, URI. Removing Parser and moving all methods to URI meant a further improvement to interpreted code but regressed throughput in JITted code due to having to access the volatile string field... - Explored restructuring code to make string final instead of volatile. This led to simplification/improvements of several operations: since string, scheme and fragment are always set in constructor, getRawSchemeSpecificPart can simply substring from string without any loss of performance (or any worse thread safety). All constructors set the string field except an internal constructor used when relativizing, normalizing and resolving URIs from other URI objects, so added some micros to measure this. While there might be a minimal footprint hit, it seems that the throughput hit is negligible. We're also still in definitely better shape compared to before w.r.t. throughput. Now, having string always being created for the operations *could* still be a problem (minimally increased footprint), so I scanned through some commonly used open source projects (netty, grizzly): while finding plenty of uses of public URI constructors, I've not come up with any use of these secondary operations, rather these tools have gone to lengths to reimplement normalization/relativization directly on Strings to avoid the overhead of creating and transforming URIs... - Drive-by optimizations: Parser creates schemeSpecificPart eagerly, which is unnecessary for many uses, but for many types of URIs the SSP == path, so checking for this means we save both footprint and throughput. Test setup: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Baseline: Benchmark Mode Cnt Score Error Units URIParserBench.createJRT avgt 20 92.821 ± 6.742 ns/op URIParserBench.createMixed avgt 20 156.030 ± 11.319 ns/op URIParserBench.createModule avgt 20 109.423 ± 8.717 ns/op URIParserBench.createWithQuery avgt 20 149.273 ± 13.337 ns/op URIParserBench.getRelativeURI avgt 20 417.349 ± 29.871 ns/op URIParserBench.getRelativeURIFromExisting avgt 20 96.893 ± 4.218 ns/op URIParserBench.getSchemeSpecificPartWithQuery avgt 20 143.445 ± 7.081 ns/op URIParserBench.getSchemeSpecificPartWithoutQuery avgt 20 95.484 ± 5.312 ns/op URIParserBench.getSchemeSpecificPartWithoutSchemeOrFragment avgt 20 126.345 ± 5.550 ns/op Experiment: Benchmark Mode Cnt Score Error Units URIParserBench.createJRT avgt 20 57.524 ± 2.639 ns/op URIParserBench.createMixed avgt 20 130.450 ± 8.163 ns/op URIParserBench.createModule avgt 20 85.525 ± 3.969 ns/op URIParserBench.createWithQuery avgt 20 127.554 ± 8.652 ns/op URIParserBench.getRelativeURI avgt 20 398.856 ± 21.873 ns/op URIParserBench.getRelativeURIFromExisting avgt 20 171.144 ± 7.601 ns/op URIParserBench.getSchemeSpecificPartWithQuery avgt 20 143.126 ± 7.495 ns/op URIParserBench.getSchemeSpecificPartWithoutQuery avgt 20 81.317 ± 4.413 ns/op URIParserBench.getSchemeSpecificPartWithoutSchemeOrFragment avgt 20 109.398 ± 5.803 ns/op (Before removing volatiles and volatile inits from URI: Benchmark Mode Cnt Score Error Units URIParserBench.createJRT avgt 10 125.016 ± 14.243 ns/op URIParserBench.createMixed avgt 10 191.271 ± 15.204 ns/op URIParserBench.createModule avgt 10 135.584 ± 17.641 ns/op URIParserBench.createWithQuery avgt 10 182.828 ± 17.690 ns/op URIParserBench.getRelativeURI avgt 10 482.073 ± 30.876 ns/op URIParserBench.getRelativeURIFromExisting avgt 10 120.606 ± 10.212 ns/op URIParserBench.getSchemeSpecificPartWithQuery avgt 10 144.177 ± 7.413 ns/op URIParserBench.getSchemeSpecificPartWithoutQuery avgt 10 101.323 ± 10.765 ns/op) All URI shapes are faster to construct with all the experiment changes in place, and footprint improves in the normal cases by not creating schemeSpecificPart eagerly. The getRelativeURI micros exercise the penalty of having to create the string field eagerly, which is markedly more expensive when isolated (getRelativeURIFromExisting) but disappears in the noise when combined with URI creation (getRelativeURI). All other cases improve or stay neutral. Comparing to before when all fields were volatile, the throughput is clearly better in all cases but getRelativeURIFromExisting, where there's still a noticeable regression. -Xint Baseline: Benchmark Mode Cnt Score Error Units URIParserBench.createJRT avgt 20 13.971 ± 0.130 us/op URIParserBench.createMixed avgt 20 39.749 ± 0.768 us/op URIParserBench.createModule avgt 20 29.047 ± 0.597 us/op URIParserBench.createWithQuery avgt 20 39.753 ± 0.919 us/op URIParserBench.getRelativeURI avgt 20 96.385 ± 1.998 us/op URIParserBench.getRelativeURIFromExisting avgt 20 21.066 ± 0.423 us/op URIParserBench.getSchemeSpecificPartWithQuery avgt 20 40.650 ± 1.044 us/op URIParserBench.getSchemeSpecificPartWithoutQuery avgt 20 30.110 ± 0.627 us/op URIParserBench.getSchemeSpecificPartWithoutSchemeOrFragment avgt 20 51.043 ± 1.382 us/op Experiment: Benchmark Mode Cnt Score Error Units URIParserBench.createJRT avgt 20 10.701 ± 0.151 us/op URIParserBench.createMixed avgt 20 26.480 ± 0.503 us/op URIParserBench.createModule avgt 20 19.547 ± 0.345 us/op URIParserBench.createWithQuery avgt 20 26.298 ± 0.458 us/op URIParserBench.getRelativeURI avgt 20 71.617 ± 1.005 us/op URIParserBench.getRelativeURIFromExisting avgt 20 25.007 ± 1.029 us/op URIParserBench.getSchemeSpecificPartWithQuery avgt 20 28.534 ± 0.643 us/op URIParserBench.getSchemeSpecificPartWithoutQuery avgt 20 21.027 ± 0.596 us/op URIParserBench.getSchemeSpecificPartWithoutSchemeOrFragment avgt 20 32.636 ± 0.912 us/op The experiment greatly benefit interpreted code across the board, except for the .*FromExisting case that see's a small regression. The size of URI.class is now 4.5Kb smaller than URI.class+URI$Parser.class before, which adds to the startup improvement. */ @Setup public void setup() throws Exception { urls = new String[1024]; for (int i = 0; i < 1024; i++) { if (ThreadLocalRandom.current().nextDouble(1.0) < 0.5) { urls[i] = "jrt:/jdk.location"; } else { urls[i] = "jrt:/jdk.location?query"; } if (ThreadLocalRandom.current().nextDouble(1.0) < 0.1) { urls[i] = urls[i] + "#fragment"; } } } @Benchmark public URI createJRT() throws Exception { return URI.create("jrt:/"); } public int index = 0; public String[] urls; public URI baseURI = URI.create("jrt:/java.base"); public URI childURI = URI.create("jrt:/java.base/some/thing/else?query"); @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE) public URI createMixed() throws Exception { index++; return URI.create(urls[index % 1024]); } @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE) public URI createModule() throws Exception { return URI.create("jrt:/java.base"); } @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE) public URI createWithQuery() throws Exception { return URI.create("jrt:/java.base?query"); } @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE) public String getSchemeSpecificPartWithQuery() throws Exception { return URI.create("jrt:/java.base?query").getSchemeSpecificPart(); } @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE) public String getSchemeSpecificPartWithoutQuery() throws Exception { return URI.create("jrt:/java.base").getSchemeSpecificPart(); } @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE) public String getSchemeSpecificPartWithoutSchemeOrFragment() throws Exception { return URI.create("/java.base?module=accessible").getSchemeSpecificPart(); } @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE) public URI getRelativeURIFromExisting() throws Exception { return baseURI.relativize(childURI); } @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE) public URI getRelativeURI() throws Exception { return URI.create("jrt:/java.base").relativize(URI.create("jrt:/java.base/test?query")); } }