== String Concatenation via invokedynamic May-November 2015 aleksey.shipilev@oracle.com @shipilev See the discussion on rationale, alternatives, and caveats in JEP: https://bugs.openjdk.java.net/browse/JDK-8085796 = Design overview == javac changes With current change, javac is able to generate three flavors of concat in the bytecode. All three flavors involve the contained change in javac, which basically recognizes the String concat tree in AST, and emits either the StringBuilder append chain, or the invokedynamic call into the JDK. The differences between different flavors may be seen with this simple example: String s = "[" + System.nanoTime() + "] " + Thread.currentThread().getName() + ":" + "Calling an application method \"" + method + "\" without fear and prejudice."; *** -XDstringConcat=inline This is a legacy scheme, where javac matches the AST looking for string concatenations, and emits the StringBuilder append chains: public java.lang.String work() throws java.lang.Exception; descriptor: ()Ljava/lang/String; flags: ACC_PUBLIC Code: stack=3, locals=1, args_size=1 0: new #2 // class java/lang/StringBuilder 3: dup 4: invokespecial #9 // Method java/lang/StringBuilder."":()V 7: ldc #10 // String [ 9: invokevirtual #6 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder; 12: invokestatic #11 // Method java/lang/System.nanoTime:()J 15: invokevirtual #12 // Method java/lang/StringBuilder.append:(J)Ljava/lang/StringBuilder; 18: ldc #13 // String ] 20: invokevirtual #6 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder; 23: invokestatic #14 // Method java/lang/Thread.currentThread:()Ljava/lang/Thread; 26: invokevirtual #15 // Method java/lang/Thread.getName:()Ljava/lang/String; 29: invokevirtual #6 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder; 32: ldc #16 // String : 34: invokevirtual #6 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder; 37: ldc #17 // String Calling an application method \" 39: invokevirtual #6 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder; 42: aload_0 43: getfield #8 // Field method:Ljava/lang/String; 46: invokevirtual #6 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder; 49: ldc #18 // String \" without fear and prejudice. 51: invokevirtual #6 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder; 54: invokevirtual #7 // Method java/lang/StringBuilder.toString:()Ljava/lang/String; 57: areturn *** -XDstringConcat=indy This is a new bytecode flavor suggested by the JEP. Here, javac also collects all String concat arguments, but instead of lowering the concat into StringBuilder chains, it emits the invokedynamic call to the known StringConcatFactory. The dynamic argument list depends on the number of concatenated arguments and their types. BSM can then emit the specialized concat method to do the actual work. public java.lang.String work() throws java.lang.Exception; descriptor: ()Ljava/lang/String; flags: ACC_PUBLIC Code: stack=9, locals=1, args_size=1 0: ldc #9 // String [ 2: invokestatic #10 // Method java/lang/System.nanoTime:()J 5: ldc #11 // String ] 7: invokestatic #12 // Method java/lang/Thread.currentThread:()Ljava/lang/Thread; 10: invokevirtual #13 // Method java/lang/Thread.getName:()Ljava/lang/String; 13: ldc #14 // String : 15: ldc #15 // String Calling an application method \" 17: aload_0 18: getfield #8 // Field method:Ljava/lang/String; 21: ldc #16 // String \" without fear and prejudice. 23: invokedynamic #17, 0 // InvokeDynamic #0:makeConcat:(Ljava/lang/String;JLjava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String; 28: areturn BootstrapMethods: 0: #94 invokestatic java/lang/invoke/StringConcatFactory.makeConcat:(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/invoke/CallSite; Method arguments: #95 *** -XDstringConcat=indyWithConstants This bytecode flavor is an elaboration for "indy" flavor, which also recognizes that known constants may be passed to the BSM directly, without adding them as dynamic arguments: public java.lang.String work() throws java.lang.Exception; descriptor: ()Ljava/lang/String; flags: ACC_PUBLIC Code: stack=4, locals=1, args_size=1 0: invokestatic #9 // Method java/lang/System.nanoTime:()J 3: invokestatic #10 // Method java/lang/Thread.currentThread:()Ljava/lang/Thread; 6: invokevirtual #11 // Method java/lang/Thread.getName:()Ljava/lang/String; 9: aload_0 10: getfield #8 // Field method:Ljava/lang/String; 13: invokedynamic #12, 0 // InvokeDynamic #0:makeConcatWithConstants:(JLjava/lang/String;Ljava/lang/String;)Ljava/lang/String; 18: areturn BootstrapMethods: 0: #85 invokestatic java/lang/invoke/StringConcatFactory.makeConcatWithConstants:(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;Ljava/lang/String;[Ljava/lang/String;)Ljava/lang/invoke/CallSite; Method arguments: #86 [\u0001] \u0001:Calling an application method \"\u0001\" without fear and prejudice. The first argument is the "recipe", where each character in the recipe means: * \1 (Unicode point 0001): an ordinary argument. This input is passed through dynamic argument, and is provided during the concatenation method invocation. This input can be null. * \2 (Unicode point 0002): a constant. This input passed through static bootstrap argument. This constant can be any value representable in constant pool. If necessary, the factory would call {@code String.valueOf} to perform a one-time String conversion. * Any other char value: a single character constant. We also bypass empty strings from the recipe, because they have no meaning at this level. This captures the Java language trick to force String concat with e.g. ("" + int)-like expression. Down here in javac, we already know we are in String concat business, and do not require these markers. == JDK Changes After javac compiled the String concat into the indy call, we need to provide a bootstrap method (BSM) in JDK to deal with it. See more details on the invokedynamic lifecycle in Javadoc: https://docs.oracle.com/javase/8/docs/api/java/lang/invoke/package-summary.html The goal for BSM is to provide a MethodHandle that points to the actual concatenation code. The power of the proposed JEP comes from the opportunity to adapt the concatenation code for the actual argument list. Our prototype uses five potential strategies: * BC_SB, or "bytecode StringBuilder" This strategy spins up the bytecode that has the same StringBuilder chain javac would otherwise emit. This strategy uses only the public API, and comes as the baseline for the current JDK behavior. On other words, this strategy, in essence, moves the javac generated bytecode to runtime. The generated bytecode is loaded via Unsafe.defineAnonymousClass, but with the caller class coming from the BSM -- in other words, the protection guarantees are inherited from the method where invokedynamic was originally called. This means, among other things, that the bytecode is verified for all non-JDK uses. * BC_SB_SIZED, or "bytecode StringBuilder, but sized" This strategy acts similarly to BC_SB, but it also tries to guess the capacity required for StringBuilder to accept all arguments without resizing. This strategy only makes an educated guess: it only guesses the space required for known types (e.g. primitives and Strings), but does not otherwise convert arguments. Therefore, the capacity estimate may be wrong, and StringBuilder may have to transparently resize or trim when doing the actual concatenation. While this does not constitute a correctness issue (in the end, that what BC_SB has to do anyway), this does pose a potential performance problem. * BC_SB_SIZED_EXACT, or "bytecode StringBuilder, but sized exactly": This strategy improves on BC_SB_SIZED, by first converting all arguments to String in order to get the exact capacity StringBuilder should have. The conversion is done via the public String.valueOf and/or Object.toString methods, and does not touch any private String API. * MH_SB_SIZED, or "MethodHandles StringBuilder, sized": (initial code contributed by Remi Forax) This strategy avoids spinning up the bytecode by building the computation on MethodHandle combinators. The computation is built with public MethodHandle APIs, resolved from a public Lookup sequence, and ends up calling the public StringBuilder API. Therefore, this strategy does not use any private API at all, even the Unsafe.defineAnonymousClass, since everything is handled under cover by java.lang.invoke APIs. * MH_INLINE_SIZED_EXACT, or "MethodHandles inline, sized exactly": (initial code contributed by Peter Levart) This strategy replicates what StringBuilders are doing: it builds the char[] array on its own and passes that char[] array to String constructor. This strategy requires access to some private APIs in JDK, most notably, the read-only Integer/Long.stringSize methods that measure the character length of the integers, and the private String constructor that accepts char[] arrays without copying. While this strategy assumes a particular implementation details for String, this opens the door for building a very optimal concatenation sequence. This is the only strategy that requires porting if there are private JDK changes occur. = Experiments Current targeted benchmark set: * string_string benchmarks serve the simplest two-string case * string_string_int is a more complicated test, routinely treated by OptoStringConcat * string_string_long provides a similar test, but it is not being optimized by OptoStringConcat Also, since we are interested in how the concat behaves in the absence of aggressively-optimizing compilers, we run the tests with -XX:TieredStopAtLevel=1 to push the code through C1. For convenience, this is a benchmark code: @State(Scope.Benchmark) @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Fork(3) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) public class ConcatBench { private static final String a = "CONST"; private String b; private int c; private long d; @Param({"1", "10", "100"}) private int size; @Setup public void setup() { StringBuilder sbB = new StringBuilder(size); for (int c = 0; c < size; c++) { sbB.append("b"); } b = sbB.toString(); c = 42; d = 42; } @Benchmark public String string_string() { return a + b; } @Benchmark public String string_string_int() { return a + b + c; } @Benchmark public String string_string_long() { return a + b + d; } } == string_string === C2 time, ns/op alloc pressure, bytes/op [size] --------> [size] --------> # JDK 9 baseline: 1, 10, 100, 1, 10, 100, *, 44.4 (0.2), 57.3 (0.3), 202.6 (1.1), 56, 72, 256, # -XDstringConcat=inline: 1, 10, 100, 1, 10, 100, BC_SB, 44.6 (0.2), 57.3 (0.3), 202.9 (1.0), 56, 72, 256, BC_SB_SIZED, 44.7 (0.2), 57.4 (0.3), 203.6 (0.9), 56, 72, 256, BC_SB_SIZED_EXACT, 44.7 (0.3), 57.5 (0.4), 203.1 (0.9), 56, 72, 256, MH_SB_SIZED, 44.7 (0.2), 57.4 (0.3), 203.2 (1.1), 56, 72, 256, MH_INLINE_SIZED_EXACT, 44.7 (0.2), 57.5 (0.4), 203.4 (0.8), 56, 72, 256, # -XDstringConcat=indy: 1, 10, 100, 1, 10, 100, BC_SB, 44.7 (0.2), 57.2 (0.2), 203.2 (0.9), 56, 72, 256, BC_SB_SIZED, 44.6 (0.2), 57.2 (0.4), 203.1 (1.1), 56, 72, 256, BC_SB_SIZED_EXACT, 44.6 (0.3), 57.4 (0.3), 203.1 (0.8), 56, 72, 256, MH_SB_SIZED, 70.2 (0.6), 95.6 (0.4), 388.2 (2.3), 88, 120, 488, MH_INLINE_SIZED_EXACT, 44.6 (0.3), 57.4 (0.2), 203.7 (1.2), 56, 72, 256, # -XDstringConcat=indyWithConstants: 1, 10, 100, 1, 10, 100, BC_SB, 44.6 (0.2), 57.4 (0.4), 203.2 (1.3), 56, 72, 256, BC_SB_SIZED, 44.5 (0.2), 57.2 (0.4), 202.7 (0.9), 56, 72, 256, BC_SB_SIZED_EXACT, 44.6 (0.2), 57.3 (0.3), 203.2 (1.4), 56, 72, 256, MH_SB_SIZED, 70.0 (0.5), 95.7 (0.4), 386.7 (1.6), 88, 120, 488, MH_INLINE_SIZED_EXACT, 44.6 (0.3), 57.3 (0.4), 203.8 (0.8), 56, 72, 256, When OptimizeStringConcat is operational, String concatenation is already heavily optimized. There are no differences, neither regression nor improvements in most strategies. MH_SB_SIZED fails to adhere to the bytecode shape expected by OptimizeStringConcat, and thus being penalized. MH_INLINE_SIZED_EXACT does the inline copy, and OptimizeStringConcat is not required there. There is also no difference between "indy" and "indyStatic" strategies, since compiler is able to aggressively inline, and see through the constness of parameters. === C1 time, ns/op alloc pressure, bytes/op [size] --------> [size] --------> # JDK 9 baseline: 1, 10, 100, 1, 10, 100, *, 101.3 (0.4), 114.1 (0.8), 441.9 (2.4), 128, 144, 560, # -XDstringConcat=inline: 1, 10, 100, 1, 10, 100, BC_SB, 102.1 (0.5), 114.2 (0.6), 442.0 (2.3), 128, 144, 560, BC_SB_SIZED, 101.8 (0.6), 113.9 (0.5), 442.9 (2.9), 128, 144, 560, BC_SB_SIZED_EXACT, 101.9 (0.4), 114.4 (0.6), 443.1 (2.4), 128, 144, 560, MH_SB_SIZED, 101.9 (0.7), 114.5 (0.5), 442.7 (3.2), 128, 144, 560, MH_INLINE_SIZED_EXACT, 101.7 (0.4), 114.6 (0.5), 442.1 (1.7), 128, 144, 560, # -XDstringConcat=indy: 1, 10, 100, 1, 10, 100, BC_SB, 101.7 (0.6), 114.3 (0.5), 442.2 (2.3), 128, 144, 560, BC_SB_SIZED, 89.7 (0.5), 114.2 (0.5), 404.3 (2.2), 112, 144, 512, BC_SB_SIZED_EXACT, 89.4 (0.2), 114.2 (0.5), 405.7 (3.4), 112, 144, 512, MH_SB_SIZED, 324.7 (8.2), 325.2 (2.4), 406.6 (1.5), 112, 144, 512, MH_INLINE_SIZED_EXACT, 44.5 (0.3), 57.0 (0.4), 202.1 (1.1), 56, 72, 256, # -XDstringConcat=indyWithConstants: 1, 10, 100, 1, 10, 100, BC_SB, 101.7 (0.4), 114.5 (0.6), 441.7 (2.4), 128, 144, 560, BC_SB_SIZED, 89.7 (0.4), 114.8 (0.6), 405.0 (2.0), 112, 144, 512, BC_SB_SIZED_EXACT, 89.5 (0.4), 114.4 (0.7), 405.5 (2.2), 112, 144, 512, MH_SB_SIZED, 307.2 (3.2), 317.0 (1.7), 400.8 (3.6), 112, 144, 512, MH_INLINE_SIZED_EXACT, 44.5 (0.3), 57.0 (0.3), 202.2 (1.0), 56, 72, 256, When OptimizeStringConcat is not working (this is C1), we get the substantial improvements in almost all strategies, except for MH_SB_SIZED on smaller strings. The allocation pressure is also down a lot. Again, "indyStatic" does not yield any substantial benefit. == string_string_int === C2 time, ns/op alloc pressure, bytes/op [size] --------> [size] --------> # JDK 9 baseline: 1, 10, 100, 1, 10, 100, *, 44.5 (0.3), 63.6 (0.4), 203.1 (1.3), 56, 80, 256, # -XDstringConcat=inline: 1, 10, 100, 1, 10, 100, BC_SB, 44.6 (0.2), 63.7 (0.3), 203.8 (1.2), 56, 80, 256, BC_SB_SIZED, 44.6 (0.2), 63.8 (0.3), 203.8 (0.9), 56, 80, 256, BC_SB_SIZED_EXACT, 44.5 (0.3), 63.6 (0.3), 203.6 (1.3), 56, 80, 256, MH_SB_SIZED, 44.5 (0.2), 63.8 (0.3), 203.8 (0.8), 56, 80, 256, MH_INLINE_SIZED_EXACT, 44.6 (0.3), 63.7 (0.4), 203.9 (1.1), 56, 80, 256, # -XDstringConcat=indy: 1, 10, 100, 1, 10, 100, BC_SB, 44.7 (0.3), 63.8 (0.4), 203.5 (1.2), 56, 80, 256, BC_SB_SIZED, 44.6 (0.3), 63.6 (0.3), 203.5 (1.1), 56, 80, 256, BC_SB_SIZED_EXACT, 63.7 (0.3), 82.8 (0.4), 223.6 (1.0), 80, 104, 280, MH_SB_SIZED, 89.6 (0.6), 121.2 (0.8), 401.8 (3.0), 112, 152, 504, MH_INLINE_SIZED_EXACT, 49.5 (0.4), 64.1 (0.4), 201.6 (3.8), 56, 80, 256, # -XDstringConcat=indyWithConstants: 1, 10, 100, 1, 10, 100, BC_SB, 44.5 (0.2), 63.6 (0.3), 203.5 (0.8), 56, 80, 256, BC_SB_SIZED, 44.6 (0.3), 63.5 (0.3), 203.2 (0.9), 56, 80, 256, BC_SB_SIZED_EXACT, 63.7 (0.4), 82.9 (0.5), 223.0 (1.5), 80, 104, 280, MH_SB_SIZED, 89.7 (0.7), 121.5 (0.6), 400.6 (2.3), 112, 152, 504, MH_INLINE_SIZED_EXACT, 47.9 (0.5), 63.9 (0.4), 204.7 (1.6), 56, 80, 256, Again, this case is very well catered for by OptimizeStringConcat. The reasoning is the same as in string_string cases. === C1 time, ns/op alloc pressure, bytes/op [size] --------> [size] --------> # JDK 9 baseline: 1, 10, 100, 1, 10, 100, *, 106.9 (1.8), 189.2 (0.7), 788.4 (4.1), 128, 240, 1000, # -XDstringConcat=inline: 1, 10, 100, 1, 10, 100, BC_SB, 107.7 (1.1), 190.3 (0.8), 790.6 (4.0), 128, 240, 1000, BC_SB_SIZED, 106.5 (1.0), 190.4 (1.3), 791.2 (3.9), 128, 240, 1000, BC_SB_SIZED_EXACT, 106.7 (2.3), 189.9 (1.0), 790.6 (5.3), 128, 240, 1000, MH_SB_SIZED, 106.8 (2.7), 189.9 (1.2), 791.1 (3.6), 128, 240, 1000, MH_INLINE_SIZED_EXACT, 105.0 (0.8), 190.3 (0.9), 788.6 (5.4), 128, 240, 1000, # -XDstringConcat=indy: 1, 10, 100, 1, 10, 100, BC_SB, 105.9 (2.3), 190.0 (1.1), 789.6 (5.7), 128, 240, 1000, BC_SB_SIZED, 111.6 (1.7), 140.4 (0.9), 417.3 (2.5), 136, 176, 528, BC_SB_SIZED_EXACT, 134.4 (1.3), 166.2 (0.9), 444.2 (2.1), 160, 208, 560, MH_SB_SIZED, 368.2 (27.2), 361.2 (8.7), 422.8 (2.5), 136, 176, 528, MH_INLINE_SIZED_EXACT, 52.8 (0.3), 64.4 (0.6), 202.1 (1.3), 56, 80, 256, # -XDstringConcat=indyWithConstants: 1, 10, 100, 1, 10, 100, BC_SB, 107.6 (0.9), 190.3 (1.0), 789.9 (5.2), 128, 240, 1000, BC_SB_SIZED, 110.5 (1.1), 140.1 (0.6), 417.8 (2.5), 136, 176, 528, BC_SB_SIZED_EXACT, 134.0 (1.3), 166.3 (0.8), 443.5 (2.9), 160, 208, 560, MH_SB_SIZED, 339.2 (1.9), 347.3 (1.8), 418.8 (3.6), 136, 176, 528, MH_INLINE_SIZED_EXACT, 51.2 (0.2), 64.4 (0.5), 203.0 (1.0), 56, 80, 256, Ditto here. When OptimizeStringConcat does not work, we have immense improvements with sized strategies. == string_string_long === C2 time, ns/op alloc pressure, bytes/op [size] --------> [size] --------> # JDK 9 baseline: 1, 10, 100, 1, 10, 100, *, 102.9 (4.1), 191.3 (0.9), 799.2 (3.2), 128, 240, 1000, # -XDstringConcat=inline: 1, 10, 100, 1, 10, 100, BC_SB, 102.1 (0.6), 191.7 (1.2), 802.1 (4.1), 128, 240, 1000, BC_SB_SIZED, 102.0 (0.5), 191.8 (1.5), 800.4 (5.9), 128, 240, 1000, BC_SB_SIZED_EXACT, 102.1 (0.6), 191.1 (1.2), 799.8 (4.0), 128, 240, 1000, MH_SB_SIZED, 102.1 (0.6), 191.5 (0.6), 798.0 (4.7), 128, 240, 1000, MH_INLINE_SIZED_EXACT, 102.1 (0.5), 191.4 (1.3), 801.1 (6.0), 128, 240, 1000, # -XDstringConcat=indy: 1, 10, 100, 1, 10, 100, BC_SB, 102.1 (0.5), 192.4 (1.2), 801.4 (4.2), 128, 240, 1000, BC_SB_SIZED, 121.1 (0.6), 153.2 (0.7), 440.0 (2.7), 152, 192, 552, BC_SB_SIZED_EXACT, 63.8 (0.4), 82.6 (0.4), 222.9 (1.3), 80, 104, 280, MH_SB_SIZED, 120.8 (0.8), 153.1 (1.0), 440.5 (2.3), 152, 192, 552, MH_INLINE_SIZED_EXACT, 52.6 (0.5), 64.2 (0.4), 204.3 (1.2), 56, 80, 256, # -XDstringConcat=indyWithConstants: 1, 10, 100, 1, 10, 100, BC_SB, 102.1 (0.8), 191.3 (0.8), 800.5 (5.1), 128, 240, 1000, BC_SB_SIZED, 121.3 (0.7), 153.2 (0.9), 439.3 (2.6), 152, 192, 552, BC_SB_SIZED_EXACT, 63.8 (0.5), 83.1 (0.4), 223.0 (1.2), 80, 104, 280, MH_SB_SIZED, 119.6 (2.7), 153.1 (0.8), 440.3 (2.1), 152, 192, 552, MH_INLINE_SIZED_EXACT, 52.3 (0.9), 63.9 (0.3), 204.0 (1.0), 56, 80, 256, OptimizeStringConcat does not work for all patterns, like when there is a "long" argument. In these cases, SIZED strategies are starting to win big time, because they guess the final storage capacity right. Some strategies are being penalized on smaller strings, because precomputing the length has the overhead that is not matched by the "wrong" initial capacity guess in non-EXACT cases. === C1 time, ns/op alloc pressure, bytes/op [size] --------> [size] --------> # JDK 9 baseline: 1, 10, 100, 1, 10, 100, *, 113.3 (2.4), 190.3 (1.5), 789.2 (4.1), 128, 240, 1000, # -XDstringConcat=inline: 1, 10, 100, 1, 10, 100, BC_SB, 116.0 (0.9), 190.5 (1.2), 790.3 (3.5), 128, 240, 1000, BC_SB_SIZED, 112.3 (1.0), 190.0 (0.9), 790.3 (3.9), 128, 240, 1000, BC_SB_SIZED_EXACT, 116.3 (2.3), 190.7 (1.0), 791.5 (3.9), 128, 240, 1000, MH_SB_SIZED, 118.4 (0.9), 191.0 (1.3), 790.5 (3.1), 128, 240, 1000, MH_INLINE_SIZED_EXACT, 114.0 (3.2), 190.3 (1.4), 792.5 (4.6), 128, 240, 1000, # -XDstringConcat=indy: 1, 10, 100, 1, 10, 100, BC_SB, 113.8 (3.0), 190.8 (0.9), 790.0 (4.4), 128, 240, 1000, BC_SB_SIZED, 123.3 (0.8), 153.2 (1.0), 437.8 (2.2), 152, 192, 552, BC_SB_SIZED_EXACT, 144.2 (2.0), 166.5 (0.7), 444.0 (3.1), 160, 208, 560, MH_SB_SIZED, 362.6 (8.6), 364.5 (1.7), 438.6 (2.3), 152, 192, 552, MH_INLINE_SIZED_EXACT, 62.5 (3.7), 66.0 (1.9), 202.3 (1.1), 56, 80, 256, # -XDstringConcat=indyWithConstants: 1, 10, 100, 1, 10, 100, BC_SB, 115.4 (3.9), 190.4 (1.1), 791.2 (3.8), 128, 240, 1000, BC_SB_SIZED, 123.9 (1.5), 153.2 (1.0), 436.6 (2.1), 152, 192, 552, BC_SB_SIZED_EXACT, 144.9 (1.4), 166.4 (0.9), 443.3 (2.8), 160, 208, 560, MH_SB_SIZED, 350.4 (1.5), 360.8 (3.9), 436.8 (3.2), 152, 192, 552, MH_INLINE_SIZED_EXACT, 63.8 (2.3), 66.8 (1.5), 202.4 (1.0), 56, 80, 256, Pretty much the same behavior is seen with C1. == LogLine LogLine benchmark tries to make a case for a more realistic scenario of a logged statement, like this: @State(Scope.Benchmark) @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Fork(3) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) public class LogLineBench { @Param({"1", "10", "100"}) private int size; private String method; @Setup public void setup() { StringBuilder sbA = new StringBuilder(size); for (int c = 0; c < size; c++) { sbA.append("a"); } method = sbA.toString(); } @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE) public String work() throws Exception { return "[" + System.nanoTime() + "] " + Thread.currentThread().getName() + ":" + "Calling an application method \"" + method + "\" without fear and prejudice."; } } === C2 time, ns/op alloc pressure, bytes/op [size] --------> [size] --------> # JDK 9 baseline: 1, 10, 100, 1, 10, 100, *, 708.7 (3.4), 720.0 (4.1), 1332.0 (6.5), 888, 904, 1680, # -XDstringConcat=inline: 1, 10, 100, 1, 10, 100, BC_SB, 709.1 (3.5), 721.5 (4.0), 1338.1 (6.3), 888, 904, 1680, BC_SB_SIZED, 708.3 (3.7), 724.3 (3.1), 1339.0 (6.4), 888, 904, 1680, BC_SB_SIZED_EXACT, 710.7 (5.6), 723.0 (2.9), 1335.6 (5.0), 888, 904, 1680, MH_SB_SIZED, 709.4 (2.7), 725.7 (3.5), 1343.5 (13.9), 888, 912, 1688, MH_INLINE_SIZED_EXACT, 709.0 (4.0), 723.0 (4.9), 1334.5 (10.8), 888, 904, 1680, # -XDstringConcat=indy: 1, 10, 100, 1, 10, 100, BC_SB, 708.7 (3.1), 722.9 (3.4), 1340.6 (6.4), 888, 904, 1680, BC_SB_SIZED, 447.3 (1.8), 472.0 (2.4), 762.1 (4.0), 560, 592, 960, BC_SB_SIZED_EXACT, 268.6 (1.5), 281.7 (1.4), 428.3 (1.8), 336, 352, 536, MH_SB_SIZED, 447.9 (1.5), 472.2 (3.2), 764.3 (6.4), 560, 592, 960, MH_INLINE_SIZED_EXACT, 229.2 (1.1), 242.3 (1.3), 381.9 (2.1), 288, 304, 488, # -XDstringConcat=indyWithConstants: 1, 10, 100, 1, 10, 100, BC_SB, 708.8 (4.1), 723.6 (8.0), 1341.3 (9.4), 888, 904, 1680, BC_SB_SIZED, 446.3 (2.2), 471.3 (1.9), 765.3 (4.6), 560, 592, 960, BC_SB_SIZED_EXACT, 268.2 (1.5), 280.3 (1.1), 428.3 (1.8), 336, 352, 536, MH_SB_SIZED, 447.7 (2.6), 473.4 (1.8), 770.2 (9.3), 560, 592, 968, MH_INLINE_SIZED_EXACT, 230.1 (1.3), 242.9 (1.2), 382.3 (3.3), 288, 304, 488, The original line is rather heavy-weight, so the baseline is relatively slow, and allocates a lot. SIZED strategies avoid some of the garbage, and thus improve performance a lot. E.g MH_INLINE_SIZED_EXACT improves the performance 3-4x! === C1 time, ns/op alloc pressure, bytes/op [size] --------> [size] --------> # JDK 9 baseline: 1, 10, 100, 1, 10, 100, *, 722.6 (3.2), 732.7 (3.5), 1342.7 (8.7), 912, 928, 1704, # -XDstringConcat=inline: 1, 10, 100, 1, 10, 100, BC_SB, 723.0 (4.6), 736.0 (3.8), 1347.5 (9.5), 912, 928, 1704, BC_SB_SIZED, 722.2 (4.7), 734.0 (4.2), 1346.7 (5.5), 912, 928, 1704, BC_SB_SIZED_EXACT, 721.5 (4.3), 733.6 (3.6), 1346.3 (8.1), 912, 928, 1704, MH_SB_SIZED, 723.3 (3.6), 733.1 (3.9), 1345.8 (8.9), 912, 928, 1704, MH_INLINE_SIZED_EXACT, 722.5 (5.0), 734.1 (3.5), 1347.0 (6.3), 912, 928, 1704, # -XDstringConcat=indy: 1, 10, 100, 1, 10, 100, BC_SB, 722.7 (3.9), 733.7 (5.0), 1343.9 (10.1), 912, 928, 1704, BC_SB_SIZED, 463.2 (3.6), 486.4 (3.2), 773.7 (3.1), 584, 616, 984, BC_SB_SIZED_EXACT, 513.8 (3.1), 539.0 (2.6), 825.7 (3.8), 648, 680, 1048, MH_SB_SIZED, 798.5 (7.7), 810.1 (9.1), 881.6 (7.0), 584, 616, 984, MH_INLINE_SIZED_EXACT, 342.6 (6.4), 341.1 (9.0), 388.6 (2.3), 288, 304, 488, # -XDstringConcat=indyWithConstants: 1, 10, 100, 1, 10, 100, BC_SB, 722.6 (4.8), 736.7 (4.9), 1347.1 (6.0), 912, 928, 1704, BC_SB_SIZED, 465.8 (3.3), 489.7 (2.3), 776.5 (5.5), 584, 616, 984, BC_SB_SIZED_EXACT, 514.1 (3.5), 539.9 (4.4), 827.7 (3.6), 648, 680, 1048, MH_SB_SIZED, 687.8 (4.2), 702.5 (4.6), 781.1 (6.2), 584, 616, 984, MH_INLINE_SIZED_EXACT, 329.7 (3.8), 331.8 (6.4), 384.1 (2.2), 288, 304, 488, These strategies operate well even in the absence of advanced optimizing compiler. This is also one of the cases where "indyStatic" wins over "indy" in MH_SB_SIZED strategy -- large constant Strings are processed eagerly during linkage. == Startup considerations Obviously, moving the String concat at the runtime may adversely impact startup, since additional steps have to be done right after the VM starts. Project Lambda had the same concerns, but it was figured the overhead is actually quite small. A smoke test like the following can be used to contrast the startup costs: public class Concat { private static String a = "a"; private static String b = "b"; private static String c = null; private static int d = 42; public static void main(String... args) { System.out.println(doStuff()); } public static String doStuff() { return a + b + c + d; } } $ for S in `seq 1 100`; do time java Concat; done 2>&1 | grep real | sed -e "s/0m//g" -e "s/s//g" | \ awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' Current implementation suffers for a related G1 performance issue, therefore we measure with -XX:+UseParallelGC. See https://bugs.openjdk.java.net/browse/JDK-8136854 -XDstringConcat=inline: *: 34.1 +- 1.5 ms -XDstringConcat=indy: BC_SB: 58.5 +- 2.0 ms BC_SB_SIZED: 59.5 +- 2.0 ms BC_SB_SIZED_EXACT: 59.0 +- 2.0 ms MH_SB_SIZED: 95.1 +- 2.5 ms MH_INLINE_SIZED_EXACT: 90.1 +- 2.5 ms -XDstringConcat=indyWithConstants: BC_SB: 57.5 +- 2.0 ms BC_SB_SIZED: 58.5 +- 2.0 ms BC_SB_SIZED_EXACT: 58.5 +- 2.0 ms MH_SB_SIZED: 91.1 +- 2.5 ms MH_INLINE_SIZED_EXACT: 87.5 +- 2.5 ms All indy strategies regress the startup time. However, we know that the largest issue with startup time is the initialization of java.lang.invoke infrastructure itself, see: https://bugs.openjdk.java.net/browse/JDK-8086045 To show that in our test, we may add a single line in the test that also uses java.lang.invoke and implicitly initializes it. Any code with lambdas will do, e.g.: private static Runnable r = () -> System.exit(0); Then the performance is like this: -XDstringConcat=inline: *: 60.0 +- 2.0 ms -XDstringConcat=indy: BC_SB: 64.1 +- 2.0 ms BC_SB_SIZED: 64.0 +- 2.0 ms BC_SB_SIZED_EXACT: 63.5 +- 2.0 ms MH_SB_SIZED: 94.5 +- 2.0 ms MH_INLINE_SIZED_EXACT: 91.5 +- 2.0 ms -XDstringConcat=indyWithConstants: BC_SB: 62.5 +- 2.0 ms BC_SB_SIZED: 63.8 +- 2.0 ms BC_SB_SIZED_EXACT: 63.3 +- 2.0 ms MH_SB_SIZED: 92.9 +- 2.0 ms MH_INLINE_SIZED_EXACT: 90.5 +- 2.0 ms In other words, there are additional ~5ms for BC_* strategies, and additional ~30ms for MH_* strategies. This cost is one-time, and should be niled as soon as BSM methods warm up. == Footprint considerations The very first experiment one can do is compile the entire JDK with and without indy string concat. Then, if we compare the raw .class files, we can see a slight regression in bytecode size: raw .class files, bytes JDK image, Kb -XDstringConcat=inline 132.227.587 334.676 -XDstringConcat=indy 133.378.513 335.796 -XDstringConcat=indyWithConstants 133.211.067 335.640 == Interaction with Compact Strings As noted in JEP, this change has a potential to collide with other String-related improvements. The improvement of special interest for us is Compact Strings (http://openjdk.java.net/jeps/254), as it changes the internal representation of String instances, and may affect the public API performance. A simple experiment would be to try and mix current prototypes and run the same set of benchmarks against it. We choose LogLineBench from above as the more realistic scenario. Since most strategies use the public API only, no further changes were needed. The only exception was MH_INLINE_SIZED_EXACT that assumes a particular String shape -- and we needed to hack in the crude implementation that assumed all Strings are Latin1 (indeed they are in our test). time, ns/op alloc pressure, bytes/op [size] --------> [size] --------> *** Indify String Concat (baseline, -XDstringConcat=inline) 1, 10, 100, 1, 10, 100, BC_SB, 709.1 (3.5), 721.5 (4.0), 1338.1 (6.3), 888, 904, 1680, BC_SB_SIZED, 708.3 (3.7), 724.3 (3.1), 1339.0 (6.4), 888, 904, 1680, BC_SB_SIZED_EXACT, 710.7 (5.6), 723.0 (2.9), 1335.6 (5.0), 888, 904, 1680, MH_SB_SIZED, 709.4 (2.7), 730.7 (10.5), 1343.5 (13.9), 888, 912, 1688, MH_INLINE_SIZED_EXACT, 709.0 (4.0), 723.0 (4.9), 1334.5 (10.8), 888, 904, 1680, *** Compact Strings (baseline, -XDstringConcat=inline) 1, 10, 100, 1, 10, 100, BC_SB, 403.9 (2.2), 410.0 (3.0), 721.7 (3.9), 504, 512, 904, BC_SB_SIZED, 402.8 (1.8), 409.0 (1.3), 722.7 (6.2), 504, 512, 904, BC_SB_SIZED_EXACT, 403.4 (1.9), 408.8 (2.3), 722.6 (3.9), 504, 512, 904, MH_SB_SIZED, 402.7 (1.9), 410.5 (9.0), 724.5 (3.5), 504, 512, 904, MH_INLINE_SIZED_EXACT, 403.0 (2.1), 410.8 (4.1), 721.5 (3.5), 504, 512, 904, *** Indify String Concat (optimized, -XDstringConcat=indyWithConstants) 1, 10, 100, 1, 10, 100, BC_SB, 709.7 (4.2), 716.4 (10.5), 1338.0 (12.1), 888, 904, 1680, BC_SB_SIZED, 447.7 (2.9), 478.9 (9.5), 763.9 (5.2), 560, 600, 960, BC_SB_SIZED_EXACT, 268.7 (1.6), 281.3 (1.1), 427.8 (2.5), 336, 352, 536, MH_SB_SIZED, 446.1 (2.6), 471.6 (2.5), 763.2 (4.8), 560, 592, 960, MH_INLINE_SIZED_EXACT, 229.1 (1.1), 241.2 (1.1), 383.9 (2.5), 288, 304, 488, *** Compact Strings + Indify String Concat (optimized, -XDstringConcat=indyWithConstants) 1, 10, 100, 1, 10, 100, BC_SB, 404.1 (4.8), 410.0 (2.3), 721.0 (3.1), 504, 512, 904, BC_SB_SIZED, 248.8 (1.4), 262.5 (1.1), 406.9 (2.0), 312, 328, 512, BC_SB_SIZED_EXACT, 275.1 (1.8), 288.2 (2.1), 428.2 (2.0), 344, 360, 536, MH_SB_SIZED, 249.3 (1.2), 261.4 (1.3), 408.6 (2.2), 312, 328, 512, MH_INLINE_SIZED_EXACT, 187.9 (2.2), 188.9 (3.2), 210.3 (1.2), 168, 176, 264, One can clearly spot that both changes alone improve performance significantly. Optimized strategies in Indify String Concat try to avoid the excess work, and Compact Strings try to improve footprint without sacrificing (much) of the performance, and avoid excess allocation. But both features together push the envelope even further. Take 100-byte case as example. In the well-tested case of BC_SB_SIZED strategy, the improvement is: 1338 ns/op -> 408 ns/op (3.25x), and allocation pressure is down 1680 bytes/op --> 512 bytes/op (3.28x)! The hacky and special-cased MH_INLINE_SIZED_EXACT provides even more, >6.36x more throughput, and 6.36x less allocation pressure! In fact, 264 bytes seems to be the space taken by the resulting String itself, meaning in that mode the concat is completely garbage-free. Also notice how a "good" BC_SB_SIZED_EXACT strategy regressed with a Compact Strings change. This is due to OptoStringConcat in Compact Strings producing more branchy code that chooses the control path based on String.coder values. Since the OptoStringConcat code shape gets introduced in C2, the profile information cannot be gathered, and we have a lot of cold code in the "hot" instruction stream. In Indy String Concat case, we may generate the bytecode for concatenation early, and so use the usual profiling mechanics usual Java code enjoys. == Conclusion There are four specific observations to be made: 1) The move of the generated bytecode sequence from javac to indy translation strategy does not experience the performance hits, as we can see with BC_SB tests. This validates the idea the move from javac-compiled bytecode sequence to runtime-generated one is sound at least on performance front. 2) Advanced strategies can improve performance when JIT compiler is not able to optimize the append chains. We can even devise strategies that skip StringBuilder APIs, and populate the char[]/byte[] array directly, and these strategies can provide the immense performance and allocation pressure improvements. 3) This feature plays well with other String-related optimizations, notably Compact Strings. The ability to fit the concat implementation for a particular String shape seems to be a huge win. 4) Bytecode footprint costs are manifested by static overhead of storing MethodTypes and other java.lang.invoke.* infra (around 10 new CP entries), plus a dynamic overhead (a few CP entries) for each new String concat shape. -------- === Cramming implicit concats (Note: this section is not complete, and measures only "indy" generator, not "indyStatic") A more thorough experiment explores the footprint at the edge of class file size. There is a concern that the Java programs which are already consuming most of the class file will fail to compile with new indy code shape. In order to quantify these costs, we may conduct a simple experiment where we implicitly concat-ting multiple arguments. In hindsight, it makes sense to test String concats and int concats, with N arguments, in four scenarios: 1) N chains, 1 argument each, distinct type in each: String s1 = "" + (Type1)t1; ... String sN = "" + (TypeN)tN; 2) N chains, 1 argument each, same type: String s1 = "" + (String)s1; ... String sN = "" + (String)s1; 3) 1 chain, N arguments: String s1 = "" + s1 + s2 + ... + sN; 4) N chains, [1..N] arguments: String s1 = "" + s1; String s2 = "" + s1 + s2; ... String sN = "" + s1 + s2 + ... + sN; These scenarios cover multiple method argument shapes, as well as provide the corner cases for multiple small concats, or one large concat, etc. The results are as follows: http://cr.openjdk.java.net/~shade/8085796/class-size.png It is important to remember that the actual limit in class files is the number of bytecode instructions in one method, and number of constant pool entries (but not directly a constant pool size). In the data below, we can see this: *) 1 chain, N arguments: constant pool size is the same entry-wise, and the class size is growing, since we have more bytecode instructions to cram into method. The sawtooth pattern for indy tests are due to cutoff in indy arguments: when our prototype faces the indy call with more than 253 arguments, it cuts out another indy call with the rest of the arguments. All those "bulk" 253-sized calls are referencing the same constant pool entry. The remainder call still has the linearly growing MethodType, until it reaches 253 arguments, and folds back. The cycles repeats with the next remainder. In this test, legacy string concat blows the method size limit first. *) N chains, [1..N] arguments. Here, indy tests start to cram more CP entries, since every concat shape requires a MethodType constant, and new method signature. Legacy concat uses the same StringBuilder::append call that reuses the same CP entry over and over again. The class file size grows quadratically in indy cases, since we shove larger and larger MethodTypes. In this test, legacy string concat blows the method size limit first. *) N chains, 1 argument each, distinct type each. Indy string concat pays for additional CP entries for storing MethodType and method signatures, in addition to storing the distinct type. While it seems as if indy string concat blows the CP entry limit, it's actually legacy string concat that blows the method size limit first. *) N chains, 1 argument each, same type. Since the code shape is not changing, indy string concat needs only a static number of additional constant pool entries to keep java.lang.invoke references. You can also see that indy string concat produces a more compact method size. As in other tests, legacy string concat blows the method size limit first.