== String Concatenation via invokedynamic

May-November 2015
aleksey.shipilev@oracle.com
@shipilev

See the discussion on rationale, alternatives, and caveats in JEP:
  https://bugs.openjdk.java.net/browse/JDK-8085796

= Design overview

== javac changes

With current change, javac is able to generate three flavors of concat in the bytecode. All three flavors involve the contained change
in javac, which basically recognizes the String concat tree in AST, and emits either the StringBuilder append chain, or the invokedynamic
call into the JDK. The differences between different flavors may be seen with this simple example:

  String s = "[" + System.nanoTime() + "] " +
                Thread.currentThread().getName() + ":" +
                "Calling an application method \"" + method + "\" without fear and prejudice.";

*** -XDstringConcat=inline

This is a legacy scheme, where javac matches the AST looking for string concatenations, and emits the StringBuilder append chains:

  public java.lang.String work() throws java.lang.Exception;
    descriptor: ()Ljava/lang/String;
    flags: ACC_PUBLIC
    Code:
      stack=3, locals=1, args_size=1
         0: new           #2                  // class java/lang/StringBuilder
         3: dup
         4: invokespecial #9                  // Method java/lang/StringBuilder."<init>":()V
         7: ldc           #10                 // String [
         9: invokevirtual #6                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
        12: invokestatic  #11                 // Method java/lang/System.nanoTime:()J
        15: invokevirtual #12                 // Method java/lang/StringBuilder.append:(J)Ljava/lang/StringBuilder;
        18: ldc           #13                 // String ]
        20: invokevirtual #6                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
        23: invokestatic  #14                 // Method java/lang/Thread.currentThread:()Ljava/lang/Thread;
        26: invokevirtual #15                 // Method java/lang/Thread.getName:()Ljava/lang/String;
        29: invokevirtual #6                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
        32: ldc           #16                 // String :
        34: invokevirtual #6                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
        37: ldc           #17                 // String Calling an application method \"
        39: invokevirtual #6                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
        42: aload_0
        43: getfield      #8                  // Field method:Ljava/lang/String;
        46: invokevirtual #6                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
        49: ldc           #18                 // String \" without fear and prejudice.
        51: invokevirtual #6                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
        54: invokevirtual #7                  // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
        57: areturn

*** -XDstringConcat=indy 

This is a new bytecode flavor suggested by the JEP. Here, javac also collects all String concat arguments, but instead of lowering
the concat into StringBuilder chains, it emits the invokedynamic call to the known StringConcatFactory. The dynamic argument list
depends on the number of concatenated arguments and their types. BSM can then emit the specialized concat method to do the actual work.

  public java.lang.String work() throws java.lang.Exception;
    descriptor: ()Ljava/lang/String;
    flags: ACC_PUBLIC
    Code:
      stack=9, locals=1, args_size=1
         0: ldc           #9                  // String [
         2: invokestatic  #10                 // Method java/lang/System.nanoTime:()J
         5: ldc           #11                 // String ]
         7: invokestatic  #12                 // Method java/lang/Thread.currentThread:()Ljava/lang/Thread;
        10: invokevirtual #13                 // Method java/lang/Thread.getName:()Ljava/lang/String;
        13: ldc           #14                 // String :
        15: ldc           #15                 // String Calling an application method \"
        17: aload_0
        18: getfield      #8                  // Field method:Ljava/lang/String;
        21: ldc           #16                 // String \" without fear and prejudice.
        23: invokedynamic #17,  0             // InvokeDynamic #0:makeConcat:(Ljava/lang/String;JLjava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String;
        28: areturn

BootstrapMethods:
  0: #94 invokestatic java/lang/invoke/StringConcatFactory.makeConcat:(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/invoke/CallSite;
    Method arguments:
      #95 


*** -XDstringConcat=indyWithConstants

This bytecode flavor is an elaboration for "indy" flavor, which also recognizes that known constants may be passed to the BSM directly, without
adding them as dynamic arguments:

  public java.lang.String work() throws java.lang.Exception;
    descriptor: ()Ljava/lang/String;
    flags: ACC_PUBLIC
    Code:
      stack=4, locals=1, args_size=1
         0: invokestatic  #9                  // Method java/lang/System.nanoTime:()J
         3: invokestatic  #10                 // Method java/lang/Thread.currentThread:()Ljava/lang/Thread;
         6: invokevirtual #11                 // Method java/lang/Thread.getName:()Ljava/lang/String;
         9: aload_0
        10: getfield      #8                  // Field method:Ljava/lang/String;
        13: invokedynamic #12,  0             // InvokeDynamic #0:makeConcatWithConstants:(JLjava/lang/String;Ljava/lang/String;)Ljava/lang/String;
        18: areturn

BootstrapMethods:
  0: #85 invokestatic java/lang/invoke/StringConcatFactory.makeConcatWithConstants:(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;Ljava/lang/String;[Ljava/lang/String;)Ljava/lang/invoke/CallSite;
    Method arguments:
      #86 [\u0001] \u0001:Calling an application method \"\u0001\" without fear and prejudice.

The first argument is the "recipe", where each character in the recipe means:

   * \1 (Unicode point 0001): an ordinary argument. This input is passed through
     dynamic argument, and is provided during the concatenation method invocation. This
     input can be null.
     
   * \2 (Unicode point 0002): a constant. This input passed through static bootstrap
     argument. This constant can be any value representable in constant pool. If necessary,
     the factory would call {@code String.valueOf} to perform a one-time String conversion.

   * Any other char value: a single character constant.

We also bypass empty strings from the recipe, because they have no meaning at this level. This 
captures the Java language trick to force String concat with e.g. ("" + int)-like expression. Down
here in javac, we already know we are in String concat business, and do not require these markers.


== JDK Changes

After javac compiled the String concat into the indy call, we need to provide a bootstrap method (BSM) in JDK to deal with it. 
See more details on the invokedynamic lifecycle in Javadoc:
  https://docs.oracle.com/javase/8/docs/api/java/lang/invoke/package-summary.html

The goal for BSM is to provide a MethodHandle that points to the actual concatenation code. The power of the proposed JEP comes
from the opportunity to adapt the concatenation code for the actual argument list.

Our prototype uses five potential strategies:

  * BC_SB, or "bytecode StringBuilder"
     This strategy spins up the bytecode that has the same StringBuilder chain javac would otherwise emit. This strategy uses
     only the public API, and comes as the baseline for the current JDK behavior. On other words, this strategy, in essence,
     moves the javac generated bytecode to runtime. The generated bytecode is loaded via Unsafe.defineAnonymousClass, but with
     the caller class coming from the BSM -- in other words, the protection guarantees are inherited from the method where
     invokedynamic was originally called. This means, among other things, that the bytecode is verified for all non-JDK uses.

  * BC_SB_SIZED, or "bytecode StringBuilder, but sized"
     This strategy acts similarly to BC_SB, but it also tries to guess the capacity required for StringBuilder to accept all
     arguments without resizing. This strategy only makes an educated guess: it only guesses the space required for known types
     (e.g. primitives and Strings), but does not otherwise convert arguments. Therefore, the capacity estimate may be wrong, and
     StringBuilder may have to transparently resize or trim when doing the actual concatenation. While this does not constitute
     a correctness issue (in the end, that what BC_SB has to do anyway), this does pose a potential performance problem.

  * BC_SB_SIZED_EXACT, or "bytecode StringBuilder, but sized exactly":
     This strategy improves on BC_SB_SIZED, by first converting all arguments to String in order to get the exact capacity 
     StringBuilder should have. The conversion is done via the public String.valueOf and/or Object.toString methods, and does
     not touch any private String API.

  * MH_SB_SIZED, or "MethodHandles StringBuilder, sized":
     (initial code contributed by Remi Forax)
     This strategy avoids spinning up the bytecode by building the computation on MethodHandle combinators. The computation is 
     built with public MethodHandle APIs, resolved from a public Lookup sequence, and ends up calling the public StringBuilder API.
     Therefore, this strategy does not use any private API at all, even the Unsafe.defineAnonymousClass, since everything is handled
     under cover by java.lang.invoke APIs.

  * MH_INLINE_SIZED_EXACT, or "MethodHandles inline, sized exactly":
     (initial code contributed by Peter Levart)
     This strategy replicates what StringBuilders are doing: it builds the char[] array on its own and passes that char[] array
     to String constructor. This strategy requires access to some private APIs in JDK, most notably, the read-only Integer/Long.stringSize
     methods that measure the character length of the integers, and the private String constructor that accepts char[] arrays without
     copying. While this strategy assumes a particular implementation details for String, this opens the door for building a very
     optimal concatenation sequence. This is the only strategy that requires porting if there are private JDK changes occur.

= Experiments

Current targeted benchmark set:
  * string_string benchmarks serve the simplest two-string case
  * string_string_int is a more complicated test, routinely treated by OptoStringConcat
  * string_string_long provides a similar test, but it is not being optimized by OptoStringConcat

Also, since we are interested in how the concat behaves in the absence of aggressively-optimizing compilers, we 
run the tests with -XX:TieredStopAtLevel=1 to push the code through C1.

For convenience, this is a benchmark code:

@State(Scope.Benchmark)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class ConcatBench {

    private static final String a = "CONST";
    private String b;
    private int c;
    private long d;

    @Param({"1", "10", "100"})
    private int size;

    @Setup
    public void setup() {
        StringBuilder sbB = new StringBuilder(size);
        for (int c = 0; c < size; c++) {
            sbB.append("b");
        }
        b = sbB.toString();
        c = 42;
        d = 42;
    }

    @Benchmark
    public String string_string() {
        return a + b;
    }

    @Benchmark
    public String string_string_int() {
        return a + b + c;
    }

    @Benchmark
    public String string_string_long() {
        return a + b + d;
    }

}


== string_string

=== C2

                                                time, ns/op                       alloc pressure, bytes/op
                                [size] -------->                                    [size] --------> 

# JDK 9 baseline:
                                         1,              10,             100,          1,    10,   100, 
                        *,      44.4 (0.2),      57.3 (0.3),     202.6 (1.1),         56,    72,   256,

# -XDstringConcat=inline:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,      44.6 (0.2),      57.3 (0.3),     202.9 (1.0),         56,    72,   256, 
              BC_SB_SIZED,      44.7 (0.2),      57.4 (0.3),     203.6 (0.9),         56,    72,   256, 
        BC_SB_SIZED_EXACT,      44.7 (0.3),      57.5 (0.4),     203.1 (0.9),         56,    72,   256, 
              MH_SB_SIZED,      44.7 (0.2),      57.4 (0.3),     203.2 (1.1),         56,    72,   256, 
    MH_INLINE_SIZED_EXACT,      44.7 (0.2),      57.5 (0.4),     203.4 (0.8),         56,    72,   256, 

# -XDstringConcat=indy:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,      44.7 (0.2),      57.2 (0.2),     203.2 (0.9),         56,    72,   256, 
              BC_SB_SIZED,      44.6 (0.2),      57.2 (0.4),     203.1 (1.1),         56,    72,   256, 
        BC_SB_SIZED_EXACT,      44.6 (0.3),      57.4 (0.3),     203.1 (0.8),         56,    72,   256, 
              MH_SB_SIZED,      70.2 (0.6),      95.6 (0.4),     388.2 (2.3),         88,   120,   488, 
    MH_INLINE_SIZED_EXACT,      44.6 (0.3),      57.4 (0.2),     203.7 (1.2),         56,    72,   256,

# -XDstringConcat=indyWithConstants:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,      44.6 (0.2),      57.4 (0.4),     203.2 (1.3),         56,    72,   256, 
              BC_SB_SIZED,      44.5 (0.2),      57.2 (0.4),     202.7 (0.9),         56,    72,   256, 
        BC_SB_SIZED_EXACT,      44.6 (0.2),      57.3 (0.3),     203.2 (1.4),         56,    72,   256, 
              MH_SB_SIZED,      70.0 (0.5),      95.7 (0.4),     386.7 (1.6),         88,   120,   488, 
    MH_INLINE_SIZED_EXACT,      44.6 (0.3),      57.3 (0.4),     203.8 (0.8),         56,    72,   256, 

When OptimizeStringConcat is operational, String concatenation is already heavily optimized. There are no differences,
neither regression nor improvements in most strategies. MH_SB_SIZED fails to adhere to the bytecode shape expected by
OptimizeStringConcat, and thus being penalized. MH_INLINE_SIZED_EXACT does the inline copy, and OptimizeStringConcat
is not required there.

There is also no difference between "indy" and "indyStatic" strategies, since compiler is able to aggressively inline,
and see through the constness of parameters.

=== C1 

                                                time, ns/op                       alloc pressure, bytes/op
                                [size] -------->                                    [size] --------> 

# JDK 9 baseline:
                                         1,              10,             100,          1,    10,   100, 
                        *,     101.3 (0.4),     114.1 (0.8),     441.9 (2.4),        128,   144,   560,

# -XDstringConcat=inline:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     102.1 (0.5),     114.2 (0.6),     442.0 (2.3),        128,   144,   560, 
              BC_SB_SIZED,     101.8 (0.6),     113.9 (0.5),     442.9 (2.9),        128,   144,   560, 
        BC_SB_SIZED_EXACT,     101.9 (0.4),     114.4 (0.6),     443.1 (2.4),        128,   144,   560, 
              MH_SB_SIZED,     101.9 (0.7),     114.5 (0.5),     442.7 (3.2),        128,   144,   560, 
    MH_INLINE_SIZED_EXACT,     101.7 (0.4),     114.6 (0.5),     442.1 (1.7),        128,   144,   560, 

# -XDstringConcat=indy:

                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     101.7 (0.6),     114.3 (0.5),     442.2 (2.3),        128,   144,   560, 
              BC_SB_SIZED,      89.7 (0.5),     114.2 (0.5),     404.3 (2.2),        112,   144,   512, 
        BC_SB_SIZED_EXACT,      89.4 (0.2),     114.2 (0.5),     405.7 (3.4),        112,   144,   512, 
              MH_SB_SIZED,     324.7 (8.2),     325.2 (2.4),     406.6 (1.5),        112,   144,   512, 
    MH_INLINE_SIZED_EXACT,      44.5 (0.3),      57.0 (0.4),     202.1 (1.1),         56,    72,   256, 

# -XDstringConcat=indyWithConstants:

                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     101.7 (0.4),     114.5 (0.6),     441.7 (2.4),        128,   144,   560, 
              BC_SB_SIZED,      89.7 (0.4),     114.8 (0.6),     405.0 (2.0),        112,   144,   512, 
        BC_SB_SIZED_EXACT,      89.5 (0.4),     114.4 (0.7),     405.5 (2.2),        112,   144,   512, 
              MH_SB_SIZED,     307.2 (3.2),     317.0 (1.7),     400.8 (3.6),        112,   144,   512, 
    MH_INLINE_SIZED_EXACT,      44.5 (0.3),      57.0 (0.3),     202.2 (1.0),         56,    72,   256, 

When OptimizeStringConcat is not working (this is C1), we get the substantial improvements in almost all strategies,
except for MH_SB_SIZED on smaller strings. The allocation pressure is also down a lot. Again, "indyStatic" does not
yield any substantial benefit.

== string_string_int

=== C2

                                                time, ns/op                       alloc pressure, bytes/op
                                [size] -------->                                    [size] --------> 

# JDK 9 baseline:
                                         1,              10,             100,          1,    10,   100, 
                        *,      44.5 (0.3),      63.6 (0.4),     203.1 (1.3),         56,    80,   256,

# -XDstringConcat=inline:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,      44.6 (0.2),      63.7 (0.3),     203.8 (1.2),         56,    80,   256, 
              BC_SB_SIZED,      44.6 (0.2),      63.8 (0.3),     203.8 (0.9),         56,    80,   256, 
        BC_SB_SIZED_EXACT,      44.5 (0.3),      63.6 (0.3),     203.6 (1.3),         56,    80,   256, 
              MH_SB_SIZED,      44.5 (0.2),      63.8 (0.3),     203.8 (0.8),         56,    80,   256, 
    MH_INLINE_SIZED_EXACT,      44.6 (0.3),      63.7 (0.4),     203.9 (1.1),         56,    80,   256, 

# -XDstringConcat=indy:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,      44.7 (0.3),      63.8 (0.4),     203.5 (1.2),         56,    80,   256, 
              BC_SB_SIZED,      44.6 (0.3),      63.6 (0.3),     203.5 (1.1),         56,    80,   256, 
        BC_SB_SIZED_EXACT,      63.7 (0.3),      82.8 (0.4),     223.6 (1.0),         80,   104,   280, 
              MH_SB_SIZED,      89.6 (0.6),     121.2 (0.8),     401.8 (3.0),        112,   152,   504, 
    MH_INLINE_SIZED_EXACT,      49.5 (0.4),      64.1 (0.4),     201.6 (3.8),         56,    80,   256, 

# -XDstringConcat=indyWithConstants:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,      44.5 (0.2),      63.6 (0.3),     203.5 (0.8),         56,    80,   256, 
              BC_SB_SIZED,      44.6 (0.3),      63.5 (0.3),     203.2 (0.9),         56,    80,   256, 
        BC_SB_SIZED_EXACT,      63.7 (0.4),      82.9 (0.5),     223.0 (1.5),         80,   104,   280, 
              MH_SB_SIZED,      89.7 (0.7),     121.5 (0.6),     400.6 (2.3),        112,   152,   504, 
    MH_INLINE_SIZED_EXACT,      47.9 (0.5),      63.9 (0.4),     204.7 (1.6),         56,    80,   256, 

Again, this case is very well catered for by OptimizeStringConcat. The reasoning is the same as in string_string cases.

=== C1

                                                time, ns/op                       alloc pressure, bytes/op
                                [size] -------->                                    [size] --------> 

# JDK 9 baseline:
                                         1,              10,             100,          1,    10,   100, 
                        *,     106.9 (1.8),     189.2 (0.7),     788.4 (4.1),        128,   240,  1000,

# -XDstringConcat=inline:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     107.7 (1.1),     190.3 (0.8),     790.6 (4.0),        128,   240,  1000, 
              BC_SB_SIZED,     106.5 (1.0),     190.4 (1.3),     791.2 (3.9),        128,   240,  1000, 
        BC_SB_SIZED_EXACT,     106.7 (2.3),     189.9 (1.0),     790.6 (5.3),        128,   240,  1000, 
              MH_SB_SIZED,     106.8 (2.7),     189.9 (1.2),     791.1 (3.6),        128,   240,  1000, 
    MH_INLINE_SIZED_EXACT,     105.0 (0.8),     190.3 (0.9),     788.6 (5.4),        128,   240,  1000, 

# -XDstringConcat=indy:

                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     105.9 (2.3),     190.0 (1.1),     789.6 (5.7),        128,   240,  1000, 
              BC_SB_SIZED,     111.6 (1.7),     140.4 (0.9),     417.3 (2.5),        136,   176,   528, 
        BC_SB_SIZED_EXACT,     134.4 (1.3),     166.2 (0.9),     444.2 (2.1),        160,   208,   560, 
              MH_SB_SIZED,    368.2 (27.2),     361.2 (8.7),     422.8 (2.5),        136,   176,   528, 
    MH_INLINE_SIZED_EXACT,      52.8 (0.3),      64.4 (0.6),     202.1 (1.3),         56,    80,   256, 

# -XDstringConcat=indyWithConstants:

                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     107.6 (0.9),     190.3 (1.0),     789.9 (5.2),        128,   240,  1000, 
              BC_SB_SIZED,     110.5 (1.1),     140.1 (0.6),     417.8 (2.5),        136,   176,   528, 
        BC_SB_SIZED_EXACT,     134.0 (1.3),     166.3 (0.8),     443.5 (2.9),        160,   208,   560, 
              MH_SB_SIZED,     339.2 (1.9),     347.3 (1.8),     418.8 (3.6),        136,   176,   528, 
    MH_INLINE_SIZED_EXACT,      51.2 (0.2),      64.4 (0.5),     203.0 (1.0),         56,    80,   256, 

Ditto here. When OptimizeStringConcat does not work, we have immense improvements with sized strategies.


== string_string_long

=== C2

                                                time, ns/op                       alloc pressure, bytes/op
                                [size] -------->                                    [size] --------> 

# JDK 9 baseline:
                                         1,              10,             100,          1,    10,   100, 
                        *,     102.9 (4.1),     191.3 (0.9),     799.2 (3.2),        128,   240,  1000,

# -XDstringConcat=inline:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     102.1 (0.6),     191.7 (1.2),     802.1 (4.1),        128,   240,  1000, 
              BC_SB_SIZED,     102.0 (0.5),     191.8 (1.5),     800.4 (5.9),        128,   240,  1000, 
        BC_SB_SIZED_EXACT,     102.1 (0.6),     191.1 (1.2),     799.8 (4.0),        128,   240,  1000, 
              MH_SB_SIZED,     102.1 (0.6),     191.5 (0.6),     798.0 (4.7),        128,   240,  1000, 
    MH_INLINE_SIZED_EXACT,     102.1 (0.5),     191.4 (1.3),     801.1 (6.0),        128,   240,  1000, 

# -XDstringConcat=indy:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     102.1 (0.5),     192.4 (1.2),     801.4 (4.2),        128,   240,  1000, 
              BC_SB_SIZED,     121.1 (0.6),     153.2 (0.7),     440.0 (2.7),        152,   192,   552, 
        BC_SB_SIZED_EXACT,      63.8 (0.4),      82.6 (0.4),     222.9 (1.3),         80,   104,   280, 
              MH_SB_SIZED,     120.8 (0.8),     153.1 (1.0),     440.5 (2.3),        152,   192,   552, 
    MH_INLINE_SIZED_EXACT,      52.6 (0.5),      64.2 (0.4),     204.3 (1.2),         56,    80,   256, 

# -XDstringConcat=indyWithConstants:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     102.1 (0.8),     191.3 (0.8),     800.5 (5.1),        128,   240,  1000, 
              BC_SB_SIZED,     121.3 (0.7),     153.2 (0.9),     439.3 (2.6),        152,   192,   552, 
        BC_SB_SIZED_EXACT,      63.8 (0.5),      83.1 (0.4),     223.0 (1.2),         80,   104,   280, 
              MH_SB_SIZED,     119.6 (2.7),     153.1 (0.8),     440.3 (2.1),        152,   192,   552, 
    MH_INLINE_SIZED_EXACT,      52.3 (0.9),      63.9 (0.3),     204.0 (1.0),         56,    80,   256, 

OptimizeStringConcat does not work for all patterns, like when there is a "long" argument. In these cases, SIZED 
strategies are starting to win big time, because they guess the final storage capacity right.  Some strategies
are being penalized on smaller strings, because precomputing the length has the overhead that is not matched by the
"wrong" initial capacity guess in non-EXACT cases.

=== C1

                                                time, ns/op                       alloc pressure, bytes/op
                                [size] -------->                                    [size] --------> 

# JDK 9 baseline:
                                         1,              10,             100,          1,    10,   100, 
                        *,     113.3 (2.4),     190.3 (1.5),     789.2 (4.1),        128,   240,  1000, 

# -XDstringConcat=inline:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     116.0 (0.9),     190.5 (1.2),     790.3 (3.5),        128,   240,  1000, 
              BC_SB_SIZED,     112.3 (1.0),     190.0 (0.9),     790.3 (3.9),        128,   240,  1000, 
        BC_SB_SIZED_EXACT,     116.3 (2.3),     190.7 (1.0),     791.5 (3.9),        128,   240,  1000, 
              MH_SB_SIZED,     118.4 (0.9),     191.0 (1.3),     790.5 (3.1),        128,   240,  1000, 
    MH_INLINE_SIZED_EXACT,     114.0 (3.2),     190.3 (1.4),     792.5 (4.6),        128,   240,  1000, 

# -XDstringConcat=indy:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     113.8 (3.0),     190.8 (0.9),     790.0 (4.4),        128,   240,  1000, 
              BC_SB_SIZED,     123.3 (0.8),     153.2 (1.0),     437.8 (2.2),        152,   192,   552, 
        BC_SB_SIZED_EXACT,     144.2 (2.0),     166.5 (0.7),     444.0 (3.1),        160,   208,   560, 
              MH_SB_SIZED,     362.6 (8.6),     364.5 (1.7),     438.6 (2.3),        152,   192,   552, 
    MH_INLINE_SIZED_EXACT,      62.5 (3.7),      66.0 (1.9),     202.3 (1.1),         56,    80,   256, 

# -XDstringConcat=indyWithConstants:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     115.4 (3.9),     190.4 (1.1),     791.2 (3.8),        128,   240,  1000, 
              BC_SB_SIZED,     123.9 (1.5),     153.2 (1.0),     436.6 (2.1),        152,   192,   552, 
        BC_SB_SIZED_EXACT,     144.9 (1.4),     166.4 (0.9),     443.3 (2.8),        160,   208,   560, 
              MH_SB_SIZED,     350.4 (1.5),     360.8 (3.9),     436.8 (3.2),        152,   192,   552, 
    MH_INLINE_SIZED_EXACT,      63.8 (2.3),      66.8 (1.5),     202.4 (1.0),         56,    80,   256, 

Pretty much the same behavior is seen with C1.

== LogLine

LogLine benchmark tries to make a case for a more realistic scenario of a logged statement, like this:


@State(Scope.Benchmark)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class LogLineBench {

    @Param({"1", "10", "100"})
    private int size;

    private String method;

    @Setup
    public void setup() {
        StringBuilder sbA = new StringBuilder(size);
        for (int c = 0; c < size; c++) {
            sbA.append("a");
        }
        method = sbA.toString();
    }

    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public String work() throws Exception {
        return "[" + System.nanoTime() + "] " +
                Thread.currentThread().getName() + ":" +
                "Calling an application method \"" + method + "\" without fear and prejudice.";
    }
}

=== C2

                                                time, ns/op                       alloc pressure, bytes/op
                                [size] -------->                                    [size] --------> 

# JDK 9 baseline:
                                         1,              10,             100,          1,    10,   100, 
                        *,     708.7 (3.4),     720.0 (4.1),    1332.0 (6.5),        888,   904,  1680, 

# -XDstringConcat=inline:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     709.1 (3.5),     721.5 (4.0),    1338.1 (6.3),        888,   904,  1680, 
              BC_SB_SIZED,     708.3 (3.7),     724.3 (3.1),    1339.0 (6.4),        888,   904,  1680, 
        BC_SB_SIZED_EXACT,     710.7 (5.6),     723.0 (2.9),    1335.6 (5.0),        888,   904,  1680, 
              MH_SB_SIZED,     709.4 (2.7),     725.7 (3.5),   1343.5 (13.9),        888,   912,  1688, 
    MH_INLINE_SIZED_EXACT,     709.0 (4.0),     723.0 (4.9),   1334.5 (10.8),        888,   904,  1680, 

# -XDstringConcat=indy:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     708.7 (3.1),     722.9 (3.4),    1340.6 (6.4),        888,   904,  1680, 
              BC_SB_SIZED,     447.3 (1.8),     472.0 (2.4),     762.1 (4.0),        560,   592,   960, 
        BC_SB_SIZED_EXACT,     268.6 (1.5),     281.7 (1.4),     428.3 (1.8),        336,   352,   536, 
              MH_SB_SIZED,     447.9 (1.5),     472.2 (3.2),     764.3 (6.4),        560,   592,   960, 
    MH_INLINE_SIZED_EXACT,     229.2 (1.1),     242.3 (1.3),     381.9 (2.1),        288,   304,   488, 

# -XDstringConcat=indyWithConstants:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     708.8 (4.1),     723.6 (8.0),    1341.3 (9.4),        888,   904,  1680, 
              BC_SB_SIZED,     446.3 (2.2),     471.3 (1.9),     765.3 (4.6),        560,   592,   960, 
        BC_SB_SIZED_EXACT,     268.2 (1.5),     280.3 (1.1),     428.3 (1.8),        336,   352,   536, 
              MH_SB_SIZED,     447.7 (2.6),     473.4 (1.8),     770.2 (9.3),        560,   592,   968, 
    MH_INLINE_SIZED_EXACT,     230.1 (1.3),     242.9 (1.2),     382.3 (3.3),        288,   304,   488, 

The original line is rather heavy-weight, so the baseline is relatively slow, and allocates a lot. SIZED strategies
avoid some of the garbage, and thus improve performance a lot. E.g MH_INLINE_SIZED_EXACT improves the performance 3-4x!

=== C1

                                                time, ns/op                       alloc pressure, bytes/op
                                [size] -------->                                    [size] --------> 

# JDK 9 baseline:
                                         1,              10,             100,          1,    10,   100, 
                        *,     722.6 (3.2),     732.7 (3.5),    1342.7 (8.7),        912,   928,  1704,

# -XDstringConcat=inline:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     723.0 (4.6),     736.0 (3.8),    1347.5 (9.5),        912,   928,  1704, 
              BC_SB_SIZED,     722.2 (4.7),     734.0 (4.2),    1346.7 (5.5),        912,   928,  1704, 
        BC_SB_SIZED_EXACT,     721.5 (4.3),     733.6 (3.6),    1346.3 (8.1),        912,   928,  1704, 
              MH_SB_SIZED,     723.3 (3.6),     733.1 (3.9),    1345.8 (8.9),        912,   928,  1704, 
    MH_INLINE_SIZED_EXACT,     722.5 (5.0),     734.1 (3.5),    1347.0 (6.3),        912,   928,  1704, 

# -XDstringConcat=indy:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     722.7 (3.9),     733.7 (5.0),   1343.9 (10.1),        912,   928,  1704, 
              BC_SB_SIZED,     463.2 (3.6),     486.4 (3.2),     773.7 (3.1),        584,   616,   984, 
        BC_SB_SIZED_EXACT,     513.8 (3.1),     539.0 (2.6),     825.7 (3.8),        648,   680,  1048, 
              MH_SB_SIZED,     798.5 (7.7),     810.1 (9.1),     881.6 (7.0),        584,   616,   984, 
    MH_INLINE_SIZED_EXACT,     342.6 (6.4),     341.1 (9.0),     388.6 (2.3),        288,   304,   488, 

# -XDstringConcat=indyWithConstants:
                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     722.6 (4.8),     736.7 (4.9),    1347.1 (6.0),        912,   928,  1704, 
              BC_SB_SIZED,     465.8 (3.3),     489.7 (2.3),     776.5 (5.5),        584,   616,   984, 
        BC_SB_SIZED_EXACT,     514.1 (3.5),     539.9 (4.4),     827.7 (3.6),        648,   680,  1048, 
              MH_SB_SIZED,     687.8 (4.2),     702.5 (4.6),     781.1 (6.2),        584,   616,   984, 
    MH_INLINE_SIZED_EXACT,     329.7 (3.8),     331.8 (6.4),     384.1 (2.2),        288,   304,   488, 

These strategies operate well even in the absence of advanced optimizing compiler. This is also one of the cases where
"indyStatic" wins over "indy" in MH_SB_SIZED strategy -- large constant Strings are processed eagerly during linkage.

== Startup considerations

Obviously, moving the String concat at the runtime may adversely impact startup, since additional steps have to be done
right after the VM starts. Project Lambda had the same concerns, but it was figured the overhead is actually quite small.
A smoke test like the following can be used to contrast the startup costs:

public class Concat {
        private static String a = "a";
        private static String b = "b";
        private static String c = null;
        private static int d = 42;


        public static void main(String... args) {
                System.out.println(doStuff());
        }

        public static String doStuff() {
                return a + b + c + d;
        }
}


$ for S in `seq 1 100`; do time java Concat; done 2>&1 | grep real | sed -e "s/0m//g" -e "s/s//g" | \
    awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' 

Current implementation suffers for a related G1 performance issue, therefore we measure with -XX:+UseParallelGC.
See https://bugs.openjdk.java.net/browse/JDK-8136854
                            
 -XDstringConcat=inline:
                        *: 34.1 +- 1.5 ms
     
 -XDstringConcat=indy:
                    BC_SB: 58.5 +- 2.0 ms
              BC_SB_SIZED: 59.5 +- 2.0 ms
        BC_SB_SIZED_EXACT: 59.0 +- 2.0 ms
              MH_SB_SIZED: 95.1 +- 2.5 ms
    MH_INLINE_SIZED_EXACT: 90.1 +- 2.5 ms

 -XDstringConcat=indyWithConstants:
                    BC_SB: 57.5 +- 2.0 ms
              BC_SB_SIZED: 58.5 +- 2.0 ms
        BC_SB_SIZED_EXACT: 58.5 +- 2.0 ms
              MH_SB_SIZED: 91.1 +- 2.5 ms
    MH_INLINE_SIZED_EXACT: 87.5 +- 2.5 ms

All indy strategies regress the startup time. However, we know that the largest issue with startup time is the 
initialization of java.lang.invoke infrastructure itself, see:
 https://bugs.openjdk.java.net/browse/JDK-8086045

To show that in our test, we may add a single line in the test that also uses java.lang.invoke and implicitly
initializes it. Any code with lambdas will do, e.g.:
   private static Runnable r = () -> System.exit(0);

Then the performance is like this:
                            
 -XDstringConcat=inline:
                        *: 60.0 +- 2.0 ms
     
 -XDstringConcat=indy:
                    BC_SB: 64.1 +- 2.0 ms
              BC_SB_SIZED: 64.0 +- 2.0 ms
        BC_SB_SIZED_EXACT: 63.5 +- 2.0 ms
              MH_SB_SIZED: 94.5 +- 2.0 ms
    MH_INLINE_SIZED_EXACT: 91.5 +- 2.0 ms

 -XDstringConcat=indyWithConstants:
                    BC_SB: 62.5 +- 2.0 ms
              BC_SB_SIZED: 63.8 +- 2.0 ms
        BC_SB_SIZED_EXACT: 63.3 +- 2.0 ms
              MH_SB_SIZED: 92.9 +- 2.0 ms
    MH_INLINE_SIZED_EXACT: 90.5 +- 2.0 ms

In other words, there are additional ~5ms for BC_* strategies, and additional ~30ms for MH_* strategies. This cost
is one-time, and should be niled as soon as BSM methods warm up.

== Footprint considerations

The very first experiment one can do is compile the entire JDK with and without indy string concat. Then, if we compare the raw .class files, we can see a slight regression in bytecode size:

                           raw .class files, bytes    JDK image, Kb
 -XDstringConcat=inline           132.227.587            334.676
 -XDstringConcat=indy             133.378.513            335.796
 -XDstringConcat=indyWithConstants       133.211.067            335.640
 

== Interaction with Compact Strings

As noted in JEP, this change has a potential to collide with other String-related improvements. The improvement of 
special interest for us is Compact Strings (http://openjdk.java.net/jeps/254), as it changes the internal representation
of String instances, and may affect the public API performance. 

A simple experiment would be to try and mix current prototypes and run the same set of benchmarks against it. We choose
LogLineBench from above as the more realistic scenario. Since most strategies use the public API only, no further changes
were needed. The only exception was MH_INLINE_SIZED_EXACT that assumes a particular String shape -- and we needed to 
hack in the crude implementation that assumed all Strings are Latin1 (indeed they are in our test).

                                                time, ns/op                       alloc pressure, bytes/op
                                [size] -------->                                    [size] --------> 

*** Indify String Concat (baseline, -XDstringConcat=inline)

                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     709.1 (3.5),     721.5 (4.0),    1338.1 (6.3),        888,   904,  1680, 
              BC_SB_SIZED,     708.3 (3.7),     724.3 (3.1),    1339.0 (6.4),        888,   904,  1680, 
        BC_SB_SIZED_EXACT,     710.7 (5.6),     723.0 (2.9),    1335.6 (5.0),        888,   904,  1680, 
              MH_SB_SIZED,     709.4 (2.7),    730.7 (10.5),   1343.5 (13.9),        888,   912,  1688, 
    MH_INLINE_SIZED_EXACT,     709.0 (4.0),     723.0 (4.9),   1334.5 (10.8),        888,   904,  1680, 

*** Compact Strings (baseline, -XDstringConcat=inline)

                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     403.9 (2.2),     410.0 (3.0),     721.7 (3.9),        504,   512,   904, 
              BC_SB_SIZED,     402.8 (1.8),     409.0 (1.3),     722.7 (6.2),        504,   512,   904, 
        BC_SB_SIZED_EXACT,     403.4 (1.9),     408.8 (2.3),     722.6 (3.9),        504,   512,   904, 
              MH_SB_SIZED,     402.7 (1.9),     410.5 (9.0),     724.5 (3.5),        504,   512,   904, 
    MH_INLINE_SIZED_EXACT,     403.0 (2.1),     410.8 (4.1),     721.5 (3.5),        504,   512,   904, 


*** Indify String Concat (optimized, -XDstringConcat=indyWithConstants)

                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     709.7 (4.2),    716.4 (10.5),   1338.0 (12.1),        888,   904,  1680, 
              BC_SB_SIZED,     447.7 (2.9),     478.9 (9.5),     763.9 (5.2),        560,   600,   960, 
        BC_SB_SIZED_EXACT,     268.7 (1.6),     281.3 (1.1),     427.8 (2.5),        336,   352,   536, 
              MH_SB_SIZED,     446.1 (2.6),     471.6 (2.5),     763.2 (4.8),        560,   592,   960, 
    MH_INLINE_SIZED_EXACT,     229.1 (1.1),     241.2 (1.1),     383.9 (2.5),        288,   304,   488, 


*** Compact Strings + Indify String Concat (optimized, -XDstringConcat=indyWithConstants)

                                         1,              10,             100,          1,    10,   100, 
                    BC_SB,     404.1 (4.8),     410.0 (2.3),     721.0 (3.1),        504,   512,   904, 
              BC_SB_SIZED,     248.8 (1.4),     262.5 (1.1),     406.9 (2.0),        312,   328,   512, 
        BC_SB_SIZED_EXACT,     275.1 (1.8),     288.2 (2.1),     428.2 (2.0),        344,   360,   536, 
              MH_SB_SIZED,     249.3 (1.2),     261.4 (1.3),     408.6 (2.2),        312,   328,   512, 
    MH_INLINE_SIZED_EXACT,     187.9 (2.2),     188.9 (3.2),     210.3 (1.2),        168,   176,   264, 


One can clearly spot that both changes alone improve performance significantly. Optimized strategies in Indify String
Concat try to avoid the excess work, and Compact Strings try to improve footprint without sacrificing (much) of the
performance, and avoid excess allocation.

But both features together push the envelope even further. Take 100-byte case as example. In the well-tested case of
BC_SB_SIZED strategy, the improvement is: 1338 ns/op -> 408 ns/op (3.25x), and allocation pressure is down 
1680 bytes/op --> 512 bytes/op (3.28x)! The hacky and special-cased MH_INLINE_SIZED_EXACT provides even more, >6.36x
more throughput, and 6.36x less allocation pressure! In fact, 264 bytes seems to be the space taken by the resulting
String itself, meaning in that mode the concat is completely garbage-free.

Also notice how a "good" BC_SB_SIZED_EXACT strategy regressed with a Compact Strings change. This is due to
OptoStringConcat in Compact Strings producing more branchy code that chooses the control path based on String.coder
values. Since the OptoStringConcat code shape gets introduced in C2, the profile information cannot be gathered, and
we have a lot of cold code in the "hot" instruction stream. In Indy String Concat case, we may generate the bytecode
for concatenation early, and so use the usual profiling mechanics usual Java code enjoys.  


== Conclusion

There are four specific observations to be made:

 1) The move of the generated bytecode sequence from javac to indy translation strategy does not experience the
performance hits, as we can see with BC_SB tests. This validates the idea the move from javac-compiled bytecode
sequence to runtime-generated one is sound at least on performance front.

 2) Advanced strategies can improve performance when JIT compiler is not able to optimize the append chains. We can
even devise strategies that skip StringBuilder APIs, and populate the char[]/byte[] array directly, and these strategies
can provide the immense performance and allocation pressure improvements.

 3) This feature plays well with other String-related optimizations, notably Compact Strings. The ability to fit 
the concat implementation for a particular String shape seems to be a huge win.

 4) Bytecode footprint costs are manifested by static overhead of storing MethodTypes and other java.lang.invoke.* infra
(around 10 new CP entries), plus a dynamic overhead (a few CP entries) for each new String concat shape.


--------

=== Cramming implicit concats

(Note: this section is not complete, and measures only "indy" generator, not "indyStatic")

A more thorough experiment explores the footprint at the edge of class file size. There is a concern that the Java programs which are already consuming most of the class file will fail to compile with new indy code shape. In order to quantify these costs, we may conduct a simple experiment where we implicitly concat-ting multiple arguments. 

In hindsight, it makes sense to test String concats and int concats, with N arguments, in four scenarios:

 1) N chains, 1 argument each, distinct type in each: 
      String s1 = "" + (Type1)t1;
       ...
      String sN = "" + (TypeN)tN;

 2) N chains, 1 argument each, same type:
      String s1 = "" + (String)s1;
       ...
      String sN = "" + (String)s1;

 3) 1 chain, N arguments:
      String s1 = "" + s1 + s2 + ... + sN;

 4) N chains, [1..N] arguments:
      String s1 = "" + s1;
      String s2 = "" + s1 + s2;
        ...
      String sN = "" + s1 + s2 + ... + sN;

These scenarios cover multiple method argument shapes, as well as provide the corner cases for multiple small concats,
or one large concat, etc. The results are as follows:
 http://cr.openjdk.java.net/~shade/8085796/class-size.png

It is important to remember that the actual limit in class files is the number of bytecode instructions in one method,
and number of constant pool entries (but not directly a constant pool size). In the data below, we can see this:

 *) 1 chain, N arguments: constant pool size is the same entry-wise, and the class size is growing, since we have more
bytecode instructions to cram into method. The sawtooth pattern for indy tests are due to cutoff in indy arguments:
when our prototype faces the indy call with more than 253 arguments, it cuts out another indy call with the rest of the
arguments. All those "bulk" 253-sized calls are referencing the same constant pool entry. The remainder call still has
the linearly growing MethodType, until it reaches 253 arguments, and folds back. The cycles repeats with the next
remainder. In this test, legacy string concat blows the method size limit first.

 *) N chains, [1..N] arguments. Here, indy tests start to cram more CP entries, since every concat shape requires a
MethodType constant, and new method signature. Legacy concat uses the same StringBuilder::append call that reuses the 
same CP entry over and over again. The class file size grows quadratically in indy cases, since we shove larger and
larger MethodTypes. In this test, legacy string concat blows the method size limit first.

 *) N chains, 1 argument each, distinct type each. Indy string concat pays for additional CP entries for storing MethodType
and method signatures, in addition to storing the distinct type. While it seems as if indy string concat blows the CP
entry limit, it's actually legacy string concat that blows the method size limit first.

 *) N chains, 1 argument each, same type. Since the code shape is not changing, indy string concat needs only a static
number of additional constant pool entries to keep java.lang.invoke references. You can also see that indy string concat
produces a more compact method size. As in other tests, legacy string concat blows the method size limit first.