% Iterators vs. Cursors: A Case Study in Objects vs. Values > John Rose, 2018-0310 (light edits 2020-0923, name changes 2022-0803) There is a clear parallelism between immutable object types (notably value-based classes) and value types, which are immutable and have similar qualities to value-based classes. Indeed, [recent changes in Project Valhalla] tend to make value classes (per se) conform to the contract of value-based classes so closely as to make migration of VBSs to VCs a desirable choice (in most cases). [recent changes in Project Valhalla]: There is another parallelism between stateful object types and value classes, but it is more subtle. In many (not all) cases a stateful object can be transformed into a value, as long as the value which models the current state of the object is made available to all code that may need to read that state. This means that updating a state must involve discarding the previous value and using the new value which models the updated object state. This all sounds complicated, but it feels simple in practice. "Codes like a class but works like an int" means that the current state value is always assigned to whatever variable(s) are looking at it. In fact, just like an int: ``` int x = 0; Value v = Value(0); … x + 1; // bad code x = x + 1; // good code v.bump(); // bad code v = v.bump(); // good code ``` (Hmmm… It might be time to support a `__NoDiscard` modifier to return values in APIs. The bad code with `x` is a syntax error, but the bad code with `v` is silently accepted. To track this I filed [JDK-8199429] and threw some ideas in to deal with later.) [JDK-8199429]: Let's do an example, where a stateful `Iterator` object is transformed into a stateless `Cursor` value. This code is taken verbatim from the JDK: ``` class CharIterator implements PrimitiveIterator.OfInt { int cur = 0; public boolean hasNext() { return cur < length(); } public int nextInt() { if (hasNext()) { return charAt(cur++); } else { throw new NoSuchElementException(); } } // omitted forEachRemaining } ``` It is an inner class, used to construct a stream over an enclosing object (a `CharSequence`) which has `length()` and `charAt(i)` operations. It could also be used as a normal `Iterator` (and would return boxed `Integer`s, for what that's worth). (To peek under the covers at this, try javap on the local class sometimes known as `java.lang.CharSequence$1CharIterator`.) An instance of this class has two fields: A pointer to the outer `CharSequence`, plus the varying index `cur`. It is often the case that iterators have one fixed field and one varying; sometimes there are additional fixed fields, such as a limit to the varying field. Sometimes there are additional varying fields, such as a prefetched current value. These fixed and varying fields correspond to loop variables, which are (respectively) loop-invariant and loop-varying, but they are neatly packaged inside of the iterator. The public interface of this object is very simple, a variation of `java.util.Iterator`: ``` public interface PrimitiveIterator.OfInt extends … Iterator { @Override boolean hasNext(); int nextInt(); } ``` A for-loop over this iterator looks like this: ``` for (final var t = myIntIterator(); t.hasNext(); ) { var x = t.nextInt(); workOn(x); } ``` This for-loop appears to use just one iteration variable `t`, and it is loop-invariant. But under the hood there is a loop-invariant pointer to a `CharSequence` and a loop-varying index. A recurring problem with iterators is that they require a heap allocation to hold the iteration variables. This is necessary in order to make `t` be loop-invariant, but to encapsulate the loop-varying field `cur`. A recurring fix to the problem is to use something called "escape analysis" to determine that `t` is local to the for-loop, and expand its fields directly into the same stack frame as the for-loop. Here's the same loop as above, but with the fields of the iterator inlined (as source code) into the for-loop: ``` final var t = myIntIterator(); final var tcs = t.this$0; // CharSequence.this for (var tcur = t.cur; tcur < tcs.length(); ) { var x = tcs.charAt(tcur++); workOn(x); } ``` (Note how the `hasNext()` method simplifies down to what you would write in a hand-crafted loop, something like `a[i++]`. We'll want to preserve this effect, which comes from correct usage of `hasNext` and `next`.) While wonderful in theory, the escape analysis fix is fragile in practice. Value types, besides supplying flattened data structures, also supply (as a necessary component) an inherent property that they never "escape" in the sense of escape analysis, and therefore do not require any analysis in order to enjoy their optimization into groups of local variables. The catch is that values have different characteristics, in order to avoid escape analysis and allow flattening. Specifically, they must be identity-free and immutable. That in turn requires different patterns in source code to use them correctly. If a value models the same state variables as the `CharIterator` above, it must appear in the for-loop as a loop-varying value, not a loop-invariant iterator object. In a values-rich world, this loop-varying replacement for iterator will allow loops to be coded generically, as with iterators, but with no heap allocations to hold the state. It seems possible that such loop-varying loop control values could become as common as iterators, so they deserve their own name; we can call them *cursors*. Let's work again through the iterator example to see how a value-based cursor might replace the iterator. First, we'll regroup the `tcs` and `tcur` variables into a value class: ``` value class CharCursor extends … { //CharSequence this$0; // make it a nested class int cur = 0; … } ``` Now let's rename the expanded loop to see how to fill in the value class: ``` var c = myIntCursor(); final var ccs = c.this$0; // CharSequence.this for (var ccur = c.cur; ccur < ccs.length(); ) { var x = tcs.charAt(ccur++); workOn(x); } ``` And then regroup `ccs`/`ccur` back into `c`: ``` for (var c = myIntCursor(); c.hasValue(); c = c.advance()) { var x = c.getAsInt(); workOn(x); } ``` I chose the names more or less arbitrarily, but tried to approach, yet not encroach on, the existing iterator API. > (An earlier draft of this note just used `hasNext` and `next`, but using the same name with a subtly different behavioral contract is one of those clever-but-bad ideas some of us can tangled in. Users might call `get` and expect it (somehow) to include the semantics of `advance`, but that's impossible for a stateless cursor.) The name `get` (or `getAsInt`) allows a cursor to be a `Supplier` (or `IntSupplier`). This is a legitimate implementation of the contract, despite the fact that some suppliers, like all iterators, are stateful. The name `hasValue` could also be `hasCurrent` or `notAtEnd` or `alive`, while the name `get` could be `value` or `current`. The name `advance` could also have been `bump`, `increment`, etc. It could even have been `next`, encroaching on a different part of the `Iterator` API. More importantly, the value version of the loop requires three methods where the object version requires two. Is this evidence of a mistake somewhere? Not really; it is evidence that we are succeeding in the core principle of value types: "codes like a class, works like an int". Compare the previous loop with a simple array/int loop and you will see why: ``` final var a = myArray(); for (var i = 0; i < a.length; i = i + 1) { var x = a[i]; workOn(x); } ``` The same pattern of two iteration variables occurs (`a` and `i`), and the same pattern of loop control operations occurs. Instead of `c.hasValue()` we have `i** The Java language (following C) gives an abbreviated way to get the effect of `t.nextInt()` in a single expression, the famous `a[i++]`. If you look carefully at `a[i++]`, you'll notice that it returns a pair of values: First, the value of the expression is the get element of that array you are scanning. Second, the iteration state quietly updates itself to point at the array element afterwards. Given the widespread use of this deceptively simple expression, it seems worth considering whether value types could benefit from such a syntax also; it would call a method on a value stored in a variable, update the variable, and then call *another* method on the previous value of the variable. Maybe something like: ``` for (var c = myIntCursor(); c.hasValue(); ) { var x = c{.=advance()}.getAsInt(); workOn(x); } ``` Sadly, there don't seem to be any obviously good answers here. And to be honest `i++`, though indispensible, is quite ugly; we're just used to it. So I'll back away from that set of ideas. See [JDK-8199429] for a bit more. **\** Now we can fill in our cursor class more fully: ``` value class CharCursor extends … PrimitiveCursor.OfInt { //CharSequence this$0; // make it a nested class int cur = 0; public boolean hasValue() { return cur < length(); } public int getAsInt() { if (hasValue()) { return charAt(cur); } else { throw new NoSuchElementException(); } } public CharCursor advance() { return this __With { i = i + 1; }; } } ``` (The `__With` token sweeps under the rug the details of how a new value instance is obtained from an old value instance of the same class by updating one of its fields. This is a [separate and complicated conversation], involving concepts like factory methods, primary constructors, reconstructors, deconstructors, and "wither" methods. `__With` in this case is simply a punch-through to the current JVM-level bytecode which handles such things, `withfield`.) [separate and complicated conversation]: The details of the class are not deeply important, although the above is probably close to a working example in current Valhalla prototypes. The code is almost identical to the original `CharIterator` example. (And I didn't have to cherry-pick an example, or work hard on the iterator-to-cursor transition.) Notice that the double effect of the `t.nextInt()` method derives from a tiny "`++`" token embedded in the heart of the method. We can derive the value type semantics by dropping the "`++`". Of course it is not always this easy to make a cursor from an iterator, but it probably not much harder in most cases. If we change "`i++`" to "`i`" in `getInt` to drop the second effect, we had to put it somewhere. In fact, it went into the third method, as one would expect. In [some proposals] for the language support for values, you can write "`i++`" as the complete body of a reconstructor method: [some proposals]: ``` public __Reconstructor advance() { i++; } ``` As a sanity check, let's inline the cursor into our new loop and see how it optimizes: ``` for (var c = myIntCursor(); c.cur < c.this$0.length(); c = __WithField(c.cur, c.cur + 1)) { var x = c.this$0.charAt(c.cur); workOn(x); } ``` This is more complicated than the inlining of the `CharIterator`; is it harder to optimize? For starters, there are no heap objects or object identities, and thus no aliasing. The compiler can accurately track `c` and its components, `c.cur` and `c.this$0` as if they were all plain variables. In fact, the above loop quickly folds to the previous version of the inlined loop, including operations like `ccur++`. The compiler can easily and reliably note that `c.this$0` is in fact a loop invariant, and can be hoisted into a final like `ccs` was above. There's no need to exhibit this loop because its behavior is indistinguishable from the nice loop with `ccs`, above. A key part of optimizing the cursor-based loop is recognizing the loop invariants. This recognition is slightly at risk since when the cursor value is updated, it apparently updates all components of the cursor (including `this$0`), not just the part that changed (which is `cur` only). If the compiler could not inline the crucial call to `c.advance()`, it would be unable to deduce that `c.this$0` is a loop invariant. Of course it also wouldn't see that the thing changing was a nice little `int` variable. To make sure the loop optimizes fully, all three methods of `CharCursor` must be inlined, and their bodies must be easy for the compiler to analyze. Of course, the same is already true with the `CharIterator`, and we have seen that the code of the two classes is almost identical, except that the `CharCursor` splits out two methods from a single `CharIterator` method. Nearly-identical code is a problem of course. If we move away from iterators and toward cursors, will there be a general duplication of loop iteration code, one copy for cursors and one for iterators? After all, iterators aren't going anywhere, and sometimes you need one. What's necessary in order to adopt cursors widely is to have a story for interconverting them with iterators, without always duplicating code. (Sometimes we duplicate code if it helps performance but we try not to indulge too often.) Given the similarity in structure, can cursors be automatically derived from iterators? The answer seems to be no (in general) since deriving the three methods of a cursor from the two methods of an iterator requires somehow splitting the `get` method of the iterator into its two effects, and assigning the bits of code to the `get` and `advance` methods of the cursor. But it seems that iterators can be derived from cursors. Here is a proof, specialized to the current example of `CharCursor` and `CharIterator`: ``` class CharIterator implements PrimitiveIterator.OfInt { CharCursor c; public boolean hasValue() { return c.hasValue(); } public int nextInt() { int x = c.getAsInt(); // (where is my ++) c = c.advance(); return x; } } ``` More generically: ``` abstract class IteratorFromCursor implements Iterator { protected Cursor c; protected IteratorFromCursor(Cursor initialState) { c = initialState; } public boolean hasNext() { return c.hasValue(); } public T next() { T x = c.get(); // (where is my ++) c = c.advance(); return x; } public void forEachRemaining(Consumer block) { c.forEachRemaining(block); } } ``` (In order to fully flatten this class, it will need to be a template class. This is a planned follow-on to value types. Before template classes, the generic iterator will have a slightly more complex heap structure. Where escape analysis succeeds with bespoke iterators, it is likely also to succeed with such generic iterators built on top of cursors.) Given that derivation, it will be reasonable, over time, to refactor some existing iterators as cursors, splitting the two effects of the `get` method by hand, and supplying backward-compatible iterators automatically using generics like the above. What have we learned? - Value classes can incrementally improve on certain kinds of mutable classes, such as iterators. The improvement is that the dependence on escape analysis is removed. - In particular, value classes can reduce heap pressure for loop, in cases where we care about such things. - The improvement requires recoding but is incremental in cost, because the code for cursors and iterators is very similar. - To benefit from cursors, for-loops which use iterators need recoding; this recoding is also incremental in nature. - There is a backward compatibility story for iterators in a world where cursors are the new way to do things. - In order to operate effectively with for-loops, value classes may need a bit more syntax love, so they can "work like an int" when that int is loop variable. Since moving to cursors requires recoding for both loops and iterators, it's fair to wonder whether we are actually stuck with iterators, despite their issues. Of course, most code will stay as it is, but performance sensitive code may benefit from recoding loops. In that case, having a basic supply of cursors on hand makes sense. In any case, simply as a thought experiment, cursors have taught us more about how value types will work.