% Iterators vs. Cursors: A Case Study in Objects vs. Values
> John Rose, 2018-0310 (light edits 2020-0923, name changes 2022-0803)
There is a clear parallelism between immutable object types (notably
value-based classes) and value types, which are immutable and have
similar qualities to value-based classes. Indeed, [recent changes in
Project Valhalla] tend to make value classes (per se) conform to the
contract of value-based classes so closely as to make migration of
VBSs to VCs a desirable choice (in most cases).
[recent changes in Project Valhalla]:
There is another parallelism between stateful object types and value
classes, but it is more subtle. In many (not all) cases a stateful
object can be transformed into a value, as long as the value which
models the current state of the object is made available to all code
that may need to read that state. This means that updating a state
must involve discarding the previous value and using the new value
which models the updated object state.
This all sounds complicated, but it feels simple in practice. "Codes
like a class but works like an int" means that the current state value
is always assigned to whatever variable(s) are looking at it. In
fact, just like an int:
```
int x = 0;
Value v = Value(0);
…
x + 1; // bad code
x = x + 1; // good code
v.bump(); // bad code
v = v.bump(); // good code
```
(Hmmm… It might be time to support a `__NoDiscard` modifier to return
values in APIs. The bad code with `x` is a syntax error, but the bad
code with `v` is silently accepted. To track this I filed [JDK-8199429]
and threw some ideas in to deal with later.)
[JDK-8199429]:
Let's do an example, where a stateful `Iterator` object is transformed
into a stateless `Cursor` value. This code is taken verbatim from the
JDK:
```
class CharIterator implements PrimitiveIterator.OfInt {
int cur = 0;
public boolean hasNext() {
return cur < length();
}
public int nextInt() {
if (hasNext()) {
return charAt(cur++);
} else {
throw new NoSuchElementException();
}
}
// omitted forEachRemaining
}
```
It is an inner class, used to construct a stream over an enclosing
object (a `CharSequence`) which has `length()` and `charAt(i)` operations.
It could also be used as a normal `Iterator` (and would return boxed
`Integer`s, for what that's worth).
(To peek under the covers at this, try javap on the local class
sometimes known as `java.lang.CharSequence$1CharIterator`.)
An instance of this class has two fields: A pointer to the outer
`CharSequence`, plus the varying index `cur`. It is often the case that
iterators have one fixed field and one varying; sometimes there are
additional fixed fields, such as a limit to the varying field.
Sometimes there are additional varying fields, such as a prefetched
current value. These fixed and varying fields correspond to loop
variables, which are (respectively) loop-invariant and loop-varying,
but they are neatly packaged inside of the iterator.
The public interface of this object is very simple, a variation of
`java.util.Iterator`:
```
public interface PrimitiveIterator.OfInt
extends … Iterator {
@Override boolean hasNext();
int nextInt();
}
```
A for-loop over this iterator looks like this:
```
for (final var t = myIntIterator(); t.hasNext(); ) {
var x = t.nextInt();
workOn(x);
}
```
This for-loop appears to use just one iteration variable `t`, and it is
loop-invariant. But under the hood there is a loop-invariant pointer
to a `CharSequence` and a loop-varying index.
A recurring problem with iterators is that they require a heap
allocation to hold the iteration variables. This is necessary in
order to make `t` be loop-invariant, but to encapsulate the loop-varying
field `cur`. A recurring fix to the problem is to use something called
"escape analysis" to determine that `t` is local to the for-loop, and
expand its fields directly into the same stack frame as the for-loop.
Here's the same loop as above, but with the fields of the iterator
inlined (as source code) into the for-loop:
```
final var t = myIntIterator();
final var tcs = t.this$0; // CharSequence.this
for (var tcur = t.cur; tcur < tcs.length(); ) {
var x = tcs.charAt(tcur++);
workOn(x);
}
```
(Note how the `hasNext()` method simplifies down to what you would write
in a hand-crafted loop, something like `a[i++]`. We'll want to preserve
this effect, which comes from correct usage of `hasNext` and `next`.)
While wonderful in theory, the escape analysis fix is fragile in
practice. Value types, besides supplying flattened data structures,
also supply (as a necessary component) an inherent property that
they never "escape" in the sense of escape analysis, and therefore
do not require any analysis in order to enjoy their optimization
into groups of local variables.
The catch is that values have different characteristics, in order to
avoid escape analysis and allow flattening. Specifically, they must be
identity-free and immutable. That in turn requires different patterns
in source code to use them correctly. If a value models the same
state variables as the `CharIterator` above, it must appear in the
for-loop as a loop-varying value, not a loop-invariant iterator
object. In a values-rich world, this loop-varying replacement
for iterator will allow loops to be coded generically, as
with iterators, but with no heap allocations to hold the state.
It seems possible that such loop-varying loop control values
could become as common as iterators, so they deserve their own
name; we can call them *cursors*.
Let's work again through the iterator example to see how a value-based
cursor might replace the iterator. First, we'll regroup the `tcs` and
`tcur` variables into a value class:
```
value class CharCursor extends … {
//CharSequence this$0; // make it a nested class
int cur = 0;
…
}
```
Now let's rename the expanded loop to see how to fill in the value
class:
```
var c = myIntCursor();
final var ccs = c.this$0; // CharSequence.this
for (var ccur = c.cur; ccur < ccs.length(); ) {
var x = tcs.charAt(ccur++);
workOn(x);
}
```
And then regroup `ccs`/`ccur` back into `c`:
```
for (var c = myIntCursor(); c.hasValue(); c = c.advance()) {
var x = c.getAsInt();
workOn(x);
}
```
I chose the names more or less arbitrarily, but tried to approach, yet
not encroach on, the existing iterator API.
> (An earlier draft of this note just used `hasNext` and `next`, but
using the same name with a subtly different behavioral contract is one
of those clever-but-bad ideas some of us can tangled in. Users might
call `get` and expect it (somehow) to include the semantics of
`advance`, but that's impossible for a stateless cursor.)
The name `get` (or `getAsInt`) allows a cursor to be a `Supplier` (or
`IntSupplier`). This is a legitimate implementation of the contract,
despite the fact that some suppliers, like all iterators, are
stateful.
The name `hasValue` could also be `hasCurrent` or `notAtEnd` or
`alive`, while the name `get` could be `value` or `current`.
The name `advance` could also have been `bump`, `increment`, etc. It
could even have been `next`, encroaching on a different part of the
`Iterator` API.
More importantly, the value version of the loop requires three methods
where the object version requires two. Is this evidence of a mistake
somewhere? Not really; it is evidence that we are succeeding in the
core principle of value types: "codes like a class, works like an
int". Compare the previous loop with a simple array/int loop and
you will see why:
```
final var a = myArray();
for (var i = 0; i < a.length; i = i + 1) {
var x = a[i];
workOn(x);
}
```
The same pattern of two iteration variables occurs (`a` and `i`), and the
same pattern of loop control operations occurs. Instead of
`c.hasValue()` we have `i**
The Java language (following C) gives an abbreviated way to get the
effect of `t.nextInt()` in a single expression, the famous `a[i++]`. If
you look carefully at `a[i++]`, you'll notice that it returns a pair of
values: First, the value of the expression is the get element of that
array you are scanning. Second, the iteration state quietly updates
itself to point at the array element afterwards. Given the widespread
use of this deceptively simple expression, it seems worth considering
whether value types could benefit from such a syntax also; it would
call a method on a value stored in a variable, update the variable,
and then call *another* method on the previous value of the variable.
Maybe something like:
```
for (var c = myIntCursor(); c.hasValue(); ) {
var x = c{.=advance()}.getAsInt();
workOn(x);
}
```
Sadly, there don't seem to be any obviously good answers here. And to
be honest `i++`, though indispensible, is quite ugly; we're just used to
it. So I'll back away from that set of ideas. See [JDK-8199429] for a
bit more.
**\**
Now we can fill in our cursor class more fully:
```
value class CharCursor extends … PrimitiveCursor.OfInt {
//CharSequence this$0; // make it a nested class
int cur = 0;
public boolean hasValue() {
return cur < length();
}
public int getAsInt() {
if (hasValue()) {
return charAt(cur);
} else {
throw new NoSuchElementException();
}
}
public CharCursor advance() {
return this __With { i = i + 1; };
}
}
```
(The `__With` token sweeps under the rug the details of how a new
value instance is obtained from an old value instance of the same
class by updating one of its fields. This is a [separate and
complicated conversation], involving concepts like factory methods,
primary constructors, reconstructors, deconstructors, and "wither"
methods. `__With` in this case is simply a punch-through to the current
JVM-level bytecode which handles such things, `withfield`.)
[separate and complicated conversation]:
The details of the class are not deeply important, although the above
is probably close to a working example in current Valhalla prototypes.
The code is almost identical to the original `CharIterator` example.
(And I didn't have to cherry-pick an example, or work hard on the
iterator-to-cursor transition.) Notice that the double effect of the
`t.nextInt()` method derives from a tiny "`++`" token embedded in the
heart of the method. We can derive the value type semantics by
dropping the "`++`". Of course it is not always this easy to make a
cursor from an iterator, but it probably not much harder in most
cases.
If we change "`i++`" to "`i`" in `getInt` to drop the second effect, we had
to put it somewhere. In fact, it went into the third method, as one
would expect. In [some proposals] for the language support for values,
you can write "`i++`" as the complete body of a reconstructor method:
[some proposals]:
```
public __Reconstructor advance() {
i++;
}
```
As a sanity check, let's inline the cursor into our new loop and see
how it optimizes:
```
for (var c = myIntCursor();
c.cur < c.this$0.length();
c = __WithField(c.cur, c.cur + 1)) {
var x = c.this$0.charAt(c.cur);
workOn(x);
}
```
This is more complicated than the inlining of the `CharIterator`; is it
harder to optimize? For starters, there are no heap objects or object
identities, and thus no aliasing. The compiler can accurately track `c` and
its components, `c.cur` and `c.this$0` as if they were all plain
variables. In fact, the above loop quickly folds to the previous
version of the inlined loop, including operations like `ccur++`. The
compiler can easily and reliably note that `c.this$0` is in fact a loop
invariant, and can be hoisted into a final like `ccs` was above.
There's no need to exhibit this loop because its behavior is
indistinguishable from the nice loop with `ccs`, above.
A key part of optimizing the cursor-based loop is recognizing the loop
invariants. This recognition is slightly at risk since when the
cursor value is updated, it apparently updates all components of the
cursor (including `this$0`), not just the part that changed (which is
`cur` only). If the compiler could not inline the crucial call to
`c.advance()`, it would be unable to deduce that `c.this$0` is a loop
invariant. Of course it also wouldn't see that the thing changing was
a nice little `int` variable.
To make sure the loop optimizes fully, all three methods of `CharCursor`
must be inlined, and their bodies must be easy for the compiler to
analyze. Of course, the same is already true with the `CharIterator`,
and we have seen that the code of the two classes is almost identical,
except that the `CharCursor` splits out two methods from a single
`CharIterator` method.
Nearly-identical code is a problem of course. If we move away from
iterators and toward cursors, will there be a general duplication of
loop iteration code, one copy for cursors and one for iterators?
After all, iterators aren't going anywhere, and sometimes you need
one.
What's necessary in order to adopt cursors widely is to have a story
for interconverting them with iterators, without always duplicating
code. (Sometimes we duplicate code if it helps performance but we try
not to indulge too often.) Given the similarity in structure, can
cursors be automatically derived from iterators? The answer seems to
be no (in general) since deriving the three methods of a cursor from
the two methods of an iterator requires somehow splitting the `get`
method of the iterator into its two effects, and assigning the bits of
code to the `get` and `advance` methods of the cursor.
But it seems that iterators can be derived from cursors. Here is a
proof, specialized to the current example of `CharCursor` and
`CharIterator`:
```
class CharIterator implements PrimitiveIterator.OfInt {
CharCursor c;
public boolean hasValue() {
return c.hasValue();
}
public int nextInt() {
int x = c.getAsInt(); // (where is my ++)
c = c.advance();
return x;
}
}
```
More generically:
```
abstract class IteratorFromCursor implements Iterator {
protected Cursor c;
protected IteratorFromCursor(Cursor initialState) {
c = initialState;
}
public boolean hasNext() {
return c.hasValue();
}
public T next() {
T x = c.get(); // (where is my ++)
c = c.advance();
return x;
}
public void forEachRemaining(Consumer block) {
c.forEachRemaining(block);
}
}
```
(In order to fully flatten this class, it will need to be a template
class. This is a planned follow-on to value types. Before template
classes, the generic iterator will have a slightly more complex heap
structure. Where escape analysis succeeds with bespoke iterators, it
is likely also to succeed with such generic iterators built on top of
cursors.)
Given that derivation, it will be reasonable, over time, to refactor
some existing iterators as cursors, splitting the two effects of the
`get` method by hand, and supplying backward-compatible iterators
automatically using generics like the above.
What have we learned?
- Value classes can incrementally improve on certain kinds of
mutable classes, such as iterators. The improvement is that
the dependence on escape analysis is removed.
- In particular, value classes can reduce heap pressure for loop,
in cases where we care about such things.
- The improvement requires recoding but is incremental in cost,
because the code for cursors and iterators is very similar.
- To benefit from cursors, for-loops which use iterators need
recoding; this recoding is also incremental in nature.
- There is a backward compatibility story for iterators in a world
where cursors are the new way to do things.
- In order to operate effectively with for-loops, value classes may
need a bit more syntax love, so they can "work like an int" when
that int is loop variable.
Since moving to cursors requires recoding for both loops and
iterators, it's fair to wonder whether we are actually stuck with
iterators, despite their issues. Of course, most code will stay as it
is, but performance sensitive code may benefit from recoding loops.
In that case, having a basic supply of cursors on hand makes sense.
In any case, simply as a thought experiment, cursors have taught
us more about how value types will work.