THIS DOCUMENT HAS BEEN SUPERSEDED AND IS PROVIDED FOR HISTORICAL CONTEXT ONLY

Data Classes for Java

Brian Goetz, February 2018

This document explores possible directions for data classes in the Java Language. This is an exploratory document only and does not constitute a plan for any specific feature in any specific version of the Java Language.

Background

It is a common (and often deserved) complaint that "Java is too verbose" or has too much "ceremony." A significant contributor to this is that while classes can flexibly model a variety of programming paradigms, this invariably comes with modeling overheads -- and in the case of classes that are nothing more than "plain data carriers", these modeling overhead can be out of line with their value. To write a simple data carrier class responsibly, we have to write a lot of low-value, repetitive code: constructors, accessors, equals(), hashCode(), toString(), etc. And developers are sometimes tempted to cut corners such as omitting these important methods, leading to surprising behavior or poor debuggability, or pressing an alternate but not entirely appropriate class into service because it has the "right shape" and they don't want to define yet another class.

IDEs will help you write most of this code, but writing code is only a small part of the problem. IDEs don't do anything to help the reader distill the design intent of "I'm a plain data carrier for x, y, and z" from the dozens of lines of boilerplate code. And repetitive code is a good place for bugs to hide; if we can, it is best to eliminate their hiding spots outright.

We don't yet have a formal definition of "plain data carrier", but we probably "know it when we see it". Nobody thinks that SocketInputStream is just a carrier for some data; it fully encapsulates some complex and unspecified state (including a native resource) and exposes an interface contract that likely looks nothing like its internal representation.

At the other extreme, its pretty clear that:

final class Point {
    public final int x;
    public final int y;

    public Point(int x, int y) {
        this.x = x;
        this.y = y;
    }

    // state-based implementations of equals, hashCode, toString
    // nothing else

}

is "just" the data (x, y). Its representation is (x, y), its construction protocol accepts an (x, y) pair and stores it directly into the representation, provides unmediated access to that representation, and derives the core Object methods from that representation.

Data classes in other OO languages

Other OO languages have explored compact syntactic forms for modeling data-oriented classes: case classes in Scala, data classes in Kotlin, and soon, record classes in C#. These have in common that some or all of the state of a class can be described directly directly in the class header (though they vary considerably in their semantics, such as constraints on the mutability or accessibility of fields, extensibility of the class, and other restrictions.) Committing in the class declaration to at least part of the relationship between state and interface enables suitable defaults to be provided for various state-related members such as constructors or Object methods. All of these mechanisms (let's call them "data classes") seek to bring us closer to the goal of being able to define Point as:

record Point(int x, int y) { }

The clarity and compactness here is surely attractive -- a Point is just a carrier for two integer components x and y, and from that, the reader immediately knows that there are sensible and correct implementations for the core Object methods, and doesn't have to wade through a page of boilerplate to be able to confidently reason about their semantics. Most developers are going to say "Well, of course I want that."

Meet the elephant

Unfortunately, such universal consensus is only syntax-deep; almost immediately after we finish celebrating the concision, comes the debate over the natural semantics of such a construct, and what restrictions we are willing to accept. Are they extensible? Are the fields mutable? Can I control the behavior of the generated methods, or the accessibility of the fields? Can I have additional fields and constructors?

Just like the story of the blind men and the elephant, developers are likely to bring very different assumptions about the "obvious" semantics of a data class. To bring these implicit assumptions into the open, let's name the various positions.

Algebraic Annie will say "a data class is just an algebraic product type." Like Scala's case classes, they come paired with pattern matching, and are best served immutable. (And for dessert, Annie would order sealed interfaces.)

Boilerplate Billy will say "a data class is just an ordinary class with better syntax", and will likely bristle at constraints on mutability, extension, or encapsulation. (Billy's brother, JavaBean Jerry, will say "these must be for JavaBeans -- so of course I get getters and setters too." And his sister, POJO Patty, remarks that she is drowning in enterprise POJOs, and hopes they are proxyable by frameworks like Hibernate.)

Tuple Tommy will say "a data class is just a nominal tuple" -- and may not even be expecting them to have methods other than the core Object methods -- they're just the simplest of aggregates. (He might even expect the names to be erased, so that two data classes of the same "shape" can be freely converted.)

Values Victor will say "a data class is really just a more transparent value type."

All of these personae are united in favor of "data classes" -- but have different ideas of what data classes are, and there may not be any one solution that makes them all happy.

Encapsulation and boundaries

While we're painfully aware of the state-related boilerplate we deal with every day, the boilerplate is just a symptom of a deeper problem, which is that Java asks all classes are asked to pay equally for the cost of encapsulation -- but not all classes benefit equally from it.

To be sure, encapsulation is essential; encapsulating our state (so it can't be manipulated without our oversight) and our representation (so it can be evolved without affecting the API contract) enables us to write code that can operate safely and robustly across a variety of boundaries:

But, not all classes value their boundaries equally. Defending these boundaries is essential for a class like KeyStore or SocketInputStream, but is of far less value for a class like Point or Person. Many classes are not concerned at all with defending their boundaries; perhaps they are private to a package or module and co-compiled with their clients, trust their clients, and have no complex invariants that need protecting. Since the cost of establishing and defending these boundaries (how constructor arguments map to state, how to derive the equality contract from state, etc) is constant across classes, but the benefit is not, the cost may sometimes be out of line with the benefit. This is what Java developers mean by "too much ceremony" -- not that the ceremony has no value, but that they're forced to invoke it even when it does not offer sufficient value.

The encapsulation model that Java provides -- where the representation is entirely decoupled from construction, state access, and equality -- is just more than many classes need. Classes that have a simpler relationship with their boundaries can benefit from a simpler model where we can define a class as a thin wrapper around its state, and derive the relationship between state, construction, equality, and state access from that.

Further, the costs of decoupling representation from API goes beyond the overhead of declaring boilerplate members; encapsulation is, by its nature, information-destroying. If you see a class with a constructor that takes an argument x, and an accessor called x(), we often have only convention to tell us that they probably refer to the same thing. Relying on this is a pretty safe guess, but its just a guess. It would be nicer if tools and library code could mechnically rely on this correspondence -- without a human having to read the specs (if there even is one!) to confirm this expectation.

Digression -- enums

If the problem is that we're modeling something simple with something overly general, simplification is going to come from constraint; by letting go of some degrees of freedom, we hope to be freed of the obligation to specify everything explicitly.

The enum facility, added in Java 5, is an excellent example of such a tradeoff. The type-safe enum pattern was well understood, and easy to express (albeit verbosely), prior to Java 5 (see Effective Java, 1st Edition, item 21.) The initial motivation to add enums to the language might have been irritation at the boilerplate required for this idiom, but the real benefit is semantic.

The key simplification of enums was to constrain the lifecycle of enum instances -- enum constants are singletons, and the requisite instance control is managed by the runtime. By baking singleton-awareness into the language model, the compiler can safely and correctly generate the boilerplate needed for the type-safe enum pattern. And because enums started with a semantic goal, rather than a syntactic one, it was possible for enums to interact positively with other features, such as the ability to switch on enums, or to get comparison and safe serialization for free.

Perhaps surprisingly, enums delivered their syntactic and semantic benefits without requiring us to give up most other degrees of freedom that classes enjoy; Java's enums are not mere enumerations of integers, as they are in many other languages, but instead are full-fledged classes, with unconstrained state and behavior, and even subtyping (constrained to interface inheritance only.)

If we are looking to replicate the success of this approach with data classes, our first question must therefore be: what constraints will give us the semantic and syntactic benefits we want, and, are we willing to accept these constraints?

Why not "just" do tuples?

Some readers may feel at this point that if we "just" had tuples, we wouldn't need data classes. And while tuples might offer a lighter-weight means to express some aggregates, the result is often inferior aggregates.

Classes and their members have names; tuples and their members do not. A central aspect of Java's philosophy is that names matter; a Person with properties firstName and lastName is clearer and safer than a tuple of String and String. Classes support state validation through their constructors; tuples do not. Some data aggregates (such as ranges) have invariants that, if enforced by the constructor, can thereafter be relied upon; tuples do not offer this ability. Classes can have behavior that is derived from their state; co-locating state and derived behavior makes it more discoverable and easier to access.

For all these reasons, we don't want to abandon classes for modeling data; we just want to make modeling data with classes simpler. The major pain of using named classes for aggregates is the overhead of declaring them; if we can reduce this, the temptation to reach for more weakly typed mechanisms is greatly reduced.

Are data classes the same as value types?

With value types coming down the road through Project Valhalla, it is reasonable to ask about the overlap between (immutable) data classes and value types, and as whether the intersection of data-ness and value-ness is a useful space to inhabit.

Value types are primarily about enabling flat and dense layout of objects in memory. The central sacrifice of value types is object identity; in exchange for giving up object identity (which entails giving up mutability and layout polymorphism), the runtime can elide object headers, inline values directly into other values, objects, and arrays, and freely hoist values from the heap into registers or onto the stack. The lack of layout polymorphism means we have to give up something else: self-reference. A value type V cannot refer, directly or indirectly, to another V. But value classes need not give up any encapsulation, and in fact encapsulation is essential for some applications of value types (such as "smart pointers" or references to native resources.)

On the other hand, data class instances have identity, which supports mutability (maybe) but also supports self-reference. Unlike value types, data classes are well suited to representing tree and graph nodes.

Each of these simplified aggregate forms -- values and data classes -- involves accepting certain restrictions in exchange for certain benefits. If we're willing to accept both sets of restrictions, we get both sets of benefits; the notion of a "value data class" is perfectly sensible for things like extended numerics or tuples.

Towards requirements for data classes

While it is superficially tempting to to treat data classes as primarily being about boilerplate reduction, we prefer to start with a semantic goal: modeling data as data. If we choose our goals correctly, the boilerplate will take care of itself, and we will gain additional benefits aside from concision.

So, what do we mean by "modeling data as data", and what are we going to have to give up? What degrees of freedom that classes enjoy do such "plain" data aggregates not need, that we can eliminate and thereby simplify the model? Java's object model is built around the assumption that we want the representation of an object to be completely decoupled from its API; the APIs and behavior of constructors, accessor methods, and Object methods need not align directly with the object's state, or even with each other. However, in practice, they are frequently much more tightly coupled; a Point object has fields x and y, a constructor that takes x and y, accessors for x and y and initializes those fields, and Object methods that characterize points solely by their x and y values. We claim that for a class to be "just a plain carrier for its data", this coupling is something that can be counted upon -- that we're giving up the ability to decouple its (publicly declared) state from its API. The API for a data class models the state, the whole state, and nothing but the state.

Being able to count on this coupling drives a number of advantages. The compiler can generate sensible and correct implementations for standard class members. Clients can freely deconstruct and reconstruct aggregates, or restructure them into a more convenient form, without fear that they will discard hidden data or undermine hidden assumptions. Frameworks can safely and mechanically serialize or marshal them, without the need to provide complex mapping mechanisms. By giving up the flexibility to decouple a classes state from its API, we gain all of these benefits.

One consequence of this is that data classes are transparent; they give up their data freely to all requestors. Otherwise, their API doesn't model their whole state, and we lose the ability to freely deconstruct and reconstruct them.

Use cases for data classes

Applications are full of use cases for simple aggregates that are just wrappers for their data.

All of these applications can benefit from the nominality of classes (both of the aggregate and of the components) and the co-location of data with behavior, but have no need to model them with the full generality of objects. A simpler aggregation mechanism will do -- because they're simple data aggregates, rather than models of stateful processes.

Data classes and pattern matching

One of the big advantages of defining data classes in terms of coupling their API to a publicly specified state description, rather than simply as boilerplate-reduced class, we gain the ability to freely convert a data class instance back and forth between its aggregate form and its exploded state. This has a natural connection with pattern matching; by coupling the API to the state description, there is an obvious deconstruction pattern -- whose signature is the dual of the constructor's -- which can be mechanically generated.

For example, suppose we have data classes as follows:

interface Shape { }
record Point(int x, int y);
record Rect(Point p1, Point p2) implements Shape;
record Circle(Point center, int radius) implements Shape;

A client can deconstruct a shape as follows:

switch (shape) {
     case Rect(Point(var x1, var y1), Point(var x2, var y2)): ...
     case Circle(Point(var x, var y), int r): ...
     ....
}

with the mechanically generated pattern extractors. This synergy between data classes and pattern matching makes each feature more expressive. However, a not-entirely-obvious consequence of this is that there is no such thing as truly private fields in a data class; even if the fields were to be declared private, their values would still be implicitly readable via the destructuring pattern. This would be surprising if our design center for data class was that they are merely a boilerplate reduction tool -- but is consistent with data classes being transparent carriers for their data.

Data classes and externalization

Data classes are also a natural fit for safe, mechanical externalization (serialization, marshaling to and from JSON or XML, mapping to database rows, etc). If a class is a transparent carrier for a state vector, and the components of that state vector can in turn be externalized in the desired encoding, then the carrier can be safely and mechanically marshaled and unmarshaled with guaranteed fidelity, without the security and integrity risks of bypassing the constructor (as built-in serialization does). In fact, a transparent carrier need not do anything special to support externalization; the externalization framework can deconstruct the object using its deconstruction pattern, and reconstruct it using its constructor, which are already public.

Formalizing the requirements

Let's formalize this notion a bit, so we can use this to evaluate potential design choices. We say a class C is a transparent carrier for a state vector S if:

This means that C has a constructor (or factory) which accepts the state vector S, and accessors (or a deconstruction pattern) which produces the components of S, and that for any valid instance, extracting the state vector and then reconstructing an instance from that state vector produces an instance equivalent to the original. Similarly, constructing instances from equivalent state vectors produces equivalent instances. (Mathematically inclined readers will spot the embedding-projection pair.) Moreover any additional operations on equivalent instances produce equivalent results and preserve the equivalence of the instances.

These invariants are an attempt to capture our requirements; that the carrier is transparent, and that there is a simple and predictable relationship between the classes representation, its construction, and its destructuring -- that the API is the representation.

Note that so far, we haven't said anything about syntax or boilerplate; we've only talked about constraining the semantics of the class to be a simple carrier for a specified state vector. But these constraints allow us to safely and mechanically generate the boilerplate for constructors, pattern extractors, accessors, equals(), hashCode(), and toString(), externalization, and more.

A starting point

The simplest -- and most draconian -- model for data classes is to say that a data class is a final class with public final fields for each state component, a public constructor and deconstruction pattern whose signature matches that of the state description, and state-based implementations of the core Object methods, and further, that no other members (or explicit implementations of the implicit members) are allowed. This is essentially the strictest interpretation of a nominal tuple.

This starting point is simple and stable -- and nearly everyone will find something to object to about it. So, how much can we relax these constraints without giving up on the semantic benefits we want? Let's look at some directions in which the draconian starting point could be extended, and their interactions.

Interfaces and additional methods

One obvious direction for relaxing this model is to allow data classes to implement interfaces or to declare methods that operate on their state. No one could claim that the following class violates the spirit of data-class-ness:

record Point(int x, int y) {
    boolean isOrigin() {
        return x == 0 && y == 0;
    }
}

The method isOrigin() merely computes a derived property of the state; the obvious place to put this is in the class that models the state. Similarly, no one could object to having Point implement Comparable<Point>.

However, even allowing additional methods is stepping onto a slippery slope; if the method's behavior depends on anything other than the state of the object (including depending on the identity of the instance), then we've violated our "nothing but the state" rule.

Overriding implicit members

The default implementations of constructors and Object methods is likely to be what is desired in a lot of cases, but there may be cases where we want to refine these further, such as a constructor that enforces validity constraints, or an equals() method that compares array components by content rather than delegating to Object.equals(). The natural way to denote this would be to declare explicit versions of these members, and have this suppress the generation of the implicit member.

Allowing refined implementations expands the range of useful data classes, but again exposes us to the risk that the the explicit implementations won't conform to the requirements of a plain data carrier.

The most common case of overriding an implicit member is likely to be overriding the constructor, to validate that the state conforms to its invariants. Data classes without representational invariants should not require an explicit constructor, but ideally it should be possible to specify an explicit constructor that enforces invariants -- without having to write out all the constructor boilerplate out by hand.

Additional constructors

Related to additional methods is additional constructors. Data classes clearly need a constructor whose signature matches that of the state description (call this the principal constructor); otherwise, we couldn't freely deconstruct and reconstruct it. But it may also be desirable to offer additional constructors, which can derive the state from some alternate form. On the surface, this seems reasonable -- so long as the constructor is not squirreling away data that is effectively part of the object state, but not part of the state description.

Additional fields

Related to the previous item is the question of whether a data class can have additional fields beyond its state description. And again, there are cases when this is harmless, and cases when this completely violates our requirements.

An additional field that merely caches a derived property of the state description (whether computed eagerly or lazily) is fine, because it is still logically "nothing but the state". For example:

record Name(String first, String last) {
    private String firstAndLast;
    
    Name(String first, String last) {
        firstAndList = first + " " + last;
    }
    
    public String firstAndLast() { return firstAndLast; }
}

is well within the spirit of the requirements; the existence of the firstAndList field is purely an implementation detail, but the behavior of the Name class is derived solely from its state description.

On the other hand, squirreling away additional state which is not derived from the state description, and which affects the user-visible behavior of its methods (especially equals() and hashCode()!), would totally violate the goal that a data class is "just" a carrier for its state. Similarly, if they affected the behavior of mutative methods, this would undermine the requirement that performing identical actions on equal carriers results in equal carriers.

So, even more so that with explicit methods or constructors, additional fields are a significant risk item for undermining the goal that a data class models "the state, the whole state, and nothing but the state."

Extension

Can a data class extend an ordinary class? Can a data class extend another data class? Can a non-data class extend a data class? Again, our model of "plain data carrier" can help us evaluate these.

Extension between data classes and non-data classes, or between concrete data classes, seems immediately problematic. If a data class extends an ordinary class, we would have no control over the equals() contract of the superclass, and therefore no reason to believe that the desired invariants hold.

Similarly, if another class (data or not) were to extend a data class, we'd almost certainly violate the desired invariants. Consider:

__data class C(STATE_DESCR) { }

class D extends C { 
   ...
}

D d = ...
switch (d) { 
    case C(var STATE_DESCR): assert d.equals(new C(STATE_DESCR));
    ...
}

Deconstructing a C and reconstructing it should yield an equivalent instance -- but in this case, it will not. D is not a plain carrier for C's state description, as it has at least some additional typestate, and perhaps some additional state, which may cause the equality check to fail. The same argument can be made for a concrete data class extending another concrete data class (though we may be able to rescue abstract data classes.)

Mutability

One of the thorniest problems is whether we allow mutability, and how we handle the consequences if we do. The simplest solution -- and surely a tempting one -- is to insist that state components of data classes be final. While this is an attractive opening position, this is likely to be too limiting; while immutable data is surely better-behaved than mutable data, mutable data certainly qualifies as "data", and there are many legitimate uses for mutable "plain data" aggregates. (And, even if we required that data class fields always be final, this only gives us shallow immutability -- we still have to deal with the possibility that the contents are more deeply mutable.)

It is worth noting that similar languages that went down the data-class path -- including Scala, Kotlin, and C# -- all settled on not forcing data classes to be immutable, though its almost certain that their designers initially considered doing so. (Even if we allow mutability, we still have the option of nudging users towards finality, say by making the default for data class fields final, and providing a way to opt out of finality for individual fields.)

Field encapsulation and accessors

Public fields make everyone nervous, even public final fields. If fields can be nonfinal, they certainly need some encapsulation support; even if they cannot, it still may be desirable to encapsulate the field and instead provide a read accessor, to support the uniform access principle.

Encapsulating fields and mediating access to state may serve to protect integrity boundaries (rejecting writes that would violate representational invariants), detect when when writes have happened so that listeners can be notified or cached state can be adjusted; or to make defensive copies on reads for mutable components such as arrays. However, we must be careful to avoid undermining the transparency of data classes; each state component must be readable somehow.

No discussion involving boilerplate (or any question of Java language evolution, for that matter) can be complete without the subject of field accessors (and properties) coming up. On the one hand, accessors constitute a significant portion of the boilerplate in existing code; on the other hand, the JavaBean-style getter/setter conventions are already badly overused. Mutability may drag with it encapsulation, and encapsulation plus transparency may in turn drag accessors with them, but we should be mindful of the purpose of these accessors; it is not to abstract the representation from the API, but at most to enable rejection of bad values and provide syntactic uniformity of access.

(Without rehashing the properties debate, one fundamental objection to automating JavaBean-style field accessors is that it would take what is at best a questionable -- and certainly overused -- API naming convention and burn it into the language. Unlike the core methods like Object.equals(), field accessors do not have any special treatment in the language, and so names of the form getSize() should not either. Also, while equally tedious, writing (and reading) accessor declarations are not nearly as error-prone as equals().)

Arrays and defensive copies

Array-valued fields are particularly problematic, as there is no way to make them deeply immutable. But they're really just a special case of mutable objects which do not provide unmodifiable views. APIs that encapsulate arrays frequently make defensive copies when they're on the other side of a trust boundary from their users. Should data classes support this? Unfortunately, this also falls afoul of our requirements for data classes.

Because the equals() method of arrays is inherited from Object, which compares instances by identity, making defensive copies of array components in read accessors would violate the invariant that destructuring an instance of a data class and reconstructing it yields an equal instance -- the defensive copy and the original array will not be equal to each other. (Arrays are simply a bad fit for data classes, as they are mutable, but unlike List their equals() method is based on identity.) We'd rather not distort data classes to accomodate arrays, especially as there are ample alternatives available.

Thread-safety

Allowing mutable state in data classes raises the question of whether, and how, they can be made thread-safe. (Note that thread-safety is not a requirement for mutable classes; many useful classes, such as ArrayList, are not thread-safe.) Thread-safe classes encapsulate a protocol for coordinating access to their shared mutable state. But, data classes disavow most forms of encapsulation. (Immutable objects are implicitly thread-safe, because there is no shared mutable state to which access need be coordinated.)

Like most non-thread-safe classes, instances of mutable data classes can still be used safely in concurrent environments through confinement, where the data class instance is encapsulated within a thread-safe class. While it might be possible to nibble around the edges to support a few use cases, ultimately data classes are not going to be the right tool for creating thread-safe mutable classes, and rather than reinventing all the flexibility of classes in a new syntax, we should probably just guide people to writing ordinary classes in these cases.

A concrete proposal

The central compromise we make for data classes is that we give up the ability to decouple the API semantics from the state description, to define non-state-based semantics for equality and hashing, and to hide state from curious readers. In return, we gain the ability for the compiler to generate key class members, as well as the ability to safely and mechanically copy, serialize, and externalize data classes.

What don't we have to give up to get this? Quite a lot. Data classes can be generic, can implement interfaces, can have static fields, and can have constructors and methods, all without compromising this commitments. To start, let's say that

record Point(int x, int y) { }

desugars to

final class Point extends java.lang.DataClass {
    final int x;
    final int y;
    
    public Point(int x, int y) {
        this.x = x;
        this.y = y;
    }

    // destructuring pattern for Point(int x, int y)
    // state-based equals, hashCode, and toString
    // public read accessors x() and y()
}

Any interfaces implemented by the data class are lifted onto the desugared class in the obvious way, as are any type variables, static fields, static methods, and instance methods.

Explicit implementations of implicit methods. Allowing explicit implementations of implicit members -- especially equals() and hashCode() -- is a tradeoff; they allow greater flexibility in using data classes, but increase the risk that the invariants will be violated. As a starting point, we propose that the user not be able to override equals() and hashCode(), that overrides of reader accessors are permitted but the returned value must be equals() to the appropriate field, and that toString() can be overridden as desired. If the data class provides an explicit implementation of any allowable implicit members, it is used in place of the implicit member.

Explicit constructors. If the data class imposes no invariants, no constructor declaration is needed, and the class acquires a default constructor whose signature matches the state description. An overridden default constructor must delegate to the default principal constructor, as in:

record Range(int lo, int hi) {

    // Explicit default constructor
    @Override
    public Range(int lo, int hi) {
        // validation logic
        if (lo > hi)
            throw new IllegalArgumentException(...);
            
        // delegate to default constructor
        default.this(lo, hi);
    }
}

The default.this() call invokes the constructor that would otherwise have been auto-generated for this data class (including the default super constructor); this avoids the need to write out the tedious and error-inviting sequence of this.x = x assignments. (The rules about statements preceding calls to super or this constructors can be relaxed, and the this reference treated as definitely unassigned for statements preceding the default.this() call.)

Additional constructors may be explicitly declared -- but they must delegate to the default constructor (via the usual this() mechanism.)

Fields. Given a data class

record Foo(int x, int y) { ... }

we will lift the state components (int x, int y) onto fields of Foo -- along with any annotations specified on the state components. (The Javadoc for data classes will allow class parameters to be documented with the @param tag, as method parameters are now.)

The most restrictive approach would be that fields are always final; we could also consider making them final by default, but allowing mutability to be supported by opting in via a mutability modifier (non-final, unfinal, mutable -- bikeshed to be painted later.) Similarly, the most restrictive approach would be for them to always have package accessibility (or protected for fields of abstract data class); a less restrictive approach would be to treat these as defaults, but allow them to optionally be declared public.

With respect to additional fields beyond those in the state description, the most restrictive approach would be to prohibit them; it seems inevitable that such additional state would flow into equality or other essential behavior, undermining the invariants of data classes. Relaxing this constraint would likely require tightening others, such as prohibiting an explicit implementation of equals() and hashCode(), and other constraints on constructors (such as requiring that the call to the default constructor appear last.)

To leave room for evolution, as a starting point we will take the most restrictive choices on all of these -- no additional fields, no override of equals() and hashCode(), and flow restrictions on constructors -- so that we have the flexibility to choose later which of these makes the most sense to relax.

Extension. We've already noted that arbitrary extension is problematic, but it should be practical to maintain inheritance from abstract data classes to other data classes. A sensible balance regarding extension is:

This allows us to declare families of algebraic data types, such as the following partial hierarchy describing an arithmetic expression:

interface Node { }

abstract record BinaryOpNode(Node left, Node right) 
    implements Node;

record PlusNode(Node left, Node right) 
      extends BinaryOperatorNode(left, right);

record MulNode(Node left, Node right) 
      extends BinaryOperatorNode(left, right);
      
record IntNode(int constant) implements Node;

When a data class extends an abstract data class, the state description of the superclass must be a prefix of the state description of the subclass:

abstract record Base(int x);
record Sub(int x, int y) extends Base(x);

The arguments to the extends Base() clause is a list of names of state components of Sub, not arbitrary expressions, must be a prefix of the state description of Sub, and must match the state description of Base; this suppresses the local declaration of inherited fields, and also plays into the generation of the default principal constructor (which arguments are passed up to which superclass constructor, vs. which are used to initialize local fields.) These rules are sufficient for implementing algebraic data type hierarchies like the Node example above.

Accessors. Data classes are transparent; they readily give up their state through the destructuring pattern. To make this explicit, and to support the uniform access principle for state, data classes implicitly acquire public read accessors for all state components, whose name is the same as the state component. (We will separately explore how arbitrary classes, that do not meet the requirements for data classes, might also benefit from accessor generation.) If write accessors are desired, they can be provided explicitly -- data classes will not bring these automatically.

Reflection. While our implementation is essentially a desugaring into a mostly ordinary class with fields and methods, we don't actually want to erase the data-ness completely; compilers need to be able to identify which classes are data classes, and what their state descriptions are, so they can enforce any restrictions on how they interact with other classes -- so this information must be present in the class file. This can be reflected on Class with methods such as isDataClass() and a method to return the ordered list of fields that are the classes state description.

Compatibility and migration requirements

It is important that existing classes that meet the requirements for data classes -- of which there are many -- should be able to be compatibly migrated to data classes so that they can benefit from the semantic transparency and syntactic concision of data classes. If an existing class which meets the requirements wants to migrate to be a data class, it should be able to do so in a source- and binary-compatible manner by simply exposing its state through the class header and removing redundant field, constructor, and Object method declarations. The reverse migration is also possible; a class that is a data class can compatibly migrate to a regular class by providing equivalent explicit implementations of the implicit members. However, to be behaviorally compatible, it must continue to conform to the specification of DataClass.

Once a data class is published, changing its state description will have compatibility consequences for clients that are outside of the maintenance boundary. The binary- and source- compatibility impact of such changes can be partially mitigated by declaring new constructors and deconstruction patterns that follow the old state description (so that existing clients can construct and deconstruct them), but depending on existing usage, it may be hard to mitigate the behavioral compatibility issues, as the resulting class may well fall afoul of the invariants of plain data carriers from the perspective of legacy clients, such as the deconstructing and reconstructing a data class using an old state description. For data classes operating fully within a maintenance boundary, it may be practical to compatibly refactor both a data class and its clients when changing the state description.

Summary

The key question in designing a facility for "plain data aggregates" in Java is identifying which degrees of freedom we are willing to give up. If we try to model all the degrees of freedom of classes, we just move the complexity around; to gain some benefit, we must accept some constraints. We think that the sensible constraints to accept are disavowing the use of encapsulation for decoupling representation from API, and for mediating read access to state; in turn, this provides significant syntactic and semantic benefits for classes which can accept these constraints.