Equivalence

Liam Miller-Cushon, April 2019

Background

Correctly implementing equals() and hashCode() requires too much ceremony.

Implementations are time-consuming to write by hand and, worse, expensive to maintain. IDEs can help generate the initial code, but once generated that code needs to be read, and debugged, and maintained as the class changes. Over time these methods become a place for bugs to hide (see the appendix for real-world examples).

Consider all the boilerplate required to correctly implement equals() and hashCode() for a trivial Point class:

class Point {
  final int x;
  final int y;

  Point(int x, int y) {
    this.x = x;
    this.y = y;
  }

  @Override public boolean equals(Object other) {
    if (!(other instanceof Point)) {
      return false;
    }
    Point that = (Point) other;
    return x == that.x && y == that.y;
  }

  @Override public int hashCode() {
    return Objects.hash(x, y);
  }
}

Objective

The goal of this proposal is to create a library solution to help produce readable, correct, and performant implementations of equals() and hashCode().

A secondary use case is to provide new definitions of equivalence for existing types, which differ from those types’ equals() and hashCode() implementations.

The API will likely consist of an interface with a method for testing equality and computing hashCodes for instances of the type.

WARNING: All API details in examples are very much non-final, and included only to illustrate the idea.

interface Equivalence<T> {
  boolean equivalent(T self, Object other);
  int hash(T self);
}

With that hypothetical API, our Point example might look something like this:

class Point {
  int x;
  int y;

  private static final Equivalence<Point> EQ =
      Equivalence.of(Point.class, p -> p.x, p -> p.y);

  @Override public boolean equals(Object other) {
    return EQ.equivalent(this, other);
  }

  @Override public int hashCode() {
    return EQ.hash(this);
  }
}

In the future we expect value classes like Point to become records, but there will always be some cases where a class needs an equals and hashCode implementation and can’t be migrated to a record. This proposal is complementary to records, and would help avoid ever needing to hand-implement equals and hashCode.

Non-goals of this document

There are some opportunities here to make the performance competitive to (or better than) hand-written implementations through language extensions or compiler support, but that is out of scope for this document. The goal is to determine what the best version of the feature we could realize purely as a library is.

It’s possible that a future version of the language will support ‘field references’, such as Foo::x where x is a field. The Equivalence API would dovetail nicely with that feature and might help provide motivation for it, but the details of that feature are out of scope for this document.

Requirements

Should the API support both equals() and hashCode(), or focus solely on equals()?

An advantage of supporting both equals() and hashCode() is that implementing hashCode and equals avoids a common source of bugs. Implementations of hashCode that consider more state than the corresponding equals function are incorrect. Having a single canonical list of state to consider in both equals and hashCode isn’t just about avoiding boilerplate, it’s about

correctness.

(Being able to share the list of state used by equals and hashCode with a Comparator implementation is potentially interesting; see “What’s the relationship to Comparator?” below)

Should the API support custom comparison functions?

One option is for the API to always use Object.equals and Object.hashCode when testing the provided state. Another is to allow custom comparators to be associated with derived state, for example we might want to perform a case-insensitive comparison on a String field.

private static final Equivalence<Point> EQ =
    Equivalence.forClass(Person.class)
        .andThen(Person::age)
        .andThen(Person::name, CASE_INSENSITIVE_EQ); // also of type Equivalence

(Another example where custom comparisons will be helpful is with arrays, where we often want Arrays.deepEquals and Object.deepHashCode instead of Object.equals and Object.hashCode. Arrays are a common-enough special case that it’s natural to discuss preferential treatment in the API, which we do below.)

Facilitate omitting state from hashCode?

The state included in a correct implementation of hashCode must be a subset of the state considered by equals. Sometimes using only a proper subset in hashCode produces a good-enough hash that is faster to compute.

Support for this feature falls out of custom comparisons discussed above. For example it might look something like:

private static final Equivalence<Point> EQ =
    Equivalence.forClass(Point.class)
        .andThen(Point::x)
        .andThen(Point::y, Equivalence.using(Objects::equals, x -> 0));

We could consider adding more explicit support to the API for this-use case (for example: Equivalence.forClass(Point.class)andThen(Point::x).butNotHashing(Point::y), but we think that level of support is unnecessary. It’s an uncommon use-case (the best practice for hash functions is to avoid trivially constructible collisions), and it’s already possible without additions to the API.

Is the hash reduce function customizable?

The time-honoured tradition in many hashCode() implementations has been to combine the hash of each piece of state using (x, y) -> 31 * x + y. This is usually a fine choice; we don’t see a compelling reason to support customization.

Either way, we should absolutely not specify the exact behavior of the default hash reduce function, to allow future improvements to it.

(One slightly aggressive way to ensure that would be to have optional per-JVM-invocation randomization of the hash seed, which could be enabled for testing. Google has been randomizing hash iteration order in our tests for a few years with good results.)

Does equals use instanceof or getClass()?

There’s a choice between implementing equals using instanceof, or using getClass(), or offering the choice of either. We want to try to avoid philosophical debates on which is more correct.

Fortunately we think there’s an easy way to decide this. instanceof is the more permissive default, since the user can still perform an explicit getClass() check in the Equivalence chain, or as a guard before calling Equivalence.equals, for example:

this.getClass() == o.getClass() && EQ.equivalent(this, o);

The reverse is not possible.

What happens to null?

equivalent(nonNull, null) must safely return false in order to be useful for implementing Object.equals(), because of the symmetry requirement.

equivalent(null, null) should probably return true rather than throw, to be as consistent and unsurprising as possible.

What’s the relationship to Comparator?

There are some clear similarities: both Comparator and Equivalence allow extracting state from an instance (for use with compareTo and equals/hashCode, respectively).

There are also some reasons why it may be necessary to approach them as completely separate APIs, and not have one be a generalization of the other.

Comparator can provide half of an Equivalence implementation–x.compareTo(y) == 0 is an OK implementation of equivalent–but it can’t implement hashCode. So making Comparator extend Equivalence would force it to throw UnsupportedOperationException for hashCode.

Alternatively, Equivalence has some of the pieces of a Comparator implementation, including the state that you might want to test in a comparison function. However having Equivalence support comparison would create overlap with Comparator, and the state you want to compare might only be a subset of the state considered by equals and hashCode.

A third approach would be to have a way to create both an Equivalence and Comparator at the same time, by providing a single list of state extractors. This might require having a common super-type, which adds complexity and could muddy the two concepts.

Design Questions

What is the name of the API?

Two options that have been discussed so far are:

  1. Equalator, which references Comparator
  2. Equivalence, since an instance of this type is an equivalence relation.

Our opinion is that coining a new word is unnecessary, when a well-known word exists already in the world of mathematics.

Should arrays be compared for content equality or reference equality?

NOTE: This question is about default behavior; the ability to customize the equals and hashCode functions used for a specific field ensures it will always be possible to make an explicit choice.

There are at least two schools of thought here, each with strong arguments. This document isn’t going to settle the debate, just present the options.

An argument for detecting arrays and automatically using Arrays.{deepEquals,deepHashCode} is that using Object.{equals,hashCode} on arrays is effectively always a bug, so we’re doing users a favor by recognizing and fixing that bug. (Experience with static analysis to disallow calling Object.{hashCode,equals} on arrays has shown this is effectively always the behavior users want.)

An argument against is that it complicates the mental model for arrays. No matter what, users need to learn that arrays use reference equality. When we do them a favor in this context only, we pollute that knowledge, which has consequences. Note that Kotlin chose this approach.

Specialized custom comparisons

Should we try to avoid boxing/varargs overhead in Equivalence, for example by having specialized interfaces like IntEquivalence and overloads of the builder methods like andThenInt(IntFunction) and andThenInt(IntFunction, IntEquivalence)?

This might be necessary to get the desired performance in some situations. On the other hand, it adds significantly to the complexity of the API.

One alternative is to investigate translation strategies that achieve similar performance without surfacing that complexity in the Java API.

equivalent(T, Object) or equivalent(T, T)

There are two candidates for the signature of the equivalent function in Equivalence: equivalent(T, Object), and equivalent(T, T).

Using equivalent(T, Object) allows implementing Object#equals with less ceremony. We anticipate a larger fraction of uses of Equivalence being to implement Object#equals than the fraction of Comparators that implement compareTo, which is an argument for biasing towards this use-case. (The ease of implementing Object#equals is especially significant when used together with concise method bodies.)

public boolean equals(Object other) {
  return EQ.equivalent(this, other);
}

or:

public boolean equals(Object other) = EQ::equivalent;

The advantage of equivalent(T, T) is that uses other than implementing Object#equals are cleaner, and provide some additional type-safety. It also avoids a decision about using getClass() vs. instanceof, since the type check and case will happen separately.

public boolean equals(Object other) {
  return other instanceof Foo that && EQ.equivalent(this, that);
}

or:

public boolean equals(Object other) ->
    other instanceof Foo that && EQ.equivalent(this, that);

Another option would be to use equivalent(T, T) and convert to an Equivalence<Object> before using it to implement Object.equals, to avoid the use-case casts. (But then the awkwardness is at the declaration instead, and it sacrifices additional type-safety.)

Appendix

Straw implementation

just a sketch to illustrate the idea

interface Equivalence<T> {

 boolean equivalent(T self, Object other);

 int hash(T self);

 static <T> Equivalence<T> of(Class<T> clazz, Function<T, ?>... decomposers) {
   return new Equivalence<T>() {
     public boolean equivalent(T self, Object other) {
       if (!clazz.isInstance(other)) {
         return false;
       }
       T that = clazz.cast(other);
       return Arrays.stream(decomposers)
           .allMatch(d -> Objects.equals(d.apply(self), d.apply(that)));
     }

     public int hash(T self) {
       return Arrays.stream(decomposers)
           .map(d -> Objects.hashCode(d.apply(self)))
           .reduce(17, (x, y) -> 31 * x + y);
     }
   };
 }
}

Examples of bugs in equals and hashCode implementations

We have observed a wide array of bugs in implementations of equals and hashCode methods, often in the process of rolling out static analysis to prevent these issues.

Some of them seem obvious in hindsight, and unlikely to occur in code written by experienced Java developers, but invariably they have shown up in real-world code. One factor is that because equals and hashCode methods are seen as boilerplate they often aren’t reviewed with a lot of scrutiny, and bugs can creep in over time, especially as other parts of the class change.