The Broken Java Set Interface
Finding flaws in the Java Collections framework is like shooting fish in a barrel. From an OO design point of view the framework is just plain bad. From the infamous “vector is a list” gaffe to consistent use of inheritance for implementation, it’s enough to send shivers down the spine of any object fanatic.
Today I want to post about an issue I came across in the Java collections hierarchy that I had never seen before. This is a trap that is easy to fall into, so I was surprised that I had never read about it in the many articles I have read about Java collections. The problem is simple and easy to understand, yet it is insidious enough to potentially lead you into a debugging rabbit hole. The problem arises from Oracle’s neglect of contracts (sorry Oracle, when you acquired Sun you acquired their baggage too!).
The wonderful thing about OO design, abstraction and interfaces is the implicit guarantee that if I have some object that conforms to some interface, I don’t need to care what kind of object I’m dealing with. It doesn’t matter if I’m driving a Honda Accord or a Mac truck, when I turn the wheel clockwise I expect my vehicle to turn right accordingly. The fact that the Mac truck implements this through a huge steering box while my little Accord relies on a flimsy rack and pinion system doesn’t matter to me, the driver. The vehicle has exposed an interface to me – a steering wheel – which means they have agreed that it will behave like every other steering wheel I have used. Simple, elegant and the essence of object design.
So let’s take a look at the two main implementations of Java.Collections.Set. We have HashSet and TreeSet, two fundamentally different implementations with very different performance characteristics. All I know is that if someone gives me a Set, it better behave like every other Set I have used.
So what’s the problem here? The problem is, these Sets don’t behave the same, in a subtle but important way. Let’s explore by example. Inspect the following code and try to predict the outcome.
import java.util.*;
public class SetTest {
Set<Thing> hashSet;
Set<Thing> treeSet;
public SetTest() {
hashSet = new HashSet<Thing>();
treeSet = new TreeSet<Thing>();
Thing t1 = new Thing("Thing 1");
Thing t2 = new Thing("Thing 2");
hashSet.add(t1);
hashSet.add(t2);
treeSet.add(t1);
treeSet.add(t2);
System.out.println("Before t1.setString");
System.out.println("HashSet contains t1: " + hashSet.contains(t1) + ", t2: " + hashSet.contains(t2));
System.out.println("TreeSet contains t1: " + treeSet.contains(t1) + ", t2: " + treeSet.contains(t2));
System.out.println();
t1.setString("Different String");
System.out.println("After t1.setString");
System.out.println("HashSet contains t1: " + hashSet.contains(t1) + ", t2: " + hashSet.contains(t2));
System.out.println("TreeSet contains t1: " + treeSet.contains(t1) + ", t2: " + treeSet.contains(t2));
System.out.println();
}
public static void main(String[] args) {
new SetTest();
}
}
class Thing implements Comparable<Thing> {
private String str;
public Thing(String str) {
this.str = str;
}
public void setString(String newStr) { str = newStr; }
public int compareTo(Thing o) { return str.compareTo(o.str); }
@Override
public boolean equals(Object o) {
// flawed equals method for simplicity
if (o instanceof Thing) return ((Thing)o).str.equals(str);
return false;
}
@Override
public int hashCode() {
// flawed hashcode method for simplicity
return str.hashCode();
}
}
For those who aren’t in the mood for reading code, here’s what’s happening. We have some class Thing, with a String field, str. This class identifies as its field, str. This means it must also hash on this field. There have been many words written on the theory of hashCode/equals, so I’ll leave it to you to Google around if you don’t know why it’s important to define these methods together.
We now instantiate two Thing objects and two Sets of Things, one HashSet and one TreeSet. The Things are added to each set, and we check that they are indeed in the sets. We then change the name of one Thing and check again. The result?
HashSet contains t1: true, t2: true
TreeSet contains t1: true, t2: true
After t1.setString
HashSet contains t1: false, t2: true
TreeSet contains t1: true, t2: true
The HashSet no longer thinks it contains the Thing! So why did this happen? It’s probably fairly obvious by now, but I’ll make it explicit. A HashSet uses a hash table to store values. It uses the hash algorithm to figure out if a value is contained in the table. This means that a HashSet is not only a Set, it’s also a Map, but the details of the Map are hidden behind the Set interface. This seems logical until you consider the contract for a Map.
Maps rely on key values that do not change. It is imperative that the key in a mapped object is immutable in order to find the object later. When we changed that str field in Thing, we implicitly altered the key (the hash code of the previous String). Unfortunately this change was not abstracted away properly in the Set. When we use the contains method we are asking the hash table for an object that is identical to an object already in the set, but that hashes differently to how it hashed when first added.
Incidentally this bug also allows us to add duplicate values to a HashSet.
The result of all of this is that we cannot rely on all Java Sets to behave the same. In our own code we could just start using TreeSets if we know we want to have mutable hashable values. But what if we are given a Set by some other piece of code? The whole point of OO encapsulation and information hiding is that we aren’t supposed to care how it’s implemented. Unfortunately we simply can’t behave like this in the Java Collections framework.
It should be noted that Sun/Oracle did seem to consider, at some point, the issues with mutable Set members. In the Set API they state
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
but then fail to follow through with anything at all in the HashSet API description. Regardless, this business of noting in your API, “by the way, depending on the implementation things might break” is not acceptable OO practice, in my opinion.
So what is the solution? If we were to re-implement the Set interface and implementations, what would we do differently? One hacky solution is to force the HashSet to perform a linear search on its hash table when contains() misses. This seems to solve the problem, but we lose the main advantage of a HashSet, O(1) adding, removal and containment checking. Is there a better solution to this problem? What if we were to rewrite the Set hierarchy from scratch, what would we do differently? I don’t have the answers right now, but it’s something I’ll be thinking about.
Tagged collections, java, objected oriented design, oo design, Programming
One of my favourite things to do when learning a new language (or graphics package for that matter) is to write an implementation of a 





We try to do everything in layers, and this allows us to make changes to the final product without affecting the original image. An adjustment layer is a layer that makes certain adjustments to all visible layers beneath it. The advantage is that the adjustment layer can be hidden or deleted and the original images remain intact. I chose a Levels adjustment so that I could change the output white, mid and black levels with respect to the input levels, thereby increasing contrast and effectively “stretching” the dynamic range of the sky. On the right is my levels adjustment. 

