Union-Find (Disjoint Set Union)

Algorithms

What's the question?

You have a pile of items. A thousand, a million, doesn't matter. They start with no relationships at all.

Now someone is going to throw two operations at you, in any order, as many times as they like. The first: merge group A with group B. The second: are X and Y in the same group?

Your job is to answer fast. Both operations, every time, on data structures that keep growing.

That's the union-find problem. It comes up everywhere — checking whether two pixels belong to the same blob, telling whether adding an edge to a graph creates a cycle, deciding which network components are reachable from each other. Anywhere you build connectivity dynamically.


The trick is the representation

You'd think you store each group as a set of elements. You don't. You store each group as a tree of parent pointers.

Every node has one field: a pointer to some other node in its group. The root of the tree points at itself. The root is the group's identity — its name, its representative.

Two nodes are in the same group exactly when they walk up to the same root. That's the whole data structure.

The two operations fall out:

Both cost whatever the height of the tree is. Which leads to the obvious next question.


How bad can it get?

Without care, terrible. Press the naive demo above and watch the visualizer.

The unions in the demo are union(0, 1), union(1, 2), union(2, 3), and so on — each one hooks the previous chain onto the next node. After eleven unions the tree is a straight line, eleven nodes deep.

The final find at the end of the demo walks the whole chain. Eleven hops to answer one query. The data structure has degenerated into a linked list and you're paying per operation.

Two small fixes turn that disaster into "essentially constant time." Both come from the same instinct: keep the tree shallow.


Trick 1 — union by rank

When you merge two trees, you have a choice: hang tree A under tree B's root, or hang tree B under tree A's root.

The clever choice is to hang the shorter one under the taller one. The taller tree's height stays the same. Only if the two trees have the same height does the merged tree get one level deeper.

We don't actually track exact heights — we keep a rough upper bound called the rank. Same rank on both sides, you bump the rank up by one. Otherwise, the smaller rank loses out.

Now switch the demo to union by rank. Same eleven unions, same arguments — but every later union sees node 0 with the highest rank, so every new node attaches to 0 directly. The depth never gets past one. Depth = 1. The find at the end is a single hop.

That outcome is specific to this chain-building sequence. In general, union by rank gets you to per operation — a hard upper bound, since the rank only grows when two same-rank trees merge, and you can show the number of nodes at rank is at least .


Trick 2 — path compression

Here's the second move, and it's a little magical.

When find(x) walks from x up to the root, it touches every node on the path. The walk is unavoidable the first time. But on the way back down, rewire every node it touched so they all point directly at the root.

The whole chain collapses. Next query on any of those nodes is a single hop.

Switch the demo to path compression. This demo uses the plain attachment rule (so the unions build the same eleven-deep chain), but the final find now does compression. Watch closely: the path gets highlighted as the walk happens, and then the arrows snap. Every node on that path is now one hop from the root.

It's free, too — you were going to walk that path anyway. Rewriting the parent pointers as you unwind costs nothing extra in big-O. You just amortize a lot of future work into the one walk you had to do.


The famous α(n) bound

Put both tricks together and the result is one of the great surprises of algorithm analysis.

The amortized time per operation is where is the inverse Ackermann function. Ackermann's function grows hideously fast — even is bigger than the number of atoms in the universe. Its inverse therefore grows hideously slowly.

For any value of that could exist on a computer — or for any value of that could exist anywhere — is less than 5.

So in practice, union-find is constant time. Not "constant on average," not "constant if you're lucky." Constant. The original 1975 proof by Tarjan is one of the most surprising results in computer science.


Code

All of the above fits into about twenty lines of Python:

class UnionFind:
    def __init__(self, n):
        self.parent = list(range(n))  # everyone starts in their own set
        self.rank   = [0] * n          # rough tree-height estimate

    def find(self, x):
        # Walk up to the root, and on the way back rewire every node
        # we touched to point straight at the root. (Path compression.)
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])
        return self.parent[x]

    def union(self, a, b):
        ra, rb = self.find(a), self.find(b)
        if ra == rb:
            return False  # already in the same set
        # Union by rank: hang the shorter tree under the taller one.
        if self.rank[ra] < self.rank[rb]:
            self.parent[ra] = rb
        elif self.rank[ra] > self.rank[rb]:
            self.parent[rb] = ra
        else:
            self.parent[rb] = ra
            self.rank[ra] += 1
        return True

    def connected(self, a, b):
        return self.find(a) == self.find(b)

The recursive find is doing path compression for free — every recursive call's return value gets stored as the parent of the node below, so by the time the recursion unwinds, every node on the path points at the root.

If you don't like recursion, you can rewrite find as a two-pass loop: walk up to find the root, then walk up again and rewire each node's parent to that root. Same result.


The necessarily-true facts

  1. The root represents the set. Every node in the same group walks up to the same root. That's the definition we maintain. find(x) == find(y) exactly when x and y are in the same group.
  2. Find is the basic primitive. Both union and connected are built out of find calls. The whole structure's performance lives or dies on how fast find is.
  3. Naive find is . Without any tricks, find walks one parent pointer at a time up to the root. A pathological union sequence makes the tree as deep as the group is large — per op.
  4. Union by rank caps depth at . Each rank bump requires merging two trees of equal rank, so the number of nodes at rank is at least . With total nodes, no rank exceeds .
  5. Path compression is amortized-free. The walk happens anyway. Rewiring parent pointers during the walk costs per node touched, and pays off on every future query.
  6. Together: amortized. Tarjan's 1975 proof. The constant is the inverse Ackermann function, which is at most 4 for any that could ever physically exist.
  7. Sets cannot be split. Union-find supports merging only, never splitting. Once two elements are in the same set, this data structure cannot separate them.

Where you actually see it

Kruskal's algorithm for minimum spanning trees. Sort edges by weight, walk them in order, and add each edge to the tree unless its endpoints are already connected. The "already connected" check is one find; the "add to tree" step is one union. Without union-find, Kruskal would be slow; with it, the sort dominates.

Connected components in an image. Each pixel is a node. Walk the image; for adjacent pixels of the same color, union them. At the end, find gives every pixel its component label.

Dynamic graph connectivity. Any time edges arrive over time and you need to keep answering "are these two nodes connected yet?" — that's exactly the problem union-find was built for.

Percolation. Drop sites onto a grid one at a time and ask when the top connects to the bottom. The connectivity check is union-find.


Common questions

Why represent each group as a tree instead of a list of members?

A list of members would make find instant — every node would store its group's name directly. But then union would have to walk one whole group and rewrite every member, which is the slow direction. The tree representation balances both ops: union is at the root, and find can be made nearly with the two tricks.

If path compression already flattens trees, why bother with union by rank?

Path compression only helps a node after someone calls find on it. Until then, it might sit deep in the tree. Union by rank prevents the tree from ever getting too deep in the first place, so even the first find on any node is cheap. The two tricks together give the bound; either one alone is worse — pure path compression gets you to amortized, and rank alone gets you worst-case.

What does actually evaluate to?

For , . For , . For , . For , . For up to a tower of 2's that's levels tall, . So no, you will not see a 5 in your lifetime.

Can I undo a union?

Not in plain union-find. The structure is monotonic — sets only get bigger. If you need to roll back unions (for backtracking search, say), you need union-find with rollback: skip path compression, use union by rank only, and keep a stack of pointer-changes so you can replay them in reverse. You give up the bound and go back to , but rollback is now legal.

Why does the recursive find in the code do path compression automatically?

Look at the line self.parent[x] = self.find(self.parent[x]). The recursive call returns the root. We then set x's parent to that root before returning. So every node on the recursion stack ends up pointing straight at the root by the time the call unwinds. One pass, no extra bookkeeping.


Worked examples

Step-by-step solutions on problems that use union-find. Pair these with the exercises below — examples to study, exercises to attempt.

Browse all worked examples →


Exercises

A full exercise set is available for this topic, structured as one worked example + 7 practice problems (across 7 surface contexts) + 2 pattern-resistant check problems.

Open the Union-Find exercise set → Recognize a problem as a partition-membership question and reduce it to find/union calls — optionally decorating each component with extra metadata.