Algorithms

Union-Find — exercises

Recognize a problem as a partition-membership question and reduce it to find/union calls — optionally decorating each component with extra metadata.

1 worked example · 7 practice problems · 2 check problems

Worked example: Kruskal's MST via Union-Find

Problem. Given the weighted undirected graph below on 4 vertices and 5 edges, build a minimum spanning tree using Kruskal's algorithm, with union-find as the cycle check. Report the MST edges (in the order they were added) and the total weight.

Edges, in the format (u, v, weight):

(0, 1, 1),  (0, 2, 4),  (1, 2, 2),  (1, 3, 5),  (2, 3, 3)

Diagnosis. Kruskal walks the edges from cheapest to most expensive, adding each one unless it would close a cycle. "Closes a cycle" reduces to "are its endpoints already in the same component?" — exactly what find(u) == find(v) answers. Union-find is the central data structure of the algorithm; its queries dominate the running time.

Predict before reading on: how many edges will the final MST contain, and why is that count independent of how the algorithm chooses among ties?

Solution. Sort edges by weight: (0,1,1), (1,2,2), (2,3,3), (0,2,4), (1,3,5). Initialize union-find with 4 singletons.

Edge	find(v)	Same root?	Action
`(0,1,1)`	1	no	Add. `union(0,1)`.
`(1,2,2)`	2	no	Add. `union(1,2)`.
`(2,3,3)`	3	no	Add. `union(2,3)`.
`(0,2,4)`	0	yes	Skip.
`(1,3,5)`	0	yes	Skip.

(Each find may also trigger path compression; the table abbreviates the bookkeeping. Whichever variant of union-find you use, the cycle check is the same.)

Predict before reading on: what does it mean for the algorithm if at some point ALL remaining edges produce find(u) == find(v)? What general invariant about the graph does that signal?

Verification. MST = {(0,1,1), (1,2,2), (2,3,3)}. Total weight = 6.

Two sanity checks:

Edge count = 3 = $V - 1$ for $V = 4$ vertices — the required size of any spanning tree.
Brute-force: enumerate all spanning trees (there are 8 by Kirchhoff's theorem on this graph). The minimum-weight one totals 6.

def kruskal(edges, n):
    edges.sort(key=lambda e: e[2])
    parent = list(range(n))

    def find(x):
        if parent[x] != x:
            parent[x] = find(parent[x])
        return parent[x]

    mst, total = [], 0
    for u, v, w in edges:
        if find(u) != find(v):
            parent[find(u)] = find(v)
            mst.append((u, v, w))
            total += w
    return mst, total

print(kruskal([(0,1,1),(0,2,4),(1,2,2),(1,3,5),(2,3,3)], 4))
# ([(0,1,1), (1,2,2), (2,3,3)], 6)

Articulate: state in one sentence the precise role union-find plays in Kruskal's algorithm.

Practice problems

Seven problems, seven different surfaces. Each one is the same move from the worked example — partition membership via find/union — applied in a different domain. Try each problem before revealing the answer.

P.1 image processing, 2D grids

Count the number of connected components of 1s in the following 5×5 binary image, using 4-connectivity (cells share an edge):

Find the analogue: which moves of the worked example (find vs union) correspond to "this is a new region" vs "this merges two regions" when scanning the image cell by cell?

show answer

4 components. Walk the image: for each "1" cell, union it with its already-visited 1 neighbors (left and up). At the end, count distinct roots.

The four components are: {(0,0), (0,1), (1,1)}, {(0,4), (1,3), (1,4), (2,3)}, {(3,0), (4,0), (4,1)}, and the singleton {(3,2)}.

P.2 temporal event streams

Friendships are added between users at integer timestamps. The stream is:

t=1:  add (1, 2)
t=2:  add (3, 4)
t=3:  add (5, 6)
t=4:  add (2, 3)
t=5:  add (4, 5)

For each pair below, report the earliest timestamp at which the two users are in the same friend group:

(a) users 1 and 6
(b) users 1 and 3
(c) users 5 and 6

Find the analogue: the worked example didn't care WHEN edges arrived. Here the answer for each query is itself a timestamp. Which step (find or union) needs to be augmented to carry a "first-merged time"?

show answer

Walk the timeline and union at each step. After each union, two components merge — at that timestamp, every pair (a, b) where a is in the old left group and b in the old right group is "now connected." Report the FIRST timestamp at which the queried pair both belongs to the merged group.

(a) t = 5. Components after t=4 are {1,2,3,4}, {5,6}. After t=5: {1,2,3,4,5,6}. 1 and 6 land together at t = 5.
(b) t = 4. Components after t=3 are {1,2}, {3,4}, {5,6}. After t=4: {1,2,3,4}, {5,6}.
(c) t = 3. The union at t=3 directly connects 5 and 6.

P.3 decorated union-find with per-component metadata

Implement union-find so that each component tracks its size as extra metadata, updated in constant time per union. After the following sequence of unions on 7 nodes, report the size of the component containing node 3:

union(1, 2), union(3, 4), union(1, 3), union(5, 6), union(4, 7)

Find the analogue: where in the union routine of the worked example would you insert a constant-time size update? Where do you READ the size to answer the query?

show answer

Size = 5. The component containing 3 is {1, 2, 3, 4, 7}.

Implementation: keep a parallel size[] array indexed by node, where size[root] holds the size of that root's component. On union(a, b) after finding roots ra, rb with ra ≠ rb: set size[new_root] = size[ra] + size[rb]. Read by going size[find(node)].

Trace: {1,2}(2), {3,4}(2), {5}, {6}, {7} → union(1,3) merges to {1,2,3,4}(4) → {5,6}(2) → union(4,7) merges to {1,2,3,4,7}(5). size[find(3)] = 5.

P.4 algebraic equivalence from raw equality facts

You're given a list of equations between variables in ${a, b, c, d, e, f}$ :

a = b,    c = d,    e = f,    b = e

Partition the six variables into their equivalence classes.

Find the analogue: each equation is a "merge these two" instruction. What's the union-find input? After processing all four equations, how do you READ the partition?

show answer

Two classes: {a, b, e, f} and {c, d}.

Trace: start with six singletons. a=b merges {a,b}. c=d → {c,d}. e=f → {e,f}. b=e merges {a,b} and {e,f} → {a,b,e,f}. {c,d} never connects to anything else.

To read the partition: group every variable by find(v).

P.5 programming language type inference

A compiler is processing the following type-equality constraints:

T(x) = T(y),    T(z) = T(w),    T(y) = T(z),    T(x) = int

After processing all four constraints, which variables must have type int?

Find the analogue: the first three constraints are pure equivalence merges, same as the worked example. The fourth constraint pins a concrete type onto one variable. What property of union-find lets that pin propagate to every other variable in the same component?

show answer

All four — x, y, z, w — must have type int.

After the first three constraints, the four variables are in one component: {x, y, z, w}. Pin T(x) = int by attaching the type tag to the root: type_of[find(x)] = int. Any later query type_of[find(v)] for any v in the component returns int, because every member shares the root.

(This is the standard Hindley-Milner unification trick — union-find is the data structure underneath.)

P.6 bipartite testing with parity-augmented union-find

Process the edges (1, 2), (2, 3), (3, 1) in order. After each edge, decide whether the graph is still bipartite.

If a non-bipartite edge appears, identify it.

Find the analogue: bipartiteness fails when an edge creates an ODD cycle. How can union-find be augmented so that find(u) also returns u's "color" relative to its component's root?

show answer

Augment each node with a parity bit relative to its parent. When unioning (u, v) with the new edge, the parities of u and v relative to the root tell you what color each endpoint must be. If they're in the same component AND have the same parity, the new edge would force them to differ — an odd cycle forms, the graph is no longer bipartite.

After (1, 2): 1 is color A, 2 is color B. Bipartite. ✓
After (2, 3): 3 connects to 2 (color B), so 3 is color A. Bipartite. ✓
After (3, 1): 1 and 3 are both color A and already in the same component. The new edge demands they differ. Not bipartite. Triangle is the offending odd cycle.

P.7 offline algorithms, reverse-time technique

Initial graph: 4 nodes {1, 2, 3, 4}, edges {(1, 2), (2, 3), (3, 4)}. Then the following operations:

op 1: delete edge (2, 3)
op 2: query connected(1, 4)?

Answer the query using a single forward union-find pass — no edge-deletion support.

Find the analogue: union-find supports merging but not splitting. Instead of running the operations forward, reformulate: which edges REMAIN at the moment of the query, and what does processing only those edges give you?

show answer

After the deletion at op 1, the surviving edge set is {(1, 2), (3, 4)}. Process THOSE as forward unions on a fresh union-find:

union(1, 2)  ->  components: {1, 2}, {3}, {4}
union(3, 4)  ->  components: {1, 2}, {3, 4}

find(1) ≠ find(4). Not connected.

This is the textbook "process in reverse" trick: a delete-then-query workload over time can be solved by stepping through time backward (so deletions become additions) and answering each query against the union-find state at that simulated point.

Check problems

Two problems designed to resist pattern-matching against the practice set. Each requires you to do something the practice problems didn't.

Check 1 derivation

Prove that in union-by-rank, the number of nodes in any tree whose root has rank $r$ is at least $2^{r}$ . Use this to conclude that the depth of any such tree is at most $lo g_{2} n$ when the structure contains $n$ nodes total.

This is the bound the page asserts but doesn't derive in full. Produce the derivation yourself.

show solution sketch

Lemma. In union-by-rank, a tree whose root has rank $r$ has at least $2^{r}$ nodes.

Proof by induction on r.

Base case (r = 0). A node has rank 0 the moment it's created — it's a singleton tree with 1 node. $1 \geq 2^{0} = 1$ . ✓

Inductive step (r > 0). A root acquires rank $r$ only through the tie case of union-by-rank: two trees, each rooted at a rank- $(r - 1)$ node, are merged, and the resulting root has rank bumped to $r$ . By inductive hypothesis each contributing tree had at least $2^{r - 1}$ nodes, so the merged tree has at least $2 \cdot 2^{r - 1} = 2^{r}$ nodes. ✓

Corollary (depth bound). If the union-find contains $n$ nodes, then no rank can exceed $lo g_{2} n$ , because a rank- $r$ tree requires $2^{r} \leq n$ nodes. The rank of a tree is an upper bound on its depth (ranks only increase, depths follow). Therefore the depth of any tree is at most $lo g_{2} n$ , and every find is $O (lo g n)$ in the worst case — even without path compression.

(With path compression, the amortized bound tightens further to $O (α (n))$ , but that's Tarjan's much harder analysis.)

Check 2 diagnosis

The following code claims to implement Kruskal's MST with full rank-and-compression union-find. It compiles. It produces plausible-looking output on small graphs. But the implementation has a bug — not in the cycle check, in the union logic — that causes union-by-rank to silently break.

def kruskal(edges, n):
    edges.sort(key=lambda e: e[2])
    parent = list(range(n))
    rank_  = [0] * n

    def find(x):
        if parent[x] != x:
            parent[x] = find(parent[x])
        return parent[x]

    mst, total = [], 0
    for u, v, w in edges:
        ru, rv = find(u), find(v)
        if ru != rv:
            # union by rank
            if rank_[u] < rank_[v]:
                parent[ru] = rv
            elif rank_[u] > rank_[v]:
                parent[rv] = ru
            else:
                parent[rv] = ru
                rank_[u] += 1
            mst.append((u, v, w))
            total += w
    return mst, total

Identify the bug, explain why the output can still look correct on small graphs, and propose a one-line fix (which you can apply to several lines).

Hint: scrutinize what rank_[u] and rank_[v] actually look up, given everything that's happened by this point in the loop.

show solution sketch

Bug. Inside the if ru != rv: branch, the rank lookups and update reference the ORIGINAL endpoints u and v — not the roots ru and rv.

Rank is a property of tree roots, not arbitrary nodes. Once an early union has hung u below another node, rank_[u] is stale and unrelated to the height of the tree it now belongs to. Three things go wrong:

The comparison rank_[u] < rank_[v] is between stale values. The "shorter tree under taller tree" invariant collapses.
The attach decisions can hang taller trees under shorter ones, growing depth.
The update rank_[u] += 1 bumps an interior node's rank field — a value nothing else reads — leaking "phantom rank" into bookkeeping that no longer reflects any tree shape.

Why it still looks right on small graphs. Kruskal's correctness depends ONLY on the cycle check, which is fine — find(u) != find(v) is computed correctly. The buggy union still merges two components into one; it just merges them via a bad attach choice. The MST edges and weight come out correct. The visible symptom is degraded performance: tree depths grow toward $O (n)$ in the worst case, and each find walks farther than it should. To EXPOSE the bug, instrument the code to print the max tree depth after a long sequence of unions and compare against a correct implementation. Or run it on a worst-case input (e.g., 10⁵ unions in a chain pattern) and observe pathological run times.

Fix. Replace every rank_[u] / rank_[v] with rank_[ru] / rank_[rv]:

# Replace each rank_[u] / rank_[v] with rank_[ru] / rank_[rv]:
if rank_[ru] < rank_[rv]:
    parent[ru] = rv
elif rank_[ru] > rank_[rv]:
    parent[rv] = ru
else:
    parent[rv] = ru
    rank_[ru] += 1