Composing flows

Normalizing Flows

A single 1D flow is limited. The transform $f (x) = x + α sin (2 x)$ from the last chapter can flatten a Gaussian into a bimodal shape, or sharpen it into a spike, but the family of densities you can reach from a single such $f$ is small. The fix is composition: stack several invertible functions in sequence. Each one adds a little structure, and the cost — both computational and notational — turns out to be tiny because the change-of-variables formula composes cleanly.

Composition is invertible

Take $K$ invertible differentiable functions $f_{1}, f_{2}, \dots, f_{K}$ and define their composition:

g (x) = (f_{K} \circ f_{K - 1} \circ \dots \circ f_{1}) (x) = f_{K} (f_{K - 1} (\dots f_{1} (x))) .

The composition $g$ is itself invertible: just apply the layer inverses in reverse order. Differentiable too — that's the chain rule. So $g$ is a perfectly legal flow, and we can apply the change-of-variables formula to it.

The log-density adds up

The chain rule says $g^{'} (x)$ is the product of all the layer derivatives, evaluated at the chain of intermediate values:

g^{'} (x) = f_{1}^{'} (x) f_{2}^{'} (z_{1}) f_{3}^{'} (z_{2}) \dots f_{K}^{'} (z_{K - 1}), z_{k} = f_{k} (z_{k - 1}), z_{0} = x .

Plug into the change-of-variables formula $lo g p_{Y} (y) = lo g p_{X} (x) - lo g ∣ g^{'} (x) ∣$ and the product turns into a sum:

lo g p_{Y} (y) = lo g p_{X} (x) - k = 1 \sum K lo g f_{k}^{'} (z_{k - 1})

where you compute the chain $z_{0} = x, z_{1} = f_{1} (x), z_{2} = f_{2} (z_{1}), \dots$ forward and accumulate one $lo g ∣ f_{k}^{'} ∣$ term per layer. Cost grows linearly in the number of layers; nothing in the formula needs you to know $g^{'}$ as a single object. This is the feature that makes deep flows practical: you can stack hundreds of layers and the log-density still costs $O (K)$ to evaluate.

The visualization

Below: the same Gaussian on top, the same colored samples, but now the transform is a stack of up to six layers $f_{k} (z) = z + 0.32 sin (2 z + φ_{k})$ with different fixed phases. Drag the slider — or click a preset — to change the depth $K$ . A non-integer $K$ means "a few full layers plus a partial layer at the top," which lets the slider move smoothly.

K3.0

Each layer is f_k(z) = z + 0.32 · sin(2z + φ_k) with a different fixed phase. The slider chooses how many of them are stacked.

A walk through the depths.

K = 0. No layers — the composition is the identity. Output equals input, density equals base. The starting point of any flow.

K = 1. One layer. Same kind of bimodal output you saw in the previous chapter, with peaks where that single layer's $f^{'}$ is small. A single sine ripple isn't very expressive on its own.

K = 3. Three layers, each ripple at a different phase, applied in sequence. The output density now has multiple peaks that aren't at the simple $\pm π /2$ locations from one layer alone — each layer was looking at a transformed input, so the squeeze points show up wherever the cumulative chain rule makes $\prod_{k} f_{k}^{'}$ small.

K = 6. Full depth. The output is densely multimodal: half a dozen peaks of varying height, none of which correspond to anything obvious in the base Gaussian. We've constructed a complicated density from a simple one and a small handful of identical-form layers. No layer alone could produce this; all six together can.

Watch a particular colored dot as you sweep the slider — say one of the green ones near $x = 0$ . At $K = 0$ it sits right at $y = 0$ . As $K$ grows, it migrates around. Each layer "sees" the previous layer's output and adds a sine ripple to it; the result is a path through y-space that depends on the entire stack.

Inverting the stack

Composition handles the forward direction (sample $x$ , apply layers, get $y$ ) but flows also need the inverse direction (given $y$ , find the $x$ that produced it, so we can evaluate the density). Inverting the composition is just inverting each layer in reverse order:

g^{- 1} (y) = (f_{1}^{- 1} \circ f_{2}^{- 1} \circ \dots \circ f_{K}^{- 1}) (y) .

Each $f_{k}^{- 1}$ is a 1D root-finding problem — for the demo we use bisection per layer, which is robust and cheap. In higher dimensions the per-layer inverse becomes the central design constraint of any flow architecture: a layer that's fast to compute forward but slow to invert (or vice versa) is fine for some uses (sampling-only, density-only) but not for the full forward-and-inverse density model.

What this is, and what it isn't

What we have now is a $K$ -layer flow with a fixed structure, fixed phases, and fixed amplitudes. It can produce multimodal densities, but it can't produce specific multimodal densities — we have no knobs that say "put a peak here, put another peak there." For that we need two more pieces.

The first is parameterization: replacing the hand-picked $α$ and $φ_{k}$ with learnable parameters. Each layer becomes a small parameter vector, and the whole flow becomes a parameterized family of densities. The second is training: a procedure for adjusting those parameters so that the flow's density matches a target — usually the empirical density of a dataset. For flows the procedure is direct: maximize the log-likelihood of the data under the model. We'll get to both, but first we have to handle the dimensional jump from 1D to 2D — the $∣ f^{'} (x) ∣$ in our formula has to become $∣ det J_{f} (x) ∣$ , and that's where flows start being more than just "1D change of variables but bigger."

Why does adding more layers help?

Because each layer contributes its own $lo g ∣ f_{k}^{'} ∣$ term to the log-density, evaluated at a point that depends on every previous layer. The space of densities you can reach grows quickly with depth. There's a formal sense in which sufficiently flexible flows are universal density approximators (in 1D, even a single layer from a flexible class can approximate any density; in higher dimensions you generally need depth and the right architecture). For this chapter the simpler observation is enough: more layers, more shapes.

Why don't the layers' effects cancel out as the phases drift?

Because each layer is applied after the previous, not added to it. If you summed $\sum_{k} 0.32 sin (2 z + φ_{k})$ against a fixed $z$ the phases would interfere and the sum would average to something close to a single sine of small amplitude. But composition is different: layer 2 sees layer 1's output, layer 3 sees layer 2's output, and so on. The ripples compound rather than cancel. You can see this in the demo by stepping through K = 1, 2, 3 — the output is not getting smoother, it's getting more structured.

What stops the depth from making the inverse intractable?

Nothing fundamental — and that's a real concern. Each layer inverse is a 1D root-find, fast. The composed inverse is K sequential root-finds, which is linear in K and still cheap for K up to thousands. But: in higher dimensions, the inverse of a layer might require solving an equation whose cost scales with the dimension. Architectures like coupling layers (next-but-one chapter) are designed exactly so that the per-layer inverse stays as cheap as the forward pass, even in high dimensions. The composition pattern we built here is the foundation; the engineering is in choosing layer shapes that keep both directions efficient.