Composing flows

Normalizing Flows

A single 1D flow is limited. The transform from the last chapter can flatten a Gaussian into a bimodal shape, or sharpen it into a spike, but the family of densities you can reach from a single such is small. The fix is composition: stack several invertible functions in sequence. Each one adds a little structure, and the cost — both computational and notational — turns out to be tiny because the change-of-variables formula composes cleanly.


Composition is invertible

Take invertible differentiable functions and define their composition:

The composition is itself invertible: just apply the layer inverses in reverse order. Differentiable too — that's the chain rule. So is a perfectly legal flow, and we can apply the change-of-variables formula to it.


The log-density adds up

The chain rule says is the product of all the layer derivatives, evaluated at the chain of intermediate values:

Plug into the change-of-variables formula and the product turns into a sum:

where you compute the chain forward and accumulate one term per layer. Cost grows linearly in the number of layers; nothing in the formula needs you to know as a single object. This is the feature that makes deep flows practical: you can stack hundreds of layers and the log-density still costs to evaluate.


The visualization

Below: the same Gaussian on top, the same colored samples, but now the transform is a stack of up to six layers with different fixed phases. Drag the slider — or click a preset — to change the depth . A non-integer means "a few full layers plus a partial layer at the top," which lets the slider move smoothly.

base distribution   pX(x) = N(0, 1)-4-2024-4-2024after K layers   pY(y), K = 3.0
K3.0

Each layer is fk(z) = z + 0.32 · sin(2z + φk) with a different fixed phase. The slider chooses how many of them are stacked.

A walk through the depths.

K = 0. No layers — the composition is the identity. Output equals input, density equals base. The starting point of any flow.

K = 1. One layer. Same kind of bimodal output you saw in the previous chapter, with peaks where that single layer's is small. A single sine ripple isn't very expressive on its own.

K = 3. Three layers, each ripple at a different phase, applied in sequence. The output density now has multiple peaks that aren't at the simple locations from one layer alone — each layer was looking at a transformed input, so the squeeze points show up wherever the cumulative chain rule makes small.

K = 6. Full depth. The output is densely multimodal: half a dozen peaks of varying height, none of which correspond to anything obvious in the base Gaussian. We've constructed a complicated density from a simple one and a small handful of identical-form layers. No layer alone could produce this; all six together can.

Watch a particular colored dot as you sweep the slider — say one of the green ones near . At it sits right at . As grows, it migrates around. Each layer "sees" the previous layer's output and adds a sine ripple to it; the result is a path through y-space that depends on the entire stack.


Inverting the stack

Composition handles the forward direction (sample , apply layers, get ) but flows also need the inverse direction (given , find the that produced it, so we can evaluate the density). Inverting the composition is just inverting each layer in reverse order:

Each is a 1D root-finding problem — for the demo we use bisection per layer, which is robust and cheap. In higher dimensions the per-layer inverse becomes the central design constraint of any flow architecture: a layer that's fast to compute forward but slow to invert (or vice versa) is fine for some uses (sampling-only, density-only) but not for the full forward-and-inverse density model.


What this is, and what it isn't

What we have now is a -layer flow with a fixed structure, fixed phases, and fixed amplitudes. It can produce multimodal densities, but it can't produce specific multimodal densities — we have no knobs that say "put a peak here, put another peak there." For that we need two more pieces.

The first is parameterization: replacing the hand-picked and with learnable parameters. Each layer becomes a small parameter vector, and the whole flow becomes a parameterized family of densities. The second is training: a procedure for adjusting those parameters so that the flow's density matches a target — usually the empirical density of a dataset. For flows the procedure is direct: maximize the log-likelihood of the data under the model. We'll get to both, but first we have to handle the dimensional jump from 1D to 2D — the in our formula has to become , and that's where flows start being more than just "1D change of variables but bigger."

Why does adding more layers help?

Because each layer contributes its own term to the log-density, evaluated at a point that depends on every previous layer. The space of densities you can reach grows quickly with depth. There's a formal sense in which sufficiently flexible flows are universal density approximators (in 1D, even a single layer from a flexible class can approximate any density; in higher dimensions you generally need depth and the right architecture). For this chapter the simpler observation is enough: more layers, more shapes.

Why don't the layers' effects cancel out as the phases drift?

Because each layer is applied after the previous, not added to it. If you summed against a fixed the phases would interfere and the sum would average to something close to a single sine of small amplitude. But composition is different: layer 2 sees layer 1's output, layer 3 sees layer 2's output, and so on. The ripples compound rather than cancel. You can see this in the demo by stepping through K = 1, 2, 3 — the output is not getting smoother, it's getting more structured.

What stops the depth from making the inverse intractable?

Nothing fundamental — and that's a real concern. Each layer inverse is a 1D root-find, fast. The composed inverse is K sequential root-finds, which is linear in K and still cheap for K up to thousands. But: in higher dimensions, the inverse of a layer might require solving an equation whose cost scales with the dimension. Architectures like coupling layers (next-but-one chapter) are designed exactly so that the per-layer inverse stays as cheap as the forward pass, even in high dimensions. The composition pattern we built here is the foundation; the engineering is in choosing layer shapes that keep both directions efficient.