Normalizing Flows

I'd heard about normalizing flows mapping a simple distribution to a complicated one and went down a rabbit hole trying to figure out what you'd actually USE one for. Turns out: sampling from things that are hard to sample from. Say you have a forward simulator and you know it's correct. The ordinary move is to pick some distribution, fit it, sample from the fit. But if your answer is a mixture of 37 gaussians, do you really believe that's the model? Normalizing flows let you take a normal distribution, draw from it, push it through an invertible transformation, and get something arbitrarily complicated — and because the transformation is invertible, you can go either direction and read off the exact density anywhere.

Why yet another generative model? Two good answers. 1. Normally when you use a generative model, think diffusion, VAE, GAN, you don't actually have access to the log-density of a sample.1 What does the log-density buy you? Stuff like exact maximum likelihood (not a variational bound) and outlier detection, which you can't do with a non-log-densitying model. 2. The math is remarkably simple. Remember the chain rule, change of variables, and all the times you asked "when will we ever use change of variables as anything other than an intellectual exercise?" Here it is. Change of variables lets you push samples from a simple distribution through an invertible map and read off the exact density of the complicated distribution on the other side.

Does change of variables actually make your life easier, or is it still the rote calculus exercise it was in undergrad? Concrete example. Take a standard normal random variable Z and the map

Its derivative is always positive (it never drops below 0.1), so f is monotonic and invertible. Push Z through f and you get a perfectly well-defined random variable X = f(Z) — clearly not Gaussian, not anything with a name. Two questions: how do you sample from it, and how do you evaluate its density at a point x? Sampling is one line: draw z, compute f(z), done. No surprises there, just take your z drawn from a gaussian and plug in the value. The density is one formula: invert f to recover the unique z that maps to x, then change of variables gives

The advantage of flows is that they describe X in two ways at once. By construction (X = f(Z)) — that's what gives you sampling. By density (change of variables hands you the exact p_X(x)) — that's what gives you maximum likelihood and outlier detection. Most generative models give you half for free and force you to pick up a part-time job to get the other. Energy-based models hand you an unnormalized density and nothing else — no closed-form CDF to invert, no rejection envelope to bound it — so sampling means MCMC. GANs hand you a generator, so the density isn't accessible in any usable form. VAEs split the difference: easy sampling through the decoder, but density evaluation needs a variational bound (ELBO). Flows give you both at once: a sampler by construction, and an exact density via change of variables. The price is restricting f to be invertible with a cheap Jacobian determinant — which is what coupling layers and autoregressive flows are about.

This series builds normalizing flows from the simplest possible starting point. Every chapter has at least one interactive visualization. The first chapter is enough to understand what a flow is; the later chapters are about making them expressive, training them, and finally fitting one to a real 2D target.


Chapters

  1. Change of variables in 1D — Push a standard Gaussian through an invertible function and watch the density transform. This chapter is the one mandatory piece of math for everything that follows.
  2. Composing flows — Stacking simple invertible pieces gives a much more expressive transform. The log-density picks up a sum of log-derivative terms.
  3. Going to 2D: the Jacobian determinant — In higher dimensions becomes . We see why a general Jacobian determinant is too expensive and what to do about it.
  4. Coupling layers (RealNVP) — The trick that makes the Jacobian determinant cheap: split the variables in half, only transform one half conditioned on the other.
  5. Training a flow on two moons — End-to-end: a 6-layer RealNVP in PyTorch, trained on the two-moons dataset by minimizing the negative log-likelihood. About 70 lines of code; in 2000 steps the flow drops the NLL from a Gaussian baseline of 2.84 to 1.24 — capturing ~1.6 nats per point of structure beyond what a Gaussian could.
  6. What's next — Survey: NICE, RealNVP, MAF, IAF, neural spline flows, continuous normalizing flows. Where to go and what to read.

The chapters marked "upcoming" aren't written yet.


Prerequisites

Probability density basics — what a PDF is, how to integrate one, what it means to sample from a distribution. A bit of calculus (chain rule, derivative of a function, the change-of- variables formula in 1D, even if you don't remember it cleanly). For the later chapters: linear algebra (determinants, matrix-vector products) and a working familiarity with automatic differentiation, since training a flow is "differentiate the log-likelihood through the network."

No machine-learning background required for the first three chapters. The maximum-likelihood chapter (#5) assumes you've seen gradient descent on a parameterized loss before; if you haven't, that's the only piece worth picking up first.