“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

Normalizing Flows

Simulation-Based Inference

⛓ What you need to know first 4 concepts, 4 layers

The requisite-knowledge inventory for this page, bottom-up: the primitives at the base, combined upward until you reach what this page assumes. Skim the layers you already own; start wherever the ground gets unfamiliar.

ABC's four failures from the last post all pointed at the same fix: stop rejecting simulations, start learning a function. Specifically, learn a conditional density $q_{ϕ} (θ ∣ x)$ from simulated $(θ, x)$ pairs. After training, querying it at $x_{obs}$ gives you the posterior in a forward pass — no rejection, no summary statistics, no per-observation rerun.

That's the plan. The question is what kind of object $q_{ϕ}$ can actually be. Most off-the-shelf density estimators are missing one piece or another. Normalizing flows are the one family that has everything you need at once. This post is about why they're the right tool and what they look like.

What we need from the density estimator

Four requirements; drop any one and the method fails.

Tractable density. Given any $θ$ , we can evaluate $q_{ϕ} (θ ∣ x)$ as a number. Without this we can't compute the loss that trains the model — and we can't say "the model thinks $θ_{1}$ is more likely than $θ_{2}$ " with anything more than samples.

Tractable sampling. Given $x_{obs}$ , we can draw $θ^{'}$ from $q_{ϕ} (θ ∣ x_{obs})$ cheaply. Without this we can't actually produce the posterior samples we wanted in the first place.

Flexibility. The posteriors that come out of real simulators are not Gaussians. They're skewed (asymmetric around the mode), often multimodal (two or more bumps when the model is genuinely ambiguous), often bounded (parameters live on intervals or have hard physical constraints), and almost always correlated across dimensions (knowing one parameter narrows the others). The density estimator has to be able to represent all of those shapes. A Gaussian fails three out of four. A mixture of Gaussians handles multimodality but not bounded support or skewness cleanly.

Trainable. The whole object is parameterized by some $ϕ$ and trained by gradient descent on a loss. Everything has to be differentiable.

Most generative models fail one of these. GANs sample beautifully but give you no density. VAEs and diffusion models give you flexibility and sampling but only a variational lower bound on density. Gaussians give you density and sampling but no flexibility. Flows are the one design that delivers all four exactly — no surrogate losses, no variational gaps.

The idea

A normalizing flow is one trick repeated. Take a simple base distribution you can sample from — a standard Gaussian $u \sim N (0, I)$ , almost always. Apply an invertible, differentiable transformation $θ = f_{ϕ} (u)$ . By the change-of-variables formula, the result is a new distribution whose density you can write down exactly.

Stack many such transformations. Each is invertible. Each contributes a Jacobian-determinant term to the log-density. The composition can fit very complicated shapes while staying tractable in both directions — you can evaluate $q_{ϕ} (θ)$ and you can sample from it, both from the same object, both cheaply. The rest is engineering: pick invertible transformations whose Jacobian determinants are easy to compute, and make them conditional on $x$ .

Change of variables, briefly

If $u \sim p_{u}$ and $θ = f_{ϕ} (u)$ with $f_{ϕ}$ invertible, then the density of $θ$ follows from change of variables:

lo g q_{ϕ} (θ) = lo g p_{u} (f_{ϕ}^{- 1} (θ)) + lo g det \frac{\partial f _{ϕ}^{- 1}}{\partial θ}

The intuition: as $f_{ϕ}$ stretches space, it spreads probability mass over a larger volume; the density at any output point is the density at the corresponding input point, divided by the local stretching factor. The log-determinant is that factor in log form. There's a slower derivation with worked 1D examples on the Change of Variables page if you want to see it geometrically. The rest of this post will use the formula above as a given.

For a composition $f_{ϕ} = f_{K} \circ f_{K - 1} \circ \dots \circ f_{1}$ , the log-determinants add up:

lo g q_{ϕ} (θ) = lo g p_{u} (u) + k = 1 \sum K lo g det \frac{\partial f _{k}^{- 1}}{\partial \cdot}

So if each layer has a cheap log-det-Jacobian, the whole stack does. That cheapness is the engineering constraint. A general $d \times d$ Jacobian determinant costs $O (d^{3})$ to compute, which is fatal for $d$ bigger than a few. Practical flow design is one question: "what kind of invertible transformations have $O (d)$ log-determinants?" The answer in every case is the same: make the Jacobian triangular, so its determinant is the product of its diagonal entries.

Layer families

Flows are built from one of a few standard layer types. They all impose triangularity, but they do it differently.

Affine flows

The simplest. Scale and shift each dimension independently:

θ_{i} = exp (s_{i}) \cdot u_{i} + t_{i}

The Jacobian is diagonal; the log-determinant is just $\sum_{i} s_{i}$ . Useful as a building block — most flow stacks have an affine layer at the input or output for normalization — but if you only ever use affine layers you can't represent anything more complicated than a shifted, axis-scaled Gaussian. Real expressiveness has to come from non-axis-aligned moves.

Coupling layers (Real NVP, NICE)

The workhorse of practical flows. Split the variables in half: $θ = (θ_{A}, θ_{B})$ . Pass $θ_{A}$ through unchanged. Transform $θ_{B}$ by an affine map whose scale and shift are functions of $θ_{A}$ :

θ_{A}^{'} = θ_{A}, θ_{B}^{'} = exp (s_{ϕ} (θ_{A})) \cdot θ_{B} + t_{ϕ} (θ_{A})

The Jacobian is triangular by construction: $θ_{A}^{'}$ depends only on $θ_{A}$ , $θ_{B}^{'}$ on both halves. The log-determinant is $\sum_{i} s_{ϕ} (θ_{A})_{i}$ over the $B$ coordinates only — cheap to compute, easy to differentiate.

The functions $s_{ϕ}$ and $t_{ϕ}$ can be arbitrary neural networks. They don't have to be invertible themselves; the layer's invertibility comes from the structure (given $θ_{A}^{'} = θ_{A}$ , you can compute $s_{ϕ}$ and $t_{ϕ}$ directly, then invert the affine map to recover $θ_{B}$ ). Stack many coupling layers, alternating which half is the "passthrough", and you get arbitrarily expressive transformations.

Autoregressive flows (MAF, IAF)

A different way to impose triangularity. Transform each coordinate $θ_{i}$ as a function of $θ_{1}, \dots, θ_{i - 1}$ :

θ_{i} = exp (s_{ϕ} (θ_{1 : i - 1})) \cdot u_{i} + t_{ϕ} (θ_{1 : i - 1})

Triangular by construction. The asymmetry is in which direction is cheap: MAF (Masked Autoregressive Flow) parallelizes the density evaluation in one neural-net pass but samples sequentially (each $θ_{i}$ needs the previous coordinates filled in first). IAF (Inverse Autoregressive Flow) is the opposite — parallel sampling, sequential density. Pick by which direction you need.

Neural spline flows

Replace the affine inner transform of a coupling or autoregressive layer with a monotone piecewise-rational-quadratic spline. Same triangular structure, dramatically more expressive per layer — a spline can capture skew and heavy tails and bounded support that a single affine map can't, in one transformation instead of a stack of ten. Fewer layers needed for the same quality. This is the current standard for serious SBI work; the sbi package's default conditional flow is a stack of neural spline coupling layers.

Conditional flows

Everything above describes an unconditional density $q_{ϕ} (θ)$ . For SBI we want $q_{ϕ} (θ ∣ x)$ — a different density for each value of $x$ . The standard trick: make every internal neural network inside the flow take $x$ as an additional input.

Concretely, an embedding network $h_{ψ} (x)$ maps the observation into a fixed-size latent vector. That latent gets concatenated to the inputs of every coupling sub-net $s_{ϕ}$ and $t_{ϕ}$ (or spline-parameter network, or autoregressive sub-net). The flow's per-layer parameters become functions of both the previous coordinates AND the embedding of $x$ .

Now the whole pipeline is a conditional density. Given $x_{obs}$ : run it through $h_{ψ}$ once to get an embedding; sample $u \sim N (0, I)$ ; transform forward through the flow conditioned on the embedding; out comes $θ^{'} \sim q_{ϕ} (θ ∣ x_{obs})$ . Want a density at some specific $θ$ ? Run the inverse, accumulate log-Jacobians, add the base log-density. One trained network handles any $x_{obs}$ in a single forward pass either direction.

The embedding network $h_{ψ}$ is the part that handles structured or high-dimensional $x$ . For images it's a CNN; for time series, a temporal CNN or RNN; for graphs, a GNN; for sets, a permutation-invariant network. Whatever embedding produces a useful vector summary of $x$ , plug it in. The summary is learned end-to-end with the rest of the flow — no manual feature engineering, no human-picked summary statistics.

This is the explicit answer to ABC's summary-statistic burden from Post 3: the network learns its own summary as part of training, optimized to be informative about $θ$ rather than chosen by hand to be tractable.

What this object gives us

A conditional normalizing flow is a learnable, flexible, exact-density, exact-sampling model of $q_{ϕ} (θ ∣ x)$ . Each of ABC's four failure modes from the previous post is addressed by this single object:

Wasted simulations — every $(θ, x)$ pair contributes to training, regardless of whether $x$ is near any specific $x_{obs}$ . There's no rejection step throwing data away.

Curse of dimensionality — the embedding network $h_{ψ}$ learns a useful low-dimensional summary of $x$ as part of training. The flow then matches the posterior in $θ$ -space, whose dimension is usually modest. No $ε^{d}$ volume collapse.

Summary-statistic burden — covered by the embedding network. You're still committing to an architecture for $h_{ψ}$ , but the architecture choice is much weaker than committing to a specific summary statistic, and it's the kind of choice ML practice has a lot of precedent for.

No amortization — inference at any new $x_{obs}$ is one forward pass through $h_{ψ}$ and one through the flow. Per-observation cost is microseconds after training.

The cost is paid up front in training: simulation budget for the $(θ, x)$ dataset, and the structural commitment to a flow architecture. The next post is about how the training works — what the loss function is, what the data is, how much of it you need, and what diagnostics to run before trusting the output.