“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

Neural Posterior Estimation

Simulation-Based Inference

⛓ What you need to know first 5 concepts, 5 layers

The requisite-knowledge inventory for this page, bottom-up: the primitives at the base, combined upward until you reach what this page assumes. Skim the layers you already own; start wherever the ground gets unfamiliar.

We have a conditional density estimator — a normalizing flow $q_{ϕ} (θ ∣ x)$ from the previous post. The flow can represent flexible shapes, evaluate density, and sample, all conditional on $x$ . Now we train it. The interesting part is how the training objective falls out of the problem: it reduces to plain supervised learning on simulated $(θ, x)$ pairs, which is what makes the whole pipeline work.

The pipeline is three steps. Sample $θ$ values from the prior and run the simulator to get matched $(θ, x)$ pairs. Train the flow to maximize $lo g q_{ϕ} (θ ∣ x)$ on those pairs. Query the trained flow at any $x_{obs}$ . That's it. No rejection step, no MCMC, no hand-designed summary statistics, no per-observation rerun. Below we'll derive the loss, then walk through both a from-scratch implementation (so the mechanics are visible) and the production version using the sbi package.

The loss function

Start from what we want. We want $q_{ϕ} (θ ∣ x) \approx p (θ ∣ x)$ for any $x$ the model produces. The cleanest measure of "approximately equal" between two distributions is KL divergence, so the objective is the expected forward KL between true and approximate posteriors:

L (ϕ) = E_{x \sim p (x)} [KL (p (θ ∣ x) ∥ q_{ϕ} (θ ∣ x))]

Forward KL specifically, not reverse. The direction matters. Forward KL is mode-covering: it punishes $q_{ϕ}$ for putting low density where $p$ has support. Reverse KL is mode-seeking: it punishes $q_{ϕ}$ for putting high density where $p$ has none. For posterior estimation we want mode-covering — if the posterior has two bumps and we miss one, the answer is structurally wrong, not just inaccurate.

Expand the KL inside the expectation:

KL (p (θ ∣ x) ∥ q_{ϕ} (θ ∣ x)) = E_{θ \sim p (θ ∣ x)} [lo g p (θ ∣ x) - lo g q_{ϕ} (θ ∣ x)]

The $lo g p (θ ∣ x)$ term has no $ϕ$ in it. Drop it — the gradient is the same. What's left is the cross-entropy term, with the outer expectation over $x$ :

L (ϕ) = - E_{x \sim p (x)} E_{θ \sim p (θ ∣ x)} [lo g q_{ϕ} (θ ∣ x)] + const

Now the two expectations collapse. Sampling $x \sim p (x)$ and then $θ \sim p (θ ∣ x)$ is sampling $(θ, x)$ jointly from $p (θ, x)$ . And we have a sampler for the joint — sample $θ$ from the prior, run the simulator. So:

L (ϕ) = - E_{(θ, x) \sim p (θ, x)} [lo g q_{ϕ} (θ ∣ x)] + const

Monte Carlo estimate with $N$ simulated pairs:

\hat{L} (ϕ) = - \frac{1}{N} i = 1 \sum N lo g q_{ϕ} (θ_{i} ∣ x_{i})

This is supervised learning. $(θ_{i}, x_{i})$ are the inputs and labels, $q_{ϕ} (θ ∣ x)$ is the model, negative log-likelihood is the loss. Every simulated pair contributes a gradient signal regardless of where $x_{i}$ happens to land. No rejection, no summary statistic, no per-observation rerun. The simulator's only job is to generate the training data; everything after is gradient descent.

The reason this is even non-obvious is that the cross-entropy objective is conditional on $x$ , and "sampling $x$ from somewhere then sampling $θ$ from its conditional posterior" sounds hard. The trick is the identity $p (θ, x) = p (x) p (θ ∣ x)$ , read in reverse: joint samples ARE samples from $x$ -marginal-then-conditional, and we know how to make joint samples.

The pipeline, in three steps

Mechanically simple.

Step 1: generate training data. Sample $N$ values of $θ$ from the prior. Run the simulator on each. You get $(θ_{i}, x_{i})$ pairs. The number $N$ is the simulation budget — typically $1 0^{4}$ to $1 0^{6}$ depending on simulator cost and posterior complexity. Rough rule: at least ten times more pairs than the flow has parameters.

Step 2: train the flow. Standard PyTorch / JAX training loop. Sample mini-batches of $(θ, x)$ pairs, compute $- lo g q_{ϕ} (θ ∣ x)$ , backprop, Adam step. Repeat until validation NLL stops improving.

Step 3: query at a real observation. Push $x_{obs}$ through the embedding network once. To sample a posterior $θ^{'}$ , draw $u \sim N (0, I)$ and run forward through the flow conditioned on the embedding. To evaluate a density at a specific $θ$ , run inverse through the flow and accumulate log-Jacobians. Either operation is one forward pass.

The simplest possible version

Before the production version, here's the entire pipeline implemented from scratch with the simplest conditional density you could write down — a conditional Gaussian. The Gaussian is too restrictive to capture the real posterior shape (the pinball's posterior over $p \in [0, 1]$ is bounded, possibly skewed), but it makes the mechanics visible. No flow library, no abstraction. Swap the conditional Gaussian for a flow and the rest is identical.

The pinball simulator, the same model as Post 1:

import numpy as np

def pinball(p, n_stages=8, alpha=0.5, rng=None):
    """Returns final position x given probability of a left kick p."""
    rng = rng or np.random.default_rng()
    x, v = 0.0, 0.0
    for _ in range(n_stages):
        kick = -1.0 if rng.random() < p else 1.0
        v = alpha * v + kick
        x = x + v
    return x

Generate the training set — 100,000 forward simulations. No rejection. Every pair is kept:

N = 100_000
rng = np.random.default_rng(0)

# Prior: Beta(2, 2) — weakly informative on (0, 1).
ps = rng.beta(2, 2, size=N)
xs = np.array([pinball(p, rng=rng) for p in ps])

# Standardize x for numerically stable training.
x_mean, x_std = xs.mean(), xs.std()
xs_norm = (xs - x_mean) / x_std

The conditional density. An MLP takes $x$ , outputs $(μ, lo g σ)$ , and the density is $N (μ (x), σ (x)^{2})$ :

import torch
import torch.nn as nn
import torch.optim as optim

class ConditionalGaussian(nn.Module):
    """q(p | x) modeled as N(mu(x), sigma(x)^2). Mu and log-sigma come from a small MLP."""

    def __init__(self, hidden=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(1, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, 2),  # outputs (mu, log_sigma)
        )

    def params(self, x):
        out = self.net(x.unsqueeze(-1))
        mu = out[..., 0]
        log_sigma = out[..., 1].clamp(-5, 2)
        return mu, log_sigma

    def log_prob(self, p, x):
        mu, log_sigma = self.params(x)
        sigma = log_sigma.exp()
        z = (p - mu) / sigma
        return -0.5 * z ** 2 - log_sigma - 0.5 * np.log(2 * np.pi)

    def sample(self, x, n=1000):
        mu, log_sigma = self.params(x.expand(n))
        sigma = log_sigma.exp()
        return mu + sigma * torch.randn(n)

Training loop. Sample a mini-batch, compute the NLL exactly as derived above, take an Adam step:

ps_t = torch.tensor(ps, dtype=torch.float32)
xs_t = torch.tensor(xs_norm, dtype=torch.float32)

model = ConditionalGaussian()
opt = optim.Adam(model.parameters(), lr=1e-3)

for step in range(5000):
    idx = torch.randint(0, N, (512,))
    p_batch, x_batch = ps_t[idx], xs_t[idx]
    loss = -model.log_prob(p_batch, x_batch).mean()  # NLL
    opt.zero_grad()
    loss.backward()
    opt.step()
    if step % 500 == 0:
        print(f"step {step}: nll = {loss.item():.4f}")

Inference at a new observation:

# Suppose true_p = 0.6 produced an observed x in the real world
true_p = 0.6
x_obs_raw = pinball(true_p, rng=rng)
x_obs = torch.tensor((x_obs_raw - x_mean) / x_std, dtype=torch.float32)

with torch.no_grad():
    mu, log_sigma = model.params(x_obs)
    samples = model.sample(x_obs, n=5000).clamp(0, 1).numpy()

print(f"posterior approx: N({mu.item():.3f}, {log_sigma.exp().item():.3f}^2)")

About sixty lines of code, no SBI library. The Gaussian limitation is real: the true posterior over $p \in [0, 1]$ is bounded, often skewed near the edges, and a Gaussian can't represent any of that. The point of this version is the mechanics, not the answer. Swap the ConditionalGaussian for a flow and you have NPE proper.

The real version with the sbi package

The sbi package handles flow construction, the training loop, and the inference utilities. The verbosity drops by about two-thirds:

from sbi.inference import SNPE
from sbi.utils import BoxUniform
import torch

# Prior over theta (p, in the pinball case).
prior = BoxUniform(low=torch.tensor([0.01]), high=torch.tensor([0.99]))

def simulator_torch(theta):
    p = theta.squeeze(-1).numpy()
    xs = np.array([pinball(float(p_)) for p_ in p])
    return torch.tensor(xs, dtype=torch.float32).unsqueeze(-1)

# Generate (theta, x) pairs.
N = 100_000
theta_train = prior.sample((N,))
x_train = simulator_torch(theta_train)

# Train a neural spline flow as posterior estimator.
inference = SNPE(prior=prior, density_estimator='nsf')
inference.append_simulations(theta_train, x_train).train(max_num_epochs=100)
posterior = inference.build_posterior()

# Inference at an observation.
x_obs = torch.tensor([[pinball(0.6)]], dtype=torch.float32)
samples = posterior.sample((5000,), x=x_obs)
log_prob = posterior.log_prob(samples, x=x_obs)  # density evaluation is also free

Under the hood, density_estimator='nsf' builds a conditional neural spline flow — the layer family from the previous post that gives the most expressiveness per parameter. The training loop is exactly the NLL we derived; the embedding network is a small MLP by default. Swap in a CNN by passing embedding_net=... if your observations are images. After training, posterior.sample and posterior.log_prob both work at any $x_{obs}$ in a single forward pass.

This is what production SBI looks like, modulo hyperparameters. The hard part isn't writing the code — it's deciding whether to believe what comes out.

Diagnostics: did inference actually work?

A trained NPE network always produces a posterior. Whether that posterior is right is a separate question, and the network can't answer it for you. The checks below are not optional. They're what separate "I trained a network" from "I have a defensible result."

Coverage testing and simulation-based calibration

The gold-standard check. The idea: for a held-out set of $(θ_{true}, x)$ pairs drawn from the joint, ask whether the network's posterior contains $θ_{true}$ at the right frequency. If you ask for the 90% credible region, $θ_{true}$ should fall inside it about 90% of the time. Falling inside less often means the posterior is overconfident (too narrow); more often means underconfident (too broad).

The cleanest version is the rank-statistics test, also called simulation-based calibration (SBC). For each held-out pair, draw many samples from $q_{ϕ} (\cdot ∣ x)$ and compute the rank of $θ_{true}$ among them. Across the whole held-out set, the distribution of these ranks should be uniform on ${0, 1, \dots, K}$ . A non-uniform rank histogram is direct evidence the posterior is miscalibrated — and the shape of the non-uniformity tells you how. U-shape: overconfident. Inverted U: underconfident. Skewed: biased. The sbi package has SBC as a one-liner; run it before publishing anything.

Sanity check against ABC

For low-dimensional problems where ABC is tractable, run ABC with a tight tolerance on the same observation and compare. NPE and ABC posteriors should agree up to $ε$ -blurring. ABC is slow but well-understood; NPE is fast but can fail silently. Disagreement on a problem where ABC works is a red flag — either ABC's $ε$ is too loose, or NPE is misbehaving in some way that needs diagnosing.

Common failure modes

Insufficient simulation budget. The flow underfits the joint. Posteriors look too broad; validation NLL plateaus high. Fix: more simulations. The rough rule of "at least ten times more pairs than flow parameters" is a starting point, not a guarantee.

Out-of-distribution observations. If $x_{obs}$ lies in a region of $x$ -space that your simulator rarely produces — because the prior puts little mass on $θ$ values that generate it — the network is extrapolating. You'll get a confident-looking posterior that's almost certainly wrong. Fix: widen the prior, or switch to a sequential method (SNPE) that adapts the proposal distribution toward $x_{obs}$ .

Mode collapse. If the true posterior is multimodal, a flow trained with forward KL is theoretically mode-covering, but with finite simulations and finite capacity it can still latch onto one mode. Fix: more flow capacity, more simulations, or a loss that combines forward and reverse KL (an active research area).

Simulator misspecification. Real $x_{obs}$ comes from a process that isn't exactly your simulator. NPE will give you a posterior over the simulator's parameters that best matches the real data, which may have no scientific meaning. This isn't an NPE failure — it's a modeling failure. Fix: actually check that your simulator can produce things that look like real data before trusting any inference. Posterior predictive checks are the right tool.

NPE in context

NPE is the most direct application of the flow-based machinery to SBI. Train a conditional density on simulated pairs, query at any observation. Two siblings show up in the same toolbox.

NLE (Neural Likelihood Estimation). Instead of $q_{ϕ} (θ ∣ x)$ , train a flow $q_{ϕ} (x ∣ θ)$ approximating the likelihood. Then sample the posterior with MCMC using the learned likelihood. Useful when you want to preserve the prior-likelihood split for interpretation or modular updates.

NRE (Neural Ratio Estimation). Train a binary classifier to distinguish $(θ, x)$ pairs drawn jointly from pairs drawn from the marginal product. The classifier's logit is $lo g p (x ∣ θ) / p (x)$ — the likelihood ratio. This is the cleanest theoretical link to traditional likelihood-based inference, and it's the bridge to the frequentist methods from The Likelihood Perspective. Same trained network, Bayesian use (multiply by prior → posterior) or frequentist use (treat as likelihood → confidence intervals).

Sequential variants (SNPE, SNLE, SNRE) iteratively refine the proposal distribution toward the posterior so simulations get spent where they matter most. Standard when the simulator is expensive — you can't afford the wide-net training set NPE wants for a single round.

All of these are variations on the same skeleton. Draw $(θ, x)$ from the joint distribution; use those samples to estimate the posterior, the likelihood, or the ratio. The choice of which quantity to estimate is mostly a choice of which downstream use is most convenient. The capstone post in the next one is about pushing that observation harder: every SBI method, including the rejection-based ones, is a different approximation scheme for the same path integral. Once you see them that way, the family stops looking like unrelated tricks and starts looking like one idea with multiple implementations.