“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

The Likelihood Perspective

Simulation-Based Inference

A reasonable first read of SBI is that it's a Bayesian framework. The posterior shows up everywhere, the prior is built into ABC, most of the literature talks about posterior estimation. The Bayesian framing is convenient, but it's not structural. The actual constraint SBI is working around is that the likelihood is intractable — and that constraint blocks frequentists just as hard. Frequentist SBI methods exist. They're worth seeing, if only to keep the distinction clean.

The misconception to avoid: thinking "SBI exists because you wanted a Bayesian framework." It doesn't. Both frequentist and Bayesian inference need a callable likelihood — MLE to maximize it, MCMC to sample its acceptance ratio, ELBO to take its gradient, Wald tests to get its second derivative. When the likelihood disappears, both camps are stuck. They just reach for different replacements.

The shared problem, framework-free

You have a simulator that samples from $p (x ∣ θ)$ without evaluating it. You have an observation $x_{obs}$ . You want to learn something about $θ$ . That much is shared between framework choices.

A Bayesian wants a posterior. A frequentist wants a point estimate and a confidence interval. Both need a callable function of $θ$ that captures "how plausible is $x_{obs}$ at this $θ$ " — a likelihood, or something playing the role of one. The three classical frequentist SBI methods below are different ways of constructing a stand-in for the missing likelihood.

Synthetic likelihood

Wood (2010) suggested the most direct trick. Pick a summary statistic $s (x)$ , a low-dimensional projection of your observation. For each candidate $θ$ , simulate the model many times, compute $s$ on each simulation, and fit a Gaussian to the resulting samples:

s (x) ∣ θ \sim N (μ (θ), Σ (θ))

You estimate $\overset{μ}{^} (θ)$ and $\hat{Σ} (θ)$ from the simulated summaries, then evaluate the Gaussian density at the observed $s (x_{obs})$ . That density is your synthetic likelihood — call it $\hat{L} (θ)$ . Now you can do classical likelihood-based inference: maximize $\hat{L}$ for a point estimate, profile it for confidence intervals, do likelihood-ratio tests against nulls.

The source of the bias is clear. The true distribution of $s (x) ∣ θ$ is rarely Gaussian, but synthetic likelihood approximates it with a Gaussian whose first two moments come from simulation. More simulations make $\overset{μ}{^}$ and $\hat{Σ}$ tighter. Replacing the Gaussian assumption with something more flexible — a Gaussian process, a kernel density estimate, a normalizing flow — gives you smoother approximations at the cost of more machinery.

Synthetic likelihood is the frequentist sibling of ABC. ABC throws away simulations that don't land near $x_{obs}$ ; synthetic likelihood keeps all of them and uses them to estimate the local likelihood surface in $s$ -space.

Indirect inference and the method of simulated moments

An older idea, popular in econometrics. Pick an auxiliary model that you can actually fit — something tractable. The auxiliary doesn't need to be correct; it just needs to be fittable, and its parameters need to vary smoothly and identifiably with the true model's parameters.

Fit the auxiliary to your real data, giving auxiliary parameter estimates $\hat{β} (x_{obs})$ . For each candidate $θ$ , simulate from the real (intractable) model at $θ$ and fit the auxiliary to the simulation, giving $\hat{β} (x_{sim} (θ))$ . The $θ$ that makes the simulated auxiliary estimates match the real ones is your indirect inference estimate:

\hat{θ}_{II} = ar g θ min \hat{β} (x_{obs}) - \hat{β} (x_{sim} (θ))^{2}

Concretely: if the real model is a stochastic differential equation with no closed-form transition density, you might use a simple AR(1) or Gaussian VAR as the auxiliary. The AR(1) doesn't have to capture the SDE's full behavior — it only has to capture differences between behaviors at different SDE parameters. That's a much weaker requirement than getting the likelihood right.

The method of simulated moments is the same idea with moments instead of an auxiliary model. Pick a set of moments — mean, variance, autocorrelation at various lags. Compute them on the real data. For each $θ$ , simulate and compute the same moments. Minimize the distance. This is Hansen-style GMM with simulation substituted for analytic moment computation. Same flavor, different summaries.

Both are end-to-end frequentist procedures. Confidence regions come from the asymptotic distribution of the distance-minimizing estimator — delta-method covariance, sandwich estimators, the usual machinery. No prior, no posterior, no Bayesian language anywhere.

Neural ratio estimation

The most interesting method is the one that bridges both worlds: neural ratio estimation (NRE). Train a classifier $d_{ϕ} (x, θ)$ to discriminate between $(x, θ)$ pairs sampled jointly from $p (θ) p (x ∣ θ)$ and pairs sampled from the marginal product $p (θ) p (x)$ . For a well-trained Bayes-optimal classifier:

d_{ϕ} (x, θ) \to \frac{p ( x ∣ θ )}{p ( x ∣ θ ) + p ( x )} ⟹ \frac{p ( x ∣ θ )}{p ( x )} = \frac{d _{ϕ} ( x , θ )}{1 - d _{ϕ} ( x , θ )}

That ratio is the likelihood-to-marginal ratio. It's the thing you wanted from the start of inference: a callable function of $θ$ that scores how well $x_{obs}$ matches the model at that $θ$ , relative to a baseline.

Now you have a choice. Multiply by a prior and you get a Bayesian posterior up to a constant — that's the standard NRE-for-SBI pipeline. Or treat it as a likelihood directly: find the $θ$ that maximizes the estimated ratio, profile out nuisance parameters, build confidence intervals from the log-ratio drop (Wilks' theorem says a drop of 1.92 brackets a 95% confidence region for a single parameter). Same trained network, two different uses, no retraining.

This is why particle physics is one of NRE's biggest users. The field cares about both kinds of statements — posterior probabilities and frequentist confidence intervals on cross-sections and coupling constants — and NRE gives you both from one training run. The Bayesian-vs-frequentist choice happens after the network is trained, at the point of constructing the output statement, not at the point of fitting the model.

Why the Bayesian framing dominates anyway

Given that frequentist SBI is fully workable, why does most of the field talk in posterior language? Two reasons, neither structural.

First, a simulator is mechanically a sampler. Push $θ$ in, get $x$ out. Repeating this with $θ$ drawn from anywhere gives you $(θ, x)$ pairs from a joint distribution, and conditioning on $x \approx x_{obs}$ gives you a posterior. The Bayesian update IS what you have when you have a sampler. Frequentist inference needs pivotal quantities or asymptotic distributions of estimators, which require extra work to extract from raw simulator output.

Second, the methods that pair best with neural density estimation — NPE, NLE, the rest of the series — are Bayesian by construction. They estimate $p (θ ∣ x)$ or $p (x ∣ θ)$ as density objects, and density objects play well with the prior-times-likelihood formula. Profile-likelihood frequentism on a learned likelihood requires more thought when the likelihood is a function approximated by a neural network with no analytic gradient structure.

Neither reason is fundamental. NRE shows the bridge cleanly, and the rest of this series will be careful not to over-commit to a framework where the math is genuinely framework-agnostic. If you're an "I want a confidence interval, not a credible interval" reader, the methods later in the series can give you one — you just have to read them slightly differently.

The takeaway

SBI is a toolbox for inference when the likelihood is intractable. Both camps end up reaching into the same toolbox; they just package the outputs differently. The Bayesian framing in this series is mostly notation — simulators sample, sampling conditions, conditioning gives a posterior, so the math reads Bayesian by inertia. The underlying methods are framework-agnostic where they need to be. ABC, which the next post is about, is the most-Bayesian of them. Synthetic likelihood is the most-frequentist. NRE will reappear later, and at that point it'll be obvious that the framework choice was never what mattered.