“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

Why Some Likelihoods Can't Be Written Down

Simulation-Based Inference

⛓ What you need to know first 1 concepts, 1 layers

The requisite-knowledge inventory for this page, bottom-up: the primitives at the base, combined upward until you reach what this page assumes. Skim the layers you already own; start wherever the ground gets unfamiliar.

base
- Bayes' Rule
↳you are here

Some stochastic models you can run forward easily. Toss in some parameters, get a sample out. Running them backward is the hard part — you see a sample, you want the parameters, and the math you'd use in a textbook stops working. This is the situation that motivates simulation-based inference.

The cleanest way to feel why is to compare two simulators that look almost identical and then diverge. Same inputs, same number of moves — but one of them you can do inference on with one line of textbook math, and the other one you can't.

Two simulators

Both simulators take a single parameter $p \in [0, 1]$ : the probability of a "left" kick. Both run for 8 stages. At each stage they sample a kick — $- 1$ with probability $p$ , $+ 1$ with probability $1 - p$ . They differ only in how they translate kicks into motion.

Simulator one is the Galton board. Each kick is an independent left-right step. The position is just the running sum:

x_{t + 1} = x_{t} + kick_{t}

Simulator two is a pinball. Each kick changes velocity, and the position moves by that velocity. A friction-like factor $α$ drains some velocity at each step. We'll use $α = 0.5$ :

v_{t + 1} = α v_{t} + kick_{t}, x_{t + 1} = x_{t} + v_{t + 1}

The pinball remembers. The same kick at high speed and the same kick at low speed produce different motions, because the kick lands on top of whatever velocity you already had. Galton has no such memory — each step is just $\pm 1$ , regardless of history.

The inference problem

Now the question. You observe a final position $x_{obs}$ from one of these simulators. You want to estimate $p$ . How do you actually do this?

Almost every textbook method reduces to the same primitive: you need the ability to evaluate the likelihood $p (x ∣ θ)$ at a given $(x, θ)$ pair. Maximum likelihood needs to maximize it. MCMC needs to evaluate it to compute acceptance ratios. Variational inference needs it inside the ELBO. Even bog-standard hypothesis testing needs to evaluate it under the null. This isn't a Bayes-vs-frequentist thing — both camps need a callable likelihood function. You can't take the gradient of something you can't compute.

For the Galton board this is easy. Eight independent $\pm 1$ kicks land at $x$ if and only if the number of lefts $k = (8 - x) /2$ . The likelihood is just the binomial probability of seeing $k$ lefts out of 8:

p (x ∣ p) = (k 8) p^{k} (1 - p)^{8 - k}, k = (8 - x) /2

Closed form, one line. You can plug in any $(x, p)$ and get a number back. MLE for $p$ falls out as $\overset{p}{^} = k /8$ . Inference on the Galton board is a homework problem.

For the pinball, this stops working. The reason is worth seeing.

What goes wrong with the pinball

Pick three specific kick sequences with the same L/R count — four lefts and four rights each. In Galton, the L/R count is all that matters; all three sequences land at the same position, namely 0. In pinball, the three sequences end up in three different places.

Same number of lefts, same number of rights — three different final positions. Velocity carries memory between stages, and the order of the kicks determines what that memory accumulates to. The L/R count is not a sufficient statistic for the pinball: it doesn't compress the trajectory into something that still predicts $x$ .

Drop a few hundred balls in both simulators and watch what happens to the histograms. The "3 paths" button tags the three sequences from above so you can pick them out by color.

L count:0 (all R) → 8 (all L)

Galton — memoryless

x ← x ± 1, each peg independent

p(x|θ) = C(n,k) · p^k(1−p)^n−k · closed form

No drops yet

Pinball — state-dependent

v ← αv ± 1, then x ← x + v

p(x|θ) = ∫ p(path|θ) δ(x − f(path)) d(path) · intractable

No drops yet

In Galton, each histogram bin is monochrome — all balls in bin $x$ share an L count, because L count fully determines $x$ . In pinball, bins mix colors. Different L counts produce the same final position, and the same L count produces different positions. There's no compression of the kick sequence that predicts the bin.

The likelihood as a path integral

When a sufficient statistic exists, you can compute $p (x ∣ θ)$ by counting how many sequences yield $x$ and weighting by their probability — and the count is tractable because the sequences collapse under the statistic. Galton's binomial coefficient is exactly that count.

When no sufficient statistic exists, you can't collapse. The likelihood becomes a sum (or integral) over every latent trajectory that produces the observation:

p (x ∣ θ) = \int p (path ∣ θ) δ (x - f (path)) d (path)

For the discrete pinball this is a sum over $2^{N}$ kick sequences, each weighted by its probability under $θ$ , restricted to the ones that end at $x$ . With $N = 8$ that's 256 terms, which is fine — but the structure doesn't simplify, so the sum is what you have to compute. With $N = 50$ it's a quadrillion. With $N = 1000$ the universe runs out of room.

This is what people mean when they say the likelihood is intractable. It's not that no expression exists — the path integral above is a perfectly good expression. It's that you can't evaluate it. And the inference machinery you'd reach for assumes evaluability.

Path-dependence is one route, not the only one

Pinball is the cleanest example to picture, but it's not the only way an evaluable likelihood disappears. Models with state-dependent rates have the same issue from a different angle — Lotka-Volterra predator-prey populations, SIR epidemics, coalescent trees in population genetics. The state-of-the-system controls the next event's rate, so different orderings of events produce different probabilities and there's no closed-form marginal.

Implicit generators give you a different flavor of the same problem: a GAN's generator is a simulator with no density at all — there's no $p (x ∣ θ)$ defined anywhere, only samples. High-dimensional observation models can be intractable just because the marginalization integral over nuisance latents is too big to compute, even when each piece has a closed form. The umbrella story is the same: the joint distribution doesn't reduce, and the marginal you'd want for inference is locked behind an integral you can't do.

The simulator is what's left

When you can't evaluate $p (x ∣ θ)$ , you can almost always still sample from it. The simulator IS the model — same object, given in sampling form rather than function form. You set $θ$ , push the button, and out comes an $x$ . That part still works for the pinball, for Lotka-Volterra, for GANs, for the whole intractable-likelihood family.

So the driving question of simulation-based inference is: can we do inference using only the simulator? Without ever evaluating the likelihood — only sampling from it? It turns out we can. The simplest method that does it is Approximate Bayesian Computation, and it amounts to Bayes' rule executed by Monte Carlo. That's where the series goes next.