ABC: Bayes by Monte Carlo

Simulation-Based Inference

Earlier in this series we had a simulator and no likelihood. The pinball, Lotka-Volterra, GANs, anything with intractable path integrals — same setup: a simulator you can run forward from $θ$ to $x$ , with a likelihood $p (x ∣ θ)$ you can't write down as a function. The question we ended on: can we do inference using only the simulator, without ever evaluating $p (x ∣ θ)$ ? The answer is yes, and the simplest method that does it is Approximate Bayesian Computation. For a quick refresher on Bayes' rule itself, take a look at the Bayes' Rule page.

ABC is a three-line algorithm. The first time you see it, it looks like a hack. By the end of this post it should look like textbook Bayes — just with a Monte Carlo step where the likelihood evaluation would normally go.

The algorithm

You have a simulator, a prior on the parameter $θ$ , and one observation $x_{obs}$ . To draw samples from the posterior $p (θ ∣ x_{obs})$ :

1. Sample θ' ~ prior.
2. Simulate x' from the simulator at θ'.
3. If |x' − x_obs| < ε, keep θ' as a posterior sample. Else discard.

Repeat until you have enough accepted θ'.

That's it. The histogram of accepted $θ^{'}$ is your posterior estimate — the widget below shows it building up in real time. The parameter $ε$ is the tolerance — how close the simulated observation has to be to the real one to count as a "hit". The likelihood is never evaluated; the simulator is the only object that touches $θ$ , and only in sampling mode.

Try it

The widget below runs ABC on the pinball. Set the true $p$ (the parameter you want to recover, marked in orange), click "New observation" to draw an $x_{obs}$ from the pinball at that $p$ , then run sampling. Each candidate $p^{'}$ draws a tick on the prior axis — green if accepted, grey if rejected — and the accepted $p^{'}$ build up the posterior histogram below.

Caveat for real-world inference: you don't have the true $p$ . The demo shows it for pedagogy — so you can check whether the posterior is finding the right answer. In a real ABC run you'd have only $x_{obs}$ and the simulator; the question becomes "how concentrated is the posterior?" rather than "is it near the true value?"

true p0.60ε1.50

Simulator

Each ball samples p ~ Uniform(0,1). Orange line = x_obs, green band = within ε.

ABC inference

Top: each sampled p as a tick (grey = rejected, green = accepted). Bottom: posterior histogram of accepted p.

No observation yet — click New observation

A few things to try. Set true $p = 0.6$ and run 500 samples; the posterior should pile up around 0.6. Drag $ε$ from 1.5 down to 0.5 and run again — the posterior tightens but the accept rate drops sharply. Move true $p$ to 0.3, draw a new observation, and watch where the next posterior lands. Each fresh observation gives a slightly different posterior; that's not a bug, it's how Bayesian inference at finite data is supposed to behave.

Why it works

ABC looks heuristic. It's actually rejection sampling, the textbook Monte Carlo trick, applied to the joint distribution $p (θ, x)$ and conditioned on $x \approx x_{obs}$ . The "conditioning" part is concrete: throw away every simulated pair $(θ^{'}, x^{'})$ where $x^{'}$ isn't close to $x_{obs}$ , keep the rest. The kept pairs are samples from the conditional distribution.

Standard rejection sampling for a posterior would say: sample $θ^{'} \sim p (θ)$ , accept with probability proportional to the likelihood $p (x_{obs} ∣ θ^{'})$ . ABC can't evaluate that likelihood as a number. What it can do is elegant in its place — just an if-statement. Run the simulator at $θ^{'}$ to get $x^{'}$ ; accept $θ^{'}$ if $x^{'}$ lands close enough to $x_{obs}$ . The acceptance probability under this rule is:

P (accept ∣ θ^{'}) = \int p (x ∣ θ^{'}) 1 [∣ x - x_{obs} ∣ < ε] d x

As $ε \to 0$ , this integral becomes proportional to $p (x_{obs} ∣ θ^{'})$ — exactly the likelihood the standard algorithm needs. So weighting $θ^{'}$ samples by "did the simulator land inside the tolerance ball?" is, in expectation, the same as weighting them by the likelihood. The simulator is producing a stochastic Monte Carlo estimate of the acceptance probability instead of evaluating it.

This is the conceptual unlock. ABC isn't a hack — it's Bayes' rule with the likelihood evaluation replaced by a sampling primitive. Once you see it that way, every other SBI method becomes a variation on the same theme: "what if we replaced the rejection step with something more efficient?". The neural methods that come later are doing exactly this.

A note on the bias. Any finite $ε > 0$ blurs the posterior — the ABC distribution is the true posterior convolved with the $ε$ -kernel. You can shrink $ε$ to make the bias arbitrarily small, but the cost in variance (i.e., simulator runs) grows quickly. This bias-variance trade-off is the whole story of ABC tuning.

The prior does double duty

One thing easy to skim past: in ABC — and in every SBI method that follows — the prior isn't just where prior belief lives. It's also the distribution the simulator gets queried at. Every candidate $θ^{'}$ you ever try is drawn from the prior. If your prior is uniform on $[0, 1]$ and the true $θ$ is 1.4, no draw will ever land on 1.4 — the algorithm has no mechanism to even propose it. The posterior you end up with is whatever subset of $[0, 1]$ happens to best mimic $x_{obs}$ , but the actual answer is unreachable. Not "biased posterior", not "wide posterior" — flatly cannot recover.

This is different from textbook Bayes, where a wide-but-imperfect prior gets gently corrected by enough data. In SBI a too-narrow prior is a structural failure: no amount of data fixes it. The practical rule is "wider than you think you need". Pick prior support that covers any $θ$ that's even physically plausible, then narrow later if calibration suggests you can.

Worked examples

The pinball, in full

Let's walk through what the widget above does in detail, in case the dots didn't quite connect. The simulator is the pinball from Post 1: one parameter $p$ (probability of a left kick), eight stages, final position $x$ . The prior on $p$ is uniform on $[0, 1]$ — before any data, no opinion either way.

You provide an observation by setting "true $p$ " to, say, 0.6 and clicking "New observation". The widget runs the pinball at $p = 0.6$ once and records the landing position as $x_{obs}$ . The orange line in the simulator panel marks where it landed.

Then sampling. Each iteration draws $p^{'}$ uniformly from $[0, 1]$ , runs the simulator, and checks whether the simulated $x^{'}$ lands inside the green band of width $ε$ around $x_{obs}$ . Accept and a green tick goes on the prior axis at $p^{'}$ with a new entry in the posterior histogram; reject and the simulation is discarded.

After 500 draws at $ε = 1.5$ you typically have around 100 accepted $p^{'}$ values piled up near 0.6. The histogram of those values is the posterior. That's a complete Bayesian analysis of the pinball, done without ever writing down $p (x ∣ p)$ .

An epidemic model: SIR

For something less toy: ABC on the SIR epidemic model. SIR has three populations — Susceptible $S (t)$ , Infected $I (t)$ , Recovered $R (t)$ — and two rate parameters. $β$ is the infection rate per susceptible-infected contact; $γ$ is the recovery rate per infected individual. The simulator runs a stochastic update (Gillespie algorithm, or a noisy ODE) and produces an $I (t)$ trajectory over, say, 100 days.

Your observation $x_{obs}$ is the actual $I (t)$ trajectory — daily infected counts from a real outbreak. A reasonable prior covers any plausible parameter range:

β \sim Uniform (0.5, 2.0), γ \sim Uniform (0.1, 0.5)

ABC on this looks like:

1. Sample (β', γ') from the prior.
2. Run SIR(β', γ') to get a simulated trajectory I_sim(t).
3. Compute d = Σ_t (I_obs(t) − I_sim(t))².
4. If d < ε, keep (β', γ'). Else discard.

Repeat 100,000 times.

The accepted $(β, γ)$ pairs are your posterior. The marginal on $β$ tells you the infection rate. The marginal on $γ$ tells you the recovery rate. The joint distribution captures the correlation between them, which for SIR usually skews positive — both parameters affect peak height in similar ways, so ABC can't fully separate them without more informative data.

A nice side effect: derived quantities fall out for free. The basic reproduction number $R_{0} = β / γ$ is just $β^{'} / γ^{'}$ computed on each accepted sample. No new inference needed — you already have everything you'd ever ask for as a function of $(β, γ)$ .

Notice that nowhere did we write down $p (I (t) ∣ β, γ)$ . The likelihood for SIR observed this way is genuinely hard — the trajectory depends on every event in its stochastic past, and marginalizing over those events is the same path integral problem from Post 1. The simulator runs forward fine; the inverse direction needs ABC (or something downstream of it).

And many others

The same template applies broadly across stochastic dynamical models. Lotka-Volterra predator-prey has four parameters and two coupled population time series — same ABC loop, larger prior space, more careful summary statistics when the trajectories are long. Coalescent simulators in population genetics produce trees of genetic ancestry; you infer effective population size $N_{e}$ by matching simulated tree statistics (segregating sites, Tajima's D) to observed ones. Stochastic chemical kinetics, agent-based models in ecology, neural mass models in computational neuroscience — ABC is the default first-pass tool wherever the simulator is cheap and a defensible posterior matters more than runtime.

When ABC is the right tool

All of these examples share three features. The parameter space is small, typically one to maybe five dimensions and occasionally up to ten if the summaries are informative. The simulator is cheap, because you're going to run it tens or hundreds of thousands of times. And a defensible answer matters more than amortization across many observations — you're analyzing one outbreak, one ecosystem, one experiment, and you can afford to rerun the loop if a new observation comes in.

Lose any of those conditions and ABC starts to hurt. The acceptance rate at small $ε$ , the curse of dimensionality in $x$ , the burden of designing summary statistics, and the absence of amortization across observations — those are the four reasons people ever bother with neural methods. Without them, you'd never need anything more complicated than ABC. The next post is about each of those failure modes.