Bayes' Rule

Simulation-Based Inference

To follow the rest of this series you need to be comfortable with Bayes' rule. The textbook version takes a chapter. The version that actually matters for SBI takes one page. This is that page.

Every SBI method we look at — ABC, neural posterior estimation, normalizing-flow-based inference, the rest — is Bayes' rule wearing a different computational costume. Knowing the bare rule and what makes it hard in practice is the part that buys you the rest of the series.

The three ingredients

You have a parameter $θ$ you want to know — a probability, a rate, a vector of physical constants, whatever. You have a model that says how observations arise from that parameter, the likelihood $p (x ∣ θ)$ . And you have a prior $p (θ)$ , which is what you believed about $θ$ before any data showed up.

After seeing an observation $x_{obs}$ , you want to update your belief about $θ$ . The thing you want is the posterior $p (θ ∣ x_{obs})$ : the distribution over $θ$ conditional on the data you actually saw.

The rule

Bayes' rule tells you how to get from prior + likelihood to posterior:

p (θ ∣ x_{obs}) = \frac{p ( x _{obs} ∣ θ ) p ( θ )}{p ( x _{obs} )}

Three pieces in the numerator and one in the denominator. The numerator multiplies "what the data says about $θ$ " (likelihood) by "what we already thought about $θ$ " (prior). The denominator $p (x_{obs})$ is the marginal probability of the data — what you get if you average the likelihood over the prior:

p (x_{obs}) = \int p (x_{obs} ∣ θ) p (θ) d θ

It's just a normalizing constant — its only job is to make the posterior integrate to 1. It doesn't depend on $θ$ .

The annoying part

That normalizing constant causes most of the trouble in Bayesian practice. For any model more complicated than a textbook example, the integral over all $θ$ has no closed form, and computing it numerically gets expensive fast. This is why most Bayesian techniques (MCMC, variational inference, importance sampling) avoid computing it explicitly and work with the unnormalized posterior instead:

p (θ ∣ x_{obs}) \propto p (x_{obs} ∣ θ) p (θ)

"Proportional to" turns out to be enough for sampling, optimization, computing ratios — most of what you actually want to do. The unnormalized posterior values are not probabilities, but the ratios between them are, and that's what every Monte Carlo method secretly cares about.

A worked example

A classic. You flip a coin 10 times and get 7 heads. Is the coin fair? Let $θ$ be the probability of heads. For a prior, suppose you think the coin is roughly fair but you're not sure, so $θ \sim Beta (2, 2)$ — a soft bell shape peaked at 0.5. The likelihood is binomial: the probability of 7 heads in 10 flips at parameter $θ$ is $(7 10) θ^{7} (1 - θ)^{3}$ .

Apply Bayes:

p (θ ∣ 7 heads) \propto θ^{7} (1 - θ)^{3} \cdot θ^{2 - 1} (1 - θ)^{2 - 1} = θ^{8} (1 - θ)^{4}

Which, if you've seen the conjugacy trick, you'd recognize as $Beta (9, 5)$ . The posterior is more peaked than the prior because data, and shifted slightly above 0.5 because more heads than tails. Posterior mean: $9/ (9 + 5) \approx 0.64$ .

This is the textbook flavor of Bayesian inference. Conjugate prior + likelihood means the posterior stays in a known family with updated parameters. Everything has a closed form. No Monte Carlo needed.

Why SBI exists

Now imagine the model is harder. The coin has memory — its probability of heads depends on what came before. Or the observation isn't "7 heads in 10 flips" but a full sequence of timings, sounds, and contact angles. Or the "coin" is actually a 50-dimensional fluid simulation with eight unknown coefficients.

All of those still factor as $p (θ) p (x ∣ θ)$ in principle. But $p (x ∣ θ)$ stops being something you can write down. The numerator of Bayes' rule has a piece you can't evaluate. The posterior, the thing you actually want, becomes inaccessible through direct calculation.

That's where SBI comes in. Even when the likelihood is intractable, the simulator still gives you samples from $p (x ∣ θ)$ . ABC, the first SBI method we look at, is the most direct way to use those samples to do Bayesian inference. Every other method in this series is a different way of using simulator samples to get at the same posterior $p (θ ∣ x_{obs})$ Bayes' rule is asking for.

One more thing — samples versus densities

In most of what follows, "the posterior" doesn't mean a function you evaluate at a point. It means a set of samples you can summarize however you want. From a thousand $θ$ samples drawn from the posterior, you can read off the mean, the median, credible intervals, predictive distributions for new $x$ , and any nonlinear function of $θ$ you care about.

This is liberating for SBI. The likelihood evaluation that classical methods need is used for finding modes and computing densities point-by-point. SBI methods skip ahead and produce samples directly, which is usually what you wanted in the first place — you were going to compute statistics out of those samples anyway.

With that in hand, the rest of the series is about different ways of producing those samples when the likelihood is gone. ABC is next.