Training a Flow on Two Moons

Normalizing Flows

Time to actually do it. We've built up the change-of-variables formula and shown how composition works; now we hand a flow a real dataset and ask it to learn the distribution by gradient descent. The target is the two moons dataset — two interleaved crescents in 2D, the standard non-trivial-but-tractable density that every flow paper at some point fits to. Why two moons? It's two-dimensional (cheap to train, easy to visualize), it has a clearly non-Gaussian shape (so a flow has to actually do something), and the two arms aren't axis-aligned (so a pure affine transformation can't represent it — coupling layers earn their keep).

The objective

Given a dataset of samples ${x_{i}}_{i = 1}^{N}$ drawn from some unknown distribution, we want a flow $f_{ϕ}$ with base distribution $p_{z} = N (0, I)$ such that $x = f_{ϕ}^{- 1} (z)$ reproduces the data distribution. Train it by maximizing the log-likelihood of the data under the flow. By change of variables, that's:

lo g p_{ϕ} (x) = lo g p_{z} (f_{ϕ} (x)) + lo g det \frac{\partial f _{ϕ}}{\partial x} .

The loss is the negative of this, averaged over the dataset:

L (ϕ) = - \frac{1}{N} i = 1 \sum N lo g p_{ϕ} (x_{i}) .

No surrogate, no variational bound, no adversarial game. Exact density, exact gradients. The only thing the gradient descent has to optimize is the flow's parameters $ϕ$ . Everything else is bookkeeping for the change-of-variables formula.

The flow architecture

We need a flow that's flexible enough to bend a Gaussian into two crescents, but with a Jacobian determinant that's cheap to evaluate. The standard choice is the coupling layer (RealNVP). One layer splits the input $x$ into two halves $(x_{A}, x_{B})$ — in 2D, just one coordinate each. Pass $x_{A}$ through unchanged; transform $x_{B}$ by an affine map whose parameters depend on $x_{A}$ :

z_{A} = x_{A}, z_{B} = x_{B} \cdot exp (s_{ϕ} (x_{A})) + t_{ϕ} (x_{A}) .

The functions $s_{ϕ}$ and $t_{ϕ}$ are arbitrary neural networks. The Jacobian of this transformation is triangular by construction (changes in $x_{A}$ only affect $z_{A}$ through the identity; changes in $x_{B}$ only affect $z_{B}$ through the affine map), so its log-determinant is just $\sum s_{ϕ} (x_{A})$ over the $B$ -coordinates. Cheap, differentiable, inverts in closed form.

A single coupling layer leaves one coordinate untouched, which obviously isn't enough. The trick is to stack them and alternate the mask: layer 1 transforms coord 2 conditioned on coord 1; layer 2 transforms coord 1 conditioned on coord 2; layer 3 swaps back; and so on. Six alternating layers is plenty to fit two moons.

The full code

Working PyTorch. The flow is about 60 lines. The training loop is 10. Numerical output below.

"""
Fit a normalizing flow to the two-moons distribution.

Architecture: stack of RealNVP coupling layers in 2D.
Each layer transforms one coordinate as an affine function of the other,
alternating which coordinate is held fixed (the "mask").

Training: minimize negative log-likelihood of the data,
which by change of variables expands to
  -log p_x(x) = -log p_z(f(x)) - log |det df/dx|
where p_z is N(0, I) and f maps data -> base.
"""

import numpy as np
import torch
import torch.nn as nn
from sklearn.datasets import make_moons

torch.manual_seed(0)
np.random.seed(0)

# ---- 1. The two-moons dataset --------------------------------------------
X, _ = make_moons(n_samples=2000, noise=0.05, random_state=0)
X = torch.tensor(X, dtype=torch.float32)

mean = X.mean(0); std = X.std(0)
X = (X - mean) / std    # center / rescale to ~unit variance per axis


# ---- 2. A coupling layer -------------------------------------------------
class CouplingLayer(nn.Module):
    def __init__(self, mask, hidden_dim=64):
        super().__init__()
        self.register_buffer("mask", torch.tensor(mask, dtype=torch.float32))
        self.net = nn.Sequential(
            nn.Linear(2, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, 4),  # 2 for s, 2 for t
        )

    def _st(self, x_in):
        st = self.net(self.mask * x_in)
        s, t = st.chunk(2, dim=-1)
        s = torch.tanh(s) * (1 - self.mask)   # only act on the unmasked coords
        t = t * (1 - self.mask)
        return s, t

    def forward(self, x):
        # data -> base
        s, t = self._st(x)
        z = self.mask * x + (1 - self.mask) * (x * torch.exp(s) + t)
        log_det = s.sum(-1)
        return z, log_det

    def inverse(self, z):
        # base -> data
        s, t = self._st(z)
        x = self.mask * z + (1 - self.mask) * ((z - t) * torch.exp(-s))
        return x


# ---- 3. Stack of coupling layers -----------------------------------------
class RealNVP(nn.Module):
    def __init__(self, n_layers=6):
        super().__init__()
        masks = [[1.0, 0.0], [0.0, 1.0]] * (n_layers // 2)
        self.layers = nn.ModuleList(CouplingLayer(m) for m in masks)

    def forward(self, x):
        log_det = 0.0
        for layer in self.layers:
            x, ld = layer(x)
            log_det = log_det + ld
        return x, log_det

    def inverse(self, z):
        for layer in reversed(self.layers):
            z = layer.inverse(z)
        return z

    def log_prob(self, x):
        z, log_det = self.forward(x)
        log_pz = -0.5 * (z**2).sum(-1) - np.log(2 * np.pi)
        return log_pz + log_det


# ---- 4. Train ------------------------------------------------------------
flow = RealNVP(n_layers=6)
opt = torch.optim.Adam(flow.parameters(), lr=1e-3)
batch_size = 256
n_steps = 2000

initial_nll = -flow.log_prob(X).mean().item()
print(f"NLL at random init:           {initial_nll:.4f}")
print()
print(f"{'step':>6}  {'NLL':>9}")
print(f"{'-'*6}  {'-'*9}")
for step in range(n_steps + 1):
    idx = torch.randint(0, len(X), (batch_size,))
    loss = -flow.log_prob(X[idx]).mean()
    opt.zero_grad(); loss.backward(); opt.step()
    if step % 200 == 0:
        print(f"{step:>6}  {loss.item():>9.4f}")

# ---- 5. Final metrics ----------------------------------------------------
final_nll = -flow.log_prob(X).mean().item()
baseline_nll = -(-0.5 * (X**2).sum(-1) - np.log(2*np.pi)).mean().item()
print()
print(f"NLL after training:           {final_nll:.4f}")
print(f"NLL of N(0,I) baseline:       {baseline_nll:.4f}   (no flow)")
print(f"Information gained:           {baseline_nll - final_nll:.4f} nats / point")

with torch.no_grad():
    x_gen = flow.inverse(torch.randn(2000, 2))
print()
print(f"True data:    mean = ({X.mean(0)[0]:+.3f}, {X.mean(0)[1]:+.3f})   std = ({X.std(0)[0]:.3f}, {X.std(0)[1]:.3f})")
print(f"Flow samples: mean = ({x_gen.mean(0)[0]:+.3f}, {x_gen.mean(0)[1]:+.3f})   std = ({x_gen.std(0)[0]:.3f}, {x_gen.std(0)[1]:.3f})")

Output:

NLL at random init:           3.0135

  step        NLL
------  ---------
     0     3.0421
   200     1.6504
   400     1.4283
   600     1.3796
   800     1.3192
  1000     1.2957
  1200     1.3021
  1400     1.2910
  1600     1.2863
  1800     1.2037
  2000     1.2377

NLL after training:           1.2414
NLL of N(0,I) baseline:       2.8374   (no flow)
Information gained:           1.5960 nats / point

True data:    mean = (-0.000, +0.000)   std = (1.000, 1.000)
Flow samples: mean = (+0.067, -0.020)   std = (0.966, 1.081)

What the numbers say

The flow starts at NLL ≈ 3.0 (random init: the flow's distribution is roughly a standard Gaussian, which assigns reasonable but not great probability to the data). After 2000 steps it lands around NLL ≈ 1.24. For comparison, a frozen $N (0, I)$ assigns the data NLL ≈ 2.84 — that's the floor if you weren't allowed to do any learning at all. The trained flow improves over that baseline by about 1.6 nats per data point, which is the amount of structure the flow extracted from the crescents that a Gaussian couldn't capture.

The sample statistics line up too: when you draw 2000 samples from the trained flow by pushing standard-normal noise through $f^{- 1}$ , the resulting samples have mean and standard deviation that match the training data to within sampling noise. Visualizing them would show two crescents — the flow has learned the shape, not just the moments.

What just happened

We took an unknown 2D distribution (given to us only as 2000 samples), set up a flexible invertible map between it and a standard Gaussian, and trained the map by maximizing the log-likelihood it assigns to the data. After training, the flow gives you two things simultaneously, exactly: a sampler (draw $z$ , push through $f^{- 1}$ , get a sample from the learned distribution) and a density evaluator (push $x$ through $f$ , accumulate log-Jacobians, get $lo g p_{ϕ} (x)$ ). That's the by construction and by density trick from the intro page playing out concretely.

Compare against what a different generative model would have given you: a VAE trained on the same data would let you sample easily but evaluate density only through the ELBO (a lower bound). A GAN would let you sample but offer no density at all. An energy-based model would let you write down an unnormalized density but require MCMC to sample. The flow is the only one of the four that does both jobs exactly, in one trained object.

Why this matters: simulation-based inference

Two moons is the smallest example. The big use case for flows is simulation-based inference — when you have a forward simulator whose likelihood is intractable. Cases like the pinball simulator, where the position depends on the order of kicks and the likelihood collapses into a sum over $2^{N}$ latent paths. Or Lotka-Volterra predator-prey simulators, where the rate of the next event depends on the current state and you can't marginalize. Or coalescent simulators in population genetics, where the genealogy is hidden inside the data.

In all those cases, you have $(θ, x)$ pairs from the simulator and want the posterior $p (θ ∣ x_{obs})$ . The trick: train a conditional flow $q_{ϕ} (θ ∣ x)$ on simulated $(θ, x)$ pairs. At inference time, plug in $x_{obs}$ and you can sample from the posterior, evaluate its density, do MAP estimation, marginalize over nuisance dimensions — all of it. This is neural posterior estimation, the dominant approach in modern SBI.

The conditional case is essentially what we did here, with two changes: the flow's per-layer neural networks take an extra input $x$ (the observation), and the training pairs are $(θ, x)$ from a simulator instead of just $x$ from a dataset. Everything else — the architecture, the loss, the training loop — is identical. Two moons is the unconditional rehearsal; SBI is the show.

What's next

Two moons is the smallest interesting target. For something genuinely useful you'd want to: (1) replace the affine inner transform with a spline (neural spline flows) to get more expressiveness per layer; (2) handle higher dimensions, which means thinking carefully about the masking strategy (random permutations between layers, or a different layer family like masked autoregressive); (3) make the flow conditional on context for SBI. Each of those is engineering on top of what's already here. The math doesn't change.

Related on this site

Change of variables in 1D and Composing flows are the earlier chapters in this series. Why some likelihoods can't be written down sets up the SBI problem that flows solve; the SBI normalizing-flows post explains why this exact architecture is the right answer.