Change of variables in 1D

Normalizing Flows

Suppose you have a distribution that is easy to handle — say, a standard Gaussian $X \sim N (0, 1)$ — and you'd like to turn it into something more interesting. The machinery for doing that is the change-of-variables formula, and in one dimension the whole thing fits in a single line. This chapter is about that line and what it looks like.

The setup

Pick an invertible, differentiable function $f : R \to R$ . Apply it to a sample:

X \sim N (0, 1), Y = f (X) .

The new variable $Y$ has its own probability density $p_{Y}$ , and we want to know what it is. The answer:

p_{Y} (y) = \frac{p _{X} ( x )}{∣ f ^{'} ( x ) ∣}, x = f^{- 1} (y) .

Two things to read off this. First, $p_{Y} (y)$ is determined by the base density at the preimage $x = f^{- 1} (y)$ — to evaluate the new density at $y$ , you have to undo the transformation and ask the base density at the corresponding $x$ . Second, the answer is divided by $∣ f^{'} (x) ∣$ : a stretching transformation ( $∣ f^{'} ∣ > 1$ ) dilutes the density, a compressing one ( $∣ f^{'} ∣ < 1$ ) concentrates it.

Why the divisor

Probability mass is conserved. If a thin slice of the input line of width $d x$ contains probability $p_{X} (x) d x$ , the corresponding slice of the output line has width $d y = ∣ f^{'} (x) ∣ d x$ — that's just the derivative relationship. The same probability mass now lives in a slice that is $∣ f^{'} (x) ∣$ -times wider, so the output density must be smaller by the same factor:

p_{Y} (y) d y = p_{X} (x) d x ⟹ p_{Y} (y) = \frac{p _{X} ( x )}{∣ f ^{'} ( x ) ∣} .

The formula is bookkeeping — probability per unit width on the input side equals probability per unit width on the output side, accounting for the local stretching factor $∣ f^{'} ∣$ . There is no machine-learning anywhere in this derivation; it's the same identity every undergraduate calculus course teaches under "u-substitution," read as a statement about densities.

The visualization

Below: a standard Gaussian on top, the same samples pushed through $f (x) = x + α sin (2 x)$ on the bottom. Each colored dot above is a sample; the matching dot below is where it lands after the transform. The shaded curves are the density before and after. Drag the slider — or click a preset — to see how the output density changes with the nonlinearity parameter $α$ . The factor of 2 inside the sine puts the action right inside the Gaussian's main mass, so the effect is large enough to be visible.

α0.40

y = x + α · sin(2x). Each colored dot above is a sample x ~ N(0, 1); its matching dot below is y = f(x).

A walk through what each preset is doing.

Identity ( $α = 0$ ). The transform is $y = x$ . The output density equals the input — no stretching anywhere, derivative is 1 everywhere, formula gives $p_{Y} = p_{X}$ . The dots stay where they started. This is the trivial flow; everything else builds on it.

Mild stretch ( $α = 0.25$ ). The derivative $f^{'} (x) = 1 + 2 α cos (2 x)$ is largest near $x = 0$ ( $f^{'} (0) = 1.5$ ) and smallest near $x = \pm π /2$ ( $f^{'} (\pm π /2) = 0.5$ ). So the central x's get stretched apart and the x's near $\pm π /2$ get squeezed together. In density-land that means the centre flattens and small humps start to grow at $y \approx \pm π /2$ (the images of the squeezed regions). At this slider position the humps are subtle — keep going.

Bimodal ( $α = 0.45$ ). Same direction, near the limit. $f^{'} (\pm π /2) = 1 - 2 α = 0.1$ : the transform compresses by a factor of 10 right where the base Gaussian still has decent mass ( $p_{X} (π /2) \approx 0.117$ ). The change-of-variables formula divides by $0.1$ , giving output density peaks of $\approx 1.17$ at $y = \pm π /2$ — the bottom panel auto-rescales to fit them, and you can see the central Gaussian peak (still around 0.4 / 1.5 ≈ 0.27 in absolute terms) become small in comparison. The result is a bimodal output density built from a unimodal Gaussian by a single, smooth, invertible function. This is the central thing a flow does.

Compression ( $α = - 0.25$ ). Sign flip. Now the centre compresses ( $f^{'} (0) = 1 - 0.5 = 0.5$ ) while the regions near $x = \pm π /2$ stretch. The central peak of the output density gets taller, and the wings flatten. The dots in the middle of the input bunch up in the middle of the output; the dots away from centre spread out.

Sharp peak ( $α = - 0.45$ ). Near the limit in the other direction. The central derivative is $f^{'} (0) = 0.1$ , so the output density at $y = 0$ reaches $\approx 4.0$ . The bottom panel rescales to fit the spike, and you can see how thoroughly the wide input has been compressed into a single tall peak around zero — most of the input mass has been crammed into a very narrow window of $y$ .

The whole story is in the slider: the shape of the output density is dictated by where the transform stretches and where it compresses. A flow is just an explicit choice of those stretches.

The log version

For training, we work with log-densities — products turn into sums and there's no risk of underflow. Take logs of the change-of-variables formula:

lo g p_{Y} (y) = lo g p_{X} (x) - lo g ∣ f^{'} (x) ∣, x = f^{- 1} (y) .

Two terms. The first is the base log-density at the preimage (cheap — for a Gaussian it's a quadratic in $x$ ). The second is the log of the absolute derivative of the transform, sometimes called the log-determinant of the Jacobian — in 1D the "Jacobian" is just $f^{'} (x)$ . Every flow architecture you'll meet later is in some sense an answer to one question: how do we make that second term cheap to compute when the input lives in high-dimensional space?

For now the answer is "it's a single derivative, so it's free," and that's enough to make the rest of the series work. The next chapter composes 1D flows: stacking transforms gives a much richer family of output densities, and the log-density formula picks up exactly one extra $lo g ∣ f_{k}^{'} ∣$ term per layer.

Why does the formula need $f$ to be invertible?

Because the inverse $x = f^{- 1} (y)$ appears explicitly on the right-hand side. A non-invertible $f$ would have multiple $x$ 's mapping to the same $y$ , and the density at $y$ would have to sum the contributions from all of them. That's the formula for a non-invertible push-forward, and it's strictly more expensive (you have to enumerate all preimages). For flows, requiring invertibility is the price we pay for keeping the formula clean and tractable. It also means we can generate by sampling $x \sim p_{X}$ and applying $f$ , AND we can evaluate the density of any $y$ by applying $f^{- 1}$ and the formula. One model, two capabilities.

What if the transform is not a bijection — say, $y = x^{2}$ ?

Then it doesn't qualify as a flow. The push-forward density still exists ( $Y = X^{2}$ has a chi-squared distribution if $X$ is standard normal), but computing it requires the multi-preimage formula and the transform isn't invertible — you can sample $Y$ from $X$ but you can't go back. The whole point of a normalizing flow is that you can go back: the forward direction generates samples, the inverse direction evaluates densities, and both directions are exact. Drop invertibility and you've just got a generative model; it's not a flow anymore.

Why call it "normalizing"?

Because the inverse direction normalizes data into the base distribution. Given a complicated empirical sample of $y$ values, applying $f^{- 1}$ should yield $x$ values that look like draws from the base — a standard Gaussian. That's where "normalizing" comes from: the flow is the change of variables that turns data into a standard normal. We then read the term "normalizing flow" forwards, but the underlying meaning points the other way.