Machine Learning

Logistic Regression — exercises

Recognize a problem as binary classification, model P(y=1|x) = σ(xᵀβ), fit by minimizing cross-entropy via gradient descent, then read decision boundary, probability calibration, and odds-ratio interpretation off the learned β.

1 worked example · 7 practice problems · 2 check problems

Worked example: 1D logistic regression by gradient descent

Problem. Fit $P (y = 1 ∣ x) = σ (β_{0} + β_{1} x)$ to the six points $(- 3, 0), (- 2, 0), (- 1, 1), (1, 0), (2, 1), (3, 1)$ . Compute the gradient at the initialization $β = (0, 0)$ , take one explicit gradient-descent step with learning rate $η = 1$ , then state the converged estimate. Verify with code.

Diagnosis. Linear regression had a closed form; logistic regression doesn't. Set up the cross-entropy loss, compute its gradient, descend until the gradient is small. The gradient has the clean form $\nabla L (β) = X^{⊤} (σ (Xβ) - y) / n$ , which is what makes one explicit step computable by hand.

Predict before reading on: eyeball the data before doing any algebra. The points at $x = \pm 1$ have "wrong" labels — what should this do to $β_{1}$ compared with a cleanly-separated dataset? Should it grow or shrink?

Setup. Design matrix $X$ has 6 rows and 2 columns (intercept + feature):

X = 111111 - 3 - 2 - 1 123, y = 001011

Gradient at $β = 0$ . At zero, $Xβ = 0$ and $σ (0) = 1/2$ for every row. So $σ (Xβ) - y = (1/2 - y_{i})$ , a vector of $\pm 1/2$ entries:

σ (Xβ) - y = (1/2, 1/2, - 1/2, 1/2, - 1/2, - 1/2)^{⊤}

Multiply by $X^{⊤}$ and divide by $n = 6$ :

\nabla L (0) = \frac{1}{6} X^{⊤} 1/2 1/2 - 1/2 1/2 - 1/2 - 1/2 = \frac{1}{6} (0 - 4) = (0 - 2/3)

One GD step. With $η = 1$ : $β^{(1)} = β^{(0)} - η \nabla L = (0, + 2/3)$ . The intercept stays at zero — by the symmetry of the data under $x \to - x, y \to 1 - y$ , it stays at zero forever, and only $β_{1}$ moves.

Predict before reading on: predict the converged $β_{1}$ : should it be larger or smaller than $2/3$ after one step? Why? Think about the gradient at the new point $β = (0, 2/3)$ .

Converged estimate. Running the iteration to convergence gives $\hat{β} = (0, 0.732)$ . The intercept is exactly zero (the symmetry holds), and the slope is finite (not infinite) because the data is not separable: the $(- 1, 1)$ and $(1, 0)$ points sit on the "wrong" side of the boundary.

Predicted probabilities at the data points are $σ (\hat{β}_{1} x) = (0.10, 0.19, 0.32, 0.68, 0.81, 0.90)$ . The "wrong" points get 0.32 and 0.68 — the model gives them partial credit rather than committing.

Verification.

import numpy as np
def sigmoid(z): return 1 / (1 + np.exp(-z))

x = np.array([-3, -2, -1, 1, 2, 3], dtype=float)
y = np.array([ 0,  0,  1, 0, 1, 1], dtype=float)
X = np.column_stack([np.ones(6), x])

beta = np.zeros(2)
for _ in range(5000):
    p = sigmoid(X @ beta)
    beta -= X.T @ (p - y) / 6     # lr = 1
print(beta)            # ≈ [0., 0.732]
print(sigmoid(X @ beta).round(2))  # ≈ [0.10, 0.19, 0.32, 0.68, 0.81, 0.90]

Articulate: state in one sentence what cross-entropy is actually measuring, and what it would mean for it to be zero on a dataset.

Practice problems

Seven problems, seven different surfaces. Some are pure "set up the design matrix and fit"; others read the fitted model (boundary, probabilities, odds); two are derivations that extend the technique (class imbalance, softmax).

P.1 medical risk scoring, reading a fitted model

A clinical biomarker is measured on eight patients with binary disease outcome $d \in {0, 1}$ :

biomarker b  =  1   2   3   4   5   6   7   8
disease   d  =  0   0   0   1   0   1   1   1

A logistic-regression fit gives $\hat{β} = (- 5.77, 1.28)$ (intercept, slope on $b$ ).

Without re-fitting:

(a) Compute the predicted probability of disease at $b = 5$ .

(b) Find the biomarker level at which the model is exactly undecided ( $P = 0.5$ ).

(c) State the odds-ratio interpretation of the slope: by what factor do the odds of disease change per unit increase in biomarker?

Find the analogue: this problem reads a fitted model. The worked example was about obtaining the model; this is about using it. Both lean on the same formula $σ (β_{0} + β_{1} x)$ .

show answer

(a) $σ (- 5.77 + 1.28 \cdot 5) = σ (0.63) = 1/ (1 + e^{- 0.63}) \approx 0.65$ .

(b) $P = 0.5 \Leftrightarrow β_{0} + β_{1} b = 0 \Leftrightarrow b = - β_{0} / β_{1} = 5.77/1.28 \approx 4.5$ . The model is undecided at biomarker level 4.5.

(c) The model says $lo g \frac{P}{1 - P} = β_{0} + β_{1} b$ . A unit increase in $b$ raises the log-odds by $β_{1} = 1.28$ , multiplying the odds by $e^{1.28} \approx 3.60$ . So each additional unit of biomarker multiplies the disease odds by 3.6×. That's the natural-language sentence epidemiologists actually write into clinical papers — odds ratios are what logistic regression gives you for free.

P.2 marketing click prediction, multi-feature fit

Ten ad impressions are labelled with whether the user clicked:

age  imps  clicked
22    2     1
28    5     1
31    9     0
35    3     1
40   15     0
45    4     1
52   18     0
55    7     1
60   20     0
62   12     0

Fit $P (click ∣ age, imps) = σ (β_{0} + β_{1} age + β_{2} imps)$ via gradient descent (standardize features first). Report $\hat{β}$ and the in-sample accuracy at the 0.5 threshold. Which feature does the model use to predict not clicking?

Find the analogue: same gradient as the worked example, just with three columns in the design matrix instead of two and ten rows instead of six. Standardize features before fitting so the learning rate works on both at once.

show answer

import numpy as np

def sigmoid(z):
    return np.where(z >= 0,
                    1.0 / (1.0 + np.exp(-z)),
                    np.exp(z) / (1.0 + np.exp(z)))

# Ten labelled (age, impressions, clicked) rows.
data = np.array([
    [22,  2, 1],  [28,  5, 1],  [31,  9, 0],  [35,  3, 1],  [40, 15, 0],
    [45,  4, 1],  [52, 18, 0],  [55,  7, 1],  [60, 20, 0],  [62, 12, 0],
])
X_raw = data[:, :2].astype(float)
y     = data[:, 2].astype(float)
n     = len(y)

# Standardize features so GD has comparable scales.
X_raw = (X_raw - X_raw.mean(axis=0)) / X_raw.std(axis=0)
X     = np.column_stack([np.ones(n), X_raw])

beta = np.zeros(3)
for _ in range(20000):
    p = sigmoid(X @ beta)
    grad = X.T @ (p - y) / n
    beta -= 0.5 * grad

print("β        :", beta)
print("accuracy :", np.mean((sigmoid(X @ beta) >= 0.5) == y))

Fit returns approximately $β_{0} \approx - 0.3$ , $β_{1} \approx - 0.7$ (on standardized age), $β_{2} \approx - 2.0$ (on standardized imps). In-sample accuracy 9/10.

The number of impressions has the largest-magnitude negative coefficient — the more impressions of the ad the user has already seen, the less likely they are to click. (Banner blindness.) Age also leans negative but is dominated by imps.

Important caveat with $n = 10$ : these coefficients are noisy. The point of the exercise is the mechanics; in production you'd never trust a 10-sample fit.

P.3 decision-boundary geometry in 2D

A 2D logistic-regression fit produces $\hat{β} = (1, 2, - 3)$ for $P (y = 1 ∣ x_{1}, x_{2}) = σ (β_{0} + β_{1} x_{1} + β_{2} x_{2})$ .

(a) Write the equation of the decision boundary in the $(x_{1}, x_{2})$ -plane.

(b) Give a unit vector pointing in the direction of increasing predicted probability.

Find the analogue: logistic regression's decision boundary is the affine hyperplane $β_{0} + β_{1} x_{1} + β_{2} x_{2} = 0$ . Everything about this problem is geometry on that hyperplane.

show answer

(a) Boundary equation: $1 + 2 x_{1} - 3 x_{2} = 0$ , or rearranged $x_{2} = (1 + 2 x_{1}) /3$ .

(b) The gradient of $β_{0} + β_{1} x_{1} + β_{2} x_{2}$ in the $(x_{1}, x_{2})$ -plane is $(2, - 3)$ . This is the direction of increasing log-odds, hence of increasing probability. Normalize: unit vector $= (2, - 3) / 4 + 9 = (2/ 13, - 3/ 13) \approx (0.555, - 0.832)$ .

(c) The signed distance from the origin to the hyperplane $a x_{1} + b x_{2} + c = 0$ is $- c / a^{2} + b^{2}$ . Here $a = 2, b = - 3, c = 1$ , so distance $= - 1/ 13 \approx - 0.277$ . The origin is on the $y = 0$ side (negative).

P.4 polynomial features for circular data

Two classes are arranged as concentric rings in 2D: class 0 inside the unit circle, class 1 between radius 2 and 3. Linear logistic regression on the raw features $(x_{1}, x_{2})$ fails because no straight line separates the two classes.

(a) Why? Explain the failure in one sentence using the decision-boundary geometry from P.3.

(b) Augment the design matrix to $[1, x_{1}, x_{2}, x_{1}^{2}, x_{2}^{2}]$ and fit. Report linear vs polynomial accuracy. Why does this work?

Find the analogue: the page noted: "logistic regression can only learn linear boundaries — standard fix is hand-engineer nonlinear features." This problem is that fix.

show answer

(a) The classes are separated by the curve $x_{1}^{2} + x_{2}^{2} = 1. 5^{2}$ (a circle), which is not a straight line. A linear-in- $(x_{1}, x_{2})$ model can only carve out a half-plane, so any choice of $β$ miscalls at least half the rings.

(b) Code:

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

rng = np.random.default_rng(2)
n = 50
# Class 0: small radius ring.  Class 1: large radius ring.
r0 = rng.uniform(0, 1, n); t0 = rng.uniform(0, 2*np.pi, n)
r1 = rng.uniform(2, 3, n); t1 = rng.uniform(0, 2*np.pi, n)
inner = np.column_stack([r0 * np.cos(t0), r0 * np.sin(t0)])
outer = np.column_stack([r1 * np.cos(t1), r1 * np.sin(t1)])
X_raw = np.vstack([inner, outer])
y     = np.concatenate([np.zeros(n), np.ones(n)])

def fit(X, lr=0.5, n_iter=20000):
    beta = np.zeros(X.shape[1])
    for _ in range(n_iter):
        p = sigmoid(X @ beta)
        beta -= lr * X.T @ (p - y) / len(y)
    return beta

# Linear features only.
X_lin = np.column_stack([np.ones(2*n), X_raw])
beta_lin = fit(X_lin)
acc_lin = np.mean((sigmoid(X_lin @ beta_lin) >= 0.5) == y)

# Add x², y² features.
X_poly = np.column_stack([np.ones(2*n), X_raw, X_raw[:,0]**2, X_raw[:,1]**2])
beta_poly = fit(X_poly)
acc_poly = np.mean((sigmoid(X_poly @ beta_poly) >= 0.5) == y)

print("linear features  accuracy:", acc_lin)   # ~ 0.6, near chance
print("polynomial-feat. accuracy:", acc_poly)  # ~ 1.0

Linear-feature accuracy ≈ 0.6 (essentially chance). Polynomial-feature accuracy ≈ 1.0.

Why it works: the boundary is now $β_{0} + β_{1} x_{1} + β_{2} x_{2} + β_{3} x_{1}^{2} + β_{4} x_{2}^{2} = 0$ , which is a conic section. For roughly circular data the fit converges with $β_{3} \approx β_{4} > 0$ and $β_{1}, β_{2} \approx 0$ , recovering $β_{3} (x_{1}^{2} + x_{2}^{2}) = - β_{0}$ — a circle of radius $- β_{0} / β_{3}$ .

(c) The model is still linear in parameters (it's the same OLS-style design matrix, just with extra columns). The hidden trick is that polynomial features are a poor person's kernel — neural networks generalize this by learning the nonlinear features rather than hand-engineering them.

P.5 class imbalance via reweighted cross-entropy

A fraud-detection dataset has 99% non-fraud (class 0) and 1% fraud (class 1). Logistic regression trained on the raw cross-entropy $- \sum [y_{i} lo g \overset{p}{^}_{i} + (1 - y_{i}) lo g (1 - \overset{p}{^}_{i})]$ is dominated by the majority class — the model can predict "always 0" and hit 99% accuracy with the loss barely complaining.

Derive the class-reweighted cross-entropy where positive examples carry weight $w_{+}$ and negative examples carry weight $w_{-}$ , and write down the resulting gradient. Show that setting $w_{+} / w_{-} = (fraction negatives) / (fraction positives)$ gives the same loss as if classes were balanced.

Find the analogue: this is the same gradient derivation as the worked example, applied to a weighted loss instead of the uniform loss. The structure is unchanged; the algebra just carries weights.

show answer

Weighted loss:

$L_{w} (β) = - \frac{1}{n} \sum_{i} [w_{+} y_{i} lo g \overset{p}{^}_{i} + w_{-} (1 - y_{i}) lo g (1 - \overset{p}{^}_{i})]$ where $\overset{p}{^}_{i} = σ (x_{i}^{⊤} β)$ .

Differentiating and using $σ^{'} (z) = σ (z) (1 - σ (z))$ , the contribution from row $i$ is $w_{+} y_{i} (\overset{p}{^}_{i} - 1) + w_{-} (1 - y_{i}) \overset{p}{^}_{i} = (w_{+} y_{i} + w_{-} (1 - y_{i})) (\overset{p}{^}_{i} - y_{i})$ .

So the gradient is $\nabla L_{w} (β) = \frac{1}{n} X^{⊤} W (\overset{p}{^} - y)$ with $W = diag (w_{+} y_{i} + w_{-} (1 - y_{i}))$ — the unweighted form times a row-rescaling.

Setting $w_{+} = 1/ p_{+}$ and $w_{-} = 1/ p_{-}$ (inverse class frequencies) makes positives and negatives contribute equal total weight, so the loss "sees" a balanced dataset. Practical equivalent: oversample the minority class to 50%. Both fixes solve the same problem with the same algebra under the hood.

P.6 calibration diagnostics

You've trained a logistic regression and have 1000 held-out predictions $\overset{p}{^}_{i}$ with true labels $y_{i}$ . Bucket the predictions into 10 deciles (0–0.1, 0.1–0.2, …, 0.9–1.0) and compute, for each bucket, the average predicted probability and the empirical fraction of positives.

(a) What should the plot of (mean predicted probability) vs (empirical fraction positive) look like if the model is well-calibrated?

(b) Sketch what the plot looks like for a model that is over-confident at the extremes — predicting 0.05 when the true rate is 0.15, and 0.95 when the true rate is 0.85.

Find the analogue: the page noted: "probabilities are calibrated only if the linearity assumption holds." This is the diagnostic that surfaces miscalibration.

show answer

(a) Perfect calibration is the 45° diagonal: bucket $b$ has mean predicted probability $\overset{p}{^} \approx 0.05 + 0.1 b$ and empirical rate equal to that same number. A well-calibrated model lies along $y = x$ .

(b) Over-confidence at the extremes looks like a sigmoid: shallow at the ends (predicted 0.05 maps to actual 0.15, so the curve sits above the diagonal for low $\overset{p}{^}$ ), steep in the middle, shallow again at the high end (predicted 0.95 maps to actual 0.85, below the diagonal). The curve bows above the diagonal on the left and below it on the right.

(c) Standard fixes: (i) Platt scaling — fit a second 1D logistic regression that maps the model's raw logits to calibrated probabilities. (ii) Isotonic regression — fit a monotone step function from raw to calibrated probabilities. (iii) Temperature scaling for neural networks — divide logits by a learned scalar $T > 1$ to soften over-confident outputs. (iv) Add regularization during training; over-confidence often comes from unconstrained $∥ β ∥$ .

P.7 softmax / multi-class extension

Extend binary logistic regression to $K$ classes. The model is $P (y = k ∣ x) = e^{x^{⊤} β_{k}} / \sum_{j = 1}^{K} e^{x^{⊤} β_{j}}$ with one weight vector $β_{k} \in R^{p}$ per class.

(a) Write down the negative log-likelihood for one labelled point $(x, y)$ where $y \in {1, \dots, K}$ , encoded as a one-hot vector $y_{k} = 1 [y = k]$ .

(b) Derive the gradient with respect to $β_{k}$ for a single labelled point. Show it has the form $(\overset{p}{^}_{k} - y_{k}) x$ .

(c) Conclude that the matrix gradient over a dataset has the same $X^{⊤} (\hat{P} - Y)$ structure as binary logistic regression, with $\hat{P}, Y \in R^{n \times K}$ .

Find the analogue: the binary case takes $σ (z) = e^{z} / (1 + e^{z})$ — that's exactly the $K = 2$ softmax. The structural pattern $X^{⊤} (\overset{p}{^} - y)$ repeats verbatim.

show answer

(a) Negative log-likelihood for one point: $L = - lo g P (y ∣ x) = - \sum_{k} y_{k} lo g \overset{p}{^}_{k}$ with $\overset{p}{^}_{k} = e^{x^{⊤} β_{k}} / Z$ and $Z = \sum_{j} e^{x^{⊤} β_{j}}$ .

(b) Differentiate $lo g \overset{p}{^}_{k} = x^{⊤} β_{k} - lo g Z$ with respect to $β_{k^{'}}$ :

$\partial_{β_{k^{'}}} lo g \overset{p}{^}_{k} = 1 [k = k^{'}] x - \frac{\partial l o g Z}{\partial β _{k^{'}}} = (1 [k = k^{'}] - \overset{p}{^}_{k^{'}}) x$

Since only one $y_{k}$ is 1 (call it $y_{*}$ ), $\partial_{β_{k^{'}}} L = - (1 [y_{*} = k^{'}] - \overset{p}{^}_{k^{'}}) x = (\overset{p}{^}_{k^{'}} - y_{k^{'}}) x$ ✓

(c) Stack across the dataset: with $\hat{P} \in R^{n \times K}$ holding predicted probabilities and $Y \in R^{n \times K}$ the one-hot label matrix, the gradient with respect to the weight matrix $B \in R^{p \times K}$ is $\nabla_{B} L = X^{⊤} (\hat{P} - Y) / n$ . Exact same shape as binary; softmax is binary LogR generalized cleanly. This is also the gradient at the output layer of every classification neural network.

Check problems

Two problems designed to resist pattern-matching against the practice set. Neither can be solved by recognizing it as similar to one of the seven above.

Check 1 articulation

On perfectly separable data (every class-1 point has higher $x^{⊤} β$ than every class-0 point), the maximum-likelihood estimator for logistic regression has no finite optimum — $∥ \hat{β} ∥ \to \infty$ as the gradient descent runs.

In 150–250 words, explain why. Your explanation should describe the geometry of the cross-entropy loss surface on separable data (what happens to the loss as $β$ is rescaled by a large positive constant?), distinguish "the data is separable" from "the model has converged," and explain why this matters in practice. A reader who has just read the logistic-regression concept page should follow your explanation.

show solution sketch

Cross-entropy on one data point is $- lo g σ (x^{⊤} β)$ when $y = 1$ and $- lo g σ (- x^{⊤} β)$ when $y = 0$ . On perfectly separable data, there exists a direction $β^{⋆}$ such that $y_{i} = 1 \Rightarrow x_{i}^{⊤} β^{⋆} > 0$ and $y_{i} = 0 \Rightarrow x_{i}^{⊤} β^{⋆} < 0$ for every $i$ .

Consider the rescaled estimator $β = c β^{⋆}$ for $c \to + \infty$ . Every $y_{i} = 1$ point sees $x_{i}^{⊤} β \to + \infty$ , so $σ (x_{i}^{⊤} β) \to 1$ , so its contribution to the loss $\to 0$ . Same story for $y_{i} = 0$ points with $σ$ going to 0. The total cross-entropy $\to 0$ , monotonically, as $c \to \infty$ .

So the loss has no minimizer at finite $β$ : every increase in $∥ β ∥$ along the separating direction strictly decreases the loss. Gradient descent reflects this — $∥ \hat{β} ∥$ grows without bound, and the predicted probabilities sharpen toward step functions.

"Convergence" in the optimizer's sense (gradient norm small) and "the data is separable" are different things: the gradient norm does shrink (toward zero), but only because the sigmoid saturates, not because $β$ has stabilized. In practice, this is why production code uses $L_{2}$ regularization (which adds $λ ∥ β ∥^{2}$ and gives a finite minimizer) or simply caps the number of gradient steps.

Check 2 derivation

Derive the Hessian of the cross-entropy loss $L (β) = - \frac{1}{n} \sum_{i} [y_{i} lo g σ (x_{i}^{⊤} β) + (1 - y_{i}) lo g (1 - σ (x_{i}^{⊤} β))]$ . Express it in matrix form using the design matrix $X$ , the predicted probabilities $\overset{p}{^}_{i}$ , and a diagonal matrix.

Conclude that the Hessian is positive semi-definite, and therefore the cross-entropy loss is convex. State explicitly which property of the sigmoid you used.

show solution sketch

Start from the gradient: $\nabla L (β) = \frac{1}{n} X^{⊤} (\overset{p}{^} - y)$ with $\overset{p}{^} = σ (Xβ)$ componentwise.

Differentiate again. $\partial \overset{p}{^}_{i} / \partial β_{j} = σ^{'} (x_{i}^{⊤} β) \cdot x_{ij} = \overset{p}{^}_{i} (1 - \overset{p}{^}_{i}) \cdot x_{ij}$ .

So $\partial^{2} L / \partial β_{j} \partial β_{k} = \frac{1}{n} \sum_{i} x_{ij} \overset{p}{^}_{i} (1 - \overset{p}{^}_{i}) x_{ik}$ . In matrix form:

$\nabla^{2} L (β) = \frac{1}{n} X^{⊤} D X, D = diag (\overset{p}{^}_{i} (1 - \overset{p}{^}_{i}))$

Positive semi-definiteness. For any vector $v$ :

$v^{⊤} (X^{⊤} D X) v = (X v)^{⊤} D (X v) = \sum_{i} \overset{p}{^}_{i} (1 - \overset{p}{^}_{i}) (X v)_{i}^{2}$

Each term is non-negative because $\overset{p}{^}_{i} \in (0, 1)$ means $\overset{p}{^}_{i} (1 - \overset{p}{^}_{i}) > 0$ . The sum is therefore $\geq 0$ for every $v$ , so the Hessian is PSD and the loss is convex.

The sigmoid property used: $σ^{'} (z) = σ (z) (1 - σ (z))$ , which is always positive. This is the workhorse identity that makes the gradient clean and the loss convex — both properties trace to the same simple derivative formula.

Convexity is why gradient descent has no local-minimum traps on logistic regression. It also enables Newton-style optimizers (the Hessian is well-defined and PSD, so the Newton step $(X^{⊤} D X)^{- 1} X^{⊤} (\overset{p}{^} - y)$ is a descent direction). This is what IRLS — "iteratively reweighted least squares" — actually computes.