Machine Learning

Logistic Regression — exercises

Recognize a problem as binary classification, model P(y=1|x) = σ(xᵀβ), fit by minimizing cross-entropy via gradient descent, then read decision boundary, probability calibration, and odds-ratio interpretation off the learned β.

1 worked example · 7 practice problems · 2 check problems

Worked example: 1D logistic regression by gradient descent

Problem. Fit to the six points . Compute the gradient at the initialization , take one explicit gradient-descent step with learning rate , then state the converged estimate. Verify with code.

Diagnosis. Linear regression had a closed form; logistic regression doesn't. Set up the cross-entropy loss, compute its gradient, descend until the gradient is small. The gradient has the clean form , which is what makes one explicit step computable by hand.

Predict before reading on: eyeball the data before doing any algebra. The points at have "wrong" labels — what should this do to compared with a cleanly-separated dataset? Should it grow or shrink?

Setup. Design matrix has 6 rows and 2 columns (intercept + feature):

Gradient at . At zero, and for every row. So , a vector of entries:

Multiply by and divide by :

One GD step. With : . The intercept stays at zero — by the symmetry of the data under , it stays at zero forever, and only moves.

Predict before reading on: predict the converged : should it be larger or smaller than after one step? Why? Think about the gradient at the new point .

Converged estimate. Running the iteration to convergence gives . The intercept is exactly zero (the symmetry holds), and the slope is finite (not infinite) because the data is not separable: the and points sit on the "wrong" side of the boundary.

Predicted probabilities at the data points are . The "wrong" points get 0.32 and 0.68 — the model gives them partial credit rather than committing.

Verification.

import numpy as np
def sigmoid(z): return 1 / (1 + np.exp(-z))

x = np.array([-3, -2, -1, 1, 2, 3], dtype=float)
y = np.array([ 0,  0,  1, 0, 1, 1], dtype=float)
X = np.column_stack([np.ones(6), x])

beta = np.zeros(2)
for _ in range(5000):
    p = sigmoid(X @ beta)
    beta -= X.T @ (p - y) / 6     # lr = 1
print(beta)            # ≈ [0., 0.732]
print(sigmoid(X @ beta).round(2))  # ≈ [0.10, 0.19, 0.32, 0.68, 0.81, 0.90]

Articulate: state in one sentence what cross-entropy is actually measuring, and what it would mean for it to be zero on a dataset.


Practice problems

Seven problems, seven different surfaces. Some are pure "set up the design matrix and fit"; others read the fitted model (boundary, probabilities, odds); two are derivations that extend the technique (class imbalance, softmax).

P.1 medical risk scoring, reading a fitted model

A clinical biomarker is measured on eight patients with binary disease outcome :

biomarker b  =  1   2   3   4   5   6   7   8
disease   d  =  0   0   0   1   0   1   1   1

A logistic-regression fit gives (intercept, slope on ).

Without re-fitting:

(a) Compute the predicted probability of disease at .

(b) Find the biomarker level at which the model is exactly undecided ().

(c) State the odds-ratio interpretation of the slope: by what factor do the odds of disease change per unit increase in biomarker?

Find the analogue: this problem reads a fitted model. The worked example was about obtaining the model; this is about using it. Both lean on the same formula .

show answer

(a) .

(b) . The model is undecided at biomarker level 4.5.

(c) The model says . A unit increase in raises the log-odds by , multiplying the odds by . So each additional unit of biomarker multiplies the disease odds by 3.6×. That's the natural-language sentence epidemiologists actually write into clinical papers — odds ratios are what logistic regression gives you for free.

P.2 marketing click prediction, multi-feature fit

Ten ad impressions are labelled with whether the user clicked:

age  imps  clicked
22    2     1
28    5     1
31    9     0
35    3     1
40   15     0
45    4     1
52   18     0
55    7     1
60   20     0
62   12     0

Fit via gradient descent (standardize features first). Report and the in-sample accuracy at the 0.5 threshold. Which feature does the model use to predict not clicking?

Find the analogue: same gradient as the worked example, just with three columns in the design matrix instead of two and ten rows instead of six. Standardize features before fitting so the learning rate works on both at once.

show answer
import numpy as np

def sigmoid(z):
    return np.where(z >= 0,
                    1.0 / (1.0 + np.exp(-z)),
                    np.exp(z) / (1.0 + np.exp(z)))

# Ten labelled (age, impressions, clicked) rows.
data = np.array([
    [22,  2, 1],  [28,  5, 1],  [31,  9, 0],  [35,  3, 1],  [40, 15, 0],
    [45,  4, 1],  [52, 18, 0],  [55,  7, 1],  [60, 20, 0],  [62, 12, 0],
])
X_raw = data[:, :2].astype(float)
y     = data[:, 2].astype(float)
n     = len(y)

# Standardize features so GD has comparable scales.
X_raw = (X_raw - X_raw.mean(axis=0)) / X_raw.std(axis=0)
X     = np.column_stack([np.ones(n), X_raw])

beta = np.zeros(3)
for _ in range(20000):
    p = sigmoid(X @ beta)
    grad = X.T @ (p - y) / n
    beta -= 0.5 * grad

print("β        :", beta)
print("accuracy :", np.mean((sigmoid(X @ beta) >= 0.5) == y))

Fit returns approximately , (on standardized age), (on standardized imps). In-sample accuracy 9/10.

The number of impressions has the largest-magnitude negative coefficient — the more impressions of the ad the user has already seen, the less likely they are to click. (Banner blindness.) Age also leans negative but is dominated by imps.

Important caveat with : these coefficients are noisy. The point of the exercise is the mechanics; in production you'd never trust a 10-sample fit.

P.3 decision-boundary geometry in 2D

A 2D logistic-regression fit produces for .

(a) Write the equation of the decision boundary in the -plane.

(b) Give a unit vector pointing in the direction of increasing predicted probability.

(c) Compute the perpendicular distance from the origin to the boundary.

Find the analogue: logistic regression's decision boundary is the affine hyperplane . Everything about this problem is geometry on that hyperplane.

show answer

(a) Boundary equation: , or rearranged .

(b) The gradient of in the -plane is . This is the direction of increasing log-odds, hence of increasing probability. Normalize: unit vector .

(c) The signed distance from the origin to the hyperplane is . Here , so distance . The origin is on the side (negative).

P.4 polynomial features for circular data

Two classes are arranged as concentric rings in 2D: class 0 inside the unit circle, class 1 between radius 2 and 3. Linear logistic regression on the raw features fails because no straight line separates the two classes.

(a) Why? Explain the failure in one sentence using the decision-boundary geometry from P.3.

(b) Augment the design matrix to and fit. Report linear vs polynomial accuracy. Why does this work?

(c) What does the augmented boundary equation look like in terms of ?

Find the analogue: the page noted: "logistic regression can only learn linear boundaries — standard fix is hand-engineer nonlinear features." This problem is that fix.

show answer

(a) The classes are separated by the curve (a circle), which is not a straight line. A linear-in- model can only carve out a half-plane, so any choice of miscalls at least half the rings.

(b) Code:

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

rng = np.random.default_rng(2)
n = 50
# Class 0: small radius ring.  Class 1: large radius ring.
r0 = rng.uniform(0, 1, n); t0 = rng.uniform(0, 2*np.pi, n)
r1 = rng.uniform(2, 3, n); t1 = rng.uniform(0, 2*np.pi, n)
inner = np.column_stack([r0 * np.cos(t0), r0 * np.sin(t0)])
outer = np.column_stack([r1 * np.cos(t1), r1 * np.sin(t1)])
X_raw = np.vstack([inner, outer])
y     = np.concatenate([np.zeros(n), np.ones(n)])

def fit(X, lr=0.5, n_iter=20000):
    beta = np.zeros(X.shape[1])
    for _ in range(n_iter):
        p = sigmoid(X @ beta)
        beta -= lr * X.T @ (p - y) / len(y)
    return beta

# Linear features only.
X_lin = np.column_stack([np.ones(2*n), X_raw])
beta_lin = fit(X_lin)
acc_lin = np.mean((sigmoid(X_lin @ beta_lin) >= 0.5) == y)

# Add x², y² features.
X_poly = np.column_stack([np.ones(2*n), X_raw, X_raw[:,0]**2, X_raw[:,1]**2])
beta_poly = fit(X_poly)
acc_poly = np.mean((sigmoid(X_poly @ beta_poly) >= 0.5) == y)

print("linear features  accuracy:", acc_lin)   # ~ 0.6, near chance
print("polynomial-feat. accuracy:", acc_poly)  # ~ 1.0

Linear-feature accuracy ≈ 0.6 (essentially chance). Polynomial-feature accuracy ≈ 1.0.

Why it works: the boundary is now , which is a conic section. For roughly circular data the fit converges with and , recovering — a circle of radius .

(c) The model is still linear in parameters (it's the same OLS-style design matrix, just with extra columns). The hidden trick is that polynomial features are a poor person's kernel — neural networks generalize this by learning the nonlinear features rather than hand-engineering them.

P.5 class imbalance via reweighted cross-entropy

A fraud-detection dataset has 99% non-fraud (class 0) and 1% fraud (class 1). Logistic regression trained on the raw cross-entropy is dominated by the majority class — the model can predict "always 0" and hit 99% accuracy with the loss barely complaining.

Derive the class-reweighted cross-entropy where positive examples carry weight and negative examples carry weight , and write down the resulting gradient. Show that setting gives the same loss as if classes were balanced.

Find the analogue: this is the same gradient derivation as the worked example, applied to a weighted loss instead of the uniform loss. The structure is unchanged; the algebra just carries weights.

show answer

Weighted loss:

where .

Differentiating and using , the contribution from row is .

So the gradient is with — the unweighted form times a row-rescaling.

Setting and (inverse class frequencies) makes positives and negatives contribute equal total weight, so the loss "sees" a balanced dataset. Practical equivalent: oversample the minority class to 50%. Both fixes solve the same problem with the same algebra under the hood.

P.6 calibration diagnostics

You've trained a logistic regression and have 1000 held-out predictions with true labels . Bucket the predictions into 10 deciles (0–0.1, 0.1–0.2, …, 0.9–1.0) and compute, for each bucket, the average predicted probability and the empirical fraction of positives.

(a) What should the plot of (mean predicted probability) vs (empirical fraction positive) look like if the model is well-calibrated?

(b) Sketch what the plot looks like for a model that is over-confident at the extremes — predicting 0.05 when the true rate is 0.15, and 0.95 when the true rate is 0.85.

(c) Name one fix for an over-confident logistic regression.

Find the analogue: the page noted: "probabilities are calibrated only if the linearity assumption holds." This is the diagnostic that surfaces miscalibration.

show answer

(a) Perfect calibration is the 45° diagonal: bucket has mean predicted probability and empirical rate equal to that same number. A well-calibrated model lies along .

(b) Over-confidence at the extremes looks like a sigmoid: shallow at the ends (predicted 0.05 maps to actual 0.15, so the curve sits above the diagonal for low ), steep in the middle, shallow again at the high end (predicted 0.95 maps to actual 0.85, below the diagonal). The curve bows above the diagonal on the left and below it on the right.

(c) Standard fixes: (i) Platt scaling — fit a second 1D logistic regression that maps the model's raw logits to calibrated probabilities. (ii) Isotonic regression — fit a monotone step function from raw to calibrated probabilities. (iii) Temperature scaling for neural networks — divide logits by a learned scalar to soften over-confident outputs. (iv) Add regularization during training; over-confidence often comes from unconstrained .

P.7 softmax / multi-class extension

Extend binary logistic regression to classes. The model is with one weight vector per class.

(a) Write down the negative log-likelihood for one labelled point where , encoded as a one-hot vector .

(b) Derive the gradient with respect to for a single labelled point. Show it has the form .

(c) Conclude that the matrix gradient over a dataset has the same structure as binary logistic regression, with .

Find the analogue: the binary case takes — that's exactly the softmax. The structural pattern repeats verbatim.

show answer

(a) Negative log-likelihood for one point: with and .

(b) Differentiate with respect to :

Since only one is 1 (call it ),

(c) Stack across the dataset: with holding predicted probabilities and the one-hot label matrix, the gradient with respect to the weight matrix is . Exact same shape as binary; softmax is binary LogR generalized cleanly. This is also the gradient at the output layer of every classification neural network.


Check problems

Two problems designed to resist pattern-matching against the practice set. Neither can be solved by recognizing it as similar to one of the seven above.

Check 1 articulation

On perfectly separable data (every class-1 point has higher than every class-0 point), the maximum-likelihood estimator for logistic regression has no finite optimum — as the gradient descent runs.

In 150–250 words, explain why. Your explanation should describe the geometry of the cross-entropy loss surface on separable data (what happens to the loss as is rescaled by a large positive constant?), distinguish "the data is separable" from "the model has converged," and explain why this matters in practice. A reader who has just read the logistic-regression concept page should follow your explanation.

show solution sketch

Cross-entropy on one data point is when and when . On perfectly separable data, there exists a direction such that and for every .

Consider the rescaled estimator for . Every point sees , so , so its contribution to the loss . Same story for points with going to 0. The total cross-entropy , monotonically, as .

So the loss has no minimizer at finite : every increase in along the separating direction strictly decreases the loss. Gradient descent reflects this — grows without bound, and the predicted probabilities sharpen toward step functions.

"Convergence" in the optimizer's sense (gradient norm small) and "the data is separable" are different things: the gradient norm does shrink (toward zero), but only because the sigmoid saturates, not because has stabilized. In practice, this is why production code uses regularization (which adds and gives a finite minimizer) or simply caps the number of gradient steps.

Check 2 derivation

Derive the Hessian of the cross-entropy loss . Express it in matrix form using the design matrix , the predicted probabilities , and a diagonal matrix.

Conclude that the Hessian is positive semi-definite, and therefore the cross-entropy loss is convex. State explicitly which property of the sigmoid you used.

show solution sketch

Start from the gradient: with componentwise.

Differentiate again. .

So . In matrix form:

Positive semi-definiteness. For any vector :

Each term is non-negative because means . The sum is therefore for every , so the Hessian is PSD and the loss is convex.

The sigmoid property used: , which is always positive. This is the workhorse identity that makes the gradient clean and the loss convex — both properties trace to the same simple derivative formula.

Convexity is why gradient descent has no local-minimum traps on logistic regression. It also enables Newton-style optimizers (the Hessian is well-defined and PSD, so the Newton step is a descent direction). This is what IRLS — "iteratively reweighted least squares" — actually computes.