Recognize a problem as binary classification, model P(y=1|x) = σ(xᵀβ), fit by minimizing cross-entropy via gradient descent, then read decision boundary, probability calibration, and odds-ratio interpretation off the learned β.
1 worked example · 7 practice problems · 2 check problems
Worked example: 1D logistic regression by gradient descent
Problem. Fit P(y=1∣x)=σ(β0+β1x) to the six points (−3,0),(−2,0),(−1,1),(1,0),(2,1),(3,1). Compute the gradient at the initialization β=(0,0), take one explicit gradient-descent step with learning rate η=1, then state the converged estimate. Verify with code.
Diagnosis. Linear regression had a closed form; logistic regression doesn't. Set up the cross-entropy loss, compute its gradient, descend until the gradient is small. The gradient has the clean form ∇L(β)=X⊤(σ(Xβ)−y)/n, which is what makes one explicit step computable by hand.
Predict before reading on: eyeball the data before doing any algebra. The points at x=±1 have "wrong" labels — what should this do to β1 compared with a cleanly-separated dataset? Should it grow or shrink?
Setup. Design matrix X has 6 rows and 2 columns (intercept + feature):
X=111111−3−2−1123,y=001011
Gradient at β=0. At zero, Xβ=0 and σ(0)=1/2 for every row. So σ(Xβ)−y=(1/2−yi), a vector of ±1/2 entries:
One GD step. With η=1: β(1)=β(0)−η∇L=(0,+2/3). The intercept stays at zero — by the symmetry of the data under x→−x,y→1−y, it stays at zero forever, and only β1 moves.
Predict before reading on: predict the converged β1: should it be larger or smaller than 2/3 after one step? Why? Think about the gradient at the new point β=(0,2/3).
Converged estimate. Running the iteration to convergence gives β^=(0,0.732). The intercept is exactly zero (the symmetry holds), and the slope is finite (not infinite) because the data is not separable: the (−1,1) and (1,0) points sit on the "wrong" side of the boundary.
Predicted probabilities at the data points are σ(β^1x)=(0.10,0.19,0.32,0.68,0.81,0.90). The "wrong" points get 0.32 and 0.68 — the model gives them partial credit rather than committing.
Articulate: state in one sentence what cross-entropy is actually measuring, and what it would mean for it to be zero on a dataset.
Practice problems
Seven problems, seven different surfaces. Some are pure "set up the design matrix and fit"; others read the fitted model (boundary, probabilities, odds); two are derivations that extend the technique (class imbalance, softmax).
P.1medical risk scoring, reading a fitted model
A clinical biomarker is measured on eight patients with binary disease outcome d∈{0,1}:
A logistic-regression fit gives β^=(−5.77,1.28) (intercept, slope on b).
Without re-fitting:
(a) Compute the predicted probability of disease at b=5.
(b) Find the biomarker level at which the model is exactly undecided (P=0.5).
(c) State the odds-ratio interpretation of the slope: by what factor do the odds of disease change per unit increase in biomarker?
Find the analogue:
this problem reads a fitted model. The worked example was about obtaining the model; this is about using it. Both lean on the same formula σ(β0+β1x).
show answer
(a) σ(−5.77+1.28⋅5)=σ(0.63)=1/(1+e−0.63)≈0.65.
(b) P=0.5⇔β0+β1b=0⇔b=−β0/β1=5.77/1.28≈4.5. The model is undecided at biomarker level 4.5.
(c) The model says log1−PP=β0+β1b. A unit increase in b raises the log-odds by β1=1.28, multiplying the odds by e1.28≈3.60. So each additional unit of biomarker multiplies the disease odds by 3.6×. That's the natural-language sentence epidemiologists actually write into clinical papers — odds ratios are what logistic regression gives you for free.
P.2marketing click prediction, multi-feature fit
Ten ad impressions are labelled with whether the user clicked:
Fit P(click∣age,imps)=σ(β0+β1age+β2imps) via gradient descent (standardize features first). Report β^ and the in-sample accuracy at the 0.5 threshold. Which feature does the model use to predict not clicking?
Find the analogue:
same gradient as the worked example, just with three columns in the design matrix instead of two and ten rows instead of six. Standardize features before fitting so the learning rate works on both at once.
Fit returns approximately β0≈−0.3, β1≈−0.7 (on standardized age), β2≈−2.0 (on standardized imps). In-sample accuracy 9/10.
The number of impressions has the largest-magnitude negative coefficient — the more impressions of the ad the user has already seen, the less likely they are to click. (Banner blindness.) Age also leans negative but is dominated by imps.
Important caveat with n=10: these coefficients are noisy. The point of the exercise is the mechanics; in production you'd never trust a 10-sample fit.
P.3decision-boundary geometry in 2D
A 2D logistic-regression fit produces β^=(1,2,−3) for P(y=1∣x1,x2)=σ(β0+β1x1+β2x2).
(a) Write the equation of the decision boundary in the (x1,x2)-plane.
(b) Give a unit vector pointing in the direction of increasing predicted probability.
(c) Compute the perpendicular distance from the origin to the boundary.
Find the analogue:
logistic regression's decision boundary is the affine hyperplane β0+β1x1+β2x2=0. Everything about this problem is geometry on that hyperplane.
show answer
(a) Boundary equation: 1+2x1−3x2=0, or rearranged x2=(1+2x1)/3.
(b) The gradient of β0+β1x1+β2x2 in the (x1,x2)-plane is (2,−3). This is the direction of increasing log-odds, hence of increasing probability. Normalize: unit vector =(2,−3)/4+9=(2/13,−3/13)≈(0.555,−0.832).
(c) The signed distance from the origin to the hyperplane ax1+bx2+c=0 is −c/a2+b2. Here a=2,b=−3,c=1, so distance =−1/13≈−0.277. The origin is on the y=0 side (negative).
P.4polynomial features for circular data
Two classes are arranged as concentric rings in 2D: class 0 inside the unit circle, class 1 between radius 2 and 3. Linear logistic regression on the raw features (x1,x2) fails because no straight line separates the two classes.
(a) Why? Explain the failure in one sentence using the decision-boundary geometry from P.3.
(b) Augment the design matrix to [1,x1,x2,x12,x22] and fit. Report linear vs polynomial accuracy. Why does this work?
(c) What does the augmented boundary equation look like in terms of (x1,x2)?
Find the analogue:
the page noted: "logistic regression can only learn linear boundaries — standard fix is hand-engineer nonlinear features." This problem is that fix.
show answer
(a) The classes are separated by the curve x12+x22=1.52 (a circle), which is not a straight line. A linear-in-(x1,x2) model can only carve out a half-plane, so any choice of β miscalls at least half the rings.
Why it works: the boundary is now β0+β1x1+β2x2+β3x12+β4x22=0, which is a conic section. For roughly circular data the fit converges with β3≈β4>0 and β1,β2≈0, recovering β3(x12+x22)=−β0 — a circle of radius −β0/β3.
(c) The model is still linear in parameters (it's the same OLS-style design matrix, just with extra columns). The hidden trick is that polynomial features are a poor person's kernel — neural networks generalize this by learning the nonlinear features rather than hand-engineering them.
P.5class imbalance via reweighted cross-entropy
A fraud-detection dataset has 99% non-fraud (class 0) and 1% fraud (class 1). Logistic regression trained on the raw cross-entropy −∑[yilogp^i+(1−yi)log(1−p^i)] is dominated by the majority class — the model can predict "always 0" and hit 99% accuracy with the loss barely complaining.
Derive the class-reweighted cross-entropy where positive examples carry weight w+ and negative examples carry weight w−, and write down the resulting gradient. Show that setting w+/w−=(fraction negatives)/(fraction positives) gives the same loss as if classes were balanced.
Find the analogue:
this is the same gradient derivation as the worked example, applied to a weighted loss instead of the uniform loss. The structure is unchanged; the algebra just carries weights.
show answer
Weighted loss:
Lw(β)=−n1∑i[w+yilogp^i+w−(1−yi)log(1−p^i)] where p^i=σ(xi⊤β).
Differentiating and using σ′(z)=σ(z)(1−σ(z)), the contribution from row i is w+yi(p^i−1)+w−(1−yi)p^i=(w+yi+w−(1−yi))(p^i−yi).
So the gradient is ∇Lw(β)=n1X⊤W(p^−y) with W=diag(w+yi+w−(1−yi)) — the unweighted form times a row-rescaling.
Setting w+=1/p+ and w−=1/p− (inverse class frequencies) makes positives and negatives contribute equal total weight, so the loss "sees" a balanced dataset. Practical equivalent: oversample the minority class to 50%. Both fixes solve the same problem with the same algebra under the hood.
P.6calibration diagnostics
You've trained a logistic regression and have 1000 held-out predictions p^i with true labels yi. Bucket the predictions into 10 deciles (0–0.1, 0.1–0.2, …, 0.9–1.0) and compute, for each bucket, the average predicted probability and the empirical fraction of positives.
(a) What should the plot of (mean predicted probability) vs (empirical fraction positive) look like if the model is well-calibrated?
(b) Sketch what the plot looks like for a model that is over-confident at the extremes — predicting 0.05 when the true rate is 0.15, and 0.95 when the true rate is 0.85.
(c) Name one fix for an over-confident logistic regression.
Find the analogue:
the page noted: "probabilities are calibrated only if the linearity assumption holds." This is the diagnostic that surfaces miscalibration.
show answer
(a) Perfect calibration is the 45° diagonal: bucket b has mean predicted probability p^≈0.05+0.1b and empirical rate equal to that same number. A well-calibrated model lies along y=x.
(b) Over-confidence at the extremes looks like a sigmoid: shallow at the ends (predicted 0.05 maps to actual 0.15, so the curve sits above the diagonal for low p^), steep in the middle, shallow again at the high end (predicted 0.95 maps to actual 0.85, below the diagonal). The curve bows above the diagonal on the left and below it on the right.
(c) Standard fixes: (i) Platt scaling — fit a second 1D logistic regression that maps the model's raw logits to calibrated probabilities. (ii) Isotonic regression — fit a monotone step function from raw to calibrated probabilities. (iii) Temperature scaling for neural networks — divide logits by a learned scalar T>1 to soften over-confident outputs. (iv) Add regularization during training; over-confidence often comes from unconstrained ∥β∥.
P.7softmax / multi-class extension
Extend binary logistic regression to K classes. The model is P(y=k∣x)=ex⊤βk/∑j=1Kex⊤βj with one weight vector βk∈Rp per class.
(a) Write down the negative log-likelihood for one labelled point (x,y) where y∈{1,…,K}, encoded as a one-hot vector yk=1[y=k].
(b) Derive the gradient with respect to βk for a single labelled point. Show it has the form (p^k−yk)x.
(c) Conclude that the matrix gradient over a dataset has the same X⊤(P^−Y) structure as binary logistic regression, with P^,Y∈Rn×K.
Find the analogue:
the binary case takes σ(z)=ez/(1+ez) — that's exactly the K=2 softmax. The structural pattern X⊤(p^−y) repeats verbatim.
show answer
(a) Negative log-likelihood for one point: L=−logP(y∣x)=−∑kyklogp^k with p^k=ex⊤βk/Z and Z=∑jex⊤βj.
(b) Differentiate logp^k=x⊤βk−logZ with respect to βk′:
Since only one yk is 1 (call it y∗), ∂βk′L=−(1[y∗=k′]−p^k′)x=(p^k′−yk′)x ✓
(c) Stack across the dataset: with P^∈Rn×K holding predicted probabilities and Y∈Rn×K the one-hot label matrix, the gradient with respect to the weight matrix B∈Rp×K is ∇BL=X⊤(P^−Y)/n. Exact same shape as binary; softmax is binary LogR generalized cleanly. This is also the gradient at the output layer of every classification neural network.
Check problems
Two problems designed to resist pattern-matching against the practice set. Neither can be solved by recognizing it as similar to one of the seven above.
Check 1articulation
On perfectly separable data (every class-1 point has higher x⊤β than every class-0 point), the maximum-likelihood estimator for logistic regression has no finite optimum — ∥β^∥→∞ as the gradient descent runs.
In 150–250 words, explain why. Your explanation should describe the geometry of the cross-entropy loss surface on separable data (what happens to the loss as β is rescaled by a large positive constant?), distinguish "the data is separable" from "the model has converged," and explain why this matters in practice. A reader who has just read the logistic-regression concept page should follow your explanation.
show solution sketch
Cross-entropy on one data point is −logσ(x⊤β) when y=1 and −logσ(−x⊤β) when y=0. On perfectly separable data, there exists a direction β⋆ such that yi=1⇒xi⊤β⋆>0 and yi=0⇒xi⊤β⋆<0 for every i.
Consider the rescaled estimator β=cβ⋆ for c→+∞. Every yi=1 point sees xi⊤β→+∞, so σ(xi⊤β)→1, so its contribution to the loss →0. Same story for yi=0 points with σ going to 0. The total cross-entropy →0, monotonically, as c→∞.
So the loss has no minimizer at finite β: every increase in ∥β∥ along the separating direction strictly decreases the loss. Gradient descent reflects this — ∥β^∥ grows without bound, and the predicted probabilities sharpen toward step functions.
"Convergence" in the optimizer's sense (gradient norm small) and "the data is separable" are different things: the gradient norm does shrink (toward zero), but only because the sigmoid saturates, not because β has stabilized. In practice, this is why production code uses L2 regularization (which adds λ∥β∥2 and gives a finite minimizer) or simply caps the number of gradient steps.
Check 2derivation
Derive the Hessian of the cross-entropy loss L(β)=−n1∑i[yilogσ(xi⊤β)+(1−yi)log(1−σ(xi⊤β))]. Express it in matrix form using the design matrix X, the predicted probabilities p^i, and a diagonal matrix.
Conclude that the Hessian is positive semi-definite, and therefore the cross-entropy loss is convex. State explicitly which property of the sigmoid you used.
show solution sketch
Start from the gradient: ∇L(β)=n1X⊤(p^−y) with p^=σ(Xβ) componentwise.
So ∂2L/∂βj∂βk=n1∑ixijp^i(1−p^i)xik. In matrix form:
∇2L(β)=n1X⊤DX,D=diag(p^i(1−p^i))
Positive semi-definiteness. For any vector v:
v⊤(X⊤DX)v=(Xv)⊤D(Xv)=∑ip^i(1−p^i)(Xv)i2
Each term is non-negative because p^i∈(0,1) means p^i(1−p^i)>0. The sum is therefore ≥0 for every v, so the Hessian is PSD and the loss is convex.
The sigmoid property used: σ′(z)=σ(z)(1−σ(z)), which is always positive. This is the workhorse identity that makes the gradient clean and the loss convex — both properties trace to the same simple derivative formula.
Convexity is why gradient descent has no local-minimum traps on logistic regression. It also enables Newton-style optimizers (the Hessian is well-defined and PSD, so the Newton step (X⊤DX)−1X⊤(p^−y) is a descent direction). This is what IRLS — "iteratively reweighted least squares" — actually computes.