Recognize a problem as "fit a model that is linear in its parameters," set up the design matrix, solve XᵀXβ = Xᵀy by normal equations or QR, and read coefficient stability vs prediction stability from the conditioning of X.
1 worked example · 7 practice problems · 2 check problems
Worked example: fitting a line by hand
Problem. Fit y=β0+β1x by ordinary least squares to the five data points (1,2.1),(2,3.8),(3,6.2),(4,7.9),(5,10.1). Report β^0,β^1, the residuals, and R2. Do it by hand, then check with code.
Diagnosis. Linear regression in one feature. Set up the design matrix X with a column of ones for the intercept and x for the feature, then solve the normal equations X⊤Xβ^=X⊤y. For one feature the algebra simplifies to the textbook scalar formulas, but the matrix view will be what generalizes.
Predict before reading on: eyeball the data before doing any algebra. What slope do you expect, roughly? What intercept? You should be able to predict β^1 to within 0.1 just by looking.
Solution. Five sums:
∑x=15,∑y=30.1,∑x2=55,∑xy=110.4,xˉ=3,yˉ=6.02
For a single feature, the OLS estimator collapses to
Predict before reading on: these formulas drop out of β^=(X⊤X)−1X⊤y when X has exactly two columns (a ones column plus one feature). Convince yourself this is true — what entries of the 2×2 matrix X⊤X show up in the denominator of β^1?
Residuals. Predicted values y^i=−0.01+2.01xi give (2.00,4.01,6.02,8.03,10.04). Residuals yi−y^i are (+0.10,−0.21,+0.18,−0.13,+0.06). They alternate sign and are small relative to yi — the model fits well.
R-squared.SSres=∑(yi−y^i)2=0.107. SStot=∑(yi−yˉ)2=40.508. So R2=1−0.107/40.508=0.9974. The model captures 99.7% of the variance.
Articulate: state in one sentence what linear regression actually does — what objective it minimizes, and over what.
Practice problems
Seven problems, seven different surfaces. Each is the same move from the worked example — set up the design matrix, solve the normal equations, read the result. The features change, the trick doesn't.
P.1climate trends, time-series fit
Annual temperature anomalies (°C) at one station for five consecutive years are:
year offset x = 0, 1, 2, 3, 4anomaly y = 0.50, 0.55, 0.62, 0.68, 0.75
Fit y=β0+β1x by hand. What's the implied warming rate per decade?
Find the analogue:
same one-feature OLS as the worked example. Compute ∑x,∑y,∑x2,∑xy and plug into the scalar formula.
show answer
With n=5: ∑x=10,xˉ=2, ∑y=3.10,yˉ=0.62, ∑x2=30,∑xy=6.83.
P.2free-fall physics, nonlinear-in-t but linear-in-parameters
A ball dropped from a tower has its height measured at six times. The data:
t (s) = 0.0, 0.2, 0.4, 0.6, 0.8, 1.0h (m) = 5.00, 4.81, 4.22, 3.24, 1.85, 0.12
Fit the model h(t)=h0−21gt2 and extract g. Notice the model is nonlinear in t but linear in the parameters(h0,g).
Find the analogue:
"linear in parameters" is the key property. What feature column does g multiply? Build the design matrix accordingly and use the same OLS solve.
show answer
Reparametrize as h=β0+β1t2 with β0=h0 and β1=−g/2. The design matrix has columns [1,t2], so t2∈{0,0.04,0.16,0.36,0.64,1.0}.
Fit log10(food)=β0+β1log10(income). Report β^1, which is the elasticity — the percentage change in food per percentage change in income.
Find the analogue:
the relationship is nonlinear in the raw variables but linear in their logs. Build a design matrix with log10(income) as the feature column. Same OLS solve.
show answer
Let zi=log10(incomei) and wi=log10(foodi). Compute ∑z=8.584, ∑w=5.088, ∑z2=15.042, ∑zw=8.964.
Plug into the one-feature formula: β^1≈0.751, β^0≈−0.272.
Elasticity ≈ 0.75: a 1% increase in income produces about a 0.75% increase in food spending. Sub-unitary, just as Engel observed in 1857. Anything <1 here is what makes food a "necessity good" in econ jargon.
P.5sensor calibration with inverse application
A pressure transducer is calibrated against a reference. The (pressure, voltage) pairs are:
(a) Fit V=α+βP. (b) Use the fit to convert a future reading of V=3.0 V into kPa.
Find the analogue:
part (a) is the worked example with renamed variables. Part (b) inverts the fit — same parameters, solve algebraically for P in terms of V.
Construct a 200-sample, 3-feature dataset where x3 is a near-exact copy of x1+x2 (differ only at the 7th decimal place). Generate y=1+2x1+x2+ε with small Gaussian noise. Compute κ(X⊤X). Solve the normal equations and report β^. Then solve via np.linalg.lstsq and compare. What's similar, what's different?
Find the analogue:
the worked example assumed X⊤X was well-conditioned. This problem is what happens when it isn't — and surfaces the difference between what the data can tell you (predictions) and what it cannot (individual coefficients).
κ(X⊤X)∼1014, near double-precision limits. The two solvers give different coefficients for x1,x2,x3 — the per-column splits are wildly unstable, often differing by factors of 103. But the predictionsy^=Xβ^ agree to ∼10−4 between methods.
Reason: the data constrains linear combinations of (β1,β2,β3) that lie in the row-space of X, not the individual coefficients. The null direction β3−β1−β2 is essentially unconstrained, so any solver picks an arbitrary value there. The takeaway: when κ is large, look at predictions, not coefficients.
P.7experimental design, weighted least squares
You're given n data points (xi,yi) with known but unequal noise levels σi. The standard OLS estimator treats every point equally; this overweights noisy points and underweights precise ones. Derive the weighted least-squares estimator that minimizes ∑i(yi−xi⊤β)2/σi2. Show your answer matches β^=(X⊤WX)−1X⊤Wy with W=diag(1/σ12,…,1/σn2).
Find the analogue:
the derivation move is the same one that produced the OLS normal equations — set the gradient of the loss to zero. The only difference is the loss has weights.
show answer
Write the weighted loss as L(β)=(y−Xβ)⊤W(y−Xβ). Expand:
L=y⊤Wy−2β⊤X⊤Wy+β⊤X⊤WXβ
Gradient: ∇βL=−2X⊤Wy+2X⊤WXβ=0.
Solving: β^WLS=(X⊤WX)−1X⊤Wy. ✓
Equivalent recipe in practice: rescale each row i of X and y by 1/σi, then run plain OLS on the rescaled system. The rescaling makes the residuals equal-variance, which is what OLS optimality requires.
Aside: this is also the MLE under iid Gaussian noise with known per-point variance σi2. Same algebra, different framing — the same answer drops out of the likelihood derivation.
Check problems
Two problems that resist pattern-matching against the practice set. Neither is solvable by remembering one of the problems above.
Check 1articulation
The normal-equations route to OLS forms X⊤X and solves (X⊤X)β^=X⊤y. The QR route factors X=QR and solves Rβ^=Q⊤y. Mathematically they compute the same β^. Numerically, when κ(X)∼107, the normal-equations solver loses about 14 digits while the QR solver loses only 7.
In 150–250 words, explain why. Your explanation should distinguish what is being computed (the same minimizer) from how it is being computed (a different sequence of finite-precision operations). Make clear that the issue isn't the input data or the algorithm in isolation — it's their interaction. A reader who just finished the linear-regression page should be able to follow your explanation.
show solution sketch
The two routes converge to the same β^ in exact arithmetic — that's what "mathematically the same" means. The difference shows up because finite-precision arithmetic amplifies errors at a rate that depends on the condition number of the matrices being inverted, and the two routes invert different matrices.
Forming X⊤X squares the condition number: κ(X⊤X)=κ(X)2. The normal-equations solver then has to invert that squared-conditioned matrix, and forward error scales as condition number times machine epsilon. With κ(X)∼107 and machine epsilon ∼10−16, the normal equations lose about log10(1014)=14 digits.
QR factorization never forms X⊤X. The Householder or Givens algorithm operates directly on X, with arithmetic whose forward error scales as κ(X) rather than its square. The loss is log10(107)=7 digits — half as many.
The issue isn't X alone (any conditioning is fine if the operations are stable). It isn't the OLS formula alone (the minimizer is well-defined for any non-singular system). It's the realization: which intermediate matrices the algorithm forms and inverts in the working precision. Conditioning is a property of the realization, not the problem.
Check 2derivation
Assume the linear model y=Xβ+ε with iid Gaussian noise ε∼N(0,σ2I). Derive the covariance matrix of the OLS estimator:
Cov(β^)=σ2(X⊤X)−1
Specialize to simple linear regression (single feature plus intercept) and use it to derive the closed-form standard error of the slope:
SE(β^1)=∑i(xi−xˉ)2σ
Discuss qualitatively: what makes SE(β^1) small? What makes it large? Connect at least one factor to a practical experimental-design decision.
show solution sketch
General case. The OLS estimator is β^=(X⊤X)−1X⊤y. Substitute y=Xβ+ε:
β^=(X⊤X)−1X⊤(Xβ+ε)=β+(X⊤X)−1X⊤ε
So β^−β=(X⊤X)−1X⊤ε. Its covariance is
Cov(β^)=(X⊤X)−1X⊤Cov(ε)X(X⊤X)−1
With Cov(ε)=σ2I, the middle three matrices collapse to σ2X⊤X, giving
Cov(β^)=σ2(X⊤X)−1X⊤X(X⊤X)−1=σ2(X⊤X)−1 ✓
Simple linear regression. With X=[1,x], X⊤X=(n∑xi∑xi∑xi2).
The determinant is n∑xi2−(∑xi)2=n∑(xi−xˉ)2 (using ∑(xi−xˉ)2=∑xi2−nxˉ2). The inverse has (2,2)-entry n/(n∑(xi−xˉ)2)=1/∑(xi−xˉ)2.
So Var(β^1)=σ2/∑(xi−xˉ)2, and SE(β^1)=σ/∑(xi−xˉ)2. ✓
Sample sizen: more data → larger sum ∑(xi−xˉ)2 → smaller SE. Standard n scaling.
Spread of x: data spread over a wide range gives much smaller SE than the same number of points clustered together. This is the experimental-design lever: when you can choose where to measure, spread the measurements out. Two points at the extremes of the design range are vastly more informative for the slope than ten points in a narrow cluster.
Practical implication: in a dose-response experiment, putting your samples at the extreme doses (rather than evenly spaced) maximizes the precision of the slope estimate, at the cost of nonlinearity diagnostics. Most experimental-design textbooks call this the "D-optimal" or "extreme-point" design.