Machine Learning

Linear Regression — exercises

Recognize a problem as "fit a model that is linear in its parameters," set up the design matrix, solve XᵀXβ = Xᵀy by normal equations or QR, and read coefficient stability vs prediction stability from the conditioning of X.

1 worked example · 7 practice problems · 2 check problems

Worked example: fitting a line by hand

Problem. Fit $y = β_{0} + β_{1} x$ by ordinary least squares to the five data points $(1, 2.1), (2, 3.8), (3, 6.2), (4, 7.9), (5, 10.1)$ . Report $\hat{β}_{0}, \hat{β}_{1}$ , the residuals, and $R^{2}$ . Do it by hand, then check with code.

Diagnosis. Linear regression in one feature. Set up the design matrix $X$ with a column of ones for the intercept and $x$ for the feature, then solve the normal equations $X^{⊤} X \hat{β} = X^{⊤} y$ . For one feature the algebra simplifies to the textbook scalar formulas, but the matrix view will be what generalizes.

Predict before reading on: eyeball the data before doing any algebra. What slope do you expect, roughly? What intercept? You should be able to predict $\hat{β}_{1}$ to within 0.1 just by looking.

Solution. Five sums:

\sum x = 15, \sum y = 30.1, \sum x^{2} = 55, \sum x y = 110.4, \overset{x}{ˉ} = 3, \overset{y}{ˉ} = 6.02

For a single feature, the OLS estimator collapses to

\hat{β}_{1} = \frac{\sum x y - n x ˉ y ˉ}{\sum x ^{2} - n x ˉ ^{2}}, \hat{β}_{0} = \overset{y}{ˉ} - \hat{β}_{1} \overset{x}{ˉ}

Plug in:

\hat{β}_{1} = \frac{110.4 - 5 \cdot 3 \cdot 6.02}{55 - 5 \cdot 9} = \frac{110.4 - 90.3}{10} = \frac{20.1}{10} = 2.01

\hat{β}_{0} = 6.02 - 2.01 \cdot 3 = - 0.01

Predict before reading on: these formulas drop out of $\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y$ when $X$ has exactly two columns (a ones column plus one feature). Convince yourself this is true — what entries of the $2 \times 2$ matrix $X^{⊤} X$ show up in the denominator of $\hat{β}_{1}$ ?

Residuals. Predicted values $\overset{y}{^}_{i} = - 0.01 + 2.01 x_{i}$ give $(2.00, 4.01, 6.02, 8.03, 10.04)$ . Residuals $y_{i} - \overset{y}{^}_{i}$ are $(+ 0.10, - 0.21, + 0.18, - 0.13, + 0.06)$ . They alternate sign and are small relative to $y_{i}$ — the model fits well.

R-squared. $SS_{res} = \sum (y_{i} - \overset{y}{^}_{i})^{2} = 0.107$ . $SS_{tot} = \sum (y_{i} - \overset{y}{ˉ})^{2} = 40.508$ . So $R^{2} = 1 - 0.107/40.508 = 0.9974$ . The model captures 99.7% of the variance.

Verification.

import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([2.1, 3.8, 6.2, 7.9, 10.1])
X = np.column_stack([np.ones(5), x])
beta = np.linalg.solve(X.T @ X, X.T @ y)
print(beta)            # [-0.01  2.01]
y_hat = X @ beta
print(1 - np.sum((y - y_hat)**2) / np.sum((y - y.mean())**2))   # 0.9974

Articulate: state in one sentence what linear regression actually does — what objective it minimizes, and over what.

Practice problems

Seven problems, seven different surfaces. Each is the same move from the worked example — set up the design matrix, solve the normal equations, read the result. The features change, the trick doesn't.

P.1 climate trends, time-series fit

Annual temperature anomalies (°C) at one station for five consecutive years are:

year offset x = 0, 1, 2, 3, 4
anomaly    y = 0.50, 0.55, 0.62, 0.68, 0.75

Fit $y = β_{0} + β_{1} x$ by hand. What's the implied warming rate per decade?

Find the analogue: same one-feature OLS as the worked example. Compute $\sum x, \sum y, \sum x^{2}, \sum x y$ and plug into the scalar formula.

show answer

With $n = 5$ : $\sum x = 10, \overset{x}{ˉ} = 2$ , $\sum y = 3.10, \overset{y}{ˉ} = 0.62$ , $\sum x^{2} = 30, \sum x y = 6.83$ .

$\hat{β}_{1} = (6.83 - 5 \cdot 2 \cdot 0.62) / (30 - 5 \cdot 4) = 0.63/10 = 0.063$ °C/year.

$\hat{β}_{0} = 0.62 - 0.063 \cdot 2 = 0.494$ °C.

Warming rate: 0.63 °C/decade.

P.2 free-fall physics, nonlinear-in-t but linear-in-parameters

A ball dropped from a tower has its height measured at six times. The data:

t (s)  = 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
h (m)  = 5.00, 4.81, 4.22, 3.24, 1.85, 0.12

Fit the model $h (t) = h_{0} - \frac{1}{2} g t^{2}$ and extract $g$ . Notice the model is nonlinear in $t$ but linear in the parameters $(h_{0}, g)$ .

Find the analogue: "linear in parameters" is the key property. What feature column does $g$ multiply? Build the design matrix accordingly and use the same OLS solve.

show answer

Reparametrize as $h = β_{0} + β_{1} t^{2}$ with $β_{0} = h_{0}$ and $β_{1} = - g /2$ . The design matrix has columns $[1, t^{2}]$ , so $t^{2} \in {0, 0.04, 0.16, 0.36, 0.64, 1.0}$ .

Solving $X^{⊤} X \hat{β} = X^{⊤} h$ gives $\hat{β}_{0} \approx 5.00$ , $\hat{β}_{1} \approx - 4.892$ .

So $h_{0} \approx 5.00$ m and $g = - 2 \hat{β}_{1} \approx 9.78$ m/s² — within 0.2% of the textbook value.

P.3 polynomial curve fitting

Fit a quadratic $y = β_{0} + β_{1} x + β_{2} x^{2}$ to:

x = -2, -1,  0,  1,  2
y =  9.1, 4.05, 1.0, 0.05, 0.95

Report $\hat{β}$ . The true generating polynomial (before noise was added) was $(x - 1)^{2}$ .

Find the analogue: polynomial regression is linear regression in disguise. The design matrix gets one column per power of $x$ . Same solve, more columns.

show answer

Design matrix $X$ with columns $[1, x, x^{2}]$ :

X = [[1, -2, 4],
     [1, -1, 1],
     [1,  0, 0],
     [1,  1, 1],
     [1,  2, 4]]

Solving the normal equations gives $\hat{β} \approx (1.03, - 2.03, 1.00)$ , close to the true $(1, - 2, 1)$ from $(x - 1)^{2} = x^{2} - 2 x + 1$ .

The minor deviation is just the OLS estimator absorbing the noise. With noise-free data it would recover the true coefficients exactly.

P.4 economic elasticity, log-log regression

Engel's law says food expenditure grows sub-linearly with income. Given five households:

income (k$)  = 20,  40,  60,  80,  100
food   (k$)  = 5.0, 8.7, 11.7, 14.4, 16.7

Fit $lo g_{10} (food) = β_{0} + β_{1} lo g_{10} (income)$ . Report $\hat{β}_{1}$ , which is the elasticity — the percentage change in food per percentage change in income.

Find the analogue: the relationship is nonlinear in the raw variables but linear in their logs. Build a design matrix with $lo g_{10} (income)$ as the feature column. Same OLS solve.

show answer

Let $z_{i} = lo g_{10} (income_{i})$ and $w_{i} = lo g_{10} (food_{i})$ . Compute $\sum z = 8.584$ , $\sum w = 5.088$ , $\sum z^{2} = 15.042$ , $\sum z w = 8.964$ .

Plug into the one-feature formula: $\hat{β}_{1} \approx 0.751$ , $\hat{β}_{0} \approx - 0.272$ .

Elasticity ≈ 0.75: a 1% increase in income produces about a 0.75% increase in food spending. Sub-unitary, just as Engel observed in 1857. Anything $< 1$ here is what makes food a "necessity good" in econ jargon.

P.5 sensor calibration with inverse application

A pressure transducer is calibrated against a reference. The (pressure, voltage) pairs are:

P (kPa)  =  0,   10,   20,   30,   40
V (V)    = 0.50, 1.45, 2.51, 3.49, 4.55

(a) Fit $V = α + βP$ . (b) Use the fit to convert a future reading of $V = 3.0$ V into kPa.

Find the analogue: part (a) is the worked example with renamed variables. Part (b) inverts the fit — same parameters, solve algebraically for $P$ in terms of $V$ .

show answer

(a) $\sum P = 100, \overset{ˉ}{P} = 20$ , $\sum V = 12.50, \overset{ˉ}{V} = 2.50$ , $\sum P^{2} = 3000$ , $\sum P V = 351.4$ . Plug in: $\hat{β} = (351.4 - 5 \cdot 20 \cdot 2.50) / (3000 - 5 \cdot 400) = 101.4/1000 = 0.1014$ V/kPa. $\overset{α}{^} = 2.50 - 0.1014 \cdot 20 = 0.472$ V.

(b) Invert: $P = (V - 0.472) /0.1014$ . For $V = 3.0$ : $P \approx 24.9$ kPa.

This is how every calibration curve in a lab works — fit once, use the inverse forever.

P.6 numerical conditioning, multicollinearity diagnosis

Construct a 200-sample, 3-feature dataset where $x_{3}$ is a near-exact copy of $x_{1} + x_{2}$ (differ only at the 7th decimal place). Generate $y = 1 + 2 x_{1} + x_{2} + ε$ with small Gaussian noise. Compute $κ (X^{⊤} X)$ . Solve the normal equations and report $\hat{β}$ . Then solve via np.linalg.lstsq and compare. What's similar, what's different?

Find the analogue: the worked example assumed $X^{⊤} X$ was well-conditioned. This problem is what happens when it isn't — and surfaces the difference between what the data can tell you (predictions) and what it cannot (individual coefficients).

show answer

import numpy as np

rng = np.random.default_rng(0)
n  = 200
x1 = rng.normal(0, 1, n)
x2 = rng.normal(0, 1, n)
x3 = x1 + x2 + rng.normal(0, 1e-7, n)   # x3 ≈ x1 + x2

X  = np.column_stack([np.ones(n), x1, x2, x3])
y  = 1.0 + 2.0 * x1 + 1.0 * x2 + rng.normal(0, 0.1, n)

print("cond(XᵀX) =", f"{np.linalg.cond(X.T @ X):.2e}")
print("β (normal eq):", np.linalg.solve(X.T @ X, X.T @ y))
print("β (lstsq)   :", np.linalg.lstsq(X, y, rcond=None)[0])
# Predictions, not coefficients, are what's stable.

$κ (X^{⊤} X) \sim 1 0^{14}$ , near double-precision limits. The two solvers give different coefficients for $x_{1}, x_{2}, x_{3}$ — the per-column splits are wildly unstable, often differing by factors of $1 0^{3}$ . But the predictions $\overset{y}{^} = X \hat{β}$ agree to $\sim 1 0^{- 4}$ between methods.

Reason: the data constrains linear combinations of $(β_{1}, β_{2}, β_{3})$ that lie in the row-space of $X$ , not the individual coefficients. The null direction $β_{3} - β_{1} - β_{2}$ is essentially unconstrained, so any solver picks an arbitrary value there. The takeaway: when $κ$ is large, look at predictions, not coefficients.

P.7 experimental design, weighted least squares

You're given $n$ data points $(x_{i}, y_{i})$ with known but unequal noise levels $σ_{i}$ . The standard OLS estimator treats every point equally; this overweights noisy points and underweights precise ones. Derive the weighted least-squares estimator that minimizes $\sum_{i} (y_{i} - x_{i}^{⊤} β)^{2} / σ_{i}^{2}$ . Show your answer matches $\hat{β} = (X^{⊤} W X)^{- 1} X^{⊤} W y$ with $W = diag (1/ σ_{1}^{2}, \dots, 1/ σ_{n}^{2})$ .

Find the analogue: the derivation move is the same one that produced the OLS normal equations — set the gradient of the loss to zero. The only difference is the loss has weights.

show answer

Write the weighted loss as $L (β) = (y - Xβ)^{⊤} W (y - Xβ)$ . Expand:

$L = y^{⊤} W y - 2 β^{⊤} X^{⊤} W y + β^{⊤} X^{⊤} W Xβ$

Gradient: $\nabla_{β} L = - 2 X^{⊤} W y + 2 X^{⊤} W Xβ = 0$ .

Solving: $\hat{β}_{WLS} = (X^{⊤} W X)^{- 1} X^{⊤} W y$ . ✓

Equivalent recipe in practice: rescale each row $i$ of $X$ and $y$ by $1/ σ_{i}$ , then run plain OLS on the rescaled system. The rescaling makes the residuals equal-variance, which is what OLS optimality requires.

Aside: this is also the MLE under iid Gaussian noise with known per-point variance $σ_{i}^{2}$ . Same algebra, different framing — the same answer drops out of the likelihood derivation.

Check problems

Two problems that resist pattern-matching against the practice set. Neither is solvable by remembering one of the problems above.

Check 1 articulation

The normal-equations route to OLS forms $X^{⊤} X$ and solves $(X^{⊤} X) \hat{β} = X^{⊤} y$ . The QR route factors $X = QR$ and solves $R \hat{β} = Q^{⊤} y$ . Mathematically they compute the same $\hat{β}$ . Numerically, when $κ (X) \sim 1 0^{7}$ , the normal-equations solver loses about 14 digits while the QR solver loses only 7.

In 150–250 words, explain why. Your explanation should distinguish what is being computed (the same minimizer) from how it is being computed (a different sequence of finite-precision operations). Make clear that the issue isn't the input data or the algorithm in isolation — it's their interaction. A reader who just finished the linear-regression page should be able to follow your explanation.

show solution sketch

The two routes converge to the same $\hat{β}$ in exact arithmetic — that's what "mathematically the same" means. The difference shows up because finite-precision arithmetic amplifies errors at a rate that depends on the condition number of the matrices being inverted, and the two routes invert different matrices.

Forming $X^{⊤} X$ squares the condition number: $κ (X^{⊤} X) = κ (X)^{2}$ . The normal-equations solver then has to invert that squared-conditioned matrix, and forward error scales as condition number times machine epsilon. With $κ (X) \sim 1 0^{7}$ and machine epsilon $\sim 1 0^{- 16}$ , the normal equations lose about $lo g_{10} (1 0^{14}) = 14$ digits.

QR factorization never forms $X^{⊤} X$ . The Householder or Givens algorithm operates directly on $X$ , with arithmetic whose forward error scales as $κ (X)$ rather than its square. The loss is $lo g_{10} (1 0^{7}) = 7$ digits — half as many.

The issue isn't $X$ alone (any conditioning is fine if the operations are stable). It isn't the OLS formula alone (the minimizer is well-defined for any non-singular system). It's the realization: which intermediate matrices the algorithm forms and inverts in the working precision. Conditioning is a property of the realization, not the problem.

Check 2 derivation

Assume the linear model $y = Xβ + ε$ with iid Gaussian noise $ε \sim N (0, σ^{2} I)$ . Derive the covariance matrix of the OLS estimator:

$Cov (\hat{β}) = σ^{2} (X^{⊤} X)^{- 1}$

Specialize to simple linear regression (single feature plus intercept) and use it to derive the closed-form standard error of the slope:

$SE (\hat{β}_{1}) = \frac{σ}{\sum _{i} ( x _{i} - x ˉ ) ^{2}}$

Discuss qualitatively: what makes $SE (\hat{β}_{1})$ small? What makes it large? Connect at least one factor to a practical experimental-design decision.

show solution sketch

General case. The OLS estimator is $\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y$ . Substitute $y = Xβ + ε$ :

$\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} (Xβ + ε) = β + (X^{⊤} X)^{- 1} X^{⊤} ε$

So $\hat{β} - β = (X^{⊤} X)^{- 1} X^{⊤} ε$ . Its covariance is

$Cov (\hat{β}) = (X^{⊤} X)^{- 1} X^{⊤} Cov (ε) X (X^{⊤} X)^{- 1}$

With $Cov (ε) = σ^{2} I$ , the middle three matrices collapse to $σ^{2} X^{⊤} X$ , giving

$Cov (\hat{β}) = σ^{2} (X^{⊤} X)^{- 1} X^{⊤} X (X^{⊤} X)^{- 1} = σ^{2} (X^{⊤} X)^{- 1}$ ✓

Simple linear regression. With $X = [1, x]$ , $X^{⊤} X = (n \sum x_{i} \sum x_{i} \sum x_{i}^{2})$ .

The determinant is $n \sum x_{i}^{2} - (\sum x_{i})^{2} = n \sum (x_{i} - \overset{x}{ˉ})^{2}$ (using $\sum (x_{i} - \overset{x}{ˉ})^{2} = \sum x_{i}^{2} - n \overset{x}{ˉ}^{2}$ ). The inverse has $(2, 2)$ -entry $n / (n \sum (x_{i} - \overset{x}{ˉ})^{2}) = 1/ \sum (x_{i} - \overset{x}{ˉ})^{2}$ .

So $Var (\hat{β}_{1}) = σ^{2} / \sum (x_{i} - \overset{x}{ˉ})^{2}$ , and $SE (\hat{β}_{1}) = σ / \sum (x_{i} - \overset{x}{ˉ})^{2}$ . ✓

What controls it. Three things:

Noise level $σ$ : smaller noise → smaller SE. Obvious.
Sample size $n$ : more data → larger sum $\sum (x_{i} - \overset{x}{ˉ})^{2}$ → smaller SE. Standard $n$ scaling.
Spread of $x$ : data spread over a wide range gives much smaller SE than the same number of points clustered together. This is the experimental-design lever: when you can choose where to measure, spread the measurements out. Two points at the extremes of the design range are vastly more informative for the slope than ten points in a narrow cluster.

Practical implication: in a dose-response experiment, putting your samples at the extreme doses (rather than evenly spaced) maximizes the precision of the slope estimate, at the cost of nonlinearity diagnostics. Most experimental-design textbooks call this the "D-optimal" or "extreme-point" design.