Linear Regression from Scratch

Machine Learning

Linear regression is the first thing in every ML textbook and it stays useful through the last chapter. Half the deep-learning tricks reduce to "fancier linear regression in a learned feature space," and a fair amount of physics — including the surrogate fit on the Self-Learning Monte Carlo page — is exactly the OLS estimator dressed in domain notation. So it's worth doing this one carefully.

The setup

You have $n$ data points $(x_{i}, y_{i})$ , with $x_{i} \in R^{p}$ the features and $y_{i} \in R$ the target. The model assumes $y_{i} = x_{i}^{⊤} β + ε_{i}$ for some coefficients $β \in R^{p}$ and noise $ε_{i}$ with mean zero. The job is to estimate $β$ from data — pick the line (or hyperplane) that fits best, where "best" needs a definition.

Stack the data into a design matrix $X \in R^{n \times p}$ with one row per observation, and a vector $y \in R^{n}$ of targets:

y = Xβ + ε

If you want an intercept, prepend a column of ones to $X$ . That folds the intercept into $β$ and lets the whole problem be one matrix equation. If you want quadratic terms, append $x^{2}$ columns. Anything you can compute from the raw features that you then combine linearly is still "linear regression" — the linearity is in $β$ , not in $x$ .

The cost function

"Best fit" needs to mean something. The standard choice is sum of squared residuals:

L (β) = i = 1 \sum n (y_{i} - x_{i}^{⊤} β)^{2} = ∥ y - Xβ ∥_{2}^{2}

Squaring is a real choice — you could use $∣ y - Xβ ∣$ (median regression, robust to outliers), or $(y - Xβ)^{4}$ (penalizes big errors harder), or anything else convex. Squared error wins because: (i) it has a closed-form minimizer, (ii) it's the maximum-likelihood estimator if you assume Gaussian noise, and (iii) it makes the math come out clean. It also makes outliers very expensive — one point with a residual of 10 contributes 100 to the loss, so squared error is fragile when your noise has fat tails.

The normal equations

To minimize $L (β) = ∥ y - Xβ ∥^{2}$ , expand and set the gradient to zero:

\nabla_{β} L = - 2 X^{⊤} (y - Xβ) = 0

which gives the normal equations:

X^{⊤} X β = X^{⊤} y ⟹ \hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y

That's the entire algorithm. Five symbols on the right-hand side, no iteration, no learning rate. As long as $X^{⊤} X$ is invertible — equivalently, as long as your features are linearly independent — there's a unique closed-form minimizer and you can compute it in one step.

The code

"""
Linear regression from scratch.

Goal: fit y ≈ X β by minimizing the sum of squared residuals.

Three short demos:
  1. Simple linear regression on synthetic (x, y) data.
  2. Multivariate fit with two near-duplicate features — shows what
     multicollinearity does to the condition number of XᵀX, and how
     QR factorization stays calm while the normal equations panic.
  3. The same OLS machinery used by the Self-Learning Monte Carlo page
     to fit J_eff from (Q, E) samples.
"""

import numpy as np


def ols_normal_equations(X, y):
    """Closed-form least squares: β = (XᵀX)⁻¹ Xᵀ y."""
    return np.linalg.solve(X.T @ X, X.T @ y)


def ols_qr(X, y):
    """Same fit via QR. R β = Qᵀ y is upper-triangular — solve by back-sub."""
    Q, R = np.linalg.qr(X)
    return np.linalg.solve(R, Q.T @ y)


def r_squared(y, y_hat):
    ss_res = np.sum((y - y_hat) ** 2)
    ss_tot = np.sum((y - y.mean()) ** 2)
    return 1.0 - ss_res / ss_tot


# 1. Simple linear regression.   y = 0.5 + 2.0·x + noise
# ---------------------------------------------------------------------
print("=" * 62)
print("1. Simple linear regression")
print("=" * 62)

rng = np.random.default_rng(42)
n  = 50
x  = rng.uniform(-2.0, 2.0, n)
y  = 0.5 + 2.0 * x + rng.normal(0.0, 0.5, n)

X    = np.column_stack([np.ones(n), x])     # design matrix [1, x]
beta = ols_normal_equations(X, y)
y_hat   = X @ beta
sigma   = np.sqrt(((y - y_hat) ** 2).sum() / (n - 2))   # residual SE

print(f"  true (intercept, slope) : ({0.5}, {2.0})")
print(f"  fit  (intercept, slope) : ({beta[0]:+.4f}, {beta[1]:+.4f})")
print(f"  R²                      : {r_squared(y, y_hat):.4f}")
print(f"  residual std error  σ̂   : {sigma:.4f}")


# 2. Multivariate with multicollinearity.
# x₂ ≈ x₁ + tiny noise → XᵀX is nearly singular.
# y = 1 + 2 x₁ + 0 x₂ + 1 x₃ + noise   (x₂ is the spurious one)
# ---------------------------------------------------------------------
print()
print("=" * 62)
print("2. Multivariate fit — what multicollinearity does")
print("=" * 62)

n  = 200
x1 = rng.normal(0, 1, n)
x2 = x1 + rng.normal(0, 1e-7, n)             # x1 and x2 differ at ~7th decimal
x3 = rng.normal(0, 1, n)
y  = 1.0 + 2.0 * x1 + 0.0 * x2 + 1.0 * x3 + rng.normal(0, 0.3, n)

X = np.column_stack([np.ones(n), x1, x2, x3])
print(f"  cond(XᵀX)        : {np.linalg.cond(X.T @ X):.2e}")
print(f"  cond(X)          : {np.linalg.cond(X):.2e}")

beta_ne = ols_normal_equations(X, y)
beta_qr = ols_qr(X, y)
print(f"  fit (normal eq)  : {np.array2string(beta_ne, precision=2, suppress_small=True)}")
print(f"  fit (QR)         : {np.array2string(beta_qr, precision=2, suppress_small=True)}")
print(f"  ‖β_ne − β_qr‖∞   : {np.abs(beta_ne - beta_qr).max():.2e}")
print(f"  true coefficients: [1, 2, 0, 1]")
print()
print(f"  β1 + β2 (NE)     : {beta_ne[1] + beta_ne[2]:+.4f}")
print(f"  β1 + β2 (QR)     : {beta_qr[1] + beta_qr[2]:+.4f}")
print(f"  ‖ŷ_NE − ŷ_QR‖∞   : {np.abs(X @ beta_ne - X @ beta_qr).max():.2e}")


# 3. The SLMC J_eff fit.   E_true ≈ −J_eff · Q + c   →   regress E on Q.
# ---------------------------------------------------------------------
print()
print("=" * 62)
print("3. The SLMC J_eff fit, replicated")
print("=" * 62)

n          = 800
J_eff_pop  = 1.2497            # the value the real SLMC fit produced
Q          = rng.normal(180.0, 70.0, n)     # rough scale at T = 2.95, L = 16
E          = -J_eff_pop * Q + rng.normal(0, 15.0, n)

X     = np.column_stack([Q, np.ones(n)])
beta  = ols_normal_equations(X, E)
J_eff = -beta[0]
print(f"  N training samples       : {n}")
print(f"  fit J_eff                : {J_eff:.4f}")
print(f"  population J_eff (target): {J_eff_pop}")
print(f"  R²                       : {r_squared(E, X @ beta):.4f}")

Running it

==============================================================
1. Simple linear regression
==============================================================
  true (intercept, slope) : (0.5, 2.0)
  fit  (intercept, slope) : (+0.4145, +2.0151)
  R²                      : 0.9722
  residual std error  σ̂   : 0.3864

==============================================================
2. Multivariate fit — what multicollinearity does
==============================================================
  cond(XᵀX)        : 3.95e+14
  cond(X)          : 2.02e+07
  fit (normal eq)  : [     0.98 -24076.03  24077.99      1.  ]
  fit (QR)         : [     0.98 -24655.27  24657.23      1.  ]
  ‖β_ne − β_qr‖∞   : 5.79e+02
  true coefficients: [1, 2, 0, 1]

  β1 + β2 (NE)     : +1.9629
  β1 + β2 (QR)     : +1.9629
  ‖ŷ_NE − ŷ_QR‖∞   : 1.49e-04

==============================================================
3. The SLMC J_eff fit, replicated
==============================================================
  N training samples       : 800
  fit J_eff                : 1.2589
  population J_eff (target): 1.2497
  R²                       : 0.9725

Reading the numbers

Example 1. The line was generated with intercept $0.5$ and slope $2.0$ , fit recovers $0.41$ and $2.02$ . $R^{2} = 0.97$ says the model explains 97% of the variance in $y$ . The residual standard error $\overset{σ}{^} \approx 0.39$ is close to the noise we baked in ( $σ = 0.5$ , slightly biased downward because the fit uses up two degrees of freedom). For one feature this is all there is.

Example 2 is the one to dwell on. The data has four features but $x_{2}$ is a copy of $x_{1}$ plus floating-point dust. The condition number of $X^{⊤} X$ blows up to $\sim 1 0^{14}$ , near the double-precision limit. Both fits return huge coefficients of opposite sign for $x_{1}$ and $x_{2}$ — $- 24076$ and $+ 24078$ , or $- 24655$ and $+ 24657$ . The fits disagree by 500 units.

But look at what is stable. The sum $β_{1} + β_{2} \approx 1.96$ matches between the two methods to four decimals. The predictions $\overset{y}{^} = X \hat{β}$ from the two methods agree to $1 0^{- 4}$ . The intercept and $β_{3}$ are bang on. What's happening: when $x_{2} \approx x_{1}$ , you can swap mass between $β_{1}$ and $β_{2}$ freely and get nearly identical predictions. The data constrain the sum, not the split. The "instability" is a feature of the question, not the solver — there's no information in this dataset that could pin down the individual coefficients.

This is the single most important diagnostic to internalize about linear regression. Look at predictions, not coefficients. A model with crazy-looking coefficients can still be perfectly good at predicting. Coefficient instability tells you about identifiability — whether the data could distinguish two competing causal stories — not about model quality.

Example 3 is the same machinery again, but the data is now $(Q, E)$ samples from a 2D Ising simulation. We regress the true energy on the nearest-neighbor bond sum, and recover $J_{eff} \approx 1.26$ . This is exactly what the SLMC page does inside fit_surrogate(): stack columns into a design matrix, hand it to numpy.linalg.lstsq, read off a physical coupling. The same three lines work whether your target is house prices or the renormalized Ising coupling.

Normal equations vs QR

The closed-form $(X^{⊤} X)^{- 1} X^{⊤} y$ is pretty but it has a numerical issue: forming $X^{⊤} X$ squares the condition number of $X$ . If $X$ is ill-conditioned with $κ (X) \sim 1 0^{7}$ , then $κ (X^{⊤} X) \sim 1 0^{14}$ and you're losing 14 digits of precision before you even start solving. The QR factorization avoids the squaring entirely:

X = QR ⟹ R \hat{β} = Q^{⊤} y

where $Q$ has orthonormal columns and $R$ is upper-triangular. You solve $R \hat{β} = Q^{⊤} y$ by back-substitution, never forming $X^{⊤} X$ . The condition number stays at $κ (X)$ . In Example 2 above, you can see this matters — the two methods produce different individual coefficients in the same null direction. (They give the same predictions because the residuals see the same row-space of $X$ .) In production, default to $lstsq$ or QR; reach for $(X^{⊤} X)^{- 1} X^{⊤} y$ only when you genuinely need $X^{⊤} X$ for something else (e.g., the covariance matrix of $\hat{β}$ ).

Diagnostics worth running

R-squared measures the fraction of variance in $y$ explained by the model: $R^{2} = 1 - SS_{res} / SS_{tot}$ . It runs from 0 (model no better than the mean) to 1 (model fits perfectly). Useful, but it always increases when you add features, so it overstates the value of complex models. The "adjusted $R^{2}$ " penalizes for the number of parameters and is more honest.

Residual standard error $\overset{σ}{^} = SS_{res} / (n - p)$ is your estimate of the noise level. Compare it to the spread of $y$ itself; if they're close, the model isn't learning much.

Residuals vs fitted values. Plot $\overset{ε}{^}_{i} = y_{i} - \overset{y}{^}_{i}$ against $\overset{y}{^}_{i}$ . If you see structure — a curve, a fan shape, an outlier — your linear model is missing something. A straight model fit to a curved relationship looks fine in $R^{2}$ but the residual plot screams.

Condition number of $X$ . If $κ (X)$ is large, your coefficient estimates are unstable in the worst directions. The data isn't telling you which of two correlated features matters.

What this doesn't do

It can't handle nonlinearity in $β$ . Linear in $β$ is the whole point. If your true relationship is $y = sin (β_{1} x)$ or $y = β_{1} / (1 + β_{2} x)$ , you need a nonlinear least-squares solver (Levenberg-Marquardt, or gradient descent on a parameterized loss). Linear regression with $sin x$ in the design matrix is fine — that's linear in $β$ . But $sin (β_{1} x)$ is not.

It assumes Gaussian residuals at the maximum-likelihood level. If your noise is heavy-tailed (e.g., counts with occasional spikes), the squared loss makes outliers dominate the fit. Switch to $L_{1}$ regression, robust loss functions (Huber), or a different likelihood entirely (Poisson regression for counts, logistic for binary outcomes — the next page in this series).

It doesn't pick variables for you. If you throw 10,000 features at it on 100 data points, $X^{⊤} X$ isn't invertible and the closed form breaks. You need regularization (ridge: $\hat{β} = (X^{⊤} X + λ I)^{- 1} X^{⊤} y$ ), or sparsity (lasso: $L_{1}$ penalty), or a principled feature-selection step. Default OLS has no preference for simple solutions.

It scales linearly in features but cubically in solve cost. The QR factorization is $O (n p^{2})$ in time and $O (n p)$ in memory. With $p \sim 1 0^{4}$ features this is already minutes per fit and gigabytes of memory; with $p \sim 1 0^{6}$ you can't form $X$ at all. That's where gradient descent earns its keep — it doesn't need the full design matrix in memory, only the ability to compute matrix-vector products. Gradient descent lives in the Optimization section because it's the workhorse for everything iterative on the site (logistic regression, neural networks, but also variational energy minimization in Hartree-Fock, FEM nonlinear residuals, MLE in general).

Core idea

Linear regression turns a least-squares problem into a single matrix equation, $X^{⊤} X \hat{β} = X^{⊤} y$ . The closed-form answer is nice; the closed-form answer through QR is nicer when the design matrix is anywhere near ill-conditioned. Look at predictions rather than coefficients when the condition number is high — coefficient instability is the data telling you what it doesn't constrain.

Exercises

A full exercise set is available for this topic, structured as one worked example + 7 practice problems (across 7 surface contexts) + 2 pattern-resistant check problems.

Open the Linear Regression exercise set → Recognize a problem as "fit a model that is linear in its parameters," set up the design matrix, solve XᵀXβ = Xᵀy by normal equations or QR, and read coefficient stability vs prediction stability from the conditioning of X.

References

Strang, G. (2009). Introduction to Linear Algebra, 4th ed. Wellesley-Cambridge Press. Chapter 4 on projections and least squares is the cleanest derivation of the normal equations.
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM. Lectures 18–19 on least squares and QR are the canonical reference on numerical conditioning.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapter 3 covers diagnostics, ridge, and lasso in the same notation used here.