Linear Regression from Scratch
Machine Learning
Linear regression is the first thing in every ML textbook and it stays useful through the last chapter. Half the deep-learning tricks reduce to "fancier linear regression in a learned feature space," and a fair amount of physics — including the surrogate fit on the Self-Learning Monte Carlo page — is exactly the OLS estimator dressed in domain notation. So it's worth doing this one carefully.
The setup
You have data points , with the features and the target. The model assumes for some coefficients and noise with mean zero. The job is to estimate from data — pick the line (or hyperplane) that fits best, where "best" needs a definition.
Stack the data into a design matrix with one row per observation, and a vector of targets:
If you want an intercept, prepend a column of ones to . That folds the intercept into and lets the whole problem be one matrix equation. If you want quadratic terms, append columns. Anything you can compute from the raw features that you then combine linearly is still "linear regression" — the linearity is in , not in .
The cost function
"Best fit" needs to mean something. The standard choice is sum of squared residuals:
Squaring is a real choice — you could use (median regression, robust to outliers), or (penalizes big errors harder), or anything else convex. Squared error wins because: (i) it has a closed-form minimizer, (ii) it's the maximum-likelihood estimator if you assume Gaussian noise, and (iii) it makes the math come out clean. It also makes outliers very expensive — one point with a residual of 10 contributes 100 to the loss, so squared error is fragile when your noise has fat tails.
The normal equations
To minimize , expand and set the gradient to zero:
which gives the normal equations:
That's the entire algorithm. Five symbols on the right-hand side, no iteration, no learning rate. As long as is invertible — equivalently, as long as your features are linearly independent — there's a unique closed-form minimizer and you can compute it in one step.
The code
"""
Linear regression from scratch.
Goal: fit y ≈ X β by minimizing the sum of squared residuals.
Three short demos:
1. Simple linear regression on synthetic (x, y) data.
2. Multivariate fit with two near-duplicate features — shows what
multicollinearity does to the condition number of XᵀX, and how
QR factorization stays calm while the normal equations panic.
3. The same OLS machinery used by the Self-Learning Monte Carlo page
to fit J_eff from (Q, E) samples.
"""
import numpy as np
def ols_normal_equations(X, y):
"""Closed-form least squares: β = (XᵀX)⁻¹ Xᵀ y."""
return np.linalg.solve(X.T @ X, X.T @ y)
def ols_qr(X, y):
"""Same fit via QR. R β = Qᵀ y is upper-triangular — solve by back-sub."""
Q, R = np.linalg.qr(X)
return np.linalg.solve(R, Q.T @ y)
def r_squared(y, y_hat):
ss_res = np.sum((y - y_hat) ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)
return 1.0 - ss_res / ss_tot
# 1. Simple linear regression. y = 0.5 + 2.0·x + noise
# ---------------------------------------------------------------------
print("=" * 62)
print("1. Simple linear regression")
print("=" * 62)
rng = np.random.default_rng(42)
n = 50
x = rng.uniform(-2.0, 2.0, n)
y = 0.5 + 2.0 * x + rng.normal(0.0, 0.5, n)
X = np.column_stack([np.ones(n), x]) # design matrix [1, x]
beta = ols_normal_equations(X, y)
y_hat = X @ beta
sigma = np.sqrt(((y - y_hat) ** 2).sum() / (n - 2)) # residual SE
print(f" true (intercept, slope) : ({0.5}, {2.0})")
print(f" fit (intercept, slope) : ({beta[0]:+.4f}, {beta[1]:+.4f})")
print(f" R² : {r_squared(y, y_hat):.4f}")
print(f" residual std error σ̂ : {sigma:.4f}")
# 2. Multivariate with multicollinearity.
# x₂ ≈ x₁ + tiny noise → XᵀX is nearly singular.
# y = 1 + 2 x₁ + 0 x₂ + 1 x₃ + noise (x₂ is the spurious one)
# ---------------------------------------------------------------------
print()
print("=" * 62)
print("2. Multivariate fit — what multicollinearity does")
print("=" * 62)
n = 200
x1 = rng.normal(0, 1, n)
x2 = x1 + rng.normal(0, 1e-7, n) # x1 and x2 differ at ~7th decimal
x3 = rng.normal(0, 1, n)
y = 1.0 + 2.0 * x1 + 0.0 * x2 + 1.0 * x3 + rng.normal(0, 0.3, n)
X = np.column_stack([np.ones(n), x1, x2, x3])
print(f" cond(XᵀX) : {np.linalg.cond(X.T @ X):.2e}")
print(f" cond(X) : {np.linalg.cond(X):.2e}")
beta_ne = ols_normal_equations(X, y)
beta_qr = ols_qr(X, y)
print(f" fit (normal eq) : {np.array2string(beta_ne, precision=2, suppress_small=True)}")
print(f" fit (QR) : {np.array2string(beta_qr, precision=2, suppress_small=True)}")
print(f" ‖β_ne − β_qr‖∞ : {np.abs(beta_ne - beta_qr).max():.2e}")
print(f" true coefficients: [1, 2, 0, 1]")
print()
print(f" β1 + β2 (NE) : {beta_ne[1] + beta_ne[2]:+.4f}")
print(f" β1 + β2 (QR) : {beta_qr[1] + beta_qr[2]:+.4f}")
print(f" ‖ŷ_NE − ŷ_QR‖∞ : {np.abs(X @ beta_ne - X @ beta_qr).max():.2e}")
# 3. The SLMC J_eff fit. E_true ≈ −J_eff · Q + c → regress E on Q.
# ---------------------------------------------------------------------
print()
print("=" * 62)
print("3. The SLMC J_eff fit, replicated")
print("=" * 62)
n = 800
J_eff_pop = 1.2497 # the value the real SLMC fit produced
Q = rng.normal(180.0, 70.0, n) # rough scale at T = 2.95, L = 16
E = -J_eff_pop * Q + rng.normal(0, 15.0, n)
X = np.column_stack([Q, np.ones(n)])
beta = ols_normal_equations(X, E)
J_eff = -beta[0]
print(f" N training samples : {n}")
print(f" fit J_eff : {J_eff:.4f}")
print(f" population J_eff (target): {J_eff_pop}")
print(f" R² : {r_squared(E, X @ beta):.4f}") Running it
==============================================================
1. Simple linear regression
==============================================================
true (intercept, slope) : (0.5, 2.0)
fit (intercept, slope) : (+0.4145, +2.0151)
R² : 0.9722
residual std error σ̂ : 0.3864
==============================================================
2. Multivariate fit — what multicollinearity does
==============================================================
cond(XᵀX) : 3.95e+14
cond(X) : 2.02e+07
fit (normal eq) : [ 0.98 -24076.03 24077.99 1. ]
fit (QR) : [ 0.98 -24655.27 24657.23 1. ]
‖β_ne − β_qr‖∞ : 5.79e+02
true coefficients: [1, 2, 0, 1]
β1 + β2 (NE) : +1.9629
β1 + β2 (QR) : +1.9629
‖ŷ_NE − ŷ_QR‖∞ : 1.49e-04
==============================================================
3. The SLMC J_eff fit, replicated
==============================================================
N training samples : 800
fit J_eff : 1.2589
population J_eff (target): 1.2497
R² : 0.9725 Reading the numbers
Example 1. The line was generated with intercept and slope , fit recovers and . says the model explains 97% of the variance in . The residual standard error is close to the noise we baked in (, slightly biased downward because the fit uses up two degrees of freedom). For one feature this is all there is.
Example 2 is the one to dwell on. The data has four features but is a copy of plus floating-point dust. The condition number of blows up to , near the double-precision limit. Both fits return huge coefficients of opposite sign for and — and , or and . The fits disagree by 500 units.
But look at what is stable. The sum matches between the two methods to four decimals. The predictions from the two methods agree to . The intercept and are bang on. What's happening: when , you can swap mass between and freely and get nearly identical predictions. The data constrain the sum, not the split. The "instability" is a feature of the question, not the solver — there's no information in this dataset that could pin down the individual coefficients.
This is the single most important diagnostic to internalize about linear regression. Look at predictions, not coefficients. A model with crazy-looking coefficients can still be perfectly good at predicting. Coefficient instability tells you about identifiability — whether the data could distinguish two competing causal stories — not about model quality.
Example 3 is the same machinery again, but the data is now samples from a 2D Ising simulation. We regress the true energy on the nearest-neighbor bond sum, and recover . This is exactly what the SLMC page does inside fit_surrogate(): stack columns into a design matrix, hand it to numpy.linalg.lstsq, read off a physical coupling. The same three lines work whether your target is house prices or the renormalized Ising coupling.
Normal equations vs QR
The closed-form is pretty but it has a numerical issue: forming squares the condition number of . If is ill-conditioned with , then and you're losing 14 digits of precision before you even start solving. The QR factorization avoids the squaring entirely:
where has orthonormal columns and is upper-triangular. You solve by back-substitution, never forming . The condition number stays at . In Example 2 above, you can see this matters — the two methods produce different individual coefficients in the same null direction. (They give the same predictions because the residuals see the same row-space of .) In production, default to or QR; reach for only when you genuinely need for something else (e.g., the covariance matrix of ).
Diagnostics worth running
R-squared measures the fraction of variance in explained by the model: . It runs from 0 (model no better than the mean) to 1 (model fits perfectly). Useful, but it always increases when you add features, so it overstates the value of complex models. The "adjusted " penalizes for the number of parameters and is more honest.
Residual standard error is your estimate of the noise level. Compare it to the spread of itself; if they're close, the model isn't learning much.
Residuals vs fitted values. Plot against . If you see structure — a curve, a fan shape, an outlier — your linear model is missing something. A straight model fit to a curved relationship looks fine in but the residual plot screams.
Condition number of . If is large, your coefficient estimates are unstable in the worst directions. The data isn't telling you which of two correlated features matters.
What this doesn't do
It can't handle nonlinearity in . Linear in is the whole point. If your true relationship is or , you need a nonlinear least-squares solver (Levenberg-Marquardt, or gradient descent on a parameterized loss). Linear regression with in the design matrix is fine — that's linear in . But is not.
It assumes Gaussian residuals at the maximum-likelihood level. If your noise is heavy-tailed (e.g., counts with occasional spikes), the squared loss makes outliers dominate the fit. Switch to regression, robust loss functions (Huber), or a different likelihood entirely (Poisson regression for counts, logistic for binary outcomes — the next page in this series).
It doesn't pick variables for you. If you throw 10,000 features at it on 100 data points, isn't invertible and the closed form breaks. You need regularization (ridge: ), or sparsity (lasso: penalty), or a principled feature-selection step. Default OLS has no preference for simple solutions.
It scales linearly in features but cubically in solve cost. The QR factorization is in time and in memory. With features this is already minutes per fit and gigabytes of memory; with you can't form at all. That's where gradient descent earns its keep — it doesn't need the full design matrix in memory, only the ability to compute matrix-vector products. Gradient descent lives in the Optimization section because it's the workhorse for everything iterative on the site (logistic regression, neural networks, but also variational energy minimization in Hartree-Fock, FEM nonlinear residuals, MLE in general).
Core idea
Linear regression turns a least-squares problem into a single matrix equation, . The closed-form answer is nice; the closed-form answer through QR is nicer when the design matrix is anywhere near ill-conditioned. Look at predictions rather than coefficients when the condition number is high — coefficient instability is the data telling you what it doesn't constrain.
Exercises
A full exercise set is available for this topic, structured as one worked example + 7 practice problems (across 7 surface contexts) + 2 pattern-resistant check problems.
Open the Linear Regression exercise setReferences
- Strang, G. (2009). Introduction to Linear Algebra, 4th ed. Wellesley-Cambridge Press. Chapter 4 on projections and least squares is the cleanest derivation of the normal equations.
- Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM. Lectures 18–19 on least squares and QR are the canonical reference on numerical conditioning.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapter 3 covers diagnostics, ridge, and lasso in the same notation used here.