PCA From Scratch
Interview Prep
The problem
Implement Principal Component Analysis: given data
X of shape (N, D) and a target dimension
k, return the top-k projection. The
interviewer will want both eigendecomposition and SVD versions,
and will follow up with: which is more numerically stable? when
does each fail?
The objective
PCA finds the k orthonormal directions in
R^D that maximize the variance of projected data
(equivalently: minimize the reconstruction error). Those directions
are the top-k eigenvectors of the data covariance —
or, equivalently, the top-k right singular vectors of
the centered data matrix.
Pattern: center first, then SVD
Step 1 of any PCA is to subtract the per-feature mean. Forgetting this is the most common bug — PCA on uncentered data finds the direction that points from the origin toward the data cloud's centre, which is rarely the direction of meaningful variance.
Step 2 has two options. The eigendecomposition
forms Xᵀ X / (N − 1) (a D × D covariance
matrix) and finds its eigenvectors. This is fine if
D ≤ N and the data isn't poorly conditioned, but
forming Xᵀ X squares the condition number — small
singular values become invisible. The SVD of
X directly avoids this squaring and is the
industrial-strength choice.
SVD-based PCA (preferred)
def pca_svd(X: np.ndarray, k: int):
"""SVD on the centered data. Preferred — numerically stable, avoids forming X^T X."""
mean = X.mean(axis=0, keepdims=True)
Xc = X - mean
n = Xc.shape[0]
# full_matrices=False yields U: (N, r), S: (r,), Vt: (r, D), r = min(N, D)
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
# Components are rows of Vt (right singular vectors).
components = Vt[:k] # (k, D)
# Singular values relate to eigenvalues of covariance: λ_i = S_i^2 / (n - 1)
explained = (S[:k] ** 2) / (n - 1)
Z = Xc @ components.T # (N, k)
return Z, components.T, explained, mean Trace (shapes)
Suppose X is shape (1000, 50): 1000 samples of 50-dim data, with strong correlation.
Step 1: mean shape (1, 50); Xc shape (1000, 50).
Step 2: SVD(Xc) gives U (1000, 50), S (50,), Vt (50, 50).
Step 3: keep top k=2 -> components.T shape (50, 2).
Step 4: Z = Xc @ components.T shape (1000, 2).
If the top 2 singular values capture e.g. 92% of the variance
( sum(S[:2]**2) / sum(S**2) = 0.92 ), then reconstruction:
X_hat = Z @ components.T + mean
preserves 92% of the variance in 2-D instead of 50-D. Eigendecomposition variant
import numpy as np
def pca_eig(X: np.ndarray, k: int):
"""Eigendecomposition of the covariance matrix. Works when D <= N."""
# 1) Center
mean = X.mean(axis=0, keepdims=True)
Xc = X - mean
n = Xc.shape[0]
# 2) Covariance (D x D), then symmetric eigendecomposition
cov = (Xc.T @ Xc) / (n - 1) # sample covariance
eigvals, eigvecs = np.linalg.eigh(cov) # ASC order
# 3) Take top-k by eigenvalue
idx = np.argsort(eigvals)[::-1][:k]
components = eigvecs[:, idx] # (D, k)
explained = eigvals[idx]
# 4) Project
Z = Xc @ components
return Z, components, explained, mean
Note the tight relationship between the two: if
X = UΣVᵀ, then
XᵀX = V Σ² Vᵀ. So the eigenvalues of the covariance
are the squared singular values divided by N − 1; the
eigenvectors are the right singular vectors V. They
are literally the same answer up to numerical precision.
Reconstruction (inverse transform)
def inverse_transform(Z, components, mean):
"""Reconstruct in the original space (up to information lost by truncation)."""
return Z @ components.T + mean Complexity
- SVD-based:
O(N · D · min(N, D))for full SVD;O(N · D · k)with randomized / truncated SVD. - Eigendecomposition-based:
O(N · D²)to form the covariance +O(D³)to decompose. - Space:
O(N · D + D · k).
Practical gotchas
PCA is not scale-invariant. Features measured in different
units (e.g., income in dollars, age in years) will be dominated by
the largest-scale feature. The standard fix is to standardize
(x ← (x − μ) / σ) per feature before running PCA —
this gives "PCA on the correlation matrix" rather than covariance.
Always ask whether to standardize in interviews.
Variations worth knowing
- Randomized SVD: approximate the top-k decomposition
in
O(N · D · k)via random projections. Standard for huge sparse matrices;sklearn'sTruncatedSVDuses it. - Incremental PCA: update components as new data arrives, without re-decomposing from scratch. Essential for streaming or out-of-memory workloads.
- Kernel PCA: apply PCA in a feature space defined
by a kernel function — captures nonlinear structure. Cost grows
with
N, notD. - Probabilistic PCA: a generative model where PCA arises as the maximum-likelihood solution under Gaussian noise. Generalizes naturally to factor analysis and EM training.
- ICA (Independent Component Analysis): looks similar but optimizes for statistical independence rather than decorrelation. Used in signal separation (e.g., audio sources, EEG artefacts).
- Autoencoders: a linear autoencoder with MSE loss is exactly equivalent to PCA. Nonlinear autoencoders generalize it to learned nonlinear projections.