PCA From Scratch

Interview Prep

LegendaryML Engineeringmllinear-algebranumpy

The problem

Implement Principal Component Analysis: given data X of shape (N, D) and a target dimension k, return the top-k projection. The interviewer will want both eigendecomposition and SVD versions, and will follow up with: which is more numerically stable? when does each fail?

The objective

PCA finds the k orthonormal directions in R^D that maximize the variance of projected data (equivalently: minimize the reconstruction error). Those directions are the top-k eigenvectors of the data covariance — or, equivalently, the top-k right singular vectors of the centered data matrix.

Pattern: center first, then SVD

Step 1 of any PCA is to subtract the per-feature mean. Forgetting this is the most common bug — PCA on uncentered data finds the direction that points from the origin toward the data cloud's centre, which is rarely the direction of meaningful variance.

Step 2 has two options. The eigendecomposition forms Xᵀ X / (N − 1) (a D × D covariance matrix) and finds its eigenvectors. This is fine if D ≤ N and the data isn't poorly conditioned, but forming Xᵀ X squares the condition number — small singular values become invisible. The SVD of X directly avoids this squaring and is the industrial-strength choice.

SVD-based PCA (preferred)

def pca_svd(X: np.ndarray, k: int):
    """SVD on the centered data.  Preferred — numerically stable, avoids forming X^T X."""
    mean = X.mean(axis=0, keepdims=True)
    Xc = X - mean
    n = Xc.shape[0]

    # full_matrices=False yields U: (N, r), S: (r,), Vt: (r, D), r = min(N, D)
    U, S, Vt = np.linalg.svd(Xc, full_matrices=False)

    # Components are rows of Vt (right singular vectors).
    components = Vt[:k]                     # (k, D)
    # Singular values relate to eigenvalues of covariance: λ_i = S_i^2 / (n - 1)
    explained = (S[:k] ** 2) / (n - 1)

    Z = Xc @ components.T                   # (N, k)
    return Z, components.T, explained, mean

Trace (shapes)

Suppose X is shape (1000, 50): 1000 samples of 50-dim data, with strong correlation.

Step 1: mean shape (1, 50);  Xc shape (1000, 50).
Step 2: SVD(Xc) gives U (1000, 50), S (50,), Vt (50, 50).
Step 3: keep top k=2 -> components.T shape (50, 2).
Step 4: Z = Xc @ components.T shape (1000, 2).

If the top 2 singular values capture e.g. 92% of the variance
( sum(S[:2]**2) / sum(S**2) = 0.92 ), then reconstruction:
  X_hat = Z @ components.T + mean
preserves 92% of the variance in 2-D instead of 50-D.

Eigendecomposition variant

import numpy as np

def pca_eig(X: np.ndarray, k: int):
    """Eigendecomposition of the covariance matrix.  Works when D <= N."""
    # 1) Center
    mean = X.mean(axis=0, keepdims=True)
    Xc = X - mean
    n = Xc.shape[0]

    # 2) Covariance (D x D), then symmetric eigendecomposition
    cov = (Xc.T @ Xc) / (n - 1)            # sample covariance
    eigvals, eigvecs = np.linalg.eigh(cov)  # ASC order

    # 3) Take top-k by eigenvalue
    idx = np.argsort(eigvals)[::-1][:k]
    components = eigvecs[:, idx]            # (D, k)
    explained = eigvals[idx]

    # 4) Project
    Z = Xc @ components
    return Z, components, explained, mean

Note the tight relationship between the two: if X = UΣVᵀ, then XᵀX = V Σ² Vᵀ. So the eigenvalues of the covariance are the squared singular values divided by N − 1; the eigenvectors are the right singular vectors V. They are literally the same answer up to numerical precision.

Reconstruction (inverse transform)

def inverse_transform(Z, components, mean):
    """Reconstruct in the original space (up to information lost by truncation)."""
    return Z @ components.T + mean

Complexity

SVD-based: O(N · D · min(N, D)) for full SVD; O(N · D · k) with randomized / truncated SVD.
Eigendecomposition-based: O(N · D²) to form the covariance + O(D³) to decompose.
Space: O(N · D + D · k).

Practical gotchas

PCA is not scale-invariant. Features measured in different units (e.g., income in dollars, age in years) will be dominated by the largest-scale feature. The standard fix is to standardize (x ← (x − μ) / σ) per feature before running PCA — this gives "PCA on the correlation matrix" rather than covariance. Always ask whether to standardize in interviews.

Variations worth knowing

Randomized SVD: approximate the top-k decomposition in O(N · D · k) via random projections. Standard for huge sparse matrices; sklearn's TruncatedSVD uses it.
Incremental PCA: update components as new data arrives, without re-decomposing from scratch. Essential for streaming or out-of-memory workloads.
Kernel PCA: apply PCA in a feature space defined by a kernel function — captures nonlinear structure. Cost grows with N, not D.
Probabilistic PCA: a generative model where PCA arises as the maximum-likelihood solution under Gaussian noise. Generalizes naturally to factor analysis and EM training.
ICA (Independent Component Analysis): looks similar but optimizes for statistical independence rather than decorrelation. Used in signal separation (e.g., audio sources, EEG artefacts).
Autoencoders: a linear autoencoder with MSE loss is exactly equivalent to PCA. Nonlinear autoencoders generalize it to learned nonlinear projections.