Singular Value Decomposition

Linear Algebra

The singular value decomposition factors any real matrix $A \in R^{m \times n}$ as

A = U Σ V^{T},

with $U \in R^{m \times m}$ orthogonal, $V \in R^{n \times n}$ orthogonal, and $Σ \in R^{m \times n}$ rectangular-diagonal with non-negative entries $σ_{1} \geq σ_{2} \geq \dots \geq 0$ on the diagonal. The $σ_{i}$ are the SINGULAR VALUES, the columns of $U$ and $V$ are the LEFT and RIGHT singular vectors. Unlike eigendecomposition, which doesn't exist for non-diagonalizable matrices and is awkward for rectangular ones, the SVD exists for every matrix and is essentially unique (up to sign choices and the freedom to rotate within singular subspaces of repeated singular values).

The decomposition is THE foundational object of linear algebra in the same sense the eigendecomposition is foundational for square Hermitian matrices: it diagonalizes the action of $A$ in a way that respects the inner product on both sides. Almost every theorem about norms, condition numbers, low-rank approximation, and pseudoinverses ultimately reduces to a statement about singular values.

Geometric picture

Read $A x = U Σ V^{T} x$ right-to-left, as three sequential transformations of $x$ : (1) $V^{T} x$ rotates the coordinate frame so the input is expressed in the right-singular-vector basis; (2) $Σ$ stretches each axis by the corresponding $σ_{i}$ (or zeros it out, if $σ_{i} = 0$ ); (3) $U$ rotates again into the left-singular-vector frame. Every linear map is — geometrically — "rotate, stretch along axes, rotate again". The singular values measure the stretching; the singular vectors are the special axes along which the stretching is pure (no shear, no cross-coupling).

Two consequences fall out immediately. $∥ A ∥_{2} = σ_{1}$ — the operator 2-norm equals the largest singular value, since the maximum stretching factor is what the operator norm measures. And $rank (A)$ equals the number of non-zero singular values, since $Σ$ "destroys" any input direction whose $σ_{i} = 0$ .

Connection to eigendecomposition

For any $A$ , the matrices $A^{T} A$ and $A A^{T}$ are symmetric positive semi-definite, so they have orthonormal eigendecompositions. Substituting $A = U Σ V^{T}$ :

A^{T} A = V Σ^{T} Σ V^{T} = V diag (σ_{i}^{2}) V^{T},

A A^{T} = U Σ Σ^{T} U^{T} = U diag (σ_{i}^{2}) U^{T} .

So: $V$ is the eigenvector matrix of $A^{T} A$ , $U$ the eigenvector matrix of $A A^{T}$ , and the squared singular values are the (shared) eigenvalues of either. This is the textbook proof of existence — but it is also a TERRIBLE way to compute the SVD numerically, because forming $A^{T} A$ squares the condition number. Real SVD algorithms (Golub-Kahan bidiagonalisation followed by an implicit QR sweep) work on $A$ directly and never form $A^{T} A$ .

Eckart-Young: best low-rank approximation

The most-used fact about the SVD. Truncating the decomposition to the top $k$ singular triples,

A_{k} = i = 1 \sum k σ_{i} u_{i} v_{i}^{T},

gives the unique best rank- $k$ approximation to $A$ in BOTH the spectral (operator 2-) norm and the Frobenius norm:

rank (B) \leq k min ∥ A - B ∥_{2} = σ_{k + 1}, rank (B) \leq k min ∥ A - B ∥_{F} = \sum_{i > k} σ_{i}^{2} .

The Eckart-Young theorem (1936) is the foundation of every "compress this matrix" technique: image compression, PCA / principal components, recommender systems' latent factors, model order reduction in PDEs. If $A$ 's singular values decay fast, you can throw away almost all of them and lose almost nothing — the bound says exactly how much you lose.

The pseudoinverse

For a rectangular or rank-deficient $A$ , the Moore-Penrose pseudoinverse $A^{+}$ is given by:

A^{+} = V Σ^{+} U^{T}, where (Σ^{+})_{ii} = {1/ σ_{i} 0 σ_{i} > 0 σ_{i} = 0 .

For least-squares $min_{x} ∥ A x - b ∥$ , the minimum-norm solution is $x = A^{+} b$ . Computing it via SVD is the most numerically stable least-squares route — markedly better than the normal equations $(A^{T} A) x = A^{T} b$ when $A$ is ill-conditioned, because forming $A^{T} A$ doubles the condition number whereas SVD works directly on $A$ . The TRUNCATED pseudoinverse (drop singular values below some tolerance) is the standard regularization trick for ill-posed inverse problems.

Code

# SVD decomposition of a near-low-rank matrix, plus Eckart-Young
# truncation and the connection to A^T A.

import numpy as np

np.random.seed(1)
n, r = 40, 4

# Build a rank-r matrix from orthonormal U, V and known singular values
U_true = np.linalg.qr(np.random.randn(n, r))[0]
V_true = np.linalg.qr(np.random.randn(n, r))[0]
sigma_true = np.array([10.0, 4.0, 2.0, 0.5])
A_lr = U_true @ np.diag(sigma_true) @ V_true.T

# Add small Gaussian noise — matrix becomes full rank but nearly low-rank
A = A_lr + 0.01 * np.random.randn(n, n)

U, S, Vt = np.linalg.svd(A, full_matrices=False)
print("Top 8 singular values:")
print(f"  {S[:8]}")
print("(First 4 close to [10, 4, 2, 0.5]; the rest are O(noise).)")

# Eckart-Young: best rank-k approximation in spectral and Frobenius norm
def truncate(U, S, Vt, k):
    return U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]

print("\nBest-rank-k approximation: ||A - A_k||_F  vs  sqrt(sum_{i>k} sigma_i^2):")
for k in [1, 2, 3, 4, 5, 10]:
    Ak = truncate(U, S, Vt, k)
    err_actual = np.linalg.norm(A - Ak, 'fro')
    err_theory = np.sqrt(np.sum(S[k:]**2))
    print(f"  k={k:2d}: {err_actual:.4f}  vs  {err_theory:.4f}")

# Connection to eigendecomposition of A^T A
eigs = np.linalg.eigvalsh(A.T @ A)
print(f"\nsigma_max from SVD      = {S[0]:.6f}")
print(f"sqrt(lambda_max of A^T A) = {np.sqrt(eigs[-1]):.6f}")

Output:

Top 8 singular values:
  [9.98948824 4.00100258 1.98685247 0.51553587 0.11850767 0.11375271
 0.10832342 0.10496257]
(First 4 close to [10, 4, 2, 0.5]; the rest are O(noise).)

Best-rank-k approximation: ||A - A_k||_F  vs  sqrt(sum_{i>k} sigma_i^2):
  k= 1: 4.5117  vs  4.5117
  k= 2: 2.0851  vs  2.0851
  k= 3: 0.6325  vs  0.6325
  k= 4: 0.3664  vs  0.3664
  k= 5: 0.3467  vs  0.3467
  k=10: 0.2603  vs  0.2603

sigma_max from SVD      = 9.989488
sqrt(lambda_max of A^T A) = 9.989488

Three things worth noticing. (1) The top four singular values recover the embedded ones $[10, 4, 2, 0.5]$ very accurately; the remaining 36 are noise-floor at $\sim 1 0^{- 1}$ . (2) The Frobenius error of the rank- $k$ truncation matches the Eckart-Young theoretical value to machine precision at every $k$ . (3) The largest singular value equals the square root of the largest eigenvalue of $A^{T} A$ to six decimal places — the SVD/eigendecomposition connection in action.

What it's used for

Least squares and regression. Solve $A x \approx b$ via the truncated pseudoinverse; this is what numpy's np.linalg.lstsq does. Numerically stable even when $A$ is rank-deficient.
PCA / principal components. The principal components of a centred data matrix $X$ are the right singular vectors of $X$ ; the explained variances are $σ_{i}^{2} / (n - 1)$ . Computing PCA via SVD on $X$ directly is the standard recipe — no covariance-matrix construction needed.
Image / data compression. Store only the top $k$ singular triples. Reconstruction error bounded by $σ_{k + 1}$ .
Determining rank in finite precision. The numerical rank is the number of singular values above some tolerance (typically $σ_{1} \cdot ϵ_{mach} \cdot max (m, n)$ ). Counting eigenvalues of $A^{T} A$ gives the same answer in principle but loses precision.
Latent-factor models. Recommender systems, topic models, embeddings — all built on truncated SVD or its randomised / streaming variants.
Model order reduction. PDE simulations: project a high-dim state onto the top singular vectors of a snapshot matrix. The reduced model captures the dominant dynamics for orders of magnitude less work.

Computing the SVD

The standard dense algorithm (Golub-Kahan-Reinsch, 1965-1970) has two phases. (1) Reduce $A$ to upper bidiagonal form via Householder reflections from both sides — analogous to the Hessenberg reduction inside symmetric eigensolvers. (2) Iteratively diagonalize the bidiagonal matrix via implicit QR sweeps. Cost: $O (mn min (m, n))$ for an $m \times n$ matrix. For very large or sparse problems, RANDOMISED SVD (Halko-Martinsson-Tropp 2009) computes a near-optimal rank- $k$ approximation in roughly $O (mnk)$ , using a random projection followed by a small SVD.

Householder QR — the reflections used to reduce $A$ to bidiagonal form during SVD computation.
Condition number — $κ (A) = σ_{1} / σ_{n}$ , a property the SVD makes manifest.
Lanczos iteration — applied to $A^{T} A$ (without forming it), gives a streaming way to compute the largest few singular values of a sparse matrix.