Conjugate Gradient
Linear Algebra
Conjugate gradient (CG) is the canonical iterative solver for when is symmetric positive definite. It uses no matrix factorizations, touches only through mat-vecs, and at iteration produces the BEST approximation to in the Krylov subspace , measured in the -norm. The algorithm needs four -vectors of storage regardless of how long it runs. Combined with a decent preconditioner it dominates large sparse SPD problems: finite-element solvers, Poisson equations on regular grids, Gaussian-process regression, optimization subproblems where the Hessian is SPD.
The quadratic-minimization picture
For SPD , solving is equivalent to minimizing the strictly convex quadratic
The gradient is the negative residual; the unique minimizer is the solution. Steepest descent minimizes by taking the current residual as the search direction and moving along it to the line minimum. It works, but it's slow: successive search directions are forced to be ORTHOGONAL (a property of exact line-search on a convex quadratic), and orthogonality is a weak structural constraint that lets the trajectory zig-zag along narrow valleys when is ill-conditioned. The convergence rate of steepest descent is per iteration, where is the condition number.
CG replaces orthogonality with the stronger -CONJUGACY: search directions are chosen so that for . Geometrically, the directions are orthogonal under the inner product — the natural metric on the level sets of . The payoff: exact-arithmetic CG reaches the exact minimizer in at most iterations, and the per-iteration improvement scales like — a quadratic improvement over steepest descent.
Why the recursion is short
Building a full -conjugate basis explicitly via Gram-Schmidt-with--metric would cost per step. CG's brilliance is that the recursion is only THREE TERMS — analogous to symmetric Lanczos. The new search direction is built from the current residual and the previous direction:
and the iteration along uses the exact line-search step:
The miracle is that these three short formulas produce search directions that are PAIRWISE -CONJUGATE, and residuals that are PAIRWISE ORTHOGONAL. Both properties can be proven by induction; they emerge "for free" from the SPD structure. The residuals are, up to scaling, exactly the Lanczos basis vectors of . CG is Lanczos in disguise, with the basis never stored.
The algorithm
# Classical conjugate gradient. Solves A x = b for symmetric positive
# definite A. Storage: four n-vectors (x, r, p, Ap). Per iteration: one
# mat-vec, two dot products, three saxpy-like updates. No matrix
# factorizations; A is only used through its action.
import numpy as np
def cg(A, b, x0=None, tol=1e-10, max_iter=None):
n = A.shape[0]
x = np.zeros(n) if x0 is None else x0.copy()
r = b - A @ x # initial residual
p = r.copy() # initial search direction
rs_old = r @ r
history = [np.sqrt(rs_old)]
max_iter = max_iter or n
for k in range(max_iter):
Ap = A @ p
alpha = rs_old / (p @ Ap) # step length along p
x = x + alpha * p
r = r - alpha * Ap
rs_new = r @ r
history.append(np.sqrt(rs_new))
if np.sqrt(rs_new) < tol:
return x, history
beta = rs_new / rs_old # update for the search direction
p = r + beta * p # A-conjugate to all previous p's (proof: induction)
rs_old = rs_new
return x, history
# ─── Test 1: well-conditioned SPD ────────────────────────────────────────
np.random.seed(2)
n = 100
Q, _ = np.linalg.qr(np.random.randn(n, n))
eigs = np.linspace(1.0, 50.0, n) # condition number = 50
A = Q @ np.diag(eigs) @ Q.T
x_true = np.random.randn(n)
b = A @ x_true
x_cg, hist = cg(A, b, tol=1e-12)
print(f"n = {n}, kappa(A) = {eigs.max()/eigs.min():.1f}")
print(f" ||x_cg - x_true|| / ||x_true|| = {np.linalg.norm(x_cg - x_true)/np.linalg.norm(x_true):.2e}")
print(f" Iterations to ||r|| < 1e-12 = {len(hist) - 1}")
for k in [0, 1, 2, 5, 10, 20, len(hist) - 1]:
if k < len(hist):
print(f" iter {k:3d}: ||r|| = {hist[k]:.4e}")
# ─── Test 2: ill-conditioned, same problem ─────────────────────────────
eigs_bad = np.geomspace(1.0, 1e6, n) # condition number = 1e6
A_bad = Q @ np.diag(eigs_bad) @ Q.T
b_bad = A_bad @ x_true
_, hist_bad = cg(A_bad, b_bad, tol=1e-8, max_iter=2000)
print(f"\nSame system with kappa(A) = 1e6:")
print(f" Iterations to ||r|| < 1e-8 = {len(hist_bad) - 1}")
print(f" (Convergence rate ~ sqrt(kappa), so ~1000x more iters.)") Output on a 100×100 SPD system:
n = 100, kappa(A) = 50.0
||x_cg - x_true|| / ||x_true|| = 5.83e-15
Iterations to ||r|| < 1e-12 = 68
iter 0: ||r|| = 2.7197e+02
iter 1: ||r|| = 7.0290e+01
iter 2: ||r|| = 3.0827e+01
iter 5: ||r|| = 5.6963e+00
iter 10: ||r|| = 1.0770e+00
iter 20: ||r|| = 9.3834e-02
iter 68: ||r|| = 8.2629e-13
Same system with kappa(A) = 1e6:
Iterations to ||r|| < 1e-8 = 1432
(Convergence rate ~ sqrt(kappa), so ~1000x more iters.) Two things to read off. (1) On a well-conditioned () problem CG converges to in 68 iterations and recovers to relative error — full double precision. (2) Push the condition number to on the same matrix structure and convergence becomes ~20x slower (1432 vs 68 iterations), consistent with the theoretical rate. That single fact — convergence governed by , not — is why CG is the default solver for SPD problems and why PRECONDITIONING (which reduces ) is essential for hard problems.
Convergence bound
The classical bound, due to Hestenes-Stiefel and refined by many, is:
The error reduction is in the -norm , not the Euclidean norm — the residual norm we monitor in practice can fluctuate even while the -norm error decreases monotonically. For the bound simplifies to : to reduce error by , you need iterations. The bound is pessimistic in many cases — CG can exploit "clustered eigenvalues" and finish much faster — but the scaling is sharp.
Preconditioning
The price of CG's dependence is that ill-conditioned problems still need many iterations. A PRECONDITIONER approximates , and we solve the equivalent system with smaller condition number . The PCG algorithm threads through the recursion so the symmetry is preserved; the only extra work is one -solve per iteration.
Choosing is the whole art: it should be cheap to invert and close to . Common choices include diagonal (Jacobi), incomplete Cholesky, sparse approximate inverse, multigrid V-cycle, and domain-decomposition preconditioners. The "perfect" preconditioner is itself — converges in one iteration but is as expensive as solving the original problem.
What it's used for
- Finite-element / finite-difference PDE solves. The stiffness matrix is SPD for elliptic PDEs (Poisson, elasticity at small deformation). PCG with a multigrid or incomplete-Cholesky preconditioner is the standard solver.
- Normal equations . The Gauss-Newton step in nonlinear least squares; also a standard linear least-squares formulation, though numerically inferior to SVD-based approaches.
- Gaussian-process inference. The covariance matrix is SPD; computing and via CG plus Lanczos quadrature avoids the Cholesky.
- Newton subproblems in convex optimization. The Hessian is SPD at a minimum; CG with early termination yields a truncated-Newton or "Hessian-free" method.
- Reservoir simulation, image deblurring, electromagnetic inverse problems — wherever large sparse SPD systems appear.
Variants for the non-SPD case
CG requires SPD. Generalizations relax this in different directions: MINRES handles symmetric indefinite, GMRES handles fully non-symmetric (built on Arnoldi), BiCG and BiCGSTAB use bi-orthogonal Lanczos for non-symmetric problems with short recursions. Each makes a different cost-vs-robustness tradeoff.
Related
- Lanczos iteration — CG and Lanczos are mathematically the same algorithm; CG just avoids storing the basis.
- Condition number — the single quantity governing CG's convergence rate.
- Bi-orthogonal Lanczos — the non-symmetric analogue, behind BiCG / BiCGSTAB.