Conjugate Gradient

Linear Algebra

Conjugate gradient (CG) is the canonical iterative solver for when is symmetric positive definite. It uses no matrix factorizations, touches only through mat-vecs, and at iteration produces the BEST approximation to in the Krylov subspace , measured in the -norm. The algorithm needs four -vectors of storage regardless of how long it runs. Combined with a decent preconditioner it dominates large sparse SPD problems: finite-element solvers, Poisson equations on regular grids, Gaussian-process regression, optimization subproblems where the Hessian is SPD.

The quadratic-minimization picture

For SPD , solving is equivalent to minimizing the strictly convex quadratic

The gradient is the negative residual; the unique minimizer is the solution. Steepest descent minimizes by taking the current residual as the search direction and moving along it to the line minimum. It works, but it's slow: successive search directions are forced to be ORTHOGONAL (a property of exact line-search on a convex quadratic), and orthogonality is a weak structural constraint that lets the trajectory zig-zag along narrow valleys when is ill-conditioned. The convergence rate of steepest descent is per iteration, where is the condition number.

CG replaces orthogonality with the stronger -CONJUGACY: search directions are chosen so that for . Geometrically, the directions are orthogonal under the inner product — the natural metric on the level sets of . The payoff: exact-arithmetic CG reaches the exact minimizer in at most iterations, and the per-iteration improvement scales like — a quadratic improvement over steepest descent.

Why the recursion is short

Building a full -conjugate basis explicitly via Gram-Schmidt-with--metric would cost per step. CG's brilliance is that the recursion is only THREE TERMS — analogous to symmetric Lanczos. The new search direction is built from the current residual and the previous direction:

and the iteration along uses the exact line-search step:

The miracle is that these three short formulas produce search directions that are PAIRWISE -CONJUGATE, and residuals that are PAIRWISE ORTHOGONAL. Both properties can be proven by induction; they emerge "for free" from the SPD structure. The residuals are, up to scaling, exactly the Lanczos basis vectors of . CG is Lanczos in disguise, with the basis never stored.

The algorithm

# Classical conjugate gradient. Solves A x = b for symmetric positive
# definite A. Storage: four n-vectors (x, r, p, Ap). Per iteration: one
# mat-vec, two dot products, three saxpy-like updates. No matrix
# factorizations; A is only used through its action.

import numpy as np

def cg(A, b, x0=None, tol=1e-10, max_iter=None):
    n = A.shape[0]
    x = np.zeros(n) if x0 is None else x0.copy()
    r = b - A @ x                  # initial residual
    p = r.copy()                   # initial search direction
    rs_old = r @ r
    history = [np.sqrt(rs_old)]
    max_iter = max_iter or n
    for k in range(max_iter):
        Ap = A @ p
        alpha = rs_old / (p @ Ap)  # step length along p
        x = x + alpha * p
        r = r - alpha * Ap
        rs_new = r @ r
        history.append(np.sqrt(rs_new))
        if np.sqrt(rs_new) < tol:
            return x, history
        beta = rs_new / rs_old     # update for the search direction
        p = r + beta * p           # A-conjugate to all previous p's (proof: induction)
        rs_old = rs_new
    return x, history

# ─── Test 1: well-conditioned SPD ────────────────────────────────────────
np.random.seed(2)
n = 100
Q, _ = np.linalg.qr(np.random.randn(n, n))
eigs = np.linspace(1.0, 50.0, n)          # condition number = 50
A = Q @ np.diag(eigs) @ Q.T
x_true = np.random.randn(n)
b = A @ x_true

x_cg, hist = cg(A, b, tol=1e-12)
print(f"n = {n}, kappa(A) = {eigs.max()/eigs.min():.1f}")
print(f"  ||x_cg - x_true|| / ||x_true|| = {np.linalg.norm(x_cg - x_true)/np.linalg.norm(x_true):.2e}")
print(f"  Iterations to ||r|| < 1e-12     = {len(hist) - 1}")
for k in [0, 1, 2, 5, 10, 20, len(hist) - 1]:
    if k < len(hist):
        print(f"    iter {k:3d}:  ||r|| = {hist[k]:.4e}")

# ─── Test 2: ill-conditioned, same problem ─────────────────────────────
eigs_bad = np.geomspace(1.0, 1e6, n)       # condition number = 1e6
A_bad = Q @ np.diag(eigs_bad) @ Q.T
b_bad = A_bad @ x_true
_, hist_bad = cg(A_bad, b_bad, tol=1e-8, max_iter=2000)
print(f"\nSame system with kappa(A) = 1e6:")
print(f"  Iterations to ||r|| < 1e-8 = {len(hist_bad) - 1}")
print(f"  (Convergence rate ~ sqrt(kappa), so ~1000x more iters.)")

Output on a 100×100 SPD system:

n = 100, kappa(A) = 50.0
  ||x_cg - x_true|| / ||x_true|| = 5.83e-15
  Iterations to ||r|| < 1e-12     = 68
    iter   0:  ||r|| = 2.7197e+02
    iter   1:  ||r|| = 7.0290e+01
    iter   2:  ||r|| = 3.0827e+01
    iter   5:  ||r|| = 5.6963e+00
    iter  10:  ||r|| = 1.0770e+00
    iter  20:  ||r|| = 9.3834e-02
    iter  68:  ||r|| = 8.2629e-13

Same system with kappa(A) = 1e6:
  Iterations to ||r|| < 1e-8 = 1432
  (Convergence rate ~ sqrt(kappa), so ~1000x more iters.)

Two things to read off. (1) On a well-conditioned () problem CG converges to in 68 iterations and recovers to relative error — full double precision. (2) Push the condition number to on the same matrix structure and convergence becomes ~20x slower (1432 vs 68 iterations), consistent with the theoretical rate. That single fact — convergence governed by , not — is why CG is the default solver for SPD problems and why PRECONDITIONING (which reduces ) is essential for hard problems.

Convergence bound

The classical bound, due to Hestenes-Stiefel and refined by many, is:

The error reduction is in the -norm , not the Euclidean norm — the residual norm we monitor in practice can fluctuate even while the -norm error decreases monotonically. For the bound simplifies to : to reduce error by , you need iterations. The bound is pessimistic in many cases — CG can exploit "clustered eigenvalues" and finish much faster — but the scaling is sharp.

Preconditioning

The price of CG's dependence is that ill-conditioned problems still need many iterations. A PRECONDITIONER approximates , and we solve the equivalent system with smaller condition number . The PCG algorithm threads through the recursion so the symmetry is preserved; the only extra work is one -solve per iteration.

Choosing is the whole art: it should be cheap to invert and close to . Common choices include diagonal (Jacobi), incomplete Cholesky, sparse approximate inverse, multigrid V-cycle, and domain-decomposition preconditioners. The "perfect" preconditioner is itself — converges in one iteration but is as expensive as solving the original problem.

What it's used for

Variants for the non-SPD case

CG requires SPD. Generalizations relax this in different directions: MINRES handles symmetric indefinite, GMRES handles fully non-symmetric (built on Arnoldi), BiCG and BiCGSTAB use bi-orthogonal Lanczos for non-symmetric problems with short recursions. Each makes a different cost-vs-robustness tradeoff.

Related