Conjugate Gradient

Linear Algebra

Conjugate gradient (CG) is the canonical iterative solver for $A x = b$ when $A$ is symmetric positive definite. It uses no matrix factorizations, touches $A$ only through mat-vecs, and at iteration $k$ produces the BEST approximation to $x$ in the Krylov subspace $K_{k} (A, b)$ , measured in the $A$ -norm. The algorithm needs four $n$ -vectors of storage regardless of how long it runs. Combined with a decent preconditioner it dominates large sparse SPD problems: finite-element solvers, Poisson equations on regular grids, Gaussian-process regression, optimization subproblems where the Hessian is SPD.

The quadratic-minimization picture

For SPD $A$ , solving $A x = b$ is equivalent to minimizing the strictly convex quadratic

ϕ (x) = \frac{1}{2} x^{T} A x - b^{T} x .

The gradient $\nabla ϕ (x) = A x - b$ is the negative residual; the unique minimizer is the solution. Steepest descent minimizes $ϕ$ by taking the current residual as the search direction and moving along it to the line minimum. It works, but it's slow: successive search directions are forced to be ORTHOGONAL (a property of exact line-search on a convex quadratic), and orthogonality is a weak structural constraint that lets the trajectory zig-zag along narrow valleys when $A$ is ill-conditioned. The convergence rate of steepest descent is $\sim (κ - 1) / (κ + 1)$ per iteration, where $κ = λ_{m a x} / λ_{m i n}$ is the condition number.

CG replaces orthogonality with the stronger $A$ -CONJUGACY: search directions $p_{i}, p_{j}$ are chosen so that $p_{i}^{T} A p_{j} = 0$ for $i \neq = j$ . Geometrically, the directions are orthogonal under the inner product $⟨ x, y ⟩_{A} = x^{T} A y$ — the natural metric on the level sets of $ϕ$ . The payoff: exact-arithmetic CG reaches the exact minimizer in at most $n$ iterations, and the per-iteration improvement scales like $(κ - 1) / (κ + 1)$ — a quadratic improvement over steepest descent.

Why the recursion is short

Building a full $A$ -conjugate basis explicitly via Gram-Schmidt-with- $A$ -metric would cost $O (nk)$ per step. CG's brilliance is that the recursion is only THREE TERMS — analogous to symmetric Lanczos. The new search direction is built from the current residual and the previous direction:

p_{k + 1} = r_{k + 1} + β_{k} p_{k}, β_{k} = \frac{r _{k + 1}^{T} r _{k + 1}}{r _{k}^{T} r _{k}},

and the iteration along $p_{k}$ uses the exact line-search step:

α_{k} = \frac{r _{k}^{T} r _{k}}{p _{k}^{T} A p _{k}}, x_{k + 1} = x_{k} + α_{k} p_{k}, r_{k + 1} = r_{k} - α_{k} A p_{k} .

The miracle is that these three short formulas produce search directions ${p_{0}, p_{1}, \dots}$ that are PAIRWISE $A$ -CONJUGATE, and residuals ${r_{0}, r_{1}, \dots}$ that are PAIRWISE ORTHOGONAL. Both properties can be proven by induction; they emerge "for free" from the SPD structure. The residuals are, up to scaling, exactly the Lanczos basis vectors of $K_{k} (A, r_{0})$ . CG is Lanczos in disguise, with the basis never stored.

The algorithm

# Classical conjugate gradient. Solves A x = b for symmetric positive
# definite A. Storage: four n-vectors (x, r, p, Ap). Per iteration: one
# mat-vec, two dot products, three saxpy-like updates. No matrix
# factorizations; A is only used through its action.

import numpy as np

def cg(A, b, x0=None, tol=1e-10, max_iter=None):
    n = A.shape[0]
    x = np.zeros(n) if x0 is None else x0.copy()
    r = b - A @ x                  # initial residual
    p = r.copy()                   # initial search direction
    rs_old = r @ r
    history = [np.sqrt(rs_old)]
    max_iter = max_iter or n
    for k in range(max_iter):
        Ap = A @ p
        alpha = rs_old / (p @ Ap)  # step length along p
        x = x + alpha * p
        r = r - alpha * Ap
        rs_new = r @ r
        history.append(np.sqrt(rs_new))
        if np.sqrt(rs_new) < tol:
            return x, history
        beta = rs_new / rs_old     # update for the search direction
        p = r + beta * p           # A-conjugate to all previous p's (proof: induction)
        rs_old = rs_new
    return x, history

# ─── Test 1: well-conditioned SPD ────────────────────────────────────────
np.random.seed(2)
n = 100
Q, _ = np.linalg.qr(np.random.randn(n, n))
eigs = np.linspace(1.0, 50.0, n)          # condition number = 50
A = Q @ np.diag(eigs) @ Q.T
x_true = np.random.randn(n)
b = A @ x_true

x_cg, hist = cg(A, b, tol=1e-12)
print(f"n = {n}, kappa(A) = {eigs.max()/eigs.min():.1f}")
print(f"  ||x_cg - x_true|| / ||x_true|| = {np.linalg.norm(x_cg - x_true)/np.linalg.norm(x_true):.2e}")
print(f"  Iterations to ||r|| < 1e-12     = {len(hist) - 1}")
for k in [0, 1, 2, 5, 10, 20, len(hist) - 1]:
    if k < len(hist):
        print(f"    iter {k:3d}:  ||r|| = {hist[k]:.4e}")

# ─── Test 2: ill-conditioned, same problem ─────────────────────────────
eigs_bad = np.geomspace(1.0, 1e6, n)       # condition number = 1e6
A_bad = Q @ np.diag(eigs_bad) @ Q.T
b_bad = A_bad @ x_true
_, hist_bad = cg(A_bad, b_bad, tol=1e-8, max_iter=2000)
print(f"\nSame system with kappa(A) = 1e6:")
print(f"  Iterations to ||r|| < 1e-8 = {len(hist_bad) - 1}")
print(f"  (Convergence rate ~ sqrt(kappa), so ~1000x more iters.)")

Output on a 100×100 SPD system:

n = 100, kappa(A) = 50.0
  ||x_cg - x_true|| / ||x_true|| = 5.83e-15
  Iterations to ||r|| < 1e-12     = 68
    iter   0:  ||r|| = 2.7197e+02
    iter   1:  ||r|| = 7.0290e+01
    iter   2:  ||r|| = 3.0827e+01
    iter   5:  ||r|| = 5.6963e+00
    iter  10:  ||r|| = 1.0770e+00
    iter  20:  ||r|| = 9.3834e-02
    iter  68:  ||r|| = 8.2629e-13

Same system with kappa(A) = 1e6:
  Iterations to ||r|| < 1e-8 = 1432
  (Convergence rate ~ sqrt(kappa), so ~1000x more iters.)

Two things to read off. (1) On a well-conditioned ( $κ = 50$ ) problem CG converges to $∥ r ∥ < 1 0^{- 12}$ in 68 iterations and recovers $x$ to $\sim 1 0^{- 15}$ relative error — full double precision. (2) Push the condition number to $1 0^{6}$ on the same matrix structure and convergence becomes ~20x slower (1432 vs 68 iterations), consistent with the $κ$ theoretical rate. That single fact — convergence governed by $κ$ , not $κ$ — is why CG is the default solver for SPD problems and why PRECONDITIONING (which reduces $κ$ ) is essential for hard problems.

Convergence bound

The classical bound, due to Hestenes-Stiefel and refined by many, is:

\frac{∥ x _{k} - x ^{*} ∥ _{A}}{∥ x _{0} - x ^{*} ∥ _{A}} \leq 2 (\frac{κ - 1}{κ + 1})^{k} .

The error reduction is in the $A$ -norm $∥ y ∥_{A} = y^{T} A y$ , not the Euclidean norm — the residual norm we monitor in practice can fluctuate even while the $A$ -norm error decreases monotonically. For $κ ≫ 1$ the bound simplifies to $∥ x_{k} - x^{*} ∥_{A} ≲ 2∥ x_{0} - x^{*} ∥_{A} exp (- 2 k / κ)$ : to reduce error by $ϵ$ , you need $k \approx \frac{1}{2} κ lo g (1/ ϵ)$ iterations. The bound is pessimistic in many cases — CG can exploit "clustered eigenvalues" and finish much faster — but the $κ$ scaling is sharp.

Preconditioning

The price of CG's $κ$ dependence is that ill-conditioned problems still need many iterations. A PRECONDITIONER $M^{- 1}$ approximates $A^{- 1}$ , and we solve the equivalent system $M^{- 1} A x = M^{- 1} b$ with smaller condition number $κ (M^{- 1} A) ≪ κ (A)$ . The PCG algorithm threads $M^{- 1}$ through the recursion so the symmetry is preserved; the only extra work is one $M^{- 1}$ -solve per iteration.

Choosing $M$ is the whole art: it should be cheap to invert and close to $A$ . Common choices include diagonal (Jacobi), incomplete Cholesky, sparse approximate inverse, multigrid V-cycle, and domain-decomposition preconditioners. The "perfect" preconditioner is $M = A$ itself — converges in one iteration but is as expensive as solving the original problem.

What it's used for

Finite-element / finite-difference PDE solves. The stiffness matrix is SPD for elliptic PDEs (Poisson, elasticity at small deformation). PCG with a multigrid or incomplete-Cholesky preconditioner is the standard solver.
Normal equations $A^{T} A x = A^{T} b$ . The Gauss-Newton step in nonlinear least squares; also a standard linear least-squares formulation, though numerically inferior to SVD-based approaches.
Gaussian-process inference. The covariance matrix $K$ is SPD; computing $K^{- 1} y$ and $lo g det K$ via CG plus Lanczos quadrature avoids the $O (n^{3})$ Cholesky.
Newton subproblems in convex optimization. The Hessian is SPD at a minimum; CG with early termination yields a truncated-Newton or "Hessian-free" method.
Reservoir simulation, image deblurring, electromagnetic inverse problems — wherever large sparse SPD systems appear.

Variants for the non-SPD case

CG requires SPD. Generalizations relax this in different directions: MINRES handles symmetric indefinite, GMRES handles fully non-symmetric (built on Arnoldi), BiCG and BiCGSTAB use bi-orthogonal Lanczos for non-symmetric problems with short recursions. Each makes a different cost-vs-robustness tradeoff.

Lanczos iteration — CG and Lanczos are mathematically the same algorithm; CG just avoids storing the basis.
Condition number — the single quantity governing CG's convergence rate.
Bi-orthogonal Lanczos — the non-symmetric analogue, behind BiCG / BiCGSTAB.