Pick a stable step size from the Hessian eigenvalues, run the iteration x ← x − η ∇f(x), and recognize the failure modes (divergence above 2/L, slow progress on ill-conditioned problems, traps on non-convex landscapes).
1 worked example · 7 practice problems · 2 check problems
Worked example: gradient descent on an ill-conditioned bowl
Problem. Minimize f(x,y)=x2+10y2 by gradient descent. Start at (1,1) with learning rate η=0.04. Compute the first three iterates by hand. Explain why y converges fast and x converges slowly. State the largest learning rate that doesn't diverge.
Diagnosis. Quadratic minimization is the canonical setting for analyzing gradient descent. The Hessian is constant (no curvature changes during the descent), and the convergence is governed entirely by its eigenvalues. With ∇f=(2x,20y), the Hessian is diag(2,20) — eigenvalues 2 and 20, condition number 10. Per-step contraction in direction i is ∣1−ηλi∣.
Predict before reading on: before doing the arithmetic: predict the contraction factor for each direction at η=0.04. Which direction shrinks faster?
Iterations. Per-step contractions: ∣1−0.04⋅2∣=0.92 for x, ∣1−0.04⋅20∣=0.2 for y. Both contract — neither diverges — but y contracts almost 5× faster.
Pattern.xk=0.92k and yk=0.2k. After k=50 steps, y≈10−35 (long since converged) but x≈0.015 (still 1.5% off). The well-conditioned direction is done; the ill-conditioned direction crawls. Total iterations to drive x down to 10−6 are ∼log(10−6)/log(0.92)≈165.
Predict before reading on: we have η=0.04. The stability bound is η<2/L. Predict the failure mode at η=0.1 exactly, and at η=0.11.
Failure modes. Stability bound: η<2/L=2/20=0.1. At η=0.1, the y-direction has contraction factor ∣1−0.1⋅20∣=1 — pure oscillation, no convergence. At η=0.11, the y-direction grows by factor 1.2 per step — divergent. The x-direction is fine in all three cases; the worst-conditioned direction sets the stability boundary.
Verification.
import numpy as npx = np.array([1.0, 1.0])H = np.diag([2., 20.]) # Hessianfor k in range(4): print(f"iter {k}: x = {x}") x = x - 0.04 * (H @ x)# iter 0: [1. 1.]# iter 1: [0.92 0.2]# iter 2: [0.8464 0.04]# iter 3: [0.778688 0.008]
Articulate: state in one sentence the trade-off between stability (large η diverges) and progress (small η crawls). What does the Hessian condition number represent in this trade-off?
Practice problems
Seven problems, seven surfaces. Some are pure mechanics (run a few steps by hand), some test the spectral analysis (compute optimal η), some explore failure modes (non-smooth, non-convex), and the last two derive extensions of the basic algorithm.
P.11D quadratic minimization
Minimize f(x)=(x−2)2+1 by gradient descent starting at x0=0.
(a) What's the Hessian? What's the maximum stable learning rate?
(b) What learning rate converges to the minimum in one step?
(c) Run two iterations at the one-step-optimal η and confirm.
Find the analogue:
same eigenvalue analysis as the worked example, simpler because x is 1D. With one eigenvalue, there's no condition-number penalty — you can hit the minimum exactly with the right step size.
show answer
(a) f′′(x)=2, so L=2 and stability requires η<2/2=1.
(b) Contraction factor is ∣1−η⋅2∣, which equals zero at η=1/2. One-step convergence at η=0.5.
(c) x1=x0−0.5⋅f′(x0)=0−0.5⋅2(0−2)=0−(−2)=2. Already at the minimum after one step. x2=2−0.5⋅0=2 — stays there. ✓
The general lesson: for a quadratic with a single eigenvalue λ, the optimal η is 1/λ. For multiple eigenvalues, you can't satisfy "one-step convergence" in every direction simultaneously, and the best you can do is compromise.
P.22D ill-conditioned, optimal lr from spectral analysis
A quadratic f(x)=21x⊤Ax has Hessian A=diag(4,20).
(a) Compute the maximum stable constant learning rate.
(b) Compute the optimal constant learning rate η∗=2/(μ+L) where μ,L are the smallest and largest eigenvalues.
(c) State the resulting per-step contraction factor in the worst direction. How many iterations are needed to reduce the worst-direction error by a factor of 10−6?
Find the analogue:
the worked example used a specific η and observed two different contraction rates. This problem picks η to balance the contraction across all eigenvalues. Same spectral analysis, the question is now optimization rather than evaluation.
show answer
(a) Stability bound: η<2/L=2/20=0.1.
(b) η∗=2/(4+20)=2/24=1/12≈0.0833.
(c) Per-step contraction at η∗: ∣1−η∗μ∣=∣1−4/12∣=2/3≈0.667 at the small eigenvalue, and ∣1−η∗L∣=∣1−20/12∣=2/3 at the large eigenvalue — equal by design. (This is the whole point of η∗=2/(μ+L) — it balances the two extreme contractions.)
This rate is (κ−1)/(κ+1) where κ=L/μ=5. For a target reduction of 10−6, the number of iterations is log(10−6)/log(2/3)≈34. Compare to ∼165 at the worked example's sub-optimal η=0.04, where the worst contraction was 0.92. The optimal step gives an order of magnitude fewer iterations.
P.3non-smooth objective, oscillation around the minimum
Apply gradient descent to f(x)=∣x∣ starting at x0=1.05 with fixed η=0.1. The subgradient is ∂f(x)=sign(x) for x=0.
(a) Run the iteration. What happens for k=1,2,…,12?
(b) Describe the long-run behavior in one sentence.
(c) Propose a modification that does converge to the minimum.
Find the analogue:
the worked example had η chosen to be smaller than the stability bound. Here there's no smooth Hessian to set the bound — the gradient is a step function. What goes wrong?
show answer
(a) Iteration: x1=1.05−0.1=0.95,x2=0.85,…,x10=0.05,x11=−0.05,x12=0.05,x13=−0.05,…. Once we cross zero, the gradient direction flips and we bounce back. The iterate enters a stable oscillation between +0.05 and −0.05.
(b) Fixed-η gradient descent on a non-smooth objective oscillates around the minimum with amplitude ∼η/2 rather than converging. The minimum value f(x∗)=0 is never reached — only points where f=0.05.
(c) Fix: use a diminishing step size ηk→0. The classical choice is ηk=1/(k+1) with ∑ηk=∞,∑ηk2<∞, which is exactly the Robbins-Monro condition for stochastic-approximation convergence. The iterates then converge to the minimum, though slowly. An alternative is the proximal operator xk+1=argminx[f(x)+2η1∥x−xk∥2], which gives exact convergence for non-smooth convex f at any fixed η. For f=∣x∣ this is the soft-thresholding operator and underlies modern sparse-recovery algorithms (ISTA, FISTA).
P.4non-convex landscape, two basins from same lr
Consider f(x)=x4−4x2+x (the non-convex example from the concept page). Run gradient descent with η=0.03 from two starting points: x0=−0.4 and x0=+0.4.
(a) Locate the stationary points by hand (roots of f′(x)=4x3−8x+1).
(b) Predict which stationary point each trajectory will reach. Explain.
(c) State the function values at the two minima. Which is the global minimum?
Find the analogue:
the worked example was convex — gradient descent finds the unique minimum from any start. This problem is non-convex. The algorithm is the same; the answer depends on initialization.
(b) From x0=−0.4: f′(−0.4)=4(−0.064)−8(−0.4)+1=−0.256+3.2+1=3.94>0, so the gradient pushes x to the left, away from the local max at +0.126. Trajectory falls into the left basin, converging to x≈−1.473.
From x0=+0.4: f′(0.4)=0.256−3.2+1=−1.944<0, gradient pushes x to the right, away from the local max. Trajectory falls into the right basin, converging to x≈1.347.
(c) f(−1.473)≈−5.46, f(+1.347)≈−2.66. The local minimum at x≈−1.473 is actually the global minimum — surprisingly, the deeper basin is the one further from zero. (This is what the concept-page figure shows.) The starting point determines which basin you find; gradient descent has no way to know there's a deeper one elsewhere.
P.5heavy-ball momentum from a physics analogy
Polyak's heavy-ball method adds momentum to gradient descent:
xk+1=xk−η∇f(xk)+γ(xk−xk−1)
Derive this from a physical analogy. Specifically: a ball of mass m at position x in a potential f(x), subject to a friction force −νx˙, satisfies mx¨+νx˙+∇f(x)=0. Discretize this ODE with explicit Euler at step size h and show it produces the heavy-ball update. Identify η and γ in terms of m,ν,h.
Find the analogue:
the basic GD step is the over-damped limit (no inertia: m=0, just νx˙=−∇f). Adding back mass gives a ball that rolls past local features rather than getting stuck on them. This is the same "physics analogy" intuition used to motivate momentum in deep learning.
show answer
Explicit Euler discretization with step h: x˙(tk)≈(xk+1−xk)/h and x¨(tk)≈(xk+1−2xk+xk−1)/h2.
Divide through by (m/h2+ν/h)=(m+νh)/h2. After algebra:
xk+1=xk−m+νhh2∇f(xk)+m+νhm(xk−xk−1)
Reading off: η=h2/(m+νh),γ=m/(m+νh).
Physical reading: γ ("momentum coefficient") is the relative weight of mass to (mass + friction step) — heavier ball means more momentum carry-over. η ("learning rate") inherits an h2 from the second-derivative term, hence the quadratic scaling of step size with time step. Setting m=0 gives γ=0 and vanilla gradient descent at step η=h/ν — over-damped limit.
This is the same kind of derivation that motivates Nesterov's accelerated gradient (a more careful discretization that achieves the optimal (κ−1)/(κ+1) rate for strongly convex problems, faster than heavy-ball's (κ−1)/(κ+1)). Both are second-order ODE discretizations dressed up as first-order updates.
P.6diagonal preconditioning to fix conditioning
Take the worked example's A=diag(2,20). Find a diagonal preconditioner matrix P such that the transformed problem on z=P−1x has a perfectly-conditioned Hessian (all eigenvalues equal).
Write out the preconditioned iteration zk+1=zk−ηP−⊤∇f(Pzk) for this A. What's its convergence rate?
Find the analogue:
ill-conditioning is what limited the worked example to η<0.1 while x crawled. Preconditioning rescales the axes so the Hessian becomes the identity, eliminating the conditioning problem entirely.
show answer
Choose P=diag(2,20) so that P⊤AP−1=? — actually the cleanest statement is: substitute x=P−1z:
f(x)=21x⊤Ax=21z⊤(P−⊤AP−1)z
With P=diag(2,20), P−⊤AP−1=diag(2/2,20/20)=I.
Eigenvalues all 1, condition number 1. Optimal η=1 gives one-step convergence (in z space). The iteration in z space looks like vanilla GD on a unit-Hessian problem — boring, which is the point.
In x space, the preconditioned step is xk+1=xk−ηA−1∇f(xk), which is one step of Newton's method. So preconditioning is Newton's method in disguise when you precondition with A−1. Practical preconditioners are cheap approximations to A−1 — diagonal preconditioning (Jacobi), incomplete Cholesky, or learned preconditioners — that get most of the benefit without the cost of inverting A. Modern optimizers (Adam, RMSprop) are exactly diagonal preconditioners that estimate A−1 from running second-moment estimates of the gradient.
P.7linear systems via Richardson iteration
Show that solving a symmetric positive-definite linear system Ax=b is equivalent to minimizing the quadratic f(x)=21x⊤Ax−b⊤x.
(a) Compute the gradient and show that the unique minimizer satisfies Ax∗=b.
(b) Write out gradient descent on this quadratic. It produces a famous iterative linear-system solver — name it.
(c) State the stable learning-rate range in terms of the eigenvalues of A.
Find the analogue:
same spectral-analysis machinery as the worked example, applied to a quadratic that has a linear system at its critical point rather than a "pure" minimum. Solving Ax=b and minimizing f are the same problem.
show answer
(a) ∇f(x)=Ax−b (using symmetry of A). Setting this to zero gives Ax=b, and the Hessian A is positive definite by assumption, so the critical point is the unique global minimum.
(b) Gradient descent: xk+1=xk−η(Axk−b)=(I−ηA)xk+ηb. This is the Richardson iteration for Ax=b. It's one of the oldest iterative linear solvers; the modern descendants are Jacobi, Gauss-Seidel, and (most importantly) the conjugate gradient method, which is gradient descent's smarter cousin specifically tuned for SPD linear systems.
(c) With eigenvalues 0<μ≤…≤L of A, stability requires ∥I−ηA∥<1, i.e., η<2/L. Optimal constant η∗=2/(μ+L) gives convergence rate (κ−1)/(κ+1) with κ=L/μ — exactly the same as for general quadratic minimization. Conjugate gradient improves this to (κ−1)/(κ+1) per step, which is why it's the algorithm of choice for large sparse SPD systems in scientific computing.
Check problems
Two problems that don't pattern-match to the practice set. The first derives a result the page asserts but doesn't prove; the second tests when the basic algorithm "succeeds."
Check 1derivation
For the quadratic f(x)=21x⊤Ax−b⊤x with symmetric positive-definite A having eigenvalues 0<μ≤λ2≤…≤L, derive the optimal constant learning rate η∗ for gradient descent.
(a) Show that the per-iteration error-reduction operator is I−ηA.
(b) Derive η∗=2/(μ+L) as the minimizer of maxλ∣1−ηλ∣ over λ∈[μ,L].
(c) Compute the resulting worst-direction contraction factor and show it equals (κ−1)/(κ+1) where κ=L/μ.
show solution sketch
(a) Let ek=xk−x∗ with x∗=A−1b. The GD update is xk+1=xk−η(Axk−b). Substituting xk=x∗+ek and using Ax∗=b: xk+1=x∗+ek−ηAek=x∗+(I−ηA)ek. So ek+1=(I−ηA)ek. ✓
(b) Decompose ek in the eigenbasis of A: ek=∑iαi(k)vi with Avi=λivi. Then ek+1=∑i(1−ηλi)αi(k)vi, so each component contracts by ∣1−ηλi∣. The worst-case contraction (over all components) is maxλ∣1−ηλ∣ with λ∈[μ,L].
This function of η is piecewise linear. For η<1/L, all 1−ηλ are positive, max is at λ=μ: gives 1−ημ, decreasing in η. For η>1/μ, all are negative, max is at λ=L: gives ηL−1, increasing in η. In between, the max is max(1−ημ,ηL−1), minimized where the two are equal:
1−ημ=ηL−1⇒η∗=2/(μ+L) ✓
(c) At η∗: contraction =1−η∗μ=1−2μ/(μ+L)=(L−μ)/(L+μ)=(κ−1)/(κ+1) ✓
This is the "textbook" worst-case rate for constant-step gradient descent on a strongly convex quadratic. It depends only on the condition number κ, not on the dimension of the problem. Conjugate gradient achieves (κ−1)/(κ+1); Nesterov's accelerated gradient achieves the same; the gap between them and plain GD is what makes the latter unsuitable for large ill-conditioned problems in serious numerical work.
Check 2articulation
"Gradient descent converged" and "we found the global minimum" are different claims. In 150–250 words, explain the distinction using the non-convex example from P.4. Your answer should:
Define what "converged" means operationally (in terms of the gradient norm and/or successive iterates).
Explain what gradient descent can guarantee under mild conditions.
Explain what it cannot guarantee, and why this is a property of the loss surface rather than of the algorithm.
Name two practical strategies that partially address the gap (without claiming they solve it).
show solution sketch
"Converged" means the iterates have stopped moving meaningfully — operationally, the gradient norm has dropped below a tolerance, or ∥xk+1−xk∥ is below a tolerance. Gradient descent on a smooth function with a stable step size will reach this state. What it produces is a stationary point: a place where ∇f=0.
What gradient descent guarantees (under mild conditions: smooth f, bounded below, η<2/L): a sequence whose gradient norm goes to zero, with limit points that are stationary. For convex f, every stationary point is a global minimum, so convergence implies optimality.
What it can't guarantee: in P.4's f(x)=x4−4x2+x, there are three stationary points — a local max at x≈0.13, a local min at x≈−1.47, and a global min at x≈1.35. Starting at −0.4 drives the iterate into the local-min basin. The algorithm converges; the gradient at the limit is zero; the loss is locally minimized — and yet it's not the best basin. This is a property of the loss surface (multiple basins of attraction), not the algorithm. No first-order local-search method, with any step-size schedule, can fix this.
Practical strategies that partially address the gap: (i) multiple random restarts — sample many initial x0, run GD from each, keep the best result. Costs nrestart× the single-run cost; helps but no guarantee of finding the global min. (ii) Stochastic noise (SGD or Langevin dynamics) — adding gradient noise lets the iterates jump small barriers and find deeper basins; this is half the reason deep-learning models generalize. Neither strategy provides a worst-case guarantee. Global optimization on non-convex landscapes remains open.