Optimization

Gradient Descent — exercises

Pick a stable step size from the Hessian eigenvalues, run the iteration x ← x − η ∇f(x), and recognize the failure modes (divergence above 2/L, slow progress on ill-conditioned problems, traps on non-convex landscapes).

1 worked example · 7 practice problems · 2 check problems

Worked example: gradient descent on an ill-conditioned bowl

Problem. Minimize $f (x, y) = x^{2} + 10 y^{2}$ by gradient descent. Start at $(1, 1)$ with learning rate $η = 0.04$ . Compute the first three iterates by hand. Explain why $y$ converges fast and $x$ converges slowly. State the largest learning rate that doesn't diverge.

Diagnosis. Quadratic minimization is the canonical setting for analyzing gradient descent. The Hessian is constant (no curvature changes during the descent), and the convergence is governed entirely by its eigenvalues. With $\nabla f = (2 x, 20 y)$ , the Hessian is $diag (2, 20)$ — eigenvalues 2 and 20, condition number 10. Per-step contraction in direction $i$ is $∣1 - η λ_{i} ∣$ .

Predict before reading on: before doing the arithmetic: predict the contraction factor for each direction at $η = 0.04$ . Which direction shrinks faster?

Iterations. Per-step contractions: $∣1 - 0.04 \cdot 2∣ = 0.92$ for $x$ , $∣1 - 0.04 \cdot 20∣ = 0.2$ for $y$ . Both contract — neither diverges — but $y$ contracts almost 5× faster.

(x_{0}, y_{0}) = (1.0, 1.0), \nabla f = (2 \cdot 1, 20 \cdot 1) = (2, 20)

(x_{1}, y_{1}) = (1, 1) - 0.04 \cdot (2, 20) = (0.92, 0.2)

(x_{2}, y_{2}) = (0.92, 0.2) - 0.04 \cdot (1.84, 4.0) = (0.8464, 0.04)

(x_{3}, y_{3}) = (0.8464, 0.04) - 0.04 \cdot (1.6928, 0.8) = (0.7787, 0.008)

Pattern. $x_{k} = 0.9 2^{k}$ and $y_{k} = 0. 2^{k}$ . After $k = 50$ steps, $y \approx 1 0^{- 35}$ (long since converged) but $x \approx 0.015$ (still 1.5% off). The well-conditioned direction is done; the ill-conditioned direction crawls. Total iterations to drive $x$ down to $1 0^{- 6}$ are $\sim lo g (1 0^{- 6}) / lo g (0.92) \approx 165$ .

Predict before reading on: we have $η = 0.04$ . The stability bound is $η < 2/ L$ . Predict the failure mode at $η = 0.1$ exactly, and at $η = 0.11$ .

Failure modes. Stability bound: $η < 2/ L = 2/20 = 0.1$ . At $η = 0.1$ , the $y$ -direction has contraction factor $∣1 - 0.1 \cdot 20∣ = 1$ — pure oscillation, no convergence. At $η = 0.11$ , the $y$ -direction grows by factor 1.2 per step — divergent. The $x$ -direction is fine in all three cases; the worst-conditioned direction sets the stability boundary.

Verification.

import numpy as np
x = np.array([1.0, 1.0])
H = np.diag([2., 20.])           # Hessian
for k in range(4):
    print(f"iter {k}: x = {x}")
    x = x - 0.04 * (H @ x)
# iter 0: [1. 1.]
# iter 1: [0.92 0.2]
# iter 2: [0.8464 0.04]
# iter 3: [0.778688 0.008]

Articulate: state in one sentence the trade-off between stability (large $η$ diverges) and progress (small $η$ crawls). What does the Hessian condition number represent in this trade-off?

Practice problems

Seven problems, seven surfaces. Some are pure mechanics (run a few steps by hand), some test the spectral analysis (compute optimal $η$ ), some explore failure modes (non-smooth, non-convex), and the last two derive extensions of the basic algorithm.

P.1 1D quadratic minimization

Minimize $f (x) = (x - 2)^{2} + 1$ by gradient descent starting at $x_{0} = 0$ .

(a) What's the Hessian? What's the maximum stable learning rate?

(b) What learning rate converges to the minimum in one step?

Find the analogue: same eigenvalue analysis as the worked example, simpler because $x$ is 1D. With one eigenvalue, there's no condition-number penalty — you can hit the minimum exactly with the right step size.

show answer

(a) $f^{''} (x) = 2$ , so $L = 2$ and stability requires $η < 2/2 = 1$ .

(b) Contraction factor is $∣1 - η \cdot 2∣$ , which equals zero at $η = 1/2$ . One-step convergence at $η = 0.5$ .

(c) $x_{1} = x_{0} - 0.5 \cdot f^{'} (x_{0}) = 0 - 0.5 \cdot 2 (0 - 2) = 0 - (- 2) = 2$ . Already at the minimum after one step. $x_{2} = 2 - 0.5 \cdot 0 = 2$ — stays there. ✓

The general lesson: for a quadratic with a single eigenvalue $λ$ , the optimal $η$ is $1/ λ$ . For multiple eigenvalues, you can't satisfy "one-step convergence" in every direction simultaneously, and the best you can do is compromise.

P.2 2D ill-conditioned, optimal lr from spectral analysis

A quadratic $f (x) = \frac{1}{2} x^{⊤} A x$ has Hessian $A = diag (4, 20)$ .

(a) Compute the maximum stable constant learning rate.

(b) Compute the optimal constant learning rate $η^{*} = 2/ (μ + L)$ where $μ, L$ are the smallest and largest eigenvalues.

(c) State the resulting per-step contraction factor in the worst direction. How many iterations are needed to reduce the worst-direction error by a factor of $1 0^{- 6}$ ?

Find the analogue: the worked example used a specific $η$ and observed two different contraction rates. This problem picks $η$ to balance the contraction across all eigenvalues. Same spectral analysis, the question is now optimization rather than evaluation.

show answer

(a) Stability bound: $η < 2/ L = 2/20 = 0.1$ .

(b) $η^{*} = 2/ (4 + 20) = 2/24 = 1/12 \approx 0.0833$ .

(c) Per-step contraction at $η^{*}$ : $∣1 - η^{*} μ ∣ = ∣1 - 4/12∣ = 2/3 \approx 0.667$ at the small eigenvalue, and $∣1 - η^{*} L ∣ = ∣1 - 20/12∣ = 2/3$ at the large eigenvalue — equal by design. (This is the whole point of $η^{*} = 2/ (μ + L)$ — it balances the two extreme contractions.)

This rate is $(κ - 1) / (κ + 1)$ where $κ = L / μ = 5$ . For a target reduction of $1 0^{- 6}$ , the number of iterations is $lo g (1 0^{- 6}) / lo g (2/3) \approx 34$ . Compare to $\sim 165$ at the worked example's sub-optimal $η = 0.04$ , where the worst contraction was 0.92. The optimal step gives an order of magnitude fewer iterations.

P.3 non-smooth objective, oscillation around the minimum

Apply gradient descent to $f (x) = ∣ x ∣$ starting at $x_{0} = 1.05$ with fixed $η = 0.1$ . The subgradient is $\partial f (x) = sign (x)$ for $x \neq = 0$ .

(a) Run the iteration. What happens for $k = 1, 2, \dots, 12$ ?

(b) Describe the long-run behavior in one sentence.

Find the analogue: the worked example had $η$ chosen to be smaller than the stability bound. Here there's no smooth Hessian to set the bound — the gradient is a step function. What goes wrong?

show answer

(a) Iteration: $x_{1} = 1.05 - 0.1 = 0.95, x_{2} = 0.85, \dots, x_{10} = 0.05, x_{11} = - 0.05, x_{12} = 0.05, x_{13} = - 0.05, \dots$ . Once we cross zero, the gradient direction flips and we bounce back. The iterate enters a stable oscillation between $+ 0.05$ and $- 0.05$ .

(b) Fixed- $η$ gradient descent on a non-smooth objective oscillates around the minimum with amplitude $\sim η /2$ rather than converging. The minimum value $f (x^{*}) = 0$ is never reached — only points where $f = 0.05$ .

(c) Fix: use a diminishing step size $η_{k} \to 0$ . The classical choice is $η_{k} = 1/ (k + 1)$ with $\sum η_{k} = \infty, \sum η_{k}^{2} < \infty$ , which is exactly the Robbins-Monro condition for stochastic-approximation convergence. The iterates then converge to the minimum, though slowly. An alternative is the proximal operator $x_{k + 1} = ar g min_{x} [f (x) + \frac{1}{2 η} ∥ x - x_{k} ∥^{2}]$ , which gives exact convergence for non-smooth convex $f$ at any fixed $η$ . For $f = ∣ x ∣$ this is the soft-thresholding operator and underlies modern sparse-recovery algorithms (ISTA, FISTA).

P.4 non-convex landscape, two basins from same lr

Consider $f (x) = x^{4} - 4 x^{2} + x$ (the non-convex example from the concept page). Run gradient descent with $η = 0.03$ from two starting points: $x_{0} = - 0.4$ and $x_{0} = + 0.4$ .

(a) Locate the stationary points by hand (roots of $f^{'} (x) = 4 x^{3} - 8 x + 1$ ).

(b) Predict which stationary point each trajectory will reach. Explain.

Find the analogue: the worked example was convex — gradient descent finds the unique minimum from any start. This problem is non-convex. The algorithm is the same; the answer depends on initialization.

show answer

(a) $f^{'} (x) = 4 x^{3} - 8 x + 1 = 0$ . Numerical roots: $x \approx - 1.473$ (local min), $x \approx 0.126$ (local max), $x \approx 1.347$ (global min).

(b) From $x_{0} = - 0.4$ : $f^{'} (- 0.4) = 4 (- 0.064) - 8 (- 0.4) + 1 = - 0.256 + 3.2 + 1 = 3.94 > 0$ , so the gradient pushes $x$ to the left, away from the local max at +0.126. Trajectory falls into the left basin, converging to $x \approx - 1.473$ .

From $x_{0} = + 0.4$ : $f^{'} (0.4) = 0.256 - 3.2 + 1 = - 1.944 < 0$ , gradient pushes $x$ to the right, away from the local max. Trajectory falls into the right basin, converging to $x \approx 1.347$ .

(c) $f (- 1.473) \approx - 5.46$ , $f (+ 1.347) \approx - 2.66$ . The local minimum at $x \approx - 1.473$ is actually the global minimum — surprisingly, the deeper basin is the one further from zero. (This is what the concept-page figure shows.) The starting point determines which basin you find; gradient descent has no way to know there's a deeper one elsewhere.

P.5 heavy-ball momentum from a physics analogy

Polyak's heavy-ball method adds momentum to gradient descent:

$x_{k + 1} = x_{k} - η \nabla f (x_{k}) + γ (x_{k} - x_{k - 1})$

Derive this from a physical analogy. Specifically: a ball of mass $m$ at position $x$ in a potential $f (x)$ , subject to a friction force $- ν \overset{x}{˙}$ , satisfies $m \overset{x}{¨} + ν \overset{x}{˙} + \nabla f (x) = 0$ . Discretize this ODE with explicit Euler at step size $h$ and show it produces the heavy-ball update. Identify $η$ and $γ$ in terms of $m, ν, h$ .

Find the analogue: the basic GD step is the over-damped limit (no inertia: $m = 0$ , just $ν \overset{x}{˙} = - \nabla f$ ). Adding back mass gives a ball that rolls past local features rather than getting stuck on them. This is the same "physics analogy" intuition used to motivate momentum in deep learning.

show answer

Explicit Euler discretization with step $h$ : $\overset{x}{˙} (t_{k}) \approx (x_{k + 1} - x_{k}) / h$ and $\overset{x}{¨} (t_{k}) \approx (x_{k + 1} - 2 x_{k} + x_{k - 1}) / h^{2}$ .

Substitute into the ODE $m \overset{x}{¨} + ν \overset{x}{˙} = - \nabla f$ :

$m (x_{k + 1} - 2 x_{k} + x_{k - 1}) / h^{2} + ν (x_{k + 1} - x_{k}) / h = - \nabla f (x_{k})$

Solve for $x_{k + 1}$ :

$x_{k + 1} (m / h^{2} + ν / h) = (2 m / h^{2} + ν / h) x_{k} - (m / h^{2}) x_{k - 1} - \nabla f (x_{k})$

Divide through by $(m / h^{2} + ν / h) = (m + ν h) / h^{2}$ . After algebra:

$x_{k + 1} = x_{k} - \frac{h ^{2}}{m + ν h} \nabla f (x_{k}) + \frac{m}{m + ν h} (x_{k} - x_{k - 1})$

Reading off: $η = h^{2} / (m + ν h), γ = m / (m + ν h)$ .

Physical reading: $γ$ ("momentum coefficient") is the relative weight of mass to (mass + friction step) — heavier ball means more momentum carry-over. $η$ ("learning rate") inherits an $h^{2}$ from the second-derivative term, hence the quadratic scaling of step size with time step. Setting $m = 0$ gives $γ = 0$ and vanilla gradient descent at step $η = h / ν$ — over-damped limit.

This is the same kind of derivation that motivates Nesterov's accelerated gradient (a more careful discretization that achieves the optimal $(κ - 1) / (κ + 1)$ rate for strongly convex problems, faster than heavy-ball's $(κ - 1) / (κ + 1)$ ). Both are second-order ODE discretizations dressed up as first-order updates.

P.6 diagonal preconditioning to fix conditioning

Take the worked example's $A = diag (2, 20)$ . Find a diagonal preconditioner matrix $P$ such that the transformed problem on $z = P^{- 1} x$ has a perfectly-conditioned Hessian (all eigenvalues equal).

Write out the preconditioned iteration $z_{k + 1} = z_{k} - η P^{- ⊤} \nabla f (P z_{k})$ for this $A$ . What's its convergence rate?

Find the analogue: ill-conditioning is what limited the worked example to $η < 0.1$ while $x$ crawled. Preconditioning rescales the axes so the Hessian becomes the identity, eliminating the conditioning problem entirely.

show answer

Choose $P = diag (2, 20)$ so that $P^{⊤} A P^{- 1} = ?$ — actually the cleanest statement is: substitute $x = P^{- 1} z$ :

$f (x) = \frac{1}{2} x^{⊤} A x = \frac{1}{2} z^{⊤} (P^{- ⊤} A P^{- 1}) z$

With $P = diag (2, 20)$ , $P^{- ⊤} A P^{- 1} = diag (2/2, 20/20) = I$ .

Eigenvalues all 1, condition number 1. Optimal $η = 1$ gives one-step convergence (in $z$ space). The iteration in $z$ space looks like vanilla GD on a unit-Hessian problem — boring, which is the point.

In $x$ space, the preconditioned step is $x_{k + 1} = x_{k} - η A^{- 1} \nabla f (x_{k})$ , which is one step of Newton's method. So preconditioning is Newton's method in disguise when you precondition with $A^{- 1}$ . Practical preconditioners are cheap approximations to $A^{- 1}$ — diagonal preconditioning (Jacobi), incomplete Cholesky, or learned preconditioners — that get most of the benefit without the cost of inverting $A$ . Modern optimizers (Adam, RMSprop) are exactly diagonal preconditioners that estimate $A^{- 1}$ from running second-moment estimates of the gradient.

P.7 linear systems via Richardson iteration

Show that solving a symmetric positive-definite linear system $A x = b$ is equivalent to minimizing the quadratic $f (x) = \frac{1}{2} x^{⊤} A x - b^{⊤} x$ .

(a) Compute the gradient and show that the unique minimizer satisfies $A x^{*} = b$ .

(b) Write out gradient descent on this quadratic. It produces a famous iterative linear-system solver — name it.

Find the analogue: same spectral-analysis machinery as the worked example, applied to a quadratic that has a linear system at its critical point rather than a "pure" minimum. Solving $A x = b$ and minimizing $f$ are the same problem.

show answer

(a) $\nabla f (x) = A x - b$ (using symmetry of $A$ ). Setting this to zero gives $A x = b$ , and the Hessian $A$ is positive definite by assumption, so the critical point is the unique global minimum.

(b) Gradient descent: $x_{k + 1} = x_{k} - η (A x_{k} - b) = (I - η A) x_{k} + η b$ . This is the Richardson iteration for $A x = b$ . It's one of the oldest iterative linear solvers; the modern descendants are Jacobi, Gauss-Seidel, and (most importantly) the conjugate gradient method, which is gradient descent's smarter cousin specifically tuned for SPD linear systems.

(c) With eigenvalues $0 < μ \leq \dots \leq L$ of $A$ , stability requires $∥ I - η A ∥ < 1$ , i.e., $η < 2/ L$ . Optimal constant $η^{*} = 2/ (μ + L)$ gives convergence rate $(κ - 1) / (κ + 1)$ with $κ = L / μ$ — exactly the same as for general quadratic minimization. Conjugate gradient improves this to $(κ - 1) / (κ + 1)$ per step, which is why it's the algorithm of choice for large sparse SPD systems in scientific computing.

Check problems

Two problems that don't pattern-match to the practice set. The first derives a result the page asserts but doesn't prove; the second tests when the basic algorithm "succeeds."

Check 1 derivation

For the quadratic $f (x) = \frac{1}{2} x^{⊤} A x - b^{⊤} x$ with symmetric positive-definite $A$ having eigenvalues $0 < μ \leq λ_{2} \leq \dots \leq L$ , derive the optimal constant learning rate $η^{*}$ for gradient descent.

(a) Show that the per-iteration error-reduction operator is $I - η A$ .

(b) Derive $η^{*} = 2/ (μ + L)$ as the minimizer of $max_{λ} ∣1 - η λ ∣$ over $λ \in [μ, L]$ .

show solution sketch

(a) Let $e_{k} = x_{k} - x^{*}$ with $x^{*} = A^{- 1} b$ . The GD update is $x_{k + 1} = x_{k} - η (A x_{k} - b)$ . Substituting $x_{k} = x^{*} + e_{k}$ and using $A x^{*} = b$ : $x_{k + 1} = x^{*} + e_{k} - η A e_{k} = x^{*} + (I - η A) e_{k}$ . So $e_{k + 1} = (I - η A) e_{k}$ . ✓

(b) Decompose $e_{k}$ in the eigenbasis of $A$ : $e_{k} = \sum_{i} α_{i}^{(k)} v_{i}$ with $A v_{i} = λ_{i} v_{i}$ . Then $e_{k + 1} = \sum_{i} (1 - η λ_{i}) α_{i}^{(k)} v_{i}$ , so each component contracts by $∣1 - η λ_{i} ∣$ . The worst-case contraction (over all components) is $max_{λ} ∣1 - η λ ∣$ with $λ \in [μ, L]$ .

This function of $η$ is piecewise linear. For $η < 1/ L$ , all $1 - η λ$ are positive, max is at $λ = μ$ : gives $1 - η μ$ , decreasing in $η$ . For $η > 1/ μ$ , all are negative, max is at $λ = L$ : gives $η L - 1$ , increasing in $η$ . In between, the max is $max (1 - η μ, η L - 1)$ , minimized where the two are equal:

$1 - η μ = η L - 1 \Rightarrow η^{*} = 2/ (μ + L)$ ✓

This is the "textbook" worst-case rate for constant-step gradient descent on a strongly convex quadratic. It depends only on the condition number $κ$ , not on the dimension of the problem. Conjugate gradient achieves $(κ - 1) / (κ + 1)$ ; Nesterov's accelerated gradient achieves the same; the gap between them and plain GD is what makes the latter unsuitable for large ill-conditioned problems in serious numerical work.

Check 2 articulation

"Gradient descent converged" and "we found the global minimum" are different claims. In 150–250 words, explain the distinction using the non-convex example from P.4. Your answer should:

Define what "converged" means operationally (in terms of the gradient norm and/or successive iterates).
Explain what gradient descent can guarantee under mild conditions.
Explain what it cannot guarantee, and why this is a property of the loss surface rather than of the algorithm.
Name two practical strategies that partially address the gap (without claiming they solve it).

show solution sketch

"Converged" means the iterates have stopped moving meaningfully — operationally, the gradient norm has dropped below a tolerance, or $∥ x_{k + 1} - x_{k} ∥$ is below a tolerance. Gradient descent on a smooth function with a stable step size will reach this state. What it produces is a stationary point: a place where $\nabla f = 0$ .

What gradient descent guarantees (under mild conditions: smooth $f$ , bounded below, $η < 2/ L$ ): a sequence whose gradient norm goes to zero, with limit points that are stationary. For convex $f$ , every stationary point is a global minimum, so convergence implies optimality.

What it can't guarantee: in P.4's $f (x) = x^{4} - 4 x^{2} + x$ , there are three stationary points — a local max at $x \approx 0.13$ , a local min at $x \approx - 1.47$ , and a global min at $x \approx 1.35$ . Starting at $- 0.4$ drives the iterate into the local-min basin. The algorithm converges; the gradient at the limit is zero; the loss is locally minimized — and yet it's not the best basin. This is a property of the loss surface (multiple basins of attraction), not the algorithm. No first-order local-search method, with any step-size schedule, can fix this.

Practical strategies that partially address the gap: (i) multiple random restarts — sample many initial $x_{0}$ , run GD from each, keep the best result. Costs $n_{restart} \times$ the single-run cost; helps but no guarantee of finding the global min. (ii) Stochastic noise (SGD or Langevin dynamics) — adding gradient noise lets the iterates jump small barriers and find deeper basins; this is half the reason deep-learning models generalize. Neither strategy provides a worst-case guarantee. Global optimization on non-convex landscapes remains open.