Sigmoid and Its Gradient
Interview Prep
The problem
Implement the sigmoid function σ(x) = 1 / (1 + e^(−x))
in a numerically stable way, then derive and implement its
gradient. The interviewer will ask you to explain why backprop
through sigmoid is essentially free given the forward output.
The trap: naive sigmoid overflows
import numpy as np
def sigmoid_naive(x):
return 1.0 / (1.0 + np.exp(-x)) # overflows for x << 0: exp(-x) is huge
For large negative x, e^(−x) overflows to
infinity (or hits the IEEE 754 limit and the operation becomes
catastrophic). For large positive x, the formula is
fine. So we need a branched implementation: positive arguments use
1 / (1 + e^(−x)); negative arguments use the
algebraically equivalent e^x / (1 + e^x). Both branches
only ever exponentiate non-positive numbers, which is safe.
>>> sigmoid_naive(np.array([1000.0]))
RuntimeWarning: overflow encountered in exp
np.array([0.]) # underflowed to 0 — wrong! true value -> 1
>>> sigmoid_naive(np.array([-1000.0]))
RuntimeWarning: overflow encountered in exp
np.array([0.]) # numerically OK, but the other side is broken
>>> sigmoid(np.array([1000.0, -1000.0]))
array([1.0, 0.0]) # safe Stable implementation
def sigmoid(x):
"""Numerically stable sigmoid for any sign of x."""
out = np.empty_like(x, dtype=float)
pos = x >= 0
out[ pos] = 1.0 / (1.0 + np.exp(-x[pos]))
ex = np.exp(x[~pos])
out[~pos] = ex / (1.0 + ex)
return out Pattern: derivative expressible in terms of output
The famous identity is σ'(x) = σ(x) · (1 − σ(x)).
This is the property that makes sigmoid (and its sibling tanh) so
attractive for backprop: during the forward pass you compute and
store s = σ(x); the backward pass reuses
s · (1 − s) with no extra exponentials.
Derivation
Let s = sigmoid(x) = 1 / (1 + e^{-x})
d/dx s = d/dx (1 + e^{-x})^{-1}
= -(1 + e^{-x})^{-2} * (-e^{-x})
= e^{-x} / (1 + e^{-x})^2
Now factor:
e^{-x} / (1 + e^{-x})^2
= [1 / (1 + e^{-x})] * [e^{-x} / (1 + e^{-x})]
= s * (1 - s) (since 1 - s = e^{-x} / (1 + e^{-x}))
=> ds/dx = s * (1 - s). Maximum at x=0 (s=0.5) giving 0.25. Gradient implementation
def sigmoid_grad(x):
"""d/dx sigmoid(x) = sigmoid(x) * (1 - sigmoid(x))."""
s = sigmoid(x)
return s * (1.0 - s) Why it's a poor activation in deep networks
The maximum value of σ' is 0.25, attained
at x = 0. Far from zero the derivative is essentially
zero — the "saturating" regime. Stack a few sigmoid layers and
every backward pass multiplies several factors of ≤ 0.25 together,
which is the original cause of the vanishing gradient
problem. This is why ReLU dominates modern hidden layers
and sigmoid is now mostly relegated to the output of binary
classifiers.
Complexity
- Time:
O(n)for n inputs. - Space:
O(n)output. Backward reuses the forward output, so no extra storage.
Variations worth knowing
- tanh:
tanh(x) = 2σ(2x) − 1. Derivative1 − tanh²(x). Zero-centered output is a mild advantage over sigmoid; both saturate. - ReLU and friends:
max(0, x). Derivative is 0 or 1. Doesn't saturate on the positive side; gave us deep networks. Leaky ReLU / GELU / SiLU are smoother variants. - Softplus:
log(1 + e^x). Smooth ReLU. Its derivative is exactly the sigmoid — a nice algebraic dual. - Log-sigmoid (logistic loss kernel):
log σ(x) = −softplus(−x). Used in numerically stable binary cross-entropy: never form σ explicitly when you only need its log.