Sigmoid and Its Gradient

Interview Prep

Warm-upML Engineeringmlderivatives

The problem

Implement the sigmoid function σ(x) = 1 / (1 + e^(−x)) in a numerically stable way, then derive and implement its gradient. The interviewer will ask you to explain why backprop through sigmoid is essentially free given the forward output.

The trap: naive sigmoid overflows

import numpy as np

def sigmoid_naive(x):
    return 1.0 / (1.0 + np.exp(-x))   # overflows for x << 0: exp(-x) is huge

For large negative x, e^(−x) overflows to infinity (or hits the IEEE 754 limit and the operation becomes catastrophic). For large positive x, the formula is fine. So we need a branched implementation: positive arguments use 1 / (1 + e^(−x)); negative arguments use the algebraically equivalent e^x / (1 + e^x). Both branches only ever exponentiate non-positive numbers, which is safe.

>>> sigmoid_naive(np.array([1000.0]))
RuntimeWarning: overflow encountered in exp
np.array([0.])         # underflowed to 0 — wrong! true value -> 1

>>> sigmoid_naive(np.array([-1000.0]))
RuntimeWarning: overflow encountered in exp
np.array([0.])         # numerically OK, but the other side is broken

>>> sigmoid(np.array([1000.0, -1000.0]))
array([1.0, 0.0])     # safe

Stable implementation

def sigmoid(x):
    """Numerically stable sigmoid for any sign of x."""
    out = np.empty_like(x, dtype=float)
    pos = x >= 0
    out[ pos] = 1.0 / (1.0 + np.exp(-x[pos]))
    ex = np.exp(x[~pos])
    out[~pos] = ex / (1.0 + ex)
    return out

Pattern: derivative expressible in terms of output

The famous identity is σ'(x) = σ(x) · (1 − σ(x)). This is the property that makes sigmoid (and its sibling tanh) so attractive for backprop: during the forward pass you compute and store s = σ(x); the backward pass reuses s · (1 − s) with no extra exponentials.

Derivation

Let s = sigmoid(x) = 1 / (1 + e^{-x})

d/dx s = d/dx (1 + e^{-x})^{-1}
       = -(1 + e^{-x})^{-2} * (-e^{-x})
       = e^{-x} / (1 + e^{-x})^2

Now factor:
  e^{-x} / (1 + e^{-x})^2
    = [1 / (1 + e^{-x})] * [e^{-x} / (1 + e^{-x})]
    = s * (1 - s)            (since 1 - s = e^{-x} / (1 + e^{-x}))

=> ds/dx = s * (1 - s).  Maximum at x=0 (s=0.5) giving 0.25.

Gradient implementation

def sigmoid_grad(x):
    """d/dx sigmoid(x) = sigmoid(x) * (1 - sigmoid(x))."""
    s = sigmoid(x)
    return s * (1.0 - s)

Why it's a poor activation in deep networks

The maximum value of σ' is 0.25, attained at x = 0. Far from zero the derivative is essentially zero — the "saturating" regime. Stack a few sigmoid layers and every backward pass multiplies several factors of ≤ 0.25 together, which is the original cause of the vanishing gradient problem. This is why ReLU dominates modern hidden layers and sigmoid is now mostly relegated to the output of binary classifiers.

Complexity

Time: O(n) for n inputs.
Space: O(n) output. Backward reuses the forward output, so no extra storage.

Variations worth knowing

tanh: tanh(x) = 2σ(2x) − 1. Derivative 1 − tanh²(x). Zero-centered output is a mild advantage over sigmoid; both saturate.
ReLU and friends: max(0, x). Derivative is 0 or 1. Doesn't saturate on the positive side; gave us deep networks. Leaky ReLU / GELU / SiLU are smoother variants.
Softplus: log(1 + e^x). Smooth ReLU. Its derivative is exactly the sigmoid — a nice algebraic dual.
Log-sigmoid (logistic loss kernel): log σ(x) = −softplus(−x). Used in numerically stable binary cross-entropy: never form σ explicitly when you only need its log.