“Know how to solve every problem that has been solved.” “What I cannot create, I do not understand.” — Richard Feynman

Sharpe Ratio and Backtest Evaluation

Finance

The Sharpe ratio is the most-used performance metric in finance, and the most-abused. Defined simply as risk-adjusted excess return,

Sharpe = \frac{E [ returns ] - r _{f}}{σ _{returns}}, annualised: multiply by periods/year,

it ranks strategies by a clean, interpretable number that captures the central trade-off: more return for the same risk is better. Sharpe ratios above 1 are good, above 2 are excellent, above 3 are suspicious. This page covers the standard performance metrics (Sharpe, Sortino, drawdown), and — more importantly — the PITFALLS of backtest evaluation that turn nominally great-looking strategies into out-of-sample disasters.

Standard metrics

Each captures a different aspect of risk-adjusted performance.

Sharpe ratio. Excess return divided by standard deviation. The default; reasonable when returns are approximately Gaussian. Penalises ALL volatility — upside AND downside.
Sortino ratio. Like Sharpe but uses DOWNSIDE deviation (std of negative returns only). Doesn't penalise positive volatility — useful for strategies with positive skew (option buyers, trend followers).
Maximum drawdown. Largest peak-to-trough decline as a percentage. Captures the WORST-PATH experience; matters for investors with mark-to-market constraints (redemptions, margin calls, regulatory stops).
Calmar ratio. Annualised return divided by maximum drawdown. The drawdown-based analogue of Sharpe.
MAR ratio. Similar to Calmar but uses average annual drawdown. Used by managed-futures funds.
Information ratio. Excess return vs a BENCHMARK divided by tracking error (std of excess returns). Used in long-only equity management.

Different metrics emphasise different things; reporting MULTIPLE metrics is standard practice, since each can be gamed individually but combinations are harder to manipulate. A strategy that looks great on Sharpe but bad on max-drawdown deserves a hard look.

The Sharpe ratio's distributional caveats

Sharpe assumes returns are approximately Gaussian. Three reasons that fails for real strategies, sometimes catastrophically:

(1) Fat tails. Equity index returns are not Gaussian; they have excess kurtosis (more extreme moves than Gaussian predicts). A strategy with negative skew and high Sharpe (option selling, carry trades) has a Sharpe that overstates its risk-adjusted return — the variance doesn't capture the LEFT-TAIL risk. Lo (2002) and others have shown that pension funds historically chased high-Sharpe strategies that then suffered tail events (LTCM, mortgage-backed securities, dispersion trades).

(2) Autocorrelation. Sharpe assumes returns are independent. If returns are POSITIVELY autocorrelated (momentum, illiquid assets with smoothed marks), the realized standard deviation UNDERESTIMATES true risk — i.e., the Sharpe is artificially inflated. Real-estate funds and private-equity funds famously have unrealistically high Sharpe ratios because of mark smoothing. Bailey-Lopez de Prado (2014) and earlier papers give corrections.

(3) Time aggregation. Sharpe is conventionally annualised by multiplying daily Sharpe by $252$ . This is only valid when daily returns are IID. For strategies with regime changes, mean-reversion, or any time-dependent structure, the annualisation can be wildly wrong.

Drawdown analysis

Drawdown is the deficit relative to the running peak: $DD_{t} = (P_{m a x, t} - P_{t}) / P_{m a x, t}$ where $P_{m a x, t}$ is the maximum portfolio value up to time $t$ . Plotting drawdown over time produces the UNDERWATER CURVE — the "depth" of how much you've lost from the most recent high. Useful summaries:

Maximum drawdown: the worst DD ever observed.
Drawdown duration: how LONG you stayed underwater before recovering.
Calmar ratio: annualised return / max drawdown.
Recovery time: time from drawdown bottom back to previous peak.

Investors care about drawdowns asymmetrically: a 50% drawdown requires a 100% gain to recover, and most investors will redeem before that recovery happens. The PSYCHOLOGY of drawdowns matters as much as the math — most retail investors quit somewhere between 20% and 35% drawdown, and most institutional allocators redeem between 10% and 20%.

The multiple-testing problem

The single biggest hazard in quantitative backtesting. The setup: you have a research process that tests $N$ different strategies and picks the BEST one by some criterion (highest Sharpe, highest Calmar, etc.). The "winning" strategy LOOKS impressive in-sample. Is it real or noise?

Generate $N$ RANDOM strategies, each just IID Gaussian noise with TRUE Sharpe = 0. The Sharpe ratios of the $N$ random strategies are approximately normal with mean 0 and std proportional to $1/ T$ where $T$ is the backtest length. Order statistics: the MAXIMUM of $N$ standard normal random variables is approximately $2 ln N$ for large $N$ (extreme value theory).

For $N = 1000$ tested strategies: $2 ln 1000 \approx 3.7$ standard deviations of annualised Sharpe by pure chance. With $T = 4$ years of daily data ( $\sim 1000$ observations), one std of annualised Sharpe is roughly $252/1000 \approx 0.5$ , so the EXPECTED Sharpe of the best random strategy is approximately $3.7 \times 0.5 = 1.8$ — a Sharpe of 1.8 PURELY BY CHANCE.

Code: backtest pitfalls in action

# Standard performance metrics and the multiple-testing hazard.

import numpy as np

def sharpe(returns, rf=0.0, periods_per_year=252):
    excess = returns - rf / periods_per_year
    return np.mean(excess) / np.std(excess, ddof=1) * np.sqrt(periods_per_year)

def sortino(returns, rf=0.0, periods_per_year=252):
    """Sharpe-like ratio using DOWNSIDE deviation only."""
    excess = returns - rf / periods_per_year
    downside = excess[excess < 0]
    if len(downside) == 0: return np.inf
    return np.mean(excess) / np.std(downside, ddof=1) * np.sqrt(periods_per_year)

def max_drawdown(returns):
    cum  = np.cumprod(1 + returns)
    peak = np.maximum.accumulate(cum)
    return np.max((peak - cum) / peak)

# Synthetic real-ish strategy: 4 years of daily returns
rng = np.random.default_rng(42)
mu_d, sd_d = 0.0007, 0.012
returns = rng.normal(mu_d, sd_d, 1008)
returns[200:220] -= 0.005       # inject a drawdown cluster

print(f"Synthetic strategy, 4 years of daily returns:")
print(f"  Annualised return:  {np.mean(returns) * 252:.4f}")
print(f"  Annualised vol:     {np.std(returns) * np.sqrt(252):.4f}")
print(f"  Sharpe ratio:       {sharpe(returns):.4f}")
print(f"  Sortino ratio:      {sortino(returns):.4f}")
print(f"  Max drawdown:       {max_drawdown(returns):.4%}")

# The multiple-testing trap: search 1000 RANDOM strategies (true Sharpe = 0)
# and report the best. A "Sharpe of 1.6" can easily arise PURELY BY CHANCE.
print(f"\nData-mining demo: 1000 random strategies, all with TRUE Sharpe = 0:")
N_strats = 1000
N_days = 252 * 4
fake = rng.normal(0, sd_d, (N_strats, N_days))
sharpes = np.array([sharpe(s) for s in fake])
print(f"  Mean Sharpe across all 1000: {np.mean(sharpes):.4f}")
print(f"  BEST Sharpe found:           {np.max(sharpes):.4f}")
print(f"  Strategy ranked 10th:        {np.sort(sharpes)[-10]:.4f}")
print(f"  Sharpe of 1.6 is the 99.9th percentile of random noise.")

# Deflated Sharpe ratio (Bailey & Lopez de Prado 2014) adjusts an in-sample
# Sharpe for the number N of strategies tried and for non-normal returns.
# For Gaussian noise: P(best of N) ≈ sqrt(2 ln N) * (annualised) / sqrt(periods).
# At N=1000 and 4 years (1008 obs), the expected max under H0 is:
expected_max_h0 = np.sqrt(2 * np.log(N_strats))
print(f"\n  Expected max Sharpe of N=1000 random strategies (Gaussian):")
print(f"  ~ sqrt(2 ln N) = {expected_max_h0:.4f}  (matches the observed ~3.7σ tail).")

Output:

Synthetic strategy, 4 years of daily returns:
  Annualised return:  0.0649
  Annualised vol:     0.1882
  Sharpe ratio:       0.3448
  Sortino ratio:      0.5579
  Max drawdown:       24.2051%

Data-mining demo: 1000 random strategies, all with TRUE Sharpe = 0:
  Mean Sharpe across all 1000: 0.0013
  BEST Sharpe found:           1.6375
  Strategy ranked 10th:        1.1791
  Sharpe of 1.6 is the 99.9th percentile of random noise.

  Expected max Sharpe of N=1000 random strategies (Gaussian):
  ~ sqrt(2 ln N) = 3.7169  (matches the observed ~3.7σ tail).

The synthetic strategy gives Sharpe 0.34, max drawdown 24% — reasonable-but-mediocre numbers. The data-mining demonstration is the key result: 1000 random strategies, each with TRUE Sharpe = 0. The best one has nominal Sharpe 1.64; the tenth-best has 1.18. Either looks like a publication-worthy alpha if you saw it in isolation. The expected-value calculation ( $2 ln N \approx 3.7$ std under $H_{0}$ ) matches the empirical result almost exactly.

Defences against the multiple-testing trap

Three lines of defence, in order of effectiveness.

(1) DEFLATED SHARPE RATIO (Bailey-Lopez de Prado 2014). Given an observed Sharpe and the number $N$ of strategies tested, compute the probability that the observed Sharpe could arise PURELY BY CHANCE under $H_{0} : true Sharpe = 0$ . The DSR adjusts for: number of trials (multiple testing), skewness and kurtosis of returns (non-normality), and series length. The corrected Sharpe is dramatically lower than the raw value for any deeply mined strategy.

(2) OUT-OF-SAMPLE TESTING. Split the data into a research period and a hold-out period; develop the strategy on the research data, evaluate on the hold-out. Multiple-testing protection IF the hold-out is genuinely never used during development — but in practice researchers iterate on hold-out performance, defeating its purpose. Tools like SEQUENTIAL hold-out (different out-of-sample periods for different stages of research) can help.

(3) PROBABILITY OF BACKTEST OVERFITTING (PBO). Bailey-Borwein-Lopez de Prado-Zhu (2016) introduce a CV-based test: split the data into multiple periods, rank strategies in each, and compute the probability that the strategy that LOOKED BEST in-sample is below-median out-of-sample. A high PBO (> 0.5) means the backtest is essentially useless. Standard in modern quant research.

Other classical pitfalls

Look-ahead bias. Using information at time $t$ that wasn't AVAILABLE at $t$ — e.g., using close-of-day data to "predict" close-of-day returns. Causes the most embarrassing live blow-ups. Defence: timestamp all features explicitly with their AVAILABILITY time, not their REFERENCE time.
Survivorship bias. Backtesting only on currently-existing stocks ignores delisted ones (mostly losers). Inflates returns and Sharpe. Defence: use a survivorship-bias-free dataset that includes delisted issues.
Time-series cross-validation. Standard k-fold CV uses random partitions; for time series this leaks information (the validation set has neighboring training observations). Defence: USE TIME-ORDERED splits with optional EMBARGO between train and test windows. Lopez de Prado's PURGED k-fold is the standard reference.
Triple-barrier labeling. For ML applications, defining the "target" as the next return is naive — it ignores stop-losses, profit-takes, and time-based exits. Triple-barrier (Lopez de Prado) defines the label as the FIRST barrier hit: profit target, stop loss, or time horizon. More realistic but more complex.
Transaction costs. Backtests routinely ignore market impact, spread, financing, taxes. A strategy with 50bp Sharpe before costs and 10bp Sharpe after is barely a strategy.

Kelly criterion — the position-sizing complement to Sharpe-based ranking.
Mean-variance portfolio optimization — the tangency portfolio is the max-Sharpe portfolio.
Value at Risk — risk metric complementing Sharpe; captures tail risk that Sharpe misses.
Statistics & inference — the multiple-testing problem here is the same one that plagues all empirical sciences.