Sharpe Ratio and Backtest Evaluation
Finance
The Sharpe ratio is the most-used performance metric in finance, and the most-abused. Defined simply as risk-adjusted excess return,
it ranks strategies by a clean, interpretable number that captures the central trade-off: more return for the same risk is better. Sharpe ratios above 1 are good, above 2 are excellent, above 3 are suspicious. This page covers the standard performance metrics (Sharpe, Sortino, drawdown), and — more importantly — the PITFALLS of backtest evaluation that turn nominally great-looking strategies into out-of-sample disasters.
Standard metrics
Each captures a different aspect of risk-adjusted performance.
- Sharpe ratio. Excess return divided by standard deviation. The default; reasonable when returns are approximately Gaussian. Penalises ALL volatility — upside AND downside.
- Sortino ratio. Like Sharpe but uses DOWNSIDE deviation (std of negative returns only). Doesn't penalise positive volatility — useful for strategies with positive skew (option buyers, trend followers).
- Maximum drawdown. Largest peak-to-trough decline as a percentage. Captures the WORST-PATH experience; matters for investors with mark-to-market constraints (redemptions, margin calls, regulatory stops).
- Calmar ratio. Annualised return divided by maximum drawdown. The drawdown-based analogue of Sharpe.
- MAR ratio. Similar to Calmar but uses average annual drawdown. Used by managed-futures funds.
- Information ratio. Excess return vs a BENCHMARK divided by tracking error (std of excess returns). Used in long-only equity management.
Different metrics emphasise different things; reporting MULTIPLE metrics is standard practice, since each can be gamed individually but combinations are harder to manipulate. A strategy that looks great on Sharpe but bad on max-drawdown deserves a hard look.
The Sharpe ratio's distributional caveats
Sharpe assumes returns are approximately Gaussian. Three reasons that fails for real strategies, sometimes catastrophically:
(1) Fat tails. Equity index returns are not Gaussian; they have excess kurtosis (more extreme moves than Gaussian predicts). A strategy with negative skew and high Sharpe (option selling, carry trades) has a Sharpe that overstates its risk-adjusted return — the variance doesn't capture the LEFT-TAIL risk. Lo (2002) and others have shown that pension funds historically chased high-Sharpe strategies that then suffered tail events (LTCM, mortgage-backed securities, dispersion trades).
(2) Autocorrelation. Sharpe assumes returns are independent. If returns are POSITIVELY autocorrelated (momentum, illiquid assets with smoothed marks), the realized standard deviation UNDERESTIMATES true risk — i.e., the Sharpe is artificially inflated. Real-estate funds and private-equity funds famously have unrealistically high Sharpe ratios because of mark smoothing. Bailey-Lopez de Prado (2014) and earlier papers give corrections.
(3) Time aggregation. Sharpe is conventionally annualised by multiplying daily Sharpe by . This is only valid when daily returns are IID. For strategies with regime changes, mean-reversion, or any time-dependent structure, the annualisation can be wildly wrong.
Drawdown analysis
Drawdown is the deficit relative to the running peak: where is the maximum portfolio value up to time . Plotting drawdown over time produces the UNDERWATER CURVE — the "depth" of how much you've lost from the most recent high. Useful summaries:
- Maximum drawdown: the worst DD ever observed.
- Drawdown duration: how LONG you stayed underwater before recovering.
- Calmar ratio: annualised return / max drawdown.
- Recovery time: time from drawdown bottom back to previous peak.
Investors care about drawdowns asymmetrically: a 50% drawdown requires a 100% gain to recover, and most investors will redeem before that recovery happens. The PSYCHOLOGY of drawdowns matters as much as the math — most retail investors quit somewhere between 20% and 35% drawdown, and most institutional allocators redeem between 10% and 20%.
The multiple-testing problem
The single biggest hazard in quantitative backtesting. The setup: you have a research process that tests different strategies and picks the BEST one by some criterion (highest Sharpe, highest Calmar, etc.). The "winning" strategy LOOKS impressive in-sample. Is it real or noise?
Generate RANDOM strategies, each just IID Gaussian noise with TRUE Sharpe = 0. The Sharpe ratios of the random strategies are approximately normal with mean 0 and std proportional to where is the backtest length. Order statistics: the MAXIMUM of standard normal random variables is approximately for large (extreme value theory).
For tested strategies: standard deviations of annualised Sharpe by pure chance. With years of daily data ( observations), one std of annualised Sharpe is roughly , so the EXPECTED Sharpe of the best random strategy is approximately — a Sharpe of 1.8 PURELY BY CHANCE.
Code: backtest pitfalls in action
# Standard performance metrics and the multiple-testing hazard.
import numpy as np
def sharpe(returns, rf=0.0, periods_per_year=252):
excess = returns - rf / periods_per_year
return np.mean(excess) / np.std(excess, ddof=1) * np.sqrt(periods_per_year)
def sortino(returns, rf=0.0, periods_per_year=252):
"""Sharpe-like ratio using DOWNSIDE deviation only."""
excess = returns - rf / periods_per_year
downside = excess[excess < 0]
if len(downside) == 0: return np.inf
return np.mean(excess) / np.std(downside, ddof=1) * np.sqrt(periods_per_year)
def max_drawdown(returns):
cum = np.cumprod(1 + returns)
peak = np.maximum.accumulate(cum)
return np.max((peak - cum) / peak)
# Synthetic real-ish strategy: 4 years of daily returns
rng = np.random.default_rng(42)
mu_d, sd_d = 0.0007, 0.012
returns = rng.normal(mu_d, sd_d, 1008)
returns[200:220] -= 0.005 # inject a drawdown cluster
print(f"Synthetic strategy, 4 years of daily returns:")
print(f" Annualised return: {np.mean(returns) * 252:.4f}")
print(f" Annualised vol: {np.std(returns) * np.sqrt(252):.4f}")
print(f" Sharpe ratio: {sharpe(returns):.4f}")
print(f" Sortino ratio: {sortino(returns):.4f}")
print(f" Max drawdown: {max_drawdown(returns):.4%}")
# The multiple-testing trap: search 1000 RANDOM strategies (true Sharpe = 0)
# and report the best. A "Sharpe of 1.6" can easily arise PURELY BY CHANCE.
print(f"\nData-mining demo: 1000 random strategies, all with TRUE Sharpe = 0:")
N_strats = 1000
N_days = 252 * 4
fake = rng.normal(0, sd_d, (N_strats, N_days))
sharpes = np.array([sharpe(s) for s in fake])
print(f" Mean Sharpe across all 1000: {np.mean(sharpes):.4f}")
print(f" BEST Sharpe found: {np.max(sharpes):.4f}")
print(f" Strategy ranked 10th: {np.sort(sharpes)[-10]:.4f}")
print(f" Sharpe of 1.6 is the 99.9th percentile of random noise.")
# Deflated Sharpe ratio (Bailey & Lopez de Prado 2014) adjusts an in-sample
# Sharpe for the number N of strategies tried and for non-normal returns.
# For Gaussian noise: P(best of N) ≈ sqrt(2 ln N) * (annualised) / sqrt(periods).
# At N=1000 and 4 years (1008 obs), the expected max under H0 is:
expected_max_h0 = np.sqrt(2 * np.log(N_strats))
print(f"\n Expected max Sharpe of N=1000 random strategies (Gaussian):")
print(f" ~ sqrt(2 ln N) = {expected_max_h0:.4f} (matches the observed ~3.7σ tail).") Output:
Synthetic strategy, 4 years of daily returns:
Annualised return: 0.0649
Annualised vol: 0.1882
Sharpe ratio: 0.3448
Sortino ratio: 0.5579
Max drawdown: 24.2051%
Data-mining demo: 1000 random strategies, all with TRUE Sharpe = 0:
Mean Sharpe across all 1000: 0.0013
BEST Sharpe found: 1.6375
Strategy ranked 10th: 1.1791
Sharpe of 1.6 is the 99.9th percentile of random noise.
Expected max Sharpe of N=1000 random strategies (Gaussian):
~ sqrt(2 ln N) = 3.7169 (matches the observed ~3.7σ tail). The synthetic strategy gives Sharpe 0.34, max drawdown 24% — reasonable-but-mediocre numbers. The data-mining demonstration is the key result: 1000 random strategies, each with TRUE Sharpe = 0. The best one has nominal Sharpe 1.64; the tenth-best has 1.18. Either looks like a publication-worthy alpha if you saw it in isolation. The expected-value calculation ( std under ) matches the empirical result almost exactly.
Defences against the multiple-testing trap
Three lines of defence, in order of effectiveness.
(1) DEFLATED SHARPE RATIO (Bailey-Lopez de Prado 2014). Given an observed Sharpe and the number of strategies tested, compute the probability that the observed Sharpe could arise PURELY BY CHANCE under . The DSR adjusts for: number of trials (multiple testing), skewness and kurtosis of returns (non-normality), and series length. The corrected Sharpe is dramatically lower than the raw value for any deeply mined strategy.
(2) OUT-OF-SAMPLE TESTING. Split the data into a research period and a hold-out period; develop the strategy on the research data, evaluate on the hold-out. Multiple-testing protection IF the hold-out is genuinely never used during development — but in practice researchers iterate on hold-out performance, defeating its purpose. Tools like SEQUENTIAL hold-out (different out-of-sample periods for different stages of research) can help.
(3) PROBABILITY OF BACKTEST OVERFITTING (PBO). Bailey-Borwein-Lopez de Prado-Zhu (2016) introduce a CV-based test: split the data into multiple periods, rank strategies in each, and compute the probability that the strategy that LOOKED BEST in-sample is below-median out-of-sample. A high PBO (> 0.5) means the backtest is essentially useless. Standard in modern quant research.
Other classical pitfalls
- Look-ahead bias. Using information at time that wasn't AVAILABLE at — e.g., using close-of-day data to "predict" close-of-day returns. Causes the most embarrassing live blow-ups. Defence: timestamp all features explicitly with their AVAILABILITY time, not their REFERENCE time.
- Survivorship bias. Backtesting only on currently-existing stocks ignores delisted ones (mostly losers). Inflates returns and Sharpe. Defence: use a survivorship-bias-free dataset that includes delisted issues.
- Time-series cross-validation. Standard k-fold CV uses random partitions; for time series this leaks information (the validation set has neighboring training observations). Defence: USE TIME-ORDERED splits with optional EMBARGO between train and test windows. Lopez de Prado's PURGED k-fold is the standard reference.
- Triple-barrier labeling. For ML applications, defining the "target" as the next return is naive — it ignores stop-losses, profit-takes, and time-based exits. Triple-barrier (Lopez de Prado) defines the label as the FIRST barrier hit: profit target, stop loss, or time horizon. More realistic but more complex.
- Transaction costs. Backtests routinely ignore market impact, spread, financing, taxes. A strategy with 50bp Sharpe before costs and 10bp Sharpe after is barely a strategy.
Related
- Kelly criterion — the position-sizing complement to Sharpe-based ranking.
- Mean-variance portfolio optimization — the tangency portfolio is the max-Sharpe portfolio.
- Value at Risk — risk metric complementing Sharpe; captures tail risk that Sharpe misses.
- Statistics & inference — the multiple-testing problem here is the same one that plagues all empirical sciences.