Sharpe Ratio and Backtest Evaluation

Finance

The Sharpe ratio is the most-used performance metric in finance, and the most-abused. Defined simply as risk-adjusted excess return,

it ranks strategies by a clean, interpretable number that captures the central trade-off: more return for the same risk is better. Sharpe ratios above 1 are good, above 2 are excellent, above 3 are suspicious. This page covers the standard performance metrics (Sharpe, Sortino, drawdown), and — more importantly — the PITFALLS of backtest evaluation that turn nominally great-looking strategies into out-of-sample disasters.

Standard metrics

Each captures a different aspect of risk-adjusted performance.

Different metrics emphasise different things; reporting MULTIPLE metrics is standard practice, since each can be gamed individually but combinations are harder to manipulate. A strategy that looks great on Sharpe but bad on max-drawdown deserves a hard look.

The Sharpe ratio's distributional caveats

Sharpe assumes returns are approximately Gaussian. Three reasons that fails for real strategies, sometimes catastrophically:

(1) Fat tails. Equity index returns are not Gaussian; they have excess kurtosis (more extreme moves than Gaussian predicts). A strategy with negative skew and high Sharpe (option selling, carry trades) has a Sharpe that overstates its risk-adjusted return — the variance doesn't capture the LEFT-TAIL risk. Lo (2002) and others have shown that pension funds historically chased high-Sharpe strategies that then suffered tail events (LTCM, mortgage-backed securities, dispersion trades).

(2) Autocorrelation. Sharpe assumes returns are independent. If returns are POSITIVELY autocorrelated (momentum, illiquid assets with smoothed marks), the realized standard deviation UNDERESTIMATES true risk — i.e., the Sharpe is artificially inflated. Real-estate funds and private-equity funds famously have unrealistically high Sharpe ratios because of mark smoothing. Bailey-Lopez de Prado (2014) and earlier papers give corrections.

(3) Time aggregation. Sharpe is conventionally annualised by multiplying daily Sharpe by . This is only valid when daily returns are IID. For strategies with regime changes, mean-reversion, or any time-dependent structure, the annualisation can be wildly wrong.

Drawdown analysis

Drawdown is the deficit relative to the running peak: where is the maximum portfolio value up to time . Plotting drawdown over time produces the UNDERWATER CURVE — the "depth" of how much you've lost from the most recent high. Useful summaries:

Investors care about drawdowns asymmetrically: a 50% drawdown requires a 100% gain to recover, and most investors will redeem before that recovery happens. The PSYCHOLOGY of drawdowns matters as much as the math — most retail investors quit somewhere between 20% and 35% drawdown, and most institutional allocators redeem between 10% and 20%.

The multiple-testing problem

The single biggest hazard in quantitative backtesting. The setup: you have a research process that tests different strategies and picks the BEST one by some criterion (highest Sharpe, highest Calmar, etc.). The "winning" strategy LOOKS impressive in-sample. Is it real or noise?

Generate RANDOM strategies, each just IID Gaussian noise with TRUE Sharpe = 0. The Sharpe ratios of the random strategies are approximately normal with mean 0 and std proportional to where is the backtest length. Order statistics: the MAXIMUM of standard normal random variables is approximately for large (extreme value theory).

For tested strategies: standard deviations of annualised Sharpe by pure chance. With years of daily data ( observations), one std of annualised Sharpe is roughly , so the EXPECTED Sharpe of the best random strategy is approximately — a Sharpe of 1.8 PURELY BY CHANCE.

Code: backtest pitfalls in action

# Standard performance metrics and the multiple-testing hazard.

import numpy as np

def sharpe(returns, rf=0.0, periods_per_year=252):
    excess = returns - rf / periods_per_year
    return np.mean(excess) / np.std(excess, ddof=1) * np.sqrt(periods_per_year)

def sortino(returns, rf=0.0, periods_per_year=252):
    """Sharpe-like ratio using DOWNSIDE deviation only."""
    excess = returns - rf / periods_per_year
    downside = excess[excess < 0]
    if len(downside) == 0: return np.inf
    return np.mean(excess) / np.std(downside, ddof=1) * np.sqrt(periods_per_year)

def max_drawdown(returns):
    cum  = np.cumprod(1 + returns)
    peak = np.maximum.accumulate(cum)
    return np.max((peak - cum) / peak)

# Synthetic real-ish strategy: 4 years of daily returns
rng = np.random.default_rng(42)
mu_d, sd_d = 0.0007, 0.012
returns = rng.normal(mu_d, sd_d, 1008)
returns[200:220] -= 0.005       # inject a drawdown cluster

print(f"Synthetic strategy, 4 years of daily returns:")
print(f"  Annualised return:  {np.mean(returns) * 252:.4f}")
print(f"  Annualised vol:     {np.std(returns) * np.sqrt(252):.4f}")
print(f"  Sharpe ratio:       {sharpe(returns):.4f}")
print(f"  Sortino ratio:      {sortino(returns):.4f}")
print(f"  Max drawdown:       {max_drawdown(returns):.4%}")

# The multiple-testing trap: search 1000 RANDOM strategies (true Sharpe = 0)
# and report the best. A "Sharpe of 1.6" can easily arise PURELY BY CHANCE.
print(f"\nData-mining demo: 1000 random strategies, all with TRUE Sharpe = 0:")
N_strats = 1000
N_days = 252 * 4
fake = rng.normal(0, sd_d, (N_strats, N_days))
sharpes = np.array([sharpe(s) for s in fake])
print(f"  Mean Sharpe across all 1000: {np.mean(sharpes):.4f}")
print(f"  BEST Sharpe found:           {np.max(sharpes):.4f}")
print(f"  Strategy ranked 10th:        {np.sort(sharpes)[-10]:.4f}")
print(f"  Sharpe of 1.6 is the 99.9th percentile of random noise.")

# Deflated Sharpe ratio (Bailey & Lopez de Prado 2014) adjusts an in-sample
# Sharpe for the number N of strategies tried and for non-normal returns.
# For Gaussian noise: P(best of N) ≈ sqrt(2 ln N) * (annualised) / sqrt(periods).
# At N=1000 and 4 years (1008 obs), the expected max under H0 is:
expected_max_h0 = np.sqrt(2 * np.log(N_strats))
print(f"\n  Expected max Sharpe of N=1000 random strategies (Gaussian):")
print(f"  ~ sqrt(2 ln N) = {expected_max_h0:.4f}  (matches the observed ~3.7σ tail).")

Output:

Synthetic strategy, 4 years of daily returns:
  Annualised return:  0.0649
  Annualised vol:     0.1882
  Sharpe ratio:       0.3448
  Sortino ratio:      0.5579
  Max drawdown:       24.2051%

Data-mining demo: 1000 random strategies, all with TRUE Sharpe = 0:
  Mean Sharpe across all 1000: 0.0013
  BEST Sharpe found:           1.6375
  Strategy ranked 10th:        1.1791
  Sharpe of 1.6 is the 99.9th percentile of random noise.

  Expected max Sharpe of N=1000 random strategies (Gaussian):
  ~ sqrt(2 ln N) = 3.7169  (matches the observed ~3.7σ tail).

The synthetic strategy gives Sharpe 0.34, max drawdown 24% — reasonable-but-mediocre numbers. The data-mining demonstration is the key result: 1000 random strategies, each with TRUE Sharpe = 0. The best one has nominal Sharpe 1.64; the tenth-best has 1.18. Either looks like a publication-worthy alpha if you saw it in isolation. The expected-value calculation ( std under ) matches the empirical result almost exactly.

Defences against the multiple-testing trap

Three lines of defence, in order of effectiveness.

(1) DEFLATED SHARPE RATIO (Bailey-Lopez de Prado 2014). Given an observed Sharpe and the number of strategies tested, compute the probability that the observed Sharpe could arise PURELY BY CHANCE under . The DSR adjusts for: number of trials (multiple testing), skewness and kurtosis of returns (non-normality), and series length. The corrected Sharpe is dramatically lower than the raw value for any deeply mined strategy.

(2) OUT-OF-SAMPLE TESTING. Split the data into a research period and a hold-out period; develop the strategy on the research data, evaluate on the hold-out. Multiple-testing protection IF the hold-out is genuinely never used during development — but in practice researchers iterate on hold-out performance, defeating its purpose. Tools like SEQUENTIAL hold-out (different out-of-sample periods for different stages of research) can help.

(3) PROBABILITY OF BACKTEST OVERFITTING (PBO). Bailey-Borwein-Lopez de Prado-Zhu (2016) introduce a CV-based test: split the data into multiple periods, rank strategies in each, and compute the probability that the strategy that LOOKED BEST in-sample is below-median out-of-sample. A high PBO (> 0.5) means the backtest is essentially useless. Standard in modern quant research.

Other classical pitfalls

Related