Validation and the backtest trap
Train/test splits and leakage
3 min
The foundational rule of validation: never evaluate a model on the same data you built it on. A model tested on its training data reports how well it memorised, not how well it predicts.
The basic split
Divide the data into a training set (where the model learns) and a test set (held back, untouched, used once to judge it). Good training performance with poor test performance is the signature of overfitting — the model learned noise, not signal.
Why finance breaks the standard recipe
In most machine learning you split data randomly. In finance you must not, because the data is ordered in time and autocorrelated:
- A random split puts future data in the training set and past data in the test set — training on the future to predict the past, which is meaningless and leaks information.
- Adjacent days are correlated, so a random split scatters near-duplicate days across train and test, making the test look easier than reality.
The correct approach respects time order: train on the earlier period, test on the strictly later period — exactly how you would face the market.
Leakage, the silent killer
Data leakage is any way future information sneaks into training. It hides in subtle places: normalising features using statistics computed over the whole dataset (including the future), filling missing values with future data, or defining a label that peeks ahead. Leakage produces gorgeous test results that evaporate live, because the live model never has the future data the test secretly used. Most 'too good to be true' backtests are leakage, not genius.
This content is for educational and informational purposes only and is not investment, financial, tax or legal advice. Trading and investing carry risk, including the possible loss of capital. Any performance shown by third-party tools is hypothetical and not a promise of future results. Do your own research and consider professional advice before making any decision.