ForecastingStocks

The foundational rule of validation: never evaluate a model on the same data you built it on. A model tested on its training data reports how well it memorised, not how well it predicts.

The basic split

Divide the data into a training set (where the model learns) and a test set (held back, untouched, used once to judge it). Good training performance with poor test performance is the signature of overfitting — the model learned noise, not signal.

Why finance breaks the standard recipe

In most machine learning you split data randomly. In finance you must not, because the data is ordered in time and autocorrelated:

A random split puts future data in the training set and past data in the test set — training on the future to predict the past, which is meaningless and leaks information.
Adjacent days are correlated, so a random split scatters near-duplicate days across train and test, making the test look easier than reality.

The correct approach respects time order: train on the earlier period, test on the strictly later period — exactly how you would face the market.

Leakage, the silent killer

Data leakage is any way future information sneaks into training. It hides in subtle places: normalising features using statistics computed over the whole dataset (including the future), filling missing values with future data, or defining a label that peeks ahead. Leakage produces gorgeous test results that evaporate live, because the live model never has the future data the test secretly used. Most 'too good to be true' backtests are leakage, not genius.