Loading and cleaning OHLCV data

Before any analysis, you need clean data. OHLCV stands for Open, High, Low, Close and Volume — the standard candle format you will load most often.

Reading data into a DataFrame

A CSV of daily candles loads in one line, parsing the date column and using it as the index:

import pandas as pd

df = pd.read_csv("ohlcv.csv", parse_dates=["date"], index_col="date")
df = df.sort_index()
print(df.head())

The cleaning that actually matters

Raw market data is messy. Before trusting it, check for:

Missing rows — gaps where a session is absent. Decide whether to forward-fill, drop, or leave gaps; never silently interpolate prices you did not observe.
Duplicate timestamps — drop them with drop_duplicates or you will double-count.
Bad values — zero or negative prices, a High below the Low, volume spikes that are data errors.
Timezone alignment — store everything in one timezone (UTC is safest) so candles from different sources line up.
Adjustments — for equities, decide between raw and split/dividend-adjusted prices and be consistent. Mixing them silently corrupts returns.

A quick sanity check

assert (df["high"] >= df["low"]).all(), "found High below Low"
assert df.index.is_monotonic_increasing, "index not sorted"
print(df.isna().sum())

Garbage data produces confident, wrong conclusions. The discipline of validating inputs is unglamorous and is exactly where most retail backtests quietly go wrong.