Python for markets

Loading and cleaning OHLCV data

5 min

Before any analysis, you need clean data. OHLCV stands for Open, High, Low, Close and Volume — the standard candle format you will load most often.

Reading data into a DataFrame

A CSV of daily candles loads in one line, parsing the date column and using it as the index:

import pandas as pd

df = pd.read_csv("ohlcv.csv", parse_dates=["date"], index_col="date")
df = df.sort_index()
print(df.head())

The cleaning that actually matters

Raw market data is messy. Before trusting it, check for:

  • Missing rows — gaps where a session is absent. Decide whether to forward-fill, drop, or leave gaps; never silently interpolate prices you did not observe.
  • Duplicate timestamps — drop them with drop_duplicates or you will double-count.
  • Bad values — zero or negative prices, a High below the Low, volume spikes that are data errors.
  • Timezone alignment — store everything in one timezone (UTC is safest) so candles from different sources line up.
  • Adjustments — for equities, decide between raw and split/dividend-adjusted prices and be consistent. Mixing them silently corrupts returns.

A quick sanity check

assert (df["high"] >= df["low"]).all(), "found High below Low"
assert df.index.is_monotonic_increasing, "index not sorted"
print(df.isna().sum())

Garbage data produces confident, wrong conclusions. The discipline of validating inputs is unglamorous and is exactly where most retail backtests quietly go wrong.

Finished reading?
Risk disclaimer

This content is for educational and informational purposes only and is not investment, financial, tax or legal advice. Trading and investing carry risk, including the possible loss of capital. Any performance shown by third-party tools is hypothetical and not a promise of future results. Do your own research and consider professional advice before making any decision.