← Back to guides

CSV in high-frequency data prep: from raw feeds to normalized events

High-frequency datasets are often first seen as messy files: JSON-by-line captures, compressed archives, separate trade and book feeds, and large CSV files. This guide summarizes practical CSV handling patterns from the official HftBacktest data prep flow.

Why CSV still appears in HFT pipelines

Even when source streams are JSON, vendors and exchange dumps can deliver data as compressed CSV files with extensions like .csv.gz. The preparation pipeline in HftBacktest is useful as a pattern: normalize data into a compact event table before simulation, regardless of original format.

Most important CSV-specific details for backtesting quality

1) Event order and timestamps are critical

Timestamp quality matters more than format choice:

2) Input file order controls realism

When combining feed files, recommendation is to input trade files before depth/book files. If a depth update is caused by a trade, placing trade first improves realism in fills.

When two events share the same timestamp, the conversion flow prioritizes events from the first input file.

3) File size and memory planning

For very large CSV inputs, output conversion may require a larger buffer_size argument. This is a practical point for CSV pipelines: the same dataset can parse but fail or slow down with default buffers.

What the normalized structure looks like after CSV prep

HftBacktest normalizes input into an event table with consistent columns and types. Typical fields include:

Persisting to a binary output (for example, .npz) is part of the workflow, but the lesson is reusable: use CSV only for intake, then convert to a strict schema as soon as possible.

Market depth snapshots and daily continuity

For exchange backtesting, initial order book state matters. The guide describes snapshot behavior and notes that some providers inject snapshots at start of day. If backtesting spans days, you can avoid redundant snapshot loading by carrying forward suitable initial-state files.

Source note

This article is based on patterns shown in HftBacktest: Data Preparation .