← Back to guides

CSV in high-frequency data prep: from raw feeds to normalized events

High-frequency datasets are often first seen as messy files: JSON-by-line captures, compressed archives, separate trade and book feeds, and large CSV files. This guide summarizes practical CSV handling patterns from the official HftBacktest data prep flow.

Why CSV still appears in HFT pipelines

Even when source streams are JSON, vendors and exchange dumps can deliver data as compressed CSV files with extensions like .csv.gz. The preparation pipeline in HftBacktest is useful as a pattern: normalize data into a compact event table before simulation, regardless of original format.

Trades and incremental books can be delivered as separate CSV files.
Large files are often compressed to reduce bandwidth and storage.
Conversion tools ingest these files and emit structured arrays, which simplifies downstream steps.

Most important CSV-specific details for backtesting quality

1) Event order and timestamps are critical

Timestamp quality matters more than format choice:

The first token in raw Binance lines is the local receive timestamp (nanoseconds).
Exchange timestamp and local timestamp can drift; conversion logic explicitly prints and corrects latency.
The same conversion step also corrects event ordering when timestamps conflict or arrive out of order.

2) Input file order controls realism

When combining feed files, recommendation is to input trade files before depth/book files. If a depth update is caused by a trade, placing trade first improves realism in fills.

When two events share the same timestamp, the conversion flow prioritizes events from the first input file.

3) File size and memory planning

For very large CSV inputs, output conversion may require a larger buffer_size argument. This is a practical point for CSV pipelines: the same dataset can parse but fail or slow down with default buffers.

What the normalized structure looks like after CSV prep

HftBacktest normalizes input into an event table with consistent columns and types. Typical fields include:

event code (trade/depth event indicator)
exchange timestamp
local timestamp
price, quantity, order identifiers
internal numeric fields used for simulation state

Persisting to a binary output (for example, .npz) is part of the workflow, but the lesson is reusable: use CSV only for intake, then convert to a strict schema as soon as possible.

Market depth snapshots and daily continuity

For exchange backtesting, initial order book state matters. The guide describes snapshot behavior and notes that some providers inject snapshots at start of day. If backtesting spans days, you can avoid redundant snapshot loading by carrying forward suitable initial-state files.

Build or keep a day-end snapshot for next-day continuity.
Validate that your snapshot depth format is consistent with your event feed.
Do not assume all provider snapshots represent exchange-engine timing.

Source note

This article is based on patterns shown in HftBacktest: Data Preparation .