← Back to guides

CSV data cleaning checklist

A repeatable checklist reduces errors and keeps CSV cleanup stable across teams. Apply the same sequence every time so output is predictable, auditable, and easier to trust.

Set the acceptance rules before touching data

Before cleaning, define the expected schema and error policy for each column. If possible, write this down as a small quality contract with allowed delimiters, mandatory columns, and null behavior.

Checklist step 1: profile first

Data profiling gives you a baseline so you can estimate cleanup scope. A bad clean without profiling usually means hidden assumptions and later regressions.

Checklist step 2: normalize structure

Structure issues are easier to fix before semantic rules. Clean column headers, delimiter assumptions, and whitespace to stabilize all later logic.

  1. Standardize header case and spacing.
  2. Decode quoted fields correctly and enforce safe escaping.
  3. Normalize line endings and drop trailing separators from blank fields.

Checklist step 3: value rules

Then apply value-level rules in a controlled order. This avoids accidental data loss and makes failures easier to explain.

Checklist step 4: validation and audit

A cleanup run should end with a short verification pass and a small change record.

When to skip a rule

Not every rule applies to every file. Keep a decision line for each optional step so the process remains transparent.

For example, heavy deduplication is unnecessary on transactional logs where records are designed to repeat. In that case, flag duplicates separately instead of removing them.