What I Learned Cleaning 1 Million Rows of CSV Data Without Pandas

#csv #node #backend #dataengineering

Cleaning a small CSV? Pandas is perfect.
Cleaning up a million rows on a limited machine or using a serverless function? That's when Pandas really struggles.

That’s exactly the problem I faced.

In this post, I’ll share:

Why I avoided Pandas
My Node.js pipeline with csv-parser
How I handled common data issues: dates, phone numbers, missing fields
What I’d do differently next time

Let’s dive in.

Why Not Pandas?

Pandas is fantastic, but it does have a downside: it loads the entire file into memory. If you're working with files larger than about 500MB, you might run into some issues.

Memory errors
Slow performance
Crashes in limited environments (e.g., cloud functions, small servers)

In my case, I had:

1 million+ rows
Dirty data from multiple sources
A need to stream and clean data row by row

My Setup: Streaming CSV Processing in Node.js

Here’s the core pipeline using csv-parser and Node streams:

const fs = require('fs');
const csv = require('csv-parser');

let rowCount = 0;
let errorCount = 0;

fs.createReadStream('bigfile.csv')
  .pipe(csv())
  .on('data', (row) => {
    rowCount++;

    // Clean data
    row.email = cleanEmail(row.email);
    row.phone = cleanPhone(row.phone);
    row.date = parseDate(row.date);

    // Validate required fields
    if (!row.email || !row.date) {
      errorCount++;
      logError(row);
      return;
    }

    // Save row to DB, another file, or API...
  })
  .on('end', () => {
    console.log(`✅ Processed ${rowCount} rows`);
    console.log(`⚠️  Found ${errorCount} bad rows`);
  });

function cleanEmail(email) {
  return email?.trim().toLowerCase() || null;
}

function cleanPhone(phone) {
  const digits = phone?.replace(/\D/g, '');
  return digits?.length === 10 ? digits : null;
}

function parseDate(date) {
  const parsed = Date.parse(date);
  return isNaN(parsed) ? null : new Date(parsed).toISOString();
}

function logError(row) {
  fs.appendFileSync('errors.log', JSON.stringify(row) + '\n');
}

Common Data Issues I Ran Into (and How I Fixed Them)

Inconsistent date formats (MM-DD-YYYY vs DD/MM/YYYY) → Used Date.parse() + fallback logic.
Phone numbers in weird formats → Removed non-digits, validated length
Missing fields → Set defaults or marked as null
Extra columns → Stripped to schema fields
Encoding problems → Saved CSVs as UTF-8

Pro Tips for Large CSV Cleaning

Stream, don’t load → Avoid memory issues by processing row by row
Validate early → Catch bad data before it pollutes your system
Log errors → Keep a separate file of rejected rows for review
Test on a small sample → Always test your logic before full-scale runs
Handle edge cases → Empty cells, extra commas, inconsistent headers—these will happen!

What I’d Do Differently Next Time

Use a schema definition (like JSON Schema or Zod) to validate and transform rows automatically
Build a mapping layer for multi-source CSVs (e.g., different column names)
Consider tools like DuckDB or Polars if I need more advanced queries

Final Thoughts

Handling big data files involves more than just coding; it’s about crafting durable pipelines that can navigate the complexities and messiness of real-world scenarios.

If you’re working with CSVs, remember:

Validate early
Clean thoughtfully
Log everything

And when in doubt, stream it, don’t load it all at once.

Have you ever tackled the challenge of cleaning up a huge dataset? What tools or tips have you found to be the most helpful? I’d love to hear your thoughts!