DEV Community

Cover image for What I Learned Cleaning 1 Million Rows of CSV Data Without Pandas
Abrar ahmed
Abrar ahmed

Posted on

What I Learned Cleaning 1 Million Rows of CSV Data Without Pandas

Cleaning a small CSV? Pandas is perfect.
Cleaning up a million rows on a limited machine or using a serverless function? That's when Pandas really struggles.

That’s exactly the problem I faced.

In this post, I’ll share:

  • Why I avoided Pandas
  • My Node.js pipeline with csv-parser
  • How I handled common data issues: dates, phone numbers, missing fields
  • What I’d do differently next time
Let’s dive in.

Why Not Pandas?

Pandas is fantastic, but it does have a downside: it loads the entire file into memory. If you're working with files larger than about 500MB, you might run into some issues.

  • Memory errors
  • Slow performance
  • Crashes in limited environments (e.g., cloud functions, small servers)

In my case, I had:

  • 1 million+ rows
  • Dirty data from multiple sources
  • A need to stream and clean data row by row

My Setup: Streaming CSV Processing in Node.js

Here’s the core pipeline using csv-parser and Node streams:

const fs = require('fs');
const csv = require('csv-parser');

let rowCount = 0;
let errorCount = 0;

fs.createReadStream('bigfile.csv')
  .pipe(csv())
  .on('data', (row) => {
    rowCount++;

    // Clean data
    row.email = cleanEmail(row.email);
    row.phone = cleanPhone(row.phone);
    row.date = parseDate(row.date);

    // Validate required fields
    if (!row.email || !row.date) {
      errorCount++;
      logError(row);
      return;
    }

    // Save row to DB, another file, or API...
  })
  .on('end', () => {
    console.log(`✅ Processed ${rowCount} rows`);
    console.log(`⚠️  Found ${errorCount} bad rows`);
  });

function cleanEmail(email) {
  return email?.trim().toLowerCase() || null;
}

function cleanPhone(phone) {
  const digits = phone?.replace(/\D/g, '');
  return digits?.length === 10 ? digits : null;
}

function parseDate(date) {
  const parsed = Date.parse(date);
  return isNaN(parsed) ? null : new Date(parsed).toISOString();
}

function logError(row) {
  fs.appendFileSync('errors.log', JSON.stringify(row) + '\n');
}

Enter fullscreen mode Exit fullscreen mode

Common Data Issues I Ran Into (and How I Fixed Them)

  • Inconsistent date formats (MM-DD-YYYY vs DD/MM/YYYY) → Used Date.parse() + fallback logic.
  • Phone numbers in weird formats → Removed non-digits, validated length
  • Missing fields → Set defaults or marked as null
  • Extra columns → Stripped to schema fields
  • Encoding problems → Saved CSVs as UTF-8

Pro Tips for Large CSV Cleaning

  • Stream, don’t load → Avoid memory issues by processing row by row
  • Validate early → Catch bad data before it pollutes your system
  • Log errors → Keep a separate file of rejected rows for review
  • Test on a small sample → Always test your logic before full-scale runs
  • Handle edge cases → Empty cells, extra commas, inconsistent headers—these will happen!

What I’d Do Differently Next Time

  • Use a schema definition (like JSON Schema or Zod) to validate and transform rows automatically
  • Build a mapping layer for multi-source CSVs (e.g., different column names)
  • Consider tools like DuckDB or Polars if I need more advanced queries

Final Thoughts

Handling big data files involves more than just coding; it’s about crafting durable pipelines that can navigate the complexities and messiness of real-world scenarios.

If you’re working with CSVs, remember:

  • Validate early
  • Clean thoughtfully
  • Log everything

And when in doubt, stream it, don’t load it all at once.

Have you ever tackled the challenge of cleaning up a huge dataset? What tools or tips have you found to be the most helpful? I’d love to hear your thoughts!

Top comments (0)