Cleaning a small CSV? Pandas is perfect.
Cleaning up a million rows on a limited machine or using a serverless function? That's when Pandas really struggles.
That’s exactly the problem I faced.
In this post, I’ll share:
- Why I avoided Pandas
- My Node.js pipeline with csv-parser
- How I handled common data issues: dates, phone numbers, missing fields
- What I’d do differently next time
Let’s dive in.
Why Not Pandas?
Pandas is fantastic, but it does have a downside: it loads the entire file into memory. If you're working with files larger than about 500MB, you might run into some issues.
- Memory errors
- Slow performance
- Crashes in limited environments (e.g., cloud functions, small servers)
In my case, I had:
- 1 million+ rows
- Dirty data from multiple sources
- A need to stream and clean data row by row
My Setup: Streaming CSV Processing in Node.js
Here’s the core pipeline using csv-parser and Node streams:
const fs = require('fs');
const csv = require('csv-parser');
let rowCount = 0;
let errorCount = 0;
fs.createReadStream('bigfile.csv')
.pipe(csv())
.on('data', (row) => {
rowCount++;
// Clean data
row.email = cleanEmail(row.email);
row.phone = cleanPhone(row.phone);
row.date = parseDate(row.date);
// Validate required fields
if (!row.email || !row.date) {
errorCount++;
logError(row);
return;
}
// Save row to DB, another file, or API...
})
.on('end', () => {
console.log(`✅ Processed ${rowCount} rows`);
console.log(`⚠️ Found ${errorCount} bad rows`);
});
function cleanEmail(email) {
return email?.trim().toLowerCase() || null;
}
function cleanPhone(phone) {
const digits = phone?.replace(/\D/g, '');
return digits?.length === 10 ? digits : null;
}
function parseDate(date) {
const parsed = Date.parse(date);
return isNaN(parsed) ? null : new Date(parsed).toISOString();
}
function logError(row) {
fs.appendFileSync('errors.log', JSON.stringify(row) + '\n');
}
Common Data Issues I Ran Into (and How I Fixed Them)
- Inconsistent date formats (MM-DD-YYYY vs DD/MM/YYYY) → Used
Date.parse()
+ fallback logic. - Phone numbers in weird formats → Removed non-digits, validated length
- Missing fields → Set defaults or marked as null
- Extra columns → Stripped to schema fields
- Encoding problems → Saved CSVs as UTF-8
Pro Tips for Large CSV Cleaning
- Stream, don’t load → Avoid memory issues by processing row by row
- Validate early → Catch bad data before it pollutes your system
- Log errors → Keep a separate file of rejected rows for review
- Test on a small sample → Always test your logic before full-scale runs
- Handle edge cases → Empty cells, extra commas, inconsistent headers—these will happen!
What I’d Do Differently Next Time
- Use a schema definition (like JSON Schema or Zod) to validate and transform rows automatically
- Build a mapping layer for multi-source CSVs (e.g., different column names)
- Consider tools like DuckDB or Polars if I need more advanced queries
Final Thoughts
Handling big data files involves more than just coding; it’s about crafting durable pipelines that can navigate the complexities and messiness of real-world scenarios.
If you’re working with CSVs, remember:
- Validate early
- Clean thoughtfully
- Log everything
And when in doubt, stream it, don’t load it all at once.
Have you ever tackled the challenge of cleaning up a huge dataset? What tools or tips have you found to be the most helpful? I’d love to hear your thoughts!
Top comments (0)