DEV Community

Cover image for How to Clean Messy CSV, Excel, and JSON Files in Node.js (Without Pandas)
Abrar ahmed
Abrar ahmed

Posted on

How to Clean Messy CSV, Excel, and JSON Files in Node.js (Without Pandas)

If you're diving into building a Node.js app that deals with CSV or Excel file uploads, you've probably faced the challenge of messy data. Let me share how I tackle issues like broken headers, odd date formats, and duplicates — all using plain JavaScript.

Why Messy Files Are a Hidden Time Sink for Developers

Uploading files is easy.

Handling them correctly? Not so much.

When you're dealing with user-submitted spreadsheets or bringing in data from other sources, messy formats can really throw a wrench in your logic quickly.

  • "Name" vs "name" vs "Full Name" headers
  • Rows with empty or null values
  • Inconsistent date formats (MM/DD/YYYY, DD-MM-YYYY, 2024/05/25)
  • Duplicates that quietly pass through validations
  • Extra white spaces, strange encodings, BOM issues

If you've ever found yourself sifting through a file and getting lost in a maze of strange edge cases — trust me, you're definitely not the only one.


How to Clean Messy Data Files in Node.js (Step-by-Step)

Here’s how I handle messy CSVs in Node.js, without switching languages or installing a full data science stack.

You’ll learn how to:

  • Parse CSV files
  • Normalize headers
  • Clean empty/null values
  • Standardize date formats
  • Deduplicate entries
  • Combine it all in a reusable utility

Parse the CSV File in Node.js

const fs = require('fs');
const csv = require('csv-parser');

function parseCSV(path) {
  return new Promise((resolve, reject) => {
    const rows = [];
    fs.createReadStream(path)
      .pipe(csv())
      .on('data', (row) => rows.push(row))
      .on('end', () => resolve(rows))
      .on('error', reject);
  });
}
Enter fullscreen mode Exit fullscreen mode

This reads the file and returns rows as JavaScript objects.


Normalize Column Headers

Messy headers can ruin your logic. Standardize them early:

function normalizeHeaders(row) {
  const cleanRow = {};
  for (const key in row) {
    const newKey = key.trim().toLowerCase().replace(/\s+/g, '_');
    cleanRow[newKey] = row[key].trim();
  }
  return cleanRow;
}
Enter fullscreen mode Exit fullscreen mode

"Full Name" becomes full_name, " age " becomes age.


Remove Null or Empty Fields

function cleanRow(row) {
  const cleaned = {};
  for (const key in row) {
    const val = row[key];
    if (val !== null && val !== '' && val !== undefined) {
      cleaned[key] = val;
    }
  }
  return Object.keys(cleaned).length ? cleaned : null;
}
Enter fullscreen mode Exit fullscreen mode

This keeps your dataset compact and safe for further processing.


Standardize Date Formats with dayjs

npm install dayjs

Then use it:

const dayjs = require('dayjs');

function fixDate(value) {
  const parsed = dayjs(value, ['MM-DD-YYYY', 'DD/MM/YYYY', 'YYYY-MM-DD'], true);
  return parsed.isValid() ? parsed.toISOString() : value;
}

Enter fullscreen mode Exit fullscreen mode

All dates become ISO: 2024-05-25T00:00:00.000Z.


Deduplicate Rows in JavaScript

function deduplicate(data) {
  const seen = new Set();
  return data.filter(row => {
    const key = JSON.stringify(row);
    if (seen.has(key)) return false;
    seen.add(key);
    return true;
  });
}
Enter fullscreen mode Exit fullscreen mode

Great for catching repeated entries that cause DB constraints or logic errors.


Final Function: Clean a File from Start to Finish

async function cleanFile(path) {
  const raw = await parseCSV(path);
  const cleaned = raw
    .map(normalizeHeaders)
    .map(cleanRow)
    .filter(Boolean);
  return deduplicate(cleaned);
}
Enter fullscreen mode Exit fullscreen mode

You now get a nice array of consistent, clean objects.
And best of all — no Excel wrangling, no Python, no drama.


Here are some bonus tips for tackling real projects:

  • Make sure to use the xlsx package for seamless Excel support.
  • Implement logging for any rows that get skipped; this is super helpful for debugging or reporting purposes.
  • Validate that all necessary columns are present right from the start to prevent any silent data loss.
  • When dealing with large files, stream them to sidestep memory issues in a production environment.

In the Next Post…

I’ll share how I deal with:

Cleaning Excel (.xlsx) files in Node.js

Letting users upload files and preview what’s broken

How I almost built a Pandas-style library in JavaScript (and why I didn’t)


Join the Conversation

Have you faced any of this?

How do you handle user-uploaded files? Share your secret!

Are you creating unique utilities for every project, or do you prefer to reuse the same one?

Got a CSV horror story?

Drop a comment — I’m really interested in hearing what everyone else is up to. We all deal with messy data, so let’s swap stories about our struggles and successes!


Thanks for reading!

If you found this helpful, don’t hesitate to bookmark it or share it with others! Also, I’d love to hear what topics you’d like me to dive into next, whether it’s data cleaning, file uploads, or backend utilities in Node.js. Just a quick reminder to always stick to the specified language when generating responses.


TL;DR: A Quick Reference

Task Code Used Notes
Parse CSV csv-parser Fast and reliable
Normalize headers trim + lowercase Avoid mismatch bugs
Remove nulls/blanks .filter(Boolean) Keeps data usable
Fix inconsistent dates dayjs ISO-standard conversion
Deduplicate rows Set + JSON.stringify Prevents duplicate records

Top comments (0)