2

I am trying to convert a very large json to csv in node.js but it is taking too much time and also leading to 100% cpu at the time of conversion.

  jsonToCsv: function (data) {
    var keys = Object.keys(data[0]);
    var csv = [keys.join(",")];
    console.time("CSVGeneration");
    data.forEach(function (row) {
      var line = ''; 
      keys.forEach(function (key) {
        if (typeof row[key] === 'string') {
          row[key] = "" + file_utils.escapeCsv(row[key]) + "";
        } 
        line += row[key] + ",";
      });
      csv.push(line);  
    });
    console.timeEnd("CSVGeneration");
    csv = csv.join("\n");
    return csv;
  },
  escapeCsv: function (x) {
    if (x)
      return ('' + x.replace(/"/g, '').replace(/,/g, ' ').replace(/\n/g, " ").replace(/\r/g, " ") + '');
    else
      return ('');
  },

On an average run for 1Lac rows, it never recovered to even log time. I had to kill the process manually.

Can someone suggest a better alternative to this?

3
  • Sidenote: Why does your function accept a cb parameter, but returns the result directly? Furthermore, have you checked, if the rows are actually generated (eg, debug output inside the forEach()? Commented Dec 15, 2014 at 10:30
  • Sorry for that. cb is redundant and supposed to be removed. Commented Dec 15, 2014 at 10:33
  • yes, rows are generated but it is taking too much memory and cpu. Is there any memory effective solution to this? Commented Dec 15, 2014 at 10:51

1 Answer 1

4

Before answering this: Assuming your code is working, this question belongs to https://codereview.stackexchange.com/ .

As for your problem:

  • the new Array access functions like forEach(), while being rather comfortable when coding, are usually not quite performant. A simple for loop is the better choice in performance critical situations.
  • in escapeCsv() you apply 4 different regex replacements each for just one character. Combine those into one.
  • Assuming you data is already structured in a way, that allows for Csv conversion (data is an Array of objects, each having the same properties), it is not necessary to retrieve the keys individually for each object.

Applying this, yields the following code:

function escapeCsv(x) {
    if (x) {
        return ('' + x).replace( /[",\n\r]/gi, '' );
    } else {
        return ('');
    }
}

function jsonToCsv(data) {
    var keys = Object.keys(data[0]),
        csv = [keys.join(",")];

    var row = new Array( keys.length );
    for (var i = 0; i < data.length; i++) {
        for (var j = 0; j < keys.length; j++) {
            if (typeof data[i][keys[j]] === 'string') {
                row[j] = '"' + escapeCsv(data[i][keys[j]]) + '"';
            } else {
                row[j] = data[i][keys[j]] || '';
            }
        }
        csv.push(row.join(','));
    }

    return csv.join("\n");
}

This alone yields a performance improvement for about 3-5 according to jsPerf.

If the CSV, you are generating can be streamed to a file or to a client directly, one could improve even more and reduce the memory load, as not the CSV has to be stored in memory.

Fiddle to play around with the functions Original named like yours, new ones names with a suffix 2.

jsPerf.com comparison

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the valuable inputs, I wasn't aware about codereview.stackexchange yet. I will try out the suggested changes and will update.
@Hitesh There are libraries for CSV generation btw: github.com/wdavidw/node-csv
I will surely try this too. Actually the problem I faced was, before creating this csv, I had large a huge stack call in addition to entire data in memory. At the time of csv creation, entire setup was going out of memory. Now I am thinking of moving this in-memory computation to disk all together. Any suggestion on this?
@Hitesh Hard to suggest something without knowing much details. For less memory consumption, you might be able to use pipes: Read from one file (rowise, for example), process it and immediately write to an output file. This way at any point in time you have just one row of data in memory. But if that approach is feasible, depends on your data and computation ...
I am trying almost the same path, only difference being the source as database instead of file. Thanks for your inputs.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.