Process a binary file by chunk using a read stream in nodejs

Question

I want to process a file chunk by chunk to prevent memory exhaustion. I need to consume the file using a read stream. When trying it, this implementation seems to work fine.
I am asking your expert eyes:

have I overlooked anything or did I forget some case?
have I made a mistake somewhere and it's going to bite me in prod?
could this be improved?

The main code:

async function processFileByChunk(filePath) {
  try {
    const videoStream = fs.createReadStream(filePath);
    const stats = fs.statSync(filePath);

    await new Promise((resolve, reject) => {
      let bytesRead = 0;
      let countCurrentUploads = 0;

      videoStream.on("readable", async function () {
        while (true) {
          await wait(() => countCurrentUploads <= 0, 1000);

          const chunk = videoStream.read(16 * 1024 * 1024);

          if (!chunk || !chunk.length) {
            break;
          }

          bytesRead += chunk.length;

          console.log("bytesRead", bytesRead);

          countCurrentUploads++;

          await processChunk(chunk);

          countCurrentUploads--;
        }

        if (bytesRead >= stats.size) {
          resolve();
        }
      });

      videoStream.on("error", function (error) {
        reject(error);
      });
    });
  } catch (error) {
    console.log(error);
  }
}

Other functions:

async function processChunk(chunk) {
  console.log("process chunk...");
  await delay(2000);
  console.log("process chunk... done");
}

async function wait(fn, ms) {
  while (!fn()) {
    await delay(ms);
  }
}

function delay(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

Applied on a ~63MB file, it prints out:

bytesRead 16777216
process chunk...
process chunk... done
bytesRead 33554432
process chunk...
process chunk... done
bytesRead 50331648
process chunk...
process chunk... done
bytesRead 63598316
process chunk...
process chunk... done

Maanlamp · Accepted Answer · 2024-01-09 12:44:03Z

🏎️ Performance Matters, only when it matters

I want to process a file chunk by chunk to prevent memory exhaustion.

Are you sure this is necessary? I would say it's fair to assume almost all computers have (far) more than 1 GB of RAM these days, and most text files never exceed such ridiculous sizes. Don't optimise until you actually need to.

➿ Loops to the rescue

Assuming you really do need to process a file chunk by chunk, you could do so far more easily, using the async iterator that is built into a ReadableStream:

async function processFile(path, handler) {
  const stream = fs.createReadStream(path);
  for await (const chunk of stream) {
    await handle(chunk);
  }
}

function delay(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function handle(chunk) {
  console.log("process chunk...");
  await delay(2000);
  console.log("process chunk... done");
}

processFile("path/to/file", handleChunk);

That's it really.

🥊 `async/await` vs callbacks

Have I overlooked anything or did I forget some case?

One of the problems you have encountered is the tricky situation of using async/await inside a callback to ReadableStream.on.

When you're subscribing to the data event using stream.on("data", ...), you provide a function to be called for every chunk of that stream. The problem here is that the stream doesn't wait for that callback to complete before reading another chunk.

Using async/await in this callback is only "pausing" the callback, not the stream.

If I understand correctly, you've avoided this with some really weird loops and counters — almost as if you're trying to implement your own scheduling implementation, which is of course completely redundant because Node has its own scheduler.

If you have a look at the documentation for the async iterator for Streams, you can see that using the Symbol.AsyncIterator protocol on a Stream will take care of reading, closing and error handling of a Stream.

Because you're working in a async control context, any errors will reject the promise returned by the function, so there's no need to wrap it in a try/catch.
Because you're working in a async control context, you can await other async functions inside this body, and the iterator will not continue until the context is "resumed", so the NodeJS scheduler can do its thing for you.
- Alternatively, if you don't await the handler, the stream will not pause, which could be desired instead.
The Symbol.AsyncIterator implementation of NodeJS Streams take care of reading the stream and read error handling. No need to define any listeners yourself.

🏁 Conclusion

In short, here are my tips to you:

RTFM (excuse the expletive 😉). You can find out a lot of useful info by browsing the docs a bit before you write some code, although I do commend your creative solution to the scheduling problem.
When using async/await, make sure you're using it everywhere. Only an async context can be paused with await. Promises have been used to circumvent callback hell in (almost) every NodeJS API by now; if you see a method that asks you to provide a callback, there's probably a better method that works with a Promise instead, which you can await.

Thank you, the built-in async iterator is indeed a great solution that I did not know about! I have obviously read the doc too quickly :) Thanks also for the detailed explanations. — Louis Coulet
– Louis Coulet, Commented Jun 5, 2022 at 19:55
Concerning the necessity of this optimization, I don't think it is uncommon to process big files on cloud machines with limited in memory. — Louis Coulet
– Louis Coulet, Commented Jun 5, 2022 at 19:55
@LouisCoulet It's a perfectly valid approach, yes. I wasn't trying to discourage you from streaming files — after all, I knew about the async iterator exactly because I was in your shoes once! I've tried only to make you think of any unfounded assumptions. Glad to have helped! — Maanlamp
– Maanlamp, Commented Jun 6, 2022 at 20:56

Stack Exchange Network

Process a binary file by chunk using a read stream in nodejs

1 Answer 1

🏎️ Performance Matters, only when it matters

➿ Loops to the rescue

🥊 `async/await` vs callbacks

🏁 Conclusion

You must log in to answer this question.

Hot Network Questions

Process a binary file by chunk using a read stream in nodejs

1 Answer 1

🏎️ Performance Matters, only when it matters

➿ Loops to the rescue

🥊 async/await vs callbacks

🏁 Conclusion

You must log in to answer this question.

Related

Hot Network Questions

🥊 `async/await` vs callbacks