Highest performance way to fork/spawn many node.js processes from parent process

Question

I am using Node.js to spawn upwards of 100 child processes, maybe even 1000. What concerns me is that the parent process could become some sort of bottleneck if all the stdout/stderr of the child processes has to go through the parent process in order to get logged somewhere.

So my assumption is that in order to achieve highest performance/throughput, we should ignore stdout/stderr in the parent process, like so:

const cp = require('child_process');

items.forEach(function(exec){

   const n = cp.spawn('node', [exec], {
      stdio: ['ignore','ignore','ignore','ipc']
   });

});

My question is, how much of a performance penalty is it to use pipe in this manner:

// (100+ items to iterate over)

items.forEach(function(exec){

   const n = cp.spawn('node', [exec], {
      stdio: ['ignore','pipe','pipe','ipc']
   });

});

such that stdout and stderr are piped to the parent process? I assume the performance penalty could be drastic, especially if we handle stdout/stderr in the parent process like so:

     // (100+ items to iterate over)

    items.forEach(function(exec){

       const n = cp.spawn('node', [exec], {
          stdio: ['ignore','pipe','pipe','ipc']
       });

       n.stdout.setEncoding('utf8');
       n.stderr.setEncoding('utf8');

        n.stdout.on('data', function(d){
          // do something with the data
        });

        n.stderr.on('data', function(d){
          // do something with the data
        });

    });

I am assuming

I assume if we use 'ignore' for stdout and stderr in the parent process, that this is more performant than piping stdout/stderr to parent process.
I assume if we choose a file to stream stdout/stderr to like so

stdio: ['ignore', fs.openSync('/some/file.log'), fs.openSync('/some/file.log'),'ipc']

that this is almost as performant as using 'ignore' for stdout/stderr (which should send stdout/stderr to /dev/null)

Are these assumptions correct or not? With regard to stdout/stderr, how can I achieve highest performance, if I want to log the stdout/stderr somewhere (not to /dev/null)?

Note: This is for a library so the amount of stdout/stderr could vary quite a bit. Also, most likely will rarely fork more processes than there are cores, at most running about 15 processes simultaneously.

If the source comes into question, part of the answers are here: github.com/nodejs/node/blob/master/lib/child_process.js — Alexander Mills
– Alexander Mills, Commented Nov 26, 2016 at 10:25
and here: github.com/nodejs/node/blob/master/lib/internal/… — Alexander Mills
– Alexander Mills, Commented Nov 26, 2016 at 10:29
What does your library do that requires forking that many child processes? — robertklep
– robertklep, Commented Nov 26, 2016 at 12:21
Testing library, similar to Node.js' AVA. All the child-processes have to run on one machine, for the moment. I don't expect to utilize multiple machines, anytime soon. — Alexander Mills
– Alexander Mills, Commented Nov 26, 2016 at 19:53
Without having checked it, my guess would be that they won't be very CPU intensive at all. Passing data from file descriptor to file descriptor in modern OS'es is heavily optimized, and stdout/stderr aren't very special in that regard. — robertklep
– robertklep, Commented Nov 26, 2016 at 20:40

jcaron · Accepted Answer · 2021-01-29 01:07:46Z

4

You have the following options:

you can have the child process completely ignore stdout/stderr, and do logging on its own by any other means (log a to a file, syslog...)
if you're logging the output of your parent process, you can set stdout/stderr to process.stdout and process.stderr respectively. This means the output of the child will be the same as the main process. Nothing will flow through the main process
you can set file descriptors directly. This means the output of the child process will go to the given files, without going through the parent process
however, if you don't have any control over the child processes AND you need to somehow do something to the logs (filter them, prefix them with the associated child process, etc.), then you probably need to go through the parent process.

As we have no idea of the volume of logs you're talking about, we have no idea whether this is critical or just premature optimisation. Node.js being asynchronous, I don't expect your parent process becoming a bottleneck unless it's really busy and you have lots of logs.

edited Jan 29, 2021 at 1:07

answered Nov 26, 2016 at 9:41

jcaron

17.8k6 gold badges36 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Alexander Mills Over a year ago

I do have control over the child processes to a certain extent - however the volume of logging could vary tremendously, depending on the user, and in some cases I assume a lot of stdout/stderr could be logged

jcaron Over a year ago

How much is "a lot"?

Alexander Mills Over a year ago

I don't know, but if you had 100 processes all sending mega amounts of stdout/stderr to a single process and that process had to handle it, I assume it would be slower than if those 100 processes each independently sent their stdout/stderr to separate files

Alexander Mills Over a year ago

I am looking to gain that 10%-30% performance that might come with optimizing this somehow

jcaron Over a year ago

No, if you set stdout to process.stdout (and the same for stderr), nothing flows through the parent. The parent gives the file descriptor it's using (file, pipe, terminal...) and gives it to the child which uses it directly. If the main process's stdout points to a file, this is exactly the same as opening the file and sending the file descriptor: everything is handled directly by the kernel.

|

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

Are these assumptions correct or not?

how can I achieve highest performance?

Test it. That's how you can achieve the highest performance. Test on the same type of system you will use in production, with the same number of CPUs and similar disks (SSD or HDD).

I assume your concern is that the children might become blocked if the parent does not read quickly enough. That is a potential problem, depending on the buffer size of the pipe and how much data flows through it. However, if the alternative is to have each child process write to disk independently, this could be better, the same, or worse. We don't know for a whole bunch of reasons, starting with the fact that we have no idea how many cores you have, how quickly your processes produce data, and what I/O subsystem you're writing to.

If you have a single SSD you might be able to write 500 MB per second. That's great, but if that SSD is 512 GB in size, you'll only last 16 minutes before it is full! You'll need to narrow down the problem space a lot more before anyone can know what's the most efficient approach.

If your goal is simply to get logged data off the machine with as little system utilization as possible, your best bet is to directly write your log messages to the network.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Nov 26, 2016 at 9:27

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

8 Comments

Alexander Mills Over a year ago

Well, this will run on all sorts of systems because this is for a library. So perhaps we can assume to a certain extent that "all things being equal" except the variable in question - which is really whether it is a big penalty to pipe the stdout/stderr of the children to the parent versus just piping it to /dev/null or a file, I just don't know enough about computers to be sure one way or the other.

Alexander Mills Over a year ago

I could test it on my system, but it won't the same as the next person who runs it on theirs, I just want to know the average case / all other things being equal/fixed case

John Zwinck Over a year ago

@AlexanderMills You should have said you're writing a library! That's critically important and you didn't mention it at all! Tell us more about it...is it for internal use on one project only, internal use on many projects in a company where you work, or for use by people you may never interact with? If the latter, it might be reasonable to make this configurable if you can't test which way is better.

Alexander Mills Over a year ago

Thanks yes, I was definitely planning on making it configurable - basically - the user would have a few options including (1) the parent process should just inherit the stdio from the children so that it's all logged in the original terminal, or (2) to send the stdio to a file, or (3) send all stdio to /dev/null

John Zwinck Over a year ago

What does it mean, "send stdio to file and pipe it to the parent"?

|

Collectives™ on Stack Overflow

Highest performance way to fork/spawn many node.js processes from parent process

2 Answers 2

8 Comments

8 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

8 Comments

Related