https://www.gnu.org/software/parallel/parallel.html#EXAMPLE:-GNU-Parallel-as-dir-processor is pretty much what you do.
It states:
Using GNU parallel as dir processor has the same limitations as using GNU parallel as queue system/batch manager.
There is a a small issue when using GNU parallel as queue system/batch manager: You have to submit JobSlot number of jobs before they will start, and after that you can submit one at a time, and job will start immediately if free slots are available. Output from the running or completed jobs are held back and will only be printed when JobSlots more jobs has been started (unless you use --ungroup or --line-buffer, in which case the output from the jobs are printed immediately). E.g. if you have 10 jobslots then the output from the first completed job will only be printed when job 11 has started, and the output of second completed job will only be printed when job 12 has started.
And this is what you are seeing.
Try this:
seq 100 | parallel --delay 1 -j1 echo | # give 1..100 one per second
  # stdout is cached by GNU Parallel, >&3 is not
  parallel -j4 'sleep 1; echo stdout {}; echo direct {} >&3; sleep 1' 3>&1
 
                