Split urls into one file per host. Then run 'parallel -j5' on each file.
Or sort urls and insert a delimiter '\0' when a new host is met, then split on '\0' and remove '\0' while passing that as a block to a new instance of parallel:
sort urls.txt |
perl -pe '(not m://$last:) and print "\0";m://([^/]+): and $last=$1' |
parallel -j10 --pipe --rrs -N1 --recend '\0' parallel -j5 ./scan
Edit:
I think this will work:
cat urls.txt | parallel -q -j50 sem --fg --id '{= m://([^/]+):; $_=$1 =}' -j5 ./scan {}
sem is part of GNU Parallel (it is a shorthand for parallel --semaphore). {= m://([^/]+):; $_=$1 =} grabs the hostname. -j5 tells sem to make a counting semaphore with 5 slots. --fg forces sem to not spawn the job in the background. By using the hostname as ID you will get a counting semaphore for each hostname.
-q is needed for parallel if some of your URLs contain special shell chars (such as &). They need to be protected from shell expansion because sem will also shell expand them.