Revisions to GNU Parallel: How to limit max running network jobs per host

added 459 characters in body

Source Link

edited Apr 8, 2015 at 6:53

37.5k
34
119
226

Split urls into one file per host. Then run 'parallel -j5' on each file.

Or sort urls and insert a delimiter '\0' when a new host is met, then split on '\0' and remove '\0' while passing that as a block to a new instance of parallel:

sort urls.txt | 
  perl -pe '(not m://$last:) and print "\0";m://([^/]+): and $last=$1' |
  parallel -j10 --pipe --rrs -N1 --recend '\0' parallel -j5 ./scan

Edit:

I think this will work:

cat urls.txt | parallel -q -j50 sem --fg --id '{= m://([^/]+):; $_=$1 =}' -j5 ./scan {}

sem is part of GNU Parallel (it is a shorthand for parallel --semaphore). {= m://([^/]+):; $_=$1 =} grabs the hostname. -j5 tells sem to make a counting semaphore with 5 slots. --fg forces sem to not spawn the job in the background. By using the hostname as ID you will get a counting semaphore for each hostname.

-q is needed for parallel if some of your URLs contain special shell chars (such as &). They need to be protected from shell expansion because sem will also shell expand them.

Split urls into one file per host. Then run 'parallel -j5' on each file.

Or sort urls and insert a delimiter '\0' when a new host is met, then split on '\0' and remove '\0' while passing that as a block to a new instance of parallel:

sort urls.txt | 
  perl -pe '(not m://$last:) and print "\0";m://([^/]+): and $last=$1' |
  parallel -j10 --pipe --rrs -N1 --recend '\0' parallel -j5 ./scan

Split urls into one file per host. Then run 'parallel -j5' on each file.

Or sort urls and insert a delimiter '\0' when a new host is met, then split on '\0' and remove '\0' while passing that as a block to a new instance of parallel:

sort urls.txt | 
  perl -pe '(not m://$last:) and print "\0";m://([^/]+): and $last=$1' |
  parallel -j10 --pipe --rrs -N1 --recend '\0' parallel -j5 ./scan

Edit:

I think this will work:

cat urls.txt | parallel -q -j50 sem --fg --id '{= m://([^/]+):; $_=$1 =}' -j5 ./scan {}

sem is part of GNU Parallel (it is a shorthand for parallel --semaphore). {= m://([^/]+):; $_=$1 =} grabs the hostname. -j5 tells sem to make a counting semaphore with 5 slots. --fg forces sem to not spawn the job in the background. By using the hostname as ID you will get a counting semaphore for each hostname.

-q is needed for parallel if some of your URLs contain special shell chars (such as &). They need to be protected from shell expansion because sem will also shell expand them.

added 234 characters in body

Source Link

edited Apr 6, 2015 at 2:12

Ole Tange

37.5k
34
119
226

Split urls into one file per host. Then run 'parallel -j5' on each file.

Or sort urls and insert a delimiter '\0' when a new host is met, then split on '\0' and remove '\0' while passing that as a block to a new instance of parallel:

sort urls.txt | 
  perl -pe '(not m://$last:) and print "\0";m://([^/]+): and $last=$1' |
  parallel -j10 --pipe --rrs -N1 --recend '\0' parallel -j5 ./scan

Split urls into one file per host. Then run 'parallel -j5' on each file.

Or sort urls and insert a delimiter '\0' when a new host is met, then split on '\0' and remove '\0' while passing that as a block to a new instance of parallel:

sort urls.txt | 
  perl -pe '(not m://$last:) and print "\0";m://([^/]+): and $last=$1' |
  parallel -j10 --pipe --rrs -N1 --recend '\0' parallel -j5 ./scan

Source Link

answered Apr 6, 2015 at 2:07

Ole Tange

37.5k
34
119
226

Split urls into one file per host. Then run 'parallel -j5' on each file.

Stack Exchange Network

Return to Answer