Skip to main content
added 459 characters in body
Source Link
Ole Tange
  • 37.5k
  • 34
  • 119
  • 226

Split urls into one file per host. Then run 'parallel -j5' on each file.

Or sort urls and insert a delimiter '\0' when a new host is met, then split on '\0' and remove '\0' while passing that as a block to a new instance of parallel:

sort urls.txt | 
  perl -pe '(not m://$last:) and print "\0";m://([^/]+): and $last=$1' |
  parallel -j10 --pipe --rrs -N1 --recend '\0' parallel -j5 ./scan

Edit:

I think this will work:

cat urls.txt | parallel -q -j50 sem --fg --id '{= m://([^/]+):; $_=$1 =}' -j5 ./scan {}

sem is part of GNU Parallel (it is a shorthand for parallel --semaphore). {= m://([^/]+):; $_=$1 =} grabs the hostname. -j5 tells sem to make a counting semaphore with 5 slots. --fg forces sem to not spawn the job in the background. By using the hostname as ID you will get a counting semaphore for each hostname.

-q is needed for parallel if some of your URLs contain special shell chars (such as &). They need to be protected from shell expansion because sem will also shell expand them.

Split urls into one file per host. Then run 'parallel -j5' on each file.

Or sort urls and insert a delimiter '\0' when a new host is met, then split on '\0' and remove '\0' while passing that as a block to a new instance of parallel:

sort urls.txt | 
  perl -pe '(not m://$last:) and print "\0";m://([^/]+): and $last=$1' |
  parallel -j10 --pipe --rrs -N1 --recend '\0' parallel -j5 ./scan

Split urls into one file per host. Then run 'parallel -j5' on each file.

Or sort urls and insert a delimiter '\0' when a new host is met, then split on '\0' and remove '\0' while passing that as a block to a new instance of parallel:

sort urls.txt | 
  perl -pe '(not m://$last:) and print "\0";m://([^/]+): and $last=$1' |
  parallel -j10 --pipe --rrs -N1 --recend '\0' parallel -j5 ./scan

Edit:

I think this will work:

cat urls.txt | parallel -q -j50 sem --fg --id '{= m://([^/]+):; $_=$1 =}' -j5 ./scan {}

sem is part of GNU Parallel (it is a shorthand for parallel --semaphore). {= m://([^/]+):; $_=$1 =} grabs the hostname. -j5 tells sem to make a counting semaphore with 5 slots. --fg forces sem to not spawn the job in the background. By using the hostname as ID you will get a counting semaphore for each hostname.

-q is needed for parallel if some of your URLs contain special shell chars (such as &). They need to be protected from shell expansion because sem will also shell expand them.

added 234 characters in body
Source Link
Ole Tange
  • 37.5k
  • 34
  • 119
  • 226

Split urls into one file per host. Then run 'parallel -j5' on each file.

Or sort urls and insert a delimiter '\0' when a new host is met, then split on '\0' and remove '\0' while passing that as a block to a new instance of parallel:

sort urls.txt | 
  perl -pe '(not m://$last:) and print "\0";m://([^/]+): and $last=$1' |
  parallel -j10 --pipe --rrs -N1 --recend '\0' parallel -j5 ./scan

Split urls into one file per host. Then run 'parallel -j5' on each file.

Split urls into one file per host. Then run 'parallel -j5' on each file.

Or sort urls and insert a delimiter '\0' when a new host is met, then split on '\0' and remove '\0' while passing that as a block to a new instance of parallel:

sort urls.txt | 
  perl -pe '(not m://$last:) and print "\0";m://([^/]+): and $last=$1' |
  parallel -j10 --pipe --rrs -N1 --recend '\0' parallel -j5 ./scan
Source Link
Ole Tange
  • 37.5k
  • 34
  • 119
  • 226

Split urls into one file per host. Then run 'parallel -j5' on each file.