Revisions to Increase speed of Bash script which used grep into a while loop

deleted 68 characters in body

Source Link

edited Mar 9, 2018 at 10:25

3.1k
20
23

The percentage calculation can be reduced to a single operation like this

 echo "${even##}" | awk '{x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%\b", (y*100)/(x+y) }'

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

awk -F awk'_' -v Y="$Y" '{ if(NR%2==1) {
             gsub(/cov\./,"",$3);
          printf "%s %s printf%s "%s%s %s\nnucleotidic_cov : %.4f\n",$1,$2,($3,$4,$5, ($6 / Y)
           } else {
               x=gsub(/[AT]/,""); 
               y=gsub(/[GC]/,""); 
               printf "GC_CONT : %.2f%%\n", (y*100)/(x+y)
           }
        }' large_file

EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.

Using a single awk script instead of multiple operations will significantly speed the operation up.

The percentage calculation can be reduced to a single operation like this

 echo "${even##}" | awk '{x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%\b", (y*100)/(x+y) }'

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

  awk -v Y="$Y" '{ if(NR%2==1) {
             gsub(/cov\./,"",$3);
             printf "%s %s\nnucleotidic_cov : %.4f\n",$1,$2,($3 / Y)
           } else {
               x=gsub(/[AT]/,""); 
               y=gsub(/[GC]/,""); 
               printf "GC_CONT : %.2f%%\n", (y*100)/(x+y)
           }
        }' large_file

EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.

Using a single awk script instead of multiple operations will significantly speed the operation up.

The percentage calculation can be reduced to a single operation like this

 echo "${even##}" | awk '{x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%\b", (y*100)/(x+y) }'

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

awk -F '_' -v Y="$Y" '{ if(NR%2==1) {
    printf "%s %s %s %s %s\nnucleotidic_cov : %.4f\n",$1,$2,$3,$4,$5, ($6 / Y)
} else {
    x=gsub(/[AT]/,""); 
    y=gsub(/[GC]/,""); 
    printf "GC_CONT : %.2f%%\n", (y*100)/(x+y)
    }
 }' large_file

EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.

Using a single awk script instead of multiple operations will significantly speed the operation up.

improved answer based on op's clarification

Source Link

edited Mar 9, 2018 at 5:35

amisax

3.1k
20
23

The percentage calculation can be reduced to a single operation like this

 echo "${even##}" | awk '{x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%\b", (y*100)/(x+y) }'

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

  awk -v Y="$Y" '{ if(NR%2==1) {
             gsub(/cov\./,"nucleotidic_cov : """,$3);
             printf "%s %s\n%s\n"%s\nnucleotidic_cov : %.4f\n",$1,$2,($3 / Y)
           } else {
               x=gsub(/[AT]/,""); 
               y=gsub(/[GC]/,""); 
               printf "GC_CONT : %.2f%%\n", (y*100)/(x+y)
           }
        }' large_file

EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.

Using a single awk script instead of multiple operations will significantly speed the operation up.

The percentage calculation can be reduced to a single operation like this

 echo "${even##}" | awk '{x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%\b", (y*100)/(x+y) }'

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

awk '{ if(NR%2==1) {
         gsub(/cov\./,"nucleotidic_cov : ",$3);
         printf "%s %s\n%s\n",$1,$2,$3
       } else {
           x=gsub(/[AT]/,""); 
           y=gsub(/[GC]/,""); 
           printf "GC_CONT : %.2f%%\n", (y*100)/(x+y)
       }
    }' large_file

The percentage calculation can be reduced to a single operation like this

 echo "${even##}" | awk '{x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%\b", (y*100)/(x+y) }'

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

  awk -v Y="$Y" '{ if(NR%2==1) {
             gsub(/cov\./,"",$3);
             printf "%s %s\nnucleotidic_cov : %.4f\n",$1,$2,($3 / Y)
           } else {
               x=gsub(/[AT]/,""); 
               y=gsub(/[GC]/,""); 
               printf "GC_CONT : %.2f%%\n", (y*100)/(x+y)
           }
        }' large_file

EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.

Using a single awk script instead of multiple operations will significantly speed the operation up.

based on OP's input

Source Link

edited Feb 27, 2018 at 11:06

amisax

3.1k
20
23

The percentage calculation can be reduced to a single operation like this

 echo "${even##}" | awk '{x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%\b", (y*100)/(x+y) }'

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

awk '{ if(NR%2==1) {
         gsub(/cov\./,"nucleotidic_cov print: ",$3);
         printf "%s %s\n%s\n",$1,$2,$3
       } else {
           x=gsub(/[AB][AT]/,""); 
           y=gsub(/C[GC]/,""); 
           printf "GC_CONT : %.2f%%\n", (y*100)/(x+y)
       }
    }' large_file

The percentage calculation can be reduced to a single operation like this

 echo "${even##}" | awk '{x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%\b", (y*100)/(x+y) }'

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

awk '{ if(NR%2==1) {
          print $1
       } else {
           x=gsub(/[AB]/,""); 
           y=gsub(/C/,""); 
           printf "GC_CONT : %.2f%%\n", (y*100)/(x+y)
       }
    }' large_file

The percentage calculation can be reduced to a single operation like this

 echo "${even##}" | awk '{x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%\b", (y*100)/(x+y) }'

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

awk '{ if(NR%2==1) {
         gsub(/cov\./,"nucleotidic_cov : ",$3);
         printf "%s %s\n%s\n",$1,$2,$3
       } else {
           x=gsub(/[AT]/,""); 
           y=gsub(/[GC]/,""); 
           printf "GC_CONT : %.2f%%\n", (y*100)/(x+y)
       }
    }' large_file

Source Link

answered Feb 27, 2018 at 10:09

amisax

3.1k
20
23

Loading

Stack Exchange Network

Return to Answer