1

got this script to work against a file, composed by lots of line (>500Mb) with this scheme:

odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT  # on even lines there's a DNA sequence

Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.

So, output looks like:

> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX

Here's the code (only the loop):

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
    echo -n "${odd##}" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
    echo "scale=4;${odd##*_} / $Y" | bc 
    read even
    echo "${even##}" &&
    ACOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "A")  
    GCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "G")
    CCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "C")
    TCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "T")
    TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
    GCCONT=$(($GCOUNT+$CCOUNT))
    printf "GC_CONT: " 
    echo "scale=2;$GCCONT / $TOTALBASES *100" | bc  
done < "$1"

It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?

EDIT

As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW

4
  • 1
    see unix.stackexchange.com/questions/169716/… Commented Feb 27, 2018 at 9:22
  • adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com Commented Feb 27, 2018 at 9:24
  • Update quest. I wrote it in a wrong way. Commented Feb 27, 2018 at 9:34
  • given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done.. Commented Feb 27, 2018 at 9:44

4 Answers 4

3

You’ve reached (to put it mildly) the limits of what can be reasonably done in the shell — you should re-write your script in something like AWK, or Perl, or Python. Using a more advanced language like those will avoid having to run multiple processes for all your text processing; you’ll be able to do it using built-in functions.

3
  • I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python? Commented Feb 27, 2018 at 9:37
  • Text Processing in Python is an old book covering the topic, but you’d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl. Commented Feb 27, 2018 at 9:55
  • This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process. Commented Mar 9, 2018 at 8:24
2

The percentage calculation can be reduced to a single operation like this

 echo "${even##}" | awk '{x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%\b", (y*100)/(x+y) }'

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

awk -F '_' -v Y="$Y" '{ if(NR%2==1) {
    printf "%s %s %s %s %s\nnucleotidic_cov : %.4f\n",$1,$2,$3,$4,$5, ($6 / Y)
} else {
    x=gsub(/[AT]/,""); 
    y=gsub(/[GC]/,""); 
    printf "GC_CONT : %.2f%%\n", (y*100)/(x+y)
    }
 }' large_file

EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.

Using a single awk script instead of multiple operations will significantly speed the operation up.

10
  • In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line. Commented Feb 27, 2018 at 10:52
  • I have updated based on your explanation Commented Feb 27, 2018 at 11:07
  • can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast. Commented Mar 2, 2018 at 15:39
  • @Shred in if block we did modified $3 cov\. with nucleotidic_cov. $3 in the line is cov.XX Commented Mar 8, 2018 at 10:48
  • Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;${odd##*_} / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have. Commented Mar 8, 2018 at 16:12
1

The number of cores matters little if your program isn't parallelized (much).

You could use wc and tr rather than sed and grep, which might speed things up a bit:

ACOUNT=$(echo "${even##}" | tr -d [^A] | wc -m)

But really, I think the major problem is that shell, while an easy thing to program in for quick-and-dirty jobs, is just not the right tool for the job when it comes to raw processing power. I would suggest a more sophisticated programming language, like Perl or Python, which also have threading abilities (thereby allowing you to use all your cores).

You could do it in perl somewhat like this:

#!/usr/bin/perl -w
use strict;
use warnings;

my $y = ...;                              # calculate your Y value here
while(my $odd = <ARGV>) {                 # Read a line from the file(s) passed
                                          # on the command line
    chomp $odd;                           # lose the newline
    my @split = split /_/, $odd;          # split the read line on a "_" boundary
                                          # into an array
    print join("_", @split[0..3]) . "\n"; # print the first four elements of the
                                          # array, separated by "_"
    print $split[$#split] / $y . "\n";    # Treat the final element of the
                                          # @split array as a number, divide it
                                          # by $y, and output the result
    my %charcount = (                     # Initialize a hash table
        A => 0,
        G => 0,
        C => 0,
        T => 0
    );
    my $even = <ARGV>;                    # read the even line
    chomp $even;
    foreach my $char(split //,$even) {    # split the string into separate
                                          # characters, and loop over them
        $charcount{$char}++;              # Count the correct character
    }
    my $total = $charcount{A} + $charcount{G} + $charcount{C} + $charcount{T};
    my $gc = $charcount{G} + $charcount{C};
    my $perc = $gc / $total;
    print "GC_CONT: $perc\n";             # Do our final calculations and
                                          # output the result
}

Note: not tested (beyond "does perl accept this code")

If you want to learn more about perl, run perldoc perlintro and get started ;-)

5
  • Already switched from using tr, which slows the work more than now. Commented Feb 27, 2018 at 9:37
  • Found some error in code while running. Btw, thanks a lot, still a good way to do some research on. Commented Feb 28, 2018 at 13:32
  • What error? Probably should update it then :-) Commented Mar 1, 2018 at 12:01
  • Here's the console log. pastebin.com/uXZW0802 Commented Mar 2, 2018 at 10:57
  • So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked. Commented Mar 5, 2018 at 10:22
0

You are reading a long file line by line and executing multiple commands on each iteration. The main problem you are facing is latency of running those calculations and reading very small chunks of the file at a time.

Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.

If you want to rule out the performance of the storage and file system, you can load the file from RAM using:

# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>

This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.