1
\$\begingroup\$

I have a file of the form -

>SDF123.1 blah blah

ATCTCTGGAAACTCGGTGAAAGAGAGTAT

AGTGATGAGGATGAGTGAG...

>SBF123.1 blah blah

ATCTCTGGAAACTCGGTGAAAGAGAGTAT

AGTGATGAGGATGAGTGAG....

And I want to extract the various sections of this file into individual files (like here

I wrote the following code, but it runs too slow, as compared to when I did not have the close command in it. I had to incorporate the close command, since without it, I was getting the awk error - too many open files.

Here is the code -

cat C1_animal.fasta | awk -F ' ' '{
        if (substr($0, 1, 1)==">") {filename=(substr($1,2) ".fa")}
        print $0 >> filename; close (filename)
}'

How can I make this code more time efficient? I am new to awk.

\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

Try to close your filename only when it's necessary:

File actg.awk

BEGIN {
    FS=" "
}
/^>/ {
    if (filename != "") {
        close(filename)
    }
    filename = substr($1,2) ".fa"
    next
}
filename != "" {
    print $0 > filename
}
END {
    close (filename)
}

With shell command:

awk -f actg.awk C1_animal.fasta

Note: if you are sure there is no line before the first "> ...", you can skip the filename != " " test

\$\endgroup\$
1
  • \$\begingroup\$ Thank you, this code worked nicely and was quite faster. Could you explain a little how this code works? I am still trying to laern awk \$\endgroup\$ Commented Sep 6, 2021 at 5:54

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.