awk split a file into multiple files with specific names in another index file

Question

I have a cluster fasta file (called file) which looks like:

>1AB2
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2AC6
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
BUEIBVEO
>7KZL
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR
>6GH3
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG

the file has 4 groups: 1AB2, 2AC6, 7KZL, and 6GH3. the content during the first >1AB2 and the first >2AC6 belongs to the cluster 1AB2. the content during the first >2AC6 and the first >7KZL belongs to the cluster 2AC6.

I want to separate the file into 4 files at the second >XXXX with specific names in this index file (ind.txt):

HG001 1AB2
HG010 2AC6
HG023 7KZL
HG004 6GH3

The result file should look like:

HG001.fa

>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN

HG010.fa

>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
BUEIBVEO

HG023.fa

>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR

HG004.fa

>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG

I tried to use

awk '/^>/ && NF==1; NR==FNR{a[$2]=$1} (substr($1,2) in a) {close(out); out="cluster/"a[substr($1,2)]".fa"}  {print > out}' ind.txt file

but it didn't work, and I couldn't find the solution to the error.

Do you really have those empty sequences, headers with no sequence data? For example, the first line of your file is just >1AB2 and the second line is >1AB2 AA, so there is no sequence associated with 1AB2. The same goes for >2AC6 and various others. Is that a mistake? If not, how should we handle those? — terdon
– terdon ♦, Commented Jun 1, 2022 at 15:22
yes, that is a cluster file made by mmseqs. it means 1AB2, 1AB3, 1SC4,2CD5 belongs to the cluster 1AB2, and 1AB2 is the representative in 4 sequences. — Jiao
– Jiao, Commented Jun 1, 2022 at 15:26
Also a note regarding the try presented: You always have to put first the code for parsing the first file, when you parse two files with awk, and always use next at the end of the first file processing: FNR==NR{...; next}. This try is mixing the previous answer with the first file parsing, leading to unspecified behaviour. — thanasisp
– thanasisp, Commented Jun 2, 2022 at 12:40
means next line of input! So for every line of the first file, next ensures that the rest of the code is not executed. When reading the second file FNR != NR so only the rest of the code is executed. — thanasisp
– thanasisp, Commented Jun 2, 2022 at 13:20

thanasisp · Accepted Answer · 2022-06-01 15:59:34Z

mkdir -p cluster &&
awk 'NR==FNR  {map[">"$2]="cluster/"$1".fa"; next}
     /^>/ && NF==1 {close(out); out=map[$0]; next} 
     out != "" {print > out}
' ind.txt file

The first condition-action (NR==FNR), is parsing the index file, to create the filenames and store them to an array, where the headers of the second file are the hashes.

When a header is found (/^>/ && NF==1), we define the output filename to use.

For any other line we print to the selected filename. Also I added a condition for not printing to a file "cluster/.fa" if ther is no mapping for this header.

Testing with sample inputs created these files:

$ head cluster/*.fa
==> cluster/HG001.fa <==
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA

==> cluster/HG004.fa <==
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG

==> cluster/HG010.fa <==
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP

==> cluster/HG023.fa <==
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR

Stack Exchange Network

awk split a file into multiple files with specific names in another index file

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

awk split a file into multiple files with specific names in another index file

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions