0

I downloaded a tab-separated file from here:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Nsyl
#!genome-build-accession NCBI_Assembly:GCF_000393655.1
#!annotation-source NCBI Nicotiana sylvestris Annotation Release 100
##sequence-region NW_009338801.1 1 504
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4096
NW_009338801.1  RefSeq  region  1       504     .       +       .       ID=NW_009338801.1:1..504;Dbxref=taxon:4096;Name=Unknown;bio-material=USDA:TW 136;chrom
osome=Unknown;gbkey=Src;genome=genomic;mol_type=genomic DNA;tissue-type=leaf
##sequence-region NW_009338802.1 1 9484
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4096
NW_009338802.1  RefSeq  region  1       9484    .       +       .       ID=NW_009338802.1:1..9484;Dbxref=taxon:4096;Name=Unknown;bio-material=USDA:TW 136;chro
mosome=Unknown;gbkey=Src;genome=genomic;mol_type=genomic DNA;tissue-type=leaf
##sequence-region NW_009338803.1 1 7523
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4096
NW_009338803.1  RefSeq  region  1       7523    .       +       .       ID=NW_009338803.1:1..7523;Dbxref=taxon:4096;Name=Unknown;bio-material=USDA:TW 136;chro
mosome=Unknown;gbkey=Src;genome=genomic;mol_type=genomic DNA;tissue-type=leaf
##sequence-region NW_009338804.1 1 46372
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4096
NW_009338804.1  RefSeq  region  1       46372   .       +       .       ID=NW_009338804.1:1..46372;Dbxref=taxon:4096;Name=Unknown;bio-material=USDA:TW 136;chr
omosome=Unknown;gbkey=Src;genome=genomic;mol_type=genomic DNA;tissue-type=leaf
NW_009338804.1  Gnomon  pseudogene      32822   34172   .       -       .       ID=gene-LOC104209938;Dbxref=GeneID:104209938;Name=LOC104209938;gbkey=Gene;gene
=LOC104209938;gene_biotype=pseudogene;pseudo=true
NW_009338804.1  Gnomon  exon    32822   34172   .       -       .       ID=id-LOC104209938-1;Parent=gene-LOC104209938;Dbxref=GeneID:104209938;exon_number=1;gb
key=exon;gene=LOC104209938;model_evidence=Supporting evidence includes similarity to: 2 Proteins;number=1
##sequence-region NW_009338805.1 1 53328
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4096
NW_009338805.1  RefSeq  region  1       53328   .       +       .       ID=NW_009338805.1:1..53328;Dbxref=taxon:4096;Name=Unknown;bio-material=USDA:TW 136;chr
omosome=Unknown;gbkey=Src;genome=genomic;mol_type=genomic DNA;tissue-type=leaf
NW_009338805.1  Gnomon  gene    10570   12535   .       -       .       ID=gene-LOC104217587;Dbxref=GeneID:104217587;Name=LOC104217587;gbkey=Gene;gene=LOC1042
17587;gene_biotype=protein_coding
NW_009338805.1  Gnomon  mRNA    10570   12535   .       -       .       ID=rna-XM_009770987.1;Parent=gene-LOC104217587;Dbxref=GeneID:104217587,Genbank:XM_0097
70987.1;Name=XM_009770987.1;gbkey=mRNA;gene=LOC104217587;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic f
eature by RNAseq alignments%2C including 2 samples with support for all annotated introns;product=ribosome-interacting GTPase 1-like;transcript_id=XM_00977098
7.1
NW_009338805.1  Gnomon  exon    12140   12535   .       -       .       ID=exon-XM_009770987.1-1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_
009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1  Gnomon  exon    11826   11939   .       -       .       ID=exon-XM_009770987.1-2;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1  Gnomon  exon    11521   11695   .       -       .       ID=exon-XM_009770987.1-3;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1  Gnomon  exon    10570   10889   .       -       .       ID=exon-XM_009770987.1-4;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1  Gnomon  CDS     12140   12154   .       -       0       ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
NW_009338805.1  Gnomon  CDS     11826   11939   .       -       0       ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
NW_009338805.1  Gnomon  CDS     11521   11695   .       -       0       ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
NW_009338805.1  Gnomon  CDS     10813   10889   .       -       2       ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
...

The following command failed to create multiple files for different column names e.g. NW_009592716.1.lst, NC_007500.1.lst, ....

cat GCF_000393655.1_Nsyl_genomic.gff |awk '$3=="CDS"' |
    sed 's/;/\t/g' |
    awk '{print $1,$7,$12}' |
    sed 's/Name=//g' |
    awk 'substr($3,11,11)==1 {print $3$2,$1}'  |
    sort |
    uniq |
    awk '{print >> $2 ".lst"; close($2)}'

it only creates one file instead of multiple files:

$ head NC_007500.1.lst
YP_358649.1- NC_007500.1
YP_358650.1+ NC_007500.1
YP_358650.1- NC_007500.1
YP_358651.1- NC_007500.1
YP_358652.1- NC_007500.1
YP_358653.1- NC_007500.1
YP_358654.1+ NC_007500.1
YP_358655.1+ NC_007500.1
YP_358656.1+ NC_007500.1
YP_358657.1- NC_007500.1
...

How is it possible to make the above command tolerant to different string length?

Thank you in advance,

8
  • 2
    What does your input data look like? Commented Jul 6, 2020 at 6:54
  • 3
    I bet all this could be done in a single awk program, just edit your question with input data and expected filename. Commented Jul 6, 2020 at 6:59
  • It is going to run like a dog if you append/close the output file for every line. Awk will write to at least 1000 output files concurrently, and close them automatically at end. In fact, your close() is invalid anyway: the filename is different to the one in the print. Commented Jul 6, 2020 at 9:48
  • You never need sed when you're using awk and you definitely don't need a pipleine of multiple calls to sed + awk and a UUOC to do anything. For example cat GCF_000393655.1_Nsyl_genomic.gff | awk '$3=="CDS"' | sed 's/;/\t/g' | awk '{print $1,$7,$12}' | sed 's/Name=//g' can be written as just awk '$3=="CDS"{gsub(/;/,"\t"); $0=$1 OFS $7 OFS $12; gsub(/Name/,""); print}' GCF_000393655.1_Nsyl_genomic.gff. If you edit your question to include concise, testable sample input and expected output then we can help you do whatever it is you're trying to do. Commented Jul 6, 2020 at 13:53
  • I updated the question and included the link to the input file and examples for filenames. Commented Jul 9, 2020 at 4:22

3 Answers 3

1

As mentioned in the comments, it should be possible to implement your entire chain of AWK and sed invocations as a single AWK program.

To answer your stated question, to check whether the last character of the third field is “1”, you can use

$3 ~ /1$/

instead of substr, so in your case

$3 ~ /1$/ {print $3$2,$1}
0

I come up with

$3 == "CDS" && $1 ~ /1$/ {
        split($9,A,";") ;
        B=substr(A[4],6) ;
        V[B $7] = $1 ;
}
END {
        for (u in V) {
                print u  >> V[u] ;
                close(V[u]) ;
        }
}

resulting in 17042 files.

  • $3 == "CDS" && $1 ~ /1$/ for awk '$3=="CDS"' and awk 'substr($3,11,11)==1
  • split($9,A,";") ; for sed 's/;/\t/g' and awk '{print $1,$7,$12}'
  • B=substr(A[4],6) ; for sed 's/Name=//g'
  • V[B $7] = $1 ; for sort and uniq

To execute the script, insert code in filter.awk, then

awk -f filter.awk file_to_parse
2
  • Thank you, how do I execute your script? Commented Jul 10, 2020 at 13:20
  • Thank you it works, but how would it be possible to each file an extension .lst? Commented Jul 13, 2020 at 13:40
0

Using gawk and a variable tgt to define the name of the field you want to select out....

awk -F"[\t;:,=]" -v tgt="Genbank" '$3=="CDS"{
   for (f=4; f<=NF; f++) if ($f ~ tgt) {
     if ( $(f+1) ~ /\.1$/ ) out[$(f+1)$7" "$1]=$1".lst"}}
   END{PROCINFO["sorted_in"]="@ind_num_asc"; 
      for (o in out) print o > out[o]}' GCF_000393655.1_Nsyl_genomic.gff

tail *.lst 

==> NW_009592652.1.lst <==
XP_009779696.1- NW_009592652.1

==> NW_009592685.1.lst <==
XP_009779697.1+ NW_009592685.1
XP_009779699.1- NW_009592685.1

==> NW_009592688.1.lst <==
XP_009779700.1+ NW_009592688.1
XP_009779701.1+ NW_009592688.1
XP_009779702.1+ NW_009592688.1

==> NW_009592716.1.lst <==
XP_009779703.1+ NW_009592716.1

Repeat for tgt="Parent" and input from test.gff3 as per comments

tail *.lst

==> NbV1Ch18.lst <==
NBlab18G26040.1+ NbV1Ch18
NBlab18G26050.1- NbV1Ch18
NBlab18G26060.1+ NbV1Ch18
NBlab18G26070.1+ NbV1Ch18
NBlab18G26080.1- NbV1Ch18
NBlab18G26090.1- NbV1Ch18
NBlab18G26100.1- NbV1Ch18
NBlab18G26110.1- NbV1Ch18
NBlab18G26120.1+ NbV1Ch18
NBlab18G26130.1+ NbV1Ch18

==> NbV1Ch19.lst <==
NBlab19G29030.1+ NbV1Ch19
NBlab19G29040.1- NbV1Ch19
NBlab19G29050.1- NbV1Ch19
NBlab19G29060.1- NbV1Ch19
NBlab19G29070.1+ NbV1Ch19
NBlab19G29080.1+ NbV1Ch19
NBlab19G29090.1- NbV1Ch19
NBlab19G29100.1- NbV1Ch19
NBlab19G29110.1- NbV1Ch19
NBlab19G29120.1- NbV1Ch19

Walkthrough

Select your desired field as tgt and initially select out records on CDS

awk -F"[\t;:,=]" -v tgt="Genbank" '$3=="CDS"{

Iterate over the fields until you find tgt

   for (f=4; f<=NF; f++) if ($f ~ tgt) {

Check to see if your target field value $(f+1) ends in .1 and if it does then store your formatted output in the out array with a value of the filename you want it to go to

     if ( $(f+1) ~ /\.1$/ ) out[$(f+1)$7" "$1]=$1".lst"}}

When done then set up awk to iterate over your output array based upon a numeric ascending sort of the indices in the array

   END{PROCINFO["sorted_in"]="@ind_num_asc"; 

Then just iterate over your array printing the indices you want o into the respective files out[o]

      for (o in out) print o > out[o]}' GCF_000393655.1_Nsyl_genomic.gff
5
  • Thank you, it works with the above file but when I tried this file it did not create any files. Any ideas? Commented Jul 13, 2020 at 14:17
  • Yip. Your data structure differs. in the first file you were selecting from the key/value pair in GENBANK but this does not exist with CDS in the second file, which only has only ID and PARENT keys. I suggest you take a look at the original intent and if you intended to select out on PARENT (which look to be the same in the first gff file) then just change $12 to $9 in the first code sample. Commented Jul 13, 2020 at 16:54
  • Thank you, Your orginal code with GCF_000393655.1_Nsyl_genomic.gff produced the following output. ``` $ cat NW_00935579.1.lst XP_009802589.1+ NW_00935579.1 ``` The 2 new versions: awk -F"[\t;]" '$3=="CDS"&&$9~/1$/{ split($12,arr,"="); out[arr[2]$7" "$1]=$1".lst"} END{PROCINFO["sorted_in"]="@ind_num_asc"; for (o in out) print o > out[o]}' t.gff3 awk -F"[\t;=]" '$3=="CDS"&&$12~/1$/{ out[$16$7" "$1]=$1".lst"} END{PROCINFO["sorted_in"]="@ind_num_asc"; for (o in out) print o > out[o]}' t.gff3 produced only: ``` $ cat NbV1Ch06.lst + NbV1Ch06 ``` Commented Jul 14, 2020 at 3:11
  • @user977828 The new versions wont work with the old file because they have different data structures If you post a clarification saying which of the key/value pairs it is that you want to extract from the files then a "one stop" fix may be possible, but until you do..... we are just guessing. Please clarify your original post. Commented Jul 14, 2020 at 12:13
  • I am sorry and I understand the problem. For this file. I would need the first column and the Parent value. Which get reversed in the output files. Commented Jul 15, 2020 at 9:52

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.