awk's substr replacement

Question

I downloaded a tab-separated file from here:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Nsyl
#!genome-build-accession NCBI_Assembly:GCF_000393655.1
#!annotation-source NCBI Nicotiana sylvestris Annotation Release 100
##sequence-region NW_009338801.1 1 504
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4096
NW_009338801.1  RefSeq  region  1       504     .       +       .       ID=NW_009338801.1:1..504;Dbxref=taxon:4096;Name=Unknown;bio-material=USDA:TW 136;chrom
osome=Unknown;gbkey=Src;genome=genomic;mol_type=genomic DNA;tissue-type=leaf
##sequence-region NW_009338802.1 1 9484
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4096
NW_009338802.1  RefSeq  region  1       9484    .       +       .       ID=NW_009338802.1:1..9484;Dbxref=taxon:4096;Name=Unknown;bio-material=USDA:TW 136;chro
mosome=Unknown;gbkey=Src;genome=genomic;mol_type=genomic DNA;tissue-type=leaf
##sequence-region NW_009338803.1 1 7523
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4096
NW_009338803.1  RefSeq  region  1       7523    .       +       .       ID=NW_009338803.1:1..7523;Dbxref=taxon:4096;Name=Unknown;bio-material=USDA:TW 136;chro
mosome=Unknown;gbkey=Src;genome=genomic;mol_type=genomic DNA;tissue-type=leaf
##sequence-region NW_009338804.1 1 46372
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4096
NW_009338804.1  RefSeq  region  1       46372   .       +       .       ID=NW_009338804.1:1..46372;Dbxref=taxon:4096;Name=Unknown;bio-material=USDA:TW 136;chr
omosome=Unknown;gbkey=Src;genome=genomic;mol_type=genomic DNA;tissue-type=leaf
NW_009338804.1  Gnomon  pseudogene      32822   34172   .       -       .       ID=gene-LOC104209938;Dbxref=GeneID:104209938;Name=LOC104209938;gbkey=Gene;gene
=LOC104209938;gene_biotype=pseudogene;pseudo=true
NW_009338804.1  Gnomon  exon    32822   34172   .       -       .       ID=id-LOC104209938-1;Parent=gene-LOC104209938;Dbxref=GeneID:104209938;exon_number=1;gb
key=exon;gene=LOC104209938;model_evidence=Supporting evidence includes similarity to: 2 Proteins;number=1
##sequence-region NW_009338805.1 1 53328
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4096
NW_009338805.1  RefSeq  region  1       53328   .       +       .       ID=NW_009338805.1:1..53328;Dbxref=taxon:4096;Name=Unknown;bio-material=USDA:TW 136;chr
omosome=Unknown;gbkey=Src;genome=genomic;mol_type=genomic DNA;tissue-type=leaf
NW_009338805.1  Gnomon  gene    10570   12535   .       -       .       ID=gene-LOC104217587;Dbxref=GeneID:104217587;Name=LOC104217587;gbkey=Gene;gene=LOC1042
17587;gene_biotype=protein_coding
NW_009338805.1  Gnomon  mRNA    10570   12535   .       -       .       ID=rna-XM_009770987.1;Parent=gene-LOC104217587;Dbxref=GeneID:104217587,Genbank:XM_0097
70987.1;Name=XM_009770987.1;gbkey=mRNA;gene=LOC104217587;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic f
eature by RNAseq alignments%2C including 2 samples with support for all annotated introns;product=ribosome-interacting GTPase 1-like;transcript_id=XM_00977098
7.1
NW_009338805.1  Gnomon  exon    12140   12535   .       -       .       ID=exon-XM_009770987.1-1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_
009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1  Gnomon  exon    11826   11939   .       -       .       ID=exon-XM_009770987.1-2;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1  Gnomon  exon    11521   11695   .       -       .       ID=exon-XM_009770987.1-3;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1  Gnomon  exon    10570   10889   .       -       .       ID=exon-XM_009770987.1-4;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1  Gnomon  CDS     12140   12154   .       -       0       ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
NW_009338805.1  Gnomon  CDS     11826   11939   .       -       0       ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
NW_009338805.1  Gnomon  CDS     11521   11695   .       -       0       ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
NW_009338805.1  Gnomon  CDS     10813   10889   .       -       2       ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
...

The following command failed to create multiple files for different column names e.g. NW_009592716.1.lst, NC_007500.1.lst, ....

cat GCF_000393655.1_Nsyl_genomic.gff |awk '$3=="CDS"' |
    sed 's/;/\t/g' |
    awk '{print $1,$7,$12}' |
    sed 's/Name=//g' |
    awk 'substr($3,11,11)==1 {print $3$2,$1}'  |
    sort |
    uniq |
    awk '{print >> $2 ".lst"; close($2)}'

it only creates one file instead of multiple files:

$ head NC_007500.1.lst
YP_358649.1- NC_007500.1
YP_358650.1+ NC_007500.1
YP_358650.1- NC_007500.1
YP_358651.1- NC_007500.1
YP_358652.1- NC_007500.1
YP_358653.1- NC_007500.1
YP_358654.1+ NC_007500.1
YP_358655.1+ NC_007500.1
YP_358656.1+ NC_007500.1
YP_358657.1- NC_007500.1
...

How is it possible to make the above command tolerant to different string length?

Thank you in advance,

I bet all this could be done in a single awk program, just edit your question with input data and expected filename. — Archemar
– Archemar, Commented Jul 6, 2020 at 6:59
It is going to run like a dog if you append/close the output file for every line. Awk will write to at least 1000 output files concurrently, and close them automatically at end. In fact, your close() is invalid anyway: the filename is different to the one in the print. — Paul_Pedant
– Paul_Pedant, Commented Jul 6, 2020 at 9:48
You never need sed when you're using awk and you definitely don't need a pipleine of multiple calls to sed + awk and a UUOC to do anything. For example cat GCF_000393655.1_Nsyl_genomic.gff | awk '$3=="CDS"' | sed 's/;/\t/g' | awk '{print $1,$7,$12}' | sed 's/Name=//g' can be written as just awk '$3=="CDS"{gsub(/;/,"\t"); $0=$1 OFS $7 OFS $12; gsub(/Name/,""); print}' GCF_000393655.1_Nsyl_genomic.gff. If you edit your question to include concise, testable sample input and expected output then we can help you do whatever it is you're trying to do. — Ed Morton
– Ed Morton, Commented Jul 6, 2020 at 13:53
I updated the question and included the link to the input file and examples for filenames. — user977828
– user977828, Commented Jul 9, 2020 at 4:22

Stephen Kitt · Accepted Answer · 2020-07-06 07:15:45Z

1

As mentioned in the comments, it should be possible to implement your entire chain of AWK and sed invocations as a single AWK program.

To answer your stated question, to check whether the last character of the third field is “1”, you can use

$3 ~ /1$/

instead of substr, so in your case

$3 ~ /1$/ {print $3$2,$1}

answered Jul 6, 2020 at 7:15

Stephen Kitt

481k59 gold badges1.2k silver badges1.4k bronze badges

Add a comment |

Archemar · Accepted Answer · 2020-07-10 13:25:52Z

0

I come up with

$3 == "CDS" && $1 ~ /1$/ {
        split($9,A,";") ;
        B=substr(A[4],6) ;
        V[B $7] = $1 ;
}
END {
        for (u in V) {
                print u  >> V[u] ;
                close(V[u]) ;
        }
}

resulting in 17042 files.

$3 == "CDS" && $1 ~ /1$/ for awk '$3=="CDS"' and awk 'substr($3,11,11)==1
split($9,A,";") ; for sed 's/;/\t/g' and awk '{print $1,$7,$12}'
B=substr(A[4],6) ; for sed 's/Name=//g'
V[B $7] = $1 ; for sort and uniq

To execute the script, insert code in filter.awk, then

awk -f filter.awk file_to_parse

edited Jul 10, 2020 at 13:25

answered Jul 9, 2020 at 5:38

Archemar

32.3k18 gold badges75 silver badges107 bronze badges

Thank you, how do I execute your script?

user977828
– user977828

2020-07-10 13:20:57 +00:00
Commented Jul 10, 2020 at 13:20
Thank you it works, but how would it be possible to each file an extension .lst?

user977828
– user977828

2020-07-13 13:40:33 +00:00
Commented Jul 13, 2020 at 13:40

Add a comment |

bu5hman · Accepted Answer · 2020-07-17 07:51:42Z

0

Using gawk and a variable tgt to define the name of the field you want to select out....

awk -F"[\t;:,=]" -v tgt="Genbank" '$3=="CDS"{
   for (f=4; f<=NF; f++) if ($f ~ tgt) {
     if ( $(f+1) ~ /\.1$/ ) out[$(f+1)$7" "$1]=$1".lst"}}
   END{PROCINFO["sorted_in"]="@ind_num_asc"; 
      for (o in out) print o > out[o]}' GCF_000393655.1_Nsyl_genomic.gff

tail *.lst 

==> NW_009592652.1.lst <==
XP_009779696.1- NW_009592652.1

==> NW_009592685.1.lst <==
XP_009779697.1+ NW_009592685.1
XP_009779699.1- NW_009592685.1

==> NW_009592688.1.lst <==
XP_009779700.1+ NW_009592688.1
XP_009779701.1+ NW_009592688.1
XP_009779702.1+ NW_009592688.1

==> NW_009592716.1.lst <==
XP_009779703.1+ NW_009592716.1

Repeat for tgt="Parent" and input from test.gff3 as per comments

tail *.lst

==> NbV1Ch18.lst <==
NBlab18G26040.1+ NbV1Ch18
NBlab18G26050.1- NbV1Ch18
NBlab18G26060.1+ NbV1Ch18
NBlab18G26070.1+ NbV1Ch18
NBlab18G26080.1- NbV1Ch18
NBlab18G26090.1- NbV1Ch18
NBlab18G26100.1- NbV1Ch18
NBlab18G26110.1- NbV1Ch18
NBlab18G26120.1+ NbV1Ch18
NBlab18G26130.1+ NbV1Ch18

==> NbV1Ch19.lst <==
NBlab19G29030.1+ NbV1Ch19
NBlab19G29040.1- NbV1Ch19
NBlab19G29050.1- NbV1Ch19
NBlab19G29060.1- NbV1Ch19
NBlab19G29070.1+ NbV1Ch19
NBlab19G29080.1+ NbV1Ch19
NBlab19G29090.1- NbV1Ch19
NBlab19G29100.1- NbV1Ch19
NBlab19G29110.1- NbV1Ch19
NBlab19G29120.1- NbV1Ch19

Walkthrough

Select your desired field as tgt and initially select out records on CDS

awk -F"[\t;:,=]" -v tgt="Genbank" '$3=="CDS"{

Iterate over the fields until you find tgt

   for (f=4; f<=NF; f++) if ($f ~ tgt) {

Check to see if your target field value $(f+1) ends in .1 and if it does then store your formatted output in the out array with a value of the filename you want it to go to

     if ( $(f+1) ~ /\.1$/ ) out[$(f+1)$7" "$1]=$1".lst"}}

When done then set up awk to iterate over your output array based upon a numeric ascending sort of the indices in the array

   END{PROCINFO["sorted_in"]="@ind_num_asc";

Then just iterate over your array printing the indices you want o into the respective files out[o]

      for (o in out) print o > out[o]}' GCF_000393655.1_Nsyl_genomic.gff

edited Jul 17, 2020 at 7:51

answered Jul 10, 2020 at 18:25

bu5hman

4,8512 gold badges16 silver badges29 bronze badges

Thank you, it works with the above file but when I tried this file it did not create any files. Any ideas?

user977828
– user977828

2020-07-13 14:17:51 +00:00
Commented Jul 13, 2020 at 14:17
Yip. Your data structure differs. in the first file you were selecting from the key/value pair in GENBANK but this does not exist with CDS in the second file, which only has only ID and PARENT keys. I suggest you take a look at the original intent and if you intended to select out on PARENT (which look to be the same in the first gff file) then just change $12 to $9 in the first code sample.

bu5hman
– bu5hman

2020-07-13 16:54:31 +00:00
Commented Jul 13, 2020 at 16:54
Thank you, Your orginal code with GCF_000393655.1_Nsyl_genomic.gff produced the following output. ``` $ cat NW_00935579.1.lst XP_009802589.1+ NW_00935579.1 ``` The 2 new versions: awk -F"[\t;]" '$3=="CDS"&&$9~/1$/{ split($12,arr,"="); out[arr[2]$7" "$1]=$1".lst"} END{PROCINFO["sorted_in"]="@ind_num_asc"; for (o in out) print o > out[o]}' t.gff3 awk -F"[\t;=]" '$3=="CDS"&&$12~/1$/{ out[$16$7" "$1]=$1".lst"} END{PROCINFO["sorted_in"]="@ind_num_asc"; for (o in out) print o > out[o]}' t.gff3 produced only: ``` $ cat NbV1Ch06.lst + NbV1Ch06 ```

user977828
– user977828

2020-07-14 03:11:37 +00:00
Commented Jul 14, 2020 at 3:11
@user977828 The new versions wont work with the old file because they have different data structures If you post a clarification saying which of the key/value pairs it is that you want to extract from the files then a "one stop" fix may be possible, but until you do..... we are just guessing. Please clarify your original post.

bu5hman
– bu5hman

2020-07-14 12:13:31 +00:00
Commented Jul 14, 2020 at 12:13
I am sorry and I understand the problem. For this file. I would need the first column and the Parent value. Which get reversed in the output files.

user977828
– user977828

2020-07-15 09:52:12 +00:00
Commented Jul 15, 2020 at 9:52

Add a comment |

Stack Exchange Network

awk's substr replacement

3 Answers 3

You must log in to answer this question.

Hot Network Questions

awk's substr replacement

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions