Skip to main content
edited tags; edited tags
Link
Jeff Schaller
  • 68.8k
  • 35
  • 122
  • 263
Source Link

sed: find pattern and replace for another pattern in the same line

I have a file with gene_id and gene names in one line. I want to replace the word after gene_id with the word after gene or after product or after sprot (if some of it missed).

Here is an example of a line:

chrM    Gnomon  CDS 8345    8513    .   +   1   gene_id "cds-XP_008824843.3"; transcript_id "cds-XP_008824843.3"; Parent "rna-XM_008826621.3"; Dbxref "GeneID:103728653_Genbank:XP_008824843.3"; Name "XP_008824843.3"; end_range "8513,."; gbkey "CDS"; gene "semaphorin-3F"; partial "true"; product "semaphorin-3F"; protein_id "XP_008824843.3"; sprot "sp|Q13275|SEM3F_HUMAN";
chrM    StringTie   exon    2754    3700    .   +   .   gene_id "cds-YP_007626758.1"; transcript_id "cds-YP_007626758.1"; Parent "gene-ND1"; Dbxref "Genbank:YP_007626758.1,Gene "ID:15088436"; Name "YP_007626758.1"; Note "TAAstopcodoniscompletedbytheadditionof3'AresiduestothemRNA"; gbkey "CDS"; gene "ND1"; product "NADHdehydrogenasesubunit1"; protein_id "YP_007626758.1"; transl_except "(pos:3700..3700%2Caa:TERM)"; transl_table "2";

I tried to make it with sed:

sed -E 's/[^gene_id] .*?;/[^gene] .*?;|[^sprot] .*?;|[^product] .*?;/g'

But the results were incorrect:

chrM    Gnomon  CDS 8345    8513    .   +   1   gene_id "cds-XP_008824843.3"[^gene] .*?;|[^sprot] .*?;|[^product] .*?;
chrM     StringTie       exon    2754    3700    .       +       .       gene_id "cds-YP_007626758.1"[^gene] .*?;|[^sprot] .*?;|[^product] .*?;

But I want to save all line, but with another word after gene_id, like this:

chrM    Gnomon  CDS 8345    8513    .   +   1   gene_id "semaphorin-3F"; transcript_id "cds-XP_008824843.3"; Parent "rna-XM_008826621.3"; Dbxref "GeneID:103728653_Genbank:XP_008824843.3"; Name "XP_008824843.3"; end_range "8513,."; gbkey "CDS"; gene "semaphorin-3F"; partial "true"; product "semaphorin-3F"; protein_id "XP_008824843.3"; sprot "sp|Q13275|SEM3F_HUMAN";
chrM     StringTie       exon    2754    3700    .       +       .       gene_id "ND1"; transcript_id "cds-YP_007626758.1"; Parent "gene-ND1"; Dbxref "Genbank:YP_007626758.1,Gene "ID:15088436"; Name "YP_007626758.1"; Note "TAAstopcodoniscompletedbytheadditionof3'AresiduestothemRNA"; gbkey "CDS"; gene "ND1"; product "NADHdehydrogenasesubunit1"; protein_id "YP_007626758.1"; transl_except "(pos:3700..3700%2Caa:TERM)"; transl_table "2";

Or like this (if another missed):

chrM    Gnomon  CDS 8345    8513    .   +   1   gene_id "sp|Q13275|SEM3F_HUMAN"; transcript_id "cds-XP_008824843.3"; Parent "rna-XM_008826621.3"; Dbxref "GeneID:103728653_Genbank:XP_008824843.3"; Name "XP_008824843.3"; end_range "8513,."; gbkey "CDS"; gene "semaphorin-3F"; partial "true"; product "semaphorin-3F"; protein_id "XP_008824843.3"; sprot "sp|Q13275|SEM3F_HUMAN";
chrM     StringTie       exon    2754    3700    .       +       .       gene_id "ND1"; transcript_id "cds-YP_007626758.1"; Parent "gene-ND1"; Dbxref "Genbank:YP_007626758.1,Gene "ID:15088436"; Name "YP_007626758.1"; Note "TAAstopcodoniscompletedbytheadditionof3'AresiduestothemRNA"; gbkey "CDS"; gene "ND1"; product "NADHdehydrogenasesubunit1"; protein_id "YP_007626758.1"; transl_except "(pos:3700..3700%2Caa:TERM)"; transl_table "2";

Any help will be very much appreciated.