1

I have a file with the following structure:

GO:0000001      mitochondrion inheritance
GO:0000002      mitochondrial genome maintenance
GO:0000003      reproduction
alt_id: GO:0019952
alt_id: GO:0050876
GO:0000005      obsolete ribosomal chaperone activity
GO:0000006      high-affinity zinc uptake transmembrane transporter activity
GO:0000007      low-affinity zinc ion transmembrane transporter activity
GO:0000008      obsolete thioredoxin
alt_id: GO:0000013
GO:0000009      alpha-1,6-mannosyltransferase activity

Where it says alt_id it means that it is an alternative to the previous GO: code. I'd like to add to each alt_id the definition of the previous GO:, that is, I want an output like this:

GO:0000001      mitochondrion inheritance
GO:0000002      mitochondrial genome maintenance
GO:0000003      reproduction
alt_id: GO:0019952     reproduction
alt_id: GO:0050876     reproduction
GO:0000005      obsolete ribosomal chaperone activity
GO:0000006      high-affinity zinc uptake transmembrane transporter activity
GO:0000007      low-affinity zinc ion transmembrane transporter activity
GO:0000008      obsolete thioredoxin
alt_id: GO:0000013      obsolete thioredoxin
GO:0000009      alpha-1,6-mannosyltransferase activity

How can I copy the content of the previous row in the following? I work with Cygwin in a Windows-based environment.

1
  • What is the separator between GO:0000001 and mitochondrion inheritance? Commented Oct 25, 2016 at 12:10

2 Answers 2

1

With awk, not sure if it will work on Cygwin

$ awk '{ if(/^alt_id/){$0 = $0" "p} else{p = ""; for (i=2; i<=NF; i++) p = p" "$i} } 1' ip.txt
GO:0000001      mitochondrion inheritance
GO:0000002      mitochondrial genome maintenance
GO:0000003      reproduction
alt_id: GO:0019952  reproduction
alt_id: GO:0050876  reproduction
GO:0000005      obsolete ribosomal chaperone activity
GO:0000006      high-affinity zinc uptake transmembrane transporter activity
GO:0000007      low-affinity zinc ion transmembrane transporter activity
GO:0000008      obsolete thioredoxin
alt_id: GO:0000013  obsolete thioredoxin
GO:0000009      alpha-1,6-mannosyltransferase activity
  • For every line not matching alt_id at start of line, use a variable (p) to save all columns from two onwards
  • When line matches alt_id at start of line, append the contents of p variable to input line contained in $0 variable
  • The final 1 is short cut to print the contents of $0
1

The task can be easy done by sed

sed '
    N  #append next line (operate with `line1\nline2`);
    /\nalt_id/s/\([^0-9]*\)\n.*/&\1/
       #if next line starts with `alt_id` the append end of present line
    P  #print present line (all before `\n`)
    D  #remove all before `\n`, starts from begin with remain part (line2)
    ' file

Other way is use hold-space

sed '
    /^alt_id:/G #if line starts by `alt_id:` append hold-space
    s/\n//      #remove `\n`ewline symbol
    t           #if removing success pass further commands (go to end)
    h           #if no (for other lines) copy it to hold-space
    s/\S*//     #remove all non-space symbols from start till first space
    x           #exchange hold-space and pattern-space ==
                #+put resedue into hold-space and return full line
    ' file

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.