1

I have two files one that looks like this:

FILE1

>comp0_c0_seq1 len=392 path=[1:0-391]
ATGAG...
>comp1_c0_seq1 len=399 path=[1:0-398]
AAGGA...
>comp1_c1_seq1 len=589 path=[1319:0-588]
TATAT...
>comp2_c0_seq2 len=340 path=[1:0-339]
GGAGT...
>comp2_c1_seq1 len=312 path=[924:0-311]
GGTTA...
>comp2_c1_seq2 len=312 path=[924:0-311]
TTATT...
>comp4_c0_seq1 len=800 path=[1:0-581 1284:582-799]
AGAGA...
>comp6_c0_seq1 len=245 path=[815:0-151 745:152-244]
GATTA...

And a second file

FILE2

>contig_1
>contig_2
>contig_3
>contig_4
>contig_5
>contig_6
>contig_7
>contig_8

I can't find a pattern in FILE1 so I could easily replace the >comp0_c0_seq1 part with >contig_1 and so on. FILE2 has no sequences, only the headers

I've been trying with sed and awk but I haven't succeed

the output I wish to get is:

>contig_1 len=392 path=[1:0-391]
ATGAG...
>contig_2 len=399 path=[1:0-398]
AAGGA...
>contig_3 len=589 path=[1319:0-588]
TATAT...
>contig_4 len=340 path=[1:0-339]
GGAGT...
>contig_5 len=312 path=[924:0-311]
GGTTA...
>contig_6 len=312 path=[924:0-311]
TTATT...
>contig_7 len=800 path=[1:0-581 1284:582-799]
AGAGA...
>contig_8 len=245 path=[815:0-151 745:152-244]
GATTA...

The files I'm working with are >30,000 contigs long, with very large sequences in between them.

4
  • so you just want to replace the nth sequence with nth contig? Commented May 23, 2013 at 1:56
  • yes, that is correct. Commented May 23, 2013 at 2:03
  • I want to replace the ambiguous name between ">" and "len=# path=[]" Keeping the sequence between contigs. Commented May 23, 2013 at 2:06
  • is comp[\d]_c[\d]_seq[\d] and contig_[\d] valid regexes for your problem? Commented May 23, 2013 at 2:08

3 Answers 3

3

Using awk:

awk '{ if(/comp/) { getline $1 < "input2"; } print }' input1
Sign up to request clarification or add additional context in comments.

2 Comments

Amazing! I looked for this answer for three days! Thank you so much!
Not that I'd do it this way but the condition should really go outside the block awk /comp/{getline $1<"input2"}1' input1.
1

Using awk without the headache of getline and using both files:

$ awk 'NR==FNR{a[NR]=$0;next}/^>comp/{$1=a[++i]}1' file2 file1
>contig_1 len=392 path=[1:0-391]
ATGAG...
>contig_2 len=399 path=[1:0-398]
AAGGA...
>contig_3 len=589 path=[1319:0-588]
TATAT...
>contig_4 len=340 path=[1:0-339]
GGAGT...
>contig_5 len=312 path=[924:0-311]
GGTTA...
>contig_6 len=312 path=[924:0-311]
TTATT...
>contig_7 len=800 path=[1:0-581 1284:582-799]
AGAGA...
>contig_8 len=245 path=[815:0-151 745:152-244]
GATTA...

This assumes the file1 and file2 have the same number of >comp0_c0_seq1 and >contig_8.


If you just want increasing >contig_ then you don't need file2 at all:

$ awk '/^>comp/{$1=">contig"++i}1' file1
>contig1 len=392 path=[1:0-391]
ATGAG...
>contig2 len=399 path=[1:0-398]
AAGGA...
>contig3 len=589 path=[1319:0-588]
TATAT...
>contig4 len=340 path=[1:0-339]
GGAGT...
>contig5 len=312 path=[924:0-311]
GGTTA...
>contig6 len=312 path=[924:0-311]
TTATT...
>contig7 len=800 path=[1:0-581 1284:582-799]
AGAGA...
>contig8 len=245 path=[815:0-151 745:152-244]
GATTA...

3 Comments

but this can potentially use a lot of memory
The first solution we are talking 3 kilobytes max, hardly an issue with todays hardware. Anyway the use of file2 isn't needed at all I only included the first script as an alternative to the use of getline. The second script should be used here.
Thank you! the last solution saves me the step of finding out how many contigs each file has. Both perreal's and your answers are excelent :D
0

A Python 2.7 solution (pulls all of FILE1 into memory so perreal's solution should be your first option)

from __future__ import print_function
import re

pat = re.compile('(>comp.*?) .*?(?=(>comp|\Z))', re.DOTALL)
with open('FILE1') as f, open('FILE2') as f2:
  data = f.read()
  for fragment in pat.finditer(data):  
    fragment = fragment.group(0).replace(fragment.group(1), next(f2).rstrip())
    print(fragment, end='')

2 Comments

print is a statement in Python 2.7 this looks like a Python 3 solution?
@sudo_O, see the from __future__ import print_function at the beginning

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.