Bash replace string with contents from second file

Question

I have two files one that looks like this:

FILE1

>comp0_c0_seq1 len=392 path=[1:0-391]
ATGAG...
>comp1_c0_seq1 len=399 path=[1:0-398]
AAGGA...
>comp1_c1_seq1 len=589 path=[1319:0-588]
TATAT...
>comp2_c0_seq2 len=340 path=[1:0-339]
GGAGT...
>comp2_c1_seq1 len=312 path=[924:0-311]
GGTTA...
>comp2_c1_seq2 len=312 path=[924:0-311]
TTATT...
>comp4_c0_seq1 len=800 path=[1:0-581 1284:582-799]
AGAGA...
>comp6_c0_seq1 len=245 path=[815:0-151 745:152-244]
GATTA...

And a second file

FILE2

>contig_1
>contig_2
>contig_3
>contig_4
>contig_5
>contig_6
>contig_7
>contig_8

I can't find a pattern in FILE1 so I could easily replace the >comp0_c0_seq1 part with >contig_1 and so on. FILE2 has no sequences, only the headers

I've been trying with sed and awk but I haven't succeed

the output I wish to get is:

>contig_1 len=392 path=[1:0-391]
ATGAG...
>contig_2 len=399 path=[1:0-398]
AAGGA...
>contig_3 len=589 path=[1319:0-588]
TATAT...
>contig_4 len=340 path=[1:0-339]
GGAGT...
>contig_5 len=312 path=[924:0-311]
GGTTA...
>contig_6 len=312 path=[924:0-311]
TTATT...
>contig_7 len=800 path=[1:0-581 1284:582-799]
AGAGA...
>contig_8 len=245 path=[815:0-151 745:152-244]
GATTA...

The files I'm working with are >30,000 contigs long, with very large sequences in between them.

so you just want to replace the nth sequence with nth contig? — perreal
– perreal, Commented May 23, 2013 at 1:56
I want to replace the ambiguous name between ">" and "len=# path=[]" Keeping the sequence between contigs. — Ramirous
– Ramirous, Commented May 23, 2013 at 2:06
is comp[\d]_c[\d]_seq[\d] and contig_[\d] valid regexes for your problem? — Bill
– Bill, Commented May 23, 2013 at 2:08

perreal · Accepted Answer · 2013-05-23 02:10:46Z

3

Using awk:

awk '{ if(/comp/) { getline $1 < "input2"; } print }' input1

answered May 23, 2013 at 2:10

perreal

98.7k23 gold badges159 silver badges187 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ramirous Over a year ago

Amazing! I looked for this answer for three days! Thank you so much!

Chris Seymour Over a year ago

Not that I'd do it this way but the condition should really go outside the block awk /comp/{getline $1<"input2"}1' input1.

Chris Seymour · Accepted Answer · 2013-05-23 09:17:09Z

Using awk without the headache of getline and using both files:

$ awk 'NR==FNR{a[NR]=$0;next}/^>comp/{$1=a[++i]}1' file2 file1
>contig_1 len=392 path=[1:0-391]
ATGAG...
>contig_2 len=399 path=[1:0-398]
AAGGA...
>contig_3 len=589 path=[1319:0-588]
TATAT...
>contig_4 len=340 path=[1:0-339]
GGAGT...
>contig_5 len=312 path=[924:0-311]
GGTTA...
>contig_6 len=312 path=[924:0-311]
TTATT...
>contig_7 len=800 path=[1:0-581 1284:582-799]
AGAGA...
>contig_8 len=245 path=[815:0-151 745:152-244]
GATTA...

This assumes the file1 and file2 have the same number of >comp0_c0_seq1 and >contig_8.

If you just want increasing >contig_ then you don't need file2 at all:

$ awk '/^>comp/{$1=">contig"++i}1' file1
>contig1 len=392 path=[1:0-391]
ATGAG...
>contig2 len=399 path=[1:0-398]
AAGGA...
>contig3 len=589 path=[1319:0-588]
TATAT...
>contig4 len=340 path=[1:0-339]
GGAGT...
>contig5 len=312 path=[924:0-311]
GGTTA...
>contig6 len=312 path=[924:0-311]
TTATT...
>contig7 len=800 path=[1:0-581 1284:582-799]
AGAGA...
>contig8 len=245 path=[815:0-151 745:152-244]
GATTA...

The first solution we are talking 3 kilobytes max, hardly an issue with todays hardware. Anyway the use of file2 isn't needed at all I only included the first script as an alternative to the use of getline. The second script should be used here.
Thank you! the last solution saves me the step of finding out how many contigs each file has. Both perreal's and your answers are excelent :D

iruvar · Accepted Answer · 2013-05-23 02:17:06Z

0

A Python 2.7 solution (pulls all of FILE1 into memory so perreal's solution should be your first option)

from __future__ import print_function
import re

pat = re.compile('(>comp.*?) .*?(?=(>comp|\Z))', re.DOTALL)
with open('FILE1') as f, open('FILE2') as f2:
  data = f.read()
  for fragment in pat.finditer(data):  
    fragment = fragment.group(0).replace(fragment.group(1), next(f2).rstrip())
    print(fragment, end='')

answered May 23, 2013 at 2:17

iruvar

23.5k7 gold badges58 silver badges83 bronze badges

2 Comments

Chris Seymour Over a year ago

print is a statement in Python 2.7 this looks like a Python 3 solution?

iruvar Over a year ago

@sudo_O, see the from __future__ import print_function at the beginning

Collectives™ on Stack Overflow

Bash replace string with contents from second file

3 Answers 3

2 Comments

3 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

2 Comments

Related