replace header in a file with list of lines in another file

Question

I have a fasta file contained ~28000 sequence. I want to replace header of these sequences by a list of lines in another file. Example:

File 1:

sp|B7UM99|TIR_ECO27
MPIGNLGNNVNGNHLIPPAPP.....
sp|P0ACF8|HNS_ECOLI
MSEALKILNNIRTLRAQ........
sp|P24232|HMP_ECOLI
MLDAQTIATVKATIPLLVET..........

File 2:

sp|B7UM99|TIR_ECO27OS=Escherichia coli
sp|P0ACF8|HNS_ECOLI=Human
sp|P24232|HMP_ECOLI=Flavohemoprotein

Desired Output:

sp|B7UM99|TIR_ECO27OS=Escherichia coli
MPIGNLGNNVNGNHLIPPAPP.....
sp|P0ACF8|HNS_ECOLI=Human
MSEALKILNNIRTLRAQ........
sp|P24232|HMP_ECOLI=Flavohemoprotein
MLDAQTIATVKATIPLLVET..........

Stéphane Chazelas · Accepted Answer · 2024-04-02 08:09:10Z

1

With the GNU implementation of sed:

$ sed -e '/^sp|/{R file2' -e 'd}' file1
sp|B7UM99|TIR_ECO27OS=Escherichia coli
MPIGNLGNNVNGNHLIPPAPP.....
sp|P0ACF8|HNS_ECOLI=Human
MSEALKILNNIRTLRAQ........
sp|P24232|HMP_ECOLI=Flavohemoprotein
MLDAQTIATVKATIPLLVET..........

Where the R file command (a GNU non-standard extension) pulls one line from the file (not into the pattern space) and prints it, while d (standard) discards the pattern space.

Add the -i option to edit file1 in-place.

edited Apr 2, 2024 at 8:09

answered Apr 2, 2024 at 6:02

Stéphane Chazelas

585k96 gold badges1.1k silver badges1.7k bronze badges

Top marks for brevity, I like it and have upvoted. Would be even better with some explanation to help OP.

canupseq
– canupseq

2024-04-02 08:06:58 +00:00
Commented Apr 2, 2024 at 8:06

Add a comment |

Rui F Ribeiro · Accepted Answer · 2019-05-09 19:26:29Z

Perhaps, the script below is what you need:

#!/bin/bash

# Save the good lines
awk '{if($0 !~ "^sp")print > "result_1" }' < file_1
awk '{if($0 ~ "^sp")print > "result_2" }' < file_2

# Get number of lines in result_1 ( == nl in result_2 )
nl_file=$(wc -l result_1|cut -d' ' -f1)

# Prepare sorting of these files preceded by a number
seq 2 2 $(( ${nl_file} * 2 )) > numbered_file_1
seq 1 2 $(( ${nl_file} * 2 )) > numbered_file_2

# paste content of numbered_file_* and result_* side by side
paste -d ' ' numbered_file_1 result_1 > mergedfiles
paste -d ' ' numbered_file_2 result_2 >> mergedfiles

sort -n mergedfiles | sed 's/^[[:digit:]]\s\+//g'

canupseq · Accepted Answer · 2024-04-01 20:15:49Z

You can do that with sed and paste commands as follows:

$ sed 's/$/\n/' file2 | paste -d ' ' file1 - | sed 's/^sp.* sp/sp/'
sp|B7UM99|TIR_ECO27OS=Escherichia coli
MPIGNLGNNVNGNHLIPPAPP..... 
sp|P0ACF8|HNS_ECOLI=Human
MSEALKILNNIRTLRAQ........ 
sp|P24232|HMP_ECOLI=Flavohemoprotein
MLDAQTIATVKATIPLLVET..........

The first sed prepares the short file for pasting by adding blank lines after each entry. Now that both files have the same number of lines and the headers (old and new) line up we execute the paste command. Finally the second sed removes the old header text.

There will be a trailing space in sequence lines. If it is important to remove it you can pipe the result to another sed as | sed 's/ $//'.

canupseq · Accepted Answer · 2024-04-02 11:08:54Z

0

It is good to see that old questions are still answered years later!

This might be a lot easier with awk:

$ awk '/sp/{getline nuhead <"file2";$0=nuhead}1' file1
sp|B7UM99|TIR_ECO27OS=Escherichia coli
MPIGNLGNNVNGNHLIPPAPP.....
sp|P0ACF8|HNS_ECOLI=Human
MSEALKILNNIRTLRAQ........
sp|P24232|HMP_ECOLI=Flavohemoprotein
MLDAQTIATVKATIPLLVET..........

answered Apr 2, 2024 at 11:08

canupseq

1,9441 gold badge5 silver badges21 bronze badges

Add a comment |

Stack Exchange Network

replace header in a file with list of lines in another file

4 Answers 4

You must log in to answer this question.

Hot Network Questions

replace header in a file with list of lines in another file

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions