1

I have a fasta file contained ~28000 sequence. I want to replace header of these sequences by a list of lines in another file. Example:

File 1:

sp|B7UM99|TIR_ECO27
MPIGNLGNNVNGNHLIPPAPP.....
sp|P0ACF8|HNS_ECOLI
MSEALKILNNIRTLRAQ........
sp|P24232|HMP_ECOLI
MLDAQTIATVKATIPLLVET..........

File 2:

sp|B7UM99|TIR_ECO27OS=Escherichia coli
sp|P0ACF8|HNS_ECOLI=Human
sp|P24232|HMP_ECOLI=Flavohemoprotein

Desired Output:

sp|B7UM99|TIR_ECO27OS=Escherichia coli
MPIGNLGNNVNGNHLIPPAPP.....
sp|P0ACF8|HNS_ECOLI=Human
MSEALKILNNIRTLRAQ........
sp|P24232|HMP_ECOLI=Flavohemoprotein
MLDAQTIATVKATIPLLVET..........

4 Answers 4

1

With the GNU implementation of sed:

$ sed -e '/^sp|/{R file2' -e 'd}' file1
sp|B7UM99|TIR_ECO27OS=Escherichia coli
MPIGNLGNNVNGNHLIPPAPP.....
sp|P0ACF8|HNS_ECOLI=Human
MSEALKILNNIRTLRAQ........
sp|P24232|HMP_ECOLI=Flavohemoprotein
MLDAQTIATVKATIPLLVET..........

Where the R file command (a GNU non-standard extension) pulls one line from the file (not into the pattern space) and prints it, while d (standard) discards the pattern space.

Add the -i option to edit file1 in-place.

1
  • Top marks for brevity, I like it and have upvoted. Would be even better with some explanation to help OP. Commented Apr 2, 2024 at 8:06
0

Perhaps, the script below is what you need:

#!/bin/bash

# Save the good lines
awk '{if($0 !~ "^sp")print > "result_1" }' < file_1
awk '{if($0 ~ "^sp")print > "result_2" }' < file_2

# Get number of lines in result_1 ( == nl in result_2 )
nl_file=$(wc -l result_1|cut -d' ' -f1)

# Prepare sorting of these files preceded by a number
seq 2 2 $(( ${nl_file} * 2 )) > numbered_file_1
seq 1 2 $(( ${nl_file} * 2 )) > numbered_file_2

# paste content of numbered_file_* and result_* side by side
paste -d ' ' numbered_file_1 result_1 > mergedfiles
paste -d ' ' numbered_file_2 result_2 >> mergedfiles

sort -n mergedfiles | sed 's/^[[:digit:]]\s\+//g'
0

You can do that with sed and paste commands as follows:

$ sed 's/$/\n/' file2 | paste -d ' ' file1 - | sed 's/^sp.* sp/sp/'
sp|B7UM99|TIR_ECO27OS=Escherichia coli
MPIGNLGNNVNGNHLIPPAPP..... 
sp|P0ACF8|HNS_ECOLI=Human
MSEALKILNNIRTLRAQ........ 
sp|P24232|HMP_ECOLI=Flavohemoprotein
MLDAQTIATVKATIPLLVET.......... 

The first sed prepares the short file for pasting by adding blank lines after each entry. Now that both files have the same number of lines and the headers (old and new) line up we execute the paste command. Finally the second sed removes the old header text.

There will be a trailing space in sequence lines. If it is important to remove it you can pipe the result to another sed as | sed 's/ $//'.

0

It is good to see that old questions are still answered years later!

This might be a lot easier with awk:

$ awk '/sp/{getline nuhead <"file2";$0=nuhead}1' file1
sp|B7UM99|TIR_ECO27OS=Escherichia coli
MPIGNLGNNVNGNHLIPPAPP.....
sp|P0ACF8|HNS_ECOLI=Human
MSEALKILNNIRTLRAQ........
sp|P24232|HMP_ECOLI=Flavohemoprotein
MLDAQTIATVKATIPLLVET..........

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.