2

I have two .txt files. File1.txt and File2.txt. How can i merge these two files in unix (may be with awk) based on one common column.

File1.txt looks like

Sub_ID  Sam_ID   v1              
1878372 2253734 SAMN06396112           
1883177 2264293 SAMN06414028          
1884646 2275341 SAMN06432785         
1860945 2277481 SAMN06407597  

File2.txt looks like

Sam_ID  code    V3      V4 
2253734 20481   NA      DNA   
2275341 20483   NA      DNA    
2277481 20488   NA      DNA   

Final output file after merging should look like this

Finalfile.txt

Sub_ID  Sam_ID   v1               code   V3      V4
1878372 2253734 SAMN06396112     20481   NA      DNA                
1884646 2275341 SAMN06432785     20483   NA      DNA     
1860945 2277481 SAMN06407597     20488   NA      DNA  

I have tried Join so far but may be i am not fully understanding the command (being new to unix).

sort -k2b File1.txt >sorted_file1.txt 
sort File2.txt >sorted_file2.txt 
join -1 2 sorted_file1.txt sorted_file2.txt > Finalfile.txt

I understand that by k2b i am nominating the second column of File_1 to be common among two and then merging.

4
  • Could you please show us any attempt from your side? Commented Jun 18, 2021 at 3:59
  • You could use programming language that has support for relational db and use it to create an SQLite database and then execute a join between your two "tables" and produce the output you want. Commented Jun 18, 2021 at 3:59
  • What's your field separator in both files? One tab or multiple spaces? Commented Jun 18, 2021 at 4:13
  • One tab. @Cyrus Commented Jun 18, 2021 at 4:15

3 Answers 3

3

Thanks for adding your own attempts to solve the problem - it makes troubleshooting a lot easier.

This answer is a bit convoluted, but here is a potential solution (GNU join):

join -t $'\t' -1 2 -2 1 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)

#Sam_ID Sub_ID  v1  code    V3  V4
#2253734    1878372 SAMN06396112    20481   NA  DNA
#2275341    1884646 SAMN06432785    20483   NA  DNA
#2277481    1860945 SAMN06407597    20488   NA  DNA

Explanation:

  • join uses a single character as a separator, so you can't use "\t", but you can use $'\t' (as far as I know)
  • the -1 2 and -2 1 means "for the first file, use the second field" and "for the second file, use the first field" when combining the files
  • in each subprocess (<()), sort the file by the Sam_ID column but exclude the header from the sort (per Is there a way to ignore header lines in a UNIX sort?)

Edit

To specify the order of the columns in the output (to put the Sub_ID before the Sam_ID), you can use the -o option, e.g.

join -t $'\t' -1 2 -2 1 -o 1.1,1.2,1.3,2.2,2.3,2.4 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)

#Sub_ID Sam_ID  v1  code    V3  V4
#1878372    2253734 SAMN06396112    20481   NA  DNA
#1884646    2275341 SAMN06432785    20483   NA  DNA
#1860945    2277481 SAMN06407597    20488   NA  DNA
Sign up to request clarification or add additional context in comments.

2 Comments

Hi Jared: Thanks but the only problem i am facing with my attempt is the headers are removed. Is there anyway to avoid that?
In my example the headers from each file are retained. Are you sure they are separated by a single tab?
2

This single awk should work file for merge task:

awk 'FNR==NR {map[$2] = $0; next} {print map[$1], $0}' File1.txt File2.txt | column -t

Sub_ID   Sam_ID   v1            Sam_ID   code   V3  V4
1878372  2253734  SAMN06396112  2253734  20481  NA  DNA
1884646  2275341  SAMN06432785  2275341  20483  NA  DNA
1860945  2277481  SAMN06407597  2277481  20488  NA  DNA

column -t has been used for tabular output only.

PS: If first column from File2 is not found in File1 then you will get a leading space in output.

Comments

1

With cut grep and pr and ProcSub from bash.

pr -mt <(grep -Fv 1883177 file1.txt) <(cut -d' ' -f2- file2.txt)

Output

Sub_ID  Sam_ID   v1                  code    V3      V4
1878372 2253734 SAMN06396112        20481   NA      DNA
1884646 2275341 SAMN06432785        20483   NA      DNA
1860945 2277481 SAMN06407597        20488   NA      DNA

Or paste

paste -d' ' <(grep -Fv 1883177 file1.txt) <(cut -d' ' -f2- file2.txt) | column -t > Finalfile.txt

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.