4

Objective: Merge the contents of two files using common key present in the files

 file1.txt
 =========
 key1   11
 key2   12
 key3   13


 file2.txt
 =========
 key2   22
 key3   23
 key4   24
 key5   25


 Expected Output :
 ==================
 key1   11
 key2   12    22
 key3   13    23 
 key4   24
 key5   25

Approaches tried:

  1. join command:

    join -a 1 -a 2 file1.txt file2.txt ## full outer join
    
  2. awk:

    awk 'FNR==NR{a[$1]=$2;next;}{ print $0, a[$1]}' 2.txt 1.txt
    

Approach 2 is resulting in a right outer join and NOT a full outer join:

   key1  11
   key2  12    22
   key3  13    23 

What needs to be modified in approach 2 to result in a full outer join?

1
  • i need to do the full outer join on multiple csvs. My keys are sorted but values are not. I tried Approach 1 and it complained about sorting. What to do? Commented Mar 15, 2018 at 7:05

4 Answers 4

5

My solution using join:

join -a1 -a2  -1 1 -2 1 -o 0,1.2,2.2 -e "NULL" file1 file2 

I don't know much about awk for joining large files and always use join.

key1 11 NULL
key2 12 22
key3 13 23
key4 NULL 24
key5 NULL 25
2
  • 1
    -1 1 -2 1 is equal -j 1 Commented Apr 8, 2015 at 13:41
  • @Costas fwiw, POSIX.1-2017 states: Earlier versions of this standard allowed -j, -j1, -j2 options […]. These forms are no longer specified by POSIX.1-2017 but may be present in some implementations. Commented May 24, 2018 at 13:00
1

My solution with awk:

awk '{a[$1]=a[$1]" "$2} END{for(i in a)print i, a[i]}' file1.txt file2.txt

With keyn as index, append the second fields from each line to corresponding a[keyn](with space). At the end, print all the indices and array element.

Output:

AMD$ awk '{a[$1]=a[$1]" "$2} END{for(i in a)print i, a[i]}' file1.txt file2.txt
key1  11
key2  12 22
key3  13 23
key4  24
key5  25
0

With awk, try:

awk '{a[$1]=($1 in a)?a[$1]" "$2:$2};END{for(i in a)print i,a[i]}' file1 file2

For huge files, you should use join instead of awk approach, since when awk approach will store all files content in memory before printing out.

1
  • This is just wrong. The smaller file needs to fit in memory as an array (hash table) to overcome the ordering problem. The larger file is processed serially, record by record. In extreme conditions, it would be a trivial improvement to split the smaller file into, say 1GB sections, and make multiple passes over the larger file. The first pass might need to be "special" to restructure the input columns to the "joined" format, and place default values in columns where the first pass did not contain the required update data. Commented Feb 8, 2020 at 13:58
0

Your first join seems to be ok here, although it is misspelled in caps letters:

$>join -a 1 -a 2 file1.txt file2.txt 
key1 11
key2 12 22
key3 13 23
key4 24
key5 25

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.