Why isn't this awk command doing a full outer join?

Question

Objective: Merge the contents of two files using common key present in the files

 file1.txt
 =========
 key1   11
 key2   12
 key3   13


 file2.txt
 =========
 key2   22
 key3   23
 key4   24
 key5   25


 Expected Output :
 ==================
 key1   11
 key2   12    22
 key3   13    23 
 key4   24
 key5   25

Approaches tried:

join command:

join -a 1 -a 2 file1.txt file2.txt ## full outer join

awk:

awk 'FNR==NR{a[$1]=$2;next;}{ print $0, a[$1]}' 2.txt 1.txt

Approach 2 is resulting in a right outer join and NOT a full outer join:

   key1  11
   key2  12    22
   key3  13    23

What needs to be modified in approach 2 to result in a full outer join?

i need to do the full outer join on multiple csvs. My keys are sorted but values are not. I tried Approach 1 and it complained about sorting. What to do? — Abu Shoeb
– Abu Shoeb, Commented Mar 15, 2018 at 7:05

pogibas · Accepted Answer · 2015-04-08 05:14:35Z

5

My solution using join:

join -a1 -a2  -1 1 -2 1 -o 0,1.2,2.2 -e "NULL" file1 file2

I don't know much about awk for joining large files and always use join.

key1 11 NULL
key2 12 22
key3 13 23
key4 NULL 24
key5 NULL 25

answered Apr 8, 2015 at 5:14

pogibas

6711 gold badge8 silver badges12 bronze badges

1

-1 1 -2 1 is equal -j 1

Costas
– Costas

2015-04-08 13:41:43 +00:00
Commented Apr 8, 2015 at 13:41
@Costas fwiw, POSIX.1-2017 states: Earlier versions of this standard allowed -j, -j1, -j2 options […]. These forms are no longer specified by POSIX.1-2017 but may be present in some implementations.

myrdd
– myrdd

2018-05-24 13:00:24 +00:00
Commented May 24, 2018 at 13:00

Add a comment |

Arjun Mathew Dan · Accepted Answer · 2015-04-08 05:19:05Z

1

My solution with awk:

awk '{a[$1]=a[$1]" "$2} END{for(i in a)print i, a[i]}' file1.txt file2.txt

With keyn as index, append the second fields from each line to corresponding a[keyn](with space). At the end, print all the indices and array element.

Output:

AMD$ awk '{a[$1]=a[$1]" "$2} END{for(i in a)print i, a[i]}' file1.txt file2.txt
key1  11
key2  12 22
key3  13 23
key4  24
key5  25

answered Apr 8, 2015 at 5:19

Arjun Mathew Dan

1313 bronze badges

Add a comment |

cuonglm · Accepted Answer · 2015-04-28 19:23:33Z

0

With awk, try:

awk '{a[$1]=($1 in a)?a[$1]" "$2:$2};END{for(i in a)print i,a[i]}' file1 file2

For huge files, you should use join instead of awk approach, since when awk approach will store all files content in memory before printing out.

edited Apr 28, 2015 at 19:23

answered Apr 8, 2015 at 3:47

cuonglm

158k41 gold badges342 silver badges420 bronze badges

This is just wrong. The smaller file needs to fit in memory as an array (hash table) to overcome the ordering problem. The larger file is processed serially, record by record. In extreme conditions, it would be a trivial improvement to split the smaller file into, say 1GB sections, and make multiple passes over the larger file. The first pass might need to be "special" to restructure the input columns to the "joined" format, and place default values in columns where the first pass did not contain the required update data.

Paul_Pedant
– Paul_Pedant

2020-02-08 13:58:26 +00:00
Commented Feb 8, 2020 at 13:58

Add a comment |

FloHimself · Accepted Answer · 2015-04-08 06:11:13Z

0

Your first join seems to be ok here, although it is misspelled in caps letters:

$>join -a 1 -a 2 file1.txt file2.txt 
key1 11
key2 12 22
key3 13 23
key4 24
key5 25

answered Apr 8, 2015 at 6:11

FloHimself

11.8k3 gold badges24 silver badges24 bronze badges

Add a comment |

Stack Exchange Network

Why isn't this awk command doing a full outer join?

4 Answers 4

You must log in to answer this question.

Linked

Hot Network Questions

Why isn't this awk command doing a full outer join?

4 Answers 4

You must log in to answer this question.

Linked

Related

Hot Network Questions