unix: merge 2 files using 2nd columns

Question

I'd like to merge two files according to the content of their 2nd columns.

File 1:

"4742"  "209220_at"     2.60700394801826
"104"   "209396_s_at"   2.60651442103297
"749"   "202409_at"     2.59424724783704
"4168"  "209875_s_at"   2.58773204877464
"3973"  "1431_at"       2.52832098784342
"1826"  "207201_s_at"   2.41685345240968

File2:

"653"   "1431_at"       2.14595534191867
"1109"  "207201_s_at"   2.13777517447307
"353"   "212531_at"     2.12706340284672
"381"   "206535_at"     2.11456707231618
"1846"  "204534_at"     2.10919474441178

To have in the end:

"3973"  "1431_at"       2.52832098784342 "653"   "1431_at"       2.14595534191867
"1826"  "207201_s_at"   2.41685345240968 "1109"  "207201_s_at"   2.13777517447307

I have tried comm, diff, some obscure awk one-liner without any success. Any help much appreciated. Ben

jamessan · Accepted Answer · 2011-02-11 17:20:31Z

You can do that with a combination of the sort and join commands. The straightforward approach is

join -j2 <(sort -k2 file1) <(sort -k2 file2)

but that displays slightly differently than you're looking for. It just shows the common join field and then the remaining fields from each file

"1431_at" "3973" 2.52832098784342 "653" 2.14595534191867
"207201_s_at" "1826" 2.41685345240968 "1109" 2.13777517447307

If you need the format exactly as you showed, then you would need to tell join to output in that manner

join -o 1.1,1.2,1.3,2.1,2.2,2.3 -j2 <(sort -k2 file1) <(sort -k2 file2)

where -o accepts a list of FILENUM.FIELDNUM specifiers.

Note that the <() syntax I'm using isn't POSIX sh, so you should sort to a temporary file if you need POSIX sh syntax.

if you are using certain versions of bash, you can also write: join -j2 <(sort -k2 file{1,2})
Except that doesn't provide join with two distinct sets of input.

glenn jackman · Accepted Answer · 2011-02-11 17:41:02Z

2

awk '
  # store the first file, indexed by col2
  NR==FNR {f1[$2] = $0; next}
  # output only if file1 contains file2's col2
  ($2 in f1) {print f1[$2], $0}
' file1 file2

answered Feb 11, 2011 at 17:41

glenn jackman

249k42 gold badges233 silver badges362 bronze badges

1 Comment

glenn jackman Over a year ago

note that this will store the entire file1 in memory, so you might want to put the smaller file as file1

btilly · Accepted Answer · 2011-02-11 16:59:45Z

If the files are small, write a program in a scripting language (Perl, Python and Ruby are all good choices) that reads the first into a hash whose keys are the second column, then read through the second file and use hash lookups to fin out what (if anything) can be joined.

If the files are large then for each file swap the first and second columns, pass them through the unix sort utility, and then in a scripting language merge (plus column reorder) the two sorted files.

kurumi · Accepted Answer · 2011-02-12 03:57:35Z

0

awk 'FNR==NR{a[$2]=$0} NR>FNR && ($2 in a){ print $0,a[$2] } ' file2 file1

answered Feb 12, 2011 at 3:57

kurumi

25.7k5 gold badges47 silver badges52 bronze badges

Collectives™ on Stack Overflow

unix: merge 2 files using 2nd columns

4 Answers 4

2 Comments

1 Comment

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

Comments

Comments

Related