Remove rows from tab delimited files based on a common column with another file

Question

I am having trouble processing a tab delimited file based on the common column Column_4 with another file.

One file is likely to be very small (less than 100 rows), the second one, however, will be well over 80,000 (both with approximately 30 columns).

file1.txt:

Column_1    Column_2    Column_3    Column_4
A1          B1          C1          D1
A2          B2          C2          D2
A3          B3          C3          D3

file2.txt:

Column_1    Column_2    Column_3    Column_4
Aa1          Bb1          Cc1          Dd1
Aa2          Bb2          Cc2          D2
Aa3          Bb3          Cc3          Dd3

desired_output.txt:

Column_1    Column_2    Column_3    Column_4
Aa2          Bb2         Cc2          D2

I've tried a series of cut, grep, awk, etc., but can't seem to get it right.

The ultimate goal is to remove all the non-matching rows from file2.txt, then compare the output to file1.txt.

Thanks Gnouc. I adjusted the Columns to better illustrate what I am looking for. Basically, if Column_4 from file2.txt does not match Column_4 from file1.txt... I do not want to see it. — cmart2112
– cmart2112, Commented Aug 5, 2014 at 17:08

terdon · Accepted Answer · 2014-08-06 13:23:29Z

If I understood well your question, it sounds like a typical join ("Join lines on a common field") use case :

join --header -j 4 -t $'\t' file1.txt file2.txt

You get 7 columns for each matching row.

Here is what I get (for the slightly modified data, see below):

Column_4    Column_1    Column_2    Column_3    Column_1    Column_2    Column_3
D2  A2  B2  C2  Aa2 Bb2 Cc2
D3  A3  B3  C3  Aa3 Bb3 Cc3
D8  A8  B8  C8  Aa8 Bb8 Cc8

(sorry the tabs don't display pretty here):

Column_4 is your matching value, and comes first. You can compare the values of the other columns as you requested in your goal.

If you only want the second table columns, use:

join --header  -j 4 -o 2.1,2.2,2.3,2.4 -t $'\t' file1.txt file2.txt

However, join expects its input files to be sorted so you need to pass them through sort and sort them on the 4th field first:

join --header  -j 4 -o 2.1,2.2,2.3,2.4 -t $'\t' <(sort -k4 file1.txt) <(sort -k 4 file2.txt)

For a better demo, I suggest slightly different source files (hem, that was before you edited them)

file1:

Column_1    Column_2    Column_3    Column_4
A0  B0  C0  D0
A2  B2  C2  D2
A3  B3  C3  D3
A8  B8  C8  D8

file2:

Column_1    Column_2    Column_3    Column_4
Aa1 Bb1 Cc1 D1
Aa2 Bb2 Cc2 D2
Aa3 Bb3 Cc3 D3
Aa4 Bb4 Cc4 D4
Aa5 Bb5 Cc5 D5
Aa6 Bb6 Cc6 D6
Aa7 Bb7 Cc7 D7
Aa8 Bb8 Cc8 D8
Aa9 Bb9 Cc9 D9

cuonglm · Accepted Answer · 2015-07-05 07:18:58Z

2

An awk solution:

$ awk -F"\t" 'FNR==NR{a[$4];next}; $4 in a' OFS="\t" file1 file2
Column_1    Column_2    Column_3    Column_4
Aa2          Bb2          Cc2          D2

edited Jul 5, 2015 at 7:18

answered Aug 5, 2014 at 17:12

cuonglm

158k41 gold badges341 silver badges419 bronze badges

Is this expected to function similarly with a file with 30+ columns? I get different results when I change the $4 to $18.

cmart2112
– cmart2112

2014-08-05 17:21:25 +00:00
Commented Aug 5, 2014 at 17:21
@cmart2112 yes, it should work. Try adding -F"\t" to make awk read tab separated values. If you have spaces in your fields, that might confuse things.

terdon
– terdon ♦

2014-08-05 17:23:07 +00:00
Commented Aug 5, 2014 at 17:23
@terdon: Thanks, missing the tab delimiter. Updated my answer.

cuonglm
– cuonglm

2014-08-05 17:24:42 +00:00
Commented Aug 5, 2014 at 17:24
@Gnouc: Thanks. I am getting close. But somehow I am getting the entire contents of file2.txt, not just the row that matches in Column_4. GNU Awk 3.1.8 if that helps

cmart2112
– cmart2112

2014-08-05 17:44:20 +00:00
Commented Aug 5, 2014 at 17:44
Can you give some actual data?

cuonglm
– cuonglm

2014-08-05 17:52:04 +00:00
Commented Aug 5, 2014 at 17:52

| Show 1 more comment

Stack Exchange Network

Remove rows from tab delimited files based on a common column with another file

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Remove rows from tab delimited files based on a common column with another file

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions