1

I have a file that looks like this. It has 12 columns and 3244343 rows. Lets call this file 1.

variant_id  gene_id tss_distance    ma_samples  ma_count    maf pval_nominal    slope   slope_se    pval_nominal_threshold  min_pval_nominal    pval_beta
chr10_100000235_C_T_b38 ENSG00000227232.5   35211   73  74  0.061157    1.69779e-08 0.510322    0.0890939   0.0006160191.01823e-08  1.17701e-05
chr10_100002628_A_C_b38 ENSG00000227232.5   635545  126 130 0.107438    1.01823e-08 0.405406    0.0696647   0.0006160191.01823e-08  1.17701e-05
chr1_666028_G_A_b38 ENSG00000227232.5   636475  111 115 0.0950413   2.78462e-08 0.411513    0.0729864   0.0006160191.01823e-08  1.17701e-0

I have another file that looks like this with 7 headers and with 1633293 rows. This is file 2.

"variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos"
"chr10_100000235_C_T_b38" "chr10" "100000235" "C" "T" "chr10" 101759992
"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10" 101762385
"chr10_100004827_A_C_b38" "chr10" "100004827" "A" "C" "chr10" 101764584
"chr10_100005358_G_C_b38" "chr10" "100005358" "G" "C" "chr10" 101765115

I am only interested in the variant_id column. This is the first column in both files.

How can I compare these two columns and only print those variant_id values in the first column that are not found in the second file. For the above example, the output should be

chr1_666028_G_A_b38 

because it's found in the first file but not the second.

All the values of the variant_id in the second file are also in the first file. But there are extra IDs in the first file not present in the second, which I want to identify.

4
  • The main problem will likely be the file sizes. Is there anything you already tried? Is awk mandatory? Which shell do you use? Commented Sep 30, 2022 at 13:15
  • I've just started learning awk. I'm not sure where to start as I am a beginner. And I am using bash. Commented Sep 30, 2022 at 13:19
  • Is your first field always the same between file 1&2 ? like is variant_id always double coted ? Commented Sep 30, 2022 at 13:28
  • Thanks for your response and to answer your question Yes. The values of this column (variant_id) in the second file is always found in the first file (variant_id). However, the second file (variant_id) may not contain all of the values, if that makes sense. Commented Sep 30, 2022 at 13:33

2 Answers 2

1

If you are using a system that supports process substitution, you can use grep with the -v flag (show non-matching lines) and -f flag (read patterns from file), where the "file" is a command that just prints the first field of your file1. For example:

$ grep -vf <(awk '{print $1}' file2) file1
variant_id  gene_id tss_distance    ma_samples  ma_count    maf pval_nominal    slope   slope_se    pval_nominal_threshold  min_pval_nominal    pval_beta
"chr1_666028_G_A_b38"   ENSG00000227232.5   636475  111 115 0.0950413   2.78462e-08 0.411513    0.0729864   0.0006160191.01823e-08  1.17701e-0

If that isn't an option, you can just print the first fields into a file and use that:

$ awk '{print $1}' file2 > file2.names
$ grep -vf file2.names file1
variant_id  gene_id tss_distance    ma_samples  ma_count    maf pval_nominal    slope   slope_se    pval_nominal_threshold  min_pval_nominal    pval_beta
"chr1_666028_G_A_b38"   ENSG00000227232.5   636475  111 115 0.0950413   2.78462e-08 0.411513    0.0729864   0.0006160191.01823e-08  1.17701e-0

Alternatively, assuming you have enough RAM to hold all variant IDs from file2 (which you should unless you are working on very old hardware), you can use awk to store all the first fields of one file and then check the other file for them:

$ awk 'NR == FNR{a[$1]++;next}; !($1 in a)' file2 file1
variant_id  gene_id tss_distance    ma_samples  ma_count    maf pval_nominal    slope   slope_se    pval_nominal_threshold  min_pval_nominal    pval_beta
"chr1_666028_G_A_b38"   ENSG00000227232.5   636475  111 115 0.0950413   2.78462e-08 0.411513    0.0729864   0.0006160191.01823e-08  1.17701e-0
10
  • I tried this but the top two didn't work. It just prints out everything I have in file1. E.g. File 1 has 3244343 lines and the output file also have 3244343 lines. Moreover, the output aren't showing the disparities. Commented Sep 30, 2022 at 23:13
  • @HKJ3 what operating system are you using? Have you ever opened the file in Windows? Does the last approach work? Commented Oct 1, 2022 at 10:21
  • I am using a terminal. I am writing the command directly onto the terminal rather in a script. Commented Oct 2, 2022 at 17:22
  • @HKJ3 OK, but please answer the questions I asked: what operating system are you using? Have you ever opened the file in Windows? Does the last approach work? Commented Oct 2, 2022 at 17:22
  • Thank you for your quick response. The last approach does the same thing as the first two commands. It prints out the file 1 contents. I am using a macOS v12.2.1. I am not using windows. Commented Oct 2, 2022 at 17:25
1

I would first sort the two files (excluding the header of course). I would then use awk or cut to select the first column (minus the header) and use comm to select the ones from file1 that are not in file2, using process substitution:

comm -23 <(awk 'NR>1 {print $1;}' file1) <(awk 'NR >1 {print $1;}' file2)

You can make a minor simplification by refactoring the command in the process substitution[1] into a script, say col1-nh ("first column, no header"):

#! /bin/bash

file=$1

awk 'NR>1 {print $1;}' $file

and the command would then be:

comm -23 <(col1-nh file1) <(col1-nh file2)

Again, this assumes that the bodies of the files are sorted. But that is O(N logN) and both col1-nh and comm are O(N) where N is the number of lines, so it should be able to deal with files of the size you mention without problems. You should of course measure the time that each suggested solution takes.


[1] Although with @terdon's suggestion of using NR>1 in the awk invocation, the sed is no longer needed, and one could argue that the command is simple enough as is.

4
  • 1
    Why sort first? You're already assuming a system that supports process substitution, so there is no benefit in sorting first, just do it on the fly: comm -23 <(awk 'NR>1{print $1}' file1 | sort) <(awk 'NR>1{print $1}' file2 | sort). Commented Sep 30, 2022 at 17:00
  • You are right, but I'd rather incur the sort cost once and not have to worry during subsequent invocations. I am assuming that this is a small part of a larger exploration (e.g. doing the same thing for file2). BTW, I'll use your NR>`` suggestion to get rid of the sed` - thanks! Commented Sep 30, 2022 at 19:06
  • I get an error saying /cm/local/apps/torque/var/spool/mom_priv/jobs/19815748.master.cm.cluster.SC: line 15: col1-nh: command not found Commented Oct 2, 2022 at 19:39
  • Did you save the col1-nh script somewhere on your PATH (e.g. $HOME/bin)? Did you make it executable with chmod +x /path/to/col1-nh? Commented Oct 2, 2022 at 21:39

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.