intersection beween 2 files (values in file 1 which fall in range of values in file 2)

Question

I have a file named snp_data containing SNP (Single-Nucleotide Polymorphism) chromosome data. This is a 3-column, white-space delimited CSV file which has the following format:

user@host:~$ cat snp_data

snp_id    chromosome  position
Chr01__912 1 912 1
Chr01__944 1 944 1
Chr01__1107 1 1107 1
Chr01__1118 1 1118 1
Chr01__1146 1 1146 1
Chr01__1160 1 1160 1
...
...
...
Chr17__214708367 17 214708367
Chr17__214708424 17 214708424
Chr17__214708451 17 214708451
Chr17__214708484 17 214708484
Chr17__214708508 17 214708508

Note that for each row the snp_id field has the form Chr{chromosome}__{position} for the corresponding values of chromosome and position.

I have another file named window containing auxiliary data. This is a 5-column, white-space delimited CSV file which has the following format:

user@host:~$ cat window

seqname chromosome start end width
1 Chr1 1 15000 15000
2 Chr1 15001 30000 15000 
3 Chr1 30001 45000 15000
4 Chr1 45001 60000 15000 
5 Chr1 60001 75000 15000 
6 Chr1 75001 90000 15000 
...
...
...
199954 Chr17 214620001 214635000 15000
199955 Chr17 214635001 214650000 15000
199956 Chr17 214650001 214665000 15000
199957 Chr17 214665001 214680000 15000
199958 Chr17 214680001 214695000 15000
199959 Chr17 214695001 214708580 13580

Note the correspondence between the window and snp_data files determined by the value of the chromosome field in the window file and the values of the chromosome and snp_id fields in the snp_data file, e.g. rows with a value of "Chr1" in window correspond to rows in snp_data with a value of 1 for chromosome and whose snp_id rows begin with a prefix of Chr01__.

For each row in the snp_data file (each snp within each chromosome), if that row's position value falls within the range given by the start and end values of any of the rows in window for that particular chromosome, then append the seqname from the window file to the row from the snp_data file.

For the input given above, this would be my desired output:

user@host:~$ cat desired_output

snp_id  chromosome  position   window
Chr01__912  1   912      1
Chr01__944  1   944      1
Chr01__1107 1   1107     1
...
...
...
Chr17__214708367 17 214708367   199959
Chr17__214708424 17 214708424   199959
Chr17__214708451 17 214708451   199959
Chr17__214708484 17 214708484   199959
Chr17__214708508 17 214708508   199959

The main point is that positions are unique only within each chromosome, so I need to compare these 2 files chromosome by chromosome (i.e.. for each chromosome separately). How can I do this?

Can we assume, the width always the same - 15000? If, yes, then we don't need the window file at all. We can do something like this: awk '{print $0, int($3 / 15000 + 1)}' snp_data and get the right result. — MiniMax
– MiniMax, Commented Nov 26, 2017 at 0:19
I originally misinterpreted your question. I just updated my solution and also rewrote the question to make it clearer and address some ambiguities. Please double-check the question statement to make sure that I haven't misinterpreted anything. — igal
– igal, Commented Nov 26, 2017 at 2:23
@Anna1364 I updated the solution to remove the double quotes. — igal
– igal, Commented Nov 26, 2017 at 4:10
@Anna1364 is the "position" in snp_data actually two numbers "912 1" for initial rows, and only 1 number for the last rows? — Gnudiff
– Gnudiff, Commented Nov 28, 2017 at 11:18

igal · Accepted Answer · 2017-11-26 04:10:06Z

Here's a Python script that should do what you want:

#!/usr/bin/env python2
# -*- coding: ascii -*-
"""intersect_snp.py"""

import sys

# Read data from the SNP file into a list
snp_list = []
with open(sys.argv[1], 'r') as snp_file:
    for line in snp_file:
        snp_row = line.split() 
        snp_list.append(snp_row)

# Read data from the "window" file into a dictionary of lists
win_dict = {} 
with open(sys.argv[2], 'r') as win_file:
    for line in win_file:
        seqnames, chromosome, start, end, width = win_row = line.split()
        if chromosome not in win_dict:
            win_dict[chromosome] = []
        win_dict[chromosome].append(win_row)

# Compare data and compute results
results = []

# Iterate over snp data rows
for snp_row in snp_list:

    # Extract the field values for each snp row
    snp_id, chromosome, position = snp_row

    # Convert the chromosome to match the format in the "window" file
    # i.e. `1` -> `Chr1`
    chromosome_name = "Chr{}".format(chromosome)

    # Iterate over the "window" rows corresponding to that chromosome
    for win_row in win_dict.get(chromosome_name, []):

        # Extract the field values for each "window" row
        seqnames, chromosome, start, end, width = win_row

        # Perform the desired comparison
        if int(start) <= int(position) <= int(end):

            # If the comparison returns true, construct the result row
            result = [snp_id, chromosome, position, seqnames]
            results.append(result)
            break

# Print the output column headers
columns = ["snp_id", "chromosome", "position", "window"]
print(" ".join(columns))

# Print the results
for row in results:
    print(' '.join(row))

Notice that this script assumes that all of your rows are data rows. If your input files are named snp_data and window then you might execute it like this:

python intersect_snp.py snp_data window

If your files have header rows then you might use tail to skip/remove the headers and execute it like this instead:

python intersect_snp.py <(tail -n+2 snp_data) <(tail -n+2 window)

Suppose that this is your snp_data file:

snp_id              chromosome  position
Chr01__912          1           912
Chr01__944          1           944
Chr01__1107         1           1107
...
...
...
Chr17__214708367    17          214708367
Chr17__214708424    17          214708424
Chr17__214708451    17          214708451
Chr17__214708484    17          214708484
Chr17__214708508    17          214708508

And that this is your window file:

seqnames    chromosome  start       end         width
1           Chr1        1           15000       15000
2           Chr1        15001       30000       15000
3           Chr1        30001       45000       15000
4           Chr1        45001       60000       15000
5           Chr1        60001       75000       15000
...
...
...
199954      Chr17       214620001   214635000   15000
199955      Chr17       214635001   214650000   15000
199956      Chr17       214650001   214665000   15000
199957      Chr17       214665001   214680000   15000
199958      Chr17       214680001   214695000   15000
199959      Chr17       214695001   214708580   13580

Then if we run this command:

python intersect_snp.py <(tail -n+2 snp_data) <(tail -n+2 window)

We get the following output:

snp_id chromosome position window
Chr01__912 Chr1 912 1
Chr01__944 Chr1 944 1
Chr01__1107 Chr1 1107 1
...
...
...
Chr17__214708367 Chr17 214708367 199959
Chr17__214708424 Chr17 214708424 199959
Chr17__214708451 Chr17 214708451 199959
Chr17__214708484 Chr17 214708484 199959
Chr17__214708508 Chr17 214708508 199959

very good, +1. int(start) <= int(position) and int(position) <= int(end) is idiomatically int(start) <= int(position) <= int(end) — iruvar
– iruvar, Commented Nov 26, 2017 at 2:25
@igal, thanks so much. I tried your python code with a small subset of data which works perfectly fine. But there is a problem when I run it for the entire dataset! I have nearly 10 million SNPs, when I run the script for the entire data-set it does not produce any output! I wonder what might be wrong....? — Anna1364
– Anna1364, Commented Nov 26, 2017 at 21:15
@Anna1364 Does it actually terminate without producing any output or does it just run for a really, really long time? I didn't put any effort into making it efficient. If your input is really large than it might take a long time or possibly hang or crash. — igal
– igal, Commented Nov 26, 2017 at 21:49
@Anna1364 If you post your data somewhere (e.g. GitHub or something) then I'll running the script on your data myself. — igal
– igal, Commented Nov 26, 2017 at 21:51

Gnudiff · Accepted Answer · 2017-11-28 11:32:26Z

To avoid large waiting times, you could do this with the minimalistic SQL engine SQLite that is frequently preinstalled in Linux. It doesn't run a server, and works with SQL databases that are stored in files.

In your snp_data & window directory do:

cat snp_data | tr -s " " > snp_data.csv
sed 's#Chr##g' window | tr -s " " > window.csv

This normalizes the spaces between fields and prepares them for import.

Then import this data into SQL and run the query to get the output:

cat > task.sql
CREATE TABLE snp_data (snp_id text,chromosome int, position int);
CREATE TABLE window (seqname int,chromosome int, c_start int , c_end int, c_width int);

.mode csv
.separator "  "
.import snp_data.csv snp_data
.import window.csv window

.mode column
.header on
SELECT D.snp_id, D.chromosome, D.position, W.seqname FROM snp_data D, window W WHERE W.chromosome=D.chromosome AND D.position BETWEN W.c_start AND W.c_end;

[CTRL+D here to stop input]

And finally:

cat task.sql | sqlite3 my_database.db

In general, this should be faster for large files.

Stack Exchange Network

intersection beween 2 files (values in file 1 which fall in range of values in file 2)

2 Answers 2

You must log in to answer this question.

Hot Network Questions

intersection beween 2 files (values in file 1 which fall in range of values in file 2)

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions