I have a file named snp_data containing SNP (Single-Nucleotide Polymorphism) chromosome data. This is a 3-column, white-space delimited CSV file which has the following format:
user@host:~$ cat snp_data
snp_id chromosome position
Chr01__912 1 912 1
Chr01__944 1 944 1
Chr01__1107 1 1107 1
Chr01__1118 1 1118 1
Chr01__1146 1 1146 1
Chr01__1160 1 1160 1
...
...
...
Chr17__214708367 17 214708367
Chr17__214708424 17 214708424
Chr17__214708451 17 214708451
Chr17__214708484 17 214708484
Chr17__214708508 17 214708508
Note that for each row the snp_id field has the form Chr{chromosome}__{position} for the corresponding values of chromosome and position.
I have another file named window containing auxiliary data. This is a 5-column, white-space delimited CSV file which has the following format:
user@host:~$ cat window
seqname chromosome start end width
1 Chr1 1 15000 15000
2 Chr1 15001 30000 15000
3 Chr1 30001 45000 15000
4 Chr1 45001 60000 15000
5 Chr1 60001 75000 15000
6 Chr1 75001 90000 15000
...
...
...
199954 Chr17 214620001 214635000 15000
199955 Chr17 214635001 214650000 15000
199956 Chr17 214650001 214665000 15000
199957 Chr17 214665001 214680000 15000
199958 Chr17 214680001 214695000 15000
199959 Chr17 214695001 214708580 13580
Note the correspondence between the window and snp_data files determined by the value of the chromosome field in the window file and the values of the chromosome and snp_id fields in the snp_data file, e.g. rows with a value of "Chr1" in window correspond to rows in snp_data with a value of 1 for chromosome and whose snp_id rows begin with a prefix of Chr01__.
For each row in the snp_data file (each snp within each chromosome), if that row's position value falls within the range given by the start and end values of any of the rows in window for that particular chromosome, then append the seqname from the window file to the row from the snp_data file.
For the input given above, this would be my desired output:
user@host:~$ cat desired_output
snp_id chromosome position window
Chr01__912 1 912 1
Chr01__944 1 944 1
Chr01__1107 1 1107 1
...
...
...
Chr17__214708367 17 214708367 199959
Chr17__214708424 17 214708424 199959
Chr17__214708451 17 214708451 199959
Chr17__214708484 17 214708484 199959
Chr17__214708508 17 214708508 199959
The main point is that positions are unique only within each chromosome, so I need to compare these 2 files chromosome by chromosome (i.e.. for each chromosome separately). How can I do this?
windowfile at all. We can do something like this:awk '{print $0, int($3 / 15000 + 1)}' snp_dataand get the right result.