2

I have a file containing SNP data called snp.bed, which looks like this:

head snp.bed

    Chr17   214708483   214708484   Chr17:214708484
    Chr17   214708507   214708508   Chr17:214708508
    Chr17   214708573   214708574   Chr17:214708574

I also have a file called intersect.bed, which looks like this:

head intersect.bed

    Chr17   214708483   214708484   Chr17:214708484 Chr17   214706266   214710783   gene50573
    Chr17   214708507   214708508   Chr17:214708508 Chr17   214706266   214710783   gene50573
    Chr17   214708587   214708588   Chr17:214708580 Chr17   214706266   214710783   gene50573

I want to print out a modified version of snp.bed which contains an extra column appended to each row. If a row in snp.bed matches the first 4 columns of a row in intersect.bed, then I want to print the entire row from snp.bed with an extra column obtained by adjoining the last column from the corresponding row in intersect.bed (the gene name). Alternatively, if a row from snp.bed does not match any row from intersect.bed then adjoin an extra column consisting of the string "NA" instead of the gene name.

This is my desired output:

head snp.matched.bed

    Chr17   214708483   214708484   Chr17:214708484   gene50573
    Chr17   214708507   214708508   Chr17:214708508   gene50573
    Chr17   214708573   214708574   Chr17:214708574   NA

How can I do this?

3
  • Did you try join ... man page for join Commented Nov 6, 2017 at 23:16
  • Where did the HanXRQ prefix in the fourth column of output come from? Commented Nov 6, 2017 at 23:23
  • @MiniMax, yes, sorry. I just dropped that. Commented Nov 7, 2017 at 16:32

3 Answers 3

3

This solution assumes, that files don't have spaces in the beginning of lines. What is different from your examples, which have these spaces.

awk '
{
    str = $1$2$3$4; 
}
FNR == NR {
    arr[str] = $NF;
}
FNR != NR {
    gene_name = arr[str] ? arr[str] : "NA";
    print $0, gene_name;
}' intersect.bed snp.bed 

Output

Chr17   214708483   214708484   Chr17:214708484 gene50573
Chr17   214708507   214708508   Chr17:214708508 gene50573
Chr17   214708573   214708574   Chr17:214708574 NA
3

Here is a solution using awk:

$ awk -F '\t' 'BEGIN{while(getline line<"intersect.bed") {N=split(line,a,"\t"); seen[a[1]"\t"a[2]"\t"a[3]"\t"a[4]]=a[N];}} {if(seen[$0]) {name=seen[$0];} else{name="NA"}; print $0 "\t" name}' snp.bed
Chr17       214708483       214708484       Chr17:214708484 gene50573
Chr17       214708507       214708508       Chr17:214708508 gene50573
Chr17       214708573       214708574       Chr17:214708574 NA

I am assuming single tab characters as the delimiter of both input files.

Note also that I interpreted "first 4th column" as "first four columns."

3

Personally, I think that for this kind of task it's probably best to use a "real" programming language. I like Python, so here's a Python script that does what you want (it's intentionally verbose so that you can understand it and modify it easily):

#!/usr/bin/env python2

# intersect.py

# Read data from the first file
snp_rows = []
with open("snp.bed", 'r') as snp_file:
    for row in snp_file:
        snp_rows.append(row.split())

# Read data from the second file
int_rows = []
with open("intersect.bed", 'r') as int_file:
    for row in int_file:
        int_rows.append(row.split())

# Compare data and compute results
results = []
for row in int_rows:
    if row[:4] in snp_rows:
        results.append(row[:4] + [row[-1]])
    else:
        results.append(row[:4] + ["NA"])

# Print the results
for row in results:
    print(' '.join(row))

Save it to a file and then execute it:

python2 intersect.py

And just for fun, here is a Bash solution using standard commands (just grep and cut):

while read row; do
    match="$(grep -F "${row}" intersect.bed)";
    if [[ -n "${match}" ]]; then
        echo "${row} $(echo ${match} | cut -d' ' -f8)";
    else
        echo "${row} NA";
    fi;
done < snp.bed

Note that in general it isn't recommended to use Bash to do serious text processing. See, for example, the following post:

2
  • Using the bash scripting for text processing are considered not good thing. Read this discussion: Why is using a shell loop to process text considered bad practice?. Commented Nov 6, 2017 at 23:50
  • @MiniMax Yes, you're right of course. That was just there for fun. I added a comment with your link to the bottom of my solution. Commented Nov 6, 2017 at 23:55

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.