1

I have a data stored in three columns like this:

3651 3631 3913
3667 3996 4276
3674 4486 4605
3707 4706 5095
3720 5174 5326
3750 5439 5899
3755 5928 6263
3767 6437 7069
3779 7157 7232
3882 7384 7450
3886 7564 7649
3900 7762 7835
4006 7942 7987
4015 8236 8325
4026 8417 8464
4065 8571 8737
4156 6790 7069
4493 7157 7450
4541 7564 7649
4551 7762 7835
4597 7942 7987
4756 8236 8325
4776 8417 8464

where the 1st column is a specific value, 2nd column is start, and 3rd column is end. There are 825849 lines in the 1st column and 58386 in the 2nd and 3rd. I need to count values from the 1st if they are between start and end.

I know that in my file the first 12 specific values from column no.1 are between the first start and end, the next 5 are between the second start and end, and so on. I need to check whole file. I have tried with this, and it works but really slow:

coords='final_exons.txt'

snp=( $( cat $coords | awk '{print $1}') )
exon_start=( $( cat $coords | awk '{print $2}') )
exon_end=( $( cat $coords | awk '{print $3}') )

i=0
counter=0
for value in ${exon_end[@]}; do
    new_val=$counter
    counter=0
    let "i++"
    for snps in ${snp[@]}; do

        if [[ $value > $snps ]]; then
            #statements
            let "counter++"
            #$counter=$(echo "scale=2; $counter-$new_val" | bc)
        else
            #$new_val=$(echo "scale=2; $counter-$")
            break
        fi
    done
    #echo "NOWENOWE $new_val "
    #echo "COUNTER $value : $counter "
    final=$(echo "scale=2; sqrt(($counter-$new_val)^2)" | bc)
    echo "Exon $i : $final SNPs"
done

Thank you in advance for any hints and tips

link for full data

2 Answers 2

0
  1. Put each value in a line, but mark "start" and "end" by appending them S or E. Now sort the values numerically. You would get something like
3631S
3651
3667
...
3900
3913E
3996S
4006
...
  1. Count the occurrences between a S and an E terminated value.

Write the script and be happy!

awk '
    {print $1} 
    $3!="" {print $2"S"; print $3"E"} 
' final_exons.txt | sort -n | awk '
    !/E|S/ {count++; next}
    /S/ {count=0; next}
    /E/ {print line++": "count}'

The first awk and sort carry out step 1. There is a $3!="" test (it could also be $2!="") because there are more lines in the 1st column than in the 2nd or 3rd.

The second awk corresponds to step 2. It resets the counter if it reads lines with S, increments it if the line has no marker, and prints it (together with line number) if reads line with E.

The output when run on the sample file you provided is 0: 12; 1: 5; 2: 4; 3: 2; ...; 22: 0.

3
  • Can I ask about one more thing ? I've tested your script on my full data file and it's not working like I thought it would. For this sample data everything is correct, but on whole data there's something wrong and I can't figure it out. Commented Mar 15, 2020 at 23:16
  • I added a link to full data Commented Mar 15, 2020 at 23:33
  • @Hartek Check out my edit, I had forgotten -n option to sort and that wouldn't sort numbers correctly when they have differing number of digits! Commented Mar 15, 2020 at 23:58
0

I could use an awk:

awk '{if( ($1 > $2) && ($1 < $3) ){print NR" "$1" "$2" "$3}}' final_exons.txt

But if the col1 and ( col2 and col3 ) are trated separatelly you have better to tear them off into a separated files. Or put the data into the database and then do the interval test inside it. But the most efficient way could be to load the data into the regular array and the test loops realise in some other scripting language ( php, python, perl ...)

2
  • I need to count those values e.g. when the first 12 values from 1st column are between the first start and the first end, I want to count them and print like: 1: 12 2: 5 3: ... and so on Commented Mar 15, 2020 at 20:38
  • Sorry, the problem description is a bit nested. I edited my response (not answer). Commented Mar 15, 2020 at 21:02

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.