3

Input file looks something like this:

chr1    1    G    300
chr1    2    A    500
chr1    3    C    200
chr4    1    T    35
chr4    2    G    400
chr4    3    C    435
chr4    4    A    223
chr4    5    T    400
chr4    6    G    300
chr4    7    G    340
chr4    8    C    400

The actual file is too big to process, so I want to output a smaller file filtering by chromosome (column 1) and position (column 2) within a specific range.

For example, I'm looking for a Linux command (sed, awk, grep, etc.) that will filter by chr4 from positions 3 to 7. The desired final output is:

chr4    3    C    435
chr4    4    A    223
chr4    5    T    400
chr4    6    G    300
chr4    7    G    340

I don't want to modify the original file.

4 Answers 4

9

The solution for potentially unsorted input file:

sort -k1,1 -k2,2n file | awk '$1=="chr4" && $2>2 && $2<8'

The output:

chr4    3    C    435
chr4    4    A    223
chr4    5    T    400
chr4    6    G    300
chr4    7    G    340

If the input file is sorted it's enough to use:

awk '$1=="chr4" && $2>2 && $2<8' file
5
  • +1. Also, if the file is huge, it's worthwhile avoiding unnecessary processing of it. If the file is sorted with sort -k1,1V -k2,2n (note the V version sort on field 1, requires GNU sort), then $1 > "chr4" {exit}; can be added to the start of the awk script to exit immediately after all chr4 lines have been processed. without GNU sort (but input is still sorted), you'd need to use a flag variable to exit early after the chr4 lines: e.g. awk '$1 != "chr4" && found {exit} ; $1=="chr4" && $2>=3 && $2<=7 {print ; found=1}' Commented Aug 7, 2017 at 2:22
  • 2
    OTOH, if the file is huge, it would be better to sort awk's output rather than awk's input. Commented Aug 7, 2017 at 2:31
  • @cas, I've wrote the solution for potentially unsorted input file, so sorting should be made beforehand Commented Aug 7, 2017 at 5:28
  • I’m not sure why one would need to sort at all. There’s no such requirement in the question. Commented Aug 7, 2017 at 7:07
  • @DavidFoerster, read my notation in the answer Commented Aug 7, 2017 at 7:24
7

awk is probably the best tool for the job.  A simple solution, which is similar to one already given, but which actually uses the parameters you specified, is:

awk '$1=="chr4" && $2>=3 && $2<=7'

You may prefer a more general solution, which involves putting the awk command in a shell script, is:

#!/bin/sh
if [ "$#" -lt 3 ]
then
        echo "Usage:    $0 chromosome low_position high_position"
        exit 1
fi
chr="$1"
lo="$2"
hi="$3"
shift 3
awk -vchromo="$chr" -vpos1="$lo" -v pos2="$hi" '$1==chromo && $2>=pos1 && $2<=pos2' "$@"

If run with fewer than three arguments, this reminds you what the arguments should be, and exits.  Otherwise, it saves the first three arguments into shell variables, and then shifts them off the argument list.  Then it invokes awk, passing the shell variable values in as awk variables.

You can invoke this as any of the following:

./myscript chr4 3 7   data

or

./myscript chr4 3 7 < data

or

(some_other_process) | ./myscript chr4 3 7
and, in any case, redirect the output to the new file with >.

4

You could accomplish this with grep:

grep -e '^chr4\s\+[3-7]' input

where expression is: ^chr4 lines beginning with chr4, \s\+ one or more space character, [3-7] matches one digit at range 3 to 7.

Perhaps more useful is to use head or tail to give you as many lines as you want instead of matching them with grep (using grep only to match the first column).

grep -e '^chr4\s\+' input| tail -n +3| head -n 5

grep matches lines beginning with chr4, tail gives lines starting from 3rd line and using head limit output to the first 5 lines (lines 3 to 7) .

5
  • 2
    (1) Note that this solution is very difficult to adapt to the case where the upper bound on the position (column 2) is > 9. For example, how would you adapt this solution to find positions between 7 and 13? (2) If we assume that position numbers can be greater than 9, this answer is broken, as it will match lines where the position is between 30 and 79, or between 300 and 799, etc. Commented Aug 6, 2017 at 21:26
  • @Scott indeed, hence I suggested using head and tail (assuming the rows are sorted and without gaps). Commented Aug 6, 2017 at 21:28
  • But that would require a lot of line counting.  (I believe that the issue of “gaps” isn't really an issue here.  Note that the sample file has data for chr4, but not chr2 or chr3.  Also, there are more rows for chr4` than there are for chr1.)  E.g., if the OP wants to get (some of) the data for chromosome 42,  they would have to figure out how many lines there are for chromosomes 1-41. Commented Aug 6, 2017 at 21:36
  • 1
    @Scott, Not finding multiple digits in the second column can easily be solved by requiring whitespace after the single matching digit. For example: grep -e "^chr4\\s+[3-7]\\s" input. Note that you should also require some whitespace (1 or more) between the chr4 and the digit (i.e. \\s+), not just 0 or more (\\s*). Commented Aug 6, 2017 at 21:44
  • My original answer was a bit unclear, now updated the answer to include example using grep, head and tail. Commented Aug 6, 2017 at 21:54
0

You could use the split utility.

split -p 'chr4    (3|8)' -a 1 my_file output
  • split a file into multiple parts (basically the inverse of cat)
  • p to split on the extended regular expression 'chr4 (3|8)'
  • -a 1 to suffix the created files with a single character
  • output is the prefix name for each created file

Now the file outputb will contain your desired output. You could also modify this to place each chromosone into its own file.

2
  • What version of split supports this -p option?  GNU doesn’t; POSIX doesn’t; and man7.org doesn’t know about it. Commented Aug 7, 2017 at 3:03
  • @Scott I'm on a Mac, so the BSD version I guess? I think csplit offers the same functionality on most *nix platforms. Commented Aug 7, 2017 at 3:18

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.