8

I have a file that looks like the following:

chr19   61336212        +       0       0       CG      CGT    
chr19   61336213        -       0       0       CG      CGG    
chr19   61336218        +       0       0       CG      CGG    
chr19   61336219        -       0       0       CG      CGC    
chr19   61336268        +       0       0       CG      CGG    
chr19   61336269        -       0       0       CG      CGA    
chr19   61336402        +       0       0       CG      CGG    
chr19   61336403        -       0       0       CG      CGT    

I want to split this file for every 10000 interval of the 2nd field(NOT lines, but number interval). So for this file I would like to split from the first line( the line with 61336212) to the line that has or up to 61346211 ( 61336212+9999), then from 61346212 to 61356211, and so on and so forth. As you can see the numbers in 2nd field/column is not 'filled'.

Is there a way to do this?

2
  • In your example, if the next number after 61346211 is 61346220, say, would you expect the second file of output to cover the range starting at 61346212 or 61346220? Commented Aug 17, 2015 at 18:09
  • the second range should cover from 61346212. Commented Aug 17, 2015 at 19:13

4 Answers 4

13
awk 'NR==1 {n=$2}
     {
       file = sprintf("file.%.4d", ($2-n)/10000)
       if (file != last_file) {
         close(last_file)
         last_file = file
       }
       print > file
     }'

Would write to file.0000, file.0001... (the number being int(($2-n)/10000) where n is $2 for the first line).

Note that we close files once we've stopped writing to them as otherwise, you'd reach the limit on the number of simultaneously open files after a few hundred files (GNU awk can work around that limit, but then the performances degrade quickly).

We're assuming those numbers are always going up.

8
  • 3
    could you explain what is happening? Commented Aug 17, 2015 at 16:36
  • Could you explain what's going on here? Also like the comment below is there away to have the output file name length to be constant, such as file.0000, file.0001 instead of file.1 file.2 .. file.100.. file..2320 ? Commented Aug 17, 2015 at 19:15
  • 1
    @Fiximan, I don't feel I can explain much more without paraphrasing the code. What part do you find unclear? Commented Aug 17, 2015 at 19:51
  • Well, I understand the filename generation file = ..., but how does the iteration work? There is no part that says n = n + 10000 nor a lower_boundary <= $2 < upper_boundary part. In general the whole if (file != last_file) { close(last_file) ; last_file = file } is out of my league Commented Aug 17, 2015 at 20:20
  • 1
    @Fixman, well yes, that's what I'd call paraphrasing if (file != last_file): if the current file is not the same as the previous file, close the previous file (so have only one file open at a time (we don't need to keep them all open as other solutions do)) Commented Aug 17, 2015 at 20:33
7

Hack one-liner version. Perhaps more suitable for Code Golf than this forum though. This generates split1, split2, split3 and so on, as filenames.

awk '{if($2>b+9999){a++;b=$2}print >"split" a}' file.txt

To have output files named split001, split002, split003, involves this extra sprintf:

awk '{if($2>b+9999){a++;b=$2}print >sprintf("split%03d",a)}' file.txt

To avoid the gawk slowdown issue identified by @Stéphane Chazelas, use perl:

perl -ne '(undef,$a)=split(/\s+/,$_);if($a>$b+9999){$c++;$b=$a}open(D,sprintf(">>ysplit%03d",$c));print D' <file.txt
5
  • 1
    For this method, is there a way to have the file names to be more .. consecutively? This outputs split1....split100...split1000, but something more in the line of split00001 ... split 00100.. split01000.. ? Commented Aug 17, 2015 at 16:45
  • 1
    Sure, extra sprintf magic now added. Commented Aug 17, 2015 at 16:48
  • Note that if the input has 0, 9999, 12000, 19999, 21000, 22000, that puts 0, 9999 in file1, but 12000, 19999, 21000 in file2 which seems at odd with the requirements. Commented Aug 17, 2015 at 19:58
  • 1
    Note that this would reach the limit on the number of simultaneously open files after a few hundred files (GNU awk can work around that limit, but then the performances degrade quickly). Commented Aug 17, 2015 at 19:58
  • 1
    Yeah. I just noticed the problem you mentioned. Commented Aug 17, 2015 at 20:21
4
#!/bin/bash
first=$( head -n1 file | awk -F" +" '{print $2}' )
last=$( tail -n1 file | awk -F" +" '{print $2}' )
for (( i=$first ; i<=$last ; i=i+10000 )) ; do
   awk -v start=$i -v end=$(($i+10000)) 'BEGIN { FS == " +" } { if ( $2 >= start && $2 < end ) print $0 }' file \
   >> interval_"$i"_to_"$(( $i+10000 ))"
done

Test with interval set to 100:

more inter*
::::::::::::::
interval_61336212_to_61346212
::::::::::::::
chr19   61336212        +       0       0       CG      CGT    
chr19   61336213        -       0       0       CG      CGG    
chr19   61336218        +       0       0       CG      CGG    
chr19   61336219        -       0       0       CG      CGC    
chr19   61336268        +       0       0       CG      CGG    
chr19   61336269        -       0       0       CG      CGA    
::::::::::::::
interval_61336312_to_61346312
::::::::::::::
chr19   61336402        +       0       0       CG      CGG    
chr19   61336403        -       0       0       CG      CGT  

Note: will produce empty files for empty intervals; for removing empty files, add:

for file in interval* ; do
  if [ ! -s "$file" ] ; then
    rm "$file"
  fi
done

Will run over file for each step in the for loop, thus not the most efficient.

3

If you mean just calculation not line counting:

awk 'NR==1 || n+10000<$2{n=$2; portion++}{print > FILENAME "." portion}' file
4
  • Note that if the input has 0, 9999, 12000, 19999, 21000, 22000, that puts 0, 9999 in file1, but 12000, 19999, 21000 in file2 which seems at odd with the requirements. Commented Aug 17, 2015 at 20:28
  • Note that this would reach the limit on the number of simultaneously open files after a few hundred files (GNU awk can work around that limit, but then the performances degrade quickly). Commented Aug 17, 2015 at 20:28
  • @StéphaneChazelas I am not sure that clear understand you. If your wants to 21000 in 3rd file use 9999 instead 10000. Commented Aug 17, 2015 at 20:42
  • from my understanding of the question, the OP wants lines with 0 to 9999 in the first file, 10000 to 19999 in the second file. Commented Aug 17, 2015 at 20:46

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.