Splitting file for every 10000 numbers ( not lines )

Question

I have a file that looks like the following:

chr19   61336212        +       0       0       CG      CGT    
chr19   61336213        -       0       0       CG      CGG    
chr19   61336218        +       0       0       CG      CGG    
chr19   61336219        -       0       0       CG      CGC    
chr19   61336268        +       0       0       CG      CGG    
chr19   61336269        -       0       0       CG      CGA    
chr19   61336402        +       0       0       CG      CGG    
chr19   61336403        -       0       0       CG      CGT

I want to split this file for every 10000 interval of the 2nd field(NOT lines, but number interval). So for this file I would like to split from the first line( the line with 61336212) to the line that has or up to 61346211 ( 61336212+9999), then from 61346212 to 61356211, and so on and so forth. As you can see the numbers in 2nd field/column is not 'filled'.

Is there a way to do this?

In your example, if the next number after 61346211 is 61346220, say, would you expect the second file of output to cover the range starting at 61346212 or 61346220? — Joe Lee-Moyet
– Joe Lee-Moyet, Commented Aug 17, 2015 at 18:09

Stéphane Chazelas · Accepted Answer · 2015-08-17 20:01:05Z

13

awk 'NR==1 {n=$2}
     {
       file = sprintf("file.%.4d", ($2-n)/10000)
       if (file != last_file) {
         close(last_file)
         last_file = file
       }
       print > file
     }'

Would write to file.0000, file.0001... (the number being int(($2-n)/10000) where n is $2 for the first line).

Note that we close files once we've stopped writing to them as otherwise, you'd reach the limit on the number of simultaneously open files after a few hundred files (GNU awk can work around that limit, but then the performances degrade quickly).

We're assuming those numbers are always going up.

edited Aug 17, 2015 at 20:01

answered Aug 17, 2015 at 16:29

Stéphane Chazelas

585k96 gold badges1.1k silver badges1.7k bronze badges

3

could you explain what is happening?

FelixJN
– FelixJN

2015-08-17 16:36:51 +00:00
Commented Aug 17, 2015 at 16:36
Could you explain what's going on here? Also like the comment below is there away to have the output file name length to be constant, such as file.0000, file.0001 instead of file.1 file.2 .. file.100.. file..2320 ?

agathusia
– agathusia

2015-08-17 19:15:26 +00:00
Commented Aug 17, 2015 at 19:15
1

@Fiximan, I don't feel I can explain much more without paraphrasing the code. What part do you find unclear?

Stéphane Chazelas
– Stéphane Chazelas

2015-08-17 19:51:22 +00:00
Commented Aug 17, 2015 at 19:51
Well, I understand the filename generation file = ..., but how does the iteration work? There is no part that says n = n + 10000 nor a lower_boundary <= $2 < upper_boundary part. In general the whole if (file != last_file) { close(last_file) ; last_file = file } is out of my league

FelixJN
– FelixJN

2015-08-17 20:20:49 +00:00
Commented Aug 17, 2015 at 20:20
1

@Fixman, well yes, that's what I'd call paraphrasing if (file != last_file): if the current file is not the same as the previous file, close the previous file (so have only one file open at a time (we don't need to keep them all open as other solutions do))

Stéphane Chazelas
– Stéphane Chazelas

2015-08-17 20:33:16 +00:00
Commented Aug 17, 2015 at 20:33

| Show 3 more comments

Community · Accepted Answer · 2017-04-13 12:38:58Z

7

Hack one-liner version. Perhaps more suitable for Code Golf than this forum though. This generates split1, split2, split3 and so on, as filenames.

awk '{if($2>b+9999){a++;b=$2}print >"split" a}' file.txt

To have output files named split001, split002, split003, involves this extra sprintf:

awk '{if($2>b+9999){a++;b=$2}print >sprintf("split%03d",a)}' file.txt

To avoid the gawk slowdown issue identified by @Stéphane Chazelas, use perl:

perl -ne '(undef,$a)=split(/\s+/,$_);if($a>$b+9999){$c++;$b=$a}open(D,sprintf(">>ysplit%03d",$c));print D' <file.txt

edited Apr 13, 2017 at 12:38

CommunityBot

1

answered Aug 17, 2015 at 16:35

steve

22.3k5 gold badges53 silver badges79 bronze badges

1

For this method, is there a way to have the file names to be more .. consecutively? This outputs split1....split100...split1000, but something more in the line of split00001 ... split 00100.. split01000.. ?

agathusia
– agathusia

2015-08-17 16:45:06 +00:00
Commented Aug 17, 2015 at 16:45
1

Sure, extra sprintf magic now added.

steve
– steve

2015-08-17 16:48:20 +00:00
Commented Aug 17, 2015 at 16:48
Note that if the input has 0, 9999, 12000, 19999, 21000, 22000, that puts 0, 9999 in file1, but 12000, 19999, 21000 in file2 which seems at odd with the requirements.

Stéphane Chazelas
– Stéphane Chazelas

2015-08-17 19:58:12 +00:00
Commented Aug 17, 2015 at 19:58
1

Note that this would reach the limit on the number of simultaneously open files after a few hundred files (GNU awk can work around that limit, but then the performances degrade quickly).

Stéphane Chazelas
– Stéphane Chazelas

2015-08-17 19:58:56 +00:00
Commented Aug 17, 2015 at 19:58
1

Yeah. I just noticed the problem you mentioned.

agathusia
– agathusia

2015-08-17 20:21:42 +00:00
Commented Aug 17, 2015 at 20:21

Add a comment |

FelixJN · Accepted Answer · 2015-08-17 16:38:32Z

#!/bin/bash
first=$( head -n1 file | awk -F" +" '{print $2}' )
last=$( tail -n1 file | awk -F" +" '{print $2}' )
for (( i=$first ; i<=$last ; i=i+10000 )) ; do
   awk -v start=$i -v end=$(($i+10000)) 'BEGIN { FS == " +" } { if ( $2 >= start && $2 < end ) print $0 }' file \
   >> interval_"$i"_to_"$(( $i+10000 ))"
done

Test with interval set to 100:

more inter*
::::::::::::::
interval_61336212_to_61346212
::::::::::::::
chr19   61336212        +       0       0       CG      CGT    
chr19   61336213        -       0       0       CG      CGG    
chr19   61336218        +       0       0       CG      CGG    
chr19   61336219        -       0       0       CG      CGC    
chr19   61336268        +       0       0       CG      CGG    
chr19   61336269        -       0       0       CG      CGA    
::::::::::::::
interval_61336312_to_61346312
::::::::::::::
chr19   61336402        +       0       0       CG      CGG    
chr19   61336403        -       0       0       CG      CGT

Note: will produce empty files for empty intervals; for removing empty files, add:

for file in interval* ; do
  if [ ! -s "$file" ] ; then
    rm "$file"
  fi
done

Will run over file for each step in the for loop, thus not the most efficient.

Costas · Accepted Answer · 2015-08-17 16:55:07Z

3

If you mean just calculation not line counting:

awk 'NR==1 || n+10000<$2{n=$2; portion++}{print > FILENAME "." portion}' file

answered Aug 17, 2015 at 16:55

Costas

15k24 silver badges38 bronze badges

Note that if the input has 0, 9999, 12000, 19999, 21000, 22000, that puts 0, 9999 in file1, but 12000, 19999, 21000 in file2 which seems at odd with the requirements.

Stéphane Chazelas
– Stéphane Chazelas

2015-08-17 20:28:33 +00:00
Commented Aug 17, 2015 at 20:28
Note that this would reach the limit on the number of simultaneously open files after a few hundred files (GNU awk can work around that limit, but then the performances degrade quickly).

Stéphane Chazelas
– Stéphane Chazelas

2015-08-17 20:28:44 +00:00
Commented Aug 17, 2015 at 20:28
@StéphaneChazelas I am not sure that clear understand you. If your wants to 21000 in 3rd file use 9999 instead 10000.

Costas
– Costas

2015-08-17 20:42:27 +00:00
Commented Aug 17, 2015 at 20:42
from my understanding of the question, the OP wants lines with 0 to 9999 in the first file, 10000 to 19999 in the second file.

Stéphane Chazelas
– Stéphane Chazelas

2015-08-17 20:46:04 +00:00
Commented Aug 17, 2015 at 20:46

Add a comment |

Stack Exchange Network

Splitting file for every 10000 numbers ( not lines )

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Splitting file for every 10000 numbers ( not lines )

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions