How to split a large file into two parts, at a pattern?
Given an example file.txt:
ABC
EFG
XYZ
HIJ
KNL
I want to split this file at XYZ such that file1 contains lines up-to XYZ and rest of the lines in file2.
This is a job for csplit:
csplit -sf file -n 1 large_file /XYZ/
would silently split the file, creating pieces with prefix file and numbered using a single digit, e.g. file0 etc. Note that using /regex/ would split up to, but not including the line that matches regex. To split up to and including the line matching regex add a +1 offset:
csplit -sf file -n 1 large_file /XYZ/+1
This creates two files, file0 and file1. If you absolutely need them to be named file1 and file2 you could always add an empty pattern to the csplit command and remove the first file:
csplit -sf file -n 1 large_file // /XYZ/+1
creates file0, file1 and file2 but file0 is empty so you can safely remove it:
rm -f file0
With awk you can do:
awk '{print >out}; /XYZ/{out="file2"}' out=file1 largefile
Explanation: The first awk argument (out=file1) defines a variable with the filename that will be used for output while the subsequent argument (largefile) is processed. The awk program will print all lines to the file specified by the variable out ({print >out}). If the pattern XYZ will be found the output variable will be redefined to point to the new file ({out="file2}") which will be used as target to print the subsequent data lines.
References:
With a modern ksh here's a shell variant (i.e. without sed) of one of the sed based answers above:
{ read in <##XYZ ; print "$in" ; cat >file2 ;} <largefile >file1
And another variant in ksh alone (i.e. also omitting the cat):
{ read in <##XYZ ; print "$in" ; { read <##"" ;} >file2 ;} <largefile >file1
(The pure ksh solution seem to be quite performant; on a 2.4 GB test file it needed 19-21 sec, compared to 39-47 sec with the sed/cat based approach).
read and print - you should just let it go to output all its own. The performance gets better if you build the AST toolkit wholly and get all of the ksh builtins compiled in - it's weird to me that sed isn't one of them, actually. But with stuff like while <file do I guess you don't need sed so much...
awk perform in your benchmark? And while I'm pretty sure ksh will likely always win this fight, if you're using a GNU sed you're not being very fair to sed - GNU's -unbuffered is a piss-poor approach to POSIXLY ensuring the descriptor's offset is left where the program quit it - there should be no need to slow down the regular operation of the program - buffering is fine - all sed should have to do is lseek the descriptor when finished. For whatever reason GNU reverses that mentality.
while; the printing is implicitly done as the defined side effect of the <## redirection operator. And only the matching line needs printing. (That way the shell feature implementation is most flexible for support of incl./excl.) An explicit while loop I'd expect to be significant slower (but haven't checked).
head instead of the read; it seems only a little bit slower, but it's terser code: { head -1 <##XYZ ; { read <##"" ;} >file4 ;} <largefile >file3.
{ sed '/XYZ/q' >file1; cat >file2; } <infile
With GNU sed you should use the -unbuffered switch. Most other seds should just work though.
To leave XYZ out...
{ sed -n '/XYZ/q;p'; cat >file2; } <infile >file1
Try this with GNU sed:
sed -n -e '1,/XYZ/w file1' -e '/XYZ/,${/XYZ/d;w file2' -e '}' large_file
sed -e '1,/XYZ/{w file1' -e 'd}' large_file > file2
An easy hack is to print either to STDOUT or STDERR, depending on whether the target pattern has been matched. You can then use the shell's redirection operators to redirect the output accordingly. For example, in Perl, assuming the input file is called f and the two output files f1 and f2:
Discarding the line that matches the split pattern:
perl -ne 'if(/XYZ/){$a=1; next} ; $a==1 ? print STDERR : print STDOUT;' f >f1 2>f2
Including the matched line:
perl -ne '$a=1 if /XYZ/; $a==1 ? print STDERR : print STDOUT;' f >f1 2>f2
Alternatively, print to different file handles:
Discarding the line that matches the split pattern:
perl -ne 'BEGIN{open($fh1,">","f1");open($fh2,">","f2");}
if(/XYZ/){$a=1; next}$a==1 ? print $fh1 "$_" : print $fh2 "$_";' f
Including the matched line:
perl -ne 'BEGIN{open($fh1,">","f1"); open($fh2,">","f2");}
$a=1 if /XYZ/; $a==1 ? print $fh1 "$_" : print $fh2 "$_";' f