split file into two parts, at a pattern

Question

How to split a large file into two parts, at a pattern?

Given an example file.txt:

ABC
EFG
XYZ
HIJ
KNL

I want to split this file at XYZ such that file1 contains lines up-to XYZ and rest of the lines in file2.

don_crissti · Accepted Answer · 2015-05-10 22:37:08Z

This is a job for csplit:

csplit -sf file -n 1 large_file /XYZ/

would silently split the file, creating pieces with prefix file and numbered using a single digit, e.g. file0 etc. Note that using /regex/ would split up to, but not including the line that matches regex. To split up to and including the line matching regex add a +1 offset:

csplit -sf file -n 1 large_file /XYZ/+1

This creates two files, file0 and file1. If you absolutely need them to be named file1 and file2 you could always add an empty pattern to the csplit command and remove the first file:

csplit -sf file -n 1 large_file // /XYZ/+1

creates file0, file1 and file2 but file0 is empty so you can safely remove it:

rm -f file0

Janis · Accepted Answer · 2015-05-10 14:37:52Z

With awk you can do:

awk '{print >out}; /XYZ/{out="file2"}' out=file1 largefile

Explanation: The first awk argument (out=file1) defines a variable with the filename that will be used for output while the subsequent argument (largefile) is processed. The awk program will print all lines to the file specified by the variable out ({print >out}). If the pattern XYZ will be found the output variable will be redefined to point to the new file ({out="file2}") which will be used as target to print the subsequent data lines.

References:

gawk manual: Redirection http://www.gnu.org/software/gawk/manual/html_node/Redirection.html#Redirection

Janis · Accepted Answer · 2015-05-10 13:31:06Z

6

With a modern ksh here's a shell variant (i.e. without sed) of one of the sed based answers above:

{ read in <##XYZ ; print "$in" ; cat >file2 ;} <largefile >file1

And another variant in ksh alone (i.e. also omitting the cat):

{ read in <##XYZ ; print "$in" ; { read <##"" ;} >file2 ;} <largefile >file1

(The pure ksh solution seem to be quite performant; on a 2.4 GB test file it needed 19-21 sec, compared to 39-47 sec with the sed/cat based approach).

answered May 10, 2015 at 13:31

Janis

14.4k4 gold badges28 silver badges42 bronze badges

It's very fast. But I don't think you need to read and print - you should just let it go to output all its own. The performance gets better if you build the AST toolkit wholly and get all of the ksh builtins compiled in - it's weird to me that sed isn't one of them, actually. But with stuff like while <file do I guess you don't need sed so much...

mikeserv
– mikeserv

2015-05-10 13:54:46 +00:00
Commented May 10, 2015 at 13:54
I am curious though - how did awk perform in your benchmark? And while I'm pretty sure ksh will likely always win this fight, if you're using a GNU sed you're not being very fair to sed - GNU's -unbuffered is a piss-poor approach to POSIXLY ensuring the descriptor's offset is left where the program quit it - there should be no need to slow down the regular operation of the program - buffering is fine - all sed should have to do is lseek the descriptor when finished. For whatever reason GNU reverses that mentality.

mikeserv
– mikeserv

2015-05-10 14:05:38 +00:00
Commented May 10, 2015 at 14:05
@mikeserv; The redirection pattern match is done until the pattern is found, and the line with the found pattern will not be printed if not explicitly done as depicted. (At least that showed my test.) Note that there's no while; the printing is implicitly done as the defined side effect of the <## redirection operator. And only the matching line needs printing. (That way the shell feature implementation is most flexible for support of incl./excl.) An explicit while loop I'd expect to be significant slower (but haven't checked).

Janis
– Janis

2015-05-10 14:07:04 +00:00
Commented May 10, 2015 at 14:07
1

@mikeserv; Ah, okay. BTW, I just tried the head instead of the read; it seems only a little bit slower, but it's terser code: { head -1 <##XYZ ; { read <##"" ;} >file4 ;} <largefile >file3.

Janis
– Janis

2015-05-10 14:18:05 +00:00
Commented May 10, 2015 at 14:18
1

@mikeserv; Good point; it wasn't. But when I activate the builtin (just done and checked the results) it's the same numbers, strangely. (Maybe some function call overhead compared to read?)

Janis
– Janis

2015-05-10 14:23:01 +00:00
Commented May 10, 2015 at 14:23

| Show 4 more comments

mikeserv · Accepted Answer · 2015-05-12 07:29:12Z

6

{ sed '/XYZ/q' >file1; cat >file2; } <infile

With GNU sed you should use the -unbuffered switch. Most other seds should just work though.

To leave XYZ out...

{ sed -n '/XYZ/q;p'; cat >file2; } <infile >file1

edited May 12, 2015 at 7:29

answered May 10, 2015 at 11:47

mikeserv

59.4k10 gold badges122 silver badges242 bronze badges

Add a comment |

Cyrus · Accepted Answer · 2015-05-10 10:53:17Z

3

Try this with GNU sed:

sed -n -e '1,/XYZ/w file1' -e '/XYZ/,${/XYZ/d;w file2' -e '}' large_file

answered May 10, 2015 at 10:53

Cyrus

12.8k3 gold badges32 silver badges55 bronze badges

Shorter: sed -e '1,/XYZ/{w file1' -e 'd}' large_file > file2

don_crissti
– don_crissti

2015-05-10 11:30:18 +00:00
Commented May 10, 2015 at 11:30

Add a comment |

Community · Accepted Answer · 2017-04-13 12:36:34Z

An easy hack is to print either to STDOUT or STDERR, depending on whether the target pattern has been matched. You can then use the shell's redirection operators to redirect the output accordingly. For example, in Perl, assuming the input file is called f and the two output files f1 and f2:

Discarding the line that matches the split pattern:

perl -ne 'if(/XYZ/){$a=1; next} ; $a==1 ? print STDERR : print STDOUT;' f >f1 2>f2

Including the matched line:

perl -ne '$a=1 if /XYZ/; $a==1 ? print STDERR : print STDOUT;' f >f1 2>f2

Alternatively, print to different file handles:

Discarding the line that matches the split pattern:

perl -ne 'BEGIN{open($fh1,">","f1");open($fh2,">","f2");}
if(/XYZ/){$a=1; next}$a==1 ? print $fh1 "$_" : print $fh2 "$_";' f

Including the matched line:

perl -ne 'BEGIN{open($fh1,">","f1"); open($fh2,">","f2");}
          $a=1 if /XYZ/; $a==1 ? print $fh1 "$_" : print $fh2 "$_";' f

Stack Exchange Network

split file into two parts, at a pattern

6 Answers 6

You must log in to answer this question.

Linked

Hot Network Questions

split file into two parts, at a pattern

6 Answers 6

You must log in to answer this question.

Linked

Related

Hot Network Questions