770

I've got a large (by number of lines) plain text file that I'd like to split into smaller files, also by number of lines. So if my file has around 2M lines, I'd like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn't matter).

I could do this fairly easily in Python, but I'm wondering if there's any kind of ninja way to do this using Bash and Unix utilities (as opposed to manually looping and counting / partitioning lines).

5
  • 4
    Out of curiousity, after they're "split", how does one "combine" them? Something like "cat part2 >> part1"? Or is there another ninja utility? mind updating your question? Commented Jan 6, 2010 at 22:47
  • 15
    To put it back together, cat part* > original Commented Jan 6, 2010 at 22:49
  • 13
    yes cat is short for concatenate. In general apropos is useful for finding appropriate commands. I.E. see the output of: apropos split Commented Jan 6, 2010 at 22:51
  • 4
    As an aside, OS X users should make sure their file contains LINUX or UNIX-style Line breaks/End-Of-Line indicators (LF) instead of MAC OS X - style end-of-line indicators (CR) - the split and csplit commands will not work if your like breaks are Carriage Returns instead of LineFeeds. TextWrangler from BareBones software can help you with this if you're on Mac OS. You can choose how you want your line break characters look. when you save (or Save As...) your text files. Commented Oct 21, 2012 at 21:34
  • 2
    binary version: unix.stackexchange.com/questions/1588/… Commented Apr 26, 2016 at 12:22

14 Answers 14

1188

Have a look at the split command:

For version: (GNU coreutils) 8.32

$ split --help
Usage: split [OPTION]... [FILE [PREFIX]]
Output pieces of FILE to PREFIXaa, PREFIXab, ...;
default size is 1000 lines, and default PREFIX is 'x'.

With no FILE, or when FILE is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   generate suffixes of length N (default 2)
      --additional-suffix=SUFFIX  append an additional SUFFIX to file names
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of records per output file
  -d                      use numeric suffixes starting at 0, not alphabetic
      --numeric-suffixes[=FROM]  same as -d, but allow setting the start value
  -x                      use hex suffixes starting at 0, not alphabetic
      --hex-suffixes[=FROM]  same as -x, but allow setting the start value
  -e, --elide-empty-files  do not generate empty output files with '-n'
      --filter=COMMAND    write to shell COMMAND; file name is $FILE
  -l, --lines=NUMBER      put NUMBER lines/records per output file
  -n, --number=CHUNKS     generate CHUNKS output files; see explanation below
  -t, --separator=SEP     use SEP instead of newline as the record separator;
                            '\0' (zero) specifies the NUL character
  -u, --unbuffered        immediately copy input to output with '-n r/...'
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

The SIZE argument is an integer and optional unit (example: 10K is 10*1024).
Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000).
Binary prefixes can be used, too: KiB=K, MiB=M, and so on.

CHUNKS may be:
  N       split into N files based on size of input
  K/N     output Kth of N to stdout
  l/N     split into N files without splitting lines/records
  l/K/N   output Kth of N to stdout without splitting lines/records
  r/N     like 'l' but use round robin distribution
  r/K/N   likewise but only output Kth of N to stdout

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Full documentation <https://www.gnu.org/software/coreutils/split>
or available locally via: info '(coreutils) split invocation'
$ 

You could do something like this:

split -l 200000 filename

which will create files each with 200000 lines named xaa xab xac ...

Another option, split by size of output file (still splits on line breaks):

split -C 20m --numeric-suffixes input_filename output_prefix

creates files like output_prefix01 output_prefix02 output_prefix03 ... each of maximum size 20 megabytes.

Sign up to request clarification or add additional context in comments.

14 Comments

you can also split a file by size: split -b 200m filename (m for megabytes, k for kilobytes or no suffix for bytes)
split by size and ensure files are split on line breaks: split -C 200m filename
split produces garbled output with Unicode (UTF-16) input. At least on Windows with the version I have.
@geotheory, be sure to follow LeberMac's advice earlier in the thread about first converting CR (Mac) line endings to LR (Linux) line endings using TextWrangler or BBEdit. I had the exact same problem as you until I found that piece of advice.
-d option is not available on OSX, use gsplit instead. Hope this useful for Mac user.
|
116

Use the split command:

split -l 200000 mybigfile.txt

2 Comments

And can we set the maximum number of outputs? for example split that big file but don't exceed 50 output; even if there are remained lines in the big file
@Dr.jacky -n X seems to do what you want ?
53

Yes, there is a split command. It will split a file by lines or bytes.

$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

3 Comments

Tried georgec@ATGIS25 ~ $ split -l 100000 /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands/trc_longlands.txt but there are no split files in the directory -where is the output?
It should be in the same directory. E.g. if I want to split by 1,000,000 lines per file, do the following: split -l 1000000 train_file train_file. and in the same directory I'll get train_file.aa with the first million, then trail_file.ab with the next million, etc.
@GeorgeC and you can get custom output directories with the prefix: split input my/dir/.
43

To split a large text file into smaller files of 1000 lines each:

split <file> -l 1000

To split a large binary file into smaller files of 10M each:

split <file> -b 10M

To consolidate split files into a single file:

cat x* > <file>

Split a file, each split having 10 lines (except the last split):

split -l 10 filename

Split a file into 5 files. File is split such that each split has same size (except the last split):

split -n 5 filename

Split a file with 512 bytes in each split (except the last split; use 512k for kilobytes and 512m for megabytes):

split -b 512 filename

Split a file with at most 512 bytes in each split without breaking lines:

split -C 512 filename

1 Comment

n files with the same number of lines appears to need wc unfortunately: stackoverflow.com/questions/3194349/…
20

Split the file "file.txt" into 10,000-lines files:

split -l 10000 file.txt

Comments

19

split (from GNU coreutils, since version 8.8 from 2010-12-22) includes the following parameter:

-n, --number=CHUNKS     generate CHUNKS output files; see explanation below

CHUNKS may be:
  N       split into N files based on size of input
  K/N     output Kth of N to stdout
  l/N     split into N files without splitting lines/records
  l/K/N   output Kth of N to stdout without splitting lines/records
  r/N     like 'l' but use round robin distribution
  r/K/N   likewise but only output Kth of N to stdout

Thus, split -n 4 input output. will generate four files (output.a{a,b,c,d}) with the same amount of bytes, but lines might be broken in the middle.

If we want to preserve full lines (i.e. split by lines), then this should work:

split -n l/4 input output.

Related answer: https://stackoverflow.com/a/19031247

Comments

18

Use split:

Split a file into fixed-size pieces, creates output files containing consecutive sections of INPUT (standard input if none is given or INPUT is `-')

Syntax split [options] [INPUT [PREFIX]]

Comments

16

Use:

sed -n '1,100p' filename > output.txt

Here, 1 and 100 are the line numbers which you will capture in output.txt.

2 Comments

This only obtains the first 100 lines, you need to loop it to successively split the file into the next 101..200 etc. Or just use split like all the top answers here already tell you.
This was actually what I was looking for!
16

You can also use AWK:

awk -vc=1 'NR%200000==0{++c}{print $0 > c".txt"}' largefile

2 Comments

awk -v lines=200000 -v fmt="%d.txt" '{print>sprintf(fmt,1+int((NR-1)/lines))}'
with prefix: awk -vc=1 'NR%200000==0{++c}{print $0 > "prefix"c".txt"}' largefile
13

Here an example dividing the file "toSplit.txt" into smaller files of 200 lines named "splited00.txt", splited01.txt, ... , "splited25.txt" ...

split -l 200 --numeric-suffixes --additional-suffix=".txt" toSplit.txt splited

1 Comment

This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review
12

In case you just want to split by x number of lines each file, the given answers about split are OK. But, I am curious about why no one paid attention to the requirements:

  • "without having to count them" -> using wc + cut
  • "having the remainder in extra file" -> split does by default

I can't do that without "wc + cut", but I'm using that:

split -l  $(expr `wc $filename | cut -d ' ' -f3` / $chunks) $filename

This can be easily added to your .bashrc file functions, so you can just invoke it, passing the filename and chunks:

 split -l  $(expr `wc $1 | cut -d ' ' -f3` / $2) $1

In case you want just x chunks without remainder in the extra file, just adapt the formula to sum it (chunks - 1) on each file. I do use this approach because usually I just want x number of files rather than x lines per file:

split -l  $(expr `wc $1 | cut -d ' ' -f3` / $2 + `expr $2 - 1`) $1

You can add that to a script and call it your "ninja way", because if nothing suites your needs, you can build it :-)

1 Comment

Or, just use the -n option of split.
2

HDFS getmerge small file and split into a proper size.

This method will cause line breaks:

split -b 125m compact.file -d -a 3 compact_prefix

I try to getmerge and split into about 128 MB for every file.

# Split into 128 MB, and judge sizeunit is M or G. Please test before use.

begainsize=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $1}' `
sizeunit=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $2}' `
if [ $sizeunit = "G" ];then
    res=$(printf "%.f" `echo "scale=5;$begainsize*8 "|bc`)
else
    res=$(printf "%.f" `echo "scale=5;$begainsize/128 "|bc`)  # Celling ref http://blog.csdn.net/naiveloafer/article/details/8783518
fi
echo $res
# Split into $res files with a number suffix. Ref:  http://blog.csdn.net/microzone/article/details/52839598
compact_file_name=$compact_file"_"
echo "compact_file_name: "$compact_file_name
split -n l/$res $basedir/$compact_file -d -a 3 $basedir/${compact_file_name}

2 Comments

What is "HDFS"? Hadoop distributed file system? Or something else? Can you provide a reference to it?
What are "celling" and "begain"? Is the latter "begin" (or "start")?
2

You could use split command in Linux to split files as required.

split -l <number of lines, you wish to splits equally> <filename> <sequence you wish to follow>

split -l 1000 file.txt split_file

This will split file.txt lines into each file carrying 1000 lines and save as

split_filea

split_fileb

split_filec

etc.

Comments

1

TL;DR: If you're using split to break up newline-delimited strings of text into some integer N chunks, you most likely want -n l/N, not -n N . The former will only split on newlines, even if the number of lines doesn't divide N evenly. The latter will split your records in other ways, I think by "word," though I am not certain of that.

@Matiji66 alluded to this, but that example is about reading bytes and I failed to notice it when I first glanced at this page, so I'm going to say it explicitly here for those who may stumble on this answer in the future.

Suppose one has a text file that is a list of newline delimited strings (say, paths). One might expect from the way shell appending and also while read line; do echo $line; done < myfile.txt works, that in order to break up this file into N roughly equal pieces you'd say:

# This won't split only at newlines
split -n N myfile.txt myprefix

This is quite wrong! The records, be they paths or something else, will be split across (not completely) random characters because it doesn't do the accounting per newline, even though it says the separator character is a newline in the manpage. The instructions at the end of the manpage show you how to indicate that you don't want lines to be split, or in what way you want the splitting to happen, but it was recorded there in a way that was confusing to me. The correct way to ask for the above is:

# This will split only at newlines, so far as I can tell.
split -n l/N myfile.txt myprefix

I think that the default behavior may not split on truly random characters. It may be doing some sort of character precedence that tries to keep things it thinks are 'words' together, which could be nice if you're paginating a huge hunk of ascii text (for example). In my experimentation with doing a find ... | sort -V > myfile.txt and then manually inspecting the output of the split -n N usage I wrote above, it appears to like to keep lines together but doesn't strictly adhere to this; however, it also doesn't produce files that are of uniform character count according to wc either. I find this default behavior bizarre, but I'm sure some old head could pipe up at this juncture and explain why it would have made perfect sense to the authors of the split tool, one of whom was apparently Richard Stallman.

1 Comment

split -n l/N requires you to supply the value of N, but split -l 1 works fine for me.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.