2

I've got this csv file that needs to be broken up in smaller files. no problem with split -l 20000 test.csv my problem is that the file contains different headers. want to split say every +-1000 lines but it needs to split after pay header and new file needs to start with cust header.

cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
detail header,xxxxxxxx,xxxxxxxxxx,xxx,,
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
cust header,xxx,xxxxxx,xxxxxx
txn header,xxxx,xxx,,xxxx,xxxxx,,xxx
detail header,xxxx,xxxx,xxxxxx,xxxx,xxxx
pay header,,,,,xxxx,xxxxx
2
  • 1
    what should the output files look like? I assume it's not just one set of headers per file. Commented Jun 15, 2018 at 13:42
  • Will the cust header sections need to be broken into multiple files with cust header at top of each file, e.g. cust header1 - lines 1-1000 and cust header1 -line 1001-1999; cust header2 - lines 1-200; cust header3 - lines 1-1000? Commented Jun 15, 2018 at 17:18

2 Answers 2

6

You could do something like this with awk:

awk -vfilename=output -vcut=1000  '
    BEGIN { nl=0; nf=1; f=filename "." nf;} 
    ++nl >= cut && /^cust header,/ {
         close(f); nl=0; f=filename "." ++nf}
   {print > f}' < file

It keeps a count of the lines it's seen, and reopens a new output file if the count is larger than cut (1000 here), and the current line starts with cust header,. The output files are named output.1, output.2, ... (the filename variable)

1

ilkkachu's solution is slick, uses a single executable, and is probably the correct answer. However, I've just never been able to wrap my head around advanced uses of awk. Had ilkkachu not answered first, I might have opted for csplit. csplit will split a text file based on context lines (okay, regular expressions). You could then take that output and further split the files with the split utility you already know:

$ csplit --prefix="MySplit." test.csv '/^cust header,/' '{*}'
0
174
134
134

Those are the byte counts of each chunk (which we ignore in this case). Now, iterate over each MySplit chunk and further split to your 20k requirement:

$ for i in MySplit.0*; do
    split --additional-suffix=".$i" -l 20000 "$i"
  done

For example, using -l 2 instead of 20k, the final output given your sample would be (sorted by extension):

$ ls -lhXB
total 44K
-rw-r--r-- 1 hunteke hunteke   0 Jun 15 13:31 MySplit.00
-rw-r--r-- 1 hunteke hunteke 174 Jun 15 13:31 MySplit.01
-rw-r--r-- 1 hunteke hunteke  67 Jun 15 13:27 xaa.MySplit.01
-rw-r--r-- 1 hunteke hunteke  81 Jun 15 13:27 xab.MySplit.01
-rw-r--r-- 1 hunteke hunteke  26 Jun 15 13:27 xac.MySplit.01
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.02
-rw-r--r-- 1 hunteke hunteke  67 Jun 15 13:27 xaa.MySplit.02
-rw-r--r-- 1 hunteke hunteke  67 Jun 15 13:27 xab.MySplit.02
-rw-r--r-- 1 hunteke hunteke 134 Jun 15 13:31 MySplit.03
-rw-r--r-- 1 hunteke hunteke  67 Jun 15 13:27 xaa.MySplit.03
-rw-r--r-- 1 hunteke hunteke  67 Jun 15 13:27 xab.MySplit.03
-rw-r--r-- 2 hunteke hunteke 442 Jun 15 13:06 test.csv
0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.