Merging every nth files in a folder and delete used one

Question

I have folder with 30000 txt files, each file is 50-60kb. I need to merge them into 2.5mb txt files.And remove the one that was merging. My code would need to be something like: for f in *,50; do cat file1,file2...file49 > somefile.txt;done Of course this is pseudocode. I would need to merge files in package of 50 files, then remove the used one. Can someone please help me?

1. Deleting and re-posting the same question doesn't help you. 2. Merge all the files into one file and then split that with split -c 2500000. 3. You say "My code would need to be something like" but it really doesn't, you only think it needs to be like that. That was a bad solution for your last question and a terrible solution for this almost identical one. Your "need" is an example of an XY Problem — cas
– cas, Commented Aug 19, 2021 at 8:00
i had to delete question, that what stackexchange told me to do. It has connected with my other question and said it was duplicated, but it wasn't. But making split -c 2500000 would make me txt files with missing characteres or incomplete sentences...now i have 30 000 txt files as a whole so for me it would be better to add 1..49,50..99.100..149....etc than making big file and then splitting. — K.Mazur
– K.Mazur, Commented Aug 19, 2021 at 8:09
SE didn't tell you to delete your question, you chose to do that because you wanted to repost it. And SE didn't tell you it 're supposed to learn from answers you get here, not just repost the same qwas a duplicate (I did, because it was pretty nearly identical but with $((i+1)) .. $((i+49)) instead of just to $((i+4))). Instead of something so hideously ugly and prone to user-error & typos, use printf as I showed you in a comment to your last question. You're supposed to learn from answers you get here, not just repost the same question with trivial variations. — cas
– cas, Commented Aug 19, 2021 at 8:15
i have seen many many answears before posting quest. something like that i getting me every 50th file for file in `find folder -type f | awk 'NR %50 == 0'`; do echo $file;done but how do i now add from $file all files till next $file? — K.Mazur
– K.Mazur, Commented Aug 19, 2021 at 8:21
@cas come on, no need to be so brusque. You don't know if the "2.5M" or the "50 files" is the more relevant criterion. You just seem to have assumed that the OP wants 2.5M for some reason. Maybe what the OP really cares about is that each concatenated file contains the contents of exactly 50 originals and it's the size that is an irrelevancy so it is you, not the OP, who is distracting yourself. K.Mazur, please edit your question and clarify what you want. Do you want exactly 2.5M of data per file or do you want exactly 50 files per merged file? Or maybe you need no more than 2.5 per file? — terdon
– terdon ♦, Commented Aug 19, 2021 at 8:42

Stéphane Chazelas · Accepted Answer · 2021-08-19 15:55:23Z

3

With zsh:

files=( ./input-file*(Nn.) )
typeset -Z3 n=1
while
 (( $#files > 0 )) &&
   cat $files[1,50] > merged-file$n.txt &&
   rm -f $files[1,50]
do
  files[1,50]=()
  ((n++))
done

There ./input-file*(Nn.) expands to the files that match ./input-file*, but with 3 glob qualifiers further classifying that:

N: nullglob: makes the glob expand to nothing instead of aborting with an error when there's no match. That one you often want when setting an array from a glob and it's fine for that array to be empty in the end:
n: numericglobsort: change the sorting from a default of lexical to numerical (in effect combination of both), so that input-file2 sorts before input-file10 for instance.
.: restrict to regular files (ignore directories, symlinks, fifos...)

typeset -Z3 n makes $n a variable zero-padded to width 3, so we get merged-file001.txt, ... merged-file049.txt...

Then we loop as long as there are elements in the $files array and there's no error, concatenating batches of 50 at a time (and whatever's left for the last batch).

The same with bash 4.4+ and GNU tools:

readarray -td '' files < <(
  LC_ALL=C find . -maxdepth 1 -name 'input-file*' -type f -print0 |
    sort -zV
)
n=0
set -- "${files[@]}"
while
 (( $# > 0 )) &&
   printf -v padded_n %03d "$n" &&
   cat "${@:0:50}" > "merged-file$padded_n.txt" &&
   rm -f "${@:0:50}"
do
  shift "$(( $# >= 50 ? 50 : $# ))"
  ((n++))
done

Where find does the job of zsh's ./input-file*(N.), sort -V does the numeric (version) sort, and we use positional parameters and shift in the loop as bash arrays are quite limited.

edited Aug 19, 2021 at 15:55

answered Aug 19, 2021 at 9:08

Stéphane Chazelas

585k96 gold badges1.1k silver badges1.7k bronze badges

Why would you use rm -f? That seems to just be adding a (small) risk for no benefit. If something is write protected, maybe it shouldn't be deleted?

terdon
– terdon ♦

2021-08-19 10:46:57 +00:00
Commented Aug 19, 2021 at 10:46
@terdon, rm without -f is for interactive use, you generally don't want it in scripts. Here the specification is clear that files that have been merged must be deleted.

Stéphane Chazelas
– Stéphane Chazelas

2021-08-19 10:52:40 +00:00
Commented Aug 19, 2021 at 10:52
If the cat fails mid-way and some files were added to the archive, no file will be erased.

user232326
– user232326

2021-08-19 17:29:10 +00:00
Commented Aug 19, 2021 at 17:29
@ImHere, yes, and n and $files will be left as is, so you can investigate and fix the problem and restart from there in that case. I figured that was probably the best approach.

Stéphane Chazelas
– Stéphane Chazelas

2021-08-19 17:33:52 +00:00
Commented Aug 19, 2021 at 17:33
Why do you need to set the locale to C for find. Does it fail to report all files?

user232326
– user232326

2021-08-19 17:50:07 +00:00
Commented Aug 19, 2021 at 17:50

| Show 6 more comments

user232326user232326 · Accepted Answer · 2021-08-19 20:12:57Z

This script is:

For bash (as tagged),
Avoiding find (which fails on invalid characters),
Making sure that only plain files are processed (no dirs),
Using sort to sort numerically (well, by version) and
Joining on k files (variable count)
Removing one file at a time (avoid copying a block of files that wont get erased)

dir="myDir"

readarray -td $'\0' files < <(
   for f in ./"$dir"/in-file*; do
       if [[ -f "$f" ]]; then printf '%s\0' "$f"; fi
   done |
       sort -zV
)

k=50
rm -f ./"$dir"/joined-files*.txt
for i in "${!files[@]}"; do
   n=$((i/k+1))
   cat "${files[i]}"  >> ./"$dir"/joined-files$n.txt &&
       rm -f "${files[i]}"
done

Stack Exchange Network

Merging every nth files in a folder and delete used one

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Merging every nth files in a folder and delete used one

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions