I have folder with 30000 txt files, each file is 50-60kb. I need to merge them into 2.5mb txt files.And remove the one that was merging. My code would need to be something like: for f in *,50; do cat file1,file2...file49 > somefile.txt;done Of course this is pseudocode. I would need to merge files in package of 50 files, then remove the used one. Can someone please help me?
2 Answers
With zsh:
files=( ./input-file*(Nn.) )
typeset -Z3 n=1
while
 (( $#files > 0 )) &&
   cat $files[1,50] > merged-file$n.txt &&
   rm -f $files[1,50]
do
  files[1,50]=()
  ((n++))
done
There ./input-file*(Nn.) expands to the files that match ./input-file*, but with 3 glob qualifiers further classifying that:
- N: nullglob: makes the glob expand to nothing instead of aborting with an error when there's no match. That one you often want when setting an array from a glob and it's fine for that array to be empty in the end:
- n:- numericglobsort: change the sorting from a default of lexical to numerical (in effect combination of both), so that- input-file2sorts before- input-file10for instance.
- .: restrict to regular files (ignore directories, symlinks, fifos...)
typeset -Z3 n makes $n a variable zero-padded to width 3, so we get merged-file001.txt, ... merged-file049.txt...
Then we loop as long as there are elements in the $files array and there's no error, concatenating batches of 50 at a time (and whatever's left for the last batch).
The same with bash 4.4+ and GNU tools:
readarray -td '' files < <(
  LC_ALL=C find . -maxdepth 1 -name 'input-file*' -type f -print0 |
    sort -zV
)
n=0
set -- "${files[@]}"
while
 (( $# > 0 )) &&
   printf -v padded_n %03d "$n" &&
   cat "${@:0:50}" > "merged-file$padded_n.txt" &&
   rm -f "${@:0:50}"
do
  shift "$(( $# >= 50 ? 50 : $# ))"
  ((n++))
done
Where find does the job of zsh's ./input-file*(N.), sort -V does the numeric (version) sort, and we use positional parameters and shift in the loop as bash arrays are quite limited.
- 
        Why would you userm -f? That seems to just be adding a (small) risk for no benefit. If something is write protected, maybe it shouldn't be deleted?2021-08-19 10:46:57 +00:00Commented Aug 19, 2021 at 10:46
- 
        @terdon,rmwithout-fis for interactive use, you generally don't want it in scripts. Here the specification is clear that files that have been merged must be deleted.Stéphane Chazelas– Stéphane Chazelas2021-08-19 10:52:40 +00:00Commented Aug 19, 2021 at 10:52
- 
        If the cat fails mid-way and some files were added to the archive, no file will be erased.user232326– user2323262021-08-19 17:29:10 +00:00Commented Aug 19, 2021 at 17:29
- 
        @ImHere, yes, andnand$fileswill be left as is, so you can investigate and fix the problem and restart from there in that case. I figured that was probably the best approach.Stéphane Chazelas– Stéphane Chazelas2021-08-19 17:33:52 +00:00Commented Aug 19, 2021 at 17:33
- 
        Why do you need to set the locale toCfor find. Does it fail to report all files?user232326– user2323262021-08-19 17:50:07 +00:00Commented Aug 19, 2021 at 17:50
This script is:
- For bash (as tagged),
- Avoiding find (which fails on invalid characters),
- Making sure that only plain files are processed (no dirs),
- Using sortto sort numerically (well, by version) and
- Joining on kfiles (variable count)
- Removing one file at a time (avoid copying a block of files that wont get erased)
dir="myDir"
readarray -td $'\0' files < <(
   for f in ./"$dir"/in-file*; do
       if [[ -f "$f" ]]; then printf '%s\0' "$f"; fi
   done |
       sort -zV
)
k=50
rm -f ./"$dir"/joined-files*.txt
for i in "${!files[@]}"; do
   n=$((i/k+1))
   cat "${files[i]}"  >> ./"$dir"/joined-files$n.txt &&
       rm -f "${files[i]}"
done

split -c 2500000. 3. You say "My code would need to be something like" but it really doesn't, you only think it needs to be like that. That was a bad solution for your last question and a terrible solution for this almost identical one. Your "need" is an example of an XY Problemsplit -c 2500000would make me txt files with missing characteres or incomplete sentences...now i have 30 000 txt files as a whole so for me it would be better to add 1..49,50..99.100..149....etc than making big file and then splitting.$((i+1)) .. $((i+49))instead of just to$((i+4))). Instead of something so hideously ugly and prone to user-error & typos, useprintfas I showed you in a comment to your last question. You're supposed to learn from answers you get here, not just repost the same question with trivial variations.for file in `find folder -type f | awk 'NR %50 == 0'`; do echo $file;donebut how do i now add from $file all files till next $file?