I have folder with 30000 txt files, each file is 50-60kb. I need to merge them into 2.5mb txt files.And remove the one that was merging. My code would need to be something like: for f in *,50; do cat file1,file2...file49 > somefile.txt;done Of course this is pseudocode. I would need to merge files in package of 50 files, then remove the used one. Can someone please help me?
2 Answers
With zsh:
files=( ./input-file*(Nn.) )
typeset -Z3 n=1
while
(( $#files > 0 )) &&
cat $files[1,50] > merged-file$n.txt &&
rm -f $files[1,50]
do
files[1,50]=()
((n++))
done
There ./input-file*(Nn.) expands to the files that match ./input-file*, but with 3 glob qualifiers further classifying that:
N: nullglob: makes the glob expand to nothing instead of aborting with an error when there's no match. That one you often want when setting an array from a glob and it's fine for that array to be empty in the end:n:numericglobsort: change the sorting from a default of lexical to numerical (in effect combination of both), so thatinput-file2sorts beforeinput-file10for instance..: restrict to regular files (ignore directories, symlinks, fifos...)
typeset -Z3 n makes $n a variable zero-padded to width 3, so we get merged-file001.txt, ... merged-file049.txt...
Then we loop as long as there are elements in the $files array and there's no error, concatenating batches of 50 at a time (and whatever's left for the last batch).
The same with bash 4.4+ and GNU tools:
readarray -td '' files < <(
LC_ALL=C find . -maxdepth 1 -name 'input-file*' -type f -print0 |
sort -zV
)
n=0
set -- "${files[@]}"
while
(( $# > 0 )) &&
printf -v padded_n %03d "$n" &&
cat "${@:0:50}" > "merged-file$padded_n.txt" &&
rm -f "${@:0:50}"
do
shift "$(( $# >= 50 ? 50 : $# ))"
((n++))
done
Where find does the job of zsh's ./input-file*(N.), sort -V does the numeric (version) sort, and we use positional parameters and shift in the loop as bash arrays are quite limited.
-
Why would you use
rm -f? That seems to just be adding a (small) risk for no benefit. If something is write protected, maybe it shouldn't be deleted?2021-08-19 10:46:57 +00:00Commented Aug 19, 2021 at 10:46 -
@terdon,
rmwithout-fis for interactive use, you generally don't want it in scripts. Here the specification is clear that files that have been merged must be deleted.Stéphane Chazelas– Stéphane Chazelas2021-08-19 10:52:40 +00:00Commented Aug 19, 2021 at 10:52 -
If the cat fails mid-way and some files were added to the archive, no file will be erased.user232326– user2323262021-08-19 17:29:10 +00:00Commented Aug 19, 2021 at 17:29
-
@ImHere, yes, and
nand$fileswill be left as is, so you can investigate and fix the problem and restart from there in that case. I figured that was probably the best approach.Stéphane Chazelas– Stéphane Chazelas2021-08-19 17:33:52 +00:00Commented Aug 19, 2021 at 17:33 -
Why do you need to set the locale to
Cfor find. Does it fail to report all files?user232326– user2323262021-08-19 17:50:07 +00:00Commented Aug 19, 2021 at 17:50
This script is:
- For bash (as tagged),
- Avoiding find (which fails on invalid characters),
- Making sure that only plain files are processed (no dirs),
- Using
sortto sort numerically (well, by version) and - Joining on
kfiles (variable count) - Removing one file at a time (avoid copying a block of files that wont get erased)
dir="myDir"
readarray -td $'\0' files < <(
for f in ./"$dir"/in-file*; do
if [[ -f "$f" ]]; then printf '%s\0' "$f"; fi
done |
sort -zV
)
k=50
rm -f ./"$dir"/joined-files*.txt
for i in "${!files[@]}"; do
n=$((i/k+1))
cat "${files[i]}" >> ./"$dir"/joined-files$n.txt &&
rm -f "${files[i]}"
done
split -c 2500000. 3. You say "My code would need to be something like" but it really doesn't, you only think it needs to be like that. That was a bad solution for your last question and a terrible solution for this almost identical one. Your "need" is an example of an XY Problemsplit -c 2500000would make me txt files with missing characteres or incomplete sentences...now i have 30 000 txt files as a whole so for me it would be better to add 1..49,50..99.100..149....etc than making big file and then splitting.$((i+1)) .. $((i+49))instead of just to$((i+4))). Instead of something so hideously ugly and prone to user-error & typos, useprintfas I showed you in a comment to your last question. You're supposed to learn from answers you get here, not just repost the same question with trivial variations.for file in `find folder -type f | awk 'NR %50 == 0'`; do echo $file;donebut how do i now add from $file all files till next $file?