I want to download and extract a large zip archive (>180 GB) containing multiple small files of a single file format onto an SSD, but I don't have enough storage for both the zip archive and the extracted contents. I know that it would be possible to extract and delete individual files from an archive using the zip command as mentioned in the answers here and here. I could also get the names of all the files in an archive using the unzip -l command, store the results in an array as mentioned here, filter out the unnecessary values using the method given here, and iterate over them in BASH as mentioned here. So, the final logic would look something like this:
- List the zip file's contents using
unzip -land store the filenames in a bash array, using regular expressions to match the single file extension present in the archive. - Iterate over the array of filenames and successively extract and delete individual files using the
unzip -j -dandzip -dcommands.
How feasible is this method in terms of time required, logic complexity, and computational resources? I am worried about the efficiency of deleting and extracting single files, especially with such a large archive. If you have any feedback or comments about this approach, I would love to hear them. Thank you all in advance for your help.
Edit 1:
It seems this question has become a bit popular. Just in case anyone is interested, here is a BASH script following the logic I have outlined earlier, with batching for the extraction and deletion of files to reduce the number of operations. I have used DICOM files in this example but this would work for any other file type or for any files whose file names can be described by a regular expression. Here is the code:
#!/bin/bash
# Check if a zip file is provided as an argument
if [ -z "$1" ]; then
echo "Usage: $0 <zipfile>"
exit 1
fi
zipfile=$1
# List the contents of the zip file and store .dcm files in an array
mapfile -t dcm_files < <(unzip -Z1 "$zipfile" | grep '\.dcm$')
# Define the batch size
batch_size=10000
total_files=${#dcm_files[@]}
# Process files in batches
for ((i=0; i<total_files; i+=batch_size)); do
batch=("${dcm_files[@]:i:batch_size}")
# Create a pattern for the batch of .dcm files
pattern=$(printf " \"%s\"" "${batch[@]}")
pattern=${pattern:1}
# Extract the batch of .dcm files
eval unzip "$zipfile" $pattern -d .
# Delete the batch of .dcm files from the zip archive
eval zip -d "$zipfile" $pattern
done
echo "Extraction and deletion of .dcm files completed."
The file would have to be saved with a name like inplace_extractor.sh with a .sh extension and marked as executable. If the script and archive are in the same folder and the name of the archive is archive.zip, the method to run the script would be ./inplace_extractor.sh archive.zip. Feel free to adjust the batch size or the regular expression or account for any subfolders in your archive.
I tried it with my large archive and the performance was absolutely abysmal while the disk space rapidly shrunk, so I would still recommend going with the approaches suggested in other answers.
gz) I have my doubts this will work with multyfile structure aszip