3

I want to download and extract a large zip archive (>180 GB) containing multiple small files of a single file format onto an SSD, but I don't have enough storage for both the zip archive and the extracted contents. I know that it would be possible to extract and delete individual files from an archive using the zip command as mentioned in the answers here and here. I could also get the names of all the files in an archive using the unzip -l command, store the results in an array as mentioned here, filter out the unnecessary values using the method given here, and iterate over them in BASH as mentioned here. So, the final logic would look something like this:

  1. List the zip file's contents using unzip -l and store the filenames in a bash array, using regular expressions to match the single file extension present in the archive.
  2. Iterate over the array of filenames and successively extract and delete individual files using the unzip -j -d and zip -d commands.

How feasible is this method in terms of time required, logic complexity, and computational resources? I am worried about the efficiency of deleting and extracting single files, especially with such a large archive. If you have any feedback or comments about this approach, I would love to hear them. Thank you all in advance for your help.

Edit 1:

It seems this question has become a bit popular. Just in case anyone is interested, here is a BASH script following the logic I have outlined earlier, with batching for the extraction and deletion of files to reduce the number of operations. I have used DICOM files in this example but this would work for any other file type or for any files whose file names can be described by a regular expression. Here is the code:

#!/bin/bash

# Check if a zip file is provided as an argument
if [ -z "$1" ]; then
  echo "Usage: $0 <zipfile>"
  exit 1
fi

zipfile=$1

# List the contents of the zip file and store .dcm files in an array
mapfile -t dcm_files < <(unzip -Z1 "$zipfile" | grep '\.dcm$')

# Define the batch size
batch_size=10000
total_files=${#dcm_files[@]}

# Process files in batches
for ((i=0; i<total_files; i+=batch_size)); do
  batch=("${dcm_files[@]:i:batch_size}")

  # Create a pattern for the batch of .dcm files
  pattern=$(printf " \"%s\"" "${batch[@]}")
  pattern=${pattern:1}

  # Extract the batch of .dcm files
  eval unzip "$zipfile" $pattern -d .

  # Delete the batch of .dcm files from the zip archive
  eval zip -d "$zipfile" $pattern
done

echo "Extraction and deletion of .dcm files completed."

The file would have to be saved with a name like inplace_extractor.sh with a .sh extension and marked as executable. If the script and archive are in the same folder and the name of the archive is archive.zip, the method to run the script would be ./inplace_extractor.sh archive.zip. Feel free to adjust the batch size or the regular expression or account for any subfolders in your archive.

I tried it with my large archive and the performance was absolutely abysmal while the disk space rapidly shrunk, so I would still recommend going with the approaches suggested in other answers.

9
  • Decompress gzip file in place but I don't know if it would work with zip and even with gzip this method is unsafe. Probably very unsafe with zip since its not a streaming format Commented Nov 12, 2023 at 14:14
  • Oh, it's about downloading... if the server supports resume / offset / range you could probably cheese it with some flavor of fuse httpfs like simple-httpfs and only download the requested segments (and not store the zip file locally at all) Commented Nov 12, 2023 at 14:21
  • 1
    @frostschutz I found out a similar answer suggesting your approach here and it sounds like it could work, I will be sure to give it a try and report back here. Commented Nov 12, 2023 at 17:30
  • @frostschutz, the example you give is for archive with one compressed file (gz) I have my doubts this will work with multyfile structure as zip Commented Nov 12, 2023 at 17:37
  • 1
    @RomeoNinov I don't have much money at the moment and I don't have many uses for external storage besides this situation. Commented Nov 13, 2023 at 6:27

2 Answers 2

2

If the zip file:

  • contains trusted content; and
  • is available at a URL
    • on a reliable network connection

then the answers here may help.

In short, use a program that can unzip from a stream.

For example:

cd /place/to/store/data
curl https://www.example.org/input.zip | busybox unzip -
cd /place/to/store/data
curl https://www.example.org/input.zip | bsdtar xvf -
1

AFAIK deleting file from zip archive may need twice as much space as the archive. So the best is to attach USB disk and store archive there. Then extract the files to SSD and delete archive (if not required).

1
  • Thank you for the input, I will carefully monitor the process for any Out of Storage errors. Commented Nov 12, 2023 at 17:31

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.