How to extract and delete contents of a zip archive simultaneously?

Question

I want to download and extract a large zip archive (>180 GB) containing multiple small files of a single file format onto an SSD, but I don't have enough storage for both the zip archive and the extracted contents. I know that it would be possible to extract and delete individual files from an archive using the zip command as mentioned in the answers here and here. I could also get the names of all the files in an archive using the unzip -l command, store the results in an array as mentioned here, filter out the unnecessary values using the method given here, and iterate over them in BASH as mentioned here. So, the final logic would look something like this:

List the zip file's contents using unzip -l and store the filenames in a bash array, using regular expressions to match the single file extension present in the archive.
Iterate over the array of filenames and successively extract and delete individual files using the unzip -j -d and zip -d commands.

How feasible is this method in terms of time required, logic complexity, and computational resources? I am worried about the efficiency of deleting and extracting single files, especially with such a large archive. If you have any feedback or comments about this approach, I would love to hear them. Thank you all in advance for your help.

Edit 1:

It seems this question has become a bit popular. Just in case anyone is interested, here is a BASH script following the logic I have outlined earlier, with batching for the extraction and deletion of files to reduce the number of operations. I have used DICOM files in this example but this would work for any other file type or for any files whose file names can be described by a regular expression. Here is the code:

#!/bin/bash

# Check if a zip file is provided as an argument
if [ -z "$1" ]; then
  echo "Usage: $0 <zipfile>"
  exit 1
fi

zipfile=$1

# List the contents of the zip file and store .dcm files in an array
mapfile -t dcm_files < <(unzip -Z1 "$zipfile" | grep '\.dcm$')

# Define the batch size
batch_size=10000
total_files=${#dcm_files[@]}

# Process files in batches
for ((i=0; i<total_files; i+=batch_size)); do
  batch=("${dcm_files[@]:i:batch_size}")

  # Create a pattern for the batch of .dcm files
  pattern=$(printf " \"%s\"" "${batch[@]}")
  pattern=${pattern:1}

  # Extract the batch of .dcm files
  eval unzip "$zipfile" $pattern -d .

  # Delete the batch of .dcm files from the zip archive
  eval zip -d "$zipfile" $pattern
done

echo "Extraction and deletion of .dcm files completed."

The file would have to be saved with a name like inplace_extractor.sh with a .sh extension and marked as executable. If the script and archive are in the same folder and the name of the archive is archive.zip, the method to run the script would be ./inplace_extractor.sh archive.zip. Feel free to adjust the batch size or the regular expression or account for any subfolders in your archive.

I tried it with my large archive and the performance was absolutely abysmal while the disk space rapidly shrunk, so I would still recommend going with the approaches suggested in other answers.

Decompress gzip file in place but I don't know if it would work with zip and even with gzip this method is unsafe. Probably very unsafe with zip since its not a streaming format — frostschutz
– frostschutz, Commented Nov 12, 2023 at 14:14
Oh, it's about downloading... if the server supports resume / offset / range you could probably cheese it with some flavor of fuse httpfs like simple-httpfs and only download the requested segments (and not store the zip file locally at all) — frostschutz
– frostschutz, Commented Nov 12, 2023 at 14:21
@frostschutz I found out a similar answer suggesting your approach here and it sounds like it could work, I will be sure to give it a try and report back here. — Kumaresh Balaji Sundararajan
– Kumaresh Balaji Sundararajan, Commented Nov 12, 2023 at 17:30
@frostschutz, the example you give is for archive with one compressed file (gz) I have my doubts this will work with multyfile structure as zip — Romeo Ninov
– Romeo Ninov, Commented Nov 12, 2023 at 17:37
@RomeoNinov I don't have much money at the moment and I don't have many uses for external storage besides this situation. — Kumaresh Balaji Sundararajan
– Kumaresh Balaji Sundararajan, Commented Nov 13, 2023 at 6:27

jhnc · Accepted Answer · 2023-11-15 01:46:17Z

2

If the zip file:

contains trusted content; and
is available at a URL
- on a reliable network connection

then the answers here may help.

In short, use a program that can unzip from a stream.

For example:

cd /place/to/store/data
curl https://www.example.org/input.zip | busybox unzip -

cd /place/to/store/data
curl https://www.example.org/input.zip | bsdtar xvf -

answered Nov 15, 2023 at 1:46

jhnc

3051 silver badge6 bronze badges

Add a comment |

Romeo Ninov · Accepted Answer · 2023-11-12 12:48:52Z

1

AFAIK deleting file from zip archive may need twice as much space as the archive. So the best is to attach USB disk and store archive there. Then extract the files to SSD and delete archive (if not required).

answered Nov 12, 2023 at 12:48

Romeo Ninov

19.5k5 gold badges34 silver badges48 bronze badges

Thank you for the input, I will carefully monitor the process for any Out of Storage errors.

Kumaresh Balaji Sundararajan
– Kumaresh Balaji Sundararajan

2023-11-12 17:31:47 +00:00
Commented Nov 12, 2023 at 17:31

Add a comment |

Stack Exchange Network

How to extract and delete contents of a zip archive simultaneously?

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

How to extract and delete contents of a zip archive simultaneously?

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions