Can tar archive files in parallel?

Question

I'm trying to move parts of a large directory (~40 GiB and ~8 million files) across multiple machines via Amazon S3, and due to needing to preserve symlinks I am tarring up the directory and then uploading the resultant file, rather than syncing directly to S3.

Most of the files are already compressed, so I'm not compressing the archive with gzip or bzip. My command is along the lines of

tar --create --exclude='*.large-files' --exclude='unimportant-directory-with-many-files' --file /tmp/archive.tar /directory/to/archive

While running this, I've noticed that tar only appears to use one core on the eight-core machine. My impression, based on the pegging of that core, the low load average (~1), and the stats I'm seeing from iostat is that this operation is actually cpu-bound, rather than disk-bound, as I'd expect. Since it's slow (~90 minutes), I'm interested in trying to parallelize tar to make use of the additional cores.

Other questions on this topic either focus on compression or create multiple archives (which, due to the directory structure, is not easy in my situation). It seems most people forget that you can even create a tarball without compressing it.

Not with standard tar. tar stands for Tape ARchiver, once upon a time it was used to write backups on tapes (it still can do that, if you have the hardware). The physical format was exactly the tar format (or rather pax). Since you can't write things in parallel on a piece of tape, tar can't archive files in parallel. Maybe newer incarnations of it can though. On a side note: the format has rather low limits on paths length. Make sure you don't run into those before trying to archive 8 mil. files. — Satō Katsura
– Satō Katsura, Commented Jul 8, 2016 at 20:55
Actually, a tar program could be written to both read/write in parallel, since the output archive is not compressed, and because the size of each file (and header) can be computed directly (allowing a program to write a different extents). There aren't a lot of applications which need that, however. — Thomas Dickey
– Thomas Dickey, Commented Jul 8, 2016 at 21:02
@ThomasDickey That's why I said that perhaps newer incarnations can do it. I'm not aware of any that actually does though. — Satō Katsura
– Satō Katsura, Commented Jul 8, 2016 at 21:09

Julie Pelletier · Accepted Answer · 2016-07-08 20:54:39Z

8

Because of the nature of a tar archive which sequentially stores the files in the output, there is no way to parallelize the process unless you make more than one archive.

Note that the bottleneck of the operation would likely be the hard drive. For that reason, even if you did split the task in two or more processes, it would not go faster unless they operate on different drives.

answered Jul 8, 2016 at 20:54

Julie Pelletier

7,6591 gold badge22 silver badges44 bronze badges

1

This was a drag as a sysadmin, but now years later on my personal laptop with a 7GB/s NVMe SSD and an i9, it's clear that everything from updates to backups take a long time because they're done with 1 CPU core and 1 I/O thread.... With storage being parallel and faster than a single CPU core's bandwidth, compression tools like tar are now definitely becoming the modern bottleneck.

Tmanok
– Tmanok

2022-01-23 05:36:34 +00:00
Commented Jan 23, 2022 at 5:36

Add a comment |

Stack Exchange Network

Can tar archive files in parallel?

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Can tar archive files in parallel?

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions