0

I'm further extending a previous question to count number of files in tar file (link) to a new question on how to count files under subfolders in a tar file. What I would to have at the end is:

  1. list the folders that contains files in it
  2. count the number of files within that folder

My example tar file listing tar -tvf myfile.tar looks like below (the real tar file has more files and directories). There are a total of 2 folders where folder_files_1 has 3 files within and folder_files_2 has 4 files within.

drwxrwxrwx someuser/users      0 2017-08-07 11:43 ./root_folder/subfolder/folder_files_1/
-rwxr-xr-x someuser/users 538962 2017-08-07 11:43 ./root_folder/subfolder/folder_files_1/i716266.MRDC.270
-rwxr-xr-x someuser/users 538962 2017-08-07 11:43 ./root_folder/subfolder/folder_files_1/i716267.MRDC.266
-rwxr-xr-x someuser/users 538944 2017-08-07 11:43 ./root_folder/subfolder/folder_files_1/i716268.MRDC.287
drwxrwxrwx someuser/users      0 2017-08-07 11:50 ./root_folder/subfolder/folder_files_2/
-rwxr-xr-x someuser/users 538696 2017-08-07 11:50 ./root_folder/subfolder/folder_files_2/i717157.MRDC.8
-rwxr-xr-x someuser/users 538694 2017-08-07 11:50 ./root_folder/subfolder/folder_files_2/i717158.MRDC.4
-rwxr-xr-x someuser/users 538692 2017-08-07 11:50 ./root_folder/subfolder/folder_files_2/i717159.MRDC.34
-rwxr-xr-x someuser/users 538696 2017-08-07 11:50 ./root_folder/subfolder/folder_files_2/i717160.MRDC.5

The closest solution I've searched pointed me to using awk after tar (see references here and here).

tar tvf myfile.tar | awk '/^d/ {print $0; /$6/; getline; file_no++} END {print file_no}'

/$6/ is to match the corresponding folder ./root_folder/subfolder/folder_files_1/. But it still is no accurately counting the file numbers under the matching directory, ie. folder_files_1, _folder_files_2.

Any suggestions on how to fix my code?

3
  • The same solution in your other question should work: tar tvf myfile.tar | wc -l Commented Mar 6, 2018 at 19:24
  • @NasirRiley No, it won't. That will count everything in the tar file, now he's asking for only certain paths. Commented Mar 6, 2018 at 19:26
  • The way that he's worded it is somewhat confusing. Perhaps it can be certain that he wants to find only files but I don't see where it says that he's looking for certain paths. The answer right below this will give him what he wants if it's only files but if he only wants certain paths then it's going to get really hairy and convoluted. Commented Mar 6, 2018 at 23:13

4 Answers 4

1

Another option:

tar tf archive.tar |
    awk '
        { if (gsub("[^/]+$", "")) { h[$0]++} }
        END { for (f in h) { printf "%d\t%s\n", h[f], f } }
    '

The first awk statement strips filenames, and counts the instances of resulting directory paths. The second runs when the input has been fully consumed (i.e. at the end of stdin) and prints the list of paths and their respective counts.

The whole thing can be run into a single line if you prefer (just literally concatenate the whole lot). I've split it here for readability.

Result from running against your tarball:

4       ./root_folder/subfolder/folder_files_2/
3       ./root_folder/subfolder/folder_files_1/
1
tar -tvf file.tar | grep '^-' | wc -l

This will count the number of lines in the tar output that start with - (i.e. files). Change /^- to /^[^d]/ to count "anything but directories" if you have special types of files in your archive.

Another way, with awk:

tar -tvf file.tar | awk '/^-/ { n++ } END { print n }'

Both of these commands outputs 7, the total number of files in the archive.


If you want separate counts for each subfolder:

tar -tvf file.tar | awk '/^d/ { d = $NF; next } { n[d]++ } END { for (d in n) print n[d], d }'

This generates

4 ./root_folder/subfolder/folder_files_2/
3 ./root_folder/subfolder/folder_files_1/

for the data that you have provided.

The awk code in this last example picks out the directory name from any line that starts with d and uses it as a key in an associative array. The array entry is incremented for each found file. At the end, all entries and their count are printed.

5
  • 1
    Depending on whether pipes and device files count as "files", you might use something like grep '^[^d] to specifically omit directories. Commented Mar 6, 2018 at 20:12
  • Works for the data given, but $NF doesn't work if (path)names contain whitespace, and that logic is wrong if the tar contains e.g. /dir/{file1,subdir/[abc],file2} Commented Mar 7, 2018 at 7:20
  • @dave_thompson_085 I understand your note about white spaces, but I don't fully understand the comment about the logic. Are you concerned about sub-subfolders or subfolders occurring mixed in with files (I could understand that). Commented Mar 7, 2018 at 7:29
  • I think that's what @dave is talking about: files from parent directory listed after subdirectory and its files, in which case d should be reset or extracted from the filename. Commented Mar 7, 2018 at 9:00
  • @muru Ah. Yes. Well, in this case this is a simple solution for simple archives... Commented Mar 7, 2018 at 9:06
1

If you have GNU tar, it has a --to-command option:

--to-command=COMMAND
  Pipe extracted files to COMMAND.  The argument is the pathname
  of an external program, optionally with command line
  arguments.  The program will be invoked and the contents of
  the file being extracted supplied to it on its standard
  output.  Additional data will be supplied via the following
  environment variables:

  TAR_FILETYPE
         Type of the file. It is a single letter with the
         following meaning:

                 f           Regular file
                 d           Directory
                 l           Symbolic link
                 h           Hard link
                 b           Block device
                 c           Character device

         Currently only regular files are supported.
  ...
  TAR_FILENAME
         The name of the file.

These variables can be used to safely handle filenames with spaces, etc.

For example, using shell string substitution to remove the filename from the path given, then using sed to print only the paths for non-directories, you can then sort and apply uniq -c to get the count:

tar xf foo.tar --to-command 'echo "$TAR_FILETYPE" "${TAR_FILENAME%/*}"' |
  sed -n '/^[^d]/s/^. //p' | 
  sort |
  uniq -c

If you have GNU sed, sort and uniq, you can use their -z options and printf "%s %s\0" instead of echo to safely handle all filenames.

Example:

% tar xf dev/pacaur/byobu/byobu_5.124.orig.tar.gz --to-command 'printf "%s %s\0" "$TAR_FILETYPE" "${TAR_FILENAME%/*}"' | sed -zn '/^[^d]/s/^. //p' | sort -z | uniq -zc | tr '\0' '\n'
     15 byobu-5.124
      2 byobu-5.124/Applications/Byobu.app/Contents
      1 byobu-5.124/Applications/Byobu.app/Contents/MacOS
      8 byobu-5.124/Applications/Byobu.app/Contents/Resources
      4 byobu-5.124/etc/byobu
      3 byobu-5.124/etc/profile.d
      1 byobu-5.124/experimental
     23 byobu-5.124/po
      1 byobu-5.124/snap
     38 byobu-5.124/usr/bin
     43 byobu-5.124/usr/lib/byobu
     18 byobu-5.124/usr/lib/byobu/include
      1 byobu-5.124/usr/share/appdata
      4 byobu-5.124/usr/share/byobu/desktop
     12 byobu-5.124/usr/share/byobu/keybindings
      4 byobu-5.124/usr/share/byobu/pixmaps
      1 byobu-5.124/usr/share/byobu/pixmaps/highcontrast
     11 byobu-5.124/usr/share/byobu/profiles
      4 byobu-5.124/usr/share/byobu/status
      3 byobu-5.124/usr/share/byobu/tests
      3 byobu-5.124/usr/share/byobu/windows
      3 byobu-5.124/usr/share/dbus-1/services
      4 byobu-5.124/usr/share/doc/byobu
     37 byobu-5.124/usr/share/man/man1
      1 byobu-5.124/usr/share/sounds/byobu
0

If you don't mind running it twice (to get the count, then the lines), you can use grep.

For the count:

tar tvf myfile.tar | grep <path> | wc -l

For the lines, just remove the | wc -l

If you'd prefer to just run tar once, you can save the output to a file then cat it to grep and wc. The script all together would look something like this:

tmp_file=$(mktemp)
tar tvf myfile.tar > $tmp_file
cat $tmp_file | grep <subdir> | wc -l
cat $tmp_file | grep <subdir>
rm $tmp_file

If you want a one-liner there's probably a hack you can do with process substitution and redirection, but if you're running this with any cadence you'll probably end up putting it in a script/alias/function anyway so this is a little easier to read and understand.

If you have multiple paths in the tar file that you'd like to grep out, you can put them all in a text file and use grep -f <paths file>

2
  • Thanks for your answer, however, if I have more folders and files in my .tar file, I will have to point to them each for the grep <path> which is not the ideal solution. Commented Mar 6, 2018 at 21:00
  • To get a count of each path that is true, but if you use the script i wrote up your overhead is minimal and each grep is relatively cheap. You can use multiple patterns in grep. I updated the answer to reflect this, but you can also specify multiple patterns on the command line with '-e' Commented Mar 7, 2018 at 18:46

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.