1

My goal is rather simple: I want to create a small database of all files/directories within a big directory.

After looking for a tool suited I couldn't find anything but the good ol' du. I figured out that I should use -b if I want the number of bytes the files actually have and to use -a to list all files. Great.

Now here's the problem: I do not want to include directories in du's output. I would like to have the output like this per file: size<tab>filename so I can convert this to CSV or some kind of database. The problem with it containing directories would be that I wouldn't know how to separate directories from files after when I convert/import it to a database and I don't want to end up having files accidentally as directories or vice-versa. I also don't want to end up accidentally counting too much total usage space because when I sum up all files in a directory I already get the whole directory's size, if I add the size of the entire directory on top of that it would be too much.

Here is why I don't want to use find -type f... (test directory with 38k files in total, the real one has millions)

$ time find -mindepth 1 -type f -exec du -sb {} > /dev/null \;

real    0m45.631s
user    0m25.807s
sys     0m18.946s

$ time du -ab > /dev/null

real    0m0.154s
user    0m0.057s
sys     0m0.096s

I'm open to suggestions of any other way to achieve my initial goal. I just need a way in the end to "browse" directories and see their sizes (and the sizes of the files they contain) without actually having the filesystem in question mounted. (So kinda like an "offline" baobab you might say)

1 Answer 1

4

Note that du reports the disk usage of files, not their size (unless you use the --apparent-size option of the GNU implementation of du (or -b, another GNU extension which is short for --apparent-size --block-size=1)).

Also note that ,, TAB, newline or " are as valid as any other character in filenames on Unix-like systems.

Here, if on a GNU system, you could do:

find . ! -type d -printf '%s,%P\0' |
  LC_ALL=C sed -z 's/"/""/g; s/,/,"/; s/$/"/' |
  tr '\0' '\n'

To output in CSV format and report the size of all files except those of type directory.

Above, we're telling find to output <size>,<filepath> in NUL-delimited records as 0 is the only byte that can't occur in a file name. We process that with sed NUL-record mode (-z) in the C locale (again for it to work with any byte value, even those not forming valid characters in the user's locale), by replacing every " with "" (which is the way double quotes are generally escaped in CSVs), and adding a " after the first occurrence of , in the record (the one after the <size>), and one at the end ($).

tr translates the NULs to NLs on that output as NL is the record delimiter in CSVs. So for instance a 15 byte file called $'a\nb"c' would be rendered as:

15,"a
b""c"

If it's really the disk usage you want, you can replace %s with %b which gives you the disk usage in number of 512-byte units, or %k for number of kibibytes.

If not on a GNU system (-printf, -z are GNU extensions), you could use perl instead:

find . ! -type d -print0 |
  perl -l -0ne '
    if (@s = lstat$_) {
      s/"/""/g; print qq($s[7],"$_")
    } else {warn "$_: $!\n"}'

(-print0 is also a GNU extension, but these days, it's found in most other find implementations. If not, you can use -exec printf '%s\0' {} + instead).

This time, replace 7 with 12 if you want the disk usage in number of 512-byte units.

In any case, note that directories (which is a special type of file that contains a list of filenames with their mapping to inode), also have a size themselves. The cumulative disk usage of a directory as reported by instance by du -s some-dir is the sum of the disk usage of all unique files (of any type, including directories) referenced by that directory recursively, plus the size of that directory itself.

Here, if you want to do du's job by hand and be able to report that same size, you'll also need to record the disk-usage/size of directory files, and also record when files are duplicated (when there are several names for the same file, aka hard links). So in addition to size/disk-usage and file paths, you'd want to record the device and inode number of the file (%D:%i) which is the unique identifier of the file thanks to which du is able to tell when two file paths refer to the same file.

You may also want to record the type of the files. And directory there is just one of many types of files. There are also regular files, symlinks, fifos, devices, sockets... (%y).

6
  • Thanks for the answer! As stated in my question, the behaviour of -b is what I want. I had no idea that there's a way for find to directly print the file size, that's great! I don't really know sed much and completely disregarded the fact that filenames can contain TAB (I actually didn't know that) - Does your answer take care of filenames that have a TAB (or comma in this case)? Also thank you very much for the hint of %y - I'll definitely use that aswell! Commented Dec 27, 2020 at 2:26
  • I just checked man find for other possible variables for -printf, things like modification time and such is also very useful. I'd upvote twice if I could. Commented Dec 27, 2020 at 2:29
  • Okay, after trying it out and adding some more variables I'm very satisfied. Only question left is how to properly and safely put the quotes around the filename, now that I have a total of nine comma separated fields in my output. Commented Dec 27, 2020 at 5:18
  • I came up with this: find . -printf '%A@,%C@,%G,%#m,%D,%i,%s,%T@,%U,%y,%P' | LC_ALL=C sed -z 's/"/""/g; s/,/,"/10; s/$/"/' | tr '\0' '\n'. Is that safe, even when filenames contain a comma? I think it should but as a sed noobie I'm not quite sure. Commented Dec 27, 2020 at 5:55
  • 1
    @confetti, see edit for what the sed part does. In your case, you'll still need the trailing \0, and you'll need to add the " after the 10th ,, not the first, so s/,/,"/10 instead of s/,/,"/. Also beware that those %T@ output something like 1605892440.0446899670 which if interpreted as floats, exceed the precision of most floating point binary representations. Something to bear in mind when comparing timestamps as 1605892440.0446899670 will likely be considered as equal to 1605892440.0446899671 for instance if they're interpreted as numbers. Commented Dec 27, 2020 at 9:43

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.