3

I only recently "discovered" the inline_data feature of ext4, although it seems to have been around for 10+ years.

I ran a few statistics on various of my systems (desktop/notebook + server), specifically on the root filesystems, and found out that:

  • Around 5% of all files are < 60 bytes in size. The 60 byte threshold is relevant, because that's how much inline data you can fit in a standard 256 byte inode
  • Another ~20-25% of files are between 60 and 896 bytes in size. Again, the "magic number" 896 is how much you fit in a 1KB inode
  • Further 20% are in the 896-1920 byte range (you guess it - 1920 is what you fit into a 2KB inode)
  • That percentage is even more stunning for directories - 30-35% are below 60 bytes, and further 60% are below 1920 bytes.

This means that with an inode size of 2048 bytes you can inline roughly half of all files and 95% of all directories on an average root filesystem! This came as quite a shocker to me...

Now, of course since inodes are preallocated and fixed for the lifetime of a filesystem, large inodes lead to a lot of "wasted" space, if you have a lot of them (i.e. a low inode_ratio setting). But then again, allocating a 4KB block for a 5 byte file is also a waste of space. And according to above statistic, half of all files on the filesystem and virtually all directories can't even fill half of a 4KB block, so that wasted space is not insignificant. The only difference between wasting that space in the inode table and in the data blocks is that you have one more level of indirection, plus potential for fragmentation, etc.

The advantages I see in that setup are:

  • When the kernel loads an inode, it reads at least one page size (4KB) from disk, no matter if the inode is 128 bytes or 2KB, so you have zero overhead in terms of raw disk IO...
  • ... but you have the data preloaded as soon as you stat the file, no additional IO needed to read the contents
  • The kernel caches inodes more aggressively than data blocks, so inlined data is more likely to stay longer in cache
  • Inodes are stored in a fixed, contiguous region of the partition, so you can't ever have fragmentation there
  • Inlining is especially useful for directories, a) since such a high portion of them are small, and b) because you're very likely to need the contents of the directory, so having it preloaded makes a lot of sense

What do you think about this setup? Am I missing something here, and are there some potential risks I don't see?

I stress again that I'm talking about a root filesystem, hosting basically the operating system, config files, and some caches and logs. Obviously the picture would be quite different for a /home partition hosting user directories, and even more different for a fileserver, webserver, mailserver, etc.

(I know there are a few threads describing some corner cases where inline_data does not play well with journaling, but those are 5+ years old, so I hope those issues have been sorted out.)

EDIT: Since there are doubts expressed in the comments if directory inlining works - it does. I have already implemented the setup described here, and the machine I'm writing on right now actually is running on a root filesystem with 2KB inodes with inlining. Here's what /usr looks like in ls:

# ls -l /usr
total 160
drwxr-xr-x   2 root root 36864 Jul  1 00:35 bin
drwxr-xr-x   2 root root    60 Mar  4 13:20 games
drwxr-xr-x   4 root root  1920 Jun 16 21:32 include
drwxr-xr-x  64 root root  1920 Jun 25 21:16 lib
drwxr-xr-x   2 root root  1920 Jun  9 01:48 lib64
drwxr-xr-x  16 root root  4096 Jun 22 02:58 libexec
drwxr-xr-x  11 root root  1920 Jun  9 00:10 local
drwxr-xr-x   2 root root 12288 Jun 26 20:22 sbin
drwxr-xr-x 191 root root  4096 Jun 26 20:22 share
drwxr-xr-x   2 root root    60 Mar  4 13:20 src

And if you dive even deeper and use debuge2fs to examine those directories, the ones having 60 or 1920 byte size have 0 allocated data blocks, while those having 4096 and more do have data blocks.

7
  • "That percentage is even more stunning for directories - 30-35% are below 60 bytes, and further 60% are below 1920 bytes". AFAIK, directories on ext4 have their size as a multiple of the block size (minimum 4KiB). Commented Jul 1 at 14:49
  • Ah, looks like that (multiple of block size for directories) doesn't apply with inline_data, my bad. Commented Jul 1 at 14:58
  • 1
    Similar: Inode size 512 and 1024 bytes functions in ext4 and its pros and cons? (Need a official reference) Commented Jul 1 at 15:00
  • @StéphaneChazelas That's true on a "normal" filesystem without inlining. The contents of the directory (= the list of files and subdirectories) are stored in a normal data block, just like the data of any other regular file. That's why any non-empty directory has a size of at least the allocation size, which by default is 1 block = 4KB. When you turn on inlining, then "suddenly" small directories have a size of 60 bytes and those that don't fit there have a size of 1920 bytes. Commented Jul 1 at 15:01
  • @StéphaneChazelas thanks for that pointer, I didn't find that thread when I searched here initially. But unfortunately it doesn't give any real data/opinions on that subject, either. Commented Jul 1 at 15:19

1 Answer 1

3

I can't comment on inline data since I've never used it, but large inodes have been used for many years for storing xattrs inside the inode instead of in external blocks, so also provide a significant performance boost for filesystems that use xattrs (pretty much everything these days).

Only you can make the decision if this is worthwhile to change, but it looks like you've done the math and it comes out in favor of the larger inodes.

What would be particularly useful would be a side-by-side comparison of two filesystemS on the same-sized block device, one with the default inode size (512?), and one with 2048-byte inodes, with the same root filesystem contents copied into them. The total free space available and some performance numbers like cold-cache "find" and "grep -R" or similar.

1
  • Yes, I also thought of doing something like that, but haven't had the time/enthusiasm do try this out. Should not be too difficult, though - the easiest way would be to create a .tar of one's own root filesystem, create an ext4 filesystem image in a file, then mount that file via loopback and untar the archive into the image. Then do the same with another image with different mke2fs parameters, and run some basic benchmarking. Only thing to look out for is caching - have to drop caches beforehand, since otherwise the kernel will cache the whole image into memory, which skews the results. Commented Jul 2 at 18:03

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.