28

This is a rather low-level question, and I understand that it might not be the best place to ask. But, it seemed more appropriate than any other SE site, so here goes.

I know that on the Linux filesystem, some files actually exist, for example: /usr/bin/bash is one that exists. However, (as far as I understand it), some also don't actually exist as such and are more virtual files, eg: /dev/sda, /proc/cpuinfo, etc. My questions are (they are two, but too closely related to be separate questions):

  • How does the Linux kernel work out whether these files are real (and therefore read them from the disk) or not when a read command (or such) is issued?
  • If the file isn't real: as an example, a read from /dev/random will return random data, and a read from /dev/null will return EOF. How does it work out what data to read from this virtual file (and therefore what to do when/if data written to the virtual file too) - is there some kind of map with pointers to separate read/write commands appropriate for each file, or even for the virtual directory itself? So, an entry for /dev/null could simply return an EOF.
3
  • 1
    When the file is created, the kernel records its type. Regular disk files are then treated differently from symlinks, block devices, character devices, directories, sockets, FIFOs, etc. It is the kernel's job to know. Commented Nov 20, 2015 at 16:29
  • see the man pge for mknod Commented Nov 21, 2015 at 9:43
  • This is kind of like asking "how does a light switch know whether the light is turned on?" The light switch is in charge of deciding whether the light is turned on. Commented Nov 22, 2015 at 4:28

5 Answers 5

24

So there are basically two different types of thing here:

  1. Normal filesystems, which hold files in directories with data and metadata, in the familiar manner (including soft links, hard links, and so on). These are often, but not always, backed by a block device for persistent storage (a tmpfs lives in RAM only, but is otherwise identical to a normal filesystem). The semantics of these are familiar; read, write, rename, and so forth, all work the way you expect them to.
  2. Virtual filesystems, of various kinds. /proc and /sys are examples here, as are FUSE custom filesystems like sshfs or ifuse. There's much more diversity in these, because really they just refer to a filesystem with semantics that are in some sense 'custom'. Thus, when you read from a file under /proc, you aren't actually accessing a specific piece of data that's been stored by something else writing it earlier, as under a normal filesystem. You're essentially doing a kernel call, requesting some information that's generated on-the-fly. And this code can do anything it likes, since it's just some function somewhere implementing read semantics. Thus, you have the weird behavior of files under /proc, like for instance pretending to be symlinks when they aren't really.

The key is that /dev is actually, usually, one of the first kind. It's normal in modern distributions to have /dev be something like a tmpfs, but in older systems, it was normal to have it be a plain directory on disk, without any special attributes. The key is that the files under /dev are device nodes, a type of special file similar to FIFOs or Unix sockets; a device node has a major and minor number, and reading or writing them is doing a call to a kernel driver, much like reading or writing a FIFO is calling the kernel to buffer your output in a pipe. This driver can do whatever it wants, but it usually touches hardware somehow, e.g. to access a hard disk or play sound in the speakers.

To answer the original questions:

  1. There are two questions relevant to whether the 'file exists' or not; these are whether the device node file literally exists, and whether the kernel code backing it is meaningful. The former is resolved just like anything on a normal filesystem. Modern systems use udev or something like it to watch for hardware events and automatically create and destroy the device nodes under /dev accordingly. But older systems, or light custom builds, can just have all their device nodes literally on the disk, created ahead of time. Meanwhile, when you read these files, you're doing a call to kernel code which is determined by the major and minor device numbers; if these aren't reasonable (for instance, you're trying to read a block device that doesn't exist), you'll just get some kind of I/O error.

  2. The way it works out what kernel code to call for which device file varies. For virtual filesystems like /proc, they implement their own read and write functions; the kernel just calls that code depending on which mount point it's in, and the filesystem implementation takes care of the rest. For device files, it's dispatched based on the major and minor device numbers.

4
  • So if, let's say, an old system had its power pulled, the files in /dev would still be there, but I guess they would be cleared when the system starts up? Commented Nov 20, 2015 at 16:47
  • 2
    If an old system (one without any dynamic device creation) was shut down, either normally or abnormally, the device nodes would remain on disk just like any file. Then when next bootup happened, they would also remain on disk, and you could use them as normal. It's only in modern systems that anything special happens w.r.t. creating and destroying device nodes. Commented Nov 20, 2015 at 16:55
  • So a more modern system not using a tmpfs would dynamically create and delete them as needed, eg: bootup and shutdown? Commented Nov 20, 2015 at 17:09
  • 3
    devtmpfs, the /dev filesystem in modern Linux, is similar to a tmpfs, but has some differences to support udev. (The kernel does some automated node creation on its own before handing off to udev, in order to make boot less complicated.) In all of these cases, device nodes live only in RAM and are created and destroyed dynamically as hardware requires them. Presumably you could also use udev on an ordinary on-disk /dev, but I've never seen this done and there don't seem to be any good reasons to. Commented Nov 20, 2015 at 17:13
18

Here's a file listing of /dev/sda1 on my nearly up-to-date Arch Linux server:

% ls -li /dev/sda1
1294 brw-rw---- 1 root disk 8, 1 Nov  9 13:26 /dev/sda1

So the directory entry in /dev/ for sda has an inode number, 1294. It's a real file on disk.

Look at where the file size usually appears. "8, 1" appears instead. This is a major and minor device number. Also note the 'b' in the file permissions.

The file /usr/include/ext2fs/ext2_fs.h contains this (fragment) C struct:

/*
 * Structure of an inode on the disk
 */
struct ext2_inode {
    __u16   i_mode;     /* File mode */

That struct shows us the on-disk structure of a file's inode. Lots of interesting stuff is in that struct; take a long look at it.

The i_mode element of struct ext2_inode has 16 bits, and it uses only 9 for the user/group/other, read/write/execute permissions, and another 3 for setuid, setgid, and sticky. It's got 4 bits to differentiate among types like "plain file", "link", "directory", "named pipe", "Unix family socket", and "block device".

The Linux kernel can follow the usual directory lookup algorithm, then make a decision based on the permissions and flags in the i_mode element. For 'b', block device files, it can find the major and minor device numbers, and traditionally, use the major device number to look up a pointer to some kernel function (a device driver) that deals with disks. The minor device number usually gets used as say, the SCSI bus device number, or the EIDE device number or something like that.

Some other decisions about how to deal with a file like /proc/cpuinfo are made based on the filesystem type. If you do a:

% mount | grep proc 
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)

you can see that /proc has file system type of "proc". Reading from a file in /proc causes the kernel to do something different based on the type of the file system, just as opening a file on a ReiserFS or DOS file system would cause the kernel to use different functions to locate files, and locate data of the files.

1
  • Are you certain, that only "real files on disk" have an inode number displayed? I get 4026531975 -r--r--r-- 1 root root 0 Nov 14 18:41 /proc/mdstat which clearly isn't a "real file". Commented Nov 21, 2015 at 18:02
7

At the end of the day they are all files for Unix, that´s the beauty of the abstraction.

The way the files are handled by the kernel, now that is a diferent story.

/proc and nowadays /dev and /run (aka /var/run) are virtual filesystems in RAM. /proc is an interface/windows to kernel variables and structures.

I recommend reading The Linux Kernel http://tldp.org/LDP/tlk/tlk.html and Linux Device Drivers, Third Edition https://lwn.net/Kernel/LDD3/.

I also enjoyed The Design and Implementation of the FreeBSD Operating System http://www.amazon.com/Design-Implementation-FreeBSD-Operating-System/dp/0321968972/ref=sr_1_1

Have a look at the relevant page that is pertaining to your question.

http://www.tldp.org/LDP/tlk/dd/drivers.html

2
  • thanks, I slightly changed the first question after you commented that. Commented Nov 20, 2015 at 14:09
  • Read the last comment please. Commented Nov 20, 2015 at 14:28
4

In addition to @RuiFRibeiro's and @BruceEdiger's answers, the distinction you make is not exactly the distinction the kernel makes. Actually, you have various kind of files: regular files, directories, symbolic links, devices, sockets (and I always forget a few so I won't try to make a full list). You can have the information on the type of a file with ls: it's the first character on the line. For example:

$ls -la /dev/sda
brw-rw---- 1 root disk 8, 0 17 nov.  08:29 /dev/sda

The 'b' at the very beginning signals that this file is a block device. A dash, means a regular file, 'l' a symbolic link and so on. This information is stored in the metadata of the file, and is accessible through the system call stat for instance, so the kernel can read differently a file and a symbolic link for example.

Then, you make another distinction between "real files" like /bin/bash and "virtual files" like /proc/cpuinfo but ls report both as regular files so the difference is of another kind:

ls -la /proc/cpuinfo /bin/bash
-rwxr-xr-x 1 root root  829792 24 août  10:58 /bin/bash
-r--r--r-- 1 root wheel      0 20 nov.  16:50 /proc/cpuinfo

What happens is that they belong to different filesystems. /proc is the mounting point of a pseudo-filesystem procfs whereas /bin/bash is on a regular disk filesystem. When Linux opens a file (it does so differently depending on the filesystem), it populates a data structure file which has, among other attributes a structure of several function pointers which describe how to use this file. Therefore, it can implements distinct behaviors for different kind of files.

For example, these are the operations advertised by /proc/meminfo:

static int meminfo_proc_open(struct inode *inode, struct file *file)
{
    return single_open(file, meminfo_proc_show, NULL);
}

static const struct file_operations meminfo_proc_fops = {
    .open       = meminfo_proc_open,
    .read       = seq_read,
    .llseek     = seq_lseek,
    .release    = single_release,
};

If you look at the definition of meminfo_proc_open, you can see that this function populates a buffer in memory with the information returned by the function meminfo_proc_show, whose task is to collect data about memory usage. This information can then be read normally. Each time you open the file, the function meminfo_proc_open is called and the information about memory is refreshed.

3

All files in a file system are "real" in the sense that they allow file I/O. When you open a file, the kernel creates a file descriptor, which is an object (in the sense of object oriented programming) that acts like a file. If you read the file, the file descriptor executes its read method, which in turn will ask the file system (sysfs, ext4, nfs, etc.) for data from the file. The file systems present a uniform interface to userspace, and know what to do to handle reads and writes. The file systems in turn ask other layers to handle their requests. For a regular file on say an ext4 file system, this will involve look ups in the file system's data structures (which may involve disk reads), and eventually a read from the disk (or cache) to copy data into the read buffer. For a file in say sysfs, it generally just sprintf()s something to the buffer. For a block dev node, it will ask the disk driver to read some blocks and copy them into the buffer (the major and minor numbers tell the file system which driver to make requests to).

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.