0

I was using dd command to change a single byte of a block device(not the partition block device), such as /dev/nvme0n1, at a specific position (not managed by normal file).

dd of=${DEV:?DEV} seek=${POS:?POS} bs=1 count=1 oflag=seek_bytes conv=notrunc status=none

I encountered an issue of sync command, it hangs or takes too long time to finish on some machines.

Seems the sync command involves caches of all files, this obviously will be slow, or even hang up due to some inconsistent kernel management. Especially there are several big VMs are running on the host, the sync will be very slow, some times 30minutes.

Then I started think I should not call sync command direct, I should instead tell dd to sync the part it involved only, by the oflag=sync, like this:

dd of=${DEV:?DEV} seek=${POS:?POS} bs=1 count=1 oflag=sync,seek_bytes conv=notrunc status=none

Since it is not obvious of the difference between oflag=direct, oflag=sync, conv=fsync, I dived into the source of dd, turns out that

  • oflag=sync will cause open output file with O_SYNC flag, each write syscall will will automatically cause fsync(fd).
  • conv=fsync cause an additional fsync syscall on each write.
  • oflag=direct require the block size be multiplied of 512 etc, for my case, it is just 1 byte, dd just turn off the flag, changed it to conv=fsync.

All seems good, but I am not sure about one thing:

If the output file /dev/nvme0n1 has many files cached by Linux, then will my dd command trigger it eventually sync all files? (I actually just want dd sync the 1 byte to the device, not other contents.)

I checked the kernel source, guess the write(fd with O_SYNC flag) eventually calls [fs/sync.c#L180)(https://github.com/torvalds/linux/blob/16a8829130ca22666ac6236178a6233208d425c3/fs/sync.c#L180) (at least this is what the fsync syscall eventually calls)

int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
{
    struct inode *inode = file->f_mapping->host;

    if (!file->f_op->fsync)
        return -EINVAL;
    if (!datasync && (inode->i_state & I_DIRTY_TIME))
        mark_inode_dirty_sync(inode);
    return file->f_op->fsync(file, start, end, datasync);
}

but then I was stuck at

file->f_op->fsync(file, start, end, datasync)

I am not sure how does the file system driver handle the fsync, whether it involves all caches caused by other fds, it is not obvious.

I will continue check kernel source and append EDIT later.

EDIT: I am almost sure that the vfs_fsync_range is the one eventually called by write syscall.

The stack is like this

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
        size_t, count)
{
    return ksys_write(fd, buf, count);
}
ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
...
        ret = vfs_write(f.file, buf, count, ppos);
}
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
...
    if (file->f_op->write)
        ret = file->f_op->write(file, buf, count, pos);
    else if (file->f_op->write_iter)
        ret = new_sync_write(file, buf, count, pos);
...
}
static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
...
    ret = __generic_file_write_iter(iocb, from);
    if (ret > 0)
        ret = generic_write_sync(iocb, ret);
...
}
static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
{
    if (iocb_is_dsync(iocb)) {
        int ret = vfs_fsync_range(iocb->ki_filp,
                iocb->ki_pos - count, iocb->ki_pos - 1,
                (iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
        if (ret)
            return ret;
    }

    return count;
}

To be continued...

static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
        int datasync)
{
    struct block_device *bdev = filp->private_data;
    int error;

    error = file_write_and_wait_range(filp, start, end);
    if (error)
        return error;

    /*
     * There is no need to serialise calls to blkdev_issue_flush with
     * i_mutex and doing so causes performance issues with concurrent
     * O_SYNC writers to a block device.
     */
    error = blkdev_issue_flush(bdev);
    if (error == -EOPNOTSUPP)
        error = 0;

    return error;
}

It should be the above blkdev_fsync doing the sync work. From this function, it becomes hard to analyze. Hope some kernel developers can help me.

The above function further call functions in mm/filemap.c

and block/blk-flush.c, hope this helps.

I will do a test, but the test can not make me confident... that is why I come here to ask this question.

Tested, but since the sync command itself also quickly finished, I can not tell the if dd oflag=sync is safer than sync command.

EDIT:

I have managed to confirmed that dd oflag=sync is safer and quicker than sync command, I believe the answer of this question is yes.

Does write(fd with O_SYNC) only flush data of THAT fd instead of all caches caused by other fds of same file?

YES.

The test is like this:

  • repeatedly create big file with random data
for i in {1..10}; do echo $i; dd if=/dev/random of=tmp$i.dd count=$((10*1024*1024*1024/512)); done
  • in another term, run sync to confirm that it will be very slow, just like hang up there. Interrupt the sync command.
  • create a test file, get its physical LBA.
echo test > z
DEV=$(df . | grep /dev |awk '{print $1}')
BIG_LBA=$(sudo debugfs -R "stat $PWD/z" $DEV | grep -F '(0)' | awk -F: '{print $2}')
  • in another term, run the dd command, confirm it is very fast.
dd of=${DEV:?DEV} seek=$((BIG_LBA*8*512)) bs=1 count=1 oflag=sync,seek_bytes conv=notrunc status=none <<<"x"

But I still hope someone can point out where in the source code that I can confirm the answer.

1 Answer 1

1

I'm not sure if you are mounting /dev/nvme0n1 as well as writing directly to some block on the device, but in any case you might usefully trace what i/o is done on the device by using the BCC, BPF Compiler Collection. It has a tool, biosnoop that will show the command, process id, block number, read or write length, and latency of every i/o to discs in the system. In Fedora, there is a bcc-tools package, and presumably there is similar in other distributions.

$ sudo /usr/share/bcc/tools/biosnoop
TIME(s)     COMM           PID    DISK    T SECTOR     BYTES  LAT(ms)
8.241172    dd             12525  sdc     R 4000       4096      0.97
8.242739    dd             12525  sdc     W 4000       4096      1.17
1
  • Thank you for taking a look. Yes the /dev/nvme0n1 and its partitions is being mounted as ext4 file system etc, not only nvme, bu also for other devices the sync command has same behavior. I am also a enthusiast of linux ftrace related perf-tools and bcc tools, considered using that, but before doing that, I have reproduced the issue and confirmed that dd oflag=sync works well, so aborted the investigation, but I might continue as you suggested later because it is interesting. Commented May 11, 2023 at 17:58

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.