Simulate loss of root filesystem

Question

At $WORK there's been an increase in the failures of VMs on our provider (AWS) recently. It has been suggested that using a more reliable root filesystem might help here.

I first tested this locally - unmounting the root filesystem on Centos 7 VM resulted in a very different set of symptoms than we had seen in AWS. As I expected, it continued to respond to pings, but the other monitoring reported multiple failures. Conversely in the AWS instances, the host immediately stopped responding to pings (which causes our monitoring system to not bother asking about services).

While this favoured the null hypothesis, I want to try this on a monitored, AWS VM. However it won't let me unmount the root filesystem ("umount: /: target is busy.").

We only have access to the storage via the host or the AWS API. Is there another way I can break the root filesystem on a AWS Linux 2 VM? (read-only is not broken enough).

Kernel version is 4.14.327-246.539.amzn2.x86_64

is / on a partition or on a VG ? if the latter, add a disk to VG, move other LV to that disk, then unplug first disk. (for obvious reason 1) this is untested 2) hence only as a comment) — Archemar
– Archemar, Commented Sep 2, 2024 at 13:42
Not responding to pings would be a strange failure mode – the kernel (which does said responding) is always in memory and doesn't rely on disk access for networking... assuming it managed to boot. But also, though I'm not familiar with AWS, it surprises me that disappearing rootfs is even a possibility. I thought that was reserved to cheap "run off a single rack" VPS hosts. — grawity
– grawity, Commented Sep 2, 2024 at 13:55
let's say that I find it vastly less likely that AWS's storage fails than that your setup fails for software reasons — Marcus Müller
– Marcus Müller, Commented Sep 2, 2024 at 14:17
@u1686_grawity the kernel might however panic very robustly when you pull the root file system from under its feet; that's why "However it won't let me unmount the root filesystem ("umount: /: target is busy.")" would happen: the kernel just won't let you, usually. — Marcus Müller
– Marcus Müller, Commented Sep 2, 2024 at 14:18
@MarcusMüller: Right, but it sounds like OP isn't having that happen on its own, only some kind of undescribed failure? — grawity
– grawity, Commented Sep 2, 2024 at 14:33

Marcus Müller · Accepted Answer · 2024-09-02 18:24:59Z

In lieu of anyone else writing an answer:

You can get one of these instances with only EBS storage and remove that while the instance is running; however, if you don't get logs on your AWS that this is what's happening to the actual instances that fail, I honestly don't see the diagnostic power of doing that. You'd clearly be doing something that has nothing to do with why your instances fail.

Your system stopping to respond to pings definitely points to kernel panics/crashes. The fact that the file system is unclean afterwards aligns with that: the kernel just stops working, so no storage operations are synced to disk.

So, first thing would be to use the aws cli to get the console output, aws ec2 get-console-output --latest --output text --instance-id …

However, it's really a bit surprising that the amazon kernel on amazon ec2 would crash this harshly. Are you using third-party kernel modules? If you have any antivirus kernel modules loaded: these are the prime suspects for kernel instability (sadly.); consider whether endpoint security mechanism really apply to your server use case. (there's a lot of snake oil selling happening in the antivirus / threat detection sector.)

Stack Exchange Network

Simulate loss of root filesystem

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Simulate loss of root filesystem

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions