I've recently suffered a maddeningly small but quite important amount of damage to a hard drive on a ESXi host affecting a couple VMs. There's a file that I would very much like to recover, and of course it was somehow left off of my regular backup. The most recent copies are 6 months old. Turns out I need that... oops.
Details:
1) I have used ddrescue (AWESOME tool) within a Parted Magic bootable ISO to recover 99.98% of the VM's drive in question. Unfortunately, the errors appear to be almost entirely of RECENT file writes... so of course they're exactly the sectors I need to recover most.
2) The drive gives IO errors on bad sector reads, but it occasionally SUCCEEDS in reading a previously bad sector! So, recovery is still possible. Slightly more often than that will have some kind of major malfunction and spin the drive down and back up. Oh, and about 1/4 of those spin downs won't come back up. (Hard power cycle required, shutdown won't function) Last, just about every bad sector read comes with a nice audible clicking sound.
3) The important VM disk is NTFS formatted.
4) I can (usually) mount the damaged NTFS volume read-only, and I can (slightly less often) navigate to the folder that contains the file I need. However, the file in question appears to always give an IO error when I do an 'ls' of the folder. The other files in the folder do not give an IO error.
5) I've tried using ntfsinfo/etc... which sounds like exactly what I need... but it won't open the partition at all. (Frustrating, since 'mount' usually will)
6) The file is a Excel 2003-era XLS file, so I'm not sure I can come up with any strings to search the raw disk image for. (Possibly parts of the 6 month old version?)
I'd really like to use something like the facilities of debugfs. However, from the man pages it appears the ntfs tools could do the work if only they could be made to open the partition. In particular, I am wondering if the IO errors might be purely within the metadata for the file, and if the directory record could be restored well enough to copy the file contents off. As a last resort, whatever partial file contents I can retrieve would be great.
I've written (relatively simple) kernel modules before, so I could compile a special NTFS module with more debug info enabled (or added). (The file is worth at least a few days of tinkering to try to recover... plus I'm learning cool stuff in the process)
Any pointers?
EDIT:
More drive error information:
The /var/log/messages is showing a lot of NTFS-fs errors of course... but I finally bothered to translate the unhandled sense code message I usually get: sense key 0x3, ASC=0x11, ASCQ=0x4. (which appears to translate to UNRECOVERED READ ERROR - AUTO REALLOCATE FAILED).
When the drive spins down, I see a "scsi0: * BusLogic BT-958 Initialized" message. I'm not sure if it's the Linux SCSI driver, the ESXi driver, or the drive itself that decides to spin the drive down. If it was the Linux driver, then perhaps I could modify the driver to avoid spinning down. This whole ddrescue thing is made massively more painful by these power-cycle-requiring spindowns.
EDIT2:
using the "end_request: I/O error, dev sda, sector 7238859" log message right after I 'ls' the directory containing the file in question, I've targetted my ddrescue operation to that sector. I currently plan to take my chances and WRITE that sector back to the live disk if this succeeds. Perhaps I can slowly rebuild my way to the file in question this way. Still, most recoverable bad sectors are recovered in under 20 retries... this one is over 150 so far... *sigh*
EDIT3:
The sector error from 'ls' on the file I need is entirely uncooperative (1000+ tries overnight and no luck). I'm hoping that's just metadata when you do an 'ls' ? :)
I do have most of a ddrescue copy, but that doesn't mount (or mounts without files). The damaged drive mounts correctly most of the time... maybe IO errors on the damaged drive force 'mount' to fall back to the mirror that works?
** EDIT4:**
I've given up for now, pending further suggestions. I've removed the drive and rebuilt the box. I'll keep the drive around in case something comes up.
ddrescueor other similar tool to copy as many sectors as possible. Don't do any filesystem-level recovery from the damaged disk, do it from the copy.