3

I have a USB storage drive which is using NTFS. It works perfectly in Win10, I can see the drive and the files and read and write without issue.

I was previously able to mount this drive on my Arch server via the UUID.

I recently plugged this USB into my Win10 machine so I could do some file transfers onto another USB, and that was fine. After, I safely removed the USB drives with the eject option.

But now plugging the USB back into my Arch server it throws an error on boot saying timeout on mounting the disk.

I had a look using things like fdisk and lsblk and the UUID has completely disappeared for the USB disk.

fdisk -l shows the disk and its partitions /dev/sdb1 "Microsoft Reserved" and /dev/sdb2 "Microsoft Basic Data".

But with lsblk -l it only shows sdb it doesn't show the partitions like it does on the other drives.

Looking in the /dev/disk/by-uuid it has nothing that correlates to the USB disk.

When I try sudo mount /dev/sdb2 /mnt/test it says fsconfig system call failed /dev/sdb2: Can't look up block dev

and sudo mount /dev/sdb2 /mnt/test -t ntfs says Failed to access volume '/dev/sdb2': No such file or directory.

I can still plug this back into my Win10 machine and it's fine.

I have dun chkdsk /f on that disk and it looks ok there.

This USB storage drive is a ~30TB RAID with a lot of data so I can't really just reformat.

Any ideas why this randomly stopped working and more importantly how to fix it without having to reformat?


Results from fdisk -l /dev/sdb

The backup GPT table is corrupt, but the primary appears OK, so that will be used.
Disk /dev/sdb: 115.45 PiB, 129986248068418560 bytes, 253879390758630 sectors
Disk model: USB3.0 DISK00       
Units: sectors of 1 * 512 = 512 bytes    
Sector size (logical/physical): 512 bytes / 4096 bytes    
I/O size (minimum/optimal): 4096 bytes / 4096 bytes    
Disklabel type: gpt    
Disk identifier: 6E50C089-2F38-45FD-A894-20BBFDD4B8AF

Device     Start         End     Sectors  Size Type    
/dev/sdb1     34       32767       32734   16M Microsoft reserved    
/dev/sdb2  32768 58598420479 58598387712 27.3T Microsoft basic data    

Partition 1 does not start on physical sector boundary.

Results from sudo gdisk /dev/sdb

GPT fdisk (gdisk) version 1.0.10

Warning! Read error 5; strange behavior now likely!
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

Warning! Error 5 reading partition table for CRC check!
Warning! One or more CRCs don't match. You should repair the disk!
Main header: OK
Backup header: ERROR
Main partition table: OK
Backup partition table: ERROR

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: damaged

****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************

Results from dmesg

[ 3480.558117] Buffer I/O error on dev sdb, logical block 126939695379298, async page read
[ 3480.558120] Buffer I/O error on dev sdb, logical block 126939695379299, async page read
[ 3480.914504] sd 5:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[ 3480.914523] sd 5:0:0:0: [sdb] tag#0 Sense Key : Illegal Request [current] 
[ 3480.914532] sd 5:0:0:0: [sdb] tag#0 Add. Sense: Logical block address out of range
[ 3480.914541] sd 5:0:0:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 e6 e6 e6 e6 e6 00 00 00 00 08 00 00
[ 3480.914546] critical target error, dev sdb, sector 253879390758400 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 3480.916343] sd 5:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[ 3480.916362] sd 5:0:0:0: [sdb] tag#0 Sense Key : Illegal Request [current] 
[ 3480.916375] sd 5:0:0:0: [sdb] tag#0 Add. Sense: Logical block address out of range
[ 3480.916386] sd 5:0:0:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 e6 e6 e6 e6 e6 00 00 00 00 02 00 00
[ 3480.916394] critical target error, dev sdb, sector 253879390758400 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 3480.916408] Buffer I/O error on dev sdb, logical block 126939695379200, async page read
[ 3480.918329] sd 5:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[ 3480.918341] sd 5:0:0:0: [sdb] tag#0 Sense Key : Illegal Request [current] 
[ 3480.918348] sd 5:0:0:0: [sdb] tag#0 Add. Sense: Logical block address out of range
[ 3480.918356] sd 5:0:0:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 e6 e6 e6 e6 e6 02 00 00 00 02 00 00
[ 3480.918360] critical target error, dev sdb, sector 253879390758402 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 3480.918369] Buffer I/O error on dev sdb, logical block 126939695379201, async page read
[ 3480.920074] sd 5:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[ 3480.920086] sd 5:0:0:0: [sdb] tag#0 Sense Key : Illegal Request [current] 
[ 3480.920094] sd 5:0:0:0: [sdb] tag#0 Add. Sense: Logical block address out of range
[ 3480.920101] sd 5:0:0:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 e6 e6 e6 e6 e6 04 00 00 00 02 00 00
[ 3480.920105] critical target error, dev sdb, sector 253879390758404 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 3480.920115] Buffer I/O error on dev sdb, logical block 126939695379202, async page read

Results from win10 sg_readcap.exe -l g: (exact same results on arch)

Read Capacity results:
   Protection: prot_en=0, p_type=0, p_i_exponent=0
   Logical block provisioning: lbpme=0, lbprz=0
   Last LBA=58598424575 (0xda4bcffff), Number of logical blocks=58598424576
   Logical block length=512 bytes
   Logical blocks per physical block exponent=3 [so physical block length=4096 bytes]
   Lowest aligned LBA=0
Hence:
   Device size: 30002393382912 bytes, 2.86125e+07 MiB, 30002.4 GB, 30.0024 TB
22
  • Is this a GPT or MBR partition label? What lines do you get from sudo blkid | grep -i "microsoft" ? Also tell it the fs type as ntfs is an oddball sudo mount -t ntfs /dev/sdb2 /mnt/test See /proc/filesystems and /lib/modules/$(uname -r)/kernel/fs for a complete list of the filesystems Commented Oct 8 at 4:07
  • It is GPT. sdb doesn't show up at all in blkid so the grep result is empty. I already ran that mount command twice, once with type specified and once without, neither work now (when previously the typed one would work). I'm not sure what I'm looking at in those 2 directories, should there be entries for ntfs in both? I don't see ntfs in proc, and i see a ntfs3 in the kernel one. Commented Oct 8 at 4:22
  • You should check the kernel messages (dmesg / journalctl -k) after plugging in the drive. It sounds like somehow the kernel is ignoring the partition table. Can you add full output of fdisk -l /dev/sdb? (Maybe also see if opening the drive with gdisk instead give you some hint about what's wrong.) Commented Oct 8 at 5:45
  • Never quite seen anything like this before. It's almost like the enclosure has gone haywire. ~10TB RAID Which mode of RAID? Is 10TB the expected RAID capacity or the size of each drive? Does the size of 27.3TiB (i.e. 30TB) make any sense to you? Do you mind installing sg3_utils and add the output of sg_readcap -l /dev/sdb? Commented Oct 8 at 6:53
  • 2
    I also want a 100 Petabyte drive... if it gets detected like that in Linux, you probably can't trust anything the drive gives you in that state. If it works in Windows, do your recovery there. If this is a RAID, recover the individual drives. Commented Oct 8 at 6:56

2 Answers 2

5

I'm going to just write an answer to tell my speculations about what has happened and suggestions.

Disk /dev/sdb: 115.45 PiB, 129986248068418560 bytes, 253879390758630 sectors

Obviously this is culprit (or at least the symptom of it). The number (253879390758630) is a return value of the SCSI READ CAPACITY (16) command (that is issued to the USB bridge chip in the enclosure). In other words, the number has nothing to do with the data and metadata on store the drives (unless the RAID controller implements the RAID in a way that a small part of the drives would be used to stored the RAID metadata, if that's a thing for this sort of "hardware RAID"). It might be worth mentioning that the number is E6E6E6E6E6E6 in hexadecimal btw.

Both Windows and Linux would obtain the capacity of a (USB) drive with this SCSI command (although there could be slight difference in the parameters / "configuration"), so I really doubt that was saying that crazy number ever since it was first setup.

What makes me doubt it (or the fact that it still is normal in / to Windows) further is that, Linux probably won't attempt to read beyond the actual capacity of the drive anyway if Windows hasn't "fixed" the (primary) GPT headers according to the bogus capacity. There is a field in the GPT headers that contains the location of each other, and I'm pretty sure that Linux would rely on that to look for the backup header, in which case it (and gdisk) will not therefore read beyond the actual capacity. (I am quite sure that the Linux kernel itself will never attempt any automatic fix / repair on the partition tables.)

My guess is that either it was fine to both OSes before some point, or that it was not fine to both since day one (and that the partition table was created by something that is able to obtain the real capacity of the RAID through other means).

If I were you, I might get sg3_utils on both Linux and Windows and see if sg_readcap -l gives a very different result (although the result does not necessarily reflect what is send and received by e.g. the Windows kernel). From there I might decide what further action to take.

As a workaround (for getting the RAID accessible on Linux), you might try something like:

losetup -f -P --sizelimit 30002391285760 /dev/sdb

EDIT: Actually the size limit you need to use might be 129986248068418560 minus 33*512. Both fdisk and the kernel (but not gdisk) appears to ignore the (primary) GPT completely if you "trim" more that (from the "original" capacity, probably as per the Last usable LBA field in the GPT header).

Through that you'd create a loop device that maps to most of the actual capacity of the drive (the primary GPT + partitions + whatever in between). The backup GPT would still be missing, but now the kernel would "expect" it since the whole (virtual) drive is now "readable", which should make it ignore the problem.

Make sure you detach the loop device with losetup before unplugging the drive.

You may also attempt to "fix" the primary GPT header by "moving" the backup GPT to right after the last partition with gdisk as well, but it might get "unfixed" under certain circumstances in Windows, so it might not worth the effort.

In the long run, it's better to stop using the enclosure, especially if you need to use it on Linux. Even if it were "perfect" to Windows, it is not to Linux, in which case you kinda don't know what could go wrong.


So far under no circumstances I would see fdisk -l report a different capacity from lsblk. So I assume that the non-sensical capacity you saw is what is received by the kernel as well. (You can really confirm that by check the relevant the kernel messages, which most likely got logged before those error you added to your post. I have no idea how the kernel can trigger / receive a different result than what you see in the sg_readcap output. All I can say is I still think most likely your enclosure just don't play along with the Linux kernel in one way or another. Maybe there's some kind of "probe timing" issue?)


So the primary GPT header does contains numbers that reflect the bogus capacity (0xE6E6E6E6E6E6), whereas the backup GPT header contains correct numbers. I have no idea how that could have happened, especially when the CRCs in both headers appears to checks out respectively. Apparently something intentionally "fixed" the primary GPT header without touching the backup one, according to that mysterious capacity of recurring hexadecimal digits that came out of nowhere.

Because tail seems to work fine on the RAID, so apparently lseek() has no problem finding out the actual capacity with SEEK_END, whereas fdisk (and gdisk) apparently leverages the BLKGETSIZE64 ioctl to obtain the capacity of the drive -- and that probably somehow gave the bogus one -- you may help confirm by running blockdev --getsize64 /dev/sdb.

However, I still could not reproduce the behavior of fdisk / gdisk by "importing" both GPT headers (with dd) onto a loop device though. In fact, I have no idea how the kernel (on your system) could have seemingly cached two different capacities for one drive, and maybe because of that, accepted attempts to read past the actual capacity of the RAID.

(Maybe it's because I didn't adjust the CRC for the partition entries array in the the primary header. Maybe if I did it would get fdisk and gdisk or to "accept" it? That could be the case if even blockdev --getsize64 /dev/sdb returns the correct capacity.)

Anyway, there are a few options / approaches in gdisk that could help you fix the primary partition table, such as relocate backup data structures to the end of the disk in the Expert menu, or use backup GPT header (rebuilding main) in the Recovery/transformation menu. However, because of the peculiar behavior observed on your system, these may likely fail in your case, at least when you do it directly. My suggestion is that you should create a loop device that cover the whole RAID first:

losetup -P -f --show --sizelimit 30002393382912 /dev/sdb

Then you fix it through the loop device. Once you have done, detach the loop device, reconnect the drive, and see if now the kernel create the dev nodes for the partitions.

You should probably attempt to fix the primary GPT header only if you see:

Warning! Disk size is smaller than the main header indicates! Loading
secondary header from the last sector of the disk! You should use 'v' to
verify disk integrity, and perhaps options on the experts' menu to repair
the disk.

instead of any Read error 5 / Error 5 reading when open the loop device with gdisk.

Also note that the attempt won't actually be written to the RAID unless you choose to write table to disk and exit.

P.S. DO NOT choose the use main GPT header (rebuilding backup) option by mistake! It's your main/primary header that is broken!

6
  • @Neros hmm, what's the capacity seen in lsblk? Starting to seem like there's some kind of kernel regression... Commented Oct 8 at 11:16
  • lsblk shows the drive as 27.3TB, it doesnt show the partitions like fdisk Commented Oct 8 at 22:23
  • @Neros I'm totally lost then. Not sure if it's gonna help, but maybe you can add the output of sudo head -c 1024 /dev/sdb | hexdump -C and sudo tail -c 512 /dev/sdb | hexdump -C to your post. (It would be good to know whether the second one fails too. In case it does, do sudo dd if=/dev/sdb skip=58598424575 count=1 | hexdump -C instead.) Also, consider adding the kernel messages logged upon the connection of the drive, by that I mean the non-error information about drive before / in between the errors. Commented Oct 9 at 3:06
  • results of those commands - pastebin.com/1YCd5mn1 Commented Oct 9 at 3:35
  • @Neros I have updated my answer. Btw, since you wrote Arch server, is the system regularly updated? What's the kernel version? Commented Oct 9 at 5:51
1

There are 2 solutions that both worked for me.

Always backup any important data first before you do anything, especially before trying any weird filesystem voodoo!

  1. The safe way.

Copy all the data to some other storage, reformat the disk using a Linux partitioning app so it has one single NTFS partition. Check that is mounts on Linux and works properly on Windows, and then copy all the data back.

  1. The possibly-unsafe-but-seems-to-do-the-same-thing-in-the-end way.

Open the disk in a Linux partitioning app (one that can see both the "Microsoft Reserved" and "Microsoft Basic Data" partitions) and simply delete the 'reserved' partition (optionally expand the 'basic' partition to fill the space, but its not much of a space gain anyway). As soon as I did that on a test drive it automatically mounted right away on its own when it previously didn't show up.

Both of these methods ended up changing the UUID so you will need to find the new ones (eg. using sudo blkid /dev/sdc1) and editing your /etc/fstab or whatever you need to fix your mounts.

I have no idea why it was working fine originally or why it stopped working just because I temporarily plugged it back into a Windows box.. so I don't know what this means regarding if it was a Windows issue or Linux issue or an enclosure issue so YMMV, but both of these worked for me.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.