Failed Raid5 array with 5 drives - 2 drives removed

Question

Synopsis

Healthy Raid-5 array had 1 drive removed, quickly reinserted and started rebuilding. Then a second drive was removed within 10 minutes. Original drive assignments (sda, sdb etc) have changed due to further user errors (rebooting/swapping drives). Need advice on next steps.

Backstory

I am sorry this is so long, but here is the backstory if it helps

My name is Mike. I am not a daily user of Linux, but I can work my way through things that I need to get done usually by doing quick searches to remind me of syntax and reading man pages. I thought I could figure this out with time (its been months), and I now realized this is something I am not comfortable doing without help since the data is invaluable to my friends family. He has no other backups of the data since his backup drive also failed and he did not realize it… He just assumed it was working.

To start, this is a QNAP appliance that had a 5 drive raid 5 array using 8TB drives. He logged in and noticed that a drive was marked unhealthy due to bad blocks but it was still a member and the array was still working just fine, so he wanted to replace it with a new drive before it got worse. Unfortunately, he pulled out the wrong drive. He quickly realized it was the wrong drive and put it back in, and it started rebuilding on that drive (I saw that in the qnap logs). Without knowing any better, he pulled out the actual drive he wanted to replace within less than 10 minutes and put in a new drive. He noticed the array was offline and his data was inaccessible, so he put the origional drive back in and rebooted the QNAP hoping that would fix it. Obviously, it didn't.

He then called, and I said we do not want to do anything until we backup the data that’s on all of the origional drives. He just so happened to have a few 12/18TB external drives that I used dd to clone the md /sdX3 partitions to (not all partitions - /sdX).

(Exact commands I used + a note as to which external drive they are on)

dd if=/dev/sda3 of=/share/external/DEV3302_1/2024022_170502-sda3.img  (DST:18TB-1)
dd if=/dev/sdf3 of=/share/DiskImages/2024022_164848-sdf3.img  (DST:18TB-2)
dd if=/dev/sdb3 of=/share/external/DEV3302_1/2024022_170502-sdb3.img  (DST:18TB-1)
dd if=/dev/sdg3 of=/share/external/DEV3305_1/2024022_170502-sdg3.img  (DST:12TB)
dd if=/dev/sdd3 of=/share/DiskImages/2024022_170502-sdd3_Spare.img  (DST:18TB-2)

These were just quick backups and due to the age of the drives (5+ years) we figured we would also replace all of the NAS drives with new ones. I then repeated this process with each of the drives, one by one except used a process/commands like this:

Insert new drive in an empty slot, it got assigned sdh

dd if=/dev/sda of=/dev/sdh

Wait 14 hours for it to complete, remove the drive, replace with another new drive and repeat.

dd if=/dev/sdb of=/dev/sdh

Etc…

So, we should have exact copies of the drives. I assumed (I think incorrectly) that we could power off the QNAP, swap the old drives out with the copied drives and then we could start trying commands like

mdadm -CfR /dev/md1 --assume-clean -l 5 -n 5 -c 512 -e 1.0 /dev/sda3 /dev/sdb3 /dev/sdg3 missing /dev/sdd3

(I am not certain that command is correct even before the next paragraph)

Unfortunatly, after swapping the drives we now have two missing drives instead of 1 and the assignments seemed to have changed (ex: sda is no longer sda). I figured I must have messed up a dd copy of a drive, so we were going to start the process over on the missing drive. I tracked which ones were showing/missing, we reinserted the original disks, however now they again have different assignments but it is back to showing only a single missing drive - I am lost. I might be able to figure out the original order by comparing the drive UUID's? But I do not want to touch anything before asking for advice.

Technical description

Here is the output of the recommended commands that were supported on the QNAP.

[QNAPUser@QNAP ~]$ uname -a
Linux QNAP 5.10.60-qnap #1 SMP Mon Feb 19 12:14:12 CST 2024 x86_64 GNU/Linux

[QNAPUser@QNAP ~]$ mdadm --version
mdadm - v3.3.4 - 3rd August 2015

[QNAPUser@QNAP ~]$ smartctl --xall /dev/sda
-sh: smartctl: command not found

[QNAPUser@QNAP ~]$ sudo mdadm --examine /dev/sdb
/dev/sdb:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)

[QNAPUser@QNAP ~]$ sudo mdadm --examine /dev/sdc
/dev/sdc:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)

[QNAPUser@QNAP ~]$ sudo mdadm --examine /dev/sdd
/dev/sdd:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)

[QNAPUser@QNAP ~]$ sudo mdadm --examine /dev/sde
/dev/sde:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)

[QNAPUser@QNAP ~]$ sudo mdadm --examine /dev/sdf
/dev/sdf:
   MBR Magic : aa55

Partition[0] :   4294967295 sectors at            1 (type ee)
[QNAPUser@QNAP ~]$ sudo mdadm --examine /dev/sdg
/dev/sdg:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)

[QNAPUser@QNAP ~]$ sudo mdadm --examine /dev/sdh

[QNAPUser@QNAP ~]$ sudo mdadm --examine /dev/sdb3

/dev/sdb3:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x0
     Array UUID : 29f7c4cf:b6273e81:34f3f156:1cd1cfe2
           Name : 1
  Creation Time : Thu Aug 17 13:28:50 2017
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 15608143240 (7442.54 GiB 7991.37 GB)
     Array Size : 31216285696 (29770.17 GiB 31965.48 GB)
  Used Dev Size : 15608142848 (7442.54 GiB 7991.37 GB)
   Super Offset : 15608143504 sectors
   Unused Space : before=0 sectors, after=648 sectors
          State : clean
    Device UUID : f49eadd1:661a76d3:6ed998ad:3a39f4a9

    Update Time : Thu Feb 29 17:05:02 2024
  Bad Block Log : 512 entries available at offset -8 sectors
       Checksum : d61a661f - correct
         Events : 89359

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAA. ('A' == active, '.' == missing, 'R' == replacing)

[QNAPUser@QNAP ~]$ sudo mdadm --examine /dev/sdc3

/dev/sdc3:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x0
     Array UUID : 29f7c4cf:b6273e81:34f3f156:1cd1cfe2
           Name : 1
  Creation Time : Thu Aug 17 13:28:50 2017
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 15608143240 (7442.54 GiB 7991.37 GB)
     Array Size : 31216285696 (29770.17 GiB 31965.48 GB)
  Used Dev Size : 15608142848 (7442.54 GiB 7991.37 GB)
   Super Offset : 15608143504 sectors
   Unused Space : before=0 sectors, after=648 sectors
          State : clean
    Device UUID : b50fdcc1:3024551b:e56c1e38:8f9bc7f8

    Update Time : Thu Feb 29 17:05:02 2024
  Bad Block Log : 512 entries available at offset -8 sectors
       Checksum : e780d676 - correct
         Events : 89359

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAAA. ('A' == active, '.' == missing, 'R' == replacing)

[QNAPUser@QNAP ~]$ sudo mdadm --examine /dev/sde3

/dev/sde3:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x0
     Array UUID : 29f7c4cf:b6273e81:34f3f156:1cd1cfe2
           Name : 1
  Creation Time : Thu Aug 17 13:28:50 2017
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 15608143240 (7442.54 GiB 7991.37 GB)
     Array Size : 31216285696 (29770.17 GiB 31965.48 GB)
  Used Dev Size : 15608142848 (7442.54 GiB 7991.37 GB)
   Super Offset : 15608143504 sectors
   Unused Space : before=0 sectors, after=648 sectors
          State : clean
    Device UUID : ae2c3578:723041ba:f06efdb1:7df6cbb2

    Update Time : Thu Feb 29 17:05:02 2024
  Bad Block Log : 512 entries available at offset -8 sectors
       Checksum : 70a95caf - correct
         Events : 89359

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : AAAA. ('A' == active, '.' == missing, 'R' == replacing)

[QNAPUser@QNAP ~]$ sudo mdadm --examine /dev/sdg3

/dev/sdg3:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x0
     Array UUID : 29f7c4cf:b6273e81:34f3f156:1cd1cfe2
           Name : 1
  Creation Time : Thu Aug 17 13:28:50 2017
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 15608143240 (7442.54 GiB 7991.37 GB)
     Array Size : 31216285696 (29770.17 GiB 31965.48 GB)
  Used Dev Size : 15608142848 (7442.54 GiB 7991.37 GB)
   Super Offset : 15608143504 sectors
   Unused Space : before=0 sectors, after=648 sectors
          State : clean
    Device UUID : cf03e7e1:2ad22385:41793b2c:4f93666c

    Update Time : Thu Feb 29 16:38:38 2024
  Bad Block Log : 512 entries available at offset -8 sectors
       Checksum : da1a5378 - correct
         Events : 80401

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : AAAAA ('A' == active, '.' == missing, 'R' == replacing)

[QNAPUser@QNAP ~]$ sudo mdadm --examine /dev/sdh3

/dev/sdh3:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x0
     Array UUID : 29f7c4cf:b6273e81:34f3f156:1cd1cfe2
           Name : 1
  Creation Time : Thu Aug 17 13:28:50 2017
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 15608143240 (7442.54 GiB 7991.37 GB)
     Array Size : 31216285696 (29770.17 GiB 31965.48 GB)
  Used Dev Size : 15608142848 (7442.54 GiB 7991.37 GB)
   Super Offset : 15608143504 sectors
   Unused Space : before=0 sectors, after=648 sectors
          State : clean
    Device UUID : a06d8a8d:965b58fe:360c43cd:e252a328

    Update Time : Thu Feb 29 17:05:02 2024
  Bad Block Log : 512 entries available at offset -8 sectors
       Checksum : 5b32c26d - correct
         Events : 89359

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAAA. ('A' == active, '.' == missing, 'R' == replacing)

[QNAPUser@QNAP ~]$ sudo mdadm --detail /dev/md1 (This is the array that is broken))

mdadm: cannot open /dev/md1: No such file or directory

[QNAPUser@QNAP ~]$ git clone git://github.com/pturmel/lsdrv.git lsdrv -sh: git: command not found

[QNAPUser@QNAP ~]$ cat /proc/mdstat

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md3 : active raid1 sdd3[0]
      17568371520 blocks super 1.0 [1/1] [U]

md2 : active raid1 sdf3[0]
      7804071616 blocks super 1.0 [1/1] [U]

md322 : active raid1 sdd5[6](S) sdf5[5](S) sde5[4](S) sdg5[3](S) sdh5[2](S) sdb5[1] sdc5[0]
      6702656 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md256 : active raid1 sdd2[6](S) sdf2[5](S) sde2[4](S) sdg2[3](S) sdh2[2](S) sdb2[1] sdc2[0]
      530112 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md13 : active raid1 sde4[5] sdg4[4] sdh4[3] sdb4[2] sdc4[1] sdf4[6]
      458880 blocks super 1.0 [24/6] [_UUUUUU_________________]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md9 : active raid1 sde1[5] sdg1[4] sdh1[3] sdb1[2] sdc1[1] sdf1[6]
      530048 blocks super 1.0 [24/6] [_UUUUUU_________________]
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>

[QNAPUser@QNAP ~]$ sudo md_checker

Welcome to MD superblock checker (v2.0) - have a nice day~

Scanning system...

RAID metadata found!
UUID:           29f7c4cf:b6273e81:34f3f156:1cd1cfe2
Level:          raid5
Devices:        5
Name:           md1
Chunk Size:     512K
md Version:     1.0
Creation Time:  Aug 17 13:28:50 2017
Status:         OFFLINE
===============================================================================================
 Enclosure | Port | Block Dev Name | # | Status |   Last Update Time   | Events | Array State
===============================================================================================
 NAS_HOST       8        /dev/sdb3   0   Active   Feb 29 17:05:02 2024    89359   AAAA.
 NAS_HOST       7        /dev/sdc3   1   Active   Feb 29 17:05:02 2024    89359   AAAA.
 NAS_HOST       9        /dev/sdh3   2   Active   Feb 29 17:05:02 2024    89359   AAAA.
 NAS_HOST      10        /dev/sdg3   3   Active   Feb 29 16:38:38 2024    80401   AAAAA
 ----------------------------------  4  Missing   -------------------------------------------
===============================================================================================

md_checker is a QNAP command, so you might not be familiar with it, but the output should be useful.

Based on the output above (specifically the Last Update Time and Events), I believe that sdg3 was the first drive to be temporarily pulled from the array and was in the process of rebuilding when the second drive was pulled (now showing as "4 Missing"?) . I believe the second drive is now assigned to sde which is showing Device Role : spare. I am basing this on the the fact that the number of events and Last update time of sdb3, sdc3, sdh3 and sde3 are identical.

My goal is to do a recovery using copies of the drives, not the original drives in case something happens to make the issue worse. We do not need the array to be "healthy" or writable since we just need to make a copy/backup of the data.

What would be the best way to accomplish this?

How can I be certain of the command and order to reassemble the array, and what is the least destructive way to assemble it?

I would greatly appreciate any advice I can get, since I am just starting to confuse myself and possibly making the issue worse.

The rebuild (probably) couldn't complete in 10 minutes, so you had no redundancy when you removed the second drive. I.e. data that was on that RAID array is lost. The next step is to make the controller see 5 individual drives (I don't know enough about QNAP to say if you need to do anything - and if so, what - for that), recreate the RAID and restore data from a backup. — Henrik supports the community
– Henrik supports the community, Commented Sep 3, 2024 at 10:42
Your examine output is a bit of a mystery. sdg3 (device 3) has the oldest update time so it should be the one that got kicked first. Yet all other drives agree that device 4 (presumably sde3 spare) is missing, not device 3... — frostschutz
– frostschutz, Commented Sep 3, 2024 at 11:04
So I don't have a specific answer. Generic advice is to re-create and/or assemble --force with copy-on-write overlay, and hope data corruption will be manageable (depends on write activity while this happened). It will involve some trial&error. — frostschutz
– frostschutz, Commented Sep 3, 2024 at 11:07
@frostschutz Thank you for the link. There was a sub link to Making the harddisks read-only using an overlay file which says to first make a full harddisk-to-harddisk image of every harddisk. I am going to continue cloning the drives until I have each of them (by verifying the Device UUID's. Using the overlay looks pretty complicated, but I think it is a good idea and worth the effort, even if I am using copies of the disks. Thanks again! — MikeD
– MikeD, Commented Sep 5, 2024 at 1:45

Stack Exchange Network

Failed Raid5 array with 5 drives - 2 drives removed

Synopsis

Backstory

Technical description

0

You must log in to answer this question.

Linked

Hot Network Questions

Failed Raid5 array with 5 drives - 2 drives removed

Synopsis

Backstory

Technical description

0

You must log in to answer this question.

Linked

Related

Hot Network Questions