Replace a Failing Drive in a RAID6 Array Using mdadm

Most users that run some sort of home storage server will probably, (see: hopefully), be running some type of RAID array.

It is also likely that at some point, one or more of the drives in your array will start to degrade. That could be read errors, bad sectors, or worse complete hardware failure. In this case you will have to replace the faulty drive with a new drive of equal or larger size.

I was experiencing read errors on a new 4TB Western Digital Red NAS drive. I have 6 of these drives in a RAID6 array running Ubuntu 13.10. The array was using mdadm as a software RAID controller.

Here you will find the steps taken to replace a failing drive within a RAID6 array that uses mdadm as a software RAID controller.

Identify the Problem

Running the smartctl on the drive in question allowed me to confirm that the drive was indeed having read errors.

$ sudo smartctl -a /dev/sdg

This produces the following results


=== START OF INFORMATION SECTION ===
Device Model:     WDC WD40EFRX-68WT0N0
LU WWN Device Id: 5 0014ee 2092bd325
Firmware Version: 80.00A80
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat May 31 13:22:51 2014 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(55920) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 559) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x703d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   199   051    Pre-fail  Always       -       439
  3 Spin_Up_Time            0x0027   188   188   021    Pre-fail  Always       -       7591
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       61
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2705
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       61
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       58
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2995
194 Temperature_Celsius     0x0022   117   102   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       11
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       9
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       12

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      2704         266440
# 2  Conveyance offline  Completed: read failure       90%      2648         266440
# 3  Extended offline    Completed: read failure       90%      2646         266440
# 4  Conveyance offline  Completed: read failure       90%      2480         266440
# 5  Extended offline    Completed: read failure       90%      2478         266440
# 6  Conveyance offline  Completed: read failure       90%      2312         266440
# 7  Extended offline    Completed: read failure       90%      2310         266440
# 8  Conveyance offline  Completed: read failure       90%      2144         266440
# 9  Extended offline    Completed: read failure       90%      2142         266440
#10  Extended offline    Completed without error       00%      1985         -
#11  Extended offline    Completed without error       00%      1818         -
#12  Extended offline    Completed without error       00%      1650         -
#13  Extended offline    Completed without error       00%      1482         -
#14  Extended offline    Completed without error       00%      1314         -
#15  Extended offline    Completed without error       00%      1146         -
#16  Extended offline    Completed without error       00%       979         -
#17  Extended offline    Completed without error       00%       811         -
#18  Extended offline    Completed without error       00%       644         -
#19  Conveyance offline  Completed: read failure       90%       468         269312
#20  Extended offline    Completed: read failure       90%       466         269312
#21  Short offline       Completed: read failure       90%       312         269312
3 of 12 failed self-tests are outdated by newer successful extended offline self-test #10

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

You can see in the highlighted lines, that there are a few read-errors. I figured I would replace the drive now, since I was still well within my warranty period and avoid headache later.

The Array

Before you begin, it may be a good idea to grab a birds eye view of what your array looks like.

This can easily be accomplished (if you are already using mdadm as your RAID controller) by running:

$ cat /proc/mdstat

This should return results similar to:


/dev/md0:
        Version : 1.2
  Creation Time : Sat Feb  8 00:12:06 2014
     Raid Level : raid6
     Array Size : 15627540480 (14903.58 GiB 16002.60 GB)
  Used Dev Size : 3906885120 (3725.90 GiB 4000.65 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Sat May 31 13:23:12 2014
          State : clean 
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : Sol:0  (local to host Sol)
           UUID : 5fd6fcc6:d2300ce9:7d7184be:4b5e6da3
         Events : 220

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8       65        3      active sync   /dev/sde1
       4       8       81        4      active sync   /dev/sdf1
       5       8       97        5      active sync   /dev/sdg1

You can see that while the array state is clean and functioning properly, I still chose to replace the drive. Make a note of the drive number and mount point.

In this case I will be replacing drive 5 - /dev/sdg(1) in the array /dev/md0.

Failing/Removing the Drive
First, we need to mark the drive as failed within the array. This can be done with:

sudo mdadm --fail /dev/md0 /dev/sdg1

This will tell mdadm to fail drive /dev/sdg in the array /dev/md0. It will return the following results:

mdadm: set /dev/sdg1 faulty in /dev/md0

We can confirm that the drive has been set as faulty with:

$ sudo mdadm --detail /dev/md0

Which returns


/dev/md0:
        Version : 1.2
  Creation Time : Sat Feb  8 00:12:06 2014
     Raid Level : raid6
     Array Size : 15627540480 (14903.58 GiB 16002.60 GB)
  Used Dev Size : 3906885120 (3725.90 GiB 4000.65 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Sat May 31 13:25:24 2014
          State : clean, degraded 
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : Sol:0  (local to host Sol)
           UUID : 5fd6fcc6:d2300ce9:7d7184be:4b5e6da3
         Events : 222

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8       65        3      active sync   /dev/sde1
       4       8       81        4      active sync   /dev/sdf1
       5       0        0        5      removed

       5       8       97        -      faulty spare   /dev/sdg1

We confirm that the drive has been marked as faulty.

Next we need to remove the failed drive from within the array. This can be done with:

$ sudo mdadm --manage /dev/md0 --remove /dev/sdg1

Which returns

mdadm: hot removed /dev/sdg1 from /dev/md0

We can confirm that the drive has been removed from the active array by running:

$ cat /proc/mdstat

Which returns


Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid6 sdb1[0] sdd1[2] sdf1[4] sde1[3] sdc1[1]
      15627540480 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [UUUUU_]
      
unused devices: 

You can see that sdg1 is no longer within the active array.

Shutdown and Replace
It is now safe for you to shut down the machine and physically replace the drive in question.

You should have taken note of the serial number in the first steps to help make identifying the drive simple.

Partition and Add
Now that the new drive is in the system, we can go ahead and boot the machine.

Note: Upon boot you may encounter a boot error about a

Degraded RAID Array.

It will recommend adding bootflags to the kernel boot parameters and dump you to an initramfs prompt, but I found that if I catch the prompt quick enough you can type y and hit enter to force it to boot.

At this point, the system should be booted.

We are going to use a utility called gdisk to copy the partition table from another drive onto our new drive. I didn’t have gdisk installed by default but this can easily be installed through apt-get.

$ sudo apt-get install gdisk

Using gdisk, we’ll first use the -R flag to replicate the partition schema of another drive within the array onto our new drive.

sudo sgdisk -R=/dev/sdg /dev/sdf

It’s very important you put the correct drives in the correct order. In the above command we are replicating the partition schema of drive /dev/sdf to drive /dev/sdg.

You will receive a response like this:

The operation has completed successfully.

Next we need to randomize the new drive’s GUIDs to prevent conflict with any other drives. This can be done with:

$ sudo sgdisk -G /dev/sdg

Which returns:

The operation has completed successfully.

Now we can verify that the partition tables of our two drives are identical.

The donor drive /dev/sdf

$ sudo parted /dev/sdf print

Gives us


Model: ATA WDC WD40EFRX-68W (scsi)
Disk /dev/sdf: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start   End     Size    File system  Name  Flags
 1      1049kB  4001GB  4001GB                     raid

The receiving drive /dev/sdg

$ sudo parted /dev/sdg print

Gives us


Model: ATA WDC WD40EFRX-68W (scsi)
Disk /dev/sdg: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start   End     Size    File system  Name  Flags
 1      1049kB  4001GB  4001GB                     raid

Both now have identical partition schemas and flags. Great!

Adding the Drive Back to the Array
Now that we have a clean drive that is partitioned correctly, it is time to add it back into our array.

Remember in my case, the array mount-point is /dev/md0 but yours could be different.

To add the drive:

Note: Notice that I add the drive mount-point /dev/sdg1

$ sudo mdadm --manage /dev/md0 --add /dev/sdg1

Which returns:

mdadm: added /dev/sdg1

Verify Recovery
Now that the drive has successfully been added to the array, we can verify the rebuilding process is in progress.

$ cat /proc/mdstat

Which returns:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid6 sdg1[6] sde1[3] sdf1[4] sdc1[1] sdb1[0] sdd1[2]
      15627540480 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [UUUUU_]
      [>....................]  recovery =  0.0% (383360/3906885120) finish=1188.8min speed=54765K/sec

Keep in mind that depending on the size of your array, the recovery process could take a while. In my case, nearly 20 hours. You can always check the status of the recovery by running the above command.