punkwalrus (punkwalrus) wrote,

Tech - Repairing RAID5 on Linux with mdadm

This may make no sense to most of my friends, but I am recording this for all posterity, in case it shows up in search engines and helps somebody else. If the sound of repairing an ATA-based RAID5 on Linux sounds like the most boring thing on God's own Earth, go ahead and skip this entry.

Your box has frozen, and upon reboot, you find your ATA-based RAID5 is hosed. Run fsck.ext2? Hit control-D? What? OH NOES!!1! That's what happened to me. Turns out the motherboard had fried, and took out my controller (and RAM, and video card... but that's another story). But RAID5 is redundant, right? That's why you did that, right? And now people are laughing at you, and saying "RAID is NOT a backup plan! U shoulda used SCSI! n00b!" Unhelpful bastards, the lot. Although, that really didn't happen to me. I got a lot of blank stares, wide-eyed blinks, but one person... one HELPFUL person, on the Linux LJ community showed me the way with his lantern. And I want to repay the favor (I offered him $25, and he never wrote me, so this is second-best).

You read mdadm. It's confusing. Why can't anyone have a simple, step-by-step "how do I fix my RAID5" page? You did a "set and forget" didn't you, back when you installed Fedora Core? Yeah, me too, and a year later, my ATA RAID5 was hosed because I got a cheap power supply. Or was it...? I got a new ASUS mobo, an Antec power supply, and put together a new box. And here's how I fixed it all and got my uncorrupted data back.

My situation: /dev/md0 was my /home directory. Yee! I won't even go into my less-than-stellar offiste backup plan (which failed at all the minute weak points). So here's what I did.

/home = /dev/md0 = /dev/hdc1 /dev/hde1 and /dev/hdg1 over mobo controller and spare ATA controller.

First, upon boot to error, I follwed the directions, entered in my root password, and got a prompt. I edited /etc/fstab, and commented out # /dev/md0. I rebooted. Things came up fine. Yeah, my /home was empty, but not to fear!

Currently, I had 3 array stripes: hdc1 hde1 hdg1. I grepped them out of dmesg. If any of those two are okay, you are okay, I said. I quickly found out hdc1 was the uncool mopey goth drive, and got kicked out:
md:  adding hdc1 ...
md: bind
md: running: 
md: kicking non-fresh hdc1 from array!
It seems my first partition on my secondary motherboard IDE controller was having that "not so fresh feeling." Oh, then there's this lacivious comment:
md: kicking non-fresh hdc1 from array!
md: unbind
md: export_rdev(hdc1)
md: md0: raid array is not clean -- starting background reconstruction
raid5: device hdg1 operational as raid disk 2
raid5: device hde1 operational as raid disk 1
raid5: cannot start dirty degraded array for md0
RAID5 conf printout:
 --- rd:3 wd:2 fd:1
 disk 1, o:1, dev:hde1
 disk 2, o:1, dev:hdg1
raid5: failed to run raid set md0
Dirty, dirty, naughty bad bad bad array! I saw it looking at those smutty novels in the cafeteria and KNEW it would come to this. But all jokes aside, I wanted to make sure all my drives were okay. They were Western Digital, so I downloaded their repair/diagnostic floppy from their helpful website (the CD ISO, didn't work, BTW... didn't make a bootable CD). A thurough scan showed all systems are a go! So why did hdc fail? It didn't. The array did. And that was good news.

Welcome mdadm. Shall I take your hat and coat?

mdadm = multi-disk administrator, pronounced "em-dee-ad-min" (I think).

After booting, I did a scan of all partitions in my array, examining the missing piece, make sure it's not dead or missing:
mdadm --examine /dev/hdc1

First, I hooked up an external usb drive, and assembled the remaining stripes:
mdadm -A -f /dev/md0 /dev/hde1 /dev/hdg1
"Assemble force multidisk 0 with stripes hde1 and hdg1"

Then I mounted (stop snickering) /dev/md0 on /home:
mount /dev/md0 /home

... and backed up my data. You know, just in case it tries to influence my data with smutty novels. After that, I went ahead and added it back (or, if I had put in a spare drive added the new one):
mdadm -a /dev/md0 /dev/hdc1
"Add [hotswapadd] to multidisk 0 the partition hdc1"

Now what? I like progress bars. Don't you? Luckily, you can have one! But mdadm doesn't do it, you have to cat the multi-disk statistics in the proccess directory:
watch -n 5 cat /proc/mdstat

You'll get this, which will update every 5 seconds:
Every 5.0s: cat /proc/mdstat                            Wed Dec  7 22:07:10 2005

Personalities : [raid5]
md0 : active raid5 hdc1[3] hde1[1] hdg1[2]
      156296192 blocks level 5, 256k chunk, algorithm 2 [3/2] [_UU]
      [=================>...]  recovery = 87.6% (68524692/78148096) finish=5.0mi
n speed=31455K/sec
unused devices: 

Or I hope you do. Rock on! Hope this helps someone. Sure made me respect offsite backup more... :(
  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded