Your box has frozen, and upon reboot, you find your ATA-based RAID5 is hosed. Run fsck.ext2? Hit control-D? What? OH NOES!!1! That's what happened to me. Turns out the motherboard had fried, and took out my controller (and RAM, and video card... but that's another story). But RAID5 is redundant, right? That's why you did that, right? And now people are laughing at you, and saying "RAID is NOT a backup plan! U shoulda used SCSI! n00b!" Unhelpful bastards, the lot. Although, that really didn't happen to me. I got a lot of blank stares, wide-eyed blinks, but one person... one HELPFUL person, on the Linux LJ community showed me the way with his lantern. And I want to repay the favor (I offered him $25, and he never wrote me, so this is second-best).
You read mdadm. It's confusing. Why can't anyone have a simple, step-by-step "how do I fix my RAID5" page? You did a "set and forget" didn't you, back when you installed Fedora Core? Yeah, me too, and a year later, my ATA RAID5 was hosed because I got a cheap power supply. Or was it...? I got a new ASUS mobo, an Antec power supply, and put together a new box. And here's how I fixed it all and got my uncorrupted data back.
My situation: /dev/md0 was my /home directory. Yee! I won't even go into my less-than-stellar offiste backup plan (which failed at all the minute weak points). So here's what I did.
/home = /dev/md0 = /dev/hdc1 /dev/hde1 and /dev/hdg1 over mobo controller and spare ATA controller.
First, upon boot to error, I follwed the directions, entered in my root password, and got a prompt. I edited /etc/fstab, and commented out # /dev/md0. I rebooted. Things came up fine. Yeah, my /home was empty, but not to fear!
Currently, I had 3 array stripes: hdc1 hde1 hdg1. I grepped them out of dmesg. If any of those two are okay, you are okay, I said. I quickly found out hdc1 was the uncool mopey goth drive, and got kicked out:
md: adding hdc1 ... md: bindIt seems my first partition on my secondary motherboard IDE controller was having that "not so fresh feeling." Oh, then there's this lacivious comment:
md: running: md: kicking non-fresh hdc1 from array!
md: kicking non-fresh hdc1 from array! md: unbindDirty, dirty, naughty bad bad bad array! I saw it looking at those smutty novels in the cafeteria and KNEW it would come to this. But all jokes aside, I wanted to make sure all my drives were okay. They were Western Digital, so I downloaded their repair/diagnostic floppy from their helpful website (the CD ISO, didn't work, BTW... didn't make a bootable CD). A thurough scan showed all systems are a go! So why did hdc fail? It didn't. The array did. And that was good news.
md: export_rdev(hdc1) md: md0: raid array is not clean -- starting background reconstruction raid5: device hdg1 operational as raid disk 2 raid5: device hde1 operational as raid disk 1 raid5: cannot start dirty degraded array for md0 RAID5 conf printout: --- rd:3 wd:2 fd:1 disk 1, o:1, dev:hde1 disk 2, o:1, dev:hdg1 raid5: failed to run raid set md0
Welcome mdadm. Shall I take your hat and coat?
mdadm = multi-disk administrator, pronounced "em-dee-ad-min" (I think).
After booting, I did a scan of all partitions in my array, examining the missing piece, make sure it's not dead or missing:
mdadm --examine /dev/hdc1
First, I hooked up an external usb drive, and assembled the remaining stripes:
mdadm -A -f /dev/md0 /dev/hde1 /dev/hdg1
"Assemble force multidisk 0 with stripes hde1 and hdg1"
Then I mounted (stop snickering) /dev/md0 on /home:
mount /dev/md0 /home
... and backed up my data. You know, just in case it tries to influence my data with smutty novels. After that, I went ahead and added it back (or, if I had put in a spare drive added the new one):
mdadm -a /dev/md0 /dev/hdc1
"Add [hotswapadd] to multidisk 0 the partition hdc1"
Now what? I like progress bars. Don't you? Luckily, you can have one! But mdadm doesn't do it, you have to cat the multi-disk statistics in the proccess directory:
watch -n 5 cat /proc/mdstat
You'll get this, which will update every 5 seconds:
Every 5.0s: cat /proc/mdstat Wed Dec 7 22:07:10 2005 Personalities : [raid5] md0 : active raid5 hdc1 hde1 hdg1 156296192 blocks level 5, 256k chunk, algorithm 2 [3/2] [_UU] [=================>...] recovery = 87.6% (68524692/78148096) finish=5.0mi n speed=31455K/sec unused devices:
Or I hope you do. Rock on! Hope this helps someone. Sure made me respect offsite backup more... :(