Wednesday, August 06, 2008
Linux RAID tuning and troubleshooting
Ran across this while searching another topic.

http://makarevitch.org/rant/raid/

Labels:

Friday, December 21, 2007
Replacing a failed drive in a Software RAID mirror set
Like I wrote about last time, I have a failing drive in my triple active RAID mirror set on my firewall box. See also "Failing hard drive in a Software RAID". I'm still trying to decide whether the disk has actually failed, or if it is just having issues.

# /sbin/badblocks -sv /dev/sdc2

Since I have unmounted this RAID slice, I'm going to test with a DESTRUCTIVE write/read verification. (Which is also a good way to wipe the disk.)

# /sbin/badblocks -sv -w -t random /dev/sdc2

Well, after a few runs with that, the disk is no longer making "retry" noises. So I'm going to re-add the slice to the RAID array and see what happens.

# /sbin/mdadm /dev/md1 -a /dev/sdc2

And force mdadm to verify the sync:

# echo check > /sys/block/md1/md/sync_action

It seems to be working. I'm guessing that I finally convinced SMART to re-map the bad sector that was causing problems.

Labels:

Wednesday, December 05, 2007
Failed drive slice in a Software RAID after resync
One of the things that I do periodically on my servers is to run a mdadm resync. Because this can put a heavy strain on the disk system, I strongly suggest that you have good backups in place. My home systems run a check about once a month, servers at work run a check early on Tuesday mornings.

The script is very simple, and you can even fire off the command by writing "check" to the sync_action variable of the md process.

#!/bin/sh
# Tells mdadm to verify that the arrays are synchronized.
# This deals with the issue where a seldom-read disk block has gone bad
# by doing a daily/weekly verification of the array.

echo check > /sys/block/md0/md/sync_action
echo check > /sys/block/md1/md/sync_action
echo check > /sys/block/md2/md/sync_action
echo check > /sys/block/md3/md/sync_action
echo check > /sys/block/md4/md/sync_action
echo check > /sys/block/md5/md/sync_action
echo check > /sys/block/md6/md/sync_action


In this particular case, all of my RAID slices verified correctly, except for one of them. In this particular situation I'm running a triple-active RAID1 array. (Instead of using a hot-spare disk, I'm putting live data onto all three disks and using all three actively.)

See also Failing hard drive in a Software RAID

$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[2] sdb1[1] sda1[0]
256896 blocks [3/3] [UUU]

md2 : active raid1 sdc3[2] sdb3[1] sda3[0]
12289600 blocks [3/3] [UUU]

md4 : active raid1 sdc5[2] sdb5[1] sda5[0]
33551616 blocks [3/3] [UUU]

md3 : active raid1 sdc6[2] sdb6[1] sda6[0]
1052160 blocks [3/3] [UUU]

md5 : active raid1 sdc7[2] sdb7[1] sda7[0]
64010880 blocks [3/3] [UUU]

md6 : active raid1 sdc8[2] sdb8[1] sda8[0]
267257216 blocks [3/3] [UUU]

md7 : active raid1 sdf1[2] sde1[1] sdd1[0]
488383936 blocks [3/3] [UUU]

md1 : active raid1 sdc2[3](F) sdb2[1] sda2[0]
12289600 blocks [3/2] [UU_]

unused devices: <none>


The md1 array is my / (root) partition. Since the rest of the disk slices appear to be fine, I'm going to proceed with the assumption that it was a minor glitch.

Step 0: Analyze the failure

The first sign of error was the (F) showing up in /proc/mdstat. Apparently I don't have mdadm configured yet in monitor mode so that it e-mails me when it finds an error.

# grep "sdc2" messages
Dec 4 09:11:58 fw1-shimo kernel: raid1: Disk failure on sdc2, disabling device.
Dec 4 09:12:06 fw1-shimo kernel: disk 2, wo:1, o:0, dev:sdc2


The full detail from the mdadm resync:

# grep "Dec 4 09" messages | grep "md:"
Dec 4 09:08:33 fw1-shimo kernel: md: md6: sync done.
Dec 4 09:08:33 fw1-shimo kernel: md: syncing RAID array md1
Dec 4 09:08:33 fw1-shimo kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Dec 4 09:08:33 fw1-shimo kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
Dec 4 09:08:33 fw1-shimo kernel: md: using 128k window, over a total of 12289600 blocks.
Dec 4 09:11:31 fw1-shimo kernel: md: md1: sync done.
#


And finally, evidence from the logs that shows that sdc was having issues:

Dec 4 09:11:34 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:34 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:34 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:34 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:35 fw1-shimo kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:35 fw1-shimo kernel: ata2.00: (BMDMA stat 0x0)
Dec 4 09:11:35 fw1-shimo kernel: ata2.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:35 fw1-shimo kernel: ata2: EH complete
Dec 4 09:11:37 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:37 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:37 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:37 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:50 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:51 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:51 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:51 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:51 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:52 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:52 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:52 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:52 fw1-shimo setroubleshoot: SELinux is preventing /usr/sbin/sendmail.postfix (system_mail_t) "read" to /dev/md1 (proc_mdstat_t). For complete SELinux messages. run sealert -l d5c655f4-6fc3-445b-ab9d-3b21336cb2d0
Dec 4 09:11:52 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:53 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:53 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:53 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:53 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:53 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:54 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:54 fw1-shimo kernel: sd 2:0:0:0: SCSI error: return code = 0x08000002
Dec 4 09:11:54 fw1-shimo kernel: sdc: Current: sense key: Medium Error
Dec 4 09:11:54 fw1-shimo kernel: Additional sense: Unrecovered read error - auto reallocate failed
Dec 4 09:11:55 fw1-shimo kernel: end_request: I/O error, dev sdc, sector 25091744
Dec 4 09:11:55 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:55 fw1-shimo kernel: SCSI device sdc: 781422768 512-byte hdwr sectors (400088 MB)
Dec 4 09:11:55 fw1-shimo kernel: sdc: Write Protect is off
Dec 4 09:11:56 fw1-shimo kernel: SCSI device sdc: drive cache: write back
Dec 4 09:11:56 fw1-shimo kernel: SCSI device sdb: 781422768 512-byte hdwr sectors (400088 MB)
Dec 4 09:11:56 fw1-shimo kernel: sdb: Write Protect is off
Dec 4 09:11:57 fw1-shimo kernel: SCSI device sdb: drive cache: write back
Dec 4 09:11:57 fw1-shimo kernel: SCSI device sdc: 781422768 512-byte hdwr sectors (400088 MB)
Dec 4 09:11:57 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:11:57 fw1-shimo kernel: counted 127, received 15
Dec 4 09:11:58 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:11:58 fw1-shimo kernel: raid1: Disk failure on sdc2, disabling device.
Dec 4 09:11:58 fw1-shimo kernel: Operation continuing on 2 devices
Dec 4 09:11:58 fw1-shimo kernel: blk: request botched
Dec 4 09:11:58 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:11:59 fw1-shimo kernel: counted 112, received 16
Dec 4 09:11:59 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:11:59 fw1-shimo kernel: blk: request botched
Dec 4 09:11:59 fw1-shimo kernel: sdc: Write Protect is off
Dec 4 09:12:00 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:00 fw1-shimo kernel: counted 96, received 16
Dec 4 09:12:00 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:01 fw1-shimo kernel: blk: request botched
Dec 4 09:12:01 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:01 fw1-shimo kernel: counted 80, received 16
Dec 4 09:12:01 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:02 fw1-shimo kernel: blk: request botched
Dec 4 09:12:02 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:02 fw1-shimo kernel: counted 64, received 16
Dec 4 09:12:02 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:03 fw1-shimo kernel: blk: request botched
Dec 4 09:12:03 fw1-shimo kernel: SCSI device sdc: drive cache: write back
Dec 4 09:12:03 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:03 fw1-shimo kernel: counted 48, received 16
Dec 4 09:12:04 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:04 fw1-shimo kernel: blk: request botched
Dec 4 09:12:04 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:04 fw1-shimo kernel: counted 32, received 16
Dec 4 09:12:05 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:05 fw1-shimo kernel: blk: request botched
Dec 4 09:12:05 fw1-shimo kernel: ata3.00: WARNING: zero len r/w req
Dec 4 09:12:06 fw1-shimo last message repeated 5 times


Step 1: Drop the failed slice

# /sbin/mdadm /dev/md1 --fail /dev/sdc2
mdadm: set /dev/sdc2 faulty in /dev/md1
# /sbin/mdadm /dev/md1 --remove /dev/sdc2
mdadm: hot removed /dev/sdc2


Step 2: Zero out the failed slice

My thinking here is that by zeroing out the failed slice, I can force the SATA disk to remap any sectors that have gone bad.

# dd if=/dev/zero of=/dev/sdc2
dd: writing to `/dev/sdc2': Input/output error
24577993+0 records in
24577992+0 records out
12583931904 bytes (13 GB) copied, 1916.7 seconds, 6.6 MB/s


Well, that's not a good sign (and the disk was clicking a bit). So I'll run smartctl and check the disk's SMART info (see Monitoring Hard Disks with SMART).

# /usr/sbin/smartctl -i -d ata /dev/sdc
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HD400LJ
Serial Number: S0H2J1KLA07831
Firmware Version: ZZ100-15
User Capacity: 400,088,457,216 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a
Local Time is: Wed Dec 5 09:43:36 2007 EST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled


However, the "-Hc" output of smartctl says that the disk health is still "PASSED" and not "FAILING". So it's possible that the disk doesn't need to be retired yet.

# /usr/sbin/smartctl -Hc -d ata /dev/sdc
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x05) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (7640) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 130) minutes.


Personally, since I know the drive makes clicking noises and throws an error during the dd wipe, I'm going to swap it out.

Labels:

Friday, May 25, 2007
iSCSITarget on CentOS5
Setting up our test iSCSI SAN box this week. The original plans were to run this on top of Gentoo (which is very powerful and flexible) but after 3 years, I'm not very pleased with Gentoo as a server OS. Which is a whole different topic. So we've migrated over to using CentOS5, which is derived from Red Hat Enterprise Linux 5, a distro that is more suited for corporate use.

There's not much to talk about in terms of the base system. It's a pretty vanilla 64bit CentOS5 install (from DVD) running on top of a dual-CPU dual-core pair of Socket F Opterons. The primary packages that I've installed so far are "Yum Extender" (from stock repositories) and "rdiff-backup" (downloaded as an RPM). The OS runs on top of a 3-disk RAID1 (mirror, all drives active) Software RAID for safety.

I use a semi-customized partition layout on the (3) operating system disks. I have:

a) /boot
b) / (root, the primary OS install area)
c) swap
d) a backup root partition (which is basically a clone of the primary, except for a small change in /etc/fstab) designed for quick recovery from a situation that would hose the primary root partition
e) /var/log (broken out to its own area)
f) /backup/system (a place to store system backups)
g) LVM area (no allocated areas yet)

I mention all that because the first step before installing iscsitarget is to make sure I can recover if things go awry. Since installing iscsitarget involves mucking with the running kernel, I want a good backup of /boot along with making sure GRUB offers me options to boot an older kernel. I'll also freshen my root backup partition.

Step 1 - Backing up /, /boot, and the existing kernel

Simplicity is often best when dealing with the base OS. My methods are crude, but designed to get me back up and running without needing much in the way of software. The primary requirement is a bootable USB pen drive or bootable LiveCD (such as RIPLinuX) with the necessary tools. You could also use the CentOS5 boot DVD.

I'll run with the CentOS install DVD since that's what I have sitting in the optical drive at the moment. When CentOS boots up, enter "linux rescue" at the boot prompt. Note, if you have multiple NICs installed, it's probably better to not start networking (because the CentOS rescue mode takes forever to initialize unconnected NICs).

Select "Skip" when asked about mounting the existing install at /mnt/sysimage. We'll be doing things our own way instead.

Start up Software RAID on the key partitions (/boot, /, the backup /, and the backup partition). The following commands will (usually) startup your existing RAID devices automatically.

# mdadm --examine --scan >> /etc/mdadm.conf
# mdadm --assemble --scan

In my case "md0" is /boot, "md2" is my base CentOS install, "md3" is the backup root partition, and "md5" is where I can store image files. So let's double-check that.

# mkdir /mnt/root ; mount /dev/md3 /mnt/root
# mkdir /mnt/backuproot ; mount /dev/md3 /mnt/backuproot

If we then examine the output of "df -h" or by using "ls" on the mounted volumes we can verify that we know which is which. Let's mount our backup area and create image files. I prefer to kick off the "dd" commands in the background so that I can monitor progress and keep multiple CPUs busy.

# mkdir /mnt/backup ; mount /dev/md5 /mnt/backup
# cd /mnt/backup ; mkdir images ; cd images
# dd if=/dev/md0 | gzip > dd-md0-boot-20070525.img.gz &
# dd if=/dev/md2 | gzip > dd-md2-root-20070525.img.gz &
# dd if=/dev/md3 | gzip > dd-md3-bkproot-20070525.img.gz &

We should also backup the master boot records on each of the hard drives in the unit.

# for i in a b c; do dd if=/dev/sd$i count=1 bs=512 of=dd-sd$i-mbr-20070525.img; done

Unfortunately, the CentOS5 DVD doesn't include tools like "G4L" (Ghost for Linux) or I'd make a second set of backup files using that. I may boot my RIPLinuX CD and see what tools are there. (Because you can never have too many backups.)

Now I can dump the contents of "md2" (the original root) to "md3" (our backup root).

# dd if=/dev/md2 of=/dev/md3

Now for some cleanup stuff...

# mount /dev/md3 /mnt/backuproot
# vi /mnt/backuproot/etc/fstab

We'll need to change any references of "md2" to "md3". Basically flip them around so that "md3" is the official root when /etc/fstab gets processed. I also like to change the prompt and system name to remind myself that I'm using the emergency system. Again, our primary goal is to be able to get a box back up and operational in the case where the primary root partition is hosed. Get it up quickly, then schedule some downtime to deal with it properly.

Now would also be a good time to tune the ext3 file system on your partitions.

The last thing we need to do is edit GRUB's configuration so that we can select our backup root OS from the selection menu.

# mkdir /mnt/boot
# mount /dev/md0 /mnt/boot
# vi /mnt/boot/grub/grub.conf

Things that we'll want to do here (you could also accomplish this by booting the server in normal mode and editing grub.conf there using a more comfortable text editor):

a) Change the timeout=5 value to timeout=15 (or 30 or 60). By default, CentOS doesn't give you very long to pick an alternate boot. I find 5 seconds to be too short of a window, especially on a unit where the storage controller takes a minute or two to scan and setup the drives.

b) Copy the latest "title" section and change "root=/dev/md2" to "root=/dev/md3". I always make the "EMERGENCY" boot option the 2nd one in the list.

# mkdir /mnt/backuproot
# mount /dev/md3 /mnt/backuproot
# vi /mnt/backuproot/etc/sysconfig/network

I like to change the hostname to have "-emergency" tacked onto the end. Which should make it fairly obvious that we are booting up in emergency mode using the backup root partition. I also edit root's .bash_profile to set PS1.

Okay, that was a lot of setup work just to prepare for implementing iSCSITarget (or any other kernel rebuild), but it's always worth it.

Final notes:

- When I test booted the emergency root partition, things didn't work as planned. So while my concept is sound, I may have screwed something up. I think it's an error with /etc/fstab in the emergency partition, so I'll troubleshoot that later.

- It's also possible that you'll need to do a GRUB install on all (3) of the primary mirror disks.

Step 2 - Downloading and compiling the iSCSITarget software

So far, I've found (2) links to be useful here. One is Moving on.... and the other is iSCSI Enterprise Target を CentOS5 にインストールする (japanese). While the 2nd link is in Japanese, it shows the commands in english.

Head over to the The iSCSI Enterprise Target page and download the latest tarball containing the source code. The current version is 0.4.15. If you're using Firefox in CentOS's Gnome shell, it will probably prompt you to open the file with the archive manager. I created a subfolder under /root/iscsitarget-0.4.15 and extracted the contents there.

You will also need to go into Applications -> Add/Remove Software and add the development tools and libraries to your system. (Mostly you just need gcc.)

As noted on jackshck's page, you will also need to install the following packages:

openssl-devel (I installed the x86_64 version)
kernel-devel (again, I'm using the x86_64 version)

Open up a terminal window and go to where you extracted the iscsitarget tarball (I put mine in /root/iscsitarget-0.4.15).

# ls -l /usr/src/kernels
(make note of the kernel folder)
# make KSRC=/usr/src/kernels/2.6.18-8.1.4.el5-x86_64/
# make KSRC=/usr/src/kernels/2.6.18-8.1.4.el5-x86_64/ install

Now we can start up the ietd daemon:

# /etc/init.d/iscsi-target start

And add it to our default runlevel (this is similar to the rc-update command in Gentoo Linux):

# chkconfig iscsi-target on

Step 3 - Creating a target

This is where we get into the nitty-gritty and where I need to take a break and do some research. The /etc/ietd.conf file already exists at this point, but only contains a commented out sample configuration.

Notes:

Dec 21 2007 - The comment about iSCSITarget software for Microsoft Windows really isn't on-topic. But I'll go ahead and list the link to it, but not as an HTML link. Pricing for the real version is currently $395 (Server) or $995 (Professional). And personally, there's no way that I'd recommend running a SAN on top of Microsoft Windows (even Server 2003, which is a nice product).

Labels: , , ,

Thursday, May 10, 2007
Dealing with a failed Software RAID device
As part of my server setup, I like to make sure that plans are working as expected... which means intentionally breaking things like RAID sets.

In this particular case I have a triple-active RAID1 mirror set on the first 3 disks in the system (/dev/sda, /dev/sdb, /dev/sdc). In this RAID1 set, all 3 disks are active, with no hot-spare. I prefer this over a (2) active (1) hot-spare setup because it allows for up to 2 disks to fail before you lose data. And if I'm already dedicating a hot-spare spindle solely for the use of the RAID1 set, I may as well get to use it. The output of /proc/mdstat looks similar to (note that none of the slices are tagged with a "(S)").:

md2 : active raid1 sdc2[2] sdb2[1] sda2[0]
7911936 blocks [3/3] [UUU]


Each disk has quite a few md devices associated with it. In this particular case I have /dev/md0 up through /dev/md5 created. Probably one of the few downsides to SoftwareRAID is that you end up with quite a few md devices to keep track of. But such is the price for just about the ultimate flexibility.

GRUB Note: In a mirrored setup, you must make sure to install GRUB to the MBR (master boot record) on all of the mirror disks. Some Linux distros don't do this on their own and you'll have to do it yourself. Otherwise, when the first disk in the mirror set fails, you'll find you're left with an unbootable system. This is also why I like to make copies of the MBR for each disk in the system (# dd if=/dev/sda of=dd-sda-mbr-date.img bs=512 count=1).

So, after making very good backups, I decided it was time to test whether I could pull a disk and survive. To make sure that I had taken care of the GRUB issue, I shutdown the server and pulled the primary drive in the RAID set.

# cat /proc/mdstat
md2 : active raid1 sdc2[2] sdb2[1]
7911936 blocks [3/3] [_UU]


Ah good, mdadm is *not* happy here (as expected). It knows that one of the disks has failed in the array. So let's shutdown and replace the failed drive with a blank one. In this case, I used a spare drive that I had laying around that had been previously wiped. (Or, with care, you could zero out the drive that you pulled.)

I recomend using "sfdisk" in dump mode to configure the new drive. So if your failed drive is "sda" and one of the good ones is "sdb", you could use:

# sfdisk -d /dev/sdb | sfdisk /dev/sda

After which, you can use the "mdadm" command to add the new slices to the existing RAID arrays.

# mdadm --add /dev/mdX /dev/sdYZ

Last, don't forget to install GRUB to the MBR on the new disk.

Labels:

Monday, May 07, 2007
Brute force disaster recovery for CentOS5
Today's trick is moving a CentOS5 system from an old set of disks over to a new set of disks. Along the way, I'll create an image of the system to allow me to restore it later on.

The CentOS5 system is a fresh install running RAID-1 across (3) disks using Linux Software RAID (mdadm). There are (4) primary partitions (boot, root, swap, LVM) with no data on the LVM partition.

(Why 3 active disks? The normal setup for this server was RAID-1 across 2 disks with a hot-spare. Rather then have a risky window of time where one disk has failed and the hot-spare is synchronizing with the remaining good disk, I prefer to have all 3 disks running. That way, when a disk dies, we still have 2 disks in action. The mdadm / Software RAID doesn't seem to care and it doesn't seem to affect performance at all.)

Because this is RAID-1, capturing the disk layout and migrating over to the new disks will be very easy. It's also a very fresh install, so I'm just going to grab the disk contents using "dd" (most of the partition's sectors are still zeroed out from the original install). Once I've backed up the (3) partitions on the first drive, I'm going to pull the (3) drives and replace them with the new ones.

I'll get the machine up and running with the first replacement drive, then configure the blank 2nd and 3rd drives and add them to the RAID set. That is, if mdadm doesn't beat me to the punch and start the sync on the 2nd/3rd disks automatically.

If things go bad, I can always drop the original disks back in the unit and power it back up. I plan on keeping them around for a few days, just in case. I'll have to recreate the LVM volumes, but there aren't any yet (just a PV and a VG).

One advantage of pulling the old drives out completely and rebuilding using fresh drives - I'll end up with a tested disaster recovery process.

Now for the nitty gritty. I'm using a USB pocket drive formatted with ext3 for the rescue work. Make sure that you plug this in before booting the rescue CD.

  1. Login to the system and power it down.
  2. Boot the CentOS5 install DVD
  3. At the "boot:" enter "linux rescue"
  4. Work your way through the startup dialogs
  5. When prompted whether to mount your linux install, choose "Skip"

This should give you a command shell with useful tools. So let's poke around and check on our system.

  1. Looking at "cat /proc/mdstat" shows that while the mdadm software is running, it has not assembled any RAID arrays.
  2. The "fdisk -l" command shows us that the (3) existing disks are named sda, sdb, sdc. Each has (4) partitions (boot, root, swap, LVM).
  3. My USB drive showed up as "/dev/sdd" so I'll create a "/backup" folder and mount it using "mkdir /backup ; mount /dev/sdd1 /backup ; df -h"

Naturally, we should create a sub-folder under /backup for each machine and possibly create another folder underneath it using today's date. We should grab information about the current disk layout and store it in a text file (fdisk.txt).

  1. # cd /backup ; mkdir machinename ; cd machinename
  2. # mkdir todaysdate ; cd todaysdate
  3. # fdisk -l > fdisk.txt

Now to grab the boot loader and image the two critical partitions (boot and root). We'll grab the boot loader off of all (3) drives because it's so small (and it may not be properly synchronized).

  1. dd if=/dev/sda bs=512 count=1 of=machinename-date-sda.mbr
  2. dd if=/dev/sdb bs=512 count=1 of=machinename-date-sdb.mbr
  3. dd if=/dev/sdb bs=512 count=1 of=machinename-date-sdc.mbr
  4. dd if=/dev/sda1 | gzip > machinename-date-ddcopy-sda1.img.gz
  5. dd if=/dev/sda2 | gzip > machinename-date-ddcopy-sda2.img.gz

Total disk space for my system was around 1.75GB worth of compressed files (8GB root, 250MB boot). You could also use bzip2 if you need more compression. Unfortunately, the CentOS5 DVD does not include the "split" command, which could cause issues if you're trying to write to a filesystem that can't handle files over 2GB in size.

Now you should shut the box back down, burn those files to DVD-R, install the new (blank) disks, and boot from the install DVD again. Again, mount the drive that holds the rescue image files to a suitable path.

  1. dd of=/dev/sda bs=512 count=1 if=machinename-date-sda.mbr
  2. fdisk /dev/sda (fix the last partition)
  3. dd if=/dev/sda bs=512 count=1 of=/dev/sdb
  4. dd if=/dev/sda bs=512 count=1 of=/dev/sdc

That will restore the MBR and partition table from the old drive to the new one. If your new drive has a different size, then the last partition will be incorrectly sized for the disk. Fire up "fdisk" and delete / recreate the last partition on the disk.

Restore the two partition images:

  1. # gzip -dc machinename-date-ddcopy-sda1.img.gz | dd of=/dev/sda1
  2. # gzip -dc machinename-date-ddcopy-sda2.img.gz | dd of=/dev/sda2

At this point, we should be able to boot the system on the primary drive and have Software RAID come up in degraded mode for the arrays. Things that will need to be done once the unit boots:

  1. Tell mdadm about the 2nd (and 3rd) disks and tell it to bring those partitions into the arrays and synchronize them.
  2. Create a new swap area
  3. Recreate the LVM physical volume (PV) and volume group (VG)
  4. Restore any data from the LVM area (we had none in this example)

Getting the Software RAID back up and happy is the trickiest of the steps.

  1. Login as root, open up a terminal window
  2. # cat /proc/mdstat
  3. # fdisk -l
  4. Notice that the swap area on our system is sda3, sdb3, sdc3 and will need to be loaded as /dev/md1.
  5. # mdadm --create /dev/md1 -v --level=raid1 --raid-devices=3 /dev/sda3 /dev/sdb3 /dev/sdc3
  6. # mkswap /dev/md1 ; swapon /dev/md1
  7. Now we're ready to recreate the LVM area
  8. # mknod /dev/md3 b 9 3
  9. # mdadm --create /dev/md3 -v --level=raid1 --raid-devices=3 /dev/sda4 /dev/sdb4 /dev/sdc4
  10. # pvcreate /dev/md3 ; vgcreate vg /dev/md3
  11. Finally, we should add the 2nd and 3rd drive to md0 and md2.
  12. mdadm --add /dev/md0 /dev/sdc1
  13. mdadm --add /dev/md0 /dev/sdb1

Note: If your triple mirror RAID array puts the additional disks in as spares, make sure that you have (a) grown the number of raid devices to 3 for the RAID1 set and (b) make sure that there are no other arrays synchronizing as the same time. It's also best to add the elements one at a time, rather then adding both at the same time. I'm not sure if it's a bug in mdadm or just the way it works, but it took me two tries to get my triple mirror back up with all disks marked as "active" instead of (2) active and (1) hot-spare.

Labels: , , ,

Tuesday, March 06, 2007
My new preferred disk layout for servers
With age comes wisdom? After working with SoftwareRAID and Linux servers for a while, I've changed my preferred disk system design and layout.

RAID

Under the old system, I was running a (2) disk RAID1 (mirror) with a hot-spare disk setup and ready for action. But if you're going to have a hot-spare dedicated to the RAID1 array, why not use it as an active array member? That way, if a disk fails, you still have two good disks. Unfortunately, when a RAID element fails, the load from the rebuild process can often kill the one of the remaining disks in the array.

Is it a likely scenario? Probably not. But Linux's Software RAID handles a triple-active RAID1 mirror without any slowdown, so there's not much reason *not* to implement it that way. Plus it's a useful trick to know for situations where you really *do* need to be that paranoid.

(I'm not sure whether any hardware RAID cards provide for a triple-active mirroring RAID1 configuration.)

Partitions

I've also simplified how many partitions I like to have on the disk. My current disk layouts typically look like:

/dev/sdX1 - /dev/md0 - 250MB - /boot
/dev/sdX2 - /dev/md1 - 12GB - / (primary root)
/dev/sdX3 - /dev/md2 - 12GB - / (backup root)
/dev/sdX5 - /dev/md3 - 32GB - /var/log
/dev/sdX6 - /dev/md4 - 2GB - swap
/dev/sdX7 - /dev/md5 - 64GB - /backup/system
/dev/sdX8 - /dev/md6 - (remainder) - LVM area

During normal operations, we boot and run /dev/md1 as our / (root) partition. The /dev/md2 partition is kept offline and is never mounted. Periodically, after validating that the server is in good health, we will copy the contents of /dev/md1 to /dev/md2, make adjustments to /etc/fstab and the hostname. This requires some server downtime (long enough to setup the 2nd root partition).

In the case where the primary OS is hosed, we can boot from the backup OS partition and get back up and running quickly. That gives us the luxury to continue operations until we can schedule downtime to fix the primary OS partition.

Notice that I've broken /var/log out to its own partition. I do this so that an overflowing set of logs won't take the server box down. Plus, by putting the log files in their own physical partition, it's easy to use a boot CD or USB key to gain access to the logs in case of severe issues.

The other physical partition that I consider necessary is /backup/system. This partition is used to hold images of the boot and root partitions, along with information about the partition layout and images of the MBRs. Basically, it's used to store disaster recovery backups. You should not have this partition mounted during normal operations. Taking the contents of this partition offsite is also a good idea. A basic text file of how the backups were created along with information for how to restore these backups is recommended.

Summary

This setup tries to walk the fine line between keeping it simple, but having enough flexibility to deal with a large set of potential failures. Anything from a two-disk failure, to the primary OS being hosed, to both OS partitions having problems all the way up to boot records or the /boot partition being killed.

Labels: , ,

Tuesday, August 29, 2006
Creating a 4-disk RAID10 using mdadm
Since I can't seem to find instructions on how to do this (yet)...

I'm going to create a 4-disk RAID10 array using Linux Software RAID and mdadm. The old way is to create individual RAID1 volumes and then stripe a RAID0 volume over the RAID1 arrays. That requires creating extra /dev/mdN nodes which can be confusing to the admin that follows you.

1) Create the /dev/mdN node for the new RAID10 array. In my case, I already have /dev/md0 to /dev/md4 so I'm going to create /dev/md5 (note that "5" appears twice in the command).

# mknod /dev/md5 b 9 5

2) Use fdisk on the (4) drives, create a single primary partition of type "fd" (Linux raid autodetect). Note that I have *nothing* on these brand new drives, so I don't care if it wipes out data.

3) Create the mdadm RAID set using 4 devices and a level of RAID10.

# mdadm --create /dev/md5 -v --raid-devices=4 --chunk=32 --level=raid10 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1

Which will result in the following output:

mdadm: layout defaults to n1
mdadm: size set to 732571904K
mdadm: array /dev/md5 started.

# cat /proc/mdstat

Personalities : [raid1] [raid10]
md5 : active raid10 sdf1[3] sde1[2] sdd1[1] sdc1[0]
1465143808 blocks 32K chunks 2 near-copies [4/4] [UUUU]
[>....................] resync = 0.2% (3058848/1465143808) finish=159.3min speed=152942K/sec


As you can see, we get around 150MB/s from the RAID10 array. The regular RAID1 arrays only have about 75MB/s throughput (same as a single 750GB drive).

A final note. My mdadm.conf file is completely empty on this system. That works well for simple systems, but you'll want to create a configuration file in more complex setups.

Labels:

Sunday, June 11, 2006
Failing hard drive in a Software RAID
So today's fun is that I have a drive that is failing in my 566Mhz Celeron server. This is a small server with (3) 120GB hard drives.

hda - 120GB (primary drive, 4 partitions)
hdc - CD-ROM
hde - 120GB (second drive in the RAID1 sets, 4 partitions)
hdg - 120GB (backup drive)

During the rebuild of md3 (which is hda4+hde4) I'm getting constant aborts due to a bad block (or blocks) on hda.

# tail -n 500 /var/log/messages
Jun 10 23:17:16 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 10 23:17:16 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789712, sector=100789712
Jun 10 23:17:16 coppermine ide: failed opcode was: unknown
Jun 10 23:17:16 coppermine end_request: I/O error, dev hda, sector 100789712
Jun 10 23:17:20 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 10 23:17:20 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789723, sector=100789720
Jun 10 23:17:20 coppermine ide: failed opcode was: unknown
Jun 10 23:17:20 coppermine end_request: I/O error, dev hda, sector 100789720
Jun 10 23:17:20 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 10 23:17:20 coppermine md: md3: sync done.
Jun 10 23:17:20 coppermine RAID1 conf printout:
Jun 10 23:17:20 coppermine --- wd:1 rd:2
Jun 10 23:17:20 coppermine disk 0, wo:0, o:1, dev:hda4
Jun 10 23:17:20 coppermine disk 1, wo:1, o:1, dev:hde4
Jun 10 23:17:20 coppermine RAID1 conf printout:
Jun 10 23:17:20 coppermine --- wd:1 rd:2
Jun 10 23:17:20 coppermine disk 0, wo:0, o:1, dev:hda4
Jun 10 23:17:20 coppermine RAID1 conf printout:
Jun 10 23:17:20 coppermine --- wd:1 rd:2
Jun 10 23:17:20 coppermine disk 0, wo:0, o:1, dev:hda4
Jun 10 23:17:20 coppermine disk 1, wo:1, o:1, dev:hde4
Jun 10 23:17:20 coppermine md: syncing RAID array md3
Jun 10 23:17:20 coppermine md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Jun 10 23:17:20 coppermine md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) for reconstruction.
Jun 10 23:17:20 coppermine md: using 128k window, over a total of 115926464 blocks.


And mdadm will continue to attempt to rebuild the array until the end of time. Which is rather pointless. So the second step is to more closely examine /dev/hda and see whether we're seeing the same block number.

# grep 'hda:' /var/log/messages
May 29 08:43:06 coppermine hda: Maxtor 4R120L0, ATA DISK drive
May 29 08:43:06 coppermine hda: max request size: 128KiB
May 29 08:43:06 coppermine hda: 240121728 sectors (122942 MB) w/2048KiB Cache, CHS=65535/16/63
May 29 08:43:06 coppermine hda: cache flushes supported
May 29 08:43:06 coppermine hda: hda1 hda2 hda3 hda4
Jun 8 22:32:02 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 8 22:32:02 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=80342494, sector=80342480
Jun 8 22:32:04 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 8 22:32:04 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=80342494, sector=80342488
Jun 8 22:32:05 coppermine raid1: hda: unrecoverable I/O read error for block 72089984
Jun 9 05:10:36 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 9 05:10:36 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789712, sector=100789712
Jun 9 05:10:39 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 9 05:10:39 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789722, sector=100789720
Jun 9 05:10:39 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 9 08:26:40 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 9 08:26:40 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=54393160, sector=54393152
Jun 9 08:26:42 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 9 08:26:42 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=54393160, sector=54393160
Jun 9 08:26:42 coppermine raid1: hda: unrecoverable I/O read error for block 46140544
Jun 9 13:13:53 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 9 13:13:53 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789712, sector=100789712
Jun 9 13:13:55 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 9 13:13:55 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789722, sector=100789720
Jun 9 13:13:55 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 10 18:30:21 coppermine hda: Maxtor 4R120L0, ATA DISK drive
Jun 10 18:30:21 coppermine hda: max request size: 128KiB
Jun 10 18:30:21 coppermine hda: 240121728 sectors (122942 MB) w/2048KiB Cache, CHS=65535/16/63
Jun 10 18:30:21 coppermine hda: cache flushes supported
Jun 10 18:30:21 coppermine hda: hda1 hda2 hda3 hda4
Jun 10 23:17:16 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 10 23:17:16 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789712, sector=100789712
Jun 10 23:17:20 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 10 23:17:20 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789723, sector=100789720
Jun 10 23:17:20 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 11 04:08:06 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 11 04:08:06 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789712, sector=100789712
Jun 11 04:08:08 coppermine raid1: hda: unrecoverable I/O read error for block 92537216


This shows me that I have a drive that almost always fails at the same block number each time. Another grep of the log files makes this even more clear:

# grep 'unrecoverable' /var/log/messages
Jun 8 22:32:05 coppermine raid1: hda: unrecoverable I/O read error for block 72089984
Jun 9 05:10:39 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 9 08:26:42 coppermine raid1: hda: unrecoverable I/O read error for block 46140544
Jun 9 13:13:55 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 10 23:17:20 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 11 04:08:08 coppermine raid1: hda: unrecoverable I/O read error for block 92537216


So the first step (after backing up the system) is to stop the software RAID from attempting to constantly rebuild array "md3". You can do this with the mdadm tool's "manage mode" commands.

Well, maybe not. I've done a lot of digging in Google, but I can't figure out how to force mdadm to stop a sync that is in progress. So, I'm booting back to the original 2005.1 Gentoo boot CD so that I can manually control the process.

Note that an excellent resource is:
LVM2 and Software RAID in Linux (May 2005)

livecd ~ # fdisk -l

Disk /dev/hda: 122.9 GB, 122942324736 bytes
16 heads, 63 sectors/track, 238216 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 1 249 125464+ fd Linux raid autodetect
/dev/hda2 250 4218 2000376 fd Linux raid autodetect
/dev/hda3 4219 8187 2000376 fd Linux raid autodetect
/dev/hda4 8188 238200 115926552 fd Linux raid autodetect

Disk /dev/hde: 122.9 GB, 122942324736 bytes
16 heads, 63 sectors/track, 238216 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

Device Boot Start End Blocks Id System
/dev/hde1 * 1 249 125464+ fd Linux raid autodetect
/dev/hde2 250 4218 2000376 fd Linux raid autodetect
/dev/hde3 4219 8187 2000376 fd Linux raid autodetect
/dev/hde4 8188 238200 115926552 fd Linux raid autodetect

Disk /dev/hdg: 122.9 GB, 122942324736 bytes
16 heads, 63 sectors/track, 238216 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

Device Boot Start End Blocks Id System
/dev/hdg1 1 238000 119951968+ 8e Linux LVM

livecd ~ # modprobe md
livecd ~ # modprobe raid1
livecd ~ # ls -l /dev/md*
livecd ~ # for i in 0 1 2 3; do mknod /dev/md$i b 9 $i; done
livecd ~ # ls -l /dev/md*
brw-r--r-- 1 root root 9, 0 Jun 12 00:01 /dev/md0
brw-r--r-- 1 root root 9, 1 Jun 12 00:01 /dev/md1
brw-r--r-- 1 root root 9, 2 Jun 12 00:01 /dev/md2
brw-r--r-- 1 root root 9, 3 Jun 12 00:01 /dev/md3
livecd ~ # mdadm --assemble /dev/md0 /dev/hda1 /dev/hde1
mdadm: /dev/md0 has been started with 2 drives.
livecd ~ # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 hda1[0] hde1[1]
125376 blocks [2/2] [UU]

unused devices:
livecd ~ #


So that starts up the /boot partition. Now I can check it for errors using e2fsck. The "-c" checks for bad blocks, the "-C" updates any inodes on the system with bad block information, and "-y" answers 'yes' to any questions.

livecd ~ # e2fsck -c -C -y -v /dev/md0
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 376
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/md0: ***** FILE SYSTEM WAS MODIFIED *****

40 inodes used (0%)
3 non-contiguous inodes (7.5%)
# of inodes with ind/dind/tind blocks: 12/6/0
12593 blocks used (10%)
0 bad blocks
0 large files

26 regular files
3 directories
0 character device files
0 block device files
0 fifos
0 links
2 symbolic links (2 fast symbolic links)
0 sockets
--------
31 files
livecd ~ #


Next, I assemble the RAID1 set for the root volume.

livecd ~ # mdadm --assemble /dev/md2 /dev/hda3 /dev/hde3
mdadm: /dev/md2 has been started with 2 drives.
livecd ~ # cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hda3[0] hde3[1]
2000256 blocks [2/2] [UU]

md1 : active raid1 hda2[0] hde2[1]
2000256 blocks [2/2] [UU]

md0 : active raid1 hda1[0] hde1[1]
125376 blocks [2/2] [UU]

unused devices:
livecd ~ # e2fsck -c -C -y -v /dev/md2
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 064
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/md2: ***** FILE SYSTEM WAS MODIFIED *****

6434 inodes used (2%)
16 non-contiguous inodes (0.2%)
# of inodes with ind/dind/tind blocks: 75/2/0
390601 blocks used (78%)
0 bad blocks
0 large files

928 regular files
153 directories
1055 character device files
4025 block device files
0 fifos
0 links
264 symbolic links (264 fast symbolic links)
0 sockets
--------
6425 files
livecd ~ #


The rest of the system is more complex, LVM2 volumes on top of software RAID.

livecd ~ # modprobe dm-mod
livecd ~ # pvscan
PV /dev/hdg1 VG vgbackup lvm2 [114.39 GB / 82.39 GB free]
Total: 1 [114.39 GB] / in use: 1 [114.39 GB] / in no VG: 0 [0 ]
livecd ~ # vgscan
Reading all physical volumes. This may take a while...
Found volume group "vgbackup" using metadata type lvm2
livecd ~ # lvscan
inactive '/dev/vgbackup/backup' [32.00 GB] inherit
livecd ~ # lvchange -a y /dev/vgbackup/backup
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvscan
ACTIVE '/dev/vgbackup/backup' [32.00 GB] inherit
livecd ~ # e2fsck -c -C -y -v /dev/vgbackup/backup
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 608
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgbackup/backup: ***** FILE SYSTEM WAS MODIFIED *****

70 inodes used (0%)
14 non-contiguous inodes (20.0%)
# of inodes with ind/dind/tind blocks: 35/18/0
954693 blocks used (11%)
0 bad blocks
0 large files

55 regular files
6 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
61 files
livecd ~ #


So far so good. But most of the errors are in /dev/md3. So I'm going to assemble /dev/md3 using just one of the drives (/dev/hde4).

livecd ~ # mdadm -v --assemble /dev/md3 /dev/hde4
mdadm: looking for devices for /dev/md3
mdadm: /dev/hde4 is identified as a member of /dev/md3, slot 2.
mdadm: added /dev/hde4 to /dev/md3 as 2
mdadm: /dev/md3 assembled from 0 drives and 1 spare - not enough to start the array.
livecd ~ # cat /proc/mdstat
Personalities : [raid1]
md3 : inactive hde4[2]
115926464 blocks
md2 : active raid1 hda3[0] hde3[1]
2000256 blocks [2/2] [UU]

md1 : active raid1 hda2[0] hde2[1]
2000256 blocks [2/2] [UU]

md0 : active raid1 hda1[0] hde1[1]
125376 blocks [2/2] [UU]

unused devices:


Unfortunately, mdadm is refusing to mount /dev/md3 using just /dev/hde4. So we have to force it:

livecd ~ # mdadm --create /dev/md3 --level 1 --force --raid-disks=1 /dev/hde4
mdadm: Cannot open /dev/hde4: Device or resource busy
mdadm: create aborted
livecd ~ # mdadm --stop /dev/md3
livecd ~ # mdadm --create /dev/md3 --level 1 --force --raid-disks=1 /dev/hde4
mdadm: /dev/hde4 appears to be part of a raid array:
level=1 devices=2 ctime=Sat Oct 22 20:51:12 2005
Continue creating array? y
mdadm: array /dev/md3 started.
livecd ~ # cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 hde4[0]
115926464 blocks [1/1] [U]

md2 : active raid1 hda3[0] hde3[1]
2000256 blocks [2/2] [UU]

md1 : active raid1 hda2[0] hde2[1]
2000256 blocks [2/2] [UU]

md0 : active raid1 hda1[0] hde1[1]
125376 blocks [2/2] [UU]

unused devices:
livecd ~ #livecd ~ # mdadm --create /dev/md3 --level 1 --force --raid-disks=1 /dev/hde4
mdadm: Cannot open /dev/hde4: Device or resource busy
mdadm: create aborted
livecd ~ # mdadm --stop /dev/md3
livecd ~ # mdadm --create /dev/md3 --level 1 --force --raid-disks=1 /dev/hde4
mdadm: /dev/hde4 appears to be part of a raid array:
level=1 devices=2 ctime=Sat Oct 22 20:51:12 2005
Continue creating array? y
mdadm: array /dev/md3 started.
livecd ~ # cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 hde4[0]
115926464 blocks [1/1] [U]

md2 : active raid1 hda3[0] hde3[1]
2000256 blocks [2/2] [UU]

md1 : active raid1 hda2[0] hde2[1]
2000256 blocks [2/2] [UU]

md0 : active raid1 hda1[0] hde1[1]
125376 blocks [2/2] [UU]

unused devices:
livecd ~ #


Now I can scan for LVM2 volumes on the md3 array.

livecd ~ # pvscan
PV /dev/md3 VG vgmirror lvm2 [110.55 GB / 52.55 GB free]
PV /dev/hdg1 VG vgbackup lvm2 [114.39 GB / 82.39 GB free]
Total: 2 [224.95 GB] / in use: 2 [224.95 GB] / in no VG: 0 [0 ]
livecd ~ # vgscan
Reading all physical volumes. This may take a while...
Found volume group "vgmirror" using metadata type lvm2
Found volume group "vgbackup" using metadata type lvm2
livecd ~ # lvscan
inactive '/dev/vgmirror/tmp' [4.00 GB] inherit
inactive '/dev/vgmirror/vartmp' [4.00 GB] inherit
inactive '/dev/vgmirror/opt' [2.00 GB] inherit
inactive '/dev/vgmirror/usr' [4.00 GB] inherit
inactive '/dev/vgmirror/var' [4.00 GB] inherit
inactive '/dev/vgmirror/home' [4.00 GB] inherit
inactive '/dev/vgmirror/pgsqldata' [16.00 GB] inherit
inactive '/dev/vgmirror/www' [4.00 GB] inherit
inactive '/dev/vgmirror/svn' [16.00 GB] inherit
ACTIVE '/dev/vgbackup/backup' [32.00 GB] inherit
livecd ~ # lvchange -a y /dev/vgmirror/tmp
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/vartmp
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/opt
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/usr
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/var
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/home
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/pgsqldata
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/www
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/svn
/dev/cdrom: open failed: Read-only file system
livecd ~ #


Now I can check all of the LVM2 file systems:

livecd ~ # lvscan
ACTIVE '/dev/vgmirror/tmp' [4.00 GB] inherit
ACTIVE '/dev/vgmirror/vartmp' [4.00 GB] inherit
ACTIVE '/dev/vgmirror/opt' [2.00 GB] inherit
ACTIVE '/dev/vgmirror/usr' [4.00 GB] inherit
ACTIVE '/dev/vgmirror/var' [4.00 GB] inherit
ACTIVE '/dev/vgmirror/home' [4.00 GB] inherit
ACTIVE '/dev/vgmirror/pgsqldata' [16.00 GB] inherit
ACTIVE '/dev/vgmirror/www' [4.00 GB] inherit
ACTIVE '/dev/vgmirror/svn' [16.00 GB] inherit
ACTIVE '/dev/vgbackup/backup' [32.00 GB] inherit
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/tmp
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 576
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/tmp: ***** FILE SYSTEM WAS MODIFIED *****

15 inodes used (0%)
0 non-contiguous inodes (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
16472 blocks used (1%)
0 bad blocks
0 large files

2 regular files
4 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
6 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/vartmp
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 576
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/vartmp: ***** FILE SYSTEM WAS MODIFIED *****

4771 inodes used (0%)
524 non-contiguous inodes (11.0%)
# of inodes with ind/dind/tind blocks: 285/1/0
52582 blocks used (5%)
0 bad blocks
0 large files

4480 regular files
282 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
4762 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/opt
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 288
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/opt: ***** FILE SYSTEM WAS MODIFIED *****

12 inodes used (0%)
0 non-contiguous inodes (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
16443 blocks used (3%)
0 bad blocks
0 large files

1 regular file
2 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
3 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/usr
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 576
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/usr: ***** FILE SYSTEM WAS MODIFIED *****

202520 inodes used (38%)
3582 non-contiguous inodes (1.8%)
# of inodes with ind/dind/tind blocks: 2317/17/0
439977 blocks used (41%)
0 bad blocks
0 large files

172474 regular files
26704 directories
0 character device files
0 block device files
0 fifos
2487 links
3333 symbolic links (3248 fast symbolic links)
0 sockets
--------
204998 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/var
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 576
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/var: ***** FILE SYSTEM WAS MODIFIED *****

30344 inodes used (5%)
181 non-contiguous inodes (0.6%)
# of inodes with ind/dind/tind blocks: 54/1/0
100055 blocks used (9%)
0 bad blocks
0 large files

29856 regular files
474 directories
0 character device files
0 block device files
0 fifos
0 links
3 symbolic links (3 fast symbolic links)
2 sockets
--------
30335 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/home
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 576
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/home: ***** FILE SYSTEM WAS MODIFIED *****

58 inodes used (0%)
0 non-contiguous inodes (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
24717 blocks used (2%)
0 bad blocks
0 large files

33 regular files
15 directories
0 character device files
0 block device files
0 fifos
0 links
1 symbolic link (1 fast symbolic link)
0 sockets
--------
49 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/pgsqldata
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 304
Pass 1: Checking inodes, blocks, and sizes
Inode 1802356, i_blocks is 26312, should be 23952. Fix? yes

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(3625600--3625704) -(3625710--3625711) -(3625716--3625719) -(3625724--3625907)
Fix? yes

Free blocks count wrong for group #110 (6797, counted=7092).
Fix? yes

Free blocks count wrong (4056868, counted=4057163).
Fix? yes


/dev/vgmirror/pgsqldata: ***** FILE SYSTEM WAS MODIFIED *****

1003 inodes used (0%)
90 non-contiguous inodes (9.0%)
# of inodes with ind/dind/tind blocks: 167/19/0
137141 blocks used (3%)
0 bad blocks
0 large files

964 regular files
30 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
994 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/www
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 576
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/www: ***** FILE SYSTEM WAS MODIFIED *****

478 inodes used (0%)
0 non-contiguous inodes (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
25147 blocks used (2%)
0 bad blocks
0 large files

455 regular files
14 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
469 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/svn
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 304
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/svn: ***** FILE SYSTEM WAS MODIFIED *****

128 inodes used (0%)
17 non-contiguous inodes (13.3%)
# of inodes with ind/dind/tind blocks: 15/10/0
146674 blocks used (3%)
0 bad blocks
0 large files

98 regular files
21 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
119 files
livecd ~ #


So all of the filesystems on /dev/hde4 check out okay. Now I want to take a closer look at the drives to verify that they have no bad blocks. The best way to do this is with a read-only disk test using badblocks.

# badblocks -sv /dev/hdg1

From the looks of my testing on the various drives, hda is the problem drive with a few surface errors. So I'm going to wholy replace drive hda with a fresh 120GB drive.

So I've moved the cables from hda to connect with hde, and I've put a new 120GB hard drive into the hde position. Since I setup the box properly way back when (installing grub to both disks) things are working very well and the machine booted right back up.

First we copy the partition layout from hda to hde, then I copy the boot sector from hda to hde.

coppermine thomas # sfdisk -d /dev/hda | sfdisk /dev/hde
Checking that no-one is using this disk right now ...
OK

Disk /dev/hde: 238216 cylinders, 16 heads, 63 sectors/track
Old situation:
Warning: The partition table looks like it was made
for C/H/S=*/255/63 (instead of 238216/16/63).
For this listing I'll assume that geometry.
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

Device Boot Start End #cyls #blocks Id System
/dev/hde1 0+ 14945 14946- 120053713+ 6 FAT16
/dev/hde2 0 - 0 0 0 Empty
/dev/hde3 0 - 0 0 0 Empty
/dev/hde4 0 - 0 0 0 Empty
New situation:
Units = sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/hde1 * 63 250991 250929 fd Linux raid autodetect
/dev/hde2 250992 4251743 4000752 fd Linux raid autodetect
/dev/hde3 4251744 8252495 4000752 fd Linux raid autodetect
/dev/hde4 8252496 240105599 231853104 fd Linux raid autodetect
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
coppermine thomas # dd if=/dev/hda bs=512 count=1 of=/dev/hde
1+0 records in
1+0 records out
coppermine thomas # fdisk -l /dev/hda

Disk /dev/hda: 122.9 GB, 122942324736 bytes
16 heads, 63 sectors/track, 238216 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 1 249 125464+ fd Linux raid autodetect
/dev/hda2 250 4218 2000376 fd Linux raid autodetect
/dev/hda3 4219 8187 2000376 fd Linux raid autodetect
/dev/hda4 8188 238200 115926552 fd Linux raid autodetect
coppermine thomas # fdisk -l /dev/hde

Disk /dev/hde: 122.9 GB, 122942324736 bytes
16 heads, 63 sectors/track, 238216 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

Device Boot Start End Blocks Id System
/dev/hde1 * 1 249 125464+ fd Linux raid autodetect
/dev/hde2 250 4218 2000376 fd Linux raid autodetect
/dev/hde3 4219 8187 2000376 fd Linux raid autodetect
/dev/hde4 8188 238200 115926552 fd Linux raid autodetect
coppermine thomas #


Now I need to add the new partitions to the software RAID arrays.

coppermine thomas # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hda2[1]
2000256 blocks [2/1] [_U]

md2 : active raid1 hda3[1]
2000256 blocks [2/1] [_U]

md3 : active raid1 hda4[0]
115926464 blocks [1/1] [U]

md0 : active raid1 hda1[1]
125376 blocks [2/1] [_U]

unused devices:
coppermine thomas # mdadm /dev/md0 -a /dev/hde1
mdadm: hot added /dev/hde1
coppermine thomas # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hda2[1]
2000256 blocks [2/1] [_U]

md2 : active raid1 hda3[1]
2000256 blocks [2/1] [_U]

md3 : active raid1 hda4[0]
115926464 blocks [1/1] [U]

md0 : active raid1 hde1[2] hda1[1]
125376 blocks [2/1] [_U]
[=>...................] recovery = 9.7% (12928/125376) finish=0.5min speed=3232K/sec

unused devices:
coppermine thomas #


Repeat the above for the other 3 RAID1 arrays that are degraded.

At this point, I'm basically done. It's time to make another backup and maybe swap the hda/hde cables to verify that I copied the boot sector correctly.

...

The big problem is that md3 is showing up with only a single drive "[U]" instead of "[U_]". So I need to figure out how to tell mdadm to add /dev/hde4 to the array and force it to resync. (To fix this, you use the "grow" command of mdadm.)

coppermine thomas # mdadm --grow /dev/md3 --raid-disks=2
coppermine thomas # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hde2[0] hda2[1]
2000256 blocks [2/2] [UU]

md2 : active raid1 hde3[0] hda3[1]
2000256 blocks [2/2] [UU]

md3 : active raid1 hda4[0]
115926464 blocks [2/1] [U_]

md0 : active raid1 hde1[0] hda1[1]
125376 blocks [2/2] [UU]

unused devices:
coppermine thomas # mdadm /dev/md3 --add /dev/hde4
mdadm: hot added /dev/hde4
coppermine thomas # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hde2[0] hda2[1]
2000256 blocks [2/2] [UU]

md2 : active raid1 hde3[0] hda3[1]
2000256 blocks [2/2] [UU]

md3 : active raid1 hde4[2] hda4[0]
115926464 blocks [2/1] [U_]
[>....................] recovery = 0.0% (6656/115926464) finish=1153.4min speed=1664K/sec

md0 : active raid1 hde1[0] hda1[1]
125376 blocks [2/2] [UU]

unused devices:
coppermine thomas #

Labels:

Tuesday, April 19, 2005
Gentoo and Software RAID (2004.3)
Going back to the 2004.3 Gentoo Universal boot CD. Trying to get past my previous issue bd_claim issues when setting up a software RAID. This is on my Gigabyte GA-6VA7+ motherboard (notes on the Gigabyte GA-6VA7+ motherboard and other hardware).

Starting with the usual tricks:

1) Boot the system using the Universal CD
2) ifconfig - find out the IP address of the box
3) passwd - change the root password to something you know
4) Start the SSH daemon - /etc/init.d/sshd start

Since these drives were nuked since my last attempt, I have to re-configure the partitions.

# fdisk /dev/hda

Command: n
Command action: p
Partition number: 1
First cylinder: 1
Last cylinder: +128M
Command: a
Partition number: 1
Command: t
Hex code: fd

Command: n
Command action: p
Partition number: 2
First cylinder: [enter]
Last cylinder: +2048M
Command: t
Partition number: 2
Hex code: fd

Command: n
Command action: p
Partition number: 3
First cylinder: [enter]
Last cylinder: +2048M
Command: t
Partition number: 3
Hex code: fd

Command: n
Command action: p
First cylinder: [enter]
Last cylinder: [enter]
Command: t
Partition number: 4
Hex code: fd

Command: p

Command: w


This gives me a 128MB boot area, a 2GB swap area, a 2GB root area, with the rest of the disk set aside for my LVM2 partitions. Repeat the above commands to configure the 2nd disk in the same fashion (/dev/hdc for me).

Now I need to configure software RAID. This is a bit easier then last year since I don't need to muck with the /etc/raidtab file (instead, I'm going to use mdadm).

The following loads the 'md' module and creates the nodes (/dev/md*).

# modprobe md
# ls /dev/md*
ls: /dev/md*: No such file or directory
# for i in 0 1 2 3; do mknod /dev/md$i b 9 $i; done
# ls /dev/md*
/dev/md0 /dev/md1 /dev/md2 /dev/md3


Now, we create our RAID1 sets.

# modprobe raid1
# mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/hda1 /dev/hdc1
mdadm: array /dev/md0 started.
# mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/hda2 /dev/hdc2
mdadm: array /dev/md1 started.
# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hdc2[1] hda2[0]
2000256 blocks [2/2] [UU]
[==>..................] resync = 13.8% (277760/2000256) finish=3.6min speed=7920K/sec
md0 : active raid1 hdc1[1] hda1[0]
125376 blocks [2/2] [UU]

unused devices: <none>


Seems to be working fine. Once each RAID set finishes initialization, I'll create the next one in the series using the following commands:

# mdadm --create /dev/md2 --level=1 --raid-devices=2 /dev/hda3 /dev/hdc3
# mdadm --create /dev/md3 --level=1 --raid-devices=2 /dev/hda4 /dev/hdc4


The last RAID set will take a while to initialize (2 hours?), so I'm going to go work on other things while it runs. I also need to go back and review the documentation to see what else I need to do when doing software RAID.

Labels: ,

Thursday, March 31, 2005
Installing Gentoo on Software RAID
I'm currently fighting with this again. Apparently, there have been some changes between the 2004.0 CD and the 2005.0 CD (mostly related to the 2.6 kernal and the DevFS vs UDev systems).

First off, some tips-n-tricks:

1) ifconfig - find out the IP address of the box
2) passwd - change the root password to something you know
3) Starting the SSH daemon - The command is: /etc/init.d/sshd start

Now I'll be able to use SecureCRT and have an easier time of copy-n-paste. (Pick 'keyboard interactive' and 'ssh2' under options when setting up the connection in SecureCRT.)

Links:

Gentoo 2004.3 Software RAID Install HOWTO
Gentoo.org - x86 tips and tricks (including software RAID)
The Linux Documentation Project: Software RAID
An older gentoo.org thread on software raid

Now... to show you what I did and where it breaks. I have (2) IDE disks, attached as /dev/hda and /dev/hdc. I plan on mirroring them, and have already created /dev/hda1 (128MB for /boot), /dev/hda2 (2GB for swap), /dev/hda3 (2GB for root), and /dev/hda4 (rest of disk for LVM). All partitions are flagged as type 'fd' (linux raid auto).

# modprobe md
# ls /dev/md*
ls: /dev/md*: No such file or directory
# for i in 0 1 2 3; do mknod /dev/md$i b 9 $i; done
# ls /dev/md*
/dev/md0 /dev/md1 /dev/md2 /dev/md3
# nano -w /etc/mdadm.conf


Contents of my mdadm.conf file:

DEVICE /dev/hda1 /dev/hda2 /dev/hda3 /dev/hda4
DEVICE /dev/hdc1 /dev/hdc2 /dev/hdc3 /dev/hdc4

ARRAY /dev/md0 devices=/dev/hda1,/dev/hdc1
ARRAY /dev/md1 devices=/dev/hda2,/dev/hdc2
ARRAY /dev/md2 devices=/dev/hda3,/dev/hdc3
ARRAY /dev/md3 devices=/dev/hda4,/dev/hdc4


Pretty generic... now to create the raid. This is where we hit our first issue:

# modprobe raid1
# mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/hda1 /dev/hdc1
mdadm: /dev/hda1 appears to contain an ext2fs file system
size=72192K mtime=Wed Mar 30 16:46:07 2005
mdadm: /dev/hdc1 appears to contain an ext2fs file system
size=72192K mtime=Wed Mar 30 16:46:07 2005
Continue creating array? y
mdadm: ADD_NEW_DISK for /dev/hda1 failed: Device or resource busy
#


Here are the problems:

1) Even though I have a 128MB partition for /boot, it is still showing 64MB from a previous partitioning scheme.

2) mdadm: ADD_NEW_DISK for /dev/hda1 failed: Device or resource busy

So... take a look at my partition list:

livecd root # fdisk -l

Disk /dev/hda: 76.8 GB, 76869918720 bytes
16 heads, 63 sectors/track, 148945 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 1 249 125464+ fd Linux raid autodetect
/dev/hda2 250 4218 2000376 fd Linux raid autodetect
/dev/hda3 4219 8187 2000376 fd Linux raid autodetect
/dev/hda4 8188 148945 70942032 fd Linux raid autodetect

Disk /dev/hdc: 76.8 GB, 76869918720 bytes
16 heads, 63 sectors/track, 148945 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

Device Boot Start End Blocks Id System
/dev/hdc1 * 1 249 125464+ fd Linux raid autodetect
/dev/hdc2 250 4218 2000376 fd Linux raid autodetect
/dev/hdc3 4219 8187 2000376 fd Linux raid autodetect
/dev/hdc4 8188 148945 70942032 fd Linux raid autodetect
livecd root #


All of which looks right and proper.

Labels: ,

Gentoo mkraid errors
Note: These directions are works-in-progress... in fact, they might not even work at all until I find out why I'm ending up with non-bootable systems (looks like a bug in the 2.6 kernel).

So, I've created my /etc/raidtab file. I've done the "modprobe md" and "modprobe dm-mod" commands. I'm able to "cat /proc/mdstat" which shows that I have no personalities and no unused devices. But, when I try to use the "mkraid" command, I get the following error:

# mkraid /dev/md0
cannot determine md version: no MD device file in /dev.


The fix (according to the wiki is):

cd /dev ; MAKEDEV md

The key resource to setting up Gentoo on a RAID:

Gentoo Install on Software RAID

Now, they're not putting their swap file on a RAID1. The reason that I put my swap in a RAID1 partition is that I don't want the system to crash if a drive dies (which it will, if you have 2 seperate, non-RAID swap partitions). It's a question of whether you want a slightly higher level of stability, or if you want speed.

Other possible gotches:

1) (not confirmed) After you fdisk, you may want to reboot so make sure that the disks are no longer in use. Otherwise you may get "bd_claim" errors when trying to setup the RAID array.

(still fighting with this a few hours later... going to start a new post)

Labels: ,

Thursday, June 17, 2004
Troubleshooting software RAID boot problems
First problem is that the system boots straight into "grub". Probably due to a missing "grub.conf" file, which I'm pretty sure I had written to the proper location earlier in the install. So I'll load the kernel by hand and fix the config file once I get the box to boot properly.
grub>
grub> cat /boot/grub/grub.conf
(no file found)

grub> root (hd0,0)
grub> kernel /kernel-2.6.6-gentoo root=/dev/md2
grub> boot

This starts the kernel boot process, which will then lead me to my second error. In the meantime, if you use [shift-PgUp] and [shift-PgDn] you can scroll back and forth through the boot messages. A minor problem is that since the raid didn't shutdown cleanly, there are numerous "md: mdx: raid array is not clean -- starting background reconstruction" messages. And I can't even begin to troubleshoot the "Kernel panic: No init found" error until resync is done (that's a 2.5-3.0 hour process).

Looking back at my kernel configuration, I see that Software RAID with RAID1 support was compiled as BUILTIN, and ext2/ext3 were installed by default if I remember correctly. Those are two of the possible errors that could cause the kernel not to be able to read from the "/" (root) partition.

Also possible is that the "/" filesystem was not properly mounted during the install. The following is what I've tried to do in order to fix the issue (or at least diagnose the issue). I would strongly recommend that you do not use the following on a production system unless you understand what everything does. Since I don't have any data on the system (other then config files), I have a good amount of latitude with regards to what I can do. Hopefully, since /boot, /usr, /opt, /var and /home are intact, things will go quicker then the first install.

This repair process is basically a complete reinstall because nothing else under "/" (root) actually got written to the hard drive. Only the separately mounted folders such as /boot, /opt, /usr, /var, and /home were properly mounted.

Note: The following steps assume that you have nothing on your partitions worth keeping, or that you've already backed everything up. I tried doing it without formatting the "/" partition, but the bootstrap.sh file keeps dying on line 84. So I'm going to do a format of the "/" partition as well as /opt, /usr, /var, /tmp, and /var/tmp.

Boot the LiveCD, at the boot prompt be sure to pick the boot kernel and pass the arguments that you want, then load the raid and LVM modules.
boot: gentoo -nohotplug
(gentoo kernel now loads)

livecd root # modprobe md
livecd root # modprobe dm-mod
livecd root # modprobe via-rhine (if your network adapter failed to autoload)
livecd root # net-setup eth0 (if your network adapter failed to autoload)

If you have a copy of your /etc/raidtab file on floppy, copy it in now, otherwise you'll have to re-key your /etc/raidtab by hand. If you need, you can temporarly mount partitions to copy files from.
livecd root # nano -w /etc/raidtab
livecd root # raidstart -a
livecd root # cat /proc/mdstat
(verify that all raid sets are up and running)

livecd root # swapon /dev/md1
livecd root # mke2fs -j /dev/md2 (OVERWRITES YOUR / PARTITION)
livecd root # mount /dev/md2 /mnt/gentoo
livecd root # mkdir /mnt/gentoo/boot
livecd root # mount /dev/md0 /mnt/gentoo/boot
livecd root # ls /mnt/gentoo/boot
(if you have files already in /boot, this verifies that you mounted in the proper order)

livecd root # mkdir /etc/lvm
livecd root # echo 'devices { filter=["r/cdrom/"] }' > /etc/lvm/lvm.conf
livecd root # vgscan
(verify that it found your existing volume group or groups)

livecd root # vgchange -ay vgmirror
livecd root # lvscan
(verify that your logical volumes now show up and are "ACTIVE")

livecd root # mke2fs -j /dev/vgmirror/opt (OVERWRITES YOUR / PARTITION)
livecd root # mke2fs -j /dev/vgmirror/usr (OVERWRITES YOUR / PARTITION)
livecd root # mke2fs -j /dev/vgmirror/var (OVERWRITES YOUR / PARTITION)
livecd root # mke2fs -j /dev/vgmirror/home (OVERWRITES YOUR / PARTITION)
livecd root # mke2fs /dev/vgmirror/ttmp (OVERWRITES YOUR / PARTITION)
livecd root # mke2fs /dev/vgmirror/vartmp (OVERWRITES YOUR / PARTITION)

livecd root # mkdir /mnt/gentoo/opt
livecd root # mkdir /mnt/gentoo/usr
livecd root # mkdir /mnt/gentoo/var
livecd root # mkdir /mnt/gentoo/home
livecd root # mount /dev/vgmirror/opt /mnt/gentoo/opt
livecd root # mount /dev/vgmirror/usr /mnt/gentoo/usr
livecd root # mount /dev/vgmirror/var /mnt/gentoo/var
livecd root # mount /dev/vgmirror/home /mnt/gentoo/home

livecd root # mkdir /mnt/gentoo/tmp
livecd root # mount /dev/vgmirror/tmp /mnt/gentoo/tmp
livecd root # chmod 1777 /mnt/gentoo/tmp
livecd root # mkdir /mnt/gentoo/var/tmp
livecd root # mount /dev/vgmirror/vartmp /mnt/gentoo/var/tmp
livecd root # chmod 1777 /mnt/gentoo/var/tmp
livecd root # mkdir /mnt/gentoo/proc
livecd root # mount -t proc none /mnt/gentoo/proc

livecd root # date
livecd root # cd /mnt/gentoo
livecd gentoo # tar -xvjpf /mnt/cdrom/stages/stage1-x86-20040218.tar.bz2
livecd gentoo # tar -xvjf /mnt/cdrom/snapshots/portage-20040223.tar.bz2 -C /mnt/gentoo/usr
livecd gentoo # mkdir /mnt/gentoo/usr/portage/distfiles
livecd gentoo # cp /mnt/cdrom/distfiles/* /mnt/gentoo/usr/portage/distfiles/

livecd gentoo # nano -w /mnt/gentoo/etc/make.conf
(repeat your make.conf from the initial install)

livecd gentoo # mirrorselect -a -s4 -o | grep -ve '^Netselect' >> /mnt/gentoo/etc/make.conf
livecd gentoo # cp -L /mnt/gentoo/etc/make.conf /mnt/gentoo/boot/make.conf-backupcopy
livecd gentoo # cp -L /etc/resolv.conf /mnt/gentoo/etc/resolv.conf
livecd gentoo # cp -L /etc/raidtab /mnt/gentoo/etc/raidtab
livecd gentoo # cp -L /etc/raidtab /mnt/gentoo/boot/raidtab-backupcopy
livecd gentoo # mkdir /mnt/gentoo/etc/lvm
livecd gentoo # cp -L /etc/lvm/lvm.conf /mnt/gentoo/etc/lvm/lvm.conf
livecd gentoo # cp -L /etc/lvm/lvm.conf /mnt/gentoo/boot/lvm.conf-backupcopy
livecd gentoo # chroot /mnt/gentoo /bin/bash
livecd / # env-update
livecd / # source /etc/profile
livecd / # emerge sync
livecd / # cd /usr/portage
livecd / # scripts/bootstrap.sh

If bootstrap runs correctly (and it should now that I re-formatted the /opt, /usr, /var, /home, /tmp, and /var/tmp volumes), I can pick back up with the rest of my original install process

Labels:

Tuesday, June 15, 2004
LVM and Software RAID
Links:

Linux Logical Volume Manager (LVM) on Software RAID - explains the benefits of using LVM, and also how to get LVM running on top of software RAID in RedHat 8.

Google search for how to do software RAID and LVM

The Software-RAID HOWTO - Updated June 2004, so it's not an old outdated document. Also see the official URL.

Gentoo Server Project Wiki

Notes on Building a Linux Storage Server

Gentoo Forums: How to do a gentoo install on a software RAID - Thread was started back in July 2002, but there are posts as recently as June 2004.

Summary:

The "grub" boot loader needs a real physical partition to boot from (or possibly a RAID1 setup, RAID0 definitely doesn't work). What some folks do is to create (2) identical physical partitions on the drives in the RAID array, and then periodically copy from the primary partition to the secondary partition. (See gentoo forum post or search for user "wrex". Also look for posts by "hover" or "blake121666") The copy command is as simple as "dd if=/dev/hda1 of=/dev/hdb1 bs=8192b". You need to customize that to match your configuration (e.g. changing the source/dest partitions and the block size). If I can puzzle it all out, I think I'm going to go with /boot on RAID1. See section 7.3 of the Software RAID HOWTO

The "swap" partition should be mirrored. Otherwise when the disk fails, existing applications with data in the swap partition of the failed drive will die a nasty death. See section 2.3 of the Software RAID HOWTO.

There's a special set of command that need to be done in "grub" if you want to be able to pull the primary disk and still have the system bootable. (In case the primary disk would fail.) This is all mentioned in the gentoo forums thread about software RAID, look for the post by "hover". Alternatively, you can simply use a boot floppy and fixup the grub settings after the primary drive fails.

Replacing a failed disk in a software RAID1 array requires partitioning the drive by hand prior to rebuilding the array. This is also mentioned in the gentoo forums thread about software RAID, look for the post by "hover".

Searching around on my Gentoo 2004.0 Universal CD, I don't see the "mdadm" tools anywhere, but I do see "raidtools".

Labels: ,