Wednesday, July 22, 2009
3ware 9650SE SATA RAID and CentOS 5 Linux
A few months ago, I picked up a 3ware 9650SE 16-port controller for use in my primary office server which runs CentOS 5. So far, it's been an up and down ride.

Problem #1 - The boot process could not find the array disks.

I have a triple-mirror set that I use for my primary operating system drive. On the old system, they were hooked up directly to the SATA ports on the motherboard.

(Triple-mirror means that I created the elements in mdadm, a.k.a. Software RAID, where all 3 drives are active mirror copies of each other. This offers a slight speed-up for reads, allows you to survive 2 drive failures, and puts what would otherwise be an idle hot-swap disk to use.)

On the new system, I decided to attach them to the 3ware card and configure them as JBOD. However, the kernel initrd file (2.6.18-92.1.22.el5) that I was using for the old motherboard did not have drivers installed for the 3ware card.

So I had to create a custom initrd file by using gzip and cpio to unpack the contents into a directory. Then I copied the 3ware kernel driver binary from the 2.6.18-92.1.22.el5 folder, added it to the list of modules to be loaded. Finally, I repacked the initrd file, added a grub.conf entry to point at it, and booted cleanly.

Unfortunately, this was all done using the CentOS 5.3 install CD, so I was unable to log the session or keep careful track of what I did.

Problem #2 - JBOD is not really supported on the 9650SE

In the BIOS interface, you *can* setup disks as JBOD and tell the controller to export JBOD disks (see controller options). Setting a drive as JBOD does not currently overwrite/erase data on the drive, so it is a fairly safe operation and a good way to hook up a drive with existing data.

Note that if you have "export JBOD" set in the controller options, you can simply hook the drives up, rescan (using the BIOS or 3ware command-line utility called "tw_cli"), and the drives should show up as /dev/sd? in your list.

The major downside of JBOD mode is that write caching is always disabled by default. This means that your drives are going to have a much higher utilization percentage (as seen by "atop") then if you had enabled write caching.

Now, you can enable write caching for JBOD drives, but the unit has to be told to do that after every reboot. The command (assuming that your controller is "c0" and the unit is "u12") is:

# tw_cli /c0/u12 set cache=on

A final note. If you're going to use write caching, you should spring for the BBU (Battery Backup Unit).

Problem #3 - Controlling the RAID

Download and install the "tw_cli" tarball from 3ware. Since you have to have the 3ware driver installed to talk to the card, and the 9650SE prefers "Single Disk" over "JBOD", you're probably going to want to use 3ware RAID instead of Software RAID.

The problem with "Single Disk" mode is that it overwrites the first few sectors on the disk with 3ware control information. So all of the disks in "Single Disk" mode are going to slightly smaller then a JBOD disk. Be aware that putting a disk into a 3ware array will cause the loss of anything at the start of the disk (such as the partition table) and the number of cylinders will be slightly smaller.

Of course, due to the strange geometry of a disk touched by the 3ware controller, you'll probably have to move the disk to another 3ware controller in order to read the data in the future. Well, maybe, if you're using Software RAID1 mirroring on top of 3ware Single Disks, then the data is highly likely to be in an easy to read format, other then the odd starting point for the partitions.

Anyway, some key commands when using the tw_cli application:

# tw_cli show
- Displays the list of controllers installed. Make note of the "c#" nomenclature as you will use those "c#" labels in later commands to refer to a specific controller.

# tw_cli /c0 show
- Displays units/ports for the *first* controller installed.

# tw_cli /c0 rescan
- Use this after inserting/removing a disk using a hot-swap enclosure.

# tw_cli /c0/ux show all
- Displays configuration information for whichever unit # you provided. Replace the "x" with the unit # that you want to look at (such as /c0/u3 or /c0/u12).

Problem #4 - Performance (a.k.a. I/O wait hell)

Unfortunately, the 3ware Linux kernel driver in Red Hat / CentOS 2.6.18-92.1.22.el5, is not very good. The symptoms are as follows:

1) Create multiple "single disk" units.

2) Make heavy writes to one or more of the units. Such as using "dd" to overwrite the unit with zeros.

3) Attempt to access data on the other units.

What you will find is that:

- Performance of the system starts to feel extremely sluggish for any operations that touch drives on the 3ware controller.

- Looking at "atop", you will see that the other drives are now reporting seek times of 100-200ms instead of 1-10ms. Their utilization numbers will be up around 90-100%, even though the number of reads/writes are only in the 2-3 digit range.

- Turning write caching on/off doesn't make a difference.

From my web searches, it seems like this may be a problem specific to kernel versions prior to 2.6.26. Unfortunately, the stock kernel in Red Hat / CentOS is based off of 2.6.18 and I haven't found out yet whether Red Hat / CentOS have backported the fix.

Updates:

- Even the 2.6.18-128 RHEL/CentOS kernel displays sluggishness any time that we access units (a RAID 6, 8 drive unit that is the only thing on the array). We have zero performance problems with drive attached to a different SATA controller running Software RAID.

- I can't recommend using the 9650SE controller with RHEL/CentOS currently. Performance is absolutely horrid under load.

Labels: , , ,

Wednesday, January 28, 2009
Removing a failed, non-existent drive from Software RAID
So, you have a drive that has failed, you've replaced the drive on the fly (using hot-swap SATA) and now you need to remove the old RAID slice.

For example:

md0 : active raid1 sdi1[0] sdc1[2] sdb1[3](F) sda1[1]
264960 blocks [3/3] [UUU]


In this case, sdb1 is marked as failed, and sdi1 was the slice from the newly added drive (via SATA hot-plug). So we want to remove it with mdadm's remove command:

# mdadm /dev/md0 --remove /dev/sdb1
mdadm: cannot find /dev/sdb1: No such file or directory


Oops, we can't do that because we already swapped out the failed drive (sdb).

The answer is found in the mdadm man page for the remove feature:

-r, --remove remove listed devices. They must not be active. i.e. they should be failed or spare devices. As well as the name of a device file (e.g. /dev/sda1) the words failed and detached can be given to --remove. The first causes all failed device to be removed. The second causes any device which is no longer connected to the system (i.e an open returns ENXIO) to be removed. This will only succeed for devices that are spares or have already been marked as failed.

So instead of specifying the name of the failed RAID slice we should instead us the following command:

# mdadm /dev/md0 -r detached  
mdadm: hot removed 8:17


And there you have it, the failed raid slice that is no longer connected to the system has been removed. It will not show up in "/proc/mdstat" any more.

Labels: ,

Wednesday, August 06, 2008
Linux RAID tuning and troubleshooting
Ran across this while searching another topic.

http://makarevitch.org/rant/raid/

Labels: ,

Friday, December 21, 2007
Replacing a failed drive in a Software RAID mirror set
Like I wrote about last time, I have a failing drive in my triple active RAID mirror set on my firewall box. See also "Failing hard drive in a Software RAID". I'm still trying to decide whether the disk has actually failed, or if it is just having issues.

# /sbin/badblocks -sv /dev/sdc2

Since I have unmounted this RAID slice, I'm going to test with a DESTRUCTIVE write/read verification. (Which is also a good way to wipe the disk.)

# /sbin/badblocks -sv -w -t random /dev/sdc2

Well, after a few runs with that, the disk is no longer making "retry" noises. So I'm going to re-add the slice to the RAID array and see what happens.

# /sbin/mdadm /dev/md1 -a /dev/sdc2

And force mdadm to verify the sync:

# echo check > /sys/block/md1/md/sync_action

It seems to be working. I'm guessing that I finally convinced SMART to re-map the bad sector that was causing problems.

Labels: ,

Wednesday, December 05, 2007
Failed drive slice in a Software RAID after resync
One of the things that I do periodically on my servers is to run a mdadm resync. Because this can put a heavy strain on the disk system, I strongly suggest that you have good backups in place. My home systems run a check about once a month, servers at work run a check early on Tuesday mornings.

The script is very simple, and you can even fire off the command by writing "check" to the sync_action variable of the md process.

#!/bin/sh
# Tells mdadm to verify that the arrays are synchronized.
# This deals with the issue where a seldom-read disk block has gone bad
# by doing a daily/weekly verification of the array.

echo check > /sys/block/md0/md/sync_action
echo check > /sys/block/md1/md/sync_action
echo check > /sys/block/md2/md/sync_action
echo check > /sys/block/md3/md/sync_action
echo check > /sys/block/md4/md/sync_action
echo check > /sys/block/md5/md/sync_action
echo check > /sys/block/md6/md/sync_action


In this particular case, all of my RAID slices verified correctly, except for one of them. In this particular situation I'm running a triple-active RAID1 array. (Instead of using a hot-spare disk, I'm putting live data onto all three disks and using all three actively.)

See also Failing hard drive in a Software RAID

$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[2] sdb1[1] sda1[0]
256896 blocks [3/3] [UUU]

md2 : active raid1 sdc3[2] sdb3[1] sda3[0]
12289600 blocks [3/3] [UUU]

md4 : active raid1 sdc5[2] sdb5[1] sda5[0]
33551616 blocks [3/3] [UUU]

md3 : active raid1 sdc6[2] sdb6[1] sda6[0]
1052160 blocks [3/3] [UUU]

md5 : active raid1 sdc7[2] sdb7[1] sda7[0]
64010880 blocks [3/3] [UUU]

md6 : active raid1 sdc8[2] sdb8[1] sda8[0]
267257216 blocks [3/3] [UUU]

md7 : active raid1 sdf1[2] sde1[1] sdd1[0]
488383936 blocks [3/3] [UUU]

md1 : active raid1 sdc2[3](F) sdb2[1] sda2[0]
12289600 blocks [3/2] [UU_]

unused devices: <none>


The md1 array is my / (root) partition. Since the rest of the disk slices appear to be fine, I'm going to proceed with the assumption that it was a minor glitch.

Step 0: Analyze the failure

The first sign of error was the (F) showing up in /proc/mdstat. Apparently I don't have mdadm configured yet in monitor mode so that it e-mails me when it finds an error.

# grep "sdc2" messages
Dec 4 09:11:58 fw1-shimo kernel: raid1: Disk failure on sdc2, disabling device.
Dec 4 09:12:06 fw1-shimo kernel: disk 2, wo:1, o:0, dev:sdc2


The full detail from the mdadm resync:

# grep "Dec 4 09" messages | grep "md:"
Dec 4 09:08:33 fw1-shimo kernel: md: md6: sync done.
Dec 4 09:08:33 fw1-shimo kernel: md: syncing RAID array md1
Dec 4 09:08:33 fw1-shimo kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Dec 4 09:08:33 fw1-shimo kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
Dec 4 09:08:33 fw1-shimo kernel: md: using 128k window, over a total of 12289600 blocks.
Dec 4 09:11:31 fw1-shimo kernel: md: md1: sync done.
#


And finally, evidence from the logs that shows that sdc was having issues:

Dec 4 09:11:34 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:34 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:34 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:34 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:35 fw1-shimo kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:35 fw1-shimo kernel: ata2.00: (BMDMA stat 0x0)
Dec 4 09:11:35 fw1-shimo kernel: ata2.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:35 fw1-shimo kernel: ata2: EH complete
Dec 4 09:11:37 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:37 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:37 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:37 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:50 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:51 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:51 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:51 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:51 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:52 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:52 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:52 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:52 fw1-shimo setroubleshoot: SELinux is preventing /usr/sbin/sendmail.postfix (system_mail_t) "read" to /dev/md1 (proc_mdstat_t). For complete SELinux messages. run sealert -l d5c655f4-6fc3-445b-ab9d-3b21336cb2d0
Dec 4 09:11:52 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:53 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:53 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:53 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:53 fw1-shimo kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 4 09:11:53 fw1-shimo kernel: ata3.00: (BMDMA stat 0x60)
Dec 4 09:11:54 fw1-shimo kernel: ata3.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
Dec 4 09:11:54 fw1-shimo kernel: sd 2:0:0:0: SCSI error: return code = 0x08000002
Dec 4 09:11:54 fw1-shimo kernel: sdc: Current: sense key: Medium Error
Dec 4 09:11:54 fw1-shimo kernel: Additional sense: Unrecovered read error - auto reallocate failed
Dec 4 09:11:55 fw1-shimo kernel: end_request: I/O error, dev sdc, sector 25091744
Dec 4 09:11:55 fw1-shimo kernel: ata3: EH complete
Dec 4 09:11:55 fw1-shimo kernel: SCSI device sdc: 781422768 512-byte hdwr sectors (400088 MB)
Dec 4 09:11:55 fw1-shimo kernel: sdc: Write Protect is off
Dec 4 09:11:56 fw1-shimo kernel: SCSI device sdc: drive cache: write back
Dec 4 09:11:56 fw1-shimo kernel: SCSI device sdb: 781422768 512-byte hdwr sectors (400088 MB)
Dec 4 09:11:56 fw1-shimo kernel: sdb: Write Protect is off
Dec 4 09:11:57 fw1-shimo kernel: SCSI device sdb: drive cache: write back
Dec 4 09:11:57 fw1-shimo kernel: SCSI device sdc: 781422768 512-byte hdwr sectors (400088 MB)
Dec 4 09:11:57 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:11:57 fw1-shimo kernel: counted 127, received 15
Dec 4 09:11:58 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:11:58 fw1-shimo kernel: raid1: Disk failure on sdc2, disabling device.
Dec 4 09:11:58 fw1-shimo kernel: Operation continuing on 2 devices
Dec 4 09:11:58 fw1-shimo kernel: blk: request botched
Dec 4 09:11:58 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:11:59 fw1-shimo kernel: counted 112, received 16
Dec 4 09:11:59 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:11:59 fw1-shimo kernel: blk: request botched
Dec 4 09:11:59 fw1-shimo kernel: sdc: Write Protect is off
Dec 4 09:12:00 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:00 fw1-shimo kernel: counted 96, received 16
Dec 4 09:12:00 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:01 fw1-shimo kernel: blk: request botched
Dec 4 09:12:01 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:01 fw1-shimo kernel: counted 80, received 16
Dec 4 09:12:01 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:02 fw1-shimo kernel: blk: request botched
Dec 4 09:12:02 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:02 fw1-shimo kernel: counted 64, received 16
Dec 4 09:12:02 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:03 fw1-shimo kernel: blk: request botched
Dec 4 09:12:03 fw1-shimo kernel: SCSI device sdc: drive cache: write back
Dec 4 09:12:03 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:03 fw1-shimo kernel: counted 48, received 16
Dec 4 09:12:04 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:04 fw1-shimo kernel: blk: request botched
Dec 4 09:12:04 fw1-shimo kernel: Incorrect number of segments after building list
Dec 4 09:12:04 fw1-shimo kernel: counted 32, received 16
Dec 4 09:12:05 fw1-shimo kernel: req nr_sec 0, cur_nr_sec 8
Dec 4 09:12:05 fw1-shimo kernel: blk: request botched
Dec 4 09:12:05 fw1-shimo kernel: ata3.00: WARNING: zero len r/w req
Dec 4 09:12:06 fw1-shimo last message repeated 5 times


Step 1: Drop the failed slice

# /sbin/mdadm /dev/md1 --fail /dev/sdc2
mdadm: set /dev/sdc2 faulty in /dev/md1
# /sbin/mdadm /dev/md1 --remove /dev/sdc2
mdadm: hot removed /dev/sdc2


Step 2: Zero out the failed slice

My thinking here is that by zeroing out the failed slice, I can force the SATA disk to remap any sectors that have gone bad.

# dd if=/dev/zero of=/dev/sdc2
dd: writing to `/dev/sdc2': Input/output error
24577993+0 records in
24577992+0 records out
12583931904 bytes (13 GB) copied, 1916.7 seconds, 6.6 MB/s


Well, that's not a good sign (and the disk was clicking a bit). So I'll run smartctl and check the disk's SMART info (see Monitoring Hard Disks with SMART).

# /usr/sbin/smartctl -i -d ata /dev/sdc
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HD400LJ
Serial Number: S0H2J1KLA07831
Firmware Version: ZZ100-15
User Capacity: 400,088,457,216 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a
Local Time is: Wed Dec 5 09:43:36 2007 EST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled


However, the "-Hc" output of smartctl says that the disk health is still "PASSED" and not "FAILING". So it's possible that the disk doesn't need to be retired yet.

# /usr/sbin/smartctl -Hc -d ata /dev/sdc
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x05) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (7640) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 130) minutes.


Personally, since I know the drive makes clicking noises and throws an error during the dd wipe, I'm going to swap it out.

Labels: ,

Friday, May 25, 2007
iSCSITarget on CentOS5
Setting up our test iSCSI SAN box this week. The original plans were to run this on top of Gentoo (which is very powerful and flexible) but after 3 years, I'm not very pleased with Gentoo as a server OS. Which is a whole different topic. So we've migrated over to using CentOS5, which is derived from Red Hat Enterprise Linux 5, a distro that is more suited for corporate use.

There's not much to talk about in terms of the base system. It's a pretty vanilla 64bit CentOS5 install (from DVD) running on top of a dual-CPU dual-core pair of Socket F Opterons. The primary packages that I've installed so far are "Yum Extender" (from stock repositories) and "rdiff-backup" (downloaded as an RPM). The OS runs on top of a 3-disk RAID1 (mirror, all drives active) Software RAID for safety.

I use a semi-customized partition layout on the (3) operating system disks. I have:

a) /boot
b) / (root, the primary OS install area)
c) swap
d) a backup root partition (which is basically a clone of the primary, except for a small change in /etc/fstab) designed for quick recovery from a situation that would hose the primary root partition
e) /var/log (broken out to its own area)
f) /backup/system (a place to store system backups)
g) LVM area (no allocated areas yet)

I mention all that because the first step before installing iscsitarget is to make sure I can recover if things go awry. Since installing iscsitarget involves mucking with the running kernel, I want a good backup of /boot along with making sure GRUB offers me options to boot an older kernel. I'll also freshen my root backup partition.

Step 1 - Backing up /, /boot, and the existing kernel

Simplicity is often best when dealing with the base OS. My methods are crude, but designed to get me back up and running without needing much in the way of software. The primary requirement is a bootable USB pen drive or bootable LiveCD (such as RIPLinuX) with the necessary tools. You could also use the CentOS5 boot DVD.

I'll run with the CentOS install DVD since that's what I have sitting in the optical drive at the moment. When CentOS boots up, enter "linux rescue" at the boot prompt. Note, if you have multiple NICs installed, it's probably better to not start networking (because the CentOS rescue mode takes forever to initialize unconnected NICs).

Select "Skip" when asked about mounting the existing install at /mnt/sysimage. We'll be doing things our own way instead.

Start up Software RAID on the key partitions (/boot, /, the backup /, and the backup partition). The following commands will (usually) startup your existing RAID devices automatically.

# mdadm --examine --scan >> /etc/mdadm.conf
# mdadm --assemble --scan

In my case "md0" is /boot, "md2" is my base CentOS install, "md3" is the backup root partition, and "md5" is where I can store image files. So let's double-check that.

# mkdir /mnt/root ; mount /dev/md3 /mnt/root
# mkdir /mnt/backuproot ; mount /dev/md3 /mnt/backuproot

If we then examine the output of "df -h" or by using "ls" on the mounted volumes we can verify that we know which is which. Let's mount our backup area and create image files. I prefer to kick off the "dd" commands in the background so that I can monitor progress and keep multiple CPUs busy.

# mkdir /mnt/backup ; mount /dev/md5 /mnt/backup
# cd /mnt/backup ; mkdir images ; cd images
# dd if=/dev/md0 | gzip > dd-md0-boot-20070525.img.gz &
# dd if=/dev/md2 | gzip > dd-md2-root-20070525.img.gz &
# dd if=/dev/md3 | gzip > dd-md3-bkproot-20070525.img.gz &

We should also backup the master boot records on each of the hard drives in the unit.

# for i in a b c; do dd if=/dev/sd$i count=1 bs=512 of=dd-sd$i-mbr-20070525.img; done

Unfortunately, the CentOS5 DVD doesn't include tools like "G4L" (Ghost for Linux) or I'd make a second set of backup files using that. I may boot my RIPLinuX CD and see what tools are there. (Because you can never have too many backups.)

Now I can dump the contents of "md2" (the original root) to "md3" (our backup root).

# dd if=/dev/md2 of=/dev/md3

Now for some cleanup stuff...

# mount /dev/md3 /mnt/backuproot
# vi /mnt/backuproot/etc/fstab

We'll need to change any references of "md2" to "md3". Basically flip them around so that "md3" is the official root when /etc/fstab gets processed. I also like to change the prompt and system name to remind myself that I'm using the emergency system. Again, our primary goal is to be able to get a box back up and operational in the case where the primary root partition is hosed. Get it up quickly, then schedule some downtime to deal with it properly.

Now would also be a good time to tune the ext3 file system on your partitions.

The last thing we need to do is edit GRUB's configuration so that we can select our backup root OS from the selection menu.

# mkdir /mnt/boot
# mount /dev/md0 /mnt/boot
# vi /mnt/boot/grub/grub.conf

Things that we'll want to do here (you could also accomplish this by booting the server in normal mode and editing grub.conf there using a more comfortable text editor):

a) Change the timeout=5 value to timeout=15 (or 30 or 60). By default, CentOS doesn't give you very long to pick an alternate boot. I find 5 seconds to be too short of a window, especially on a unit where the storage controller takes a minute or two to scan and setup the drives.

b) Copy the latest "title" section and change "root=/dev/md2" to "root=/dev/md3". I always make the "EMERGENCY" boot option the 2nd one in the list.

# mkdir /mnt/backuproot
# mount /dev/md3 /mnt/backuproot
# vi /mnt/backuproot/etc/sysconfig/network

I like to change the hostname to have "-emergency" tacked onto the end. Which should make it fairly obvious that we are booting up in emergency mode using the backup root partition. I also edit root's .bash_profile to set PS1.

Okay, that was a lot of setup work just to prepare for implementing iSCSITarget (or any other kernel rebuild), but it's always worth it.

Final notes:

- When I test booted the emergency root partition, things didn't work as planned. So while my concept is sound, I may have screwed something up. I think it's an error with /etc/fstab in the emergency partition, so I'll troubleshoot that later.

- It's also possible that you'll need to do a GRUB install on all (3) of the primary mirror disks.

Step 2 - Downloading and compiling the iSCSITarget software

So far, I've found (2) links to be useful here. One is Moving on.... and the other is iSCSI Enterprise Target を CentOS5 にインストールする (japanese). While the 2nd link is in Japanese, it shows the commands in english.

Head over to the The iSCSI Enterprise Target page and download the latest tarball containing the source code. The current version is 0.4.15. If you're using Firefox in CentOS's Gnome shell, it will probably prompt you to open the file with the archive manager. I created a subfolder under /root/iscsitarget-0.4.15 and extracted the contents there.

You will also need to go into Applications -> Add/Remove Software and add the development tools and libraries to your system. (Mostly you just need gcc.)

As noted on jackshck's page, you will also need to install the following packages:

openssl-devel (I installed the x86_64 version)
kernel-devel (again, I'm using the x86_64 version)

Open up a terminal window and go to where you extracted the iscsitarget tarball (I put mine in /root/iscsitarget-0.4.15).

# ls -l /usr/src/kernels
(make note of the kernel folder)
# make KSRC=/usr/src/kernels/2.6.18-8.1.4.el5-x86_64/
# make KSRC=/usr/src/kernels/2.6.18-8.1.4.el5-x86_64/ install

Now we can start up the ietd daemon:

# /etc/init.d/iscsi-target start

And add it to our default runlevel (this is similar to the rc-update command in Gentoo Linux):

# chkconfig iscsi-target on

Step 3 - Creating a target

This is where we get into the nitty-gritty and where I need to take a break and do some research. The /etc/ietd.conf file already exists at this point, but only contains a commented out sample configuration.

Notes:

Dec 21 2007 - The comment about iSCSITarget software for Microsoft Windows really isn't on-topic. But I'll go ahead and list the link to it, but not as an HTML link. Pricing for the real version is currently $395 (Server) or $995 (Professional). And personally, there's no way that I'd recommend running a SAN on top of Microsoft Windows (even Server 2003, which is a nice product).

Labels: , , , ,

Thursday, May 10, 2007
Dealing with a failed Software RAID device
As part of my server setup, I like to make sure that plans are working as expected... which means intentionally breaking things like RAID sets.

In this particular case I have a triple-active RAID1 mirror set on the first 3 disks in the system (/dev/sda, /dev/sdb, /dev/sdc). In this RAID1 set, all 3 disks are active, with no hot-spare. I prefer this over a (2) active (1) hot-spare setup because it allows for up to 2 disks to fail before you lose data. And if I'm already dedicating a hot-spare spindle solely for the use of the RAID1 set, I may as well get to use it. The output of /proc/mdstat looks similar to (note that none of the slices are tagged with a "(S)").:

md2 : active raid1 sdc2[2] sdb2[1] sda2[0]
7911936 blocks [3/3] [UUU]


Each disk has quite a few md devices associated with it. In this particular case I have /dev/md0 up through /dev/md5 created. Probably one of the few downsides to SoftwareRAID is that you end up with quite a few md devices to keep track of. But such is the price for just about the ultimate flexibility.

GRUB Note: In a mirrored setup, you must make sure to install GRUB to the MBR (master boot record) on all of the mirror disks. Some Linux distros don't do this on their own and you'll have to do it yourself. Otherwise, when the first disk in the mirror set fails, you'll find you're left with an unbootable system. This is also why I like to make copies of the MBR for each disk in the system (# dd if=/dev/sda of=dd-sda-mbr-date.img bs=512 count=1).

So, after making very good backups, I decided it was time to test whether I could pull a disk and survive. To make sure that I had taken care of the GRUB issue, I shutdown the server and pulled the primary drive in the RAID set.

# cat /proc/mdstat
md2 : active raid1 sdc2[2] sdb2[1]
7911936 blocks [3/3] [_UU]


Ah good, mdadm is *not* happy here (as expected). It knows that one of the disks has failed in the array. So let's shutdown and replace the failed drive with a blank one. In this case, I used a spare drive that I had laying around that had been previously wiped. (Or, with care, you could zero out the drive that you pulled.)

I recomend using "sfdisk" in dump mode to configure the new drive. So if your failed drive is "sda" and one of the good ones is "sdb", you could use:

# sfdisk -d /dev/sdb | sfdisk /dev/sda

After which, you can use the "mdadm" command to add the new slices to the existing RAID arrays.

# mdadm --add /dev/mdX /dev/sdYZ

Last, don't forget to install GRUB to the MBR on the new disk.

Labels: ,

Monday, May 07, 2007
Brute force disaster recovery for CentOS5
Today's trick is moving a CentOS5 system from an old set of disks over to a new set of disks. Along the way, I'll create an image of the system to allow me to restore it later on.

The CentOS5 system is a fresh install running RAID-1 across (3) disks using Linux Software RAID (mdadm). There are (4) primary partitions (boot, root, swap, LVM) with no data on the LVM partition.

(Why 3 active disks? The normal setup for this server was RAID-1 across 2 disks with a hot-spare. Rather then have a risky window of time where one disk has failed and the hot-spare is synchronizing with the remaining good disk, I prefer to have all 3 disks running. That way, when a disk dies, we still have 2 disks in action. The mdadm / Software RAID doesn't seem to care and it doesn't seem to affect performance at all.)

Because this is RAID-1, capturing the disk layout and migrating over to the new disks will be very easy. It's also a very fresh install, so I'm just going to grab the disk contents using "dd" (most of the partition's sectors are still zeroed out from the original install). Once I've backed up the (3) partitions on the first drive, I'm going to pull the (3) drives and replace them with the new ones.

I'll get the machine up and running with the first replacement drive, then configure the blank 2nd and 3rd drives and add them to the RAID set. That is, if mdadm doesn't beat me to the punch and start the sync on the 2nd/3rd disks automatically.

If things go bad, I can always drop the original disks back in the unit and power it back up. I plan on keeping them around for a few days, just in case. I'll have to recreate the LVM volumes, but there aren't any yet (just a PV and a VG).

One advantage of pulling the old drives out completely and rebuilding using fresh drives - I'll end up with a tested disaster recovery process.

Now for the nitty gritty. I'm using a USB pocket drive formatted with ext3 for the rescue work. Make sure that you plug this in before booting the rescue CD.

  1. Login to the system and power it down.
  2. Boot the CentOS5 install DVD
  3. At the "boot:" enter "linux rescue"
  4. Work your way through the startup dialogs
  5. When prompted whether to mount your linux install, choose "Skip"

This should give you a command shell with useful tools. So let's poke around and check on our system.

  1. Looking at "cat /proc/mdstat" shows that while the mdadm software is running, it has not assembled any RAID arrays.
  2. The "fdisk -l" command shows us that the (3) existing disks are named sda, sdb, sdc. Each has (4) partitions (boot, root, swap, LVM).
  3. My USB drive showed up as "/dev/sdd" so I'll create a "/backup" folder and mount it using "mkdir /backup ; mount /dev/sdd1 /backup ; df -h"

Naturally, we should create a sub-folder under /backup for each machine and possibly create another folder underneath it using today's date. We should grab information about the current disk layout and store it in a text file (fdisk.txt).

  1. # cd /backup ; mkdir machinename ; cd machinename
  2. # mkdir todaysdate ; cd todaysdate
  3. # fdisk -l > fdisk.txt

Now to grab the boot loader and image the two critical partitions (boot and root). We'll grab the boot loader off of all (3) drives because it's so small (and it may not be properly synchronized).

  1. dd if=/dev/sda bs=512 count=1 of=machinename-date-sda.mbr
  2. dd if=/dev/sdb bs=512 count=1 of=machinename-date-sdb.mbr
  3. dd if=/dev/sdb bs=512 count=1 of=machinename-date-sdc.mbr
  4. dd if=/dev/sda1 | gzip > machinename-date-ddcopy-sda1.img.gz
  5. dd if=/dev/sda2 | gzip > machinename-date-ddcopy-sda2.img.gz

Total disk space for my system was around 1.75GB worth of compressed files (8GB root, 250MB boot). You could also use bzip2 if you need more compression. Unfortunately, the CentOS5 DVD does not include the "split" command, which could cause issues if you're trying to write to a filesystem that can't handle files over 2GB in size.

Now you should shut the box back down, burn those files to DVD-R, install the new (blank) disks, and boot from the install DVD again. Again, mount the drive that holds the rescue image files to a suitable path.

  1. dd of=/dev/sda bs=512 count=1 if=machinename-date-sda.mbr
  2. fdisk /dev/sda (fix the last partition)
  3. dd if=/dev/sda bs=512 count=1 of=/dev/sdb
  4. dd if=/dev/sda bs=512 count=1 of=/dev/sdc

That will restore the MBR and partition table from the old drive to the new one. If your new drive has a different size, then the last partition will be incorrectly sized for the disk. Fire up "fdisk" and delete / recreate the last partition on the disk.

Restore the two partition images:

  1. # gzip -dc machinename-date-ddcopy-sda1.img.gz | dd of=/dev/sda1
  2. # gzip -dc machinename-date-ddcopy-sda2.img.gz | dd of=/dev/sda2

At this point, we should be able to boot the system on the primary drive and have Software RAID come up in degraded mode for the arrays. Things that will need to be done once the unit boots:

  1. Tell mdadm about the 2nd (and 3rd) disks and tell it to bring those partitions into the arrays and synchronize them.
  2. Create a new swap area
  3. Recreate the LVM physical volume (PV) and volume group (VG)
  4. Restore any data from the LVM area (we had none in this example)

Getting the Software RAID back up and happy is the trickiest of the steps.

  1. Login as root, open up a terminal window
  2. # cat /proc/mdstat
  3. # fdisk -l
  4. Notice that the swap area on our system is sda3, sdb3, sdc3 and will need to be loaded as /dev/md1.
  5. # mdadm --create /dev/md1 -v --level=raid1 --raid-devices=3 /dev/sda3 /dev/sdb3 /dev/sdc3
  6. # mkswap /dev/md1 ; swapon /dev/md1
  7. Now we're ready to recreate the LVM area
  8. # mknod /dev/md3 b 9 3
  9. # mdadm --create /dev/md3 -v --level=raid1 --raid-devices=3 /dev/sda4 /dev/sdb4 /dev/sdc4
  10. # pvcreate /dev/md3 ; vgcreate vg /dev/md3
  11. Finally, we should add the 2nd and 3rd drive to md0 and md2.
  12. mdadm --add /dev/md0 /dev/sdc1
  13. mdadm --add /dev/md0 /dev/sdb1

Note: If your triple mirror RAID array puts the additional disks in as spares, make sure that you have (a) grown the number of raid devices to 3 for the RAID1 set and (b) make sure that there are no other arrays synchronizing as the same time. It's also best to add the elements one at a time, rather then adding both at the same time. I'm not sure if it's a bug in mdadm or just the way it works, but it took me two tries to get my triple mirror back up with all disks marked as "active" instead of (2) active and (1) hot-spare.

Labels: , , , ,

Tuesday, March 06, 2007
My new preferred disk layout for servers
With age comes wisdom? After working with SoftwareRAID and Linux servers for a while, I've changed my preferred disk system design and layout.

RAID

Under the old system, I was running a (2) disk RAID1 (mirror) with a hot-spare disk setup and ready for action. But if you're going to have a hot-spare dedicated to the RAID1 array, why not use it as an active array member? That way, if a disk fails, you still have two good disks. Unfortunately, when a RAID element fails, the load from the rebuild process can often kill the one of the remaining disks in the array.

Is it a likely scenario? Probably not. But Linux's Software RAID handles a triple-active RAID1 mirror without any slowdown, so there's not much reason *not* to implement it that way. Plus it's a useful trick to know for situations where you really *do* need to be that paranoid.

(I'm not sure whether any hardware RAID cards provide for a triple-active mirroring RAID1 configuration.)

Partitions

I've also simplified how many partitions I like to have on the disk. My current disk layouts typically look like:

/dev/sdX1 - /dev/md0 - 250MB - /boot
/dev/sdX2 - /dev/md1 - 12GB - / (primary root)
/dev/sdX3 - /dev/md2 - 12GB - / (backup root)
/dev/sdX5 - /dev/md3 - 32GB - /var/log
/dev/sdX6 - /dev/md4 - 2GB - swap
/dev/sdX7 - /dev/md5 - 64GB - /backup/system
/dev/sdX8 - /dev/md6 - (remainder) - LVM area

During normal operations, we boot and run /dev/md1 as our / (root) partition. The /dev/md2 partition is kept offline and is never mounted. Periodically, after validating that the server is in good health, we will copy the contents of /dev/md1 to /dev/md2, make adjustments to /etc/fstab and the hostname. This requires some server downtime (long enough to setup the 2nd root partition).

In the case where the primary OS is hosed, we can boot from the backup OS partition and get back up and running quickly. That gives us the luxury to continue operations until we can schedule downtime to fix the primary OS partition.

Notice that I've broken /var/log out to its own partition. I do this so that an overflowing set of logs won't take the server box down. Plus, by putting the log files in their own physical partition, it's easy to use a boot CD or USB key to gain access to the logs in case of severe issues.

The other physical partition that I consider necessary is /backup/system. This partition is used to hold images of the boot and root partitions, along with information about the partition layout and images of the MBRs. Basically, it's used to store disaster recovery backups. You should not have this partition mounted during normal operations. Taking the contents of this partition offsite is also a good idea. A basic text file of how the backups were created along with information for how to restore these backups is recommended.

Summary

This setup tries to walk the fine line between keeping it simple, but having enough flexibility to deal with a large set of potential failures. Anything from a two-disk failure, to the primary OS being hosed, to both OS partitions having problems all the way up to boot records or the /boot partition being killed.

Labels: , , ,

Tuesday, August 29, 2006
Creating a 4-disk RAID10 using mdadm
Since I can't seem to find instructions on how to do this (yet)...

I'm going to create a 4-disk RAID10 array using Linux Software RAID and mdadm. The old way is to create individual RAID1 volumes and then stripe a RAID0 volume over the RAID1 arrays. That requires creating extra /dev/mdN nodes which can be confusing to the admin that follows you.

1) Create the /dev/mdN node for the new RAID10 array. In my case, I already have /dev/md0 to /dev/md4 so I'm going to create /dev/md5 (note that "5" appears twice in the command).

# mknod /dev/md5 b 9 5

2) Use fdisk on the (4) drives, create a single primary partition of type "fd" (Linux raid autodetect). Note that I have *nothing* on these brand new drives, so I don't care if it wipes out data.

3) Create the mdadm RAID set using 4 devices and a level of RAID10.

# mdadm --create /dev/md5 -v --raid-devices=4 --chunk=32 --level=raid10 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1

Which will result in the following output:

mdadm: layout defaults to n1
mdadm: size set to 732571904K
mdadm: array /dev/md5 started.

# cat /proc/mdstat

Personalities : [raid1] [raid10]
md5 : active raid10 sdf1[3] sde1[2] sdd1[1] sdc1[0]
1465143808 blocks 32K chunks 2 near-copies [4/4] [UUUU]
[>....................] resync = 0.2% (3058848/1465143808) finish=159.3min speed=152942K/sec


As you can see, we get around 150MB/s from the RAID10 array. The regular RAID1 arrays only have about 75MB/s throughput (same as a single 750GB drive).

A final note. My mdadm.conf file is completely empty on this system. That works well for simple systems, but you'll want to create a configuration file in more complex setups.

Updates:

Most of the arrays that I've built have been based on 7200 RPM SATA drives. For small arrays (4 disks w/ a hot spare), often you can find enough ports on the motherboard. For larger arrays, you'll need to look for PCIe SATA controllers. I've used Promise and 3ware SATA RAID cards. Basically any card that allows the SATA drives to be seen and is supported directly in the Linux kernel are good bets (going forward we're going to switch to Areca at work).

Labels: ,

Sunday, June 11, 2006
Failing hard drive in a Software RAID
So today's fun is that I have a drive that is failing in my 566Mhz Celeron server. This is a small server with (3) 120GB hard drives.

hda - 120GB (primary drive, 4 partitions)
hdc - CD-ROM
hde - 120GB (second drive in the RAID1 sets, 4 partitions)
hdg - 120GB (backup drive)

During the rebuild of md3 (which is hda4+hde4) I'm getting constant aborts due to a bad block (or blocks) on hda.

# tail -n 500 /var/log/messages
Jun 10 23:17:16 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 10 23:17:16 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789712, sector=100789712
Jun 10 23:17:16 coppermine ide: failed opcode was: unknown
Jun 10 23:17:16 coppermine end_request: I/O error, dev hda, sector 100789712
Jun 10 23:17:20 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 10 23:17:20 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789723, sector=100789720
Jun 10 23:17:20 coppermine ide: failed opcode was: unknown
Jun 10 23:17:20 coppermine end_request: I/O error, dev hda, sector 100789720
Jun 10 23:17:20 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 10 23:17:20 coppermine md: md3: sync done.
Jun 10 23:17:20 coppermine RAID1 conf printout:
Jun 10 23:17:20 coppermine --- wd:1 rd:2
Jun 10 23:17:20 coppermine disk 0, wo:0, o:1, dev:hda4
Jun 10 23:17:20 coppermine disk 1, wo:1, o:1, dev:hde4
Jun 10 23:17:20 coppermine RAID1 conf printout:
Jun 10 23:17:20 coppermine --- wd:1 rd:2
Jun 10 23:17:20 coppermine disk 0, wo:0, o:1, dev:hda4
Jun 10 23:17:20 coppermine RAID1 conf printout:
Jun 10 23:17:20 coppermine --- wd:1 rd:2
Jun 10 23:17:20 coppermine disk 0, wo:0, o:1, dev:hda4
Jun 10 23:17:20 coppermine disk 1, wo:1, o:1, dev:hde4
Jun 10 23:17:20 coppermine md: syncing RAID array md3
Jun 10 23:17:20 coppermine md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Jun 10 23:17:20 coppermine md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) for reconstruction.
Jun 10 23:17:20 coppermine md: using 128k window, over a total of 115926464 blocks.


And mdadm will continue to attempt to rebuild the array until the end of time. Which is rather pointless. So the second step is to more closely examine /dev/hda and see whether we're seeing the same block number.

# grep 'hda:' /var/log/messages
May 29 08:43:06 coppermine hda: Maxtor 4R120L0, ATA DISK drive
May 29 08:43:06 coppermine hda: max request size: 128KiB
May 29 08:43:06 coppermine hda: 240121728 sectors (122942 MB) w/2048KiB Cache, CHS=65535/16/63
May 29 08:43:06 coppermine hda: cache flushes supported
May 29 08:43:06 coppermine hda: hda1 hda2 hda3 hda4
Jun 8 22:32:02 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 8 22:32:02 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=80342494, sector=80342480
Jun 8 22:32:04 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 8 22:32:04 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=80342494, sector=80342488
Jun 8 22:32:05 coppermine raid1: hda: unrecoverable I/O read error for block 72089984
Jun 9 05:10:36 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 9 05:10:36 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789712, sector=100789712
Jun 9 05:10:39 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 9 05:10:39 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789722, sector=100789720
Jun 9 05:10:39 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 9 08:26:40 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 9 08:26:40 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=54393160, sector=54393152
Jun 9 08:26:42 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 9 08:26:42 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=54393160, sector=54393160
Jun 9 08:26:42 coppermine raid1: hda: unrecoverable I/O read error for block 46140544
Jun 9 13:13:53 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 9 13:13:53 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789712, sector=100789712
Jun 9 13:13:55 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 9 13:13:55 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789722, sector=100789720
Jun 9 13:13:55 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 10 18:30:21 coppermine hda: Maxtor 4R120L0, ATA DISK drive
Jun 10 18:30:21 coppermine hda: max request size: 128KiB
Jun 10 18:30:21 coppermine hda: 240121728 sectors (122942 MB) w/2048KiB Cache, CHS=65535/16/63
Jun 10 18:30:21 coppermine hda: cache flushes supported
Jun 10 18:30:21 coppermine hda: hda1 hda2 hda3 hda4
Jun 10 23:17:16 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 10 23:17:16 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789712, sector=100789712
Jun 10 23:17:20 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 10 23:17:20 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789723, sector=100789720
Jun 10 23:17:20 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 11 04:08:06 coppermine hda: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Jun 11 04:08:06 coppermine hda: task_in_intr: error=0x40 { UncorrectableError }, LBAsect=100789712, sector=100789712
Jun 11 04:08:08 coppermine raid1: hda: unrecoverable I/O read error for block 92537216


This shows me that I have a drive that almost always fails at the same block number each time. Another grep of the log files makes this even more clear:

# grep 'unrecoverable' /var/log/messages
Jun 8 22:32:05 coppermine raid1: hda: unrecoverable I/O read error for block 72089984
Jun 9 05:10:39 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 9 08:26:42 coppermine raid1: hda: unrecoverable I/O read error for block 46140544
Jun 9 13:13:55 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 10 23:17:20 coppermine raid1: hda: unrecoverable I/O read error for block 92537216
Jun 11 04:08:08 coppermine raid1: hda: unrecoverable I/O read error for block 92537216


So the first step (after backing up the system) is to stop the software RAID from attempting to constantly rebuild array "md3". You can do this with the mdadm tool's "manage mode" commands.

Well, maybe not. I've done a lot of digging in Google, but I can't figure out how to force mdadm to stop a sync that is in progress. So, I'm booting back to the original 2005.1 Gentoo boot CD so that I can manually control the process.

Note that an excellent resource is:
LVM2 and Software RAID in Linux (May 2005)

livecd ~ # fdisk -l

Disk /dev/hda: 122.9 GB, 122942324736 bytes
16 heads, 63 sectors/track, 238216 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 1 249 125464+ fd Linux raid autodetect
/dev/hda2 250 4218 2000376 fd Linux raid autodetect
/dev/hda3 4219 8187 2000376 fd Linux raid autodetect
/dev/hda4 8188 238200 115926552 fd Linux raid autodetect

Disk /dev/hde: 122.9 GB, 122942324736 bytes
16 heads, 63 sectors/track, 238216 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

Device Boot Start End Blocks Id System
/dev/hde1 * 1 249 125464+ fd Linux raid autodetect
/dev/hde2 250 4218 2000376 fd Linux raid autodetect
/dev/hde3 4219 8187 2000376 fd Linux raid autodetect
/dev/hde4 8188 238200 115926552 fd Linux raid autodetect

Disk /dev/hdg: 122.9 GB, 122942324736 bytes
16 heads, 63 sectors/track, 238216 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

Device Boot Start End Blocks Id System
/dev/hdg1 1 238000 119951968+ 8e Linux LVM

livecd ~ # modprobe md
livecd ~ # modprobe raid1
livecd ~ # ls -l /dev/md*
livecd ~ # for i in 0 1 2 3; do mknod /dev/md$i b 9 $i; done
livecd ~ # ls -l /dev/md*
brw-r--r-- 1 root root 9, 0 Jun 12 00:01 /dev/md0
brw-r--r-- 1 root root 9, 1 Jun 12 00:01 /dev/md1
brw-r--r-- 1 root root 9, 2 Jun 12 00:01 /dev/md2
brw-r--r-- 1 root root 9, 3 Jun 12 00:01 /dev/md3
livecd ~ # mdadm --assemble /dev/md0 /dev/hda1 /dev/hde1
mdadm: /dev/md0 has been started with 2 drives.
livecd ~ # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 hda1[0] hde1[1]
125376 blocks [2/2] [UU]

unused devices:
livecd ~ #


So that starts up the /boot partition. Now I can check it for errors using e2fsck. The "-c" checks for bad blocks, the "-C" updates any inodes on the system with bad block information, and "-y" answers 'yes' to any questions.

livecd ~ # e2fsck -c -C -y -v /dev/md0
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 376
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/md0: ***** FILE SYSTEM WAS MODIFIED *****

40 inodes used (0%)
3 non-contiguous inodes (7.5%)
# of inodes with ind/dind/tind blocks: 12/6/0
12593 blocks used (10%)
0 bad blocks
0 large files

26 regular files
3 directories
0 character device files
0 block device files
0 fifos
0 links
2 symbolic links (2 fast symbolic links)
0 sockets
--------
31 files
livecd ~ #


Next, I assemble the RAID1 set for the root volume.

livecd ~ # mdadm --assemble /dev/md2 /dev/hda3 /dev/hde3
mdadm: /dev/md2 has been started with 2 drives.
livecd ~ # cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hda3[0] hde3[1]
2000256 blocks [2/2] [UU]

md1 : active raid1 hda2[0] hde2[1]
2000256 blocks [2/2] [UU]

md0 : active raid1 hda1[0] hde1[1]
125376 blocks [2/2] [UU]

unused devices:
livecd ~ # e2fsck -c -C -y -v /dev/md2
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 064
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/md2: ***** FILE SYSTEM WAS MODIFIED *****

6434 inodes used (2%)
16 non-contiguous inodes (0.2%)
# of inodes with ind/dind/tind blocks: 75/2/0
390601 blocks used (78%)
0 bad blocks
0 large files

928 regular files
153 directories
1055 character device files
4025 block device files
0 fifos
0 links
264 symbolic links (264 fast symbolic links)
0 sockets
--------
6425 files
livecd ~ #


The rest of the system is more complex, LVM2 volumes on top of software RAID.

livecd ~ # modprobe dm-mod
livecd ~ # pvscan
PV /dev/hdg1 VG vgbackup lvm2 [114.39 GB / 82.39 GB free]
Total: 1 [114.39 GB] / in use: 1 [114.39 GB] / in no VG: 0 [0 ]
livecd ~ # vgscan
Reading all physical volumes. This may take a while...
Found volume group "vgbackup" using metadata type lvm2
livecd ~ # lvscan
inactive '/dev/vgbackup/backup' [32.00 GB] inherit
livecd ~ # lvchange -a y /dev/vgbackup/backup
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvscan
ACTIVE '/dev/vgbackup/backup' [32.00 GB] inherit
livecd ~ # e2fsck -c -C -y -v /dev/vgbackup/backup
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 608
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgbackup/backup: ***** FILE SYSTEM WAS MODIFIED *****

70 inodes used (0%)
14 non-contiguous inodes (20.0%)
# of inodes with ind/dind/tind blocks: 35/18/0
954693 blocks used (11%)
0 bad blocks
0 large files

55 regular files
6 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
61 files
livecd ~ #


So far so good. But most of the errors are in /dev/md3. So I'm going to assemble /dev/md3 using just one of the drives (/dev/hde4).

livecd ~ # mdadm -v --assemble /dev/md3 /dev/hde4
mdadm: looking for devices for /dev/md3
mdadm: /dev/hde4 is identified as a member of /dev/md3, slot 2.
mdadm: added /dev/hde4 to /dev/md3 as 2
mdadm: /dev/md3 assembled from 0 drives and 1 spare - not enough to start the array.
livecd ~ # cat /proc/mdstat
Personalities : [raid1]
md3 : inactive hde4[2]
115926464 blocks
md2 : active raid1 hda3[0] hde3[1]
2000256 blocks [2/2] [UU]

md1 : active raid1 hda2[0] hde2[1]
2000256 blocks [2/2] [UU]

md0 : active raid1 hda1[0] hde1[1]
125376 blocks [2/2] [UU]

unused devices:


Unfortunately, mdadm is refusing to mount /dev/md3 using just /dev/hde4. So we have to force it:

livecd ~ # mdadm --create /dev/md3 --level 1 --force --raid-disks=1 /dev/hde4
mdadm: Cannot open /dev/hde4: Device or resource busy
mdadm: create aborted
livecd ~ # mdadm --stop /dev/md3
livecd ~ # mdadm --create /dev/md3 --level 1 --force --raid-disks=1 /dev/hde4
mdadm: /dev/hde4 appears to be part of a raid array:
level=1 devices=2 ctime=Sat Oct 22 20:51:12 2005
Continue creating array? y
mdadm: array /dev/md3 started.
livecd ~ # cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 hde4[0]
115926464 blocks [1/1] [U]

md2 : active raid1 hda3[0] hde3[1]
2000256 blocks [2/2] [UU]

md1 : active raid1 hda2[0] hde2[1]
2000256 blocks [2/2] [UU]

md0 : active raid1 hda1[0] hde1[1]
125376 blocks [2/2] [UU]

unused devices:
livecd ~ #livecd ~ # mdadm --create /dev/md3 --level 1 --force --raid-disks=1 /dev/hde4
mdadm: Cannot open /dev/hde4: Device or resource busy
mdadm: create aborted
livecd ~ # mdadm --stop /dev/md3
livecd ~ # mdadm --create /dev/md3 --level 1 --force --raid-disks=1 /dev/hde4
mdadm: /dev/hde4 appears to be part of a raid array:
level=1 devices=2 ctime=Sat Oct 22 20:51:12 2005
Continue creating array? y
mdadm: array /dev/md3 started.
livecd ~ # cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 hde4[0]
115926464 blocks [1/1] [U]

md2 : active raid1 hda3[0] hde3[1]
2000256 blocks [2/2] [UU]

md1 : active raid1 hda2[0] hde2[1]
2000256 blocks [2/2] [UU]

md0 : active raid1 hda1[0] hde1[1]
125376 blocks [2/2] [UU]

unused devices:
livecd ~ #


Now I can scan for LVM2 volumes on the md3 array.

livecd ~ # pvscan
PV /dev/md3 VG vgmirror lvm2 [110.55 GB / 52.55 GB free]
PV /dev/hdg1 VG vgbackup lvm2 [114.39 GB / 82.39 GB free]
Total: 2 [224.95 GB] / in use: 2 [224.95 GB] / in no VG: 0 [0 ]
livecd ~ # vgscan
Reading all physical volumes. This may take a while...
Found volume group "vgmirror" using metadata type lvm2
Found volume group "vgbackup" using metadata type lvm2
livecd ~ # lvscan
inactive '/dev/vgmirror/tmp' [4.00 GB] inherit
inactive '/dev/vgmirror/vartmp' [4.00 GB] inherit
inactive '/dev/vgmirror/opt' [2.00 GB] inherit
inactive '/dev/vgmirror/usr' [4.00 GB] inherit
inactive '/dev/vgmirror/var' [4.00 GB] inherit
inactive '/dev/vgmirror/home' [4.00 GB] inherit
inactive '/dev/vgmirror/pgsqldata' [16.00 GB] inherit
inactive '/dev/vgmirror/www' [4.00 GB] inherit
inactive '/dev/vgmirror/svn' [16.00 GB] inherit
ACTIVE '/dev/vgbackup/backup' [32.00 GB] inherit
livecd ~ # lvchange -a y /dev/vgmirror/tmp
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/vartmp
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/opt
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/usr
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/var
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/home
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/pgsqldata
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/www
/dev/cdrom: open failed: Read-only file system
livecd ~ # lvchange -a y /dev/vgmirror/svn
/dev/cdrom: open failed: Read-only file system
livecd ~ #


Now I can check all of the LVM2 file systems:

livecd ~ # lvscan
ACTIVE '/dev/vgmirror/tmp' [4.00 GB] inherit
ACTIVE '/dev/vgmirror/vartmp' [4.00 GB] inherit
ACTIVE '/dev/vgmirror/opt' [2.00 GB] inherit
ACTIVE '/dev/vgmirror/usr' [4.00 GB] inherit
ACTIVE '/dev/vgmirror/var' [4.00 GB] inherit
ACTIVE '/dev/vgmirror/home' [4.00 GB] inherit
ACTIVE '/dev/vgmirror/pgsqldata' [16.00 GB] inherit
ACTIVE '/dev/vgmirror/www' [4.00 GB] inherit
ACTIVE '/dev/vgmirror/svn' [16.00 GB] inherit
ACTIVE '/dev/vgbackup/backup' [32.00 GB] inherit
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/tmp
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 576
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/tmp: ***** FILE SYSTEM WAS MODIFIED *****

15 inodes used (0%)
0 non-contiguous inodes (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
16472 blocks used (1%)
0 bad blocks
0 large files

2 regular files
4 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
6 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/vartmp
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 576
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/vartmp: ***** FILE SYSTEM WAS MODIFIED *****

4771 inodes used (0%)
524 non-contiguous inodes (11.0%)
# of inodes with ind/dind/tind blocks: 285/1/0
52582 blocks used (5%)
0 bad blocks
0 large files

4480 regular files
282 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
4762 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/opt
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 288
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/opt: ***** FILE SYSTEM WAS MODIFIED *****

12 inodes used (0%)
0 non-contiguous inodes (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
16443 blocks used (3%)
0 bad blocks
0 large files

1 regular file
2 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
3 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/usr
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 576
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/usr: ***** FILE SYSTEM WAS MODIFIED *****

202520 inodes used (38%)
3582 non-contiguous inodes (1.8%)
# of inodes with ind/dind/tind blocks: 2317/17/0
439977 blocks used (41%)
0 bad blocks
0 large files

172474 regular files
26704 directories
0 character device files
0 block device files
0 fifos
2487 links
3333 symbolic links (3248 fast symbolic links)
0 sockets
--------
204998 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/var
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 576
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/var: ***** FILE SYSTEM WAS MODIFIED *****

30344 inodes used (5%)
181 non-contiguous inodes (0.6%)
# of inodes with ind/dind/tind blocks: 54/1/0
100055 blocks used (9%)
0 bad blocks
0 large files

29856 regular files
474 directories
0 character device files
0 block device files
0 fifos
0 links
3 symbolic links (3 fast symbolic links)
2 sockets
--------
30335 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/home
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 576
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/home: ***** FILE SYSTEM WAS MODIFIED *****

58 inodes used (0%)
0 non-contiguous inodes (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
24717 blocks used (2%)
0 bad blocks
0 large files

33 regular files
15 directories
0 character device files
0 block device files
0 fifos
0 links
1 symbolic link (1 fast symbolic link)
0 sockets
--------
49 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/pgsqldata
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 304
Pass 1: Checking inodes, blocks, and sizes
Inode 1802356, i_blocks is 26312, should be 23952. Fix? yes

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(3625600--3625704) -(3625710--3625711) -(3625716--3625719) -(3625724--3625907)
Fix? yes

Free blocks count wrong for group #110 (6797, counted=7092).
Fix? yes

Free blocks count wrong (4056868, counted=4057163).
Fix? yes


/dev/vgmirror/pgsqldata: ***** FILE SYSTEM WAS MODIFIED *****

1003 inodes used (0%)
90 non-contiguous inodes (9.0%)
# of inodes with ind/dind/tind blocks: 167/19/0
137141 blocks used (3%)
0 bad blocks
0 large files

964 regular files
30 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
994 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/www
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 576
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/www: ***** FILE SYSTEM WAS MODIFIED *****

478 inodes used (0%)
0 non-contiguous inodes (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
25147 blocks used (2%)
0 bad blocks
0 large files

455 regular files
14 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
469 files
livecd ~ # e2fsck -c -C -y -v /dev/vgmirror/svn
e2fsck 1.37 (21-Mar-2005)
Checking for bad blocks (read-only test): done 304
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vgmirror/svn: ***** FILE SYSTEM WAS MODIFIED *****

128 inodes used (0%)
17 non-contiguous inodes (13.3%)
# of inodes with ind/dind/tind blocks: 15/10/0
146674 blocks used (3%)
0 bad blocks
0 large files

98 regular files
21 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
119 files
livecd ~ #


So all of the filesystems on /dev/hde4 check out okay. Now I want to take a closer look at the drives to verify that they have no bad blocks. The best way to do this is with a read-only disk test using badblocks.

# badblocks -sv /dev/hdg1

From the looks of my testing on the various drives, hda is the problem drive with a few surface errors. So I'm going to wholy replace drive hda with a fresh 120GB drive.

So I've moved the cables from hda to connect with hde, and I've put a new 120GB hard drive into the hde position. Since I setup the box properly way back when (installing grub to both disks) things are working very well and the machine booted right back up.

First we copy the partition layout from hda to hde, then I copy the boot sector from hda to hde.

coppermine thomas # sfdisk -d /dev/hda | sfdisk /dev/hde
Checking that no-one is using this disk right now ...
OK

Disk /dev/hde: 238216 cylinders, 16 heads, 63 sectors/track
Old situation:
Warning: The partition table looks like it was made
for C/H/S=*/255/63 (instead of 238216/16/63).
For this listing I'll assume that geometry.
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

Device Boot Start End #cyls #blocks Id System
/dev/hde1 0+ 14945 14946- 120053713+ 6 FAT16
/dev/hde2 0 - 0 0 0 Empty
/dev/hde3 0 - 0 0 0 Empty
/dev/hde4 0 - 0 0 0 Empty
New situation:
Units = sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/hde1 * 63 250991 250929 fd Linux raid autodetect
/dev/hde2 250992 4251743 4000752 fd Linux raid autodetect
/dev/hde3 4251744 8252495 4000752 fd Linux raid autodetect
/dev/hde4 8252496 240105599 231853104 fd Linux raid autodetect
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
coppermine thomas # dd if=/dev/hda bs=512 count=1 of=/dev/hde
1+0 records in
1+0 records out
coppermine thomas # fdisk -l /dev/hda

Disk /dev/hda: 122.9 GB, 122942324736 bytes
16 heads, 63 sectors/track, 238216 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 1 249 125464+ fd Linux raid autodetect
/dev/hda2 250 4218 2000376 fd Linux raid autodetect
/dev/hda3 4219 8187 2000376 fd Linux raid autodetect
/dev/hda4 8188 238200 115926552 fd Linux raid autodetect
coppermine thomas # fdisk -l /dev/hde

Disk /dev/hde: 122.9 GB, 122942324736 bytes
16 heads, 63 sectors/track, 238216 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

Device Boot Start End Blocks Id System
/dev/hde1 * 1 249 125464+ fd Linux raid autodetect
/dev/hde2 250 4218 2000376 fd Linux raid autodetect
/dev/hde3 4219 8187 2000376 fd Linux raid autodetect
/dev/hde4 8188 238200 115926552 fd Linux raid autodetect
coppermine thomas #


Now I need to add the new partitions to the software RAID arrays.

coppermine thomas # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hda2[1]
2000256 blocks [2/1] [_U]

md2 : active raid1 hda3[1]
2000256 blocks [2/1] [_U]

md3 : active raid1 hda4[0]
115926464 blocks [1/1] [U]

md0 : active raid1 hda1[1]
125376 blocks [2/1] [_U]

unused devices:
coppermine thomas # mdadm /dev/md0 -a /dev/hde1
mdadm: hot added /dev/hde1
coppermine thomas # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hda2[1]
2000256 blocks [2/1] [_U]

md2 : active raid1 hda3[1]
2000256 blocks [2/1] [_U]

md3 : active raid1 hda4[0]
115926464 blocks [1/1] [U]

md0 : active raid1 hde1[2] hda1[1]
125376 blocks [2/1] [_U]
[=>...................] recovery = 9.7% (12928/125376) finish=0.5min speed=3232K/sec

unused devices:
coppermine thomas #


Repeat the above for the other 3 RAID1 arrays that are degraded.

At this point, I'm basically done. It's time to make another backup and maybe swap the hda/hde cables to verify that I copied the boot sector correctly.

...

The big problem is that md3 is showing up with only a single drive "[U]" instead of "[U_]". So I need to figure out how to tell mdadm to add /dev/hde4 to the array and force it to resync. (To fix this, you use the "grow" command of mdadm.)

coppermine thomas # mdadm --grow /dev/md3 --raid-disks=2
coppermine thomas # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hde2[0] hda2[1]
2000256 blocks [2/2] [UU]

md2 : active raid1 hde3[0] hda3[1]
2000256 blocks [2/2] [UU]

md3 : active raid1 hda4[0]
115926464 blocks [2/1] [U_]

md0 : active raid1 hde1[0] hda1[1]
125376 blocks [2/2] [UU]

unused devices:
coppermine thomas # mdadm /dev/md3 --add /dev/hde4
mdadm: hot added /dev/hde4
coppermine thomas # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hde2[0] hda2[1]
2000256 blocks [2/2] [UU]

md2 : active raid1 hde3[0] hda3[1]
2000256 blocks [2/2] [UU]

md3 : active raid1 hde4[2] hda4[0]
115926464 blocks [2/1] [U_]
[>....................] recovery = 0.0% (6656/115926464) finish=1153.4min speed=1664K/sec

md0 : active raid1 hde1[0] hda1[1]
125376 blocks [2/2] [UU]

unused devices:
coppermine thomas #

Labels: ,

Tuesday, November 08, 2005
Gentoo 2005.1 Software RAID (part 2) AMD64 on Asus A8V
This is a record of the kernel flags that I'm going to use for my AMD64 system. It's an Asus A8V (K8T800Pro and VT8237) with an Athlon64 3200+ chip along with 2GB of RAM. Hard drives are hooked up to the onboard Promise controller (PDC20378), the onboard SATA controller and the onboard IDE controller. Plus the motherboard has an onboard gigabit ethernet NIC (Marvell 88E8001).

In addition, I have even more hard drives hooked up to a Promise Ultra133 TX2 PCI card (PDC20269) and some HighPoint Rocket133SB PCI cards (HPT302).

# emerge mdadm
# emerge lvm2
# cd /usr/src/linux
# make menuconfig


Linux Kernel v2.6.13-gentoo-r5 Configuration
(C)ode maturity level options
(G)eneral setup
(L)oadable module support
(P)rocessor type and features
--> (P)rocessor family (changed to "AMD-Opteron/Athlon64")
--> (S)ymetric multi-processing support (turned this one OFF)
(P)ower management options (ACPI, APM)
(B)us options (PCI, etc.)
(E)xecutable file formats
(D)evice drivers
--> ATA/ATAPI/MFM/RLL support
--> --> (g)eneric/default IDE chipset support (should already be ON)
--> (S)CSI device support
--> --> (S)CSI generic support (turn this ON)
--> --> (S)CSI low-level drivers
--> --> --> (S)erial ATA (SATA) support (should already be ON)
--> --> --> --> (I)ntel PIIX/ICH SATA support (turn OFF)
--> --> --> --> (P)romise SATA TX2/TX4 support (turn ON as BUILT-IN)
--> M(u)lti-device support (should already be ON)
--> --> (R)AID support (turn it ON as BUILT-IN)
--> --> --> (R)AID-1 mirroring mode (turn it ON as BUILT-IN)
--> --> (D)evice mapper support (set to MODULE or BUILT-IN)
--> N(e)tworking support
--> --> (E)thernet (10 or 100Mbit)
--> --> --> (E)thernet (10 or 100Mbit) (Turn OFF)
--> --> (E)thernet (1000Mbit)
--> --> --> (I)ntel(R) PRO/1000 Gigabit Ethernet support (turn OFF)
--> --> --> N(e)w SysKonnect GigaEthernet support (EXPERIMENTAL) (turn ON as BUILT-IN)
--> --> --> (B)roadcom Tigon3 support (turn OFF)
--> (C)haracter Devices
--> --> (I)ntel/AMD/VIA HW Random Number Generator (should be ON)
--> --> (I)ntel 440LX/BX/GX, I8xx and E7x05 chipset support (turn it OFF)
--> (S)ound
--> --> (S)ound card support (turn OFF)
(F)ile systems
--> N(e)twork File Systems
--> --> (S)MB file system support (turn ON as BUILT-IN)
--> --> (C)IFS support (turn ON as BUILT-IN)
(P)rofiling support
(K)ernel hacking
(S)ecurity options
(C)ryptographic options
--> (C)ryptographic API (turn ON)
--> --> HM(A)C support (NEW) (turn ON as BUILT-IN)
--> --> (turn ON all other options as MODULE)
(L)ibrary routines

Exit and save your configuration. Then build the kernel (the following command is for 2.6 kernels). Expect the compile to take almost no time at all on an AMD64 chip. I used to wait an hour for all this to happen on my old VIA EPIA.

# make && make modules_install

Once the code finishes compiling, you need to copy the kernel to your /boot partition.

# mount /boot
# ls -l /boot
# ls -l arch/x86_64/boot
# df
# cp arch/x86_64/boot/bzImage /boot/kernel-2.6.13-9Nov2005
# cp System.map /boot/System.map-2.6.13-9Nov2005
# cp .config /boot/config-2.6.13-9Nov2005
# ls -l /boot
# nano -w /boot/grub/grub.conf


Add your new kernel. I'd recommend always leaving the configuration for your old kernel in place and inserting the new config above the old one. That way you get (2) benefits:

1) The "default 0" command will boot the new kernel automatically (because it appears first in the grub.conf file).

2) Your old kernel is still in place, in case the new kernel doesn't boot. There's probably no reason to remove old kernels from the system unless you are running out of space on the /boot partition. (Which is why I use the "df" command to check space.)

Bugs and goofs:

1) The disks attached to the onboard Promise PDC20378 RAID controller are not recognized by my first kernel (although they show up when I booted the LiveCD). So I'm missing a kernel option. Possibly I haven't turned SCSI on which allows me to pick the Promise SATA driver.

This is fixed by adding:

(D)evice drivers
--> (S)CSI device support
--> --> (S)CSI low-level drivers
--> --> --> (S)erial ATA (SATA) support (should already be ON)
--> --> --> --> (P)romise SATA TX2/TX4 support (turn ON as BUILT-IN)

1b) However, once you've turned on that particular driver (CONFIG_SCSI_SATA_PROMISE=y), your system will slow down and become very sluggish anytime that mdadm is rebuilding an array with drives attached to that controller. You will also start to see the following messages in your "dmesg" output:

warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip __do_softirq+0x48/0xb0


2) The disks attached to the PCI Rocket133 cards did not show up after the first boot. Same deal as #1, worked with the LiveCD, but I didn't get the driver selection right when I built the first kernel. On the upside, it allowed me to identify the (2) disks that are attached to the Promise PCI controller without any effort (hde and hdg are on the Promise card).

(I'm still troubleshooting the Rocket133.)

Key things to look for in menuconfig for Rocket133 might be:

(D)evice drivers
--> ATA/ATAPI/MFM/RLL support
--> --> SCSI emulation support
--> --> generic/default IDE chipset support
--> --> PCI IDE chipset support
--> --> Generic PCI IDE Chipset Support

Probably the only one that matters is (CONFIG_BLK_DEV_HPT366=y):

--> --> HPT36X/37X chipset support (turn this ON as BUILT-IN)

Yes, the Rocket 133SB (Rocket133SB) HPT302 chip is apparently supported by the HPT366.c file. You can find this by grepping the kernel sources:

# cd /usr/src/linux
# find . -print | xargs grep -i 'hpt302'
# grep -i 'hpt366' .config


...

So, after turning on the two drivers I have all 8 drives showing up against in /proc/partitions:

hde / hdg -- Promise PCI card
hdk, hdo, hds -- Highpoint Rocket133 cards
hda -- motherboard IDE
sda -- Promise motherboard RAID PATA
sdb -- motherboard SATA

Performance is still slow, so now I'm digging through the kernel configs trying to find lines that contain "irq".

One key line that look interesting:

Sharing PCI IDE interrupts support (CONFIG_IDEPCI_SHARE_IRQ)

Turned that on, but still haven't fixed the sluggishness or the lost ticks issue. I'm very tempted to give up on the PDC20378 chip, except that I know it worked on the LiveCD.

Labels: , ,

Friday, September 23, 2005
Gentoo 2005.1 Software RAID (part 3)
Picking up with part 7c after compiling the kernel. Now you need to install your kernel into the boot partition. Change the "2.6.12-Sep2005" portion of the filenames to whatever you want.

# cp arch/i386/boot/bzImage /boot/kernel-2.6.12-Sep2005
# cp System.map /boot/System.map-2.6.12-Sep2005
# cp .config /boot/config-2.6.12-Sep2005


If you are using LVM2, you will need to add a line at the end of the autoload file to automatically load the LMV2 module. Note that you may also need to add a line for DHCP support (not 100% sure about that). Since I'm using these boxes for servers with static IPs I don't concern myself with it.

# echo 'dm-mod' >> /etc/modules.autoload.d/kernel-2.6
# cat /etc/modules.autoload.d/kernel-2.6


Time to configure the "/etc/fstab" file. There are pages full of documentation on what goes in this file and the handbook covers some of it. For my VIA EPIA box with only 3 partitions, my fstab file is going to be rather simple.

# nano -w /etc/fstab

/dev/md0 /boot ext2 noauto,noatime 1 2
/dev/md2 / ext3 noatime 0 1
/dev/md1 none swap sw 0 0
/dev/cdroms/cdrom0 /mnt/cdrom auto noauto,ro,user 0 0

#/dev/fd0 /mnt/floppy auto noauto 0 0

proc /proc proc defaults 0 0

shm /dev/shm tmpfs nodev,nosuid,noexec 0 0


For my Celeron box which is using LVM2 partitions, it's more complex.

# nano -w /etc/fstab

/dev/md0 /boot ext2 noauto,noatime 1 2
/dev/md2 / ext3 noatime 0 1
/dev/md1 none swap sw 0 0
/dev/cdroms/cdrom0 /mnt/cdrom auto noauto,ro,user 0 0

#/dev/fd0 /mnt/floppy auto noauto 0 0

/dev/vgmirror/opt /opt ext3 noatime 0 3
/dev/vgmirror/usr /usr ext3 noatime 0 3
/dev/vgmirror/var /var ext3 noatime 0 3
/dev/vgmirror/home /home ext3 noatime 0 3
/dev/vgmirror/tmp /tmp ext2 noatime 0 3
/dev/vgmirror/vartmp /var/tmp ext2 noatime 0 3

proc /proc proc defaults 0 0

shm /dev/shm tmpfs nodev,nosuid,noexec 0 0


Now, some misc stuff (see networking configuration for information on setting up DHCP or static IPs):

# nano -w /etc/conf.d/hostname
# nano -w /etc/conf.d/domainname
# rc-update add domainname default
# nano -w /etc/conf.d/net
(either leave empty for DHCP or configure your IP and gateway)
# rc-update add net.eth0 default
# cat /etc/resolv.conf
(verify your DNS servers if you specified a static IP)
# nano -w /etc/conf.d/clock
(change CLOCK="UTC" to CLOCK="local")
# passwd
(set your root password to something you will remember)

# useradd -m -G users,wheel,audio -s /bin/bash john
# passwd john
(add a user called 'john' and set a password)


And a few other misc options (system logger, job scheduling):

# emerge syslog-ng
# rc-update add syslog-ng default
# emerge dcron
# rc-update add dcron default
# crontab /etc/crontab


I also like to install the "sshd" service at this point so that I can ssh into the box after the initial reboot. (These notes are based on a very old posting that I made about installing sshd on Gentoo Linux.) Alternately, you can do these commands after booting the box for the first time by logging in as root at the console.

# /usr/bin/ssh-keygen -t dsa -b 2048 -f /etc/ssh/ssh_host_dsa_key -N ""
(the key may take a few minutes to generate)
# chmod 600 /etc/ssh/ssh_host_dsa_key
# chmod 644 /etc/ssh/ssh_host_dsa_key.pub
# rc-update add sshd default


Now it's time to install and configure "grub" (the boot loader). Note that where we are saying "/dev/hdc", you will need to change to match the name of your secondary mirror drive.

# emerge grub

(Now, at this point, I got an error at the end of the emerge because I had failed to mount my /proc file system before entering the chroot environment. The fix was easy, requiring me to exit the chroot environment, mount the /proc filesystem and then re-enter the chroot environment.)

# ls -l /boot
# nano -w /boot/grub/grub.conf


Contents of my grub.conf file:

# Which listing to boot as default. 0 is the first, 1 the second etc.
default 0
timeout 30

# Sep 2005 installation (software RAID, no LVM2)
title=Gentoo Linux 2.6.12 (Sep 22 2005)
root (hd0,0)
kernel /kernel-2.6.12-Sep2005 root=/dev/md2


Now I fire up grub and install it onto the MBR of both disks.

# grub --no-floppy
grub> find /grub/stage1
(hd0,0)
(hd1,0)
grub> root (hd0,0)
grub> setup (hd0)
grub> device (hd0) /dev/hdc
grub> root (hd0,0)
grub> setup (hd0)
grub> quit


Time for the first reboot. Now you need to unmount everything that you can (including LVM) prior to reboot. Since I'm not using LVM2, this is rather simple.

livecd gentoo # exit
livecd / # cd /
livecd / # cat /proc/mounts
(gives you a list of what is mounted)
livecd / # umount /mnt/gentoo/boot
livecd / # umount /mnt/gentoo/proc
livecd / # umount /mnt/gentoo
livecd / # reboot


Pull the CD-ROM at this point, otherwise the LiveCD will probably boot. Then cross your fingers and watch the console for errors.

Labels: , ,

Thursday, September 22, 2005
Gentoo 2005.1 Software RAID (part 2) Celeron CPU
Time to configure the Gentoo kernel. I'm configuring this for my Celeron motherboard.

Note the use of "emerge lvm2" since I'm using LVM2 on this system during the initial installation.

# emerge mdadm
# emerge lvm2
# cd /usr/src/linux
# make menuconfig


Linux Kernel v2.6.11 Configuration
(C)ode maturity level options
(G)eneral setup
--> (C)onfigure standard kernel features for small systems (turn ON)
--> --> (O)ptimize for size (turn ON)
(L)oadable module support
(P)rocessor type and features
--> (P)rocessor family (changed to "Pentium-III...")
--> (S)ymetric multi-processing support (turned this one OFF)
--> M(a)chine Check Exception (turned this OFF)
(P)ower management options (ACPI, APM)
(B)us options (PCI, PCMCIA, EISA< MCA, ISA)
(E)xecutable file formats
(D)evice drivers
--> (A)TA/ATAPI/MFM/RLL support
--> --> (P)ROMISE PDC202{46|62|65|67} support (turn ON)
--> N(e)tworking support
--> --> N(e)twork device support (should already be BUILT-IN)
--> --> --> (E)thernet (10 or 100Mbit)
--> --> --> --> (T)ulip family network device support
--> --> --> --> --> "(T)ulip" family network device support (turn ON as BUILT-IN)
--> --> --> --> --> --> (D)ECchip Tulip (dc2114x) PCI support (turn ON as BUILT-IN)
--> --> --> --> --> --> --> (I left the sub-options alone)
--> --> --> --> (E)ISA, VLB, PCI and on board controllers (turn OFF)
--> (P)arallel port support (turned OFF)
--> M(u)lti-device support (turn it ON)
--> --> (R)AID support (turn it ON as BUILT-IN)
--> --> --> (R)AID-1 mirroring mode (turn it ON as BUILT-IN)
--> --> (D)evice mapper support (set to MODULE, per section 13 of LVM2 guide)
--> (C)haracter Devices
--> --> (I)ntel/AMD/VIA HW Random Number Generator (turn ON as BUILT-IN)
--> (S)ound
--> --> (S)ound card support (turn OFF)
(F)ile systems
--> N(e)twork File Systems
--> --> (S)MB file system support (turn ON as BUILT-IN)
--> --> (C)IFS support (turn ON as BUILT-IN)
(P)rofiling support
(K)ernel hacking
(S)ecurity options
(C)ryptographic options
--> (C)ryptographic API (turn ON)
--> --> HM(A)C support (NEW) (turn ON as BUILT-IN)
--> --> (turn ON all other options as MODULE)
(L)ibrary routines

Exit and save your configuration. Then build the kernel (the following command is for 2.6 kernels). Expect the compile to take about an hour.

# make && make modules_install

Labels: , ,