UPDATE: Nov 27, 2007
How to replace a bad drive in a ZFS raidZ setup, (which has an available spare)
OK, I have this pool:
pool: raidbox2
state: ONLINE
scrub: [here is some large message with a link to sun's docs on how to solve the problem, which contains some important omissions, which I have covered below]
config:
NAME STATE READ WRITE CKSUM
raidbox2 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c3t10d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c3t9d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0
c3t8d0 ONLINE 0 0 0
c3t5d0 ONLINE 4 0 0
spares
c3t1d0 AVAIL
errors: No known data errors
which contains some read errors, and I notice every week the number increments by 1 or 2, and the kernel spits out errors
in the /var/adm/messages file
I then type:
zpool replace raidbox2 c3t5d0 c3t1d0
This will "mirror" the c3t5d0, and the c3t1d0 (spare), and the resilver says initially it will take like 8 hours, but after
a while, it speeds up, and it ends up taking 2 hours and change.
Once the resilver is finished, you can then OFFLINE the bad drive, by executing:
zpool offline -t raidbox2 c3t5d0
(the -t flag indicates "temporary" offlining, which brings the drive back to online after a reboot)
You will notice, when doing a zpool status, that the raidZ will then change to "degraded" state.
Don't worry.
You then insert the new drive, and execute format, and then locate the beast, and do an fdisk (answer "y" to the 100% solaris partitioninig)
Then you type "label" , and "y", and then you quit format,
and now you are ready to slap the new drive into operation.
Just type: zpool replace raidbox2 c3t5d0 c3t5d0
(yep thats correct, both source and target, same ident)
You will notice the beast stars the resilvering (from the spare , which is in use, into thr new drive), which will , again
initialy say 8 hours, but ends up taking 2 and a bit. (if you are lucky to get a Fujitsu, it maybe quicker !)
At the end of the resilvering, the new drive, goes back to online, automagically, and the spare goes back to spare status,
all automagically,
end result:
pool: raidbox2
state: ONLINE
scrub: resilver completed with 0 errors on Tue Nov 27 13:51:00 2007
config:
NAME STATE READ WRITE CKSUM
raidbox2 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c3t10d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c3t9d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0
c3t8d0 ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
spares
c3t1d0 AVAIL
errors: No known data errors
voilla !
UPDATE: Oct 24, 2007
OK guys, I found that the coolest combination and price/performance is Solaris 10 running on Sun X4200 (or X4100) with 2 cpus , 8gb ram, and a nice scsi card, with a Storedge 3300 (JBOD)
this is what you do:
NOTE: the below the ==== line is outdated, we could not get the disks to respond properly after a disconnect, when running ZFS over DiskSuite, so this was anyway not needed, as there is an option
to do stripes over mirrors in ZFS directly, and you do it like this:
zpool create -f raidbox1 mirror c0t0d0s2 c0t1d0s0 mirror c0t3d0s2 c0t2d0s0 mirror c0t4d0s2 c0t8d0s2
after this, a zpool status gives you:
pool: raidbox1
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
raidbox1 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t0d0s2 ONLINE 0 0 0
c0t1d0s0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t3d0s2 ONLINE 0 0 0
c0t2d0s0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t4d0s2 ONLINE 0 0 0
c0t8d0s2 ONLINE 0 0 0
voilla, a beautiful 3 element stripe over mirrors, which works rather beautifully I must say !
=============================== the below is OBSOLETE and should only be used by the poor dudes who landed in the same trap ========================================
1. make sure you upgrade all firmware on the drives (lamentably, there is no software to do this on x86, so you have to connect the JBOD to a sparc, and use the patch, with the mload proggy.
2. once you have all drives nicely working in the jbod, make sure you give them SMI labels (you can change from EFI to SMI with format -e)
continue the format, and make sure all drives look like this:
0. c0t0d0 < SUN36G cyl 24619 alt 2 hd 27 sec 107>
/pci@0,0/pci1022,7450@2/pci1000,5110@1/sd@0,0
1. c0t1d0 < SUN72G cyl 14086 alt 2 hd 24 sec 424>
/pci@0,0/pci1022,7450@2/pci1000,5110@1/sd@1,0
2. c0t2d0 < SUN72G cyl 14086 alt 2 hd 24 sec 424>
/pci@0,0/pci1022,7450@2/pci1000,5110@1/sd@2,0
3. c0t3d0 < SUN36G cyl 24619 alt 2 hd 27 sec 107>
/pci@0,0/pci1022,7450@2/pci1000,5110@1/sd@3,0
4. c0t4d0 < SUN36G cyl 24619 alt 2 hd 27 sec 107>
/pci@0,0/pci1022,7450@2/pci1000,5110@1/sd@4,0
5. c0t8d0 < SUN36G cyl 24619 alt 2 hd 27 sec 107>
/pci@0,0/pci1022,7450@2/pci1000,5110@1/sd@8,0
6. c0t9d0 < SUN36G cyl 24619 alt 2 hd 27 sec 107>
/pci@0,0/pci1022,7450@2/pci1000,5110@1/sd@9,0
7. c0t10d0 < SUN36G cyl 24619 alt 2 hd 27 sec 107>
/pci@0,0/pci1022,7450@2/pci1000,5110@1/sd@a,0
8. c0t11d0 < SUN36G cyl 24619 alt 2 hd 27 sec 107>
/pci@0,0/pci1022,7450@2/pci1000,5110@1/sd@b,0
9. c0t12d0 < SUN36G cyl 24619 alt 2 hd 27 sec 107>
/pci@0,0/pci1022,7450@2/pci1000,5110@1/sd@c,0
10. c0t13d0 < FUJITSU-MAX3036NC-0104 cyl 41605 alt 2 hd 2 sec 863>
/pci@0,0/pci1022,7450@2/pci1000,5110@1/sd@d,0
notice the nice SUN labels on them (you arrive at this using the 'type' option under format)
NOTE: c0t5 is missing from this list, cause I took it out as I am condicting some tests, as we will see below, also note that I have also a 15k Fujitsu (she is 3 times faster than the crappy seagates), and I also have 2 72giggers (I use half of them as hotspares)
3. I cleaned all the standard partitions from the drives, and I made s2 (in all the 36giggers), and s0-s1 (in the 72 giggers) as exactly 33.92GB. (I nboticed however, that disksuite reports the 72gigger halves as a little larger, so it is best to use them as spares or as "submirrors".
4.ok now the fun part, what I decided to do, is to make 6 mirrors of 2 drives each, using Disksuite, and then make 2 stripes on top of them, using ZFS.
5. Disksuite set up:
First, the METADB's, I chose to allocate 2 extra 20mb partitions in the internal disks of the server, and it looks like this:
bash-3.00# metadb
flags first blk block count
a m pc luo 16 8192 /dev/dsk/c1t0d0s6
a pc luo 8208 8192 /dev/dsk/c1t0d0s6
a pc luo 16400 8192 /dev/dsk/c1t0d0s6
a pc luo 16 8192 /dev/dsk/c1t1d0s6
a pc luo 8208 8192 /dev/dsk/c1t1d0s6
a pc luo 16400 8192 /dev/dsk/c1t1d0s6
a pc luo 16 8192 /dev/dsk/c1t0d0s7
a pc luo 8208 8192 /dev/dsk/c1t0d0s7
a pc luo 16400 8192 /dev/dsk/c1t0d0s7
a pc luo 16 8192 /dev/dsk/c1t1d0s7
a pc luo 8208 8192 /dev/dsk/c1t1d0s7
a pc luo 16400 8192 /dev/dsk/c1t1d0s7
Second the concat/stripes:
metainit -f d10 1 1 c0t0d0s2
metainit -f d11 1 1 c0t1d0s0
metainit -f d12 1 1 c0t13d0s2
metainit -f d13 1 1 c0t12d0s2
metainit -f d14 1 1 c0t3d0s2
metainit -f d15 1 1 c0t11d0s2
metainit -f d20 1 1 c0t2d0s0
metainit -f d21 1 1 c0t10d0s2
metainit -f d22 1 1 c0t4d0s2
metainit -f d23 1 1 c0t9d0s2
metainit -f d24 1 1 c0t8d0s2
metainit -f d25 1 1 c0t5d0s2
Third, the mirrors with their 'parent' submirror:
metainit d100 -m d10
metainit d101 -m d21
metainit d102 -m d12
metainit d103 -m d13
metainit d104 -m d14
metainit d105 -m d15
NOTE: I initialized the 101 mirror with d21 as the parent, cause somehow the c0t1d0s0 is larger than the c0t10d0s2, even though I allocated the same amount (in GB) in the format. (maybe the next time I will use sectors or real bytes)
Fourth, I attached the submirrors to them:
metattach d100 d20
metattach d101 d11
metattach d102 d22
metattach d103 d23
metattach d104 d24
metattach d105 d25
NOTE: the metattach attaching d11 as a submirror , cause she's bigger.
This takes a long time, the resync (do this: while : do; metastat | grep sync;sleep 10; done)
Once it was done (after a tremendous chicken quesadillas and SOL mexican beer (I prefer Dos Equis, but they donrt have it) in Cancun (Potsdamer Platz)
6. ZFS stripes Setup
zpool create raidbox1 /dev/md/dsk/d100 /dev/md/dsk/d102 /dev/md/dsk/d104
zpool create raidbox2 /dev/md/dsk/d101 /dev/md/dsk/d103 /dev/md/dsk/d105
And VOILLA
df -k
raidbox1 104509440 20823526 83685745 20% /raidbox1
raidbox2 104509440 33013014 71496200 32% /raidbox2
NOTE: they are useed already cause I made up some large whopper files of 10gb each, to test throughput (by using the recommended patches and sunalert cluster zip files, they are wonderful useless bitconglomerates in the latest solarus U4, cause they< are not needed hehhehheheh)
7. DESTRUCT TESTS:
The real fun begins now :
To start:
=> if all is well , its taking almost exactly 2 minutes to copy the 10 gig file, from raidbox1 to raidbox2
a. Lets take a drive out, and see what happens, while copying one of those beauty 10 giggers.
=> if there is a hot spare, ready to rock, the copy finishes in about 2 minutes 30 seconds, similar performance, and the sync process to the hot spare starts about 10 seconds after pulling out the drive
=> if there is not a hotspare, the machine pauses, from 1m30 seconds to 2 minutes (pretty random), and then the copy continues, and it finishes with no errors, after about 4 minutes.
Now, the problem is this, when I reinstall the drive, (in our case I took out c0t3), zilch happens, i.e., the system does not "automagically rebuild it, or see it", and I see this:
execute this:
iostat -zcnx 5
then wait a cycle, and then
cpu
us sy wt id
0 0 0 100
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 100 0 c0t3d0
metastat reports:
d104: Mirror
Submirror 0: d14
State: Needs maintenance
Submirror 1: d24
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 71130069 blocks (33 GB)
Hmmmmm, although it does not appear to affect the system, the issue is a bit heavy. the scsi bus has somehow some commands waiting for that drive, and I can't get rid of it. The drive is in 100%w (but not doing much)
I tried devfsadm , then devfsadm disks
(still same iostat output)
then I do this:
metadetach -f d104 d14 (-f cause if not it bitches that I am trying to detach something which aint workin)
metaclear d14
d14: Concat/Stripe is cleared
(still same iostat output)
now: format
choose disk c0t3, and it tells me:
Specify disk (enter its number): 3
selecting c0t3d0
[disk formatted]
/dev/dsk/c0t3d0s2 is part of active ZFS pool raidbox1. Please see zpool(1M).
/dev/dsk/c0t3d0s8 is part of active ZFS pool raidbox1. Please see zpool(1M).
Weirdola.org, ok then, lets try to somehow reset that drive (somehow),
tried fdisk, label, partition, but still same output from iostat.
OK, lets try using the zpool to "steal the drive", and lets see what happens.
bash-3.00# zpool create test6 c0t3d0
invalid vdev specification
use '-f' to override the following errors:
/dev/dsk/c0t3d0s2 is part of active ZFS pool raidbox1. Please see zpool(1M).
/dev/dsk/c0t3d0s8 is part of active ZFS pool raidbox1. Please see zpool(1M).
hmmm, ok, lets do a -f.....
bash-3.00# zpool create -f test6 c0t3d0
invalid vdev specification
the following errors must be manually repaired:
/dev/dsk/c0t3d0s2 is part of active ZFS pool raidbox1. Please see zpool(1M).
/dev/dsk/c0t3d0s8 is part of active ZFS pool raidbox1. Please see zpool(1M).
Weirdness, cause the drive is A GONER, i.e. it is NOT part of anything, zpool status is all nice, and the disk has been taken out of the d104 mirror.
hmmm, ok, lets totally overwrite all partitions again, and reformat the thingy.
I noticed that the zpool uses partition 8 (which is only accessible by typing format -e) , so, lets try that:
Current partition table (original):
Total disk cylinders available: 24619 + 2 (reserved cylinders)
Part Tag Flag Cylinders Size Blocks
0 root wm 1 - 91 128.37MB (91/0/0) 262899
1 swap wu 92 - 182 128.37MB (91/0/0) 262899
2 backup wu 0 - 24620 33.92GB (24621/0/0) 71130069
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 usr wm 183 - 24618 33.66GB (24436/0/0) 70595604
7 unassigned wm 0 0 (0/0/0) 0
8 boot wu 0 - 0 1.41MB (1/0/0) 2889
9 alternates wm 0 0 (0/0/0) 0
lets delete them all, except 2
when I typed 8, it did not let me do crap, and I could not relabel it,
so, I quit format, and reexecuted it, then typed fdisk, and deleted the partition
then I quit fdisk, and reentered fdisk.
then I made one 100% solaris partition, and when done, typed label, and I saw this:
format> label
[0] SMI Label
[1] EFI Label
Specify Label type[0]: 1
Warning: This disk has an SMI label. Changing to EFI label will erase all
current partitions.
Continue? y
format>
So, I made it an EFI (some kind of extended crap), so, now lets go into partitions
partition> print
Current partition table (default):
Total disk sectors available: 71116540 + 16384 (reserved sectors)
Part Tag Flag First Sector Size Last Sector
0 unassigned wm 0 0 0
1 unassigned wm 0 0 0
2 unassigned wm 0 0 0
3 unassigned wm 0 0 0
4 unassigned wm 0 0 0
5 unassigned wm 0 0 0
6 unassigned wm 0 0 0
7 unassigned wm 0 0 0
8 reserved wm 71116540 8.00MB 71132923
weird, partition 2 is a goner. anyway, lets get rid of the 8 beast.
partition> 8
Part Tag Flag First Sector Size Last Sector
8 reserved wm 71116540 8.00MB 71132923
Enter partition id tag[reserved]:
Enter partition permission flags[wm]:
Enter new starting Sector[71116540]: 0
`0' is out of range.
Enter new starting Sector[71116540]:
Enter partition size[16384b, 71132923e, 8mb, 0gb, 0tb]: 0c
hehe
now print:
partition> print
Current partition table (unnamed):
Total disk sectors available: 71116540 + 16384 (reserved sectors)
Part Tag Flag First Sector Size Last Sector
0 unassigned wm 0 0 0
1 unassigned wm 0 0 0
2 unassigned wm 0 0 0
3 unassigned wm 0 0 0
4 unassigned wm 0 0 0
5 unassigned wm 0 0 0
6 unassigned wm 0 0 0
7 unassigned wm 0 0 0
8 unassigned wm 0 0 0
Goner !, noice.
so, now that I am done, and they are all gone (hopefukak):
format> label
[0] SMI Label
[1] EFI Label
Specify Label type[1]: 0
Warning: This disk has an EFI label. Changing to SMI label will erase all
current partitions.
Continue? y
Auto configuration via format.dat[no]? no
Auto configuration via generic SCSI-2[no]? no
You must use fdisk to delete the current EFI partition and create a new
Solaris partition before you can convert the label.
format>
ok, I deleted it, recreated it, and :
format> label
[0] SMI Label
[1] EFI Label
Specify Label type[1]: 0
Warning: This disk has an EFI label. Changing to SMI label will erase all
current partitions.
Continue? y
Auto configuration via format.dat[no]?
Auto configuration via generic SCSI-2[no]?
Warning: error writing VTOC.
Illegal request during read: block 71146028 (0x43d9a2c) (24626/14/16)
ASC: 0x21 ASCQ: 0x0
Warning: error reading backup label.
Illegal request during read: block 71146030 (0x43d9a2e) (24626/14/18)
ASC: 0x21 ASCQ: 0x0
Warning: error reading backup label.
Illegal request during read: block 71146032 (0x43d9a30) (24626/14/20)
ASC: 0x21 ASCQ: 0x0
Warning: error reading backup label.
Illegal request during read: block 71146034 (0x43d9a32) (24626/14/22)
ASC: 0x21 ASCQ: 0x0
Warning: error reading backup label.
Illegal request during read: block 71146036 (0x43d9a34) (24626/14/24)
ASC: 0x21 ASCQ: 0x0
Warning: error reading backup label.
Warning: no backup labels
Label failed.
OOOPS: I will now delete that partition, exit format, and re-execute the format command without the -e option.
Specify disk (enter its number): 3
selecting c0t3d0
[disk formatted]
Disk not labeled. Label it now? y
Warning: error setting drive geometry.
Warning: error writing VTOC.
Illegal request during read: block 71135741 (0x43d71fd) (24622/26/1)
ASC: 0x21 ASCQ: 0x0
Warning: error reading backup label.
Illegal request during read: block 71135743 (0x43d71ff) (24622/26/3)
ASC: 0x21 ASCQ: 0x0
Warning: error reading backup label.
Illegal request during read: block 71135745 (0x43d7201) (24622/26/5)
ASC: 0x21 ASCQ: 0x0
Warning: error reading backup label.
Illegal request during read: block 71135747 (0x43d7203) (24622/26/7)
ASC: 0x21 ASCQ: 0x0
Warning: error reading backup label.
Illegal request during read: block 71135749 (0x43d7205) (24622/26/9)
ASC: 0x21 ASCQ: 0x0
Warning: error reading backup label.
Warning: no backup labels
Write label failed
FORMAT MENU:
disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
fdisk - run the fdisk program
repair - repair a defective sector
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
save - save new disk/partition definitions
inquiry - show vendor, product and revision
volname - set 8-character volume name
! - execute , then return
quit
format>
Woohooo, excellent. ok lets see....
format> fdisk
No fdisk table exists. The default partition for the disk is:
a 100% "SOLARIS System" partition
Type "y" to accept the default partition, otherwise type "n" to edit the
partition table.
y
format> label
Ready to label disk, continue? y
format>
OK, looking better... now, lets check the partitions:
partition> print
Current partition table (original):
Total disk cylinders available: 24609 + 2 (reserved cylinders)
Part Tag Flag Cylinders Size Blocks
0 unassigned wm 1 - 91 128.37MB (91/0/0) 262899
1 unassigned wm 92 - 182 128.37MB (91/0/0) 262899
2 backup wu 0 - 24621 33.92GB (24622/0/0) 71132958
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 unassigned wm 183 - 24619 33.66GB (24437/0/0) 70598493
7 unassigned wm 0 0 (0/0/0) 0
8 boot wu 0 - 0 1.41MB (1/0/0) 2889
9 unassigned wm 0 0 (0/0/0) 0
crap, it still has that impossible to reach 8 partition. hmmm
but ok, lets see..... lets quickly type label again
then exit format, and now, lets see if zpool will work (after all the destruction we caused, maybe the zpool will finally *not* see the magic bits)
bash-3.00# zpool create -f test6 c0t3d0
bash-3.00#
Whoa
hurray, now lets check iostat:
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 100 0 c0t3d0
crap, same shit.
ok, back for some more action tomorrow, we gotta get this drive to go back to normal. WITHOUT REBOOTING !