Monday, April 22, 2019

raid - How do I recover from a faulted zpool where one device is OK, but was temporarily offline?



I have a zpool with 4 2TB USB disks in a raidz config:




[root@chef /mnt/Chef]# zpool status farcryz1
pool: farcryz1
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
farcryz1 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da1 ONLINE 0 0 0

da2 ONLINE 0 0 0
da3 ONLINE 0 0 0
da4 ONLINE 0 0 0


In order to test the pool, I simulated a drive failure by pulling the USB cable from one of the drives without taking it offline:



[root@chef /mnt/Chef]# zpool status farcryz1
pool: farcryz1
state: ONLINE

status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
farcryz1 ONLINE 0 0 0

raidz1 ONLINE 0 0 0
da4 ONLINE 22 4 0
da3 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0

errors: No known data errors


Data's still there, pool still online. Great! Now let's try to restore the pool. I plugged the drive back in, and issued the zpool replace command as I was instructed to above:




[root@chef /mnt/Chef]# zpool replace farcryz1 da4
invalid vdev specification
use '-f' to override the following errors:
/dev/da4 is part of active pool 'farcryz1'


Um.... That's not helpful... So I tried a zpool clear farcryz1, but that didn't help at all. I still couldn't replace da4. So I tried a combination of onlineing, offlineing, clearing, replaceing, and scrubing. Now I am stuck here:



[root@chef /mnt/Chef]# zpool status -v farcryz1

pool: farcryz1
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-4J
scrub: scrub completed after 0h2m with 0 errors on Fri Sep 9 13:43:34 2011
config:


NAME STATE READ WRITE CKSUM
farcryz1 DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
da4 UNAVAIL 9 0 0 experienced I/O failures
da3 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0

errors: No known data errors
[root@chef /mnt/Chef]# zpool replace farcryz1 da4

cannot replace da4 with da4: da4 is busy


How can I recover from this situation, where one device in my zpool was unexpectedly disconnected (but is not a failed device) and is now back again, ready to be resilvered?






EDIT: As requested, a tail of dmesg:



(ses3:umass-sim4:4:0:1): removing device entry

(da4:umass-sim4:4:0:0): removing device entry
ugen3.2: at usbus3
umass4: on usbus3
da4 at umass-sim4 bus 4 scbus6 target 0 lun 0
da4: Fixed Direct Access SCSI-6 device
da4: 400.000MB/s transfers
da4: 1907697MB (3906963456 512 byte sectors: 255H 63S/T 243197C)
ses3 at umass-sim4 bus 4 scbus6 target 0 lun 1
ses3: Fixed Enclosure Services SCSI-6 device
ses3: 400.000MB/s transfers

ses3: SCSI-3 SES Device
GEOM: da4: partition 1 does not start on a track boundary.
GEOM: da4: partition 1 does not end on a track boundary.
GEOM: da4: partition 1 does not start on a track boundary.
GEOM: da4: partition 1 does not end on a track boundary.
ugen3.2: at usbus3 (disconnected)
umass4: at uhub3, port 1, addr 1 (disconnected)
(da4:umass-sim4:4:0:0): lost device
(da4:umass-sim4:4:0:0): removing device entry
(ses3:umass-sim4:4:0:1): lost device

(ses3:umass-sim4:4:0:1): removing device entry
ugen3.2: at usbus3
umass4: on usbus3
da4 at umass-sim4 bus 4 scbus6 target 0 lun 0
da4: Fixed Direct Access SCSI-6 device
da4: 400.000MB/s transfers
da4: 1907697MB (3906963456 512 byte sectors: 255H 63S/T 243197C)
ses3 at umass-sim4 bus 4 scbus6 target 0 lun 1
ses3: Fixed Enclosure Services SCSI-6 device
ses3: 400.000MB/s transfers

ses3: SCSI-3 SES Device

Answer




Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.




Looks like after the initial temporary failure, you may only have needed to do a zpool clear to clear the errors.




If you want to pretend that it's a drive replacement, you probably need to clear the data off the drive first before you try re-adding it to the pool.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...