I have a server with a RAID controller which connects to a JBOD over SAS.
JBOD is populated with 16 SAS-HDDs of 3TB (8 of one make and 8 of other, same rpm).
I have configured 3 logical drives with RAID-5, each having 5 physical HDDs (+1 as hot-spare).
Now to test it out, i run the following script:
for i in 1 10 50 100 1000
do for j in a b c
do dd if=/dev/zero of=/dev/sd$j bs=1G count=$i
done
done
Everytime i run this script, it runs fine for the count=1,10 and 50.
But with 100G, writes on virtual drives crash randomly. Meaning, sometimes write on /dev/sda gets completed without and error but crashes for /dev/sdb. Sometimes writes on /dev/sda and /dev/sdb complete but fails on /dev/sdc.
I suspect that my RAID card might be faulty because i have already tested my harddisks individually by attaching them directly to the server and running "dd" on full 3TB.
What do you guys suggest?
EDIT:
Server motherboard make/model: SuperMicro X8DTH-6F
RAID controller: LSI MegaRaid SAS 9285-8e with BBU
JBOD: SuperMicro JBOD SC836E26-R1200B
SAS HDDs: 3.5", 6Gbps SAS, 3TB, 7200rpm ( 8x - Seagate ST3000NM0023, 8x - Hitachi Ultrastar 7K3000
OS: Scientific Linux 6.3
JBOD is connected to raid controller through a 6Gbps SAS cable.
EDIT 2:
Here is the text from /var/log/messages:
May 1 18:14:41 fileserver udevd[875]: worker [4083] unexpectedly returned with status 0x0100
May 1 18:14:41 fileserver udevd[875]: worker [4083] failed while handling '/devices/pci0000:00/0000:00:03.0/0000:08:00.0/host0/target0:2:0/0:2:0:0/block/sda'
May 1 18:14:43 fileserver kernel: megasas: Found FW in FAULT state, will reset adapter.
May 1 18:14:43 fileserver kernel: megaraid_sas: resetting fusion adapter.
May 1 18:15:17 fileserver kernel: INFO: task dd:4144 blocked for more than 120 seconds.
May 1 18:15:17 fileserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 1 18:15:17 fileserver kernel: dd D 000000000000000d 0 4144 4132 0x00000004
May 1 18:15:17 fileserver kernel: ffff88060c513bf8 0000000000000082 0000000000000000 ffffea0015a3a8c0
May 1 18:15:17 fileserver kernel: ffff88062d837938 ffff88062d837848 ffff88062ce06ea0 ffff88062ce06ea0
May 1 18:15:17 fileserver kernel: ffff8806133385f8 ffff88060c513fd8 000000000000fb88 ffff8806133385f8
May 1 18:15:17 fileserver kernel: Call Trace:
May 1 18:15:17 fileserver kernel: [] ? sync_page+0x0/0x50
May 1 18:15:17 fileserver kernel: [] io_schedule+0x73/0xc0
May 1 18:15:17 fileserver kernel: [] sync_page+0x3d/0x50
May 1 18:15:17 fileserver kernel: [] __wait_on_bit+0x5f/0x90
May 1 18:15:17 fileserver kernel: [] wait_on_page_bit+0x73/0x80
May 1 18:15:17 fileserver kernel: [] ? wake_bit_function+0x0/0x50
May 1 18:15:17 fileserver kernel: [] ? pagevec_lookup_tag+0x25/0x40
May 1 18:15:17 fileserver kernel: [] wait_on_page_writeback_range+0xfb/0x190
May 1 18:15:17 fileserver kernel: [] filemap_fdatawait+0x2f/0x40
May 1 18:15:17 fileserver kernel: [] filemap_write_and_wait+0x44/0x60
May 1 18:15:17 fileserver kernel: [] __sync_blockdev+0x24/0x50
May 1 18:15:17 fileserver kernel: [] sync_blockdev+0x13/0x20
May 1 18:15:17 fileserver kernel: [] __blkdev_put+0x178/0x1b0
May 1 18:15:17 fileserver kernel: [] ? fsnotify+0x113/0x160
May 1 18:15:17 fileserver kernel: [] blkdev_put+0x10/0x20
May 1 18:15:17 fileserver kernel: [] blkdev_close+0x33/0x60
May 1 18:15:17 fileserver kernel: [] __fput+0xf5/0x210
May 1 18:15:17 fileserver kernel: [] fput+0x25/0x30
May 1 18:15:17 fileserver kernel: [] filp_close+0x5d/0x90
May 1 18:15:17 fileserver kernel: [] sys_close+0xa5/0x100
May 1 18:15:17 fileserver kernel: [] system_call_fastpath+0x16/0x1b
May 1 18:15:17 fileserver kernel: INFO: task scsi_id:4145 blocked for more than 120 seconds.
May 1 18:15:17 fileserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 1 18:15:17 fileserver kernel: scsi_id D 0000000000000013 0 4145 1 0x00000000
May 1 18:15:17 fileserver kernel: ffff880c1a145c18 0000000000000086 0000000000000000 ffff880c00000002
May 1 18:15:17 fileserver kernel: ffff880c1a145be8 00007f9200000066 00000000007fffff ffffffff81fc7990
May 1 18:15:17 fileserver kernel: ffff880c2ee3b098 ffff880c1a145fd8 000000000000fb88 ffff880c2ee3b098
May 1 18:15:17 fileserver kernel: Call Trace:
May 1 18:15:17 fileserver kernel: [] __mutex_lock_slowpath+0x13e/0x180
May 1 18:15:17 fileserver kernel: [] ? exact_match+0x0/0x10
May 1 18:15:17 fileserver kernel: [] mutex_lock+0x2b/0x50
May 1 18:15:17 fileserver kernel: [] __blkdev_get+0x68/0x3c0
May 1 18:15:17 fileserver kernel: [] ? blkdev_open+0x0/0xc0
May 1 18:15:17 fileserver kernel: [] blkdev_get+0x10/0x20
May 1 18:15:17 fileserver kernel: [] blkdev_open+0x71/0xc0
May 1 18:15:17 fileserver kernel: [] __dentry_open+0x10a/0x360
May 1 18:15:17 fileserver kernel: [] ? selinux_inode_permission+0x72/0xb0
May 1 18:15:17 fileserver kernel: [] ? security_inode_permission+0x1f/0x30
May 1 18:15:17 fileserver kernel: [] nameidata_to_filp+0x54/0x70
May 1 18:15:17 fileserver kernel: [] do_filp_open+0x6c0/0xd60
May 1 18:15:17 fileserver kernel: [] ? __do_page_fault+0x1ec/0x480
May 1 18:15:17 fileserver kernel: [] ? cpumask_any_but+0x31/0x50
May 1 18:15:17 fileserver kernel: [] ? unmap_region+0x110/0x130
May 1 18:15:17 fileserver kernel: [] ? alloc_fd+0x92/0x160
May 1 18:15:17 fileserver kernel: [] do_sys_open+0x69/0x140
May 1 18:15:17 fileserver kernel: [] sys_open+0x20/0x30
May 1 18:15:17 fileserver kernel: [] system_call_fastpath+0x16/0x1b
May 1 18:15:17 fileserver kernel: INFO: task fdisk:4176 blocked for more than 120 seconds.
May 1 18:15:17 fileserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 1 18:15:17 fileserver kernel: fdisk D 0000000000000013 0 4176 4147 0x00000004
May 1 18:15:17 fileserver kernel: ffff880c2f56dc18 0000000000000082 0000000000000000 ffff880c30802078
May 1 18:15:17 fileserver kernel: ffff880c2f56dbe8 ffff880c00000721 00000000007fffff ffffffff81fc8368
May 1 18:15:17 fileserver kernel: ffff880c1d6af058 ffff880c2f56dfd8 000000000000fb88 ffff880c1d6af058
May 1 18:15:17 fileserver kernel: Call Trace:
May 1 18:15:17 fileserver kernel: [] __mutex_lock_slowpath+0x13e/0x180
May 1 18:15:17 fileserver kernel: [] ? exact_match+0x0/0x10
May 1 18:15:17 fileserver kernel: [] mutex_lock+0x2b/0x50
May 1 18:15:17 fileserver kernel: [] __blkdev_get+0x68/0x3c0
May 1 18:15:17 fileserver kernel: [] ? blkdev_open+0x0/0xc0
May 1 18:15:17 fileserver kernel: [] blkdev_get+0x10/0x20
May 1 18:15:17 fileserver kernel: [] blkdev_open+0x71/0xc0
May 1 18:15:17 fileserver kernel: [] __dentry_open+0x10a/0x360
May 1 18:15:17 fileserver kernel: [] ? selinux_inode_permission+0x72/0xb0
May 1 18:15:17 fileserver kernel: [] ? security_inode_permission+0x1f/0x30
May 1 18:15:17 fileserver kernel: [] nameidata_to_filp+0x54/0x70
May 1 18:15:17 fileserver kernel: [] do_filp_open+0x6c0/0xd60
May 1 18:15:17 fileserver kernel: [] ? __do_page_fault+0x1ec/0x480
May 1 18:15:17 fileserver kernel: [] ? pde_users_dec+0x25/0x60
May 1 18:15:17 fileserver kernel: [] ? alloc_fd+0x92/0x160
May 1 18:15:17 fileserver kernel: [] do_sys_open+0x69/0x140
May 1 18:15:17 fileserver kernel: [] sys_open+0x20/0x30
May 1 18:15:17 fileserver kernel: [] system_call_fastpath+0x16/0x1b
May 1 18:16:27 fileserver kernel: megaraid_sas: Diag reset adapter never cleared!
May 1 18:17:17 fileserver kernel: INFO: task dd:4144 blocked for more than 120 seconds.
May 1 18:17:17 fileserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 1 18:17:17 fileserver kernel: dd D 000000000000000d 0 4144 4132 0x00000004
May 1 18:17:17 fileserver kernel: ffff88060c513bf8 0000000000000082 0000000000000000 ffffea0015a3a8c0
May 1 18:17:17 fileserver kernel: ffff88062d837938 ffff88062d837848 ffff88062ce06ea0 ffff88062ce06ea0
May 1 18:17:17 fileserver kernel: ffff8806133385f8 ffff88060c513fd8 000000000000fb88 ffff8806133385f8
May 1 18:17:17 fileserver kernel: Call Trace:
May 1 18:17:17 fileserver kernel: [] ? sync_page+0x0/0x50
May 1 18:17:17 fileserver kernel: [] io_schedule+0x73/0xc0
May 1 18:17:17 fileserver kernel: [] sync_page+0x3d/0x50
May 1 18:17:17 fileserver kernel: [] __wait_on_bit+0x5f/0x90
May 1 18:17:17 fileserver kernel: [] wait_on_page_bit+0x73/0x80
May 1 18:17:17 fileserver kernel: [] ? wake_bit_function+0x0/0x50
May 1 18:17:17 fileserver kernel: [] ? pagevec_lookup_tag+0x25/0x40
May 1 18:17:17 fileserver kernel: [] wait_on_page_writeback_range+0xfb/0x190
May 1 18:17:17 fileserver kernel: [] filemap_fdatawait+0x2f/0x40
May 1 18:17:17 fileserver kernel: [] filemap_write_and_wait+0x44/0x60
May 1 18:17:17 fileserver kernel: [] __sync_blockdev+0x24/0x50
May 1 18:17:17 fileserver kernel: [] sync_blockdev+0x13/0x20
May 1 18:17:17 fileserver kernel: [] __blkdev_put+0x178/0x1b0
May 1 18:17:17 fileserver kernel: [] ? fsnotify+0x113/0x160
May 1 18:17:17 fileserver kernel: [] blkdev_put+0x10/0x20
May 1 18:17:17 fileserver kernel: [] blkdev_close+0x33/0x60
May 1 18:17:17 fileserver kernel: [] __fput+0xf5/0x210
May 1 18:17:17 fileserver kernel: [] fput+0x25/0x30
May 1 18:17:17 fileserver kernel: [] filp_close+0x5d/0x90
May 1 18:17:17 fileserver kernel: [] sys_close+0xa5/0x100
May 1 18:17:17 fileserver kernel: [] system_call_fastpath+0x16/0x1b
May 1 18:17:17 fileserver kernel: INFO: task scsi_id:4145 blocked for more than 120 seconds.
May 1 18:17:17 fileserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 1 18:17:17 fileserver kernel: scsi_id D 0000000000000013 0 4145 1 0x00000000
May 1 18:17:17 fileserver kernel: ffff880c1a145c18 0000000000000086 0000000000000000 ffff880c00000002
May 1 18:17:17 fileserver kernel: ffff880c1a145be8 00007f9200000066 00000000007fffff ffffffff81fc7990
May 1 18:17:17 fileserver kernel: ffff880c2ee3b098 ffff880c1a145fd8 000000000000fb88 ffff880c2ee3b098
May 1 18:17:17 fileserver kernel: Call Trace:
May 1 18:17:17 fileserver kernel: [] __mutex_lock_slowpath+0x13e/0x180
May 1 18:17:17 fileserver kernel: [] ? exact_match+0x0/0x10
May 1 18:17:17 fileserver kernel: [] mutex_lock+0x2b/0x50
May 1 18:17:17 fileserver kernel: [] __blkdev_get+0x68/0x3c0
May 1 18:17:17 fileserver kernel: [] ? blkdev_open+0x0/0xc0
May 1 18:17:17 fileserver kernel: [] blkdev_get+0x10/0x20
May 1 18:17:17 fileserver kernel: [] blkdev_open+0x71/0xc0
May 1 18:17:17 fileserver kernel: [] __dentry_open+0x10a/0x360
May 1 18:17:17 fileserver kernel: [] ? selinux_inode_permission+0x72/0xb0
May 1 18:17:17 fileserver kernel: [] ? security_inode_permission+0x1f/0x30
May 1 18:17:17 fileserver kernel: [] nameidata_to_filp+0x54/0x70
May 1 18:17:17 fileserver kernel: [] do_filp_open+0x6c0/0xd60
May 1 18:17:17 fileserver kernel: [] ? __do_page_fault+0x1ec/0x480
May 1 18:17:17 fileserver kernel: [] ? cpumask_any_but+0x31/0x50
May 1 18:17:17 fileserver kernel: [] ? unmap_region+0x110/0x130
May 1 18:17:17 fileserver kernel: [] ? alloc_fd+0x92/0x160
May 1 18:17:17 fileserver kernel: [] do_sys_open+0x69/0x140
May 1 18:17:17 fileserver kernel: [] sys_open+0x20/0x30
May 1 18:17:17 fileserver kernel: [] system_call_fastpath+0x16/0x1b
May 1 18:17:17 fileserver kernel: INFO: task fdisk:4176 blocked for more than 120 seconds.
May 1 18:17:17 fileserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 1 18:17:17 fileserver kernel: fdisk D 0000000000000013 0 4176 4147 0x00000004
May 1 18:17:17 fileserver kernel: ffff880c2f56dc18 0000000000000082 0000000000000000 ffff880c30802078
May 1 18:17:17 fileserver kernel: ffff880c2f56dbe8 ffff880c00000721 00000000007fffff ffffffff81fc8368
May 1 18:17:17 fileserver kernel: ffff880c1d6af058 ffff880c2f56dfd8 000000000000fb88 ffff880c1d6af058
May 1 18:17:17 fileserver kernel: Call Trace:
May 1 18:17:17 fileserver kernel: [] __mutex_lock_slowpath+0x13e/0x180
May 1 18:17:17 fileserver kernel: [] ? exact_match+0x0/0x10
May 1 18:17:17 fileserver kernel: [] mutex_lock+0x2b/0x50
May 1 18:17:17 fileserver kernel: [] __blkdev_get+0x68/0x3c0
May 1 18:17:17 fileserver kernel: [] ? blkdev_open+0x0/0xc0
May 1 18:17:17 fileserver kernel: [] blkdev_get+0x10/0x20
May 1 18:17:17 fileserver kernel: [] blkdev_open+0x71/0xc0
May 1 18:17:17 fileserver kernel: [] __dentry_open+0x10a/0x360
May 1 18:17:17 fileserver kernel: [] ? selinux_inode_permission+0x72/0xb0
May 1 18:17:17 fileserver kernel: [] ? security_inode_permission+0x1f/0x30
May 1 18:17:17 fileserver kernel: [] nameidata_to_filp+0x54/0x70
May 1 18:17:17 fileserver kernel: [] do_filp_open+0x6c0/0xd60
May 1 18:17:17 fileserver kernel: [] ? __do_page_fault+0x1ec/0x480
May 1 18:17:17 fileserver kernel: [] ? pde_users_dec+0x25/0x60
May 1 18:17:17 fileserver kernel: [] ? alloc_fd+0x92/0x160
May 1 18:17:17 fileserver kernel: [] do_sys_open+0x69/0x140
May 1 18:17:17 fileserver kernel: [] sys_open+0x20/0x30
May 1 18:17:17 fileserver kernel: [] system_call_fastpath+0x16/0x1b
May 1 18:18:11 fileserver kernel: megaraid_sas: Diag reset adapter never cleared!
...
...
I can't make anything of it, its a cry for help!
Answer
Turned out that Scientific Linux 6.3 on Supermicro servers has some issues with PCIe.
A friend suggested to add the following two options to grub:
- pcie_aspm=off
- disable_msi=1
After booting with these options, everything started working fine.
Any thoughts?
No comments:
Post a Comment