Wednesday, February 1, 2017

linux - iSCSI timeouts under high load



I have two servers connected via Gigabit Ethernet. One is iSCSI target, the second one is initiator. When I run mkfs.ext4 at initiator, after a while disk IO slows down critically in target. In the target host I can see the following in syslog:



Sep 14 09:40:03 sh11 tgtd: abort_task_set(1139) found 119668c 0
Sep 14 09:40:03 sh11 tgtd: abort_cmd(1115) found 119668c 6
Sep 14 09:40:03 sh11 tgtd: abort_task_set(1139) found 119668d 0
Sep 14 09:40:03 sh11 tgtd: abort_cmd(1115) found 119668d 6
Sep 14 09:40:03 sh11 tgtd: abort_task_set(1139) found 119668e 0
Sep 14 09:40:03 sh11 tgtd: abort_cmd(1115) found 119668e 6

Sep 14 09:40:03 sh11 tgtd: abort_task_set(1139) found 1196696 0
Sep 14 09:40:03 sh11 tgtd: abort_cmd(1115) found 1196696 6
Sep 14 09:40:03 sh11 tgtd: abort_task_set(1139) found 119669e 0
Sep 14 09:40:03 sh11 tgtd: abort_cmd(1115) found 119669e 6
Sep 14 09:40:04 sh11 tgtd: abort_task_set(1139) found 119669f 0
Sep 14 09:40:04 sh11 tgtd: abort_cmd(1115) found 119669f 6


And load average grows to 12 or even more:




# uptime
12:37:00 up 23 days, 13:25, 1 user, load average: 12.00, 7.00, 4.00



  • CentOS 6.3

  • tgtd 1.0.24

  • Intel Pentium 4 2.4GHz

  • 1Gb RAM

  • 2Tb WD Cavlar Green SATA 2.0





#lspci
00:00.0 Host bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE DRAM Controller/Host-Hub Interface (rev 02)
00:01.0 PCI bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE Host-to-AGP Bridge (rev 02)
00:1d.0 USB controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #1 (rev 02)
00:1d.1 USB controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #2 (rev 02)
00:1d.2 USB controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #3 (rev 02)
00:1d.7 USB controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller (rev 02)

00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 82)
00:1f.0 ISA bridge: Intel Corporation 82801DB/DBL (ICH4/ICH4-L) LPC Interface Bridge (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801DB (ICH4) IDE Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) SMBus Controller (rev 02)
00:1f.5 Multimedia audio controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller (rev 02)
01:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI RV200 QW [Radeon 7500]
02:01.0 Ethernet controller: D-Link System Inc DGE-530T Gigabit Ethernet Adapter (rev 11) (rev 11)
02:02.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE/SATA Controller (rev 50)
02:03.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE/SATA Controller (rev 50)
02:04.0 RAID bus controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)

02:08.0 Ethernet controller: Intel Corporation 82801DB PRO/100 VE (CNR) Ethernet Controller (rev 82)



Is there a way to tune target host to avoid these timeouts?



Update
Failing disk shows the following values:





# smartctl -A /dev/sdb
smartctl 5.42 2011-10-20 r3458 [i686-linux-2.6.32-279.2.1.el6.i686] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED > RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 167 167 021 Pre-fail Always - 6633

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 93
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 9444
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 91
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 64
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 722663
194 Temperature_Celsius 0x0022 104 092 000 Old_age Always - 46

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0


Answer



The issue was that that the WD Caviar Green disk had a defect which was not detected by SMART test. After disk replacement the problem gone.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...