Sunday, July 1, 2018

linux - Massive performance degradation on sustained sequential write



I'm in the process of migrating data into LUKS partitions. Now that the operating system drive is running off LUKS, I tried to start migrating the data drives. Then the server stopped responding.



This LUKS device was opened:



cryptsetup luksOpen /dev/sdc data1



And either of these commands strangled the server:



pv /dev/zero > /dev/mapper/data1
pv /dev/zero > /dev/sdc


Not immediately, but within seconds, the server became unusably slow. Everything blocked on I/O:



root@node51 [~]# ps aux | awk '{if($8~"D"||$8=="STAT"){print $0}}' 

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1197 0.0 0.0 0 0 ? D 06:39 0:00 [jbd2/dm-1-8]
root 1687 0.1 0.0 0 0 ? D 11:15 0:12 [kworker/u96:5]
root 13057 2.0 0.0 0 0 ? D 13:10 0:01 [dmcrypt_write]
root 13644 10.9 0.0 7452 784 pts/1 D+ 13:10 0:08 pv /dev/zero
root 14159 0.0 0.0 98256 6836 ? DNs 13:10 0:00 sshd: root [priv]
root 14772 0.0 0.0 29008 92 ? D 13:11 0:00 /usr/sbin/CRON -f
root 14773 0.0 0.0 98256 6748 ? DNs 13:11 0:00 sshd: root [priv]
root 15411 0.0 0.0 98256 6876 ? DNs 13:11 0:00 sshd: root [priv]
root 16009 0.1 0.0 98256 6840 ? DNs 13:11 0:00 sshd: root [priv]

root 16632 0.5 0.0 98256 6892 ? DNs 13:11 0:00 sshd: root [priv]
root 16900 0.0 0.0 5448 356 pts/3 D+ 13:11 0:00 awk {if($8~"D"||$8=="STAT"){print $0}}
root 28553 0.6 0.0 0 0 ? D 12:12 0:21 [txg_sync]


Of note, for about two seconds, pv reported that it was copying data at more than 2GiB/s. This is both the write-back cache and the dirty pages filling up (found by monitoring /proc/meminfo).



Afterwards, pv recorded a normal 200MiB/s write speed, but it was still ahead by between 2GiB and 3GiB in the write-back cache.



The server load average jumped up past 10.00 because of all the I/O blocking going on.




It takes a while to abort the pv write test because the write-back cache needs to be emptied, but right after the test was aborted, server performance returned to normal.



Interestingly, these commands don't cause the server to lag:



# Reads from dm-crypt block device
pv /dev/mapper/data1 > /dev/zero
# Reads from the raw block device
pv /dev/sdc > /dev/zero


# Writes to a control disk of a different model
pv /dev/zero > /dev/sdi
# Reads from a control disk
pv /dev/sdi > /dev/zero

# Writes to a file on a dm-crypt ext4 filesystem on a solid-state drive
pv /dev/zero > /tmp/empty
# Reads from that same solid-state drive
pv /dev/sda > /dev/zero



I have these questions:




  1. Why do sustained sequential writes to this data disk slow down the server so much?

  2. How can I avoid bogging down the other disks when writing to a specific one or few?

  3. Why is this kind of hard drive causing performance issues, but other drives don't?



I have six disks of the same model (/dev/sdc, /dev/sdd, /dev/sde, /dev/sdf, /dev/sdg, and /dev/sdh) to encrypt and they will have sequential write workloads in the future, so I don't want the server to stall from this problem.







Additional Information



Quick Facts



Server: Dell PowerEdge T320



Kernel: Linux node51 4.4.0-22-generic #39-Ubuntu SMP Thu May 5 16:53:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux




Operating system: Ubuntu Server 16.04 LTS Xenial Xerus 64-bit



Problematic hard drive: Toshiba PH3500U-1I72



I have six of these disks, all known to be healthy, and I tested two of them and experienced a server-wide I/O performance drop with both. They read and write at 200MiB/s near the beginning.



Control (problem-free) hard drive: Samsung SP1614C



This disk has a sustained write speed of 50MiB/s. Could it be that the problematic disk is too fast?




Disk controller: Dell PERC H310



Two solid-state drives and six problematic hard drives are connected to this controller, all of which are passed directly as AHCI. The control disk is connected to a SATA port built into the motherboard.



I/O schedulers



root@node51 [/tmp]# tail -n +1 /sys/block/sd*/queue/scheduler 
==> /sys/block/sda/queue/scheduler <==
noop [deadline] cfq


==> /sys/block/sdb/queue/scheduler <==
noop [deadline] cfq

==> /sys/block/sdc/queue/scheduler <==
[noop] deadline cfq

==> /sys/block/sdd/queue/scheduler <==
[noop] deadline cfq


==> /sys/block/sde/queue/scheduler <==
[noop] deadline cfq

==> /sys/block/sdf/queue/scheduler <==
[noop] deadline cfq

==> /sys/block/sdg/queue/scheduler <==
[noop] deadline cfq

==> /sys/block/sdh/queue/scheduler <==

[noop] deadline cfq

==> /sys/block/sdi/queue/scheduler <==
noop [deadline] cfq


Changing the scheduler of /dev/sdc from noop to deadline makes no perceptible difference. Changing the scheduler to cfq seemed to reduce the delay somewhat, but I/O operations on the other disks still suffered.



vm.dirty* kernel parameters




root@node51 [~]# sysctl -a | grep 'vm.dirty'
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200



Examples of slowness detected and logged to /var/log/syslog



ZFS transaction group sync:



May 11 19:28:44 node51 kernel: [ 4080.179688] INFO: task txg_sync:3179 blocked for more than 120 seconds.
May 11 19:28:44 node51 kernel: [ 4080.179905] Tainted: P O 4.4.0-22-generic #39-Ubuntu
May 11 19:28:44 node51 kernel: [ 4080.180110] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 11 19:28:44 node51 kernel: [ 4080.180357] txg_sync D ffff88060b68baa8 0 3179 2 0x00000000
May 11 19:28:44 node51 kernel: [ 4080.180362] ffff88060b68baa8 ffff880616a96d00 ffff8806133ea940 ffff880603dc2940
May 11 19:28:44 node51 kernel: [ 4080.180366] ffff88060b68c000 ffff880616ad6d00 7fffffffffffffff ffff88056cb8c508

May 11 19:28:44 node51 kernel: [ 4080.180368] 0000000000000001 ffff88060b68bac0 ffffffff818211f5 0000000000000000
May 11 19:28:44 node51 kernel: [ 4080.180372] Call Trace:
May 11 19:28:44 node51 kernel: [ 4080.180381] [] schedule+0x35/0x80
May 11 19:28:44 node51 kernel: [ 4080.180385] [] schedule_timeout+0x1b5/0x270
May 11 19:28:44 node51 kernel: [ 4080.180390] [] ? default_wake_function+0x12/0x20
May 11 19:28:44 node51 kernel: [ 4080.180395] [] ? __wake_up_common+0x52/0x90
May 11 19:28:44 node51 kernel: [ 4080.180398] [] io_schedule_timeout+0xa4/0x110
May 11 19:28:44 node51 kernel: [ 4080.180412] [] cv_wait_common+0xbc/0x140 [spl]
May 11 19:28:44 node51 kernel: [ 4080.180416] [] ? wake_atomic_t_function+0x60/0x60
May 11 19:28:44 node51 kernel: [ 4080.180423] [] __cv_wait_io+0x18/0x20 [spl]

May 11 19:28:44 node51 kernel: [ 4080.180487] [] zio_wait+0x10e/0x1f0 [zfs]
May 11 19:28:44 node51 kernel: [ 4080.180528] [] dsl_pool_sync+0x2c6/0x430 [zfs]
May 11 19:28:44 node51 kernel: [ 4080.180573] [] spa_sync+0x366/0xb30 [zfs]
May 11 19:28:44 node51 kernel: [ 4080.180576] [] ? default_wake_function+0x12/0x20
May 11 19:28:44 node51 kernel: [ 4080.180623] [] txg_sync_thread+0x3ba/0x630 [zfs]
May 11 19:28:44 node51 kernel: [ 4080.180669] [] ? txg_delay+0x180/0x180 [zfs]
May 11 19:28:44 node51 kernel: [ 4080.180676] [] thread_generic_wrapper+0x71/0x80 [spl]
May 11 19:28:44 node51 kernel: [ 4080.180682] [] ? __thread_exit+0x20/0x20 [spl]
May 11 19:28:44 node51 kernel: [ 4080.180686] [] kthread+0xd8/0xf0
May 11 19:28:44 node51 kernel: [ 4080.180688] [] ? kthread_create_on_node+0x1e0/0x1e0

May 11 19:28:44 node51 kernel: [ 4080.180692] [] ret_from_fork+0x3f/0x70
May 11 19:28:44 node51 kernel: [ 4080.180694] [] ? kthread_create_on_node+0x1e0/0x1e0


ext4 journal:



May 11 20:46:46 node51 kernel: [ 6000.186474] INFO: task jbd2/dm-2-8:1148 blocked for more than 120 seconds.
May 11 20:46:46 node51 kernel: [ 6000.193164] Tainted: P O 4.4.0-22-generic #39-Ubuntu
May 11 20:46:46 node51 kernel: [ 6000.199950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 11 20:46:46 node51 kernel: [ 6000.208323] jbd2/dm-2-8 D ffff88060a6e7c98 0 1148 2 0x00000000

May 11 20:46:46 node51 kernel: [ 6000.208330] ffff88060a6e7c98 0000000000000246 ffff8806133eb700 ffff88060b561b80
May 11 20:46:46 node51 kernel: [ 6000.208333] ffff88060a6e8000 ffff88060aeb68b8 ffff88060a6e7d88 ffff88060a6e7d70
May 11 20:46:46 node51 kernel: [ 6000.208336] ffff88060b561b80 ffff88060a6e7cb0 ffffffff818211f5 ffff8805fd6af900
May 11 20:46:46 node51 kernel: [ 6000.208339] Call Trace:
May 11 20:46:46 node51 kernel: [ 6000.208355] [] schedule+0x35/0x80
May 11 20:46:46 node51 kernel: [ 6000.208361] [] jbd2_journal_commit_transaction+0x240/0x1870
May 11 20:46:46 node51 kernel: [ 6000.208365] [] ? dequeue_entity+0x431/0xa80
May 11 20:46:46 node51 kernel: [ 6000.208368] [] ? dequeue_task_fair+0x51a/0x8a0
May 11 20:46:46 node51 kernel: [ 6000.208372] [] ? wake_atomic_t_function+0x60/0x60
May 11 20:46:46 node51 kernel: [ 6000.208378] [] ? try_to_del_timer_sync+0x5e/0x90

May 11 20:46:46 node51 kernel: [ 6000.208381] [] kjournald2+0xca/0x250
May 11 20:46:46 node51 kernel: [ 6000.208384] [] ? wake_atomic_t_function+0x60/0x60
May 11 20:46:46 node51 kernel: [ 6000.208387] [] ? commit_timeout+0x10/0x10
May 11 20:46:46 node51 kernel: [ 6000.208391] [] kthread+0xd8/0xf0
May 11 20:46:46 node51 kernel: [ 6000.208394] [] ? kthread_create_on_node+0x1e0/0x1e0
May 11 20:46:46 node51 kernel: [ 6000.208397] [] ret_from_fork+0x3f/0x70
May 11 20:46:46 node51 kernel: [ 6000.208399] [] ? kthread_create_on_node+0x1e0/0x1e0
May 11 20:46:46 node51 kernel: [ 6292.776357] perf interrupt took too long (2539 > 2500), lowering kernel.perf_event_max_sample_rate to 50000

Answer





The control disk is connected to a SATA port built into the
motherboard.




As stated, the disks experiencing the journal flush time out issues are connected to the PERC, the same controller that the 'problematic' Toshiba's are connected to.



The PERC 310 is only a basic hardware raid card. It's CPU is probably easily overwhelmed, either that or there is a firmware bug. Direct AHCI is not a very common usage.



I would suggest the IO is locking up on the PERC, and not the OS



No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...