debian - Bad continuous write performance on RAID-5 when not writing huge amounts

I had some problems getting acceptable read/write performance for my RAID5 + crypt + ext4 and was finally able to track it down to the following problem:

Hard disks 4x WD RED 3 TB WDC WD30EFRX-68EUZN0 as /dev/sd[efgh]

sde and sdf are connected via controller A using a 3 Gbps/s SATA link (even though 6 Gbps would have been available)

sdg and sdh are connected via controller B using a 6 Gbps/s SATA link

Write test 4 times for each disk (everything as I expected)

# dd if=/dev/zero of=/dev/sd[efgh] bs=2G count=1 oflag=dsync
sde: 2147479552 bytes (2.1 GB) copied, xxx s, [127, 123, 132, 127] MB/s
sdf: 2147479552 bytes (2.1 GB) copied, xxx s, [131, 130, 118, 137] MB/s
sdg: 2147479552 bytes (2.1 GB) copied, xxx s, [145, 145, 145, 144] MB/s
sdh: 2147479552 bytes (2.1 GB) copied, xxx s, [126, 132, 132, 132] MB/s

Read test using hdparm and dd (everything as I expected)

# hdparm -tT /dev/sd[efgh]
# echo 3 | tee /proc/sys/vm/drop_caches; dd of=/dev/null if=/dev/sd[efgh] bs=2G count=1 iflag=fullblock

(sde)
Timing cached reads:   xxx MB in  2.00 seconds = [13983.68, 14136.87] MB/sec
Timing buffered disk reads: xxx MB in  3.00 seconds = [143.16, 143.14] MB/sec
2147483648 bytes (2.1 GB) copied, xxx s, [140, 141] MB/s


(sdf)
Timing cached reads:   xxx MB in  2.00 seconds = [14025.80, 13995.14] MB/sec
Timing buffered disk reads: xxx MB in  3.00 seconds = [140.31, 140.61] MB/sec
2147483648 bytes (2.1 GB) copied, xxx s, [145, 141] MB/s

(sdg)
Timing cached reads:   xxx MB in  2.00 seconds = [14005.61, 13801.93] MB/sec
Timing buffered disk reads: xxx MB in  3.00 seconds = [153.11, 151.73] MB/sec
2147483648 bytes (2.1 GB) copied, xxx s, [154, 155] MB/s


(sdh)
Timing cached reads:   xxx MB in  2.00 seconds = [13816.84, 14335.93] MB/sec
Timing buffered disk reads: xxx MB in  3.00 seconds = [142.50, 142.12] MB/sec
2147483648 bytes (2.1 GB) copied, xxx s, [140, 140] MB/s

4x 32 GiB for testing

# gdisk -l /dev/sd[efgh]
GPT fdisk (gdisk) version 0.8.10

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present


Found valid GPT with protective MBR; using GPT.
Disk /dev/sde: 5860533168 sectors, 2.7 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): xxx
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 5860533134
Partitions will be aligned on 2048-sector boundaries
Total free space is 5793424237 sectors (2.7 TiB)

Number  Start (sector)    End (sector)  Size       Code  Name

   1            2048        67110911   32.0 GiB    FD00  Linux RAID

# mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 --chunk=256K /dev/sd[efgh]1
(some tests later ...)
# mdadm --grow --verbose /dev/md0 --layout=right-asymmetric
# mdadm --detail /dev/md0
/dev/md0:

    Version : 1.2
  Creation Time : Sat Dec 10 03:07:56 2016
     Raid Level : raid5
     Array Size : 100561920 (95.90 GiB 102.98 GB)
  Used Dev Size : 33520640 (31.97 GiB 34.33 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Sat Dec 10 23:56:53 2016

          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : right-asymmetric
     Chunk Size : 256K

           Name : vm:0  (local to host vm)

           UUID : 80d0f886:dc380755:5387f78c:1fac60da
         Events : 158

    Number   Major   Minor   RaidDevice State
       0       8       65        0      active sync   /dev/sde1
       1       8       81        1      active sync   /dev/sdf1
       2       8       97        2      active sync   /dev/sdg1
       4       8      113        3      active sync   /dev/sdh1

I expected the array to perform roughly between 350 - 400 MB/s for continuous reads and writes. Reading or writing the entire volume actually yields results perfectly within this range:

# echo 3 | tee /proc/sys/vm/drop_caches; dd of=/dev/null if=/dev/md0 bs=256K
102975406080 bytes (103 GB) copied, 261.373 s, 394 MB/s

# dd if=/dev/zero of=/dev/md0 bs=256K conv=fdatasync
102975406080 bytes (103 GB) copied, 275.562 s, 374 MB/s

However, the write performance greatly depends on the amount of data written. As expected the transfer rate increases with the amount of data, but drops dead when reaching 2 GiB and only recovers slowly when further increasing the size:

# dd if=/dev/zero of=/dev/md0 bs=256K conv=fdatasync count=x
count=1: 262144 bytes (262 kB) copied, xxx s, [3.6, 7.6, 8.9, 8.9] MB/s
count=2: 524288 bytes (524 kB) copied, xxx s, [3.1, 17.7, 15.3, 15.7] MB/s
count=4: 1048576 bytes (1.0 MB) copied, xxx s, [13.2, 23.9, 26.9, 25.4] MB/s
count=8: 2097152 bytes (2.1 MB) copied, xxx s, [24.3, 46.7, 45.9, 42.8] MB/s
count=16: 4194304 bytes (4.2 MB) copied, xxx s, [5.1, 77.3, 42.6, 73.2, 79.8] MB/s
count=32: 8388608 bytes (8.4 MB) copied, xxx s, [68.6, 101, 99.7, 101] MB/s

count=64: 16777216 bytes (17 MB) copied, xxx s, [52.5, 136, 159, 159] MB/s
count=128: 33554432 bytes (34 MB) copied, xxx s, [38.5, 175, 185, 189, 176] MB/s
count=256: 67108864 bytes (67 MB) copied, xxx s, [53.5, 244, 229, 238] MB/s
count=512: 134217728 bytes (134 MB) copied, xxx s, [111, 288, 292, 288] MB/s
count=1K: 268435456 bytes (268 MB) copied, xxx s, [171, 328, 319, 322] MB/s
count=2K: 536870912 bytes (537 MB) copied, xxx s, [228, 337, 330, 334] MB/s
count=4K: 1073741824 bytes (1.1 GB) copied, xxx s, [338, 348, 348, 343] MB/s <-- ok!
count=8K: 2147483648 bytes (2.1 GB) copied, xxx s, [168, 147, 138, 139] MB/s <-- bad!
count=16K: 4294967296 bytes (4.3 GB) copied, xxx s, [155, 160, 178, 144] MB/s
count=32K: 8589934592 bytes (8.6 GB) copied, xxx s, [256, 238, 264, 246] MB/s

count=64K: 17179869184 bytes (17 GB) copied, xxx s, [298, 285] MB/s
count=128K: 34359738368 bytes (34 GB) copied, xxx s, [347, 336] MB/s
count=256K: 68719476736 bytes (69 GB) copied, xxx s, [363, 356] MB/s <-- getting better

(below 2 GiB the first measurement seems to indicate the usage of some read-cache)

While transferring 2 GiB or more than I observed something strange in iotop:

Phase 1: in the beginning "Total DISK WRITE" and "Actual DISK WRITE" are both about "400 MB/s". dd has an IO values around 85 % while all others are at 0 %. This phase lasts longer on bigger transfers.

Phase 2: A few seconds (~16 s) before the transmission is done, a kworker jumps in and /steals/ 30 - 50 percentage points of IO from dd. The distribution fluctuates between 30:50 % and 50:30 %. At the same time the "Total DISK WRITE" drops down to 0 B/s and "Actual DISK WRITE" jumps between 20 - 70 MB/s . This phase seems to last for a constant time.

Phase 3: Within the last 3 s "Actual DISK WRITE" jumps up to > 400 MB/s while "Total DISK WRITE" stays at 0 B/s. dd and kworker are both listed with an IO-value of 0 %

Phase 4: the IO-value of dd jumps up to 5 % for single second. At the same time the transfer completes.

Some more tests

# dd if=/dev/zero of=/dev/md0 bs=256K count=32K oflag=direct
8589934592 bytes (8.6 GB) copied, 173.083 s, 49.6 MB/s


# dd if=/dev/zero of=/dev/md0 bs=256M count=64 oflag=direct
17179869184 bytes (17 GB) copied, 47.792 s, 359 MB/s

# dd if=/dev/zero of=/dev/md0 bs=768M count=16K oflag=direct
50734301184 bytes (51 GB) copied, 136.347 s, 372 MB/s <-- peak performance

# dd if=/dev/zero of=/dev/md0 bs=1G count=16K oflag=direct
41875931136 bytes (42 GB) copied, 112.518 s, 372 MB/s <-- peak performance


# dd if=/dev/zero of=/dev/md0 bs=2G count=16 oflag=direct
34359672832 bytes (34 GB) copied, 103.355 s, 332 MB/s

# dd if=/dev/zero of=/dev/md0 bs=256K count=32K oflag=dsync
8589934592 bytes (8.6 GB) copied, 498.77 s, 17.2 MB/s

# dd if=/dev/zero of=/dev/md0 bs=256M count=64 oflag=dsync
17179869184 bytes (17 GB) copied, 58.384 s, 294 MB/s

# dd if=/dev/zero of=/dev/md0 bs=1G count=8 oflag=dsync

8589934592 bytes (8.6 GB) copied, 26.4799 s, 324 MB/s

# dd if=/dev/zero of=/dev/md0 bs=2G count=8 oflag=dsync
17179836416 bytes (17 GB) copied, 192.327 s, 89.3 MB/s

# dd if=/dev/zero of=/dev/md0 bs=256K; echo "sync"; sync
102975406080 bytes (103 GB) copied, 275.378 s, 374 MB/s
sync

bs=256K oflag=direct -> 100 % IO, no kworker present, bad performance

bs=1G oflag=direct -> < 5 % IO, no kworker present, ok performance

bs=2G oflag=direct -> > 80 % IO, kworker jumps in every now and then, ok performance

oflag=dsync -> < 5 % IO, kworker jumps in every now and then; needs huge block sizes to obtain acceptable speed, but > 2G results in a massive performance drop.

echo "sync"; sync -> same as conv=fdatasync; sync returns immediately

What's that mysterious Phase 2 where both processes seem to fight for IO?

Who's transferring the data to the hardware in phase 3?

And most importantly: How can I minimize the strange effect to obtain the full 400 MB/s which the array seems to be able to provide? (Or am I even asking a XY-Problem?)

There's a long story of trial and error preceding the current state. I switched the scheduler from cfq to noop and decreased the RAIDs chunk size from 512k to 256k yielding slightly better results. Changing to --layout=right-asymmetric didn't change anything. Temporarily deactivating the hard drive's write cache performed worse.

The crypt layer mentioned in the first sentence is currently completely absent and will be re-introduced later.

# uname -a
Linux vm 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux

Answer

What you are seeing is an artifact of your dd command line, specifically from the conv=fdatasync option. From the man page:

Each CONV symbol may be:
...
fdatasync: physically write output file data before finishing
...

conv=fdatasync basically instruct dd to execute a single, final fdatasync syscall before returing. However, write are cached while dd runs. Your I/O phases can be explained as following:

dd quickly writes in the pagecache, without actually touching the disk

the pagecache is nearly full and a kworker kernel begin flushing it do disk. During pagecache flushing, dd is briefly paused (resulting in high iowait); after some pagecache is freed, dd can recover operation

the difference between TOTAL and ACTUAL disk writes in iotop depends on how the pagecache is, respectively, filled and flushed

the cycle continue

In short, there is not issue here. If you want to observe uncached behavior, replace conv=fdatasync with oflag=direct: with this flag, you can completely bypass the pagecache.

For observing cache but synchronized behavior, replace conv=fdatasync with oflag=sync: with this flag, dd call fdatasync as each block is written to disk.

Further optimization can be obtained by fine-tuning your I/O stack (ie: I/O scheduler, merge behavior, stripe cache, ecc) but this is an entirely other question.

Blog

Sunday, March 27, 2016

debian - Bad continuous write performance on RAID-5 when not writing huge amounts

Some more tests

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server