Saturday, October 27, 2018

linux - abnormal very high CPU during lengthy write operations

I have a set of large files which I sometimes copy back and forth between a Linux box and a Windows box. The files are each about 2 GB, and there tend to be 10 or so of them (they're a VM image). The VM runs on Linux (qemu) and I back it up to a Windows box. In this scenario, the VM is not running.



When I copy the files from the Linux box to the Windows box everything works fine. When I copy the files from the Windows box back to the Linux box, I get anomalous high and continuous CPU usage on the Linux box, and the file transfer goes very (very) slowly.



I'm using socat, lz4, and tar to transport the files. On the windows box, I'm using cygwin for socat, tar, etc (but this doesn't matter much, because the Windows box is behaving fine). I chose lz4 because it's very (very) fast and (like gzip, etc) provides checksuming.



When I copy the files from the Linux box, the Linux command is: tar cvf - *vmdk | lz4 -B64 | socat - TCP-LISTEN:7777,reuseaddr and the Windows command is socat TCP:linuxserver:7777 - > bigbackup.tar.lz4 . This works fine, and I get 25% to 100% network utilization, and CPU usage on all systems is less than 25%.




When I copy the files back to the Linux box, the Linux command is : socat TCP-LISTEN:7777,reuseaddr - | lz4 -d | tar xvf - and the Windows command is cat bigbackup.tar.lz4 | socat - TCP:linuxserver:7777 .



When I run this restore operation to copy the files back to my Linux box, the transfer works as expected for several seconds, and then the transfer begins stalling and slowing, and the CPU on the Linux box starts spiking, and then pegs to 100%, and all other programs become less responsive (and sometimes nonresponsive). If I let it alone, the transfer will ultimately complete but at about 5% of the speed I feel it should have taken, with the CPU pegged the entire time.



If I use windows task-manager "Networking" tab, or linux gnome-system-monitor, the network history is weird - there are 2 to 5 seconds of data transfer at about 25% utilization, and then zero for 30 to 40 seconds. This repeats until the transfer is complete. The CPU is 100% for the entire time. Using htop (linux), the socat and lz4 process CPU usage is 0 to 2 percent, and the tar process sometimes spikes to 25%, but even when the sum of these is low, some unaccounted-for thread is using the rest of the CPU. I tried renice on the tar process with no effect.



If I run the restore process on a different (windows) box (with the same commands) the transfer proceeds smoothly with network utilization at 25% to 100%, same as the back up does. Unfortunately I don't have any other Linux boxes to test with.



That this problem occurs is as boggling to me as a car which can't make left turns on Tuesdays. If the disk in the Linux box was stalling because of slow writes (it is a SSD), I would expect the kernel thread to just block, leaving the CPU otherwise available.




Here's some information about the hardware and system.




  • Debian GNU/Linux 9.1 (stretch)

  • Intel NUC NUC5CPYB, with Celeron CPU N3050 @ 1.60GHz (2 cores)

  • 8 GB DRAM

  • Realtek RTL8111/8168/8411 PCI Express Gigabit (on board)

  • SanDisk SSD PLUS 480GB

  • the system is used as a network router and VM host (smallish utility VMs) and typical CPU is 25% to 50%




I looked at the documentation for TAR, and there don't seem to be any flags which control how the system should write the file (like, buffering, caching, sync writes, etc).



Does anyone know why this happens, and is there any way of fixing it or making it less impactful ?

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...