Saturday, October 4, 2014

centos - Gluster + ZFS, deadlock during benchmarking: zfs_iput_taskq 100% cpu



First some background:
I work at a company that runs a PHP-webapplication. We have a storage backend mounted over NFS on several webservers. Today we have the issue if one webserver writes a file over NFS, sometimes the file does not appear at other mounted clients until a few minutes later. It is also not redundant so we cannot perform any "invisible" maintenance.



I've been looking at migrating to a GlusterFS solution (two or three replicated bricks/machines for redundancy). Now, using XFS as the storage filesystem "behind" Gluster works very well, performance wise. Gluster also does not seem to have the sync problem mentioned above.



However, I would like to use ZFS as the backend filesystem, the reasons being;





  • Cheap compression (currently storing 1.5TB uncompressed)

  • Very easy to expand the storage-volume "live" (one command, compared
    the LVM mess)

  • Snapshotting, bit-rot protection and all the other ZFS glory.



In my demo-setup of the solution I have three servers with Replicated Gluster with a ZFS backend pool at a separate disk on each server. I'm using CentOS 6.5 with ZFS on Linux (0.6.2) + GlusterFS 3.4. I have also tried with Ubuntu 13.10. Everything is in VMware ESX.



To test this setup I have mounted the volume over Gluster, and then running BlogBench (http://www.pureftpd.org/project/blogbench) to simulate load. The issue I'm having is that at the end of the test, the ZFS storage seems to get stuck in a deadlock. All three machines have "zfs_iput_taskq" running at 90-100% CPU, and the test freezes. If I abort the test, the deadlock does not go away, only option seems to be hard reboot.




I have tried:




  • Disabled atime

  • Disabled scheduler (noop)

  • Different compression/no compression

  • Blogbench directly on ZFS works fine

  • Blogbench on Gluster + XFS as backend works fine




Ideas? Should I just drop ZFS and go with something else? alternatives?



Regards Oscar


Answer



ZFS on Linux needs a bit of basic tuning in order to operate well under load. There's a bit of a struggle between the ZFS ARC and the Linux virtual memory subsystem.



For your CentOS systems, try the following:



Create an /etc/modprobe.d/zfs.conf configuration file. This is read during the module load/boot.




Add something like:



options zfs zfs_arc_max=40000000000
options zfs zfs_vdev_max_pending=24


Where zfs_arc_max is roughly 40% of your RAM in bytes (Edit: try zfs_arc_max=1200000000). The compiled-in default for zfs_vdev_max_pending is 8 or 10, depending on version. The value should be high (48) for SSD or low-latency drives. Maybe 12-24 for SAS. Otherwise, leave at default.



You'll want to also have some floor values in /etc/sysctl.conf




vm.swappiness = 10
vm.min_free_kbytes = 512000


Finally, with CentOS, you may want to install tuned and tuned-utils and set your profile to virtual-guest with tuned-adm profile virtual-guest.



Try these and see if the problem persists.



Edit:




Run zfs set xattr=sa storage. Here's why. You may have to wipe the volumes and start again (I'd definitely recommend doing so).


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...