Wednesday, January 6, 2016

performance - How to tell if linux disk IO is causing excessive (> 1 second) application stalls



I have a Java application performing a large volume (hundreds of MB) of continuous output (streaming plain text) to about a dozen files a ext3 SAN filesystem. Occasionally, this application pauses for several seconds at a time. I suspect that something related to ext3 vsfs (Veritas Filesystem) functionality (and/or how it interacts with the OS) is the culprit.



What steps can I take to confirm or refute this theory? I am aware of iostat and /proc/diskstats as starting points.




Revised title to de-emphasize journaling and emphasize "stalls"



I have done some googling and found at least one article that seems to describe behavior like I am observing: Solving the ext3 latency problem



Additional Information




  • Red Hat Enterprise Linux Server release 5.3 (Tikanga)

  • Kernel: 2.6.18-194.32.1.el5


  • Primary application disk is fiber-channel SAN: lspci | grep -i fibre >> 14:00.0 Fibre Channel: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter (rev 03)

  • Mount info: type vxfs (rw,tmplog,largefiles,mincache=tmpcache,ioerror=mwdisable) 0 0

  • cat /sys/block/VxVM123456/queue/scheduler >> noop anticipatory [deadline] cfq


Answer



My guess is that there's some other process that hogs the disk I/O capacity for a while. iotop can help you pinpoint it, if you have a recent enough kernel.



If this is the case, it's not about the filesystem, much less about journalling. It's the I/O scheduler the responsible to arbitrate between conflicting applications. An easy test: check the current scheduler and try a different one. It can be done on the fly, without restarting. For example, on my desktop to check the first disk (/dev/sda):



cat /sys/block/sda/queue/scheduler

=> noop deadline [cfq]


shows that it's using CFQ, which is a good choice for desktops but not so much for servers. Better set 'deadline':



echo 'deadline' > /sys/block/sda/queue/scheduler
cat /sys/block/sda/queue/scheduler
=> noop [deadline] cfq



and wait a few hours to see if it improves. If so, set it permanently in the startup scripts (depends on distribution)


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...