Monday, March 28, 2016

linux - Tracking down load average



The "load average" on a *nix machine is the "average length of the run queue", or in other words, the average number of processes that are doing something (or waiting to do something). While the concept is simple enough to understand, troubleshooting the problem can be less straight-forward.



Here's the statistics on a server I worked on today that made me wonder the best way to fix this sort of thing. Here's the statistics:




  • 1GB RAM free, 0 swap space usage

  • CPU times around 20% user, 30% wait, 50% idle (according to top)


  • About 2 to 3 processes in either "R" or "D" state at a time (tested using ps | grep)

  • Server logs free of any error messages indicating hardware problems

  • Load average around 25.0 (for all 3 averages)

  • Server visibly unresponsive for users



I eventually "fixed" the problem by restarting MySQLd... which doesn't make a lot of sense, because according to mysql's "show processlist" command, the server was theoretically idle.



What other tools/metrics should I have used to help diagnose this issue and possibly determine what was causing the server load to run so high?


Answer




It sounds like your server is IO bound - hence the processes sat in D state.



Use iostat to see what the load is on your disks.



If MySQL is causing lots of disk seeks then consider putting your MySQL data on a completely separate physical disk. If it's still slow and it's part of a master-slave setup, put the replication logs onto a separate disk too.



Note that a separate partition or logical disk isn't enough - head seek times are generally the limiting factor, not data transfer rates.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...