Tuesday, December 1, 2015

linux - How to diagnose causes of oom-killer killing processes




I have a small virtual private server running CentOS and www/mail/db, which has recently had a couple of incidents where the web server and ssh became unresponsive.



Looking at the logs, I saw that oom-killer had killed these processes, possibly due to running out of memory and swap.



Can anyone give me some pointers at how to diagnose what may have caused the most recent incident? Is it likely the first process killed? Where else should I be looking?


Answer



No, the algorithm is not that simplistic. You can find more information in:



http://linux-mm.org/OOM_Killer




If you want to track memory usage, I'd recommend running a command like:



ps -e -o pid,user,cpu,size,rss,cmd --sort -size,-rss | head


It will give you a list of the processes that are using the most memory (and probably causing the OOM situation). Remove the | head if you'd prefer to check all the processes.



If you put this on your cron, repeat it every 5 minutes and save it to a file. Keep at least a couple of days, so you can check what happened later.



For critical services like ssh, I'd recommend using monit for auto restarting them in such a situation. It might save from losing access to the machine if you don't have a remote console to it.




Best of luck,
João Miguel Neves


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...