At seemingly random intervals, the memory usage on our server is increasing over the maximum available and swapping until the CPU usage is also 100%. It then starts killing off processes when it runs out of swap memory and we have to restart the server.
When this happens our website and internal systems become unresponsive. I also cannot SSH into the server at this point so I have no way of identifying the processes which are killing the it.
I don't have a huge amount of experience with server admin but I'm looking for ideas for how to detect the problem. Let me know what extra information you may need.
Answer
Could be a fork-bomb tbh (i.e. a process that's infinitely forking children and hence exhausting the resources). Could also be a memory leak type issue.
Identifying the key process(es) is key here. Try this:
When you next restart the server leave a console open as root but use renice to set its priority to -20. Once that's done run (top with priority -20) and watch to see what's causing the issue.
This command ought to do it:
sudo bash
renice -n -20 -u root
top
When things start looking tight resort to the killall command or kill the parent and then the zombies.
At -20 you should be able to keep an active connection over ssh and still do your work, its same priority as the Kernel.
Don't forget to look in the logs (web server and otherwise in /var/log) as well since they can be quite revealing.
If you identify the problem let us know what it is and if you require further help and assistance.
Good luck.
See the renice man page and top man page.
No comments:
Post a Comment