Similar to this question, we have a computing server with 96GB of RAM that is used to run large jobs in parallel.
Occasionally, the total amount of physical RAM is exceeded, which causes the server to become unresponsive, forcing a reboot. To me, this is not acceptable behavior, so I'm looking for ways to fix this.
I know one way would be to set limits using "ulimit -v". However I'd like to avoid going down that route if possible, as I may occasionally have a signle very large process (as opposed to many small ones), so setting a useful threshold is going to be difficult.
I suspect the problem may come from the fact that the system has 20GB of swap: instead of killing the offending process(es), the system will allocate memory on disk which will make it unresponsive. Is reducing the amount of swap a good idea?
Any insight or experiences with a similar problem highly appreciated!
EDIT
I made a few experiments using the following leaking C++ program:
#include
#include
using namespace std;
int main(int argc,char * argv[])
{
while(true) {
vector* a = new vector(50000000);
sleep(1);
}
}
I ran it a first time with a 256MB swap file. The system completely hung for about 5 minutes, than came back to life. In the logs, I saw that the OOM killer had succesfully killed my leaky program.
I ran it a second time with no swap. This time, the machine didn't come back to life for at least ten minutes, at which point I rebooted the machine. This came as a surprise for me, as I expected the OOM killer to fire up earlier on a machine with no swap.
What I don't understand is the following: why does linux wait until the system is completely hung to do something about the offending process? Is it too much to expect of an OS to not be completely killed by one badly coded process?
No comments:
Post a Comment