Saturday, August 19, 2017

linux - OOM killer goes insane

On our cluster we would sometimes have nodes go down when a new process would request too much memory. I was puzzled why the OOM killer does not just kill the guilty process.



The reason turned out to be that some processes get -17 oom_adj. That makes them off-limits for OOM killer (unkillabe!).




I can clearly see that with the following script:



#!/bin/bash
for i in `grep -v 0 /proc/*/oom_adj | awk -F/ '{print $3}' | grep -v self`; do
ps -p $i | grep -v CMD
done


OK, it makes sense for sshd, udevd, and dhclient, but then I see regular user processes get -17 as well. Once that user process causes an OOM event it will never get killed. This causes OOM kiler to go insane. NFS rpc.statd, cron, everything that happened to to be not -17 will be wiped out. As a result the node is down.




I have Debian 6.0 (Linux 2.6.32-3-amd64).



Does anyone know where to contorl the -17 oom_adj assignment behaviour?



Could launching sshd and Torque mom from /etc/rc.local be causing the overprotective behaviour?

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...