linux - OOM killer goes insane

Saturday, August 19, 2017

linux - OOM killer goes insane

On our cluster we would sometimes have nodes go down when a new process would request too much memory. I was puzzled why the OOM killer does not just kill the guilty process.

The reason turned out to be that some processes get -17 oom_adj. That makes them off-limits for OOM killer (unkillabe!).

I can clearly see that with the following script:

#!/bin/bash
for i in `grep -v 0 /proc/*/oom_adj | awk -F/ '{print $3}' | grep -v self`; do
  ps -p $i | grep -v CMD
done

OK, it makes sense for sshd, udevd, and dhclient, but then I see regular user processes get -17 as well. Once that user process causes an OOM event it will never get killed. This causes OOM kiler to go insane. NFS rpc.statd, cron, everything that happened to to be not -17 will be wiped out. As a result the node is down.

I have Debian 6.0 (Linux 2.6.32-3-amd64).

Does anyone know where to contorl the -17 oom_adj assignment behaviour?

Could launching sshd and Torque mom from /etc/rc.local be causing the overprotective behaviour?

Blog

Saturday, August 19, 2017

linux - OOM killer goes insane

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server