Wednesday, February 4, 2015

linux - How to stop Apache from crashing my entire server?

I maintain a Gentoo server with a few services, including Apache. It's fairly low-end (2GB of RAM and a low-end CPU with 2 cores). My problem is that, despite my best efforts, an over-loaded Apache crashes the entire server. In fact, at this point I'm close to being convinced that Linux is a horrible operating system that isn't worth anyone's time looking for stability under load.



Things I tried:




  1. Adjusting oom_adj for the root Apache process (and thus all its children). That had close to no effect. When Apache was overloaded it would bring the system to a grind, as the system paged out everything else before it got to kill anything.

  2. Turning off swap. Didn't help, it would unload memory paged to binaries of processes and other files on /, thus causing the same effect.

  3. Putting it in a memory-limited cgroup (limited to 512 MB of RAM, 1/4th of the total). This "worked", at least in my own stress tests - except the server keeps crashing under load (basically stalling all other processes, inaccessible via SSH, etc.)

  4. Running it with idle I/O priority. This wasn't a very good idea in the end, because it just caused the system load to climb indefinitely (into the thousands) with almost no visible effect - until you tried to access an unbuffered part of the disk. This caused the task to freeze. (So much for good I/O scheduling, eh?)

  5. Limiting the number of concurrent connections to Apache. Setting the number too low caused web sites to become unresponsive due to most slots being occupied with long requests (file downloads).


  6. I tried various Apache MPMs without much success (prefork, event, itk).

  7. Switching from prefork/event+php-cgi+suphp to itk+mod_php. This improved performance, but didn't solve the actual problem.

  8. Switching I/O schedulers (cfq to deadline).



Just to stress this out: I don't care if Apache itself goes down under load, I just want the rest of my system to remain stable. Of course, having Apache recover quickly after a brief period of intensive load would be great to have, but one step at a time.



Right now I am mostly dumbfounded by how can humanity, in this day and age, design an operating system where such a seemingly simple task (don't allow one system component to crash the entire system) seems practically impossible - or at least, very hard to do.



Please don't suggest things like VMs or "BUY MORE RAM".







Some more information gathered with a friend's help:
The processes hang when the cgroup oom killer is invoked. Here's the call trace:




[] ? prepare_to_wait+0x70/0x7b
[] mem_cgroup_handle_oom+0xdf/0x180
[] ? memcg_oom_wake_function+0x0/0x6d

[] __mem_cgroup_try_charge+0x32d/0x478
[] mem_cgroup_charge_common+0x48/0x73
[] ? __lru_cache_add+0x60/0x62
[] mem_cgroup_newpage_charge+0x3b/0x4a
[] handle_mm_fault+0x305/0x8cf
[] ? schedule+0x6ae/0x6fb
[] do_page_fault+0x214/0x22b
[] page_fault+0x1f/0x30



At this point, the apache memory cgroup is practically deadlocked, and burning CPU in syscalls (all with the above call trace). This seems like a problem in the cgroup implementation...

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...