I maintain a Gentoo server with a few services, including Apache. It's fairly low-end (2GB of RAM and a low-end CPU with 2 cores). My problem is that, despite my best efforts, an over-loaded Apache crashes the entire server. In fact, at this point I'm close to being convinced that Linux is a horrible operating system that isn't worth anyone's time looking for stability under load.
Things I tried:
- Adjusting oom_adj for the root Apache process (and thus all its children). That had close to no effect. When Apache was overloaded it would bring the system to a grind, as the system paged out everything else before it got to kill anything.
- Turning off swap. Didn't help, it would unload memory paged to binaries of processes and other files on /, thus causing the same effect.
- Putting it in a memory-limited cgroup (limited to 512 MB of RAM, 1/4th of the total). This "worked", at least in my own stress tests - except the server keeps crashing under load (basically stalling all other processes, inaccessible via SSH, etc.)
- Running it with idle I/O priority. This wasn't a very good idea in the end, because it just caused the system load to climb indefinitely (into the thousands) with almost no visible effect - until you tried to access an unbuffered part of the disk. This caused the task to freeze. (So much for good I/O scheduling, eh?)
- Limiting the number of concurrent connections to Apache. Setting the number too low caused web sites to become unresponsive due to most slots being occupied with long requests (file downloads).
- I tried various Apache MPMs without much success (prefork, event, itk).
- Switching from prefork/event+php-cgi+suphp to itk+mod_php. This improved performance, but didn't solve the actual problem.
- Switching I/O schedulers (cfq to deadline).
Just to stress this out: I don't care if Apache itself goes down under load, I just want the rest of my system to remain stable. Of course, having Apache recover quickly after a brief period of intensive load would be great to have, but one step at a time.
Right now I am mostly dumbfounded by how can humanity, in this day and age, design an operating system where such a seemingly simple task (don't allow one system component to crash the entire system) seems practically impossible - or at least, very hard to do.
Please don't suggest things like VMs or "BUY MORE RAM".
Some more information gathered with a friend's help:
The processes hang when the cgroup oom killer is invoked. Here's the call trace:
[] ? prepare_to_wait+0x70/0x7b
[] mem_cgroup_handle_oom+0xdf/0x180
[] ? memcg_oom_wake_function+0x0/0x6d
[] __mem_cgroup_try_charge+0x32d/0x478
[] mem_cgroup_charge_common+0x48/0x73
[] ? __lru_cache_add+0x60/0x62
[] mem_cgroup_newpage_charge+0x3b/0x4a
[] handle_mm_fault+0x305/0x8cf
[] ? schedule+0x6ae/0x6fb
[] do_page_fault+0x214/0x22b
[] page_fault+0x1f/0x30
At this point, the apache memory cgroup is practically deadlocked, and burning CPU in syscalls (all with the above call trace). This seems like a problem in the cgroup implementation...
No comments:
Post a Comment