This morning we saw a step increase in CPU utilization and load on our web server, call it host1. Overall CPU usage went from approximately 100% to 500% (on 16 cores, 32 hyperthreads), and load from < 1 to ~6 average. It then remained at that increased level. Using top, it's clear that the added CPU usage is all by Apache (httpd processes). (We're on Apache 2.4.39, on CentOS 7.6.)
PHP processes and mariadb connections are both unchanged from their baseline levels. Data transferred over the server's network connection is unchanged. Requests served per minute, based on line counts of the server's domlogs are unchanged. There has been no increase in disk utilization, wait time, or service time. Memory usage is also flat.
I checked the Apache error log for the time when the load jumped up, and didn't see anything new, nor did Apache restart at that time.
As a test, I moved all our web traffic from host1 to an effectively identical backup server, host2. Once the traffic was migrated, host1 cpu usage dropped to near zero (a drop of ~400% at that point), while the usage on host2 increased to only ~80%, about the level host1 was running at before the sudden increase. So, whatever changed, it's something to do with that specific server, not with the nature of our web traffic.
I checked the apache configurations on both servers, and as expected, they are identical. Looking at apache status, # of simultaneous connections appears to be roughly double on host1 compared to host2. This is as you would expect given higher load/cpu usage. It's taking longer to serve requests, so we have more simultaneous connections sticking around for the same throughput. Subjectively, it seems like some requests are returning in a normal amount of time on host1, while others are taking significantly longer than usual (seconds rather than the expected milliseconds). That could be a cause or a consequence of the higher load though.
I tried disabling mod_security on host1, and ruled that out as the cause. Next I tried rebooting host1 completely, but after transferring traffic back, the high load persisted. Any idea what else might cause apache on a particular server to suddenly and persistently start using more cpu and taking longer to serve the same volume of requests?
Edit: for what it's worth, both system and user cpu use stepped up proportionally.
Edit 2: Results of perf top:
Samples: 905K of event 'cycles:ppp', Event count (approx.): 111054988919
Overhead Shared Object Symbol
16.69% [kernel] [k] system_call_after_swapgs
11.02% [kernel] [k] sysret_check
5.64% [kernel] [k] copy_user_enhanced_fast_string
3.79% [kernel] [k] fget_light
3.04% [kernel] [k] __audit_syscall_exit
3.01% [kernel] [k] generic_file_aio_read
2.51% [kernel] [k] __audit_syscall_entry
1.78% [kernel] [k] sys_lseek
1.63% [kernel] [k] __radix_tree_lookup
1.45% [kernel] [k] sys_read
1.43% libpthread-2.17.so [.] __pthread_disable_asynccancel
1.32% [kernel] [k] touch_atime
1.28% [kernel] [k] ext4_file_read
1.23% [kernel] [k] radix_tree_descend
1.11% [kernel] [k] mutex_lock
1.08% [kernel] [k] vfs_read
1.04% [kernel] [k] file_read_actor
0.95% libapr-1.so.0.6.5 [.] apr_file_read
0.85% [kernel] [k] do_sync_read
0.84% [kernel] [k] __d_lookup_rcu
0.81% [kernel] [k] __x86_indirect_thunk_rax
0.79% [kernel] [k] fsnotify
0.78% [kernel] [k] generic_segment_checks
0.77% [kernel] [k] __check_object_size
0.74% [kernel] [k] current_kernel_time
0.73% libapr-1.so.0.6.5 [.] apr_file_read_full
0.71% [kernel] [k] put_page
0.70% [kernel] [k] ext4_llseek
0.70% php-fpm [.] execute_ex
0.69% libaprutil-1.so.0.6.1 [.] getpage
0.65% [kernel] [k] _cond_resched
0.64% libaprutil-1.so.0.6.1 [.] read_from
0.59% [kernel] [k] strncpy_from_user
0.59% libpthread-2.17.so [.] llseek
0.57% [kernel] [k] __fsnotify_parent
0.57% [kernel] [k] cleanup_module
0.55% [kernel] [k] rw_verify_area
0.55% [kernel] [k] __find_get_page
0.51% [kernel] [k] auditsys
0.51% [kernel] [k] security_file_permission
0.48% php-fpm [.] _is_numeric_string_ex
0.47% [kernel] [k] unroll_tree_refs
0.47% libapr-1.so.0.6.5 [.] apr_file_seek
0.42% [kernel] [k] generic_file_llseek_size
0.41% [kernel] [k] fput
0.40% php-fpm [.] zend_hash_find
0.35% [kernel] [k] __virt_addr_valid
0.30% [kernel] [k] path_put
0.27% libaprutil-1.so.0.6.1 [.] getnext
0.27% libz.so.1.2.7 [.] 0x0000000000002d11
0.27% [kernel] [k] tracesys
0.27% [unknown] [k] 0x00002af9a2716d7b
0.27% [kernel] [k] lookup_fast
0.25% [kernel] [k] mutex_unlock
0.25% [kernel] [k] kfree
0.24% [kernel] [k] current_fs_time
For comparison, here are the results from our other server, host2, which is handling a comparable amount of web traffic, plus all our databases currently, and isn't experiencing this high load issue:
Samples: 1M of event 'cycles:ppp', Event count (approx.): 104427407206424
Overhead Shared Object Symbol
27.34% [kernel] [k] retint_userspace_restore_args
13.57% php-fpm [.] zend_do_inheritance
7.56% [kernel] [k] handle_mm_fault
7.56% httpd [.] 0x000000000004470e
6.87% libpthread-2.17.so [.] pthread_cond_broadcast@@GLIBC_2.3.2
4.43% [kernel] [k] cpuidle_enter_state
3.31% libc-2.17.so [.] __memmove_ssse3_back
3.31% [unknown] [.] 0x00002b8a4d236777
2.68% [kernel] [k] cap_inode_permission
2.09% libperl.so [.] S_refto
2.09% [unknown] [k] 0x00007f4479518900
1.37% [kernel] [k] free_pages_prepare
1.20% [kernel] [k] hrtimer_interrupt
0.90% [kernel] [k] irq_return
0.89% [kernel] [k] change_pte_range
0.76% httpd [.] ap_find_command
0.69% [unknown] [k] 0x00007fe15a923be8
0.68% [kernel] [k] lookup_fast
0.60% libperl.so [.] Perl_runops_standard
0.60% libperl.so [.] Perl_pp_multideref
0.60% [unknown] [.] 0x00007f227838a777
0.55% [kernel] [k] system_call_after_swapgs
0.46% [unknown] [.] 0x00007f18e7736015
0.45% libc-2.17.so [.] __memcpy_ssse3_back
0.40% mysqld [.] 0x0000000000b643c4
0.40% ld-2.17.so [.] _dl_map_object_deps
0.39% [unknown] [.] 0x00007f81151f761c
0.35% [kernel] [k] kmem_cache_alloc
0.35% [kernel] [k] vma_interval_tree_insert
0.30% [kernel] [k] _raw_spin_lock_irqsave
0.30% libc-2.17.so [.] __readdir_r
0.30% [kernel] [k] tick_do_update_jiffies64
0.30% libpng15.so.15.13.0 [.] 0x000000000001ff7f
0.30% [unknown] [.] 0x00007f2a3ffee13c
0.30% libpthread-2.17.so [.] __ctype_init@plt
0.26% libc-2.17.so [.] _int_malloc
0.26% libc-2.17.so [.] malloc
0.26% libc-2.17.so [.] malloc_consolidate
0.26% [kernel] [k] mem_cgroup_charge_common
0.26% libperl.so [.] Perl_hv_common
0.26% [kernel] [k] pid_revalidate
0.26% libperl.so [.] Perl_sv_eq_flags
0.26% libcairo.so.2.11512.0 [.] 0x000000000004bf43
0.26% libperl.so [.] S_dopoptosub_at
0.26% libcairo.so.2.11512.0 [.] 0x00000000000d059a
host2 is much more variable in terms of the top functions though; this is just one example. host1 (the one with the problem) has a very stable top few entries in the list.
Edit 3: Having tracked the issue down to mod security, I've specifically found it appears to be related to the IP address database, /tmp/ip.pag, growing indefinitely, despite the pruning timeouts we have in place. Clearing this file resolves the high CPU usage without disabling mod security entirely. However it will presumably grow again if we continue using IP based rules, so we're looking into modifying our rules, or may just set up a daily cron to clear out any old IPs in the file.
Answer
First sanity check if the kernel is trying to tell you something. Maybe about the storage system because of the storage stack appearing the samples. Review the output of dmesg
and syslog log files.
Your problem host is spending more time in the kernel. Contrast that to the other host where php-fpm and httpd are in the top 5. Most workloads do useful services in user space, so low kernel overhead is good.
Note the time spent in the audit interfaces. And I think fsnotify is used by the audit system as well. Review any audit rules, including file system and system calls. Auditing anything like all reads is very expensive. See the documentation of the auditd package for examples of targeted auditing in a few scenarios.
Lots of detail about what Linux is up to is possible if you profile further. Use perf to count syscalls. Also possible is getting the entire call graph in a flame graph visualization.
Edit: those flame graphs are particularly useful in a high CPU utilization scenario if you have symbols. You can compare to problem stacks in issue reports, and dig further into the specific function that was taking a long time.
Install yum-utils
and use this to install symbols: debuginfo-install --enablerepo=base-debuginfo httpd
(It is a convenience script so you don't have to know the -debuginfo postfix convention.)
You found ModSecurity issue Spontaneously high CPU usage #890. I don't actually see stack like that yet on your flame graphs. But if you install debug symbols you can see things that are interactions between APR, the module code, and the OS kernel, beyond just one of those.
No comments:
Post a Comment