Saturday, February 25, 2017

linux - Irresponsive nginx while doing “nginx reload”



While reloading nginx, I started getting errors in messages log "possible SYN flooding on port 443", and it seems like nginx becomes completely irresponsive at that time (quite for a while), cause zabbix reports "nginx is down" with ping 0s. RPS at that time is about 1800.



But, server stays responsive on the other non-web ports (SSH, etc.)



Where should I look into and what configs (sysctl, nginx) should I show to find the root cause of this.



Thanks in advance.




UPD:



Some additional info:



$ netstat -tpn |awk '/nginx/{print $6,$7}' |sort |uniq -c
3266 ESTABLISHED 31253/nginx
3289 ESTABLISHED 31254/nginx
3265 ESTABLISHED 31255/nginx
3186 ESTABLISHED 31256/nginx



nginx.conf sample:



worker_processes  4;
timer_resolution 100ms;
worker_priority -15;
worker_rlimit_nofile 200000;

events {

worker_connections 65536;
multi_accept on;
use epoll;
}

http {

sendfile on;
tcp_nopush on;
tcp_nodelay on;


keepalive_requests 100;
keepalive_timeout 65;

}


custom sysctl.conf



net.ipv4.ip_local_port_range=1024 65535

net.ipv4.conf.all.accept_redirects=0
net.ipv4.conf.all.secure_redirects=0
net.ipv4.conf.all.send_redirects=0
net.core.netdev_max_backlog=10000
net.ipv4.tcp_syncookies=0
net.ipv4.tcp_max_syn_backlog=20480
net.ipv4.tcp_synack_retries=2
net.ipv4.tcp_syn_retries=2
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216

net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.netfilter.nf_conntrack_max=1048576
net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_timestamps=1
net.ipv4.tcp_no_metrics_save=1
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_tw_recycle=0
net.ipv4.tcp_max_tw_buckets=1400000
net.core.somaxconn=250000

net.ipv4.tcp_keepalive_time=900
net.ipv4.tcp_keepalive_intvl=15
net.ipv4.tcp_keepalive_probes=5
net.ipv4.tcp_fin_timeout=10


UPD2



Under normal load at about 1800 RPS, when I set backlog on nginx to 10000 on 80 and 443 ports, and then reloaded nginx it became to use more RAM (3.8Gb out of my 4GB instance were used, and some workers were killed by OOM-killer), and with worker_priority at -15 load was over 6 (while my instance has 4 cores only). So, the instance was quite laggy, and I set worker_priority to -5, and backlog to 1000 for every port. For now, it uses less memory, and peak load was 3.8, but, nginx still becomes unresponsive for a minute or two after reload. So, the problem still persists.




Some netstat details:



netstat -tpn |awk '/:80/||/:443/{print $6}' |sort |uniq -c
6 CLOSE_WAIT
14 CLOSING
17192 ESTABLISHED
350 FIN_WAIT1
1040 FIN_WAIT2
216 LAST_ACK
338 SYN_RECV

52541 TIME_WAIT

Answer



If you have:



  keepalive_timeout  65;


I can imagine that it can take a while for connections to get terminated and workers restarted. I am not quite sure without looking in the code if nginx is waiting for them to expire once it gets a reload.




You could try lowering the value and see if it helps.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...