Thursday, July 19, 2018

networking - Intermintent slow MySQL Connections



We keep getting errors like this on our PHP sites.




Can't connect to MySQL server on '192.168.100.85' (4)





web is the web server 192.168.1.116
mysql is the mysql server 192.168.100.85



So I built a script on the web server to make 10,000 mysql connections in a row and time them. It will "sometimes" produce the error. Most times everything runs fine and there is an average time of 5-10 ms for the call to mysql_connect



Some serious Googling showed that the (4) error is due to the connection being cut off by the timeout.




web# grep mysql.connect_timeout /etc/php.ini
mysql.connect_timeout = 1





So I modified the timeout in the script to 30 to see if it would work. The connection errors went away, but occasionally the connection would take 5 seconds.



After more Googling and some tcpdump I found that occasionally when the MySQL server is doing its reverse lookup of the IP the DNS server would fail to respond. So after 5 seconds it would give up and allow the connection.



I have since added skip-name-resolve to the server. But this did not solve the problem.



Now my test would show slow connections taking 3-4.5 seconds instead of the set 5 with the DNS issue.



So I ran my tests again with tcpdump running on both ends.





web# tcpdump -n -s 65535 -w web3-$(date +"%F_%H-%M-%S").pcap host 192.168.100.85 and port 3306
mysql# tcpdump -n -s 65535 -w master1-$(date +"%F_%H-%M-%S").pcap host 192.168.1.116 and port 3306




Here is the packets from the relevant slow connection.



Packets from web:



No.     Time                       Source                Destination           Protocol Info

13312 2010-10-13 10:01:01.201965 192.168.1.116 192.168.100.85 TCP 41560 > mysql [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=904829062 TSER=0 WS=2
13316 2010-10-13 10:01:04.201577 192.168.1.116 192.168.100.85 TCP 41560 > mysql [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=904832062 TSER=0 WS=2
13317 2010-10-13 10:01:04.204837 192.168.100.85 192.168.1.116 TCP mysql > 41560 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1380 TSV=562240314 TSER=904832062 WS=7
13318 2010-10-13 10:01:04.204853 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSV=904832065 TSER=562240314
13319 2010-10-13 10:01:04.205886 192.168.100.85 192.168.1.116 MySQL Server Greeting proto=10 version=5.0.77-log
13320 2010-10-13 10:01:04.205899 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=1 Ack=61 Win=5840 Len=0 TSV=904832066 TSER=562240316
13321 2010-10-13 10:01:04.205959 192.168.1.116 192.168.100.85 MySQL Login Request userexample
13322 2010-10-13 10:01:04.206800 192.168.100.85 192.168.1.116 TCP mysql > 41560 [ACK] Seq=61 Ack=71 Win=5888 Len=0 TSV=562240317 TSER=904832066
13323 2010-10-13 10:01:04.206874 192.168.100.85 192.168.1.116 MySQL Response OK
13324 2010-10-13 10:01:04.208823 192.168.1.116 192.168.100.85 MySQL Request Quit

13325 2010-10-13 10:01:04.208839 192.168.1.116 192.168.100.85 TCP 41560 > mysql [FIN, ACK] Seq=76 Ack=72 Win=5840 Len=0 TSV=904832069 TSER=562240317
13326 2010-10-13 10:01:04.210422 192.168.100.85 192.168.1.116 TCP mysql > 41560 [FIN, ACK] Seq=72 Ack=76 Win=5888 Len=0 TSV=562240320 TSER=904832069
13327 2010-10-13 10:01:04.210437 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=77 Ack=73 Win=5840 Len=0 TSV=904832071 TSER=562240320
13328 2010-10-13 10:01:04.210567 192.168.100.85 192.168.1.116 TCP mysql > 41560 [ACK] Seq=73 Ack=77 Win=5888 Len=0 TSV=562240320 TSER=904832069


Packets from mysql:



No.     Time                       Source                Destination           Protocol Info
13315 2010-10-13 10:01:04.204817 192.168.1.116 192.168.100.85 TCP 41560 > mysql [SYN] Seq=0 Win=5840 Len=0 MSS=1380 TSV=904832062 TSER=0 WS=2

13316 2010-10-13 10:01:04.204836 192.168.100.85 192.168.1.116 TCP mysql > 41560 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSV=562240314 TSER=904832062 WS=7
13317 2010-10-13 10:01:04.206611 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSV=904832065 TSER=562240314
13318 2010-10-13 10:01:04.206808 192.168.100.85 192.168.1.116 MySQL Server Greeting proto=10 version=5.0.77-log
13319 2010-10-13 10:01:04.207658 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=1 Ack=61 Win=5840 Len=0 TSV=904832066 TSER=562240316
13320 2010-10-13 10:01:04.207815 192.168.1.116 192.168.100.85 MySQL Login Request user=example
13321 2010-10-13 10:01:04.207872 192.168.100.85 192.168.1.116 TCP mysql > 41560 [ACK] Seq=61 Ack=71 Win=5888 Len=0 TSV=562240317 TSER=904832066
13322 2010-10-13 10:01:04.207910 192.168.100.85 192.168.1.116 MySQL Response OK
13323 2010-10-13 10:01:04.210817 192.168.1.116 192.168.100.85 MySQL Request Quit
13324 2010-10-13 10:01:04.210849 192.168.100.85 192.168.1.116 TCP mysql > 41560 [FIN, ACK] Seq=72 Ack=76 Win=5888 Len=0 TSV=562240320 TSER=904832069
13325 2010-10-13 10:01:04.211632 192.168.1.116 192.168.100.85 TCP 41560 > mysql [FIN, ACK] Seq=76 Ack=72 Win=5840 Len=0 TSV=904832069 TSER=562240317

13326 2010-10-13 10:01:04.211640 192.168.100.85 192.168.1.116 TCP mysql > 41560 [ACK] Seq=73 Ack=77 Win=5888 Len=0 TSV=562240320 TSER=904832069
13327 2010-10-13 10:01:04.213243 192.168.1.116 192.168.100.85 TCP 41560 > mysql [ACK] Seq=77 Ack=73 Win=5840 Len=0 TSV=904832071 TSER=562240320


As you can see, web re-sent the initial ACK after 3 seconds of no response. But mysql never even saw that initial packet.



I also tried to run a ping flood to check for dropped packets. If you leave it running long enough you will get dropped packets.



web3# ping -f 192.168.100.85
PING 192.168.100.85 (192.168.100.85) 56(84) bytes of data.

....................................................................
--- 192.168.100.85 ping statistics ---
38253 packets transmitted, 38185 received, 0% packet loss, time 460851ms
rtt min/avg/max/mdev = 0.880/3.430/66.904/8.015 ms, pipe 7, ipg/ewma 12.047/1.378 ms


This issue is very intermittent but is happening all the time. I do understand that simply increasing our timeout would help greatly, but I'd rather have the connects always be fast and never have a 3-4 second delay in serving a page.



I contacted our hosting provider and they say its likely a result of the bad interaction between Nagle's Algorithm and Delayed ACK. http://www.stuartcheshire.org/papers/NagleDelayedAck/




It seems to me like packets are being dropped. Any ideas on a way I could better prove it to the hosting guys? Ping dropping 60 out of 38,000 packets does seem kind of minor. Is this something I should just live with?



Thank you for your time in looking into this!


Answer



I wouldn't try it with a ping flood, but I would set up a continous stream and let it run for, say, 2 minutes. That's a more valid test of network "loss" - sometimes ping -f can overrun particular pieces of equipment and give unreliable results.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...