Sunday, January 21, 2018

amazon web services - Losing database connection with high traffic on AWS - not sure where to start troubleshooting



I'm designing a highly available WP site on AWS using Elastic Beanstalk, and testing use load with a Locust.



Everything looks alright: my EC2s are t2.mediums, auto scaled over 3-6 availability zones. Load balancer is set to "Cross-zone" load balancing (so traffic should be distributed to 3 servers in 3 different zones), I am using Aurora (db.t2.medium) with a master->read replica setup.



Things are fine when I visit the site in my browser, but as soon as I spin up Locust (with 100-500 users, 90-100 second wait times, 10 user hatch rate) my site will almost instantly lose the connection to the database and eventually throw a 50x error.



My Apache/PHP setup is pretty out of the box from Beanstalk (Amazon Linux AMI, php 5.6), specs listed below.
opcache is enabled by default, but phpfpm is not currently installed.




Here is a diagram of my setup, and then the specs:



image




  • EC2


    • 3 t2.Mediums


    • 2 vCPUs

    • 24 CPU credits/hour

    • 4g RAM


  • Apache 2.4

  • PHP 5.6


    • upload_max_filesize => 64M

    • post_max_size => 64M


    • max_execution_time => 120

    • memory_limit => 256M

    • Opcache

    • opcache.enable=1

    • opcache.memory_consumption=128

    • opcache.interned_strings_buffer=8

    • opcache.max_accelerated_files=4000





I am unsure whether or not this is a hardware configuration issue, or if I need to tweak PHP/Apache/MySql


Answer



OK so I think I had several issues:




  1. When I originally created the DBs, I made t2.micros which by default only allow for 40 connections at once. I had later changed the instance, to t2.mediums but the max_connections seemed to stay the same. I have recreated the DBs at t2.mediums and the max_connections are now 90 and I can increate if need be.


  2. I misread the Locust documentation, and I had set the test to hit the site every 90ms, so every .09 seconds which is a lot. I just increased the hit time to 3-10 seconds (actual seconds) and the servers hold up fine now.




Increasing the Locust users to 200 however results in a 75% failure (database disconnect) rate but I think I can either adjust the max_connections even more, or throw a CDN in front of the site (which I'll be doing anyway)




@michael-sqlbot gets the prize here, he led me down the right path.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...