Monday, July 22, 2019

failover - Using Azure Traffic Manager for an immediate increase in capacity

I have a REST web-service on Azure which has very high but variable load, it's all set-up to auto scale using Paraleap so that it can handle the peak periods but keep costs down when things are quieter.



I have never been able to figure out a way, using any metrics, to predict when a server is going to start maxing out before it actually maxes out! So the solution I have at the minute is a separate programme that constantly checks to see if the server is up, if it starts returning errors then it tells the server to start returning an error message to a certain percentage of users, returning a simple error takes up less of the servers resources which allows the majority of users to still have a service, and then it tells Paraleap to increase the number of instances .. increasing instances takes 10-15 minutes normally, so during this period things aren't great and some users get errors, but ultimately, the new instance kick in and normal service is resumed.




I hoped Azure Traffic Manager would be my solution, my hope was that I could use failover mode, and when a failure was detected on my main web-service, I could divert x% of requests to a backup, which would return the main-service to a working state .. at the same time I would independently tell the main web-service to scale, and when it finished, the traffic manager would divert everything back to the main web-service. In other words, I'd get an instant increase in capacity which would fill the gap whilst I boot up new instances.



Unfortunately, I can't seem to find a way to do this! It looks like Traffic Manager, on detecting a failure, diverts 100% of traffic to the backup. So I'd need to more than double my server capacity just for these moments i.e. have X instance for the main web-service, and x+1 waiting in the backup, a failure with main would diver 100% of requests to backup which would have more capacity, then I would launch more instances for the main, eventually Traffic Manager would send all requests back there, at which point I'd then need to add more instances to the backup and have it sit waiting again. This would be massive overkill and would cost me a fortune!



Does anyone have any suggestions on how I can manage this better?



Thanks!

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...