Wednesday, June 14, 2017

Did my registrar screw up or is this how name server propagation works?



So my company has a number of domains with a large registrar that shall go unnamed. We are making some changes to our DNS infrastructure and the first of those is we are moving our secondary DNS from one server on site to four servers offsite. So we updated the name servers for each domain at the registrar by removing the entry for the old secondary name server and adding the four new ones. I monitored the old secondary server for requests and when I saw no new requests had been made for 24 hours I shut it down. That was this morning. I assumed at this point everything was good. Unfortunately this was my mistake. I should have gone and made sure name servers at large were returning the correct NS records.




So this afternoon we were performing maintenance on our primary DNS server and we shut it down. This is when I started getting alerts from our external monitoring. I checked and sure enough, the DNS server used there reported the only NS record for our primary domain was the primary name server. The new secondary servers were not listed and neither was the old secondary.



Is it unreasonable of me to have assumed that because the update was from



ns1.mydomain.com
ns2.mydomain.com


to




ns1.mydomain.com
ns1.backupdns.com
ns2.backupdns.com
ns3.backupdns.com
ns4.backupdns.com


in one step at the registrar that there should be no intermediate state where the only NS record was for ns1.mydomain.com?




Going forward to be safe obviously I will always leave the old name servers alone until after I'm 100% sure the new ones have propagated and only then remove the old name servers from the registrar. However, I'd still like to know if my registrar screwed up or if my expectation was unreasonable.


Answer




Is it unreasonable of me to have assumed that because the update was from <... trimmed ...>




YES.



Generally speaking, it is unreasonable for you to make ANY assumption about ANY change performed through control panel software (except the standard assumption that it's going to screw up somehow).
That includes DNS registrar management interfaces (which are usually pretty awful on the back-end).




The changes you made were probably processed as two separate transactions (one removing the old server, one adding the new ones), and someone got your DNS information after the first transaction, but before the second.






You got bit here because you kind-of Did It Wrong - though in a way that many of us do.
For the future, when decommissioning DNS servers / replacing them with new ones the safe workflow is:




  1. Build and deploy your new DNS servers. Verify they are functioning correctly.

  2. Add the new DNS servers to the registrar's list of name servers.

  3. Wait (until the change has been picked up on the internet at large.)
    TTL-Dependent, but usually 24-48 hours is a good rule.



    • At this point you should start to see queries on the new servers.


  4. Remove the old DNS server from the registrar's list of name servers.

  5. Wait again (until the change is picked up on the internet at large)
    You should stop seeing queries going to the decommissioned server.
    As in (3), 24-48 hours is a good rule to go with.

  6. Unplug the old server and dispose of it per your company's policies.



That workflow guarantees that the worst-case scenario is that someone will have an extra (lame) NS listed because they're using the "Step 2" information, but they will always have all your new secondaries, so they should always be able to find at least one working name server for your domain.




You combined steps 2, 3, 4, and 5 into one step, and on the back end the removal (4) happened before the addition (2).
Chances are that would never have caused a problem except for your maintenance happening before everyone caught up with the "addition" part of the changes. It's a classic edge case and you landed on it.



Now you know, and knowing is 7/16ths of the battle.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...