Thursday, February 4, 2016

domain name system - Avoiding DNS timeouts when a dns server fails

We have a small datacenter with about a hundred hosts pointing to 3 internal dns servers (bind 9). Our problem comes when one of the internal dns servers becomes unavailable.
At that point all the clients that point to that server start performing very slowly.



The problem seems to be that the stock linux resolver doesn't really have the concept
of "failing over" to a different dns server. You can adjust the timeout and number
of retries it uses, (and set rotate so it will work through the list), but no matter
what settings one uses our services perform much more slowly

if a primary dns server becomes unavailable.
At the moment this is one of the largest sources of service disruptions for us.



My ideal answer would be something like "RTFM: tweak /etc/resolv.conf like this...",
but if that's an option I haven't seen it.



I was wondering how other folks handled this issue?



I can see 3 possible types of solutions:





  • Use linux-ha/Pacemaker and failover ips (so the dns IP VIPs are "always" available).
    Alas, we don't have a good fencing infrastructure, and without fencing
    pacemaker doesn't work very well (in my experience Pacemaker lowers availability without
    fencing).


  • Run a local dns server on each node, and have resolv.conf point to localhost.
    This would work, but it would give us a lot more services to monitor and manage.


  • Run a local cache on each node. Folks seem to consider nscd "broken", but dnrd
    seems to have the right feature set: it marks dns servers as up or down, and
    won't use 'down' dns servers.





Any-casting seems to work only at the ip routing level, and depends on route updates for server failure. Multi-casting seemed like it would be a perfect answer, but bind does
not support broadcasting or multi-casting, and the docs I could find seem to suggest that
multicast dns is more aimed at service discovery and auto-configuration rather than regular dns resolving.



Am I missing an obvious solution?

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...