Tuesday, November 15, 2016

domain name system - Windows DNS servers repeatedly requesting records in zone when they get SERVFAIL response



We're seeing high levels (over 2000 requests/second) of DNS queries from our caching DNS servers to external servers. This may have been happening for a long time - this came to light recently because of performance problems with our firewall. Talking to colleagues at other institutions it's clear that we're making more queries than they are.



My initial thought was that the problem was lack of caching of SERVFAIL responses. Having done more investigation it's clear that the problem is a high level of requests for the failing record from the Windows DNS servers. It seems that in our environment a single query to one of the Windows DNS servers for a record from a zone which returns SERVFAIL results in a stream of requests for that record from all of the Windows DNS servers. The stream of requests doesn't stop until I add a fake empty zone on one of the Bind servers.



My plan tomorrow is to verify the configuration of the Windows DNS servers - they should just be forwarding to the caching Bind servers. I figure we must have something wrong there as I can't believe that no-one else has hit this if it's not a misconfiguration. I'll update this question after that (possibly closing this one and opening a new, clearer one).







Our setup is a pair of caching servers running Bind 9.3.6 which are used either directly by clients or via our Windows domain controllers. The caching servers pass queries to our main DNS servers which are running 9.8.4-P2 - these servers are authoritative for our domains and pass queries for other domains to external servers.



Behaviour we're seeing is that queries like the one below aren't being cached. I've verified this by looking at network traffic from the DNS servers using tcpdump.



 [root@dns1 named]# dig ptr 119.49.194.173.in-addr.arpa.

; <<>> DiG 9.3.6-P1-RedHat-9.3.6-20.P1.el5_8.6 <<>> ptr 119.49.194.173.in-addr.arpa.
;; global options: printcmd
;; Got answer:

;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 8680
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;119.49.194.173.in-addr.arpa. IN PTR

;; Query time: 950 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Sun Mar 9 13:34:20 2014
;; MSG SIZE rcvd: 45



Querying google's server directly shows that we're getting a REFUSED response.



[root@dns1 named]# dig ptr 119.49.194.173.in-addr.arpa. @ns4.google.com.

; <<>> DiG 9.3.6-P1-RedHat-9.3.6-20.P1.el5_8.6 <<>> ptr 119.49.194.173.in-addr.arpa. @ns4.google.com.
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 38825

;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;119.49.194.173.in-addr.arpa. IN PTR

;; Query time: 91 msec
;; SERVER: 216.239.38.10#53(216.239.38.10)
;; WHEN: Sun Mar 9 13:36:38 2014
;; MSG SIZE rcvd: 45



This isn't just happening with google addresses or reverse lookups but a high proportion of the queries are for those ranges (I suspect because of a Sophos reporting feature).



Should our DNS servers be caching these negative responses? I read http://tools.ietf.org/rfcmarkup?doc=2308 but didn't see anything about REFUSED. We don't specify lame-ttl in config file so I'd expect that to default to 10 minutes.



I believe this (the lack of caching) is expected behaviour. I don't understand why the other sites I've talked to aren't seeing the same thing. I've tried a test server running the latest stable version of Bind and that shows the same behaviour. I also tried Unbound and that didn't cache SERVFAIL either. There's some discussion of doing this in djbdns here but conclusion is that the functionality has been removed.



Are there settings in the Bind config that we could change to influence this behaviour? lame-ttl didn't help (and we were running with default anyway).



As part of investigation I've added some fake empty zones on our caching DNS servers to cover the ranges leading to most requests. That's dropped the number of requests to external servers but isn't sustainable (and feels wrong as well). In parallel with this I've asked a colleague to get logs from the Windows DNS servers so that we can identify the clients making the original requests.



Answer



Cause was obvious once I looked at the configuration of the Windows DNS servers (something got lost in the verbal report).



Each DC was configured to forward requests not only to the two caching Bind servers but also to all the other Windows DNS servers. For requests which were successful (including NXDOMAIN) that would work fine as the Bind servers would answer and we'd never fall through to the other Windows DNS. However for things that returned SERVFAIL one server would ask all the others which would in turn ask the Bind servers. I'm really surprised that this didn't cause more pain.



We'll take the extra forwarding out and I fully expect the volume of requests to drop dramatically.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...