RHN Bind Update Brings Down RHEL Named
alexs writes "Red Hat's response to update bind through RHN, patching the DNS hole, made a fatal error which will revert all name servers to caching only servers. This meant that anyone running their own DNS service promptly lost all of their DNS records for which they were acting as primary or secondary name servers. Expect quite a few services provided by servers running RHEL to, errr, die until their system administrators can restore their named.conf. Instead of installing etc/named.conf to etc/named.rpmnew, Red Hat moved the current etc/named.conf to etc/named.conf.rpmsave and replaced etc/named.conf with the default caching only configuration. The fix is easy enough, but this is a schoolboy error which I am surprised Red Hat made. Unfortunately we were hit and our servers went down overnight while RHN dropped its bomb and I am frankly surprised there has not been more of an uproar about this."
Here's the bug details: https://bugzilla.redhat.com/show_bug.cgi?id=453340
One of the bug comments says: "Latest caching-nameserver renamed my named.conf to named.conf.rpmsave in /var/named/chroot/etc" - so this should mean that you can still restore the lost conf file.
A few months prior to the release of RHEL 5.2, they released a kernel update (2.6.18-53.1.6.el5) in which they had added a patch for an issue that could make a system oops upon when files with names of a certain character were present on NFS shares. However, this patch also contained a bug which broke NFS lookup caching and subsequently crippled NFS performance to the point of NFS being completely unusable when working with multiple smaller files. They released a patch for it, but it would only apply cleanly to their testing kernel (which would later become the kernel shipped with 5.2) and they refused to backport it to their then-stable kernel. Shortly after, the vmsplice flaw was found forcing people to update and bring this bug upon them. For us it wasn't that big a problem since we're using CentOS and don't have anything requiring us to use standard RHEL packages (so we backported the patch and built our own kernel package), but a large amount of corporate RHEL users are required to use only standard RHEL system packages because of service contracts with hardware vendors and hence they could do little to remedy this bug. As we were among the first to report this and post about it on mailing lists, we received a lot of communication from corporate RHEL users/sysadmins asking us for help on this, further proving that this was a major issue that should have been addressed right away and not post-poned to the next major release.
Note as well that the initial release included a default conf file which specified a fixed source port, which of course breaks the fix.
[Updated 10th July 2008] We have updated the Enterprise Linux 5 packages in this advisory. The default and sample caching-nameserver configuration files have been updated so that they do not specify a fixed query-source port. Administrators wishing to take advantage of randomized UDP source ports should check their configuration file to ensure they have not specified fixed query-source ports.
Personally I'm surprised there's not been more uproar about the requirement to move internal DNS servers (yes, that means your Windows Domain Controllers in most corporate environments) outside any NAT'ing devices (eg: firewalls), as many NATs also break the fix by rewriting outbound UDP DNS queries to use the same or incremental source ports, which also breaks the fixes. Anyone here moved their AD outside the firewall?
Don't forget to check your named.conf on RHEL 5.x (and CentOS 5.x).
Make sure that any lines like
query-source port 53;
query-source-v6 port 53;
are commented out or deleted so that forwarded DNS queries come from random ports.
Restart BIND if necessary.
MS08-037 was released on the same day, and was much loved by ZoneAlarm users :-)
I wish I had mod points with which to mod you up. This is NOT a bug, and a few RHEL test machines I have here updated just fine, keeping their zone files as expected.
Judging by the CERN details, it sounds like there are two things you need to do. You need to be able to predict the 16-bit random number, and the 16-bit random port. My reading (and this was very brief, so someone *please* correct me if I'm wrong here) is that the older DNS servers had two flaws: a flaw in the RNG for the 16-bit transaction number, and they used fixed or predictable ports.
A NAT will reintroduce only the second problem because it gives you predictable ports, but obviously, relying solely on the unpredictability of a 16-bit transaction id is a little scary. Because of the birthday paradox, (assuming the attacker has perfect knowledge about which port you're choosing) an attacker would need to send only something on the order of 2^8 packets to poison the cache.
Hand off DNS queries emerging from AD servers inside your firewall to caching-only servers in your DMZ. I have all my AD servers on RFC1918 IP numbers with no NAT, because they strike me as devices I'd prefer to keep as far away from the big bad Internet as possible.
ian
I just want to clarify a bit about rpmnew vs rpmsave.
Red Hat will create an rpmsave file when we make a significant change to the configuration file, or a mandatory change. Other than that, we keep the original config file, and store the rpm-config as rpmnew.
Un paio di scarpe, per favore!
Judging by the CERN details, it sounds like there are two things you need to do. You need to be able to predict the 16-bit random number, and the 16-bit random port. My reading (and this was very brief, so someone *please* correct me if I'm wrong here) is that the older DNS servers had two flaws: a flaw in the RNG for the 16-bit transaction number, and they used fixed or predictable ports.
A NAT will reintroduce only the second problem because it gives you predictable ports, but obviously, relying solely on the unpredictability of a 16-bit transaction id is a little scary. Because of the birthday paradox, (assuming the attacker has perfect knowledge about which port you're choosing) an attacker would need to send only something on the order of 2^8 packets to poison the cache.
No, the birthday problem doesn't apply when you are trying to match a specific person's birthday.
I'm not familiar with the package in question, but I assume it also installed some binaries. If it found that there already was a configfile of that name, it should have asked what to do.
If setting up the caching-nameserver was a matter of changing config options, you don't need a package for that, you need a HOWTO.
I would hazard to guess that unfamiliarity with the package is the real root cause of this. From the package description for caching-nameserver-7.3-3 (which could be a very old version):
The file contents show:
And so there we have it - a package designed to install and maintain the very generic files needed to configure a caching DNS server. DNS server not included.
And sure - this could be a HOWTO. But making a package allows for quick-and-simple configuration. And since this kind of thing is so generic, it really lends itself to packaging. I disagree that it should only be a HOWTO.
This sounds like how RPM's behaved as long as I can remember. It looks at three versions of a config file: #1 the one from the old package, #2 the one currently on disk and #3 the one in the new package. If the config file hasn't been customized (1 and 2 are identical), it moves the old file to .rpmold (if 1 and 3 differ) and puts #3 into place. If the config file has been customized, it checks whether 1 and 3 differ. If they haven't then nothing's chanced, the customized config file's still valid and it drops #3 in with the .rpmnew extension. But if 1 and 3 differ, then something in the config file may have changed and the customized config file may no longer be valid. But it's got customizations in it that the admin may need to refer to. So it outputs a warning message about what it's doing, moves the customized config file to .rpmsave and installs #3, and the admin's expected to have seen the warning and to merge their customizations into the new config file. You do watch for warnings and errors during the update, right?
In this case RPM is right, old named.conf files aren't valid. If they're based off RH's old stock config files, they have the source port locked and that disables much of the security fix. So the admins do have to check and modify their customized files before the system's finally ready (or at least RPM has to assume they do, since it can't know exactly what their changes were). That's exacerbated by probably having caching-nameserver installed, but I think a stock BIND install has a similar named.conf until you add your own zones to it.
I'd chalk this one up to admins who a) don't understand an inherent limitation of package-management systems (namely, it doesn't know why you changed something, only that you changed it), b) didn't watch the update process for errors, and c) didn't check the systems for functionality after the update.
Don't entrust the function like DNS to a single vendor. With some services it is hard, as authors support a limited range of OSes/hardware or charge too high a price for each installation to make redundancy affordable.
But not DNS. Free solutions abound, and the commercial ones are quite cheap too. They are available for all imaginable "server-grade" OS/hardware combination. If you use more than one servers for DNS in your enterprise, and both of them use the same platform, you aren't doing your job.
Mind you, I don't blame the victims here — Red Hat screwed up royally, and that's that. Just advising on how to avoid being hit by such (inevitable) mistakes — from any vendor — in the future.
In Soviet Washington the swamp drains you.
Our IP is 207.46.19.254
OrgName: Microsoft Corp
OrgID: MSFT
Address: One Microsoft Way
City: Redmond
StateProv: WA
PostalCode: 98052
Country: US
NetRange: 207.46.0.0 - 207.46.255.255
CIDR: 207.46.0.0/16
NetName: MICROSOFT-GLOBAL-NET
NetHandle: NET-207-46-0-0-1
Parent: NET-207-0-0-0-0
NetType: Direct Assignment
NameServer: NS1.MSFT.NET
NameServer: NS5.MSFT.NET
NameServer: NS2.MSFT.NET
NameServer: NS3.MSFT.NET
NameServer: NS4.MSFT.NET
I agree with the parent.
RH quality has been slipping for some time now. Heck, I don't even know what RH brings to the table as far as distributions are concerned.
Their GUI system-config-* programs are so poorly written that a lot of them don't even work. And those that do can't repaint themselves properly on the screen. Imagine GUI programs that take up to 30 seconds to (re)paint themselves after being covered/exposed. And that's on a beefy dual-core PC!
I don't know so much about their enterprise stuff (thankfully I only have to deal with a single RH server) but Fedora certainly sucks ass. I don't think RH does *any* testing on it - that's for suckers ^H^H^H er...the community to do. Every time there is update, I say to myself "Okay...what did they break this time?"
RHEL - 5.2 - caching-nameserver-9.3.4-6.P1.el5.i386.rpm
RHEL - 5.1 - caching-nameserver-9.3.3-10.el5.i386.rpm
RHEL - 4.6 - caching-nameserver-7.3-3.noarch.rpm
RHEL - 3.9 - caching-nameserver-7.3-3_EL3.noarch.rpm
Trying to become famous by taking photos. Visit my homepage please.
[This is Dan Kaminsky]
The NAT vendors didn't get as much notice because we didn't realize so many of them were doing this.
If we had, they'd have been brought in from the start.
Now they're scrambling, to their credit. It's a bit of a facepalm for me.
What the fuck is wrong with you people? You think every System Admin out there had just one job to do and that's administer the servers? In my job I do everything. VOIP Phones, new employee setup, updates, backups, desktop support, fix the copier, follow up with accounting and executive assistant as to why we ran out of paper yet again etc. etc. etc. The point is the company SHOULD hire another IT person but they can't afford it and there is no freakin way I could ever test every update that comes out. Of course I monitor things and test them to make sure nothing breaks and nothing ever had from an update. But It shouldn't be assumed that everybody has the time to update a virtual machine before they update production and then monitor it so closely as to notice something so small as the problem this patch created. Just stop living in a fairy world will ya?