RHN Bind Update Brings Down RHEL Named

← Back to Stories (view on slashdot.org)

RHN Bind Update Brings Down RHEL Named

Posted by kdawson on Friday July 18, 2008 @12:14AM from the remind-me-of-your-name-again dept.

alexs writes "Red Hat's response to update bind through RHN, patching the DNS hole, made a fatal error which will revert all name servers to caching only servers. This meant that anyone running their own DNS service promptly lost all of their DNS records for which they were acting as primary or secondary name servers. Expect quite a few services provided by servers running RHEL to, errr, die until their system administrators can restore their named.conf. Instead of installing etc/named.conf to etc/named.rpmnew, Red Hat moved the current etc/named.conf to etc/named.conf.rpmsave and replaced etc/named.conf with the default caching only configuration. The fix is easy enough, but this is a schoolboy error which I am surprised Red Hat made. Unfortunately we were hit and our servers went down overnight while RHN dropped its bomb and I am frankly surprised there has not been more of an uproar about this."

35 of 312 comments (clear)

Min score:

Reason:

Sort:

You didn't test before deploying an update? by Anonymous Coward · 2008-07-18 00:17 · Score: 5, Insightful

So, you didn't test the update on a non-production server? Just install any old patch and let it take your network down? Who do you work for again? I have to make sure not to do business with that.
1. Re:You didn't test before deploying an update? by suso · 2008-07-18 01:03 · Score: 5, Insightful
  
  Actually, I caught the error just from looking at the output of up2date/yum. It clearly said named.conf saved to named.conf.rpmsave. So all you have to do is compare what changed, implement any changes and copy named.conf.rpmsave over named.conf.
  Just as I said on the day of the release, be careful, don't just blindly update things.
2. Re:You didn't test before deploying an update? by illumin8 · 2008-07-18 01:26 · Score: 3, Insightful
  
  So, you didn't test the update on a non-production server? Just install any old patch and let it take your network down? Who do you work for again? I have to make sure not to do business with that.
  No kidding. The only "schoolboy error" as the submitter put it, was not testing the patch on a non-production server before deploying it on a production DNS server.
  
  --
  "When the president does it, that means it's not illegal." - Richard M. Nixon
3. Re:You didn't test before deploying an update? by jocknerd · 2008-07-18 01:57 · Score: 3, Insightful
  
  You know, not everyone has non-production servers. Every server we have IS production. And if you are paying for Red Hat Enterprise, you expect Red Hat to have tested these updates themselves. If this was a Microsoft error, Slashdot would be all over Microsoft for allowing this to happen.
4. Re:You didn't test before deploying an update? by numbsafari · 2008-07-18 02:15 · Score: 3, Insightful
  
  And the rest of slashdot would be all over MS admins who blindly update their systems from AutoUpdate.
  I find it really hard to believe you don't have at the very least a strawman test system. The fact that you don't says volumes.
5. Re:You didn't test before deploying an update? by Sleepy · 2008-07-18 02:27 · Score: 5, Insightful
  
  >You know, not everyone has non-production servers. Every server we have IS production. And if you are paying for Red Hat Enterprise, you expect Red Hat to have tested these updates themselves. If this was a Microsoft error, Slashdot would be all over Microsoft for allowing this to happen.
  You are wrong; stop whining. You're just painting yourself as misinformed.
  1) The updates WERE tested.
  2) The admin installed "caching-nameserver", then configured his install to act far outside the default.
  3) He allows automatic updates straight into production. So do you it seems. Good luck with that! RHEL documentation says to not do this, but you're a bigshot "paying" for something different. I suggest you get a sidekick, and stick to the Windows side of your "enterprise".
  4) He didn't revert his .conf file, as is usually needed when some new line is added to a server .conf. This is SO NORMAL you'd have to be a n00b to get bitten!
  Your MS comparison is apples and oranges. If this guy did TEN MINUTES worth of testing he'd realize something's up, and he could revert the rpm package. How many MS updates prohibit uninstall? Quite a few!
  In Windows, you can't diff the before & after config, since Windows admins would rather be blind to what they're installing, since that's the norm and it's accepted.
6. Re:You didn't test before deploying an update? by GleeBot · 2008-07-18 02:28 · Score: 2, Insightful
  
  And contrariwise, if it's not important enough to test, then it's not important enough to not go down. So grin and bear it.
7. Re:You didn't test before deploying an update? by poot_rootbeer · 2008-07-18 02:31 · Score: 3, Insightful
  
  You know, not everyone has non-production servers. Every server we have IS production.
  Well, there's your problem right there...
8. Re:You didn't test before deploying an update? by IchNiSan · 2008-07-18 02:56 · Score: 2, Insightful
  
  You mean to tell me you don't even have an old desktop machine sitting around with RHEL on it to "play" with? Come on, pull the other leg. Or maybe find a new line of work. Not being able to afford non production servers and test lab is one thing, but not taking the old computer you replaced on the secretaries desk and using that to do some basic testing for mission critical updates is ridiculous. Or hell, just dual boot your machine if it comes to that. You have to do SOME testing of SOME things.
9. Re:You didn't test before deploying an update? by Dr+Caleb · 2008-07-18 02:59 · Score: 4, Insightful
  
  Perhaps it is his problem, but not his fault. Sounds like he's in the dreaded zone where IT is a necessary evil, not a department that can help leverage the business.
  He gets what he needs, or just barely what he needs. When management hands you crap, you learn to make crapade.
  
  --
  "History doesn't repeat itself, but it does rhyme." Mark Twain
10. Re:You didn't test before deploying an update? by Anonymous Coward · 2008-07-18 02:59 · Score: 1, Insightful
  
  You have a pretty bad view of Windows and Windows Admins. Some of that is true, and well deserved. However, you must remember that MANY MANY mid sized business run Windows and have a limited IT dept. That is my situation. I fix printers, client computers, network, AD, Exchange, app servers, website, and most anything else that pertains to a computer. So I don't have time to test EVERY patch for EVERY server and service. It just isn't practical. So I rely on MS to make sure that Windows updates work, and that is why I pay $1000 for each server... cause it's cheaper than another admin.
11. Re:You didn't test before deploying an update? by ebuck · 2008-07-18 10:03 · Score: 2, Insightful
  
  You are right about making do with what you get, but exactly how did he lack resources in this case? He already has RHEL (and updates, so I'm guessing his support contract is up to date).
  It's not like they're charging more for a non-caching domain name services server. In fact, he took a perfectly good non-caching name server, and then installed pre-packaged configuration files to make it a caching-nameserver. Then he started hacking away at the config file. Small wonder that fixes to the caching-nameserver config file will interfere with his setup. If the world worked any other way then caching-nameserver config files would never get bug fixes, ever.
  He didn't know what he installed, ignored his vendor's documentation warning not to do it this way, ignored the name of the package he was installing, ignored the concept of production in the enterprise (no updates without testing), didn't bother to read RPM's log files, and restored to fire-fighting in an emergency "failure" scenario. There's half-a-dozen routine ways this could have been avoided, but he made mistakes along every step of the way.
  In his favor, this sysadmin has balls. After being ignorant of his missteps, he's complaining that RPM saved a copy of his altered config file! I'll bet he won't even diff the changes into it before copying it back to it's original name.
  Give this man a fish and he'll complain that you're ruining his diet. Teach him how to fish and he'll complain that you're dumping your fishing responsibilities on him. He just doesn't get it.
A schoolboy error? by something_wicked_thi · 2008-07-18 00:17 · Score: 4, Insightful

What? And isn't it an error of similar proportion to upgrade your primary DNS servers without first testing the new install?
1. Re:A schoolboy error? by something_wicked_thi · 2008-07-18 00:54 · Score: 3, Insightful
  
  IMHO, rhel should have tested this.
  'Course they should. Nobody said otherwise.
  I'm not sure what you're getting at with building from sources. Seems like overkill and doesn't solve the main problem because you can still screw it up. All anyone's saying is that you should test this on a server that you don't care about, or at least test it on one, before upgrading all of them.
2. Re:A schoolboy error? by evilviper · 2008-07-18 01:10 · Score: 1, Insightful
  
  Personally I'm surprised there's not been more uproar about the requirement to move internal DNS servers (yes, that means your Windows Domain Controllers in most corporate environments) outside any NAT'ing devices (eg: firewalls),
  NAT is not a firewall.
  A firewall is not NAT.
  I wouldn't think that practically any major sites are running their public-facing DNS servers from behind a NAT (though I expect most are behind a firewall).
  
  --
  Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
MS by FozE_Bear · 2008-07-18 00:17 · Score: 4, Insightful

If it was a Microsoft product, we'd all be carrying pitchforks and torches....
1. Re:MS by Anonymous Coward · 2008-07-18 00:36 · Score: 1, Insightful
  
  But nobody is using RH in production anyway.
argh by __aardcx5948 · 2008-07-18 00:24 · Score: 2, Insightful

I guess the syadmins could put in an option in a configuration file somewhere on what files to "keep untouched" when doing package upgrades, no? So that the configuration file wouldn't be overwritten. I think I've seen something similar in Debian distros. Anyway when I install a new (custom) kernel in Ubuntu for example, synaptic asks me if I want to overwrite GRUB's menu.lst with the newly generated one, view the differences or keep my old one etc. Surely there's something similar in Redhat?
That why they get paid by nicolas.kassis · 2008-07-18 00:24 · Score: 3, Insightful

Half of whole point of a subscription to RHEL is to ensure that patches they put out are properly QAed. The other side is support, but I never had a chance to test that part out.
1. Re:That why they get paid by MikeDawg · 2008-07-18 00:30 · Score: 2, Insightful
  
  Umm. . . I disagree completely. The only way I would consider a patch "put out properly" if it was tested in my exact, or near exact environment. I can only assume that I'm not important enough for that.
  
  --
  YOU'RE WINNER !
  Another lame blog
Test your patches by MikeDawg · 2008-07-18 00:27 · Score: 2, Insightful

What kind of environment are you in where you don't first test your patches that are going out to live production machines? Regardless of the fact that it is linux and not windows, you should always test your patches before you roll them production.

--
YOU'RE WINNER !
Another lame blog
1. Re:Test your patches by Just+Some+Guy · 2008-07-18 01:38 · Score: 5, Insightful
  
  What kind of environment are you in where you don't first test your patches that are going out to live production machines? Regardless of the fact that it is linux and not windows, you should always test your patches before you roll them production.
  Disclaimer: I test first.
  You know, lot of people work in small shops that can't afford multiple redundant servers. I suspect that business with a single DNS/web/mailserver are a lot more common than Slashdotters this morning seem to thing. What are those admins supposed to do? They're receiving a critical security patch from a trusted vendor, and I imagine a lot of them feel pretty safe applying that to their sole production server. This doesn't make them stupid or incompetent.
  I have the luxury of lots of hardware that can fill in for other gear in a pinch, but lots of people don't. They don't deserve scorn for it.
  
  --
  Dewey, what part of this looks like authorities should be involved?
Experienced Monkeys... by spankymm · 2008-07-18 00:28 · Score: 2, Insightful

...check for rpm mouse droppings by running find.
RH may have made a small coding mistake - you made an even bigger one.

--
http://cafepress.com/spankymm - for the Masturbating Monkey in you!
Re:New update? by CrackerJackz · 2008-07-18 00:38 · Score: 2, Insightful

Because the named.conf file gets stomped, the 'backup' RPMSAVE file it creates is the caching-only file, not the original named.conf file.
I caught this a couple of weeks ago on a test server (where *all* patches should be tested first, Microsoft or otherwise) best way to fix? cp /etc/named.conf /root/named.conf.backup ; up2date-nox -u ; cp /root/named.conf.backup /etc/named.conf ; /etc/init.d/named restart
Little to no downtime on the prod servers :)
Well by ledow · 2008-07-18 00:44 · Score: 3, Insightful

Yeah, it's a silly mistake.
But you should be testing things like this first, and whenever you upgrade you should really be looking at/for all .rpmsave or equivalent files first to make sure nothing has changed in the meantime. Otherwise, you're just removing your config and replacing it with the default whatever happens. You should also be checking .rpmnew (or equivalent) each time to check that it hasn't changed in terms of syntax, defaults etc. (which, let's be honest, is quite likely for such an important update - especially given that we hardly know what the exact problem is yet). I wouldn't go so far as to suggest intimate analysis of packages while they are still packed unless the systems you are running are quite critical to the operation of a business.
Part human-error on RH's part (it happens). Part incompetence in not testing the updates yourself first. Chances are that if I were affected by this, I would catch it as part of "right, what did that package change?", or notice as part of usual testing later, and then just move the file. I probably wouldn't even bother to send RH a note.
If you have a DNS server, that suggests that there are reliant computers. As courtesy to all those reliant computers you HAVE to test changes and check carefully what they are doing first. If you were "stung" by it, it suggests you hit this problem on ALL your DNS servers and/or that you only have one DNS server anyway. To deploy packages like this on such a setup is just asking for trouble.
Re:No worries by larry+bagina · 2008-07-18 00:51 · Score: 2, Insightful

Idiotic.... like Debian's openssl "enhancements" that made the random number generator not so random?

--
Do you even lift?
These aren't the 'roids you're looking for.
Mod parent up! by Chrisq · 2008-07-18 01:06 · Score: 3, Insightful

I am sure that many people do not realise that going through a NAT device usually means that predictable port numbers will be allocated.

Of course until we get details of the hole and fix we cannot be 100% sure but it is very likely that exposing predictable port numbers (which the fix randomised) reintroduces the hole.

If DNS software vendors had a year's notice then why didn't the NAT firewall vendors. They could have introduced a patch at the same time.
Re:You are WRONG :D by hughesjr · 2008-07-18 01:21 · Score: 1, Insightful

BUT ... how can you create a caching-nameserver without changing that file???
If you do not change that file, you do NOT have a caching-namesever ... which was the whole point of installing that package.
Re:What kind of an idiot would...? by nabsltd · 2008-07-18 02:17 · Score: 2, Insightful

RH is only distro I have ever tried - and I tried many of them - would silently without any warning or prompt replace your config files with shipped version.
First, it doesn't do this without any warning...the output of rpm (which does the actual install) is forward to yum, or rhn, or whatever is running the "figure out everything I need and get it" process, and that is displayed to you when you are applying the patch. It clearly states in that output what happened with the file.
Second, for some updates (particularly security updates like this one), it is appropriate to save the old config file and load a default one, especially if that default one helps provide more security. Then, the admin can figure out what parts of the new default should be applied to their config, merge everything together, and restart the service.
These are the kinds of procedures that good admins do when they make changes to the system in any way.
Re:You are WRONG :D by Anonymous Coward · 2008-07-18 02:43 · Score: 1, Insightful

I hope my sarcasm detector is nonfunctional. Really.
Please read your post again and than look up what you need for a simple caching nameserver..
Welcome to third party packaging... by Venotar · 2008-07-18 02:50 · Score: 3, Insightful

This is news? Redhat (like every OS vendor I've ever dealt with) have been pushing out updates with broken assumptions for years.
In fact, this isn't even the first time they've done something similar when updating bind:
back in 2004 they released RHEL 3 update 4 and many people had precisely the same experience. Additionally, when applied, Update 4 removed the /etc/rc*.d/S*named and /etc/rc*.d/K*named and then shut named off.
As a quick glance at redhat's bugzilla shows, the first problem (the same one you experienced in this release) wasn't a schoolboy mistake on the packagers part, or a bug. It was the result of a poorly understood choice on the part of the person who originally provisioned the machine.
Rather than installing just the original bind-9.2.4, the people who had their named.conf overwritten had installed bind plus a package called caching-nameserver. It's that package that, when updated, backed up and overwrote their bind config. The "caching-nameserver" package should only be installed if you want to run a caching nameserver, because the caching-nameserver package isn't an application at all - it's simply a named.conf file.
The real bug (back in 2004) wasn't actually in Update 4's bind package. As it turns out, the package it replaced incorrectly contained a `chkconfig --del named` in its uninstall script.
Anyone without proper alerting and a good QA process found that one out the hard way. I had customers who'd gotten so blasè about performing nighttime maintenances without proper reversion testing that they scheduled nightly cronjobs that ran up2date at midnight and rebooted the production machine, Naturally, they woke up in the morning to find they'd just suffered 8 hours of downtime.
Lesson? Don't trust the vendor's QC work, don't install unnecessary packages, and make sure to QC your own work! Ask any experienced Windows admin about unintended consequences from "trusted" vendor patches...
Re:bug details by _Sprocket_ · 2008-07-18 02:55 · Score: 3, Insightful

It is a bug when an update overwrites your configuration file.
Normally I'd say you've got a valid point. The problem here is that the config file seems to be part of the intent of the package (please correct me if I'm wrong).
A rough example would be if someone replaced a packaged binary with a custom-compiled version and then complained when the package update overwrote that modified binary.
modifited files should not be touched! by Anonymous Coward · 2008-07-18 04:36 · Score: 1, Insightful

The user has misconfigured their DNS and has installed a package called, SURPRISE, caching-nameserver along with the other bind packages.

If the RPM utility saw the configuration file was modified (from a mis-matched checksum against its database), it should not have touched the file.
What would have happened if the user was BIND in a caching-only mode, but he modified the file to have a 'listen-on' directive for security purposes? He would have been using the package "in the correct" way, but it still could have borked his configuration.
While he probably should have tested, not having your update system modify your configuration behind your back a pretty good prerequisite to have in an updating system.
Isn't this one of the advantages of having configuration stored in text files? Being able to see changes in a granular fashion? Imagine if it was a binary blob that was updated. Copying back a file (or restoring it from backup) is fairly easy, but if you had some obscure bag of bits it might be harder.
Re:chroot is not a security measure by mysidia · 2008-07-18 13:07 · Score: 2, Insightful

These arguments come up all the time. So it is with chroot.
The Linux kernel lost 'securelevel'. ("A hacker can turn it off by mucking around with /dev/mem anyways, or use $kernel_bug_of_the_day to flip the bit")
Python lost 'restricted' mode. (There are some ways to get code out of the restricted jail..)
PHP6 is losing features like safe_mode, open_basedir (Custom extensions may be able to open files despite the open_basedir restriction)
I wouldn't be surprised if chroot itself gets removed eventually, and ext3 'immutable' bit, or gets a fat disclaimer not to use it. It probably only stays because it is used for some build environments.
Why? Because these security measures aren't perfect They don't guarantee 100% security against a skilled attacker. They don't satisfy everyone.
Apparently for some folks, security measures aren't acceptable unless they're effective in 100% of situations and against 100% of the possible attackers.
Even if the measures had some very practical uses... the very danger that 'people might think this is a security measure', is worth removing useful features that make life harder for crackers.
Comment removed by account_deleted · 2008-07-20 07:38 · Score: 2, Insightful

Comment removed based on user account deletion