Note If you deleted and created a new System partition, but you are installing Windows XP on a different partition, you will be prompted to select a file system for both the System and startup partitions.
I didn't modify the partition table during Windows XP installation, so this KB article is not relevant. What I did was try to install to an existing system with an unformatted NTFS partition as part of the following layout:
...umm, slashdot's lameness filter won't let me post it here, so I've updated the advogato post. It's the actual fdisk layout you would need to reproduce this problem.
If you choose to format 2 different partitions without paying attention, how is that MS's fault?
If they had asked me that question twice, I would indeed have been suspicious. But as I recall (and this is reproducible; if you distrust my memory, you're welcome to try it yourself), it only asked once, and it didn't give me any clue what partition it was talking about. Certainly nothing said "would you like to format as NTFS this existing partition you have marked as ext3?"
I have installed windows on a linux machine many times. Not once has it destroyed the hard drive. MBR, yes. Linux partition, no. All I had to do is go drop a new boot-loader on the machine. Something went wrong, that sucks (a lot), I understand. But that's hardly grounds to declare an entire OS "crap."
It's not that "something went wrong" one time out of many. If you follow a specific, reasonable sequence of actions, the contents of your hard drive will be destroyed 100% of the time. Furthermore, the makers of the OS already knew about this (presumably by someone else's hard drive being destroyed) and decided it wasn't important enough to fix. They just said "this is normal" in a place no one would ever find until after their hard drive had been destroyed. An open source developer would get lynched for that.
How many times you successfully followed a different procedure is totally irrelevant. Nowhere in the installation instructions does it say "don't do this or your hard drive will be destroyed". It was a reasonable way to partition my hard disk, but it made Windows go crazy and delete everything, and the developers don't even care. Why should I ever use that system again? Why should anyone ever use that system again?
I've had Red Hat (years ago) completely corrupt my HD while shutting down. Obviously I should never use any Linux distro ever again!
That's a horrible bug, but did they fix it or just document in a KnowledgeBase that "sometimes it completely corrupts your HD; this is normal". If the latter, then I could certainly understand why you'd share my desire to never use the system in question again. But I doubt it - I would be shocked to hear that attitude from Linux developers.
[in a later post] Using anecdotal evidence that bashes windows is +3. Using anecdotal evidence that bashes Linux is trolling.
This isn't anecdotal evidence; it is fully scientific.
I start using a given operating system considering as true the falsifiable statement "the developers of this OS will attempt to fix any problem that causes total data loss". I won't use any system for which that statement has been falsified. One solid example is sufficient do so, and for Windows I have a well-documented one. (Did you see the KnowledgeBase article?) The entire operating system is crap. I won't use it.
It's nothing new really. When I used to install Win98 as an afterthought alongside a Linux distro, I could be sure the lilo mbr would be trashed, and I made sure I had a boot floppy handy to boot back into linux and reinstall lilo.
Consider yourself lucky. When I installed Windows XP alongside Linux, it trashed my entire hard drive. That was the last time I ever installed Windows. The system simply can not be trusted. I use Linux and OS X exclusively now.
I was under the impression that even with FSFS you still needed to use the hotcopy.py script in order to get a guaranteed consistent backup.
I originally thought so, too, but check out this thread. Old revision files are never modified, old revprop files are modified only when you do "svn propset --revision", and new files are created with a unique tempfile name then svn_fs_fs__move_into_place. My backup script does some additional sanity checking (ensures the dir is an fsfs repository of version 1 or 2, etc.) but you can really get away with just copying the files.
The hash is stored in the meta information, and the compare option does only that, comparing the live system to your archive. It does not say anything about the change-detection behaviour used during a backup.
True, but my assumption (which again, I haven't checked) is that they wouldn't have stored this hash if they weren't doing something with it. I don't think the sanity check uses any information that's not gathered for normal operation.
[Time-based checking is not safe...touch example]
True. Your backup from before the move will be correct, so if you were to catch this before you got rid of the pre-move increment, you'd have a way to recover manually. I assume you're talking about after that. Yeah, there's a problem, but I'd say it's an incredibly minor failure when compared to not having incremental backups at all. I've sometimes gone for several backup cycles before realizing anything was wrong, so it would be difficult to convince me they're not worthwhile.
Do you have a real-world example where this might happen? The best I've got is moving messages in Cyrus IMAP - if they were both placed in the same second and have the same length, and you moved one away and the other into its folder before any other mail arrived there, I guess this would happen. I just consider that sequence pretty unlikely, and the consequences not too severe.
This article totally neglects consistency. Recently I've put a lot of effort into getting consistent backups of things:
PostgreSQL by doing pg_dump to a file (easiest, diffs well if you turn off compression), pg_dump over a socket (better if disk space is tight, but you send the whole thing every time), or an elaborate procedure based on archive logs. (It's in the manual, but essentially you ensure logfiles aren't overwritten during the backup and that you copy files in the proper order.)
Other ACID databases with a write-ahead log in a similar way.
Subversion fsfs is really easy - it only changes files through atomic rename(), so you copy all the files away
Subversion bdb is a write-ahead log-based system, easiest way is "svnadmin hotcopy".
Perforce by a simple checkpoint (which unfortunately locks the database for an hour if it's big enough) or a fancy procedure involving replaying journals on a second metadata directory...and a restore procedure that involves carefully throwing away anything newer than your checkpoint.
Cyrus imapd...I still haven't figured out how to do this. The best I've got is to use LVM to get a snapshot of the entire filesystem, but I don't really trust LVM.
...
If you're really desperate, anything can be safely backed up by shutting it down. A lot of people aren't willing to accept the downtime, though.
So you need a carefully-written, carefully-reviewed, carefully-tested procedure, and you need lockfiles to guarantee that it's not being run twice at once, that nothing else starts the server you shut down while the backup is going, etc. A lot of sysadmins screw this up - they'll do things like saying "okay, I'll run the snapshot at 02:00 and the backup at 03:00. The snapshot will have finished in an hour." And then something bogs down the system and it takes two, and the backup is totally worthless, but they won't know until they need to restore from it.
These systems put a lot of effort into durability by fsync()ing at the proper time, etc. If you just copy all the files in no particular order with no locking, you don't get any of those benefits. Your blind copy operation doesn't pay any attention to that sort of write barrier or see an atomic view of multiple files, so it's quite possible that (to pick a simple example) it copied the destination of a move before the move was complete and the source of the move after it was complete. Oops, that file's gone.
The article seems like a good one, though I think it may be a little too cautious. I would need to hear some real world examples before I would give up on incremental backups. Being able to store months worth of data seems so much better than being only able to store weeks because you aren't doing incremental backups.
I think his complaints are no longer relevant. rdiff-backup has a --compare-hash option, though I haven't checked the details. Maybe the author should give it another look...
Besides, if you have an accurate timeserver (you should! time is unbelievably important to software in general!), the timestamp check is pretty safe, barring maliciousness. And if your machine has been compromised, the data coming off it should not be trusted in general. This is just one more case of that.
One thing not mentioned is encryption. [...] Lately I have been using duplicity.
It seems like a great idea, but my impression was that it was missing a lot of the same love, care, testing, documentation, etc. that has been put into rdiff-backup. They're by the same guy, but he obviously has been concentrating largely on the one, and I don't believe they share any code.
Have you looked at brackup? It seems promising, anyway, but I haven't actually tried it. Maybe when it's a little more mature...
I use parameterized SQL queries so the notion of having to check/escape something as simple as single quotes hadn't occured to me.
Likewise, so neither of us have the problem this article is talking about. CVE's #2 error (which accounts for 14% of all reported security problems, and is present in at least 11% of this guy's web site sample) is really that basic.
As for your checks, I do things a little differently:
In servlets, I typically let the servlet container and Java runtime take care of character encoding issues for me. By the time I get it, it's a java.lang.String, which is valid Unicode. (I think it's stored as UTF-16...but I'm not quite sure, because that's a detail I don't have to bother with.)
I tend to just catch length errors coming back from the database. In general, you need to correctly handle errors coming back from the database anyway. If you don't serialize all accesses to the database, you'll get a unique constraint violation at some point. So I go further and just handle the SQLExceptions in general, rather than trying to avoid them. It makes the error path less efficient, but I just don't care about the efficiency of the error path. It saves code and eases maintenance, since I don't have to put "this field can only be 50 characters long" in as many places.
In short, trusting the client (i.e. the web browser) to not send bad values - either through the INPUT tag's maxlength attribute, JavaScript scrubbing or whatever - is entirely the wrong way to go. The web script must check all user input for validity along with properly escaping everything from the database that's getting sent back via HTML.
It's not just "bad" values, unless you believe that all people with names like O'Malley are out to destroy your website. SQL Injection is a very simple problem: people are confusing arbitrary data with fragments of SQL. The former should be passed in through bind variables - or escaped if you're hellbent on destroying your performance - while the latter should be executed directly. It's fortunate for the O'Malleys out there that this mistake is a huge security hole, or they would be rejected by any kind of automated system. People barely care about security; they certainly don't care about "lesser" bugs.
I often see people claim you just check for bad characters - either specific lists of punctuation or anything outside the alphanumeric range - because you can't possibly get all of your database code right. That's an incredibly wrong idea:
You can and must get all of your database code right. I developed Axamol SQL Library for this purpose. Other solutions vary.
The "nothing but alphanumeric" crowd's code doesn't work with Unicode. So José can't type his name in properly.
Unfortunately, I have to ding them on this - if the password is wrong, it hides the error message from you (you get something generic like "connection failed").
Well, yeah, but that's because it's Good Security Practice (TM). If J. Random H4x0r knew that the username was correct but the password was wrong, then he knows he's already halfway there. You're not supposed to give away that some of the information is right, but some is wrong. You're just supposed to say, "No. Try Again."
I agree that it's a good security practice for the server to say "Authentication failed" instead of "Wrong user" or "Wrong password". But he's talking about the client saying "It failed" instead of "The server said 'NO Authentication failed' after I sent credentials", "my connection to mailserver:143 was refused", or any number of other error messages that would actually be helpful.
There is no security value in the client (which is under the user's control) throwing away information instead of presenting it to the user. It's just poor error handling, and unfortunately many of Apple's client/server applications are guilty of it. I generally resort to using Wireshark when diagnosing problems with them, which is not something you can teach the average user to do. Address Book is the worst - it just stops returning results after you change your password! No error message at all! How horrible is that?
"Mail" connects, tries each possibility, and sets it to the most secure option that works.
Really? For me, it's always made me enter my credentials, then attempted to send them across the Internet in plaintext, complained that it didn't work[1], and then on the next tab, given me an SSL checkbox. I've always thought of Mail.app's configuration as incredibly stupid and insecure.
The proper way to do it would be to first try SSL and if that doesn't work, prompt the user with suitably scary text ("Secure connection failed. If you're sure your administrator is a moron who doesn't care about security, hit 'Try insecurely'; otherwise hit 'Abort' and contact your security team"). Of course, it's less convenient and will produce panicked users when the administrator doesn't care about security, but it should produce panicked users when the administrator doesn't care about security. Anything else is broken.
General rule: security and convenience are opposites. Unfortunately, that often gets overlooked by people whine "blah blah, software's too hard, blah blah".
[1] - Thankfully. Like anyone who cares about security, I've disabled unencrypted IMAP logins. So it only could work by man-in-the-middle attack - someone accepting (and probably logging) unencrypted connections, and initiating encrypted connections on their behalf.
If they want to judge a student's writing skills, this would be a much better prompt:
Your friend is contemplating cheating on the SAT. Write a letter to dissuade him/her from doing so.
I complete agree. I believe that
students should write essays that persuade people to take specific action. In my first semester of college, I took both "Accelerated Rhetoric" and "Principles of Chemistry I". I wrote many "persuasive essays", but the one I'm most proud of was a plea to the chemistry instructors to stop requiring us to do homework through a stupid web application. Most of my other essays were on stupid subjects and graded quite subjectively, but this one was concrete and objectively successful.
students should not write one-sided essays on complex subjects. Instead, they should be encouraged to write a "shared search for truth". They can go back and put their conclusion in their introduction, but it's valuable to show the complete thought process in considering all aspects of the subject. I'd consider an engineering design incomplete without a trade-offs section; why is this different? It's rare for one side to be completely right and the other completely wrong.
students should not write exclusively in prose. In the real word, people skim your writing. Why not accommodate that? Section headers, bulleted lists, sentence fragments, and even bold text are tools that are inappropriately discouraged in this format.
For some reason (I think I know why but I will leave it to you to come up with your own) the developers seem to prefer working on GPLed products even when there are competing BSD based projects out there. It is because of that the MSQL development is robust and the userbase is strong while postgres remains more or less a niche product.... Given any product space you will find that the GPLed products are more likely to be widely used and have greater developer communities.
I don't buy it. The only examples I can think of are Linux vs. BSD and MySQL vs. PostgreSQL. In both cases, there are a lot of other reasons that might explain why the GPLed one's popularity. Mostly historical, I think - the AT&T lawsuit slowed down BSD adoption, and supposedly PostgreSQL used to be a lot slower and harder to use. The PHP people focusing exclusively on MySQL really made its popularity grow.
This isn't to say postgres is not a better product (or firebird or whatever) it's just that in the marketplace it's lost mojo.
The history of open source is littered with BSD-based empty victories like this.
Is it? I'm not familiar with the SPICE landscape, but I am with PostgreSQL:
PostgreSQL, while an excellent product that I still use often, is stagnating while MySQL slowly surpasses it in every way.
Umm, what? How is PostgreSQL stagnating? It's a widely-used product with frequent releases, full-time contributors back to the open-source core, and several commercial support offerings. What do you mean by "MySQL slowly surpasses it in every way"? If you're talking about popularity, MySQL's always been more popular. If you're talking about something technical, well, I have absolutely no idea what it could be.
I am not trying to convince you or anybody else really. The point of the talk was to show people how they could benchmark and profile something to get a real sense of where their application was spending most of its time.
That's cool, but please don't spread this myth:
I don't think this is news to anybody that MySQL is quicker at connecting and issueing simple queries
Even eliminating connection overhead, MySQL was faster in this case.
In the later tests, after you'd turned on the deceptive query cache?
I wasn't actually using the query cache in MySQL in the initial steps in this iterative optimization because MySQL's query cache doesn't kick in for prepare/execute queries.
my suggestion is to turn on prepare/execute emulation in PDO while behind the scenes it will use the faster direct query api calls and thus will also hit the query cache.
The same one I complained about in my first post:
Were you using MySQL's query cache? Turn it off. It shows bigger numbers on some bad benchmarks but doesn't help real situations: artificially claims silly numbers for tiny sets (are your real data as small as your benchmark?), cleared after every DML statement on that table, etc.
Look, I don't care about your progression, I care that you've made in this thread a claim that it's well-known that MySQL is faster in real-world situations. I assert that you're wrong. If you want to convince me, you'll have to give me a single reasonable configuration of a modern MySQL database, a nearly-equivalent single reasonable configuration of a modern PostgreSQL database, and show me the MySQL one being faster in a good approximation of a real-world load. No one's ever done that, so it's certainly not well-known, and it's probably not true.
And yes, I do know my way around PostgreSQL. It's a good database, but no matter how you tweak it, it still has more connection overhead than MySQL does.
Are you talking about the actual connection initiation? Let me add to my list: were you pooling connections? I'd never actually considered that someone wouldn't do this in a web application. Even if MySQL's connection overhead is lower, it doesn't matter, because both can be trivially made 0.
This isn't a PostgreSQL slam, it is simply pointing out that people should benchmark and profile their actual applications and understand the costs.
That I agree with. But it must be done carefully. It's very easy to deceive yourself by benchmarking something you think is similar to your application but is not. Or by comparing two different techniques, with one done improperly.
have been doing this stuff a long time and have been slammed on/. countless times, but please, slam me for things I actually said or did.
Sure thing. Did you say this?
Part of the reason Lerdorf considers the Web "broken" is that it is inherently insecure for a variety of reasons. One of those reasons sits at the feet of developers.
"You don't know that you have to filter user input," Lerdorf exclaimed.
If you don't like insecurity due to poor input handling, why did you design your language to encourage it? magic-quotes-gpc is the worst language feature I have ever seen. It manipulates one particular set of inputs to make them conform to one set of output which doesn't always apply but is always a bad idea. People should be using bind variables supplied by the database library, not quoting according to MySQL x.x's rules and then sticking things directly into their statements. This is like a giant neon sign called "Security" pointing in the opposite direction from the real thing.
In contrast, Perl has taint mode, a feature you'd do well to emulate. It actually tracks a flag on each variable seeing if it came, directly or indirectly, from untrusted input. If so, it must be untainted before being used in any of a number of security-related situations. It's smart enough to avoid requiring any way of doing so which is probably inappropriate. It just flags things which are almost certainly wrong. Actual thought needs to go into correcting them, and as users learn the situations taint mode complains about, they trip it less and less often. Correct taint-mode code runs the same with it off, which makes it much superior to magic-quotes-gpc.
I went through a series of optimizations of a sample Web application, and one of many steps was to try MySQL instead of PostgreSQL for that particular application. By profiling it with Callgrind it was obvious that in this particular case MySQL was significantly faster. I don't think this is news to anybody that MySQL is quicker at connecting and issueing simple queries
It's news to me. I haven't seen a recent benchmark that says this, and I'm always skeptical of claims MySQL is faster:
Were you using MySQL as an ACID database? I.e., all tables using a transaction table type, fdatasync() on, real tests telling you that durability is actually working? If not, either run it properly or run PostgreSQL in stupid mode for something approaching an apples-to-apples comparison. fsync = off in $PGDATA/postgresql.conf.
Were you using MySQL's query cache? Turn it off. It shows bigger numbers on some bad benchmarks but doesn't help real situations: artificially claims silly numbers for tiny sets (are your real data as small as your benchmark?), cleared after every DML statement on that table, etc.
For that matter, did you issue any DML statements at all? As the bullet point above mentions, they have much greater impact on performance than their proportion would suggest. For other reasons, too. Doesn't MySQL still just have table-level locks? PostgreSQL's the other extreme; it has MVCC.
Seriously, if you can prove MySQL is faster for a real-life situation, write a paper, lay all your steps out for review. (Or point me at one someone else has done on modern versions of said databases.) There are lot of potential mistakes in benchmarking, and I won't believe claims unless I actually see that none of them were made.
By the way, what were you saying about Apache header stupidity? The article is annoyingly vague.
This has been a serious flaw in Unixes since I knew about it. The OS will let your HD fill up and overwrite itself. Many *nix flag wavers often defend this behaviour. Why they do is beyond me.
You're completely wrong. When the free space reaches 100%, write() will return ENOSPC (no space). The superuser will still be able to use the system, because traditionally, there's a 5% reserve which only uid=0 can use. (The "df" goes up to 105%.) The correct semantics are well-defined. If you've seen anything else, it's a bug in whatever system you were using, which no one could seriously defend.
Now, if you're talking about user applications breaking when encountering this condition...yeah, there are certainly some out there that break. There are buggy applications written for every platform. It's just laziness - Unix gives them well-defined semantics they can use to handle it correctly and an easy test environment (quotas).
The article mentioned two people (the original "young man" and the "young lady" later) who told the customer there was a blanket policy, and the spokeswoman who told the press that it was discretionary. I'm guessing that there was a blanket policy. PR people aren't above a bit of revisionist history in the name of damage control.
This strikes me as typical of big companies. They have a couple incidents where someone uses poor judgement or abuses his power [1]. They don't have the guts to confront him, so instead they put in place a blanket policy of mediocrity - it (hopefully) makes that particular incidence of abuse impossible but also prevents excellence. Unless some powerful outside group complains [2], these blanket policies never get removed, even after they're shown again and again to be a bad idea. Discretion erodes until everyone is mindlessly following stupid orders and nothing can get done anymore. Either they coast on their past accomplishments or, if they never had any, the company folds and everyone goes to work for a different, smaller companies. The cycle repeats.
[1] - "introduced in response to complaints that staff had mis-sold products last year."
[2] - Like the over-70 crowd; they're disproportionate voters, so they'll probably get their law against discriminating against older consumers. The company's apparently already backtracked on the policy, but it won't be enough to satisfy the seniors. Ironically, they'll create a blanket policy to counter another blanket policy, and it will probably make illegal the sort of good judgement that should have been made in the first place.
Re: the "simple solution for these five conditions", then "that would never scale!"
I suspect Guy 1 was taking a preemptive strike against tangents about caching solutions (some interviewees focus on tweaks to mask not understanding the basic approach to the problem) or perhaps against a quest to find the provable lowest asymptotic efficiency, when they actually use this problem for smallish data sets. He probably just didn't say what he really meant - maybe "small numbers (about 5), focus on the simple/elegant approach rather than performance".
I don't know what the problem was, but I probably agree with Guy 2. Denormalize the table later if you absolutely have to for performance. Your first solution should be simple, elegant, and maintainable. (A number like 5 probably didn't come from the fundamentals of the problem; it came from marketing's list of entries on a web form or something. It'll become 10 later, and five like columns was painful already. 10 is unmaintainable.) If you do remove that elegance later, you'd better have the performance numbers to back it up.
But in general, it does sound like they screwed up the interview. Never count it against someone for you leading them astray, never ask questions to which you can't recognize correct answers, expect a few minor mistakes (or your questions are way too easy), and don't everyone ask just this sort of question.
Sadly, this sounds harder than the questions I usually weed people out with. My first question is a five-line function (in C, which these people have 10+ years experience with). Strangely, few people seem to grasp that when I use a ridiculously easy five-line function to decide if you get a job or not, you should be careful. Often every line is wrong. Recently I've even been giving them more opportunities to correct themselves - "how would you test this?" "[manually printing the result]" "Okay, it spits out gibberish." "I don't understand how that could be." "Well, it spits out gibberish. What do you do to debug it?" "Uhh..."
I didn't modify the partition table during Windows XP installation, so this KB article is not relevant. What I did was try to install to an existing system with an unformatted NTFS partition as part of the following layout:
...umm, slashdot's lameness filter won't let me post it here, so I've updated the advogato post. It's the actual fdisk layout you would need to reproduce this problem.
If they had asked me that question twice, I would indeed have been suspicious. But as I recall (and this is reproducible; if you distrust my memory, you're welcome to try it yourself), it only asked once, and it didn't give me any clue what partition it was talking about. Certainly nothing said "would you like to format as NTFS this existing partition you have marked as ext3?"
It's not that "something went wrong" one time out of many. If you follow a specific, reasonable sequence of actions, the contents of your hard drive will be destroyed 100% of the time. Furthermore, the makers of the OS already knew about this (presumably by someone else's hard drive being destroyed) and decided it wasn't important enough to fix. They just said "this is normal" in a place no one would ever find until after their hard drive had been destroyed. An open source developer would get lynched for that.
How many times you successfully followed a different procedure is totally irrelevant. Nowhere in the installation instructions does it say "don't do this or your hard drive will be destroyed". It was a reasonable way to partition my hard disk, but it made Windows go crazy and delete everything, and the developers don't even care. Why should I ever use that system again? Why should anyone ever use that system again?
That's a horrible bug, but did they fix it or just document in a KnowledgeBase that "sometimes it completely corrupts your HD; this is normal". If the latter, then I could certainly understand why you'd share my desire to never use the system in question again. But I doubt it - I would be shocked to hear that attitude from Linux developers.
This isn't anecdotal evidence; it is fully scientific.
I start using a given operating system considering as true the falsifiable statement "the developers of this OS will attempt to fix any problem that causes total data loss". I won't use any system for which that statement has been falsified. One solid example is sufficient do so, and for Windows I have a well-documented one. (Did you see the KnowledgeBase article?) The entire operating system is crap. I won't use it.
Consider yourself lucky. When I installed Windows XP alongside Linux, it trashed my entire hard drive. That was the last time I ever installed Windows. The system simply can not be trusted. I use Linux and OS X exclusively now.
True, but my assumption (which again, I haven't checked) is that they wouldn't have stored this hash if they weren't doing something with it. I don't think the sanity check uses any information that's not gathered for normal operation.
True. Your backup from before the move will be correct, so if you were to catch this before you got rid of the pre-move increment, you'd have a way to recover manually. I assume you're talking about after that. Yeah, there's a problem, but I'd say it's an incredibly minor failure when compared to not having incremental backups at all. I've sometimes gone for several backup cycles before realizing anything was wrong, so it would be difficult to convince me they're not worthwhile.
Do you have a real-world example where this might happen? The best I've got is moving messages in Cyrus IMAP - if they were both placed in the same second and have the same length, and you moved one away and the other into its folder before any other mail arrived there, I guess this would happen. I just consider that sequence pretty unlikely, and the consequences not too severe.
Oops, I meant "consistent" here. "Atomic view" is nonsense.
So you need a carefully-written, carefully-reviewed, carefully-tested procedure, and you need lockfiles to guarantee that it's not being run twice at once, that nothing else starts the server you shut down while the backup is going, etc. A lot of sysadmins screw this up - they'll do things like saying "okay, I'll run the snapshot at 02:00 and the backup at 03:00. The snapshot will have finished in an hour." And then something bogs down the system and it takes two, and the backup is totally worthless, but they won't know until they need to restore from it.
These systems put a lot of effort into durability by fsync()ing at the proper time, etc. If you just copy all the files in no particular order with no locking, you don't get any of those benefits. Your blind copy operation doesn't pay any attention to that sort of write barrier or see an atomic view of multiple files, so it's quite possible that (to pick a simple example) it copied the destination of a move before the move was complete and the source of the move after it was complete. Oops, that file's gone.
I think his complaints are no longer relevant. rdiff-backup has a --compare-hash option, though I haven't checked the details. Maybe the author should give it another look...
Besides, if you have an accurate timeserver (you should! time is unbelievably important to software in general!), the timestamp check is pretty safe, barring maliciousness. And if your machine has been compromised, the data coming off it should not be trusted in general. This is just one more case of that.
It seems like a great idea, but my impression was that it was missing a lot of the same love, care, testing, documentation, etc. that has been put into rdiff-backup. They're by the same guy, but he obviously has been concentrating largely on the one, and I don't believe they share any code.
Have you looked at brackup? It seems promising, anyway, but I haven't actually tried it. Maybe when it's a little more mature...
Likewise, so neither of us have the problem this article is talking about. CVE's #2 error (which accounts for 14% of all reported security problems, and is present in at least 11% of this guy's web site sample) is really that basic.
As for your checks, I do things a little differently:
It's not just "bad" values, unless you believe that all people with names like O'Malley are out to destroy your website. SQL Injection is a very simple problem: people are confusing arbitrary data with fragments of SQL. The former should be passed in through bind variables - or escaped if you're hellbent on destroying your performance - while the latter should be executed directly. It's fortunate for the O'Malleys out there that this mistake is a huge security hole, or they would be rejected by any kind of automated system. People barely care about security; they certainly don't care about "lesser" bugs.
I often see people claim you just check for bad characters - either specific lists of punctuation or anything outside the alphanumeric range - because you can't possibly get all of your database code right. That's an incredibly wrong idea:
I agree that it's a good security practice for the server to say "Authentication failed" instead of "Wrong user" or "Wrong password". But he's talking about the client saying "It failed" instead of "The server said 'NO Authentication failed' after I sent credentials", "my connection to mailserver:143 was refused", or any number of other error messages that would actually be helpful.
There is no security value in the client (which is under the user's control) throwing away information instead of presenting it to the user. It's just poor error handling, and unfortunately many of Apple's client/server applications are guilty of it. I generally resort to using Wireshark when diagnosing problems with them, which is not something you can teach the average user to do. Address Book is the worst - it just stops returning results after you change your password! No error message at all! How horrible is that?
Really? For me, it's always made me enter my credentials, then attempted to send them across the Internet in plaintext, complained that it didn't work[1], and then on the next tab, given me an SSL checkbox. I've always thought of Mail.app's configuration as incredibly stupid and insecure.
The proper way to do it would be to first try SSL and if that doesn't work, prompt the user with suitably scary text ("Secure connection failed. If you're sure your administrator is a moron who doesn't care about security, hit 'Try insecurely'; otherwise hit 'Abort' and contact your security team"). Of course, it's less convenient and will produce panicked users when the administrator doesn't care about security, but it should produce panicked users when the administrator doesn't care about security. Anything else is broken.
General rule: security and convenience are opposites. Unfortunately, that often gets overlooked by people whine "blah blah, software's too hard, blah blah".
[1] - Thankfully. Like anyone who cares about security, I've disabled unencrypted IMAP logins. So it only could work by man-in-the-middle attack - someone accepting (and probably logging) unencrypted connections, and initiating encrypted connections on their behalf.
I complete agree. I believe that
I don't buy it. The only examples I can think of are Linux vs. BSD and MySQL vs. PostgreSQL. In both cases, there are a lot of other reasons that might explain why the GPLed one's popularity. Mostly historical, I think - the AT&T lawsuit slowed down BSD adoption, and supposedly PostgreSQL used to be a lot slower and harder to use. The PHP people focusing exclusively on MySQL really made its popularity grow.
You can't lose what you never had.
Is it? I'm not familiar with the SPICE landscape, but I am with PostgreSQL:
Umm, what? How is PostgreSQL stagnating? It's a widely-used product with frequent releases, full-time contributors back to the open-source core, and several commercial support offerings. What do you mean by "MySQL slowly surpasses it in every way"? If you're talking about popularity, MySQL's always been more popular. If you're talking about something technical, well, I have absolutely no idea what it could be.
In the later tests, after you'd turned on the deceptive query cache?
The same one I complained about in my first post:Look, I don't care about your progression, I care that you've made in this thread a claim that it's well-known that MySQL is faster in real-world situations. I assert that you're wrong. If you want to convince me, you'll have to give me a single reasonable configuration of a modern MySQL database, a nearly-equivalent single reasonable configuration of a modern PostgreSQL database, and show me the MySQL one being faster in a good approximation of a real-world load. No one's ever done that, so it's certainly not well-known, and it's probably not true.
Okay...so is that what you were talking about here?
If so, why are you complaining about something that has no significance? If not, what are you talking about?
Are you talking about the actual connection initiation? Let me add to my list: were you pooling connections? I'd never actually considered that someone wouldn't do this in a web application. Even if MySQL's connection overhead is lower, it doesn't matter, because both can be trivially made 0.
That I agree with. But it must be done carefully. It's very easy to deceive yourself by benchmarking something you think is similar to your application but is not. Or by comparing two different techniques, with one done improperly.
Sure thing. Did you say this?
If you don't like insecurity due to poor input handling, why did you design your language to encourage it? magic-quotes-gpc is the worst language feature I have ever seen. It manipulates one particular set of inputs to make them conform to one set of output which doesn't always apply but is always a bad idea. People should be using bind variables supplied by the database library, not quoting according to MySQL x.x's rules and then sticking things directly into their statements. This is like a giant neon sign called "Security" pointing in the opposite direction from the real thing.
In contrast, Perl has taint mode, a feature you'd do well to emulate. It actually tracks a flag on each variable seeing if it came, directly or indirectly, from untrusted input. If so, it must be untainted before being used in any of a number of security-related situations. It's smart enough to avoid requiring any way of doing so which is probably inappropriate. It just flags things which are almost certainly wrong. Actual thought needs to go into correcting them, and as users learn the situations taint mode complains about, they trip it less and less often. Correct taint-mode code runs the same with it off, which makes it much superior to magic-quotes-gpc.
It's news to me. I haven't seen a recent benchmark that says this, and I'm always skeptical of claims MySQL is faster:
Seriously, if you can prove MySQL is faster for a real-life situation, write a paper, lay all your steps out for review. (Or point me at one someone else has done on modern versions of said databases.) There are lot of potential mistakes in benchmarking, and I won't believe claims unless I actually see that none of them were made.
By the way, what were you saying about Apache header stupidity? The article is annoyingly vague.
You're completely wrong. When the free space reaches 100%, write() will return ENOSPC (no space). The superuser will still be able to use the system, because traditionally, there's a 5% reserve which only uid=0 can use. (The "df" goes up to 105%.) The correct semantics are well-defined. If you've seen anything else, it's a bug in whatever system you were using, which no one could seriously defend.
Now, if you're talking about user applications breaking when encountering this condition...yeah, there are certainly some out there that break. There are buggy applications written for every platform. It's just laziness - Unix gives them well-defined semantics they can use to handle it correctly and an easy test environment (quotas).
The article mentioned two people (the original "young man" and the "young lady" later) who told the customer there was a blanket policy, and the spokeswoman who told the press that it was discretionary. I'm guessing that there was a blanket policy. PR people aren't above a bit of revisionist history in the name of damage control.
This strikes me as typical of big companies. They have a couple incidents where someone uses poor judgement or abuses his power [1]. They don't have the guts to confront him, so instead they put in place a blanket policy of mediocrity - it (hopefully) makes that particular incidence of abuse impossible but also prevents excellence. Unless some powerful outside group complains [2], these blanket policies never get removed, even after they're shown again and again to be a bad idea. Discretion erodes until everyone is mindlessly following stupid orders and nothing can get done anymore. Either they coast on their past accomplishments or, if they never had any, the company folds and everyone goes to work for a different, smaller companies. The cycle repeats.
[1] - "introduced in response to complaints that staff had mis-sold products last year."
[2] - Like the over-70 crowd; they're disproportionate voters, so they'll probably get their law against discriminating against older consumers. The company's apparently already backtracked on the policy, but it won't be enough to satisfy the seniors. Ironically, they'll create a blanket policy to counter another blanket policy, and it will probably make illegal the sort of good judgement that should have been made in the first place.
Re: the "simple solution for these five conditions", then "that would never scale!"
I suspect Guy 1 was taking a preemptive strike against tangents about caching solutions (some interviewees focus on tweaks to mask not understanding the basic approach to the problem) or perhaps against a quest to find the provable lowest asymptotic efficiency, when they actually use this problem for smallish data sets. He probably just didn't say what he really meant - maybe "small numbers (about 5), focus on the simple/elegant approach rather than performance".
I don't know what the problem was, but I probably agree with Guy 2. Denormalize the table later if you absolutely have to for performance. Your first solution should be simple, elegant, and maintainable. (A number like 5 probably didn't come from the fundamentals of the problem; it came from marketing's list of entries on a web form or something. It'll become 10 later, and five like columns was painful already. 10 is unmaintainable.) If you do remove that elegance later, you'd better have the performance numbers to back it up.
But in general, it does sound like they screwed up the interview. Never count it against someone for you leading them astray, never ask questions to which you can't recognize correct answers, expect a few minor mistakes (or your questions are way too easy), and don't everyone ask just this sort of question.
Sadly, this sounds harder than the questions I usually weed people out with. My first question is a five-line function (in C, which these people have 10+ years experience with). Strangely, few people seem to grasp that when I use a ridiculously easy five-line function to decide if you get a job or not, you should be careful. Often every line is wrong. Recently I've even been giving them more opportunities to correct themselves - "how would you test this?" "[manually printing the result]" "Okay, it spits out gibberish." "I don't understand how that could be." "Well, it spits out gibberish. What do you do to debug it?" "Uhh..."
- jerk twenty-something interviewer ;)