Why eCommerce Sites collapse
Rahul Mehra writes "ZDNet has an interesting article about how eBay and other e-commerce sites collapse under heavy loads. It talks about how massive growth, incomplete planning, rising expectations (24x7 uptimes) and immature technology all contribute. " This train of thought, for me at least, leads to neo-Luddite question - what do you folks think?
The dell site is one of the busiest and most reliable ecommerce sites around. And it runs NT/IIS which goes to show that implementation and management is more important than just technology issues.
For me cyberspace has always worked best as an escape from commercialism and not as a place to do business. I really miss the old days in the 1980's when you were not even really allowed to advertise on either compuserve or usenet. Right now I'm running junkbuster on my box and a spamfilter on my hotmail account just to keep it all at bay.
E-bay is sort of a cool company though. At least they facilitate the many to many personal interaction aspect of commerce. I realize there are technical issues associated with the crash, but to me it's just a case of too much success. They just grew too fast to handle the load.
As a engineer at a major (top 5) isp i can tell you expectations are the #1 problem with e-commerce. Many people unfamiliar with design and even running a business slap together a ecommerce site. Huge graphics, lots of needless calls to the shopping cart cgi, and general sloppyness contribute to the hassles we deal with.
In a shared environment its difficult to balance loads to please everyone, especially at a dirt cheap rate. Everyone wants their own box, want someone to help design, plan and admin it and they want it cheap. It just cant happen. Many people
jumping into ecommerce would fail badly with a
traditional brick and mortar store. The Internet
and an ISP's espertise can easily cushion their mistakes.
"Why eCommerce Sites collapse?"
Sounds a little like a fox tv special.
Seriously though 24/7 can be done with present day technology. The phone system comes to mind. The question is, are companies willing to follow the discipline that's needed to do so?
World's Deadliest Site Collapses?
World's Funniest Site Collapses?
Fox seems very mindful of the trailer trash demographic.
Obviously a backup system is needed in the event of crashes. But a consideration of mirroring is the delay time that occurs because the system is backing up terrabytes. Each transaction replicating itself onto the other system. I can just see the bandwith shrinking. And then lets say that they indeed do have a proper backup solution that works seamlessly. Both identical even down to the wiring. Wouldn't what crashed one crash the other?
A better aproach IMHO is prevention. Unbelievable how many companies don't have enough system administrators. Way too many managers and sales reps.
And regarding patches. Anyone ever work with Oracle databases? Just trying to figure out what version you are running is a nightmare. Version numbering schemes like 4.5.2.3.5 just makes me want to bananas.
I don't see this as being a problem with the administrators but in the companies. Too many big wigs not enough joe's. Or maybe they should have been using MS-SQL 7.0, they do say that its what 3.5 faster and cheaper. HAH!
-
Why don't they just get it over with? I think TV wants the V-Chip, it'll give them an excuse to "raunch up" their programming with the excuse that viewers can choose to filter it out if they like.
:)
P.S. You do the math.
woah, hemos you have to explain your thinking on this one. i cant even come close to finding anything with a neo-luddite feel to it in this article.
> I am hardly an Oracle (or Sun, for that matter) expert, but I thought Oracle used its own filesystem?
Well, I'm not an expert either, but I believe that Oracle is just the program that runs on top of the OS (Sun/Solaris in this case) so it would be using the FS for the OS.
Yup. Erring on the side of the profoundly paranoid helps a lot, as does actually hiring good people. Solid backups, off site and on, VERIFIED, damn it, and just a start. There is RAID, mirroring, backup systems, on-site spare parts (disks, power supplies), redundancy in the equipment, keeping zero-hour tapes )or CDs, these days) and baseline revisions the same way to allow a blast-it-and-rebuild approach should a non-critical patch turn out squirrelly, separating the data and the OS on boxes like you separate the data, metadata, and indexes in databases, mirroring all root drives in the box, having staff on-site 24/7, ISDN at the admins' homes with multiple lines to allow being on more than one line while on the wire to the site on the company-provided workstation, mock-ups of production to test patches out on, and so on.
;)
All makes sense if you want to stay in business. All generally require the big iron pretty early, too. One of the reasons that I like Linux is that it allows most of this for so much lower a cost so that even small companies can really reduce the risks.
Also, most of this basically requires a mainframe, right now, anyway, but that is another issue (especially on this forum!
Which is why many of us do contract work. If we are going to be treated poorly, we might as well make $80/hr.
Yeah, I know it can enhance performance and all that, but still... this is the sort of low-level kluge that makes hard core techies so leery of Microsoft.
Oracle can be configured to use raw disk space or "cooked" data files. Raw disk space is rarely chosen, but my understanding is that it essentially allows Oracle to write directly to disk, bypassing OS calls. "Cooked" files, are just allocated files of a predetermined fixed size that are located on the filesystem like any other text or executable file.
I think you're confused by 8i which has some extensions for treating the database as a fileserver.
Of course, the folks from Sun managed to get it fixed, and it happened to be the patch...
Generally, when the engineers from Sun tell me "Install the patch," I do it. This is not through sheer blindness; if I'm on the phone with Sun, I've obviously used up all my resources, and am looking for two things:
1. A solution to the problem.
2. Someone to blame it on if the solution messes up something else.
Yeaaah...that was like 14 years ago.
Well, on the encryption deal, I was talking about the bandwidth and I was referring to an incident that I think that I cannot referr to directly at this point where the choice was stay up or go down. Considering how much was lost, I think that turning off the encryption, as much as it makes my skin crawl, was justified.
CPU power wasn't the issue.
On the MBA deal, I think that the fundamental issue is the fact that no one has bothered to figure out the real cost (the "beta") of the risk involved with tech decisions. When they can be shown the consequences without actually having to work for them, the MBAs of the world will make the right decisions. Someone else hasn't done the work for them yet.
And yes, I am just a little tired of cleaning up their messes.
A lot of people have wondered about the colo issues -- considering the security hole, colo couldn't be worse, but of course we are talking about 'frames, and that is a tougher deal (size, juice, support).
That's funny. A few days ago I was talking to an old DBA who was making fun of some of the NT people who were repeating M$ marketing drivel about the capabilities of SQL7 by pointing out that 20 years ago he was keeping 165,000,000 phone records current eight times daily with AT&T. Made me chuckle.
If you can ever get those old mainframe guys talking, you learn some very interesting things. Same with old IBM CEs, old DBAs, and so on. I like learning from other peoples' experiences -- it is just hard to get the old guys to talk.
These people are ages of '50' or so eh?
In all honesty, I haven't seen many 50 year-old system administrators that know what the hell they are doing. Most of them are just being lazy turds in upper management and just like to keep the title 'system admin', so that they can think they are still in the technical realm of things.
I'm not saying that there aren't 50-year old system admins out there that are good. I'm saying that for you to use age as a qualifier for abilities, you're a moron.
Your system is way too complex if -
8) You don't flinch at paying multithousand dollar yearly support fees.
9) Your system is so unreliable you need a complicated system just to watch it.
10) Consultants were involved. At all.
11) You are swimming in paper. Excuse me. Documents.
Obviously all those people cannot say "IRIX"..
Guys/ Gals, Ever thought about Mainframe webservers? Reliable, proven, eats HUGE volumes of transaction for breakfast and now even takes Java and C++ codes..... Now where's that Lunix dist for MVS?
whenever you are talking about serious work. You are back to SP clusters and s/390s and S/70s and E10ks and so on.
The Scwab issue is clearer than they are making it out to be -- I have some knowledge of this back two years and Schwab has some complete idiots running things, still, even after a series of disasters (some of which didn't make the news). I really can't explain it in any way other than people who have gotten MBAs seem to only trust other people with MBAs, no matter how poorly they perform. I set up my account there as soon as I could, but after hearing really unpleasant stories for a few years, I finally went to Fidelity.
s/390s aren't unreliable, and Parallel Sysplex stuff works well dynamically, but if a)you are basing your maintenance window and procedures on a saleman's promises to an MBA and b)you aren't keeping the better mainframers because of pay and poor treatment, you will have problems.
Similarly, Cisco routers aren't unreliable. Encryption makes a 30% performance hit. If you are about to be swamped by transactions (if the market is tanking, for instance), then turning the encryption off is a command decision that you "get paid the big bucks for." Not doing so and having systems choke is not a problem with Cisco, anymore than undersizing the systems is a technical problem.
I am relatively confident in Schwab -- I would be confident enough to keep my money there if they would take my money as seriously as I do and spend less on MBAs in technical positions and more on technical people in technical positions.
And no, to the best of my knowledge (I have several funds and they have positions in everyone out there), I own no Schwab or Fidelity stock.
At least, that's the way a lot of companies seem to treat sysadmins. A good sysadmin, who keeps systems running smoothly, *appears* to be doing nothing. Why should such a person be paid very much just to run backups and turn a few screws, they'll ask. The sysadmin is viewed as a high-tech janitor, and is given about as much respect. This often results in companies only hiring one sysadmin, or worse, foisting the sysadmins duties onto other people on a standby basis. So when the fires come, the company is suddenly understaffed. A bad sysadmin, who's always recovering from crashes, restoring backups, rerouting network traffic, looks like a busy employee. If not for him, our machines would all be down right?, they will say. So what we have here is a scenario that favors either bad sysadmins, overworked sysadmins, or standby sysadmins who are actually full-time employees with other stuff to do and worry about. Welcome to hell.
Does anyone else think that everyone out there *expecting* eBay to be available every second of every day is a bit extreme? I mean look at it this way: you can, most of the time, go on eBay, probably find something close to what you are looking for for about half the price of retail, and even order the damn thing straight to your door within a few days. And you bitch when the service burps?
It's really sickening to hear that people can't get a grip on how far technology has come, and expect it to be way farther than it is.
You should never take life too seriously - You'll never get out of it alive.
One thing the article didn't mention directly, but plays a very important part in the stability/quality of your infastructure is the quality of your sysadmins. It doesn't matter if your boxes are triply redundant hot-swappable never go down systems if the sysadmins inadvertanty blow away key files periodically. Lots and lots of IT places seem to hire semi-trained monkeys as sysadmins and then wonder why their site is always going down. Look at the chart of outages on the second or third page of that article, notice how often "Failed software upgrade" appears? The problem is that the hardware vendor is usually blamed for those kinds of problems, which draws attention away from the true problem of unqualified sysadmins. Of course most of the Slashdot crowd doesn't fall in that category.
I read the internet for the articles.
>> turning the encryption off is a command
>> decision that you "get paid the
>> big bucks for."
Gee, if I were a hacker, I'd *never ever* wait until a big event (eg. market goes to hell) to start dumping to disk if I had managed to hack into a decent-sized ISP (or worked there and was a pissy sort of person). The prestige for showing an online broker to be vulnerable has to be pretty significant, especially if you moonlight as a "security consultant" or whatever.
Maybe I'm just a wuss, but it seems like
s/get paid/get fined/g;
is a distinct possibility if the ruse is uncovered. (It's also a tacky thing to do)
I suppose that with those sorts of loads, you could make a case for it being statistically infeasible to pull any real information out without a huge amount of disk space to dump the packets onto and a lot of time to pore through them... but people don't change their passwords very often, and you could probably assemble useful information in a reasonable amount of time. And a decent lawyer should have no trouble spooking a jury into overreacting if a trial came to pass.
Either way I submit that the magnitude of the negative publicity that would ensue would make such a decision very hard to justify.
Why not colocate at, say, Above.Net, and rely on their monster pipes for the big loads? It's not like it would cost that much more, and you rely on an extremely high caliber of technical staff to keep things running.
>> Encryption makes a 30% performance hit
In my experience you're off by almost an order of magnitude, in terms of CPU load. If you're only talking about packet throughput, then yeah, the handshake, key exchange, and renegotiation every few minutes adds about 30%. It seems like CPU power is usually the bottleneck in doing SSL transactions on big fat pipes, though.
>> people who have gotten MBAs seem to only trust
>> other people with MBAs
I had the misfortune of working at one of the top business schools in the country for about a year, and this is what I perceived: MBAs without a physical science or engineering background are categorically inept at technical decisions, no matter how much they think they have learned by reading InfoWorld. Negotiation is an MBA's strong point; following through is Someone Else's Job, as best as I could make out. So why don't they recognize that they are likely to make more money (enough to offset the cost) if they hire the best (and most expensive) technical staff? Beats me...
'Cause otherwise you're presenting an opening for someone else to gain publicity as Those Guys That Suck Less (tm) and steal your mindshare and profits. That can't possibly be lost on MBAs. (can it?)
Remember that what's inside of you doesn't matter because nobody can see it.
Doing it right the first time of course isn't that easy, but once it works, don't break it. It must be possibal to run a 24x7 site for an entire year, while stuck on gilligan's island without any way to contact the rest of the world including the site.
Nasa has comptuers in buildings where at any moment deadly (a few seconds from a small leak and everyone in building is dead!) chemicals are around. Do you think that their IS wants to touch the comptuers? Not unless they first send everyone else home and empty those tanks. If it wasn't so heavy they would probably insist on space suits too.
You decide a year or more in advance how much bandwidth you will get. Then decide how many customers that will support, and you don't allow marketing to sell to any more customers. Thats right, you refuse to allow more onto the system. Marketing can deal with this if you make them, and long term satisfaction will go up.
Once you know how much bandwidth you will have, you make sure you have comptuers that can deal with it. Mainframes have been doing 24x7 for years. Unix is very close to matching that (with Sun's redundant hot swapable system perhaps better, not that sun is the only chioce) I have seen tripple redundant systems with a polling mechanism where if one comptuer gives a different result it is shut off. Guess what: none of this is cheep. Thats right, doing buisness on the internet in volumn isn't cheap. Spend the money on system that will stay up, and enough power that you don't run out, and you will run 24x7. There are plenty of companies that make equipemtn that is ment for this use.
Last, and foremost: hire system administrators that have proven they can keep the systems running 24x7, and pay them to do so. These people are older, in their 50s or so. Hire thebest of the expirenced, and then give them a deal: you pay them to keep the systems up there or not. They should soon find a paycheck arriving every two weeks, with only a few hours a month work.
Remember, design the system so you can run it from Giligans island (no access by you) without your boss realising, and you will do fine.
Of course reality is that you do have to replace crashed harddrives, but with RAID-6 (raid-5 plus more redundancy, raid-6 isn't officialy defined) that is any time. You do need to buymore backup tapes once in a while, but automatied backups are the norm in 24x7 enviroments.
This ZDNet article is probably the most fact-filled piece I've read from them.
I tend to agree with the theory that many of these companies still think like start-ups; they act like they don't have any money to spend! Perhaps they're just not aware of where their money is best spent. I can't say I know the start-up web content business mentality to its very ends, but when money is tight you start betting against catastrophe, and hope your odds are good. Duplicate server hardware is expensive for a small shop, but when you have billions of dollars in revenue, and your _entire_ business relies on your information infrastructure, the least you should do is build a duplicate server farm right down to the cables on the power supplies.
Yeah, you'll blow a million dollars on it, and you might not need it, but the maintenance costs are lower than the cost of losing your auction site, on-line trading service, bank, or retail market for five days.
You co-locate services at multiple network access points. You use reliable software--the kind you have source code to, so you're not on the phone at midnight with a "knowledge engineer" across the country who is trained in taking bug reports. You need to fix the problem so you hire people who can.
You spread the load at all points (you have multiple web servers, multiple database servers, multiple administration access points, redundant networking hardware), and you always have ample staff around for that 4:00 AM breakage.
Use it. Put Checkpoint FiewWall-1 in there, too.
Using age as a disqualifier?
Someone who's 50 has a chance to have 30+ years experience in the field. Let's see you, hot shot 25 year old, have 30 years experience.
--
Ben Kosse
Remember Ed Curry!
That first link should have read 'eBay problems probably preventable'.
The first link basically says that the eBay guys weren't paranoid enough about making sure the setup was reliable. This is always a problem. (hey, I'm working on a commercial web site that only got a proper sys-admin 2 years after it started...). Little side-note - one guy says Sun's clustering stuff is not that great... I know Sun have been a bit late in starting doing clustering stuff, but I've also heard that what they have done is pretty good, *shrug*. Actually, they just annouced version 3 last week, which also allows clustering of 16 Starfires, for 1024 processors. (they're also making the source code for this available...)
I remember seeing a blurb recently on Microsoft's site slamming Sun for causing the problems at Ebay. According to MS, the Sun server failed, causing the outage, while the NT front-end servers were golden. Lots of factors were cited, including the E10K's sensitivity to config changes, reliance on a smaller domain server, and other factors.
:-)
Now we learn that the problem was caused by Ebay, and Ebay alone, by not keeping up on their vendor patches, and that Sun had fixed this particular bug quite some time earlier.
It would seem that MS needs to print a retraction. Any bets on when we'll see it?
Agreed; it's always more interesting when you can hear the voice of an an author or an editor, rather than the bland predigested output of a committee. The Cluetrain manifesto is a good argument that companies shouldn't be homogeneous and faceless on the Web.
Looks like the definition of load balancing to me...
Look at AOL. They introduced their flat rate scheme, had constant service problems, infuriated all their customers -- and now they're by far the dominant ISP. The goal here is market share. It doesn't matter how much your customers bitch, how much other people hate you, how many "Why XXX Sucks" pages there are about you. The only thing that matters is how many bodies you can claim.
Providing quality service probably means that you're doing something wrong, just like making a profit does.
What I'm listening to now on Pandora...
Thinking that over for another minute, that's much more true for cases like AOL where the system was inadequate for the new load than here where they had plenty of hardware but didn't maintain it properly...
What I'm listening to now on Pandora...
Oracle uses files or logical volumes, which are basically glorified disk partitions. My experience is on HP-UX, but what generally happens is that root creates logical volumes for oracle which are accessed via /dev in the root filesystem. Once the LV's are created and opened, nothing should be able to read/write blocks in them except Oracle, under oracle's own user id. It's a basic device locking process.
Apparently Solaris screwed up this arrangement and wrote some blocks in Oracle's space. It's odd that Oracle was then able to crash the OS - the only reason I can think of is that Solaris put something really critical in those blocks, and Oracle overwrote them for some reason while it was aborting.
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
I am hardly an Oracle (or Sun, for that matter) expert, but I thought Oracle used its own filesystem?
Also, note that Microsoft's view on the matter is nowhere near the actual cause of the problem. It's as if Microsoft was keeping tabs on this Oracle/Sun combo and decided to come forward with their "competative analysis" when the time was right. Looks like they had some "Haloween" documents on Oracle/Sun too... ;-)
-------
Warning: Slashdot may contain traces of nuts.
Every job I've ever gone into was a mess. When I first took the job I have I was pulling my hair out this place was so bad, I mean everything here was wrong; the users habits, the way they did things, the software they used, everything was just fucked. Now, a year later, things have started to calm down and I find myself with a LOT of free time to implement back burner projects.
... back to the books ...
A sysadmin should never look idle. There are always things to do, things to improve.
The reason I stayed was that I have so much control over things, and people will listen to me.
And as clueless as my users are, I still like most of them.
Unfortunately the slashdot/linux today conspiracy - lately - is really hurting my productivity. Urgh
support gun control: take guns from cops
I Totally agree! I've been preaching this for a couple years and am implementing a very cool, full featured middleware solution now.
I read a description of what eBay had a few months ago and was shocked at the predictable crash they were heading toward.
The thing is you can't easily patch a monolithic system to run on loose clusters with replication and redundancy. It will appear much more attractive to continue down the monolithic road and add hot-spares.
Few people seem to get what it takes to build truly scalable and reliable systems.
sdw
Stephen D. Williams
But seriously... Planning planning planning. Don't run everything on one box, keep backups, have backup plans, etc, etc. If you don't do these things your site is bound to have problems
This sig is false.
Code has become bloated... I remember when I was in development, we had to fit our software on a low density floppy or two, since most of our users would not have HD floppies (Europe was a major factor in this decision) and more than two floppies would raise the Cost of Goods.
Appears to me that a lot of programmers, webmasters and networking people have forgotten how to optimise their crap.
I remember a LARGE bank in Malaysia running their servers on DOS(!) doing transactions at the rate of a couple of thousand a day. Where have we lost our ability to optimise code, data and out thoughts?
This is all very interesting to read about. One would think that the internet is generating "unheard amounts" of loads on various systems for the first time. Mainframes (IBM, Unisys, Amdahl) have taken much more than this in terms of loads or transactions / second. The problem that I see is that people tend to isolate architectures that have worked in the past for new cool things that vendors tend to shove down their throats. CICS or for that matter virtually any transaction intensive database on proper mainframe (some of my customers are doing 20-30K transactions per second and they are in no way "big" users) could handle that load. At times, the whole internet revolution reminds me of the "client server" phase that the industry went through. Ziff David was one of the proponents of this phase (well they had to sell them damn magazines didn't they?) often claiming that a Novell file server would be damaging to companies like IBM. Well perhaps it is time to step back and examine how some of the legacy systems have worked (heck... imagine your bank telling you that their systems got overloaded on pay day!?! Then let see how we can adapt them to the Internet. IBM is doing an awesome job on this and so is HP. I strongly belive that the systems we're seeing today are "prototypes" doing proof of concepts, waiting on the big iron boxes to become internet enabled. One more point. Most of the classic "brick and mortar" businesses, people who know their technology, customers, systems.. are NOT internet or e-business enabled. Lets drop a few names of the DOW Jones components.. Ford, GM, GE, DOW, Coke etc, do more business than the e-business startups and probably process more transactions per day on their mainframes. I'd be more concerned about what happens when they start up their internet "storefronts"... Ok.. just a few random thoughts before I head into work...
All the touchy-feely bullshit on the web. Countless self-absorbed homepages, insipid rantings and more. Electronic Navel Gazing I'd say. Dennis Leary was hilarious, especially that one advert with the kid crying about keeping the net free, and Leary pops up to ask how his mom and dad paid for the computer he was using...
Blar.
As it stands now, eBay's auctions are so time-critical that they're in the same league as online brokerages. And speaking of brokerages...
Fidelity is running TV ads (plastered all over Pirates of Silicon Valley last night) touting the speed of their systems and how seconds count, with a quick disclaimer at the end of the ad that response time depends on network conditions. This is a pet peeve of mine: ads with disclaimers which make the rest of the ad meaningless. Example: "99c Big Macs! That's right, 99 cents! Only 99 cents! Prices may vary." But the point is that they're promoting the idea that the internet is suitable for real-time transactions, even though they recognize that it isn't quite there.
you should invest in MBA Technologies, theyre a good company and you obviously love mba's...
-- your knees hurt, don't they?
...and you don't allow marketing to sell to any more customers. Thats right, you refuse to allow more onto the system. Marketing can deal with this if you make them,...
This reminds me of the IBM TV commercial where Bob is at an AA like meeting..."No one here is stupid"... Then Bob tells them that he forgot to tell his staff to ramp up the website for more hits because of their new PR. Then they all turn
on Bob... "That WAS stupid, Bob."
That was funny. You got me thinking about other Fox computer-related specials:
America's Funniest Core Dumps
When Spammers Attack
I Married A SysAdmin
Real Life Reboots
Totally Shocking Backups -- Caught On Tape
/* Alright -- quit yer groanin' */
Save the whales. Feed the hungry. Free the mallocs.
'Prior Proper Planning Prevents Piss-Poor Performance'.
If we can teach this to grunts, -why- cannot those who are allegedly more intelligent fail -repeatedly- to learn it?
Ah, well. I recall when Comdisco failed in the attempt they made to show Shwlob what was about to happen in a simple email system, too.
Cheers,
Drieux
...the easy way is -always- mined...
We plan, backup, build redundant systems, isolate production from testing and implementation, and still every now and then something happens that makes you realize how young all this technology really is, and that bottlenecks still exist.
I am just coming off a twenty hour day repairing problems in a production system. Both members of a cluster affected (by the clustering software itself of course). In the end we end up hacking out the best fix available on the fly.
Dependability is expensive, and that expense is often hard to justify to economy minded business people. Add to that the fact that even the most secure, stable, and isolated system will eventually break and it is a recipe for some very long days for those of us who answer the pages when it all falls down.
Good thing I enjoy this kind of work. Now its off to a nap then back to the office to listen to a vendor tell me his next release will address the trouble and explain to a few business folks that simply stating a system will be up 24/7 doesn't make it so.
OTOH, I'm not fazed by ebay's problem. I'll never hand over my CC to Amazon, and half the stuff I look for is uniquely weird - the only kind of things you can get on ebay. The crabbers are probably newbies - geez, live with it, it happens, you know? If they ever used Lynx they'd realize how far browsing on the net has come!
I remember seeing ads for Lotus that went like "the net is screaming for capitalists" or something...I suppose what hurt is that one ad had the line "short stories that nobody reads". That seemed awful crass. The net thrives on its humanity. Take out the humanity and what have you got? The soulless machine that writers have been warning us about for ages (just reread Farenheit 451 - so hard to believe it was written in the 50s!)
It hurts me that everything is done for eyeballs and money. Newbies will never realize how great it was to surf. They aren't wary of schemes that focus on them as a pyschographic and target markets. They think, "Oh how nice they want to give me free webspace". They don't think, "Gee, I don't like having my page cluttered with ads that I don't agree with."
Ebay is pretty cool. I snagged a lot of good books dirt cheap there. But it is getting harder. Before I could make a bid a few days before and still win. Now I have to wait until the last few seconds to swoop down on everyone else
Having worked in a number of environments (small start ups, educational and corporate) there are some I have noticed that people do not seem to realize:
1) the technology is a lot more fragile than the marketroids will have you believe and the engineers want to believe.
2) Complexity seems to increase as 2 to the nth power where n is the number of components. E.g.,
2 servers have 4 ways they can interact, while 3 would have 8 ways to interact (not counting the fact that there may be many software interdependecies on and between machines).
3) Planning is key, and the central tenent should always be KISS (keep it simple and stupid -or- keep it simple, stupid!).
4) The larger the environment, the more it has a life of its own.
5) The larger the environment the more crucial communication becomes.
6) #5 above can lead to information overload.
7) There is no substitute for an intelligent, well trained well led staff. All the certifcication programs and fancy admin. tools cannot substitute for that.....
My $.02
putting the 'B' in LGBTQ+
Ok, so eBay goes down and everyone gets all up in arms about it because this stuff is not 100% reliable.
Hey, we knew that. Even the best systems out there are expected to be down a few minutes a year, and most of 'em (including those "super-reliable" Suns) are on the order of a couple of *days* a year. Throw a relational database into the equation and, well, reliability ain't so hot.
There are ways to deal with that, and eBay didn't do ANY of them.
At a minimum they should have had a hot backup available, PARTICULARLY for the single point of failure -- the database. With a hot backup they could have been back online in a matter of a couple of minutes. It was insane to bet their business on a single Sun/Oracle box! Whoever made that decision should be out on the street.
But they can do a lot better than that with a little middleware infrastructure. There's no reason they can't replicate transactions to multiple databases -- or even split their databases up so they have lots of little ones handling part of the load rather than One Big Server.
Of course that will take some technology that is a bit beyond the duct-tape-and-bailing-wire stuff they're using. It's not rocket science but it's gonna be a bitch to do with CGI.
What it all comes down to is that they bet on an infastructure design that had a single point of failure and were screwed when it failed. That could have been -- SHOULD have been -- foreseen and protected against.
I could maybe see that being OK in a startup that didn't have the cash for duplicate hardware outlay, but eBay has the cash in spades and they STILL didn't do it. There's a certain level of stupidity at work here.
jim frost
jimf@frostbytes.com
I don't see why people are so surprised with this. The Internet still is not regarded as "acceptable" for real-time work, so why are people so affected by system faults?
"Planning planning planning."
Planning is the key. On personal systems, there are UPS devices, floppy drives, RAID configurations (well maybe not so often on a PC), Zip drives, CDRs... all sorts of mediums to circumvent the loss of data. Just because a system is online or owned by a large group is no reason to assume it is secure.
-Clump