'Why You Should Not Use Google Cloud' (medium.com)
A user on Medium named "Punch a Server" says you should not use Google Cloud due to the "'no-warnings-given, abrupt way' they pull the plug on your entire system if they (or the machines) believe something is wrong." The user has a project running in production on Google Cloud (GCP) that is used to monitor hundreds of wind turbines and scores of solar plants scattered across 8 countries. When their project goes down, money is lost. An anonymous Slashdot reader shares the report: Early today morning (June 28, 2018) I receive an alert from Uptime Robot telling me my entire site is down. I receive a barrage of emails from Google saying there is some "potential suspicious activity" and all my systems have been turned off. EVERYTHING IS OFF. THE MACHINE HAS PULLED THE PLUG WITH NO WARNING. The site is down, app engine, databases are unreachable, multiple Firebases say I've been downgraded and therefore exceeded limits.
Customer service chat is off. There's no phone to call. I have an email asking me to fill in a form and upload a picture of the credit card and a government issued photo id of the card holder. Great, let's wake up the CFO who happens to be the card holder. What if the card holder is on leave and is unreachable for three days? We would have lost everything -- years of work -- millions of dollars in lost revenue. I fill in the form with the details and thankfully within 20 minutes all the services started coming alive. The first time this happened, we were down for a few hours. In all we lost everything for about an hour. An automated email arrives apologizing for "inconvenience" caused. Unfortunately The Machine has no understanding of the "quantum of inconvenience" caused.
Customer service chat is off. There's no phone to call. I have an email asking me to fill in a form and upload a picture of the credit card and a government issued photo id of the card holder. Great, let's wake up the CFO who happens to be the card holder. What if the card holder is on leave and is unreachable for three days? We would have lost everything -- years of work -- millions of dollars in lost revenue. I fill in the form with the details and thankfully within 20 minutes all the services started coming alive. The first time this happened, we were down for a few hours. In all we lost everything for about an hour. An automated email arrives apologizing for "inconvenience" caused. Unfortunately The Machine has no understanding of the "quantum of inconvenience" caused.
If millions of dollars are on the line, you should be running your own systems. Seriously. I'm not an IT expert, data infrastructure guy or anything. I'm just a dumb nerd, and I know that. Never trust your data to a third party when millions are at stake -- let alone critical infrastructure reliability.
Beware of the Leopard.
Why was there a second time?
Why you shouldn't use the cloud period
Over 90 percent of Google income is adverts. You would be absolutely insane to trust them with your business or educational institution data.
Iâ(TM)m not saying MS or Amazon is great but at least their revenue model is not based exclusively or largely on data mining of users.
This happened to me. They had some kind of p2p malware going on at the data center, they saw that one of my servers use a p2p service (cryptocurrency) and they literally banned my entire project causing all servers in all regions to go offline. It took them DAYS to get everything back online with only a "sorry for the inconvenience" email. They costed me money and spent trust with my users. I had lots of redundancy, just never expected my project to get shut down.
I still use them, but now I spread my services across other cloud providers as well.
If someone else owns your infrastructure, then they but need to flip the switch and your infrastructure vanishes in a puff of, well, cloud.
This is the essence of "cloud". This is the future, everyone tells us.
That's impossible, AI wouldn't let this happen. /s
You need to design the systems such that they have a fall-back and can continue to operate without an internet connection.
Really ...
What are you going to do the next time a major blackout occurs, and the grid wants you to restart your turbines?
Our company tried to use Amazon a few years ago and ran into the same issues. Although google and amazon allow you to
spin up a single instance, they are really designed for companies that have hundred if not thousands of servers. Amazon
assumes that you have dozens of fault tolerant servers and if one goes down you just replace it with another one. This works
great for companies like Netflix but Amazon is a disaster for a company that isn't fully fault tolerant and has critical servers
that can't go down. Liquidweb, Rackspace, Linode, and even Digitalocean are more reliable when it comes to wanting to
keep a single server up and running with minimal downtime. Now if you need to keep thousands of servers up and don't care
if any one server goes down then Amazon works fine.
If an extended system outage can cause "millions of dollars in lost revenue" then you should have a DR plan. Don't put all your eggs in one basket. Have copies of everything at another site (EC2, Azure, Colo, etc) that you can turn on and switch to in this event. If millions of dollars are on the line, then it shouldn't be unreasonable to have such a plan and infrastructure established.
YouTube users, GMail users, etc. have all complained about similar issues with blackbox, zero accountability. On click, boom, you're done.
IANAL, but this is my theory...
We know that Google is controlled by some highly political people. People who want to be able to disconnect you, deplatform you, etc. at the drop of a dime. The more they make their services a customer service blackbox, the easier it is to get away with acting in bad faith.
By bad faith I mean specifically in contractual bad faith. All of the XKCD-citing hipsters miss a very important nuance of the law regarding "deplatforming assholes:" contracts are judged by the "good faith" conduct of both parties and evaluated by reasonable behavior standards.
They do things like tie your account to all of the services, including purchases, and after a few vague "bad behavior incidents" nuke it. Often taking real assets with them because of how those accounts are tied. I don't think, for instance, Microsoft would fair well if they cost someone $2k of XBox Live marketplace purchases because they cussed out a few butthurt players a few times (Microsoft claims it has the authority to do this). Google is the same way on a larger scale.
The more people that are involved, the more people who can be hauled into court, forced to testify, etc. You can demand they answer why they thought a reasonable person would act that way. You can point to flesh and blood people who are the focal point for a real user suffering real economic harm due to one or a few people's biases.
And then win damages.
IMO that is why you see these companies aggressively moving in this direction. It's about not facing as much accountability for acting like dicks.
Seriously. When someone else owns and operates your infrastructure, things like this are going to happen. When that someone earns their revenue from something other than the bill from them you pay every month, it's going to happen a lot more often because they'll be acting based on what's good for their business, not what's good for yours. This is life on any cloud platform. This was life with mainframe service bureaus back when they were the cloud platform of choice.
You have to make a call based on what the trade-offs are. Make sure you know what those trade-offs are going to be, bearing in mind that any contract you have is probably going to say the provider's only responsible for refunding your month's payment no matter what the cost to you of their mistake was. It's that that you're balancing against the cost of running your own hardware, not the monthly bill.
When you entrust your business to an outside cloud service you are entrusting people, organizations, policies, and procedures that you don't and usually can't know with the keys to the success of your business. They can be very useful and cost effective in situations but I would never trust an outside organization for mission critical services.
Turned off their automated systems, and who ever caused the flag to be raised with your Google Payments Account gets in and takes over your entire system, maxes out your CFO's credit card?
Put all your infrastructure under the physical control of some other entity well beyond your reach and then discover they can summarily turn it off and refuse to respond - Duh!
The Cloud Stikes Back!
Never trust your data to a third party when millions are at stake -- let alone critical infrastructure reliability.
While that is reasonable advice, sometimes that isn't an option. Sometimes the only reasonable way to do things is through a third party. Furthermore sometimes the third parties can do a better job than I could do myself, even accounting for their flaws.
Are you saying he should have used Google Wind instead?
#DeleteFacebook
You should not have ANY one single point of failure.
Only 1 card holder? Single point of failure.
More importantly: Only 1 cloud provider? Single point of failure.
If you're running that level of cash, and still insist on outsourcing infrastructure, then fucking distribute it. Mirror the infrastructure between AWS, GCloud, and Azure. Even these companies themselves know this. Look up Amazon's DNS providers. Hint, its not JUST AWS, but they outside their own shit too *JUST IN CASE* their servers go offline.
I have an email asking me to fill in a form and upload a picture of the credit card and a government issued photo id of the card holder. Great, let's wake up the CFO who happens to be the card holder. What if the card holder is on leave and is unreachable for three days?
Uh, I don't know - take a picture of each and save them on your phone, in case you need them?
You report everything was back up within 20 minutes once you submitted the requested information - that seems pretty good to me.
Now, about your decision to only run one instance of your mission critical application suite on exactly one cloud service...
The story here is you consider it someone else's fault for your failure to plan/prepare for an outage.
Ken
This is not a problem with Google Cloud, this is a problem with all "cloud" platforms. It's really simple, they can be held liable so they put acquit ass-covering in the contract so that they can shut you down on a whim. If this doesn't work for you then you should not any "cloud" platform.
This is just an example of reality catching up to all the idiots who said "put it in the cloud!" while ignoring all the risks. Play with fire and you'll eventually get burned.
Anons need not reply. Questions end with a question mark.
Maybe I just didn't read enough, but it seems like he doesn't say anywhere exactly what happened. He implies it was a billing issue. That's all. Without knowing exactly what went on, it's very hard to care. I imagine it's something like "well the credit card details changed, oh, and we were 107 days overdue."
Also, millions of dollars are on the line for short downtime and you're billing to a credit card?
I have an email asking me to fill in a form and upload a picture of the credit card and a government issued photo id of the card holder. Great, let's wake up the CFO who happens to be the card holder. What if the card holder is on leave and is unreachable for three days? We would have lost everything -- years of work -- millions of dollars in lost revenue.
Somewhere in Russia, India and Nigeria, several callcenters full of scammers came all at once.
-=This sig has nothing to do with my comment. Move along now=-
This is why, when I design a cloud service, I design for redundancy at the business level too, not just at the host level. I deploy to and run on multiple different cloud vendors at the same time. This way, all my eggs aren't in one basket if one has a billing hiccup or one goes out of service.
Good thing I don't work for Amazon. Cause I just found some synergy. Amazon video would be about the get much better, at zero cost to them.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
I hate to blame the messenger, but either you didn't buy the right service level agreement or Google broke the contract.
If it's the first case, blame yourself and learn a lesson. You get what yo pay for. If Google doesn't offer the level of service you need, go elsewhere. If they do, either pay up or go elsewhere.
In the second case, you are rightfully upset but you should be talking to lawyers before talking to Slashdot.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Why was there a second time?
So many of the problems here (ex. paying with a credit card and one that has only a single person's name on it? Having no fallback that can be spun up elsewhere?) are foolish if this has never happened before, and utterly, mind-bogglingly idiotic if this in fact has already happened before. It's one thing to be blind of something you should know could be a problem, it's quite another to be blind and wholly unprepared for a problem you've personally experienced! Something seems fundamentally wrong at this company.
Also, if your entire business can die because it takes an unexpected few days off, then perhaps your business is running a bit too raggedly and doesn't have enough meat on the bones . . .
I remember sigs. Oh, a simpler time!
The company thought they could get away with paying less for server infrastructure. They can. But they get less. This is one of the "less" things they get.
If you value your data, host it yourself, preferably in multiple locations. If you want to go cheap, then you can expect to lose things.
Like your data, or access to it, or availability of it.
It's not such a smart thing to cheap out on the important stuff.
Of course, convincing the bean counters of future risk inherent in what appears to them to be current savings... good luck with that.
Well, best to get rid of your bean counters. :)
Here's a maxim of mine I like to drop on the table during discussions like these:
I've fallen off your lawn, and I can't get up.
Sharecroppers, company stores, vassals, etc... all have digital counterparts these days. Instead of a single entity though, it is spread out among several corporate entities and perpetuated by all levels of government. It's only going to get worse unless/until the people revolt. Problem is most of them don't even realize it. Quite clever way of creating highly productive slaves who think they are free.
As a consultant a typical war I get into with customers is to pick the cloud setup email, drive etc... I always recommend Microsoft instead of Google products, and I have to always remind people that, altough the strong brand name, Google is an advertising company not an Enterprise partner, and I have countless stories of google pulling the plug on services because of "reasons" whatever, also, they have no respect for the customer when they drop a product, they just send an email with a month notice, and that is a good one, and then they pull the plug, never ever use google products for enterprise, ever. True story!
Who do you work for? I'm divesting. Hell, as a 6 man startup in the 90's, we knew better to have only one server farm in one colo. Granted, our failover was to the developmental farm on a T-1 in our office, but it was at least *some* failover. h, and the colo texted when there was a problem, real or imaginary.
That was 6 drunk amateurs 2 decades ago.
...(as mentioned in other comments):
1) Don't trust another company with your critical IT infrastructure!
2) Have redundant facilities with different ISPs. 3) Have tested backup/standby power systems.
Yes, it is expensive, but - how much would it cost you to be down a week? A month? There is no free ride.
Running a large project on the cloud makes a lot of sense, for a number of reasons. However, every cloud provider is vulnerable to down time... just for starters. Add to that, he had once pulled off line by GCP. Why is this not spread across multiple providers? This is 2018, the headline "Single Point of Failure Can Lead to Downtime" is not news.
This company sucks at running a business.
cloud services as they are marketed by cloud providers. Suicidal for a business if you ask me.
Basically you are paying another party to have control of your business, being responsible to keep your business up and running. You are putting your business, the source of your livelihood and livelihood of all the company's employees on systems you have NO CONTROL OVER. Basically putting your business in the hands of other people, who quite frankly do not give a damn about your business other than it pays its bill every month. Have any kind of problem, and you are out of business until the problem is rectified. As a software engineer, I advise client against relying upon the cloud for critical business functions. It makes no sense to put applications, or code bases on devices you have NO CONTROL OVER!!!!
That being said, I do see cloud services being used as part of a businesses disaster preparedness plan...as a backup system...but nothing more. But giving a cloud provider full control over your business? Plain stupid if you ask me.
You should think about divesting from just-google. The cloud is costing you more already, get yourself a number of real servers with real hosting providers dispersed geographically. Running something solely on Google or Amazon clouds is technically identical to hosting everything on a single server.
Custom electronics and digital signage for your business: www.evcircuits.com
Bring up the SLA and at LEAST request outage credits....
Throughout all of Google's platforms you can expect this kind of treatment. JUST SAY NO TO GOOGLE.
Fortunately, they will only be a threat 30% of the time (40% if they are offshore). If you're smarter than, for example, Google, you can probably develop a strategy to disable them while they are sleeping.
You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
> Great, let's wake up the CFO who happens to be the card holder. What if the card holder is on leave and is
> unreachable for three days?
Unreachable for three days? What if the CFO is dead? Oops!
In some circles, this is called "the campus bus problem"... The guy who knows everything walks out in front of the campus bus. Now what.
This also used to be seen on University computer systems... Some grad student has written processes in use all over the place, running from the home directory... The graduates/moves on. The account is deleted; chaos ensues.
Universities learned and moved on.
If you're anything but a whiny kid, you own your mistake, learn from it and move on. Don't blame someone else. The fact they're whining in the quasi pay walled "The Medium" says even more (I for one am fairly tired on getting nagged by them).
Remember when we complained about foreign tech. support? Well welcome to 2018, where there is none, even if you're paying. This "disconnect" seems to be how life goes in this so-called Information Age. Everything is SO optimized for profit today, that getting assistance for anything is getting closer and closer to being non-existent. We were sucked into it when we were given free web browsers, etc., that came with no user's manual. I understood, since the stuff was (beer) free. Fine. But these days, I run into trouble and get no help, even if I'm paying. So the digital wall of fine print that the OP ran into doesn't surprise me at all.
s3 storage is *massively* expensive at scale, compared to in house. Even among cloud providers, there are competitors that are 75% less.
XML is like violence. If it doesn't solve the problem, use more.
Had an identity I used on MS for support and forums for over 15 years. Tried to log in about a month ago and was told it had been temporarily suspended due to violations of their TOS. I tried to find out what
happened but they refused to tell me anything. They told me the only way I could get my account back was to have them send a text code to my phone. Only phone I have that accepts text is my google number which MS
WON'T ACCEPT. They don't have a system like amazon or google where they can call you @ a home number to have a robot announce a code over the phone. They demand your text#. If you don't have one, you don't get your account back.
Tried following up with their support -- twice -- both times was told I violated something in their TOS, and could I review to see what it might have been (WAY too vage). and to use their text-msg system to recover my
access (which I'd told both of the service reps didn't work -- and MS wouldn't take my voice number.
Completely lame.
One, note that just because they are a big name, it does not mean all their decisions are bullet proof guaranteed the best. Dropbox has the exact opposite story to tell.
For another, Netflix has a rather special position. They are *the* go-to reference customer for AWS. Amazon with almost every other breath references just how *awesome* Netflix is doing with AWS. As such, they assuredly have special status, Amazon is not going to just screw with Netflix because the second Netflix so much as whispers a negative AWS experience, their biggest reference customer has gone bad. If there is *any* company on the face of the Earth that can get away with single-sourcing from a cloud vendor, it's Netflix.
For the 98% of customers who are not highly prized marquee customers.. Well your experience will deviate from stories about Netflix.
XML is like violence. If it doesn't solve the problem, use more.
As a cloud migration consultant, I see a lot of companies going dual cloud with a sort of DR model in a second cloud provider to avid tat kind of scenario. Yes, it creates quite an overhead but it could be worth it.
I keep stuff on my laptop, pc, cloud, secure HDD's, safe deposit box. On a business scale, you should be keeping stuff backed up across multiple platforms.
Mission-critical functions should be kept in-house. Never farm out anything that can kill your business if your vendor fails to do their job.
-jcr
The only title of honor that a tyrant can grant is "Enemy of the State."
"We apologize for the inconvenience." bullcrap. I had the same exchange with the morons at Sears appliance service. They kept transferring me to god-awful call centers in Bangalore or some other place where their accents are so thick that talking to someone in the Deep South would be easier. After talking to six different people, I realized that they are all working off the same set of rules: Ask for the same damn information over and over, Try to sympathize, Pretend your computer is frozen, Blow off the customer.
These bastards had the balls to try to sell me a whole-home warranty. Why the eff would I buy that when I can't get you bastards to come fix my dishwasher? No wonder your company is going down in flames and good effing riddance.
And your belief matters how?
Has anyone ever talked to anyone @ Google? I in all the years since I first heard "Google". I have never been able to chat with a live person on anything ever.
;)
To be fair, I have chatted with the
Azure/365 folks (took time, many calls (2 months), but did get "their" Information Protection/Crypto issues worked out, once they stopped pointing the finger at me),
Amazon (not so much, selling mostly, interface & whole experience sucks), AWS, never pulled the trigger, but did get through (pricing is mind numbing and complex),
other smaller data centers, GoDaddy (Good/Bad), (their interface just keeps getting worst), others, etc pretty good.
Just my 2 cents
Or is it just a question of how large your account is?
We're just a mid-sized MSP, but there's always a way to get someone on the phone, 24x7. The on-call number customers call is the security-service that does our physical security, they forward the call to the on-call engineer (after the customer is verified using a "password"). The intermediate step is to ensure people don't call the on-call engineer for sysadmin-tasks that could be done during business hours.
Every customer can get the on-call number, provided they cough-up the money. For most, it's not worth it because servers are quite stable these days.
The problem is of course that Google is so big and has so many customers that it's not possible to "know" every single customer anymore.
Windows 2000 - from the guys who brought us edlin
Remember folks:
The simple definition of " Cloud " is infrastructure you neither own nor control.
By offloading this responsibility to a third party ( Google in this case ) you simply add an additional point
of failure in the chain.
With any substantial amount of money on the line, the better way to do things is to have your own servers
( preferably two locations, one primary and one backup ) so if one site goes down, it's more of an annoyance
than a Class A Catastrophe.
Most companies, however, have to get burned before they understand that there is a limit to the number of
corners you can cut.
In a word: arrogance. We see this way too much with Google, the new evil.
When all you have is a hammer, every problem starts to look like a thumb.
... cheapest service they offer, the one that doesn't include 24/7 phone support - let alone a guaranteed SLA, to host your multi-million dollar wind/solar plant, where any service outage will cost you millions in service penalties.
I may be one of the "old timers" who I'm told is thinking about things in an "old school" way when I say this. But I've *always* warned people that "The Cloud" just means you're giving somebody else the responsibility of handling your data and the systems it runs on.
That makes sense sometimes. I'm not "anti cloud". But for anything really critically important to a business, I feel you should have it running locally and THEN consider cloud options as hot-failover sites, backup sites, etc. With cloud hosting, the whole thing is off limits to you as soon as your Internet circuit goes down, for one thing. With it running locally, you can still use it just fine anywhere on your LAN.
But additionally, if the provider hosting your stuff goes bankrupt or merges with someone else, or just plain decides it's not profitable enough without some pricing changes -- where does that leave you? Technically, they can just disappear with your whole software and data configuration overnight. Or they can put trained apes in charge of maintaining things so it suddenly has huge security holes. Who knows?
When you run things yourself, YOU are where the buck stops if things go wrong. If you're good at what you do, that should be more of a comforting thing than a scary thing. I've seen too many shops trying to cut corners on the I.T. hiring budget by bringing in less experienced people who really can't properly run the systems they're supposed to be caring for. The cloud for them is a crutch ... a way to get things done that are beyond their abilities. But that's not an ideal situation for a business to put itself in.
I agree.
Get this: At the initial meeting, I asked if there were any speed issues and the vendor said, "No, you'll operate much faster than you do now."
The fucking latency was shit.
And get this: The firm logged in using RDP.
It was actually just one big duplication of our production servers (I had a dual system) loaded up to the "cloud."
It little behooves the best of us to comment on the rest of us.
child's-play. Try Billions.
Last project involved cloud-computers where we processed a BILLION USD every year (credit card payments)
And you're ALWAYS trusting a 3rd party. You own IT guys or cloud. Doesn't matter.
What DOES matter is backups and fail-overs. (These guys NEEDED a proper failover)
For the past 15+ years, I have worked with systems that must not go down. No - really, 1 second of down time in a year is intolerable for some applications. (Not always every system I work with, but at least some fall into this category.)
If the original poster of this story failed to keep in mind when designing the infrastructure that a computer that won't ever go down isn't available, then he/she failed the first principal of fault flexible systems. That principal is to come to terms with the level of acceptable fault vs. the available budget. 9's cost money. How many do you want to buy? A uptime SLA for 99.99% is still three and a half days down over 1 year. 99.999% SLA will cost more. A lot more.
From the OP " millions of dollars in lost revenue." - this is perhaps the first time I've heard that statement that I actually believe it. That still doesn't mean "you have chosen wisely".
So, let us do a thought experiment in the absence of any data. I'm going to get things wrong here for this particular situation because I don't know details and will be making assumptions.
Data Storage: Sounds like "We don't need backups because we have RAID" issue. Sorry, study availability zones in S3 on Amazon. You can stripe your RAIDs across a local zone, stripe it to other data centers geographically separated. On Linode, you can rsync between data centers. On Rackspace, you can use object storage in multiple data centers (Linode and Rackspace this will require software to accomplish as they do not offer a API to fire off copies unlike S3 does.) This ensures that the totality of data is not stored in one basket. That still doesn't back it up - and you need to do that too. Data may need to be restored due to human as well as machine error. One situation I remember was engineering a database system with 4, 8, 16, 24, 36, and 48 hour delayed slaves for super fast point in time recovery - with hot transaction backups that could be selectively applied. Just to add to the server count, it was also Master|Slave^6/Master|Slave^6 - 14 DB servers and 7 ingress data clusters of 5 each. (I'd say if you are in the US, there is a 100% chance you've done something that was processed by this system in the last year if you drive a car.)
Data Acquisition: This should use something like Atom Hopper/Rabbit MQ/Service Mix (oh god, please not Service Mix) in a clustered environment across data centers. Use DNS round Robbin for a cheap way out to find your ingress servers but requires some consideration of DNS timeouts. Datapoints are posted as messages, and servers claim the work entry from the Atom Hopper queue. This prevents the situation where clients cannot report their data, and also protects the processing servers from having a regional failure or failing to process a data message.
Also, systems should be configured so that Ansible, Salt, Git, or Puppet can automatically build systems in the cloud when stressed or even a total data center failure. Strict deployment and versioning should be enforced so that no server can't be replaced at the drop of the hat somewhere else due to "one off" changes.
Treat servers like cattle, not pets.
Last - KNOW YOUR VENDOR. This is absolutely critical. Here is something I suggest you try for your self - call the support line and time how long it takes to get a human on the phone. Remember that some vendors cost more than others, and there's a good reason for that. Getting, keeping, and paying really smart people to help you out with zero notice costs money. A lot of money. And a vendor that will be willing to let you skip tier I - III support (after you've proven you're not an complete idiot) is even more rare. I know of only one company that will do that, and it's the same one that will win the call and get a human test. But they cost twice to three times what others do - because they give you what the others will not.
What I hear in the OP's post is that they suffered a catastrophic failure and rate limits and fraud protection kicked in. I do understa
Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
Sorry, but this is your fault. We put multi million and billion dollar clients on AWS and GCP and have never, ever had an issue like this. 1) You're hosting in the cloud and not actually understanding what a cloud provider is or does. 2) You don't have a plan that is reasonable for a multi-million dollar organization, one that includes some level of support or SLA. 3) You're not building with DR in mind. Data should be backed up somewhere safe. Your infrastructure should be 'infrastructure as code' which can be spawned up essentially at a moments notice. 4) Your admins dont have an escalation path that doesnt involve waking up the CFO. 5) Someone did something wrong, either you didnt pay the bills, or your usage was completely fucked so badly they shut your account down? I bet money that it was you didn't pay your bills. Seriously... no offense, but you guys need to look at your business and spend a bit more time/money/effort on building this stuff up in a way that doesn't just fall over...
Sorry, you must be really bad at AWS.
Amazon's hardware isn't HA, your solution is.
I've run small to huge workloads on amazon, and saying that it's designed for companies that have hundreds/thousands of servers is totally wrong. In fact, it's not really that well designed for tons of servers because they don't really have a lot of built-in automation to handle hundreds, if not thousands, of servers.
Just because you use a cloud provider doesn't mean that it's turnkey. You still have to know everything. The difference is it costs less and generally it's easier.
Cloud is awesome. Scale indefinitely, work out the kinks, test your product. All on a shoestring/pay-as-you-go budget. Very nice.
Yet Only cloud is shite in production. Provider X (in this case Google) has you by the balls. You do *not* want that. Tried and true fallback and failover with your own Docker setup and a stack of rented blades in a rack including nightly to-my-desk backups are an absolut must for any critical infrastructure. No matter how cool Kubernetes and Spanner are handling your stuff today, you at least want to be able to save your data tomorrow when things turn south.
BTW, all this is Captain Obvious speaking.
This isn't news, this is basic web stuff 1-oh'-1 that every third-grade webshop running critical WordPresses knows. It's really the exact opposite of rocket science.
The kid who set up this disaster deserves a smacking.
Lesson learned I guess.
We suffer more in our imagination than in reality. - Seneca
This sounds like asking fro trouble to me!
SLAs are what ensure companies can't do this to you. If you don't have an SLA with the cloud provider then you should probably run across multiple clouds and/or in-house infra.
I worked at a Hydroelectric project and used GCM to alert people when the values of PLCs hit certain thresholds (low voltage, high temps etc). There is no way I would just use Google for this though. These alarms are important, you should never just depend on one service. We also had a directly connect pro-face (with alarms) screen in the control room as well as various live graphs, android apps and widgets and email alerts. So if Google Cloud Messaging went down there were still other ways to detect a problem.
I never use any online cloud. Stupid idea IMHO.
As other people have pointed out, the magic letters here are "SLA". You must have a contract stating what the vendor's responsibilities are and be able to enforce that contract.
A contract is only as valuable as your ability to ensure it is enforceable. When you are dealing with a company the size of Google they can hire some flesh eating lawyers and have the bank account to keep you busy until you die and so if you plan to bring a lawsuit you'd better be prepared for shock and awe. Just having a contract isn't enough by itself.
You are right that a service level agreement is a very good idea but it isn't going to matter if it is cheaper for them to screw you anyway.
Otherwise, you don't have a business, you just have a hobby.
That's a nice sound bite but it's complete BS. When you are a small business or a startup you generally simply don't have the resources to fight a company the size of Google. You can have whatever agreements you want but if they decide to screw you there isn't much you can do about it. I've started several companies where we had to depend more than is ideal on a single large vendor and it's freaking terrifying if/when you don't have alternatives - contract or no.
Comment removed based on user account deletion
As much as I appreciate Google's services as an individual, I do not use them for any business-critical need for the exact reasons listed in your article. The service may be generally reliable, but what is not acceptable is the casual way they handle exceptions. It clearly defines the pecking order. They are the masters, you are the peon, the paying peon but still a peon.
If millions are on the line and you don't wake the CFO, he/she/it is going to be very angry. Chances are, if millions are on the line, the CFO will be waking you up even if the problem is technical.
I came to the datacenter drunk with a fake ID, don't you want to be just like me?
This is not about a server failure, router failure, logic bomb, or hacking incident - this is a hosting company deciding to flip a switch and simply turn off your entire infrastructure because an automated process determined "something" in your entire ecosystem was abnormal......
Frack that. never. not 1 dollar.
Absolutely agree. The benefits of "the cloud" which is one of a string of terms which initially had zero meaning and people kept trying to figure it out until they'd invented something... can be had in a private cloud infrastructure, anything else is just leveraged hosting which should definitely be a no-no for enterprise.
If the site was so important, why didn't they plan for an outage? Sounds like they just assumed it would always be up. Why didn't they ask Google about the conditions and possible sources of outages, and plan accordingly? My employer has done multiple cloud deployments, and this is always part of the planning. What do we do if cloud provider goes down or experiences and error? What are their operational norms around infrastructure changes and the like? What is our downtime tolerance? Is the cost worth the risks (i.e. not that important) or do we invest in multiple sites? Perhaps not the whole story is here, but it sure sounds like they didn't do much due diligence before throwing all in on "cloud"
This posting is provided 'AS IS' without warranty of any kind, implied or otherwise.
Here is my bit of wisdom. Get everything running locally in a cloud container framework like Heroku or whichever on you prefer. When and if you get to the point where you need disaster recovery or more scalability you can shift everything to the cloud until you recovery or at peak periods load balance to the cloud.
The cloud is not magic. It just means that other systems administrators are running your servers and they do screw up quite often.
If you are really serious about running in the cloud and having high reliability, let me introduce chaos monkey
https://github.com/Netflix/cha...
This is what Netflix uses to make sure that they can keep running with Amazon availability zones, instances, or whatever they call them disappear. I believe that the day that Amazon accidentally took all of it's storage offline and killed half their cloud, that Netflix survived as was able to keep going.
Read this for amusement: https://aws.amazon.com/message...
Ooops, we deleted AWS....
Netflix gets less and less stable daily... were you under the impression that large enterprises are making good decisions lately? As someone who works on enterprise infrastructure I can give you a hint, they aren't.
The benefits of "the cloud" which is one of a string of terms which initially had zero meaning and people kept trying to figure it out until they'd invented something
False. "Cloud computing" originally meant that you were leasing instances on someone else's servers, and you could spin up more instances any time you were willing to pay for them. That's still a useful meaning, unfortunately people now have "private clouds" which we just used to call "clusters", so the phrase actually means less than it did originally.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
You put such mission critical stuff on "Google Cloud"?
Why??
Part of the migration should be not.
It little behooves the best of us to comment on the rest of us.
There is a three-part mantra that project managers learn at the enterprise level:
This project fell to the third part.
You work in an industry where an excuse is good enough. Which is nice for you.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
If a vendor is selling to market that needs an on-prem requirement, it's a non-starter to not offer such a feature.
Many vendors, after doing their market research, have concluded that on-premises requirements such as yours are a rounding error. The benefit of satisfying them does not exceed the opportunity cost of monopoly rents that can be extracted by not satisfying them.
Sooner or later folks are going to have to realize that "Cloud" is just market-speak for another box that you don't have access to (until you pay the $9.95 access fee...). I have run my "own cloud" since 1999. Sits right across from me, and anytime it goes down I have direct and full immediate access. And this is on stuff only I consider important (apparently..)
You keep going until you die..."Me".
In other words, one must also consider malice from the hosting provider when shopping for a hosting provider.
The cloud rained on him. If it's a bad enough storm, you lose everything.
They are *the* go-to reference customer for AWS.
I'd reserve that spot for Amazon themselves, but Netflix is a good third-party reference. Kind of scary to let your direct competitor be the host of your content, though.
If you're not going to be hybrid (which gives you other opportunities) then you simply DNS load balance between regions AND cloud providers. Easiest way? Containerize.
The complexity comes into play around your databases, but there are a myriad of well known solutions to all of these problems.
There's no other rational way to provide solid service that to spread your risk.
Setup DNS Failover (or load balance if you prefer)
Setup in multiple AWS regions.
Setup in multiple GCE regions.
Optionally setup for Azure
If you're really paranoid, have an on premise instance somewhere local to you (or a metapod in house or something similar.)
Containerization makes all of this vastly simpler than in the past.
As many others have mentioned - don't trust anyone to be up all the time, trust that at least one of them will be up all the time.
Loading...
There seems to be a lot of comments saying why would you ever use someone else's infrastructure. The last few companies I worked for have needed to and they were big companies. You pay services to be reliable. They'd lose business if they weren't.
Building infrastructure is not cheap. There's a reason only a few companies build things like electrical grids and telecommunications networks. Businesses do business with other businesses. Unless your server needs are small and simple or you have a lot of money to burn, there is value of large-scale distributed server centers. Even if your server needs are small, you should have a guaranteed better uptime than doing it locally-- they have redundant machines with redundant storage and redundant power supplies with generator backups. Also, having your server and data offsite and backing them up means you don't lose everything if your brick and mortar business catches on fire.
Check your provider out if you have concerns and make sure you have a disaster plan. Also, redundant cloud services is always an option.
SMBs run their businesses on SAAS services. So, essentially anyone serving SMBs is in the same boat.