Amazon's Werner Vogels on Large Scale Systems
ChelleChelle writes "When it comes to managing and deploying large scale systems and networks, discipline and focus matter more than specific technologies. In a conversation with ACM Queuecast host Mike Vizard, Amazon CTO Werner Vogels says the key to success is to have a 'relentless commitment to a modular computer architecture that makes it possible for the people who build the applications to also be responsible for running and deploying those systems within a common IT framework.'"
Best question in the interview (page 3):
"MV: Given the size and scope of Amazon, there's a lot of talk in the industry about good computing. Most of the talk is around scientific applications. Do you see good computing playing a role at Amazon in the future?"
Actually, no... Amazon likes bad computing.
"When it comes to managing and deploying large scale systems and networks, discipline and focus matter more than specific technologies.
How about:
When it comes to DOING ANYTHING, discipline and focus matter more than specific technologies.
If you are at a 'small scale' environment and are limited to specific technologies, discipline and focus matter even more. Your choice is less with technologies and more with how you use them.
"the key to success is to have a 'relentless commitment to a modular computer architecture that makes it possible for the people who build the applications to also be responsible for running and deploying those systems within a common IT framework.'"
We have a BINGO!!!!!
It could be worse, it could be Monday.
OpenVMS Cluster
So who has the bigger system? Amazon, or Google?
Werner's comments are only 1/2 true. While many of the things he deals with are website centric, there is a whole world behind the website. No one buys from Amazon because of the website necessarily, they buy because the fulfillment is (mostly) accurate and fast. These backend systems are not nearly as clean as werner indicates.
Myopic vision on his behalf imo.
I've been programming for many years now, but I'm new to web-app development. I've been learning Ruby on Rails (for various reasons) and one of the points the book I'm reading makes (Agile Development with Rails) is that good scalability is best achieved through the use of a "share nothing" architecture - basically reduction of chokepoints by reduction of shared content in a system.
I'm studying this as I'm looking at scalability concerns in an app I'm putting together, and I did a google search on the topic, but the only thing of interest I could find was this article, which doesn't really go into the downsides of this approach. What does slashdot think about this?
It's incorrect to suggest that specific technologies don't matter, as some have done. They matter very much, and can often be the difference between a complete failure and a brilliant success.
There was recently a topic here at Slashdot about how Amazon gives their developers a lot of freedom to choose their own languages and implementation tools. There was some speculation that much of their success has been due to their use of languages like Common Lisp and Standard ML. It's widely known that the use of Standard ML, for instance, will directly lead to improved program quality due to its functional nature, garbage collection, and strong typing. Common Lisp offers the developer a power unmatched to extend the language, which can often decrease the time it takes to get a piece of software implemented.
Had Amazon built their infrastructure around PHP, it is doubtful that they would be the leader that they are today. Based on PHP's past (rather awful) security performance, Amazon would likely have run into many, many problems and vulnerabilities. But they chose wisely, and used languages that promote solid application design.
So in the end, the tools and specific technologies used are quite important. It doesn't matter how organized you are, or what process you follow, if the technology you're using isn't sufficient for the task at hand.
This is a dupe of a story run in May:
0 8
http://slashdot.org/article.pl?sid=06/05/17/04532
I was RTFA and thought it looked mighty familiar - that or DeJaVu.
$ man woman *
-bash:
One of those 'buzzwords, you know?' was your entire interview buddy. Imagine that? Scalability is achieved through many different technologies with many different engineers? I would never have thought that. I guess you have won that argument.
Jeez, I'm glad this guy is just the CTO, not somebody important.
Under normal circumstances, at Amazon you'll have to support what you wrote. That means if your code crashes all the time, you'll get paged in the middle of the night.
Now, even if you get rid of some incompetent programmer (say by moving him to another team), the rest of the team will still get bogged down with supporting the code he wrote. And since engineers now have to do support for the other teams using their service, their productvity eventually grinds down to a halt and new development becomes extremely hard. Things will also stick around forever.
Posted amazonymously.
Just as Amazon was using SOA long before it was named, the same is true of DBMS2. Add that to SAP's adoption, and we're getting somewhere. :)
To err is human. To forgive is good system design.
You'll be able to find a lot more about this technique if you search google for the (quoted) term "shared nothing"
The best resource (though getting dated) on the origins and meaning of shared nothing v. shared-something archticture is Greg Pfister's In Search of Clusters, 2nd ed..
There's "degenerate" shared nothing, which is what I find most people referring to today -- you have web server farm and you don't store session state, or if you do, you "pin" it to a particular server. Or you just rely on the database. It's degenerate because, sure, it's scalable (memory isn't as directly linked to concurrent users), but it really just shifts the burden to the database, which tends to be 1 big box.
So the question becomes, how do you scale the database horizontally?
In the database world, the term has become somewhat overloaded. Originally it meant physically shared disks and/memory vs. using network interconnectivity. But with the rise of I/O shipping technologies over networks (iSCSI, high speed NFS/NAS, SAN fibre-channel), this isn't really true anymore. So now, it comes down to how your data is partitioned and how you ship a read/write function to that node. Does a node "own" it's data (or a replica)? Or can any node touch any data? That's the debate.
In short, it works well in some cases: read-mostly parallel queries and/or search, which is why Google's using it, or why you see it with data warehouses (Teradata, DB2 UDB). It works OK if you have mostly have transactional data updates within a well-defined partitionable set of data (such as the TPC-C benchmark). It works less well when dealing with transactional updates spread across the entire data set (assuming a normal distribution), as you'll need to update replicas with a two-phase commit. The load balancing of your data across nodes also requires care in picking the appropriate partitioning key: sometimes a hash works well, sometimes range-values work well. If you need to re-partition your data for whatever reason, it's going to be a big job.
Commercially, Oracle 10g's Real Application Clusters is an example of a shared disk database, though they use an interconnect between nodes for cache coherency. Microsoft SQL Server, DB2, Teradata, MySQL, etc. are all "shared nothing".
-Stu
Vogels was a distinguished academic in distributed systems prior to Amazon. Read his blog some time. He is quite insightful, and this queuecast was a great one. Yours is the first comment I've seen in many forums over the past few months that seems to think it was tripe, so I find it curious.
His point is that Amazon has found that a decentralied archtiecture that can work reliably but still respond to new demands with agility. That's a huge deal, considering the contortions, pain, and centralized bottlenecks that most large IT shops have to deal with. Not to mention over-obsession with technological buzzes instead of looking at the business architecture of the firm.
Perhaps that's obvious, but perhaps it's important to restate the obvious when most people don't follow it.
-Stu
...in 37 signals (the guys who made Ruby on Rails), I wouldn't be suprised to see some more RoR happening there as some commenters here have suggested. Good times...
The Army reading list
At Java One this year there was a pretty interesting talk by a guy from Ebay, talking about "creating an object model that spans the world". If you join the Sun Developer Network (for free) you can get the slides, and even audio/video I believe.
Being bitter is drinking poison and hoping someone else will die
Amazon, interestingly enough, uses a LInux/Oracle RAC data warehouse, for data analysis. It is one of the larger deployments in the world, at over 15 terabytes. I found this article, but there have been many over the years, some on Oracle's site.
;-)
And, to re-iterate, Oracle RAC is shared-disk...
-Stu
Thanks for the pointer Mnesia, great reading.
-Stu