Building Scalable Web Sites
briandon writes "It's not a step-by-step guide (and doesn't claim to be one), but Building Scalable Web Sites is the closest thing available to a nuts-and-bolts look at managing the technical aspects of doing a Web-based startup. There's lots of code inside, but the book isn't built around a single, extremely contrived, case study like an online wine store. Instead, most of the chapters follow a general pattern: a topic (like bottlenecks in your application and platform, scaling, or monitoring) is addressed and some rules of thumb that describe the way that the author feels things should be done are set forth and explained, with lots of very specific hints and factoids mixed in along the way. Tools for other languages (in most cases, Perl) are mentioned in passing, nearly all of the code snippets are in PHP. MySQL 4.1 is the basis for most of the database-centered material." Read the rest of Brian's review.
Building Scalable Web Sites : Building, Scaling, and Optimizing the Next Generation of Web Applications
author
Cal Henderson
pages
330
publisher
O'Reilly Media, Inc.
rating
9/10
reviewer
Brian Donovan
ISBN
0596102356
summary
If you've been kicking around the idea of doing a Web startup, then you should definitely give this book a read.
Henderson's resume, which can be found on his personal website, indicates that he joined Ludicorp about a year before they shut down GNE, their Web-based roleplaying game, to focus on Flickr (which had originally begun as an ofshoot of the game) and it's his role as web development lead at Ludicorp that led to the inclusion of the "The Flickr Way" sub-subtitle that runs diagonally across the upper right corner of the book's front cover.
The five-page-long first chapter sets the stage for the rest of the book with section headings that are all questions: "What is a Web application?", "How do you build Web applications?", "What is architecture?", and "How do I get started?".
Chapter two, "Web Application Architecture", begins with Henderson drawing an analogy between a web app and a type of multi-tiered dessert known as a trifle - the sponge cake at the bottom of the dish is the database, the next layer up, jell-o, is the business logic, and so on. The black and white image in the text is identical to the color image included in a slide from an eight-hour workshop that the author gave in San Francisco titled "How We Built Flickr". Having read the book and some reviews of his workshops and looked at the list of talks on Henderson's site (some with Powerpoint decks for download), it seems likely that a lot of the ideas expressed in the book were developed over an extended period of time through repeated presentations.
Next up are the considerations around development environments, beginning with a 3-point list of guidelines for building small-scale web apps up into big ones: use source control, have a one-step build process (literally, if possible, a single button), and track bugs (as well as non-bug items like features and support requests). Readers get to feast their eyes on a cropped screenshot of Flickr's build control panel (two buttons, "perform staging" and "perform deployment", to match the last two steps in the release sequence in an HTML form). For small teams, the author is in favor of allowing multiple developers to trigger releases and he suggests several ways of trying to keep that workable. In version control, Subversion gets the nod and, though no bugtracking tool is singled out as the best, FogBugz garners the highest praise ("extremely effective") and has the shortest list of "cons". The author never comes out and says what the Flickr / Flickr-Yahoo team uses in either area, however.
Chapter four is the most readable introduction to internationalization, localization, and Unicode that I've seen up to this point. MySQL's currently incomplete implementation of UTF-8, sarcastically referred to by some as "UTF-7½" (Google for it), is mentioned in enough detail that a reader can decide whether or not it's likely to be an issue for their app. The book as a whole is packed with little nuggets of information like that - things you might not have otherwise been even peripherally aware of until they bit you.
Input filtering and strategies for avoiding building cross-site-scripting and SQL injection vulnerabilities into your app are addressed in a chapter on data integrity and security that shows the same attention to detail as the rest of the book. The section on UTF-8 filtering, for example, features a three-way benchmark of UTF-8 validation techniques (using regular expressions, iconv, and ord()) and the merits of each approach are considered.
The coverage of handling emails programmatically in chapter six is also quite good. Henderson does the basics and then delves into a number of possible pitfalls in considerable detail. The salient aspects of the TNEF (media type application/ms-tnef) format, used by MS Outlook for attachments and metadata, for instance, are explained and pointers are given to open source TNEF parser implementations. I also got a lot out of the section on dealing with email from wireless devices like mobile phones, titled "Wireless Carriers Hate You" (there's that dry British wit again).
The second half of the book (chapters seven through eleven) focuses more on scalability. It's also where you'll find the most material on using MySQL, including but not limited to query profiling and optimization, a discussion of the merits of denormalizing once you begin to reach a certain scale, and a comparison of the different MySQL backends. There's an entire chapter devoted to finding and dealing with bottlenecks - how to determine whether your app is CPU-bound, I/O-bound, or context-switching-bound and what to do about it. The chapter on scaling begins by debunking the "scaling myth" (but he actually tackles several misconceptions at once - namely that scalability is synonymous with speed, that scalability is a byproduct of having written your app in Java, etc.) before getting into vertical vs. horizontal scaling (buying more powerful and expensive servers vs. adding more cheap cheap servers), load balancing, and more. Monitoring (both of web stats and your application itself) and APIs (RSS/RDF/Atom feeds, mobile content delivery formats like WAP and XHTML mobile, and REST/XML-RPC, and SOAP Web services) both get chapters of their own.
Henderson's sense of humor is evident throughout the book, but not in the annoying overly cutesey way that made me want to toss "Extreme Programming Installed" into the circular filing drawer. In the section on software interface design (where he means the interfaces between the layers of the trifle), for example, there's a "Web Application Scale of Stupidity" that places "sanity" in the center and OGF (one giant function) and OOP at the extremes. The process of separating web app logic from presentation is broken down into 3 steps: separating logic code from markup, splitting the markup into per-page files, and moving to a templating system. He closes out the chapter with a breakdown of the hosting, hardware, and networking issues involved in serving up web apps.
Technically, I think that Building Scalable Web Sites is 100%. There were just a few niggling flaws. Two dates given (both on page 155), 1990 for the creation of libxml and 1995 for the design of XML-RPC, are incorrect and I spotted a handful of grammatical mistakes (probably proportionately fewer than in this review) that I've already submitted, along with the date mistakes, as errata through the form linked from the O'Reilly catalog page for the book.
Additionally, though the cover does say "The Flickr Way", you won't find many sentences that begin "At Flickr, we [...]". Aside from the "Rolling Your Own" section in chapter seven describing some custom middleware and a protocol that they whipped up for moving files around within their system, there aren't a lot of explicit details about the way that Flickr operates in the book. You'll actually get more insider info from Tim O'Reilly's "Database War Stories" entry regarding Flickr, which is based on Henderson's answers to questions posed by O'Reilly, than from this book.
If you'd like to get a feel for Henderson's style, chapter five ("Data Integrity and Security") is available as a PDF on the O'Reilly catalog page for the book and Henderson has also put some articles online (all PDFs, not much overlap with the material in BSWS) at his website.
You can purchase Building Scalable Web Sites : Building, Scaling, and Optimizing the Next Generation of Web Applications from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
Henderson's resume, which can be found on his personal website, indicates that he joined Ludicorp about a year before they shut down GNE, their Web-based roleplaying game, to focus on Flickr (which had originally begun as an ofshoot of the game) and it's his role as web development lead at Ludicorp that led to the inclusion of the "The Flickr Way" sub-subtitle that runs diagonally across the upper right corner of the book's front cover.
The five-page-long first chapter sets the stage for the rest of the book with section headings that are all questions: "What is a Web application?", "How do you build Web applications?", "What is architecture?", and "How do I get started?".
Chapter two, "Web Application Architecture", begins with Henderson drawing an analogy between a web app and a type of multi-tiered dessert known as a trifle - the sponge cake at the bottom of the dish is the database, the next layer up, jell-o, is the business logic, and so on. The black and white image in the text is identical to the color image included in a slide from an eight-hour workshop that the author gave in San Francisco titled "How We Built Flickr". Having read the book and some reviews of his workshops and looked at the list of talks on Henderson's site (some with Powerpoint decks for download), it seems likely that a lot of the ideas expressed in the book were developed over an extended period of time through repeated presentations.
Next up are the considerations around development environments, beginning with a 3-point list of guidelines for building small-scale web apps up into big ones: use source control, have a one-step build process (literally, if possible, a single button), and track bugs (as well as non-bug items like features and support requests). Readers get to feast their eyes on a cropped screenshot of Flickr's build control panel (two buttons, "perform staging" and "perform deployment", to match the last two steps in the release sequence in an HTML form). For small teams, the author is in favor of allowing multiple developers to trigger releases and he suggests several ways of trying to keep that workable. In version control, Subversion gets the nod and, though no bugtracking tool is singled out as the best, FogBugz garners the highest praise ("extremely effective") and has the shortest list of "cons". The author never comes out and says what the Flickr / Flickr-Yahoo team uses in either area, however.
Chapter four is the most readable introduction to internationalization, localization, and Unicode that I've seen up to this point. MySQL's currently incomplete implementation of UTF-8, sarcastically referred to by some as "UTF-7½" (Google for it), is mentioned in enough detail that a reader can decide whether or not it's likely to be an issue for their app. The book as a whole is packed with little nuggets of information like that - things you might not have otherwise been even peripherally aware of until they bit you.
Input filtering and strategies for avoiding building cross-site-scripting and SQL injection vulnerabilities into your app are addressed in a chapter on data integrity and security that shows the same attention to detail as the rest of the book. The section on UTF-8 filtering, for example, features a three-way benchmark of UTF-8 validation techniques (using regular expressions, iconv, and ord()) and the merits of each approach are considered.
The coverage of handling emails programmatically in chapter six is also quite good. Henderson does the basics and then delves into a number of possible pitfalls in considerable detail. The salient aspects of the TNEF (media type application/ms-tnef) format, used by MS Outlook for attachments and metadata, for instance, are explained and pointers are given to open source TNEF parser implementations. I also got a lot out of the section on dealing with email from wireless devices like mobile phones, titled "Wireless Carriers Hate You" (there's that dry British wit again).
The second half of the book (chapters seven through eleven) focuses more on scalability. It's also where you'll find the most material on using MySQL, including but not limited to query profiling and optimization, a discussion of the merits of denormalizing once you begin to reach a certain scale, and a comparison of the different MySQL backends. There's an entire chapter devoted to finding and dealing with bottlenecks - how to determine whether your app is CPU-bound, I/O-bound, or context-switching-bound and what to do about it. The chapter on scaling begins by debunking the "scaling myth" (but he actually tackles several misconceptions at once - namely that scalability is synonymous with speed, that scalability is a byproduct of having written your app in Java, etc.) before getting into vertical vs. horizontal scaling (buying more powerful and expensive servers vs. adding more cheap cheap servers), load balancing, and more. Monitoring (both of web stats and your application itself) and APIs (RSS/RDF/Atom feeds, mobile content delivery formats like WAP and XHTML mobile, and REST/XML-RPC, and SOAP Web services) both get chapters of their own.
Henderson's sense of humor is evident throughout the book, but not in the annoying overly cutesey way that made me want to toss "Extreme Programming Installed" into the circular filing drawer. In the section on software interface design (where he means the interfaces between the layers of the trifle), for example, there's a "Web Application Scale of Stupidity" that places "sanity" in the center and OGF (one giant function) and OOP at the extremes. The process of separating web app logic from presentation is broken down into 3 steps: separating logic code from markup, splitting the markup into per-page files, and moving to a templating system. He closes out the chapter with a breakdown of the hosting, hardware, and networking issues involved in serving up web apps.
Technically, I think that Building Scalable Web Sites is 100%. There were just a few niggling flaws. Two dates given (both on page 155), 1990 for the creation of libxml and 1995 for the design of XML-RPC, are incorrect and I spotted a handful of grammatical mistakes (probably proportionately fewer than in this review) that I've already submitted, along with the date mistakes, as errata through the form linked from the O'Reilly catalog page for the book.
Additionally, though the cover does say "The Flickr Way", you won't find many sentences that begin "At Flickr, we [...]". Aside from the "Rolling Your Own" section in chapter seven describing some custom middleware and a protocol that they whipped up for moving files around within their system, there aren't a lot of explicit details about the way that Flickr operates in the book. You'll actually get more insider info from Tim O'Reilly's "Database War Stories" entry regarding Flickr, which is based on Henderson's answers to questions posed by O'Reilly, than from this book.
If you'd like to get a feel for Henderson's style, chapter five ("Data Integrity and Security") is available as a PDF on the O'Reilly catalog page for the book and Henderson has also put some articles online (all PDFs, not much overlap with the material in BSWS) at his website.
You can purchase Building Scalable Web Sites : Building, Scaling, and Optimizing the Next Generation of Web Applications from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
It's not a real book on "Building Scalable Web Sites" unless there's a chapter titled : ;)
"Preparing for Slashdotting: Burn, Baby, Burn"
Maybe /. doesn't want to support Amazon due to their stance on patents. Both Amazon and B&N have affiliate programs. B&N has less (if any) software patents.
Developers: We can use your help.
The underlying architecture doesn't matter as much as most people think it does.
I can write you a scalable, reliable website using MS-Access. It will be slower and require a lot more code, but it will scale and remain up.
The whole "You use platform X, so your app won't scale" has been proven wrong by many large companies running large apps for almost every platform.
To reply to your flames, I'm currently finishing up an educational web app using PHP 5 and MySQL/Cluster 5. Redundant servers in the datacenter with load balancing, backup datacenter with a valid dataset always within a minute of the primary (specs allowed this), and it is *very* scalable. We hammered this with all manner of stress tools, very rarely had a problem. Added another server to the cluster, and went 5x beyond our max projected usage.
I prefer PHP/MySQL, have done ColdFusion, ASP, JSP, Postgres, MSSQL, and Oracle. Each has a cost/benefit that needs to be evaluated. Most projects, though, the platform just doesn't matter so much. PHP/MySQL examples are generally easy to read by everyone and work well for examples in this book.
All my personal sites plus all my side contracting sites run on LAMP.
I really enjoy working with PHP but...do a search on any tech job board and you will find all two job openings for people with LAMP experience. Embarrassed to say it but I went out and learned ASP.Net/C# so I could make a living.
I realize there are VERY large PHP/MySQL site out there but I haven't had that many opportunities to scale a PHP app in a commercial environment. I wonder how many full time PHP developers there are out there and how many of those work on enterprise level websites. Can't be that many can it?
(Perhaps we never see these types of openings(LAMP) because developers are so happy with their job that new positions rarely open - heh)
~CrnbrdEater
...but I really can't take any book seriously titled "Building Scalable Web Sites" that explains itself using PHP and mySQL. I know PHP/mySQL have their place but I just don't think of them as industrial strength.
No doubt there will be multiple posts following to tell me how wrong I am, but that's how I see it.
and overstock is cheaper still
023AD01("Child", "Evil");
The Free Software community decided a long time back that Amazon is not threatening anyone with its patents. GNU ended its boycott four years ago.
Your opinion is shared by many... but see this other post on Slashdot for my response.
$nice = $webHosting + $domainNames + $sslCerts
This was mentioned in Rich Bowen's excellent lightning talk and is listed as "experimental" in the Apache 2.0 docs but as an "extension" in the Apache 2.2 docs. Anyone have experience with this? Seems very tweakable...
The Army reading list
Without any caching the M in LAMP quickly becomes a bottleneck.
Your both wrong, the first and second rules of scalable, reliable websites is "Do not talk about scalable, reliable websites".
This text seems to balance general philosophy with LAMP implementation. Anyone got suggestions for other related books, preferably with a .NET or Java specific bent?
http://www.aceshardware.com/read.jsp?id=50000347
PHP just can't cut it.
As we all know, MySQL databases and the PHP (Personal Home Page) language can't be used for building robust enterprise apps (there's not even an enterprise version of PHP). It baffles me to see people building large reliable websites with these technologies. I'm very tempted e-mail their webmasters and ask them how they are defying logic.
Badass Resumes
Bookpool is the cheapest
Heh. Funny that you should say that, given that my last two jobs involved (among other things) building Really Big web apps using LAMP for places like Dell (internal but vast amounts of data and traffic) and Dun and Bradstreet (publically facing information service that was frequently in the top 500 busiest sites according to alexa, for what their stats are worth). I know we weren't alone in those projects, either. It really depends on where you are looking, different cities have vastly different characters. If you're in a place with lots of startup/R&D/academia, you'll see a higher % of listings using open source toolchains. If you're in a place where most of the openings are from "traditional"/old-line businesses (say shipping or insurance) you'll see a much higher % listing of things like AS/400, MSFT, etc. Java's the one toolchain that seems to do a reasonable job of thriving in both of those environments (Tomcat/Jboss for the startups, WebLogic for the megacorps, etc. etc.).
So fwiw if you're looking to do large corp/site LAMP work all I can say is look around a bit more. You may end up moving to a different metropolis, but sometimes that's a big win from the career/job pool standpoint anyway.
News for Geeks in Austin, TX
Save yourself $6.80 by buying the book here: Building Scalable Web Sites. And if you use the "secret" A9.com discount, you can save an extra 1.57%! That's a total savings of $7.20, or 22.83%!
"PHP just can't cut it"?
Um, care to explain just what in the hell that statement is based on, since the article you linked doesn't even mention PHP? It compares different webservers and cache settings. Differences in programming languages don't even enter into it.
Here's an article on scalability that's actually relevant to PHP, a case study about Digg.
Conclusion:
"It turns out that it really is fast and cheap to develop applications in PHP. Most scaling and performance challenges are almost always related to the data layer, and are common across all language platforms. [...] There is simply no truth to the idea that Java is better than scripting languages at writing scalable web applications. [...] it just isn't true to say that PHP doesn't scale, and with the rise of Web 2.0, sites like Digg, Flickr, and even Jobby are proving that large scale applications can be rapidly built and maintained on-the-cheap, by one or two developers."
Knock it off allready.
I've had enough of the eternal Dimwits constantly bashing this or that with "MySQL not scalable" "PHP not scalable", blablabla.
PHP has arrived in the enterprise market. That's a fact. Yes, I know, Java has been there for 8 years, PHP is messy and quirky (so is Perl), MySQL isn't a DB, we've heard it all before.
In case you haven't noticed: PHP 5 is out. It's a full blown, mature PL and arguably the 400 pound gorilla of SSI solutions with a long history. MySQL 5 is out aswell. It's a full blown DB and comes with tons of free x-platform admin and design tools that make building the outline of a large webapp a walk in the park and thus scares the living daylights out of Oracle and IBM. You may have noticed IBM virtually giving their DB2 away for free (beer) since just a few months ago. Guess how that happend.
Imagine someone would come along and tell you that large-scale webapps in Perl are a pipedream. Not to far-fetched in this context, no? And what about slashdot and kuro5hin?
PHP is as good a technology as any other in use when it comes to building large webapps (point in case: www.rubyonrails.org/index.php/ ). Industry strength PHP Frameworks are poping up left, right and center and other places like mushrooms after the rain. And as for MySQL "not being ready for large, scalable apps" - you're being silly.
We suffer more in our imagination than in reality. - Seneca
Yes, people can make high traffic sites in PHP. Notice how those sites are incredibly simple, served almost entirely from cache, and could run off of a single machine instead of several dozen had it been written by people with brains.
Better throw away your 14.400 analog modem...
1) Mod_perl
2) FastCGI
3) FastCGI or a daemon process using Apache2::* to integrate with Apache as a in-any-capacity servlet engine.
You use these to create idioms for how your cgis handle requests.
Then you move on to your Object persistance, Session handling... (may I suggest memcached?)
And you have choices there.
I guess that's what makes Perl nice in this sense... you can pick and choose from all different parts and put it together how you feel comfortable.
You can use HTML::Inline or Mason or SimpleTemplates or XSLTs or straight friggin prints for the View portion and they are all on equal footing.
OTH if you are going to be cutting and pasting code from the web, then by all means, use Zend or Rails or some vertically integrated system. (Well I shouldn't lump rails in there so much but the ActiveRecord thing rubs me the wrong way)
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
It's not the best in the world but I still enjoy tinkering with http://wsframework.sf.net./ The theme is BAD but I'm working on it!
I'm a noob at this, but isn't the only factor in website speed the speed of the websites internet connection? I'm sure dinkySQL, mySQL, and MS SQL can all handle 10 simultaneous DSL clients connecting at 768k...and 768k * 10 is 7Mbits...those above programs can probably handle 100 or 1000 users, although this increases the bandwith usage to 70Mbits and 700Mbits/second...for the price of those connections, can the website be written in any language with any database, and the speed difference could simply be accounted for with a $10k quad-processor raid-5 10k rpm system? Or is that even necessary, since most websites have close to zero processing required and can probably cache all the resources in RAM, and a P2 would be fast enough to host across a 1.4Mbit T-1? Or is this book aimed at people with 10+Mbit connections, who may or may not need 3 or 4 computers to host send their web pages fast enough across the internet connection bottleneck?
Source - http://www.kaneva.com/channel/channelPage.aspx?com munityId=12834&pageId=13293 [kaneva.com]
Tools like ajaxWrite is a web-based word processor that can read and write Microsoft Word and other standard document formats. Anytime you need to open, read or write a word processor file, simply point your Firefox browser to www.ajaxwrite.com and in seconds a full-featured program will be available for you to open, edit, print and save. ajaxWrite has been designed to look like Microsoft Word, making it easy for anyone to start using it without needing to learn a new program. ajaxWrite also handles all the popular document formats so it's easy to share your files and collaborate with your co-workers and friends. Once finished with your document, you can easily save your work right to your hard drive. This keeps you organized and works in the same way that you're already accustomed to.
I see all these people bash PHP claiming this that and the other.
Security: PHP isn't the problem, poor implementation is the problem (the coder) PHP's only hand in that is easily giving you the ability to do it. All languages are dangerous to the security ignorant.
Slow: PHP can be slow. So can straight C. If you know what your doing, PHP can be blazing fast. There is a reason that so many large companies are picking it up. IBM, Oracle, Yahoo, etc...
MySQL gets the same treatment sometimes.
Both of these technologies are tremendous in what they provide. It's said "Knowledge is Power". If you think they're not powerful tools, maybe it's you that lack the knowledge and therefore the power.
In the end, it's still the best tool for the job. If you rule out PHP and MySQL without looking at them, it's your loss.
"Tools for other languages (in most cases, Perl) are mentioned in passing, nearly all of the code snippets are in PHP. MySQL 4.1 is the basis for most of the database-centered material."
I somehow thought that this was about building serious webapps for serious companies (i.e. the ones with the money to create scalable infrastructure). The semi-last sentence in the write-up killed all that. I have a title for other editions in the same series: 'Using legos to build skyscrapers', or 'Building scalable rockets using play-doh'.
Seriously though - such books mustn't focus on one technology or the other. I'm no java-fanboy myself (per se, that is; I like java). But scalability is much better documented using abstraction, divided in all of its relevant parts; networking, hardware, operating systems (lots of it should be here) and software. And not even mentioning php or mysql. Describing scalability in this way immediately erases all your claims of being a professional.
Religion is what happens when nature strikes and groupthink goes wrong.