Slashdot Mirror


Building Scalable Web Sites

briandon writes "It's not a step-by-step guide (and doesn't claim to be one), but Building Scalable Web Sites is the closest thing available to a nuts-and-bolts look at managing the technical aspects of doing a Web-based startup. There's lots of code inside, but the book isn't built around a single, extremely contrived, case study like an online wine store. Instead, most of the chapters follow a general pattern: a topic (like bottlenecks in your application and platform, scaling, or monitoring) is addressed and some rules of thumb that describe the way that the author feels things should be done are set forth and explained, with lots of very specific hints and factoids mixed in along the way. Tools for other languages (in most cases, Perl) are mentioned in passing, nearly all of the code snippets are in PHP. MySQL 4.1 is the basis for most of the database-centered material." Read the rest of Brian's review. Building Scalable Web Sites : Building, Scaling, and Optimizing the Next Generation of Web Applications author Cal Henderson pages 330 publisher O'Reilly Media, Inc. rating 9/10 reviewer Brian Donovan ISBN 0596102356 summary If you've been kicking around the idea of doing a Web startup, then you should definitely give this book a read.

Henderson's resume, which can be found on his personal website, indicates that he joined Ludicorp about a year before they shut down GNE, their Web-based roleplaying game, to focus on Flickr (which had originally begun as an ofshoot of the game) and it's his role as web development lead at Ludicorp that led to the inclusion of the "The Flickr Way" sub-subtitle that runs diagonally across the upper right corner of the book's front cover.

The five-page-long first chapter sets the stage for the rest of the book with section headings that are all questions: "What is a Web application?", "How do you build Web applications?", "What is architecture?", and "How do I get started?".

Chapter two, "Web Application Architecture", begins with Henderson drawing an analogy between a web app and a type of multi-tiered dessert known as a trifle - the sponge cake at the bottom of the dish is the database, the next layer up, jell-o, is the business logic, and so on. The black and white image in the text is identical to the color image included in a slide from an eight-hour workshop that the author gave in San Francisco titled "How We Built Flickr". Having read the book and some reviews of his workshops and looked at the list of talks on Henderson's site (some with Powerpoint decks for download), it seems likely that a lot of the ideas expressed in the book were developed over an extended period of time through repeated presentations.

Next up are the considerations around development environments, beginning with a 3-point list of guidelines for building small-scale web apps up into big ones: use source control, have a one-step build process (literally, if possible, a single button), and track bugs (as well as non-bug items like features and support requests). Readers get to feast their eyes on a cropped screenshot of Flickr's build control panel (two buttons, "perform staging" and "perform deployment", to match the last two steps in the release sequence in an HTML form). For small teams, the author is in favor of allowing multiple developers to trigger releases and he suggests several ways of trying to keep that workable. In version control, Subversion gets the nod and, though no bugtracking tool is singled out as the best, FogBugz garners the highest praise ("extremely effective") and has the shortest list of "cons". The author never comes out and says what the Flickr / Flickr-Yahoo team uses in either area, however.

Chapter four is the most readable introduction to internationalization, localization, and Unicode that I've seen up to this point. MySQL's currently incomplete implementation of UTF-8, sarcastically referred to by some as "UTF-7½" (Google for it), is mentioned in enough detail that a reader can decide whether or not it's likely to be an issue for their app. The book as a whole is packed with little nuggets of information like that - things you might not have otherwise been even peripherally aware of until they bit you.

Input filtering and strategies for avoiding building cross-site-scripting and SQL injection vulnerabilities into your app are addressed in a chapter on data integrity and security that shows the same attention to detail as the rest of the book. The section on UTF-8 filtering, for example, features a three-way benchmark of UTF-8 validation techniques (using regular expressions, iconv, and ord()) and the merits of each approach are considered.

The coverage of handling emails programmatically in chapter six is also quite good. Henderson does the basics and then delves into a number of possible pitfalls in considerable detail. The salient aspects of the TNEF (media type application/ms-tnef) format, used by MS Outlook for attachments and metadata, for instance, are explained and pointers are given to open source TNEF parser implementations. I also got a lot out of the section on dealing with email from wireless devices like mobile phones, titled "Wireless Carriers Hate You" (there's that dry British wit again).

The second half of the book (chapters seven through eleven) focuses more on scalability. It's also where you'll find the most material on using MySQL, including but not limited to query profiling and optimization, a discussion of the merits of denormalizing once you begin to reach a certain scale, and a comparison of the different MySQL backends. There's an entire chapter devoted to finding and dealing with bottlenecks - how to determine whether your app is CPU-bound, I/O-bound, or context-switching-bound and what to do about it. The chapter on scaling begins by debunking the "scaling myth" (but he actually tackles several misconceptions at once - namely that scalability is synonymous with speed, that scalability is a byproduct of having written your app in Java, etc.) before getting into vertical vs. horizontal scaling (buying more powerful and expensive servers vs. adding more cheap cheap servers), load balancing, and more. Monitoring (both of web stats and your application itself) and APIs (RSS/RDF/Atom feeds, mobile content delivery formats like WAP and XHTML mobile, and REST/XML-RPC, and SOAP Web services) both get chapters of their own.

Henderson's sense of humor is evident throughout the book, but not in the annoying overly cutesey way that made me want to toss "Extreme Programming Installed" into the circular filing drawer. In the section on software interface design (where he means the interfaces between the layers of the trifle), for example, there's a "Web Application Scale of Stupidity" that places "sanity" in the center and OGF (one giant function) and OOP at the extremes. The process of separating web app logic from presentation is broken down into 3 steps: separating logic code from markup, splitting the markup into per-page files, and moving to a templating system. He closes out the chapter with a breakdown of the hosting, hardware, and networking issues involved in serving up web apps.

Technically, I think that Building Scalable Web Sites is 100%. There were just a few niggling flaws. Two dates given (both on page 155), 1990 for the creation of libxml and 1995 for the design of XML-RPC, are incorrect and I spotted a handful of grammatical mistakes (probably proportionately fewer than in this review) that I've already submitted, along with the date mistakes, as errata through the form linked from the O'Reilly catalog page for the book.

Additionally, though the cover does say "The Flickr Way", you won't find many sentences that begin "At Flickr, we [...]". Aside from the "Rolling Your Own" section in chapter seven describing some custom middleware and a protocol that they whipped up for moving files around within their system, there aren't a lot of explicit details about the way that Flickr operates in the book. You'll actually get more insider info from Tim O'Reilly's "Database War Stories" entry regarding Flickr, which is based on Henderson's answers to questions posed by O'Reilly, than from this book.

If you'd like to get a feel for Henderson's style, chapter five ("Data Integrity and Security") is available as a PDF on the O'Reilly catalog page for the book and Henderson has also put some articles online (all PDFs, not much overlap with the material in BSWS) at his website.

You can purchase Building Scalable Web Sites : Building, Scaling, and Optimizing the Next Generation of Web Applications from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

7 of 124 comments (clear)

  1. Third rule of scalable, reliable, websites: by Anonymous Coward · · Score: 5, Informative

    The underlying architecture doesn't matter as much as most people think it does.

    I can write you a scalable, reliable website using MS-Access. It will be slower and require a lot more code, but it will scale and remain up.

    The whole "You use platform X, so your app won't scale" has been proven wrong by many large companies running large apps for almost every platform.

    To reply to your flames, I'm currently finishing up an educational web app using PHP 5 and MySQL/Cluster 5. Redundant servers in the datacenter with load balancing, backup datacenter with a valid dataset always within a minute of the primary (specs allowed this), and it is *very* scalable. We hammered this with all manner of stress tools, very rarely had a problem. Added another server to the cluster, and went 5x beyond our max projected usage.

    I prefer PHP/MySQL, have done ColdFusion, ASP, JSP, Postgres, MSSQL, and Oracle. Each has a cost/benefit that needs to be evaluated. Most projects, though, the platform just doesn't matter so much. PHP/MySQL examples are generally easy to read by everyone and work well for examples in this book.

    1. Re:Third rule of scalable, reliable, websites: by Lao-Tzu · · Score: 3, Informative
      I can write you a scalable, reliable website using MS-Access. It will be slower and require a lot more code, but it will scale and remain up.

      You're absolutely correct. However, the easiest way to build a scalable website with MS-Access would be to start caching the entire contents of the database in the application layer. At this point, you might as well be using a flat text file for a database, since the database engine is not solving any problems for you. It's not contributing to developing your application. It's not useful.

      Any good discussion of building scalable websites should start with discussing tools. You can make do with a heavy rock, but when you have the choice to use a hammer instead...

      Just as an example, let's take your educational web application. Kudos for getting it working well. Would you like to port it to Access, with the same availability requirements? If there was an ideal DatabaseDbSQLServer software that could meet the same requirements with half the hardware, would you consider using it?

      The point I'm trying to make is that the underlying architecture does matter. It won't matter to the user when the application is done, but it does matter to people who are Building Scalable Web Sites, the title of this book.

  2. Re:Alright, I know this may be flamebait... by PornMaster · · Score: 3, Informative

    Having read (and enjoyed) the book... despite using PHP for the examples, there's relatively little dependent on PHP in the text. This isn't a "write really fast PHP code" book. It's about designing systems and process instead of just a web site. It's about setting things up in a way that they'll be maintainable, and you won't have hogtied yourself by putting the logic and the HTML together. It mentions the importance of defining a coding style, whatever that is, so when you have a bunch of developers, there will be consistency... and that the choice of style isn't as important as defining one.

    There's lots missing still... and the long focus on unicode, localization, etc is a bit tedious to get through... but overall, it's a book that I wish that people at $WORK were forced to read.

  3. Re:PHP and Industry by truthsearch · · Score: 4, Informative

    *raises hand*

    My company creates very large sites with LAMP. We also do Python, Flash, etc. so it's not because we only know PHP. If you don't see the job postings it's for one of 2 reasons: you're looking in the wrong places; or the jobs are largely found by networking and other methods.

    Large financial companies will find developers through job posting sites and head hunters. These companies usually develop on commercial platforms (.NET, websphere, etc.). But large web sites are usually owned by relatively small companies who use more networking and direct contact with open source developers.

    PHP and MySQL are quite capable of running large web sites. They were not created with large scale in mind, however, so there are special considerations you need to keep in mind. I don't recommend it for every lage site, but in the right situations it works.

  4. Here's an article ACTUALLY MENTIONING PHP by warith · · Score: 5, Informative

    "PHP just can't cut it"?

    Um, care to explain just what in the hell that statement is based on, since the article you linked doesn't even mention PHP? It compares different webservers and cache settings. Differences in programming languages don't even enter into it.

    Here's an article on scalability that's actually relevant to PHP, a case study about Digg.

    Conclusion:

    "It turns out that it really is fast and cheap to develop applications in PHP. Most scaling and performance challenges are almost always related to the data layer, and are common across all language platforms. [...] There is simply no truth to the idea that Java is better than scripting languages at writing scalable web applications. [...] it just isn't true to say that PHP doesn't scale, and with the rise of Web 2.0, sites like Digg, Flickr, and even Jobby are proving that large scale applications can be rapidly built and maintained on-the-cheap, by one or two developers."

    1. Re:Here's an article ACTUALLY MENTIONING PHP by Anonymous Coward · · Score: 1, Informative

      So, if PHP is scalable, why is Digg so painfully slow? Seriously, if I open Slashdot (Perl) and Digg (PHP) side by side, I can read about five Slashdot stories and all the 5-rated comments in the time it takes Digg just to start displaying the first page.

      You can't blame the hardware. Scalability is about not needing to keep throwing hardware at the problem...

  5. Re:Prepare for massive PHP bashing in 3, 2, 1, ... by drew · · Score: 2, Informative

    MySQL 5 is out aswell. It's a full blown DB and comes with tons of free x-platform admin and design tools that make building the outline of a large webapp a walk in the park and thus scares the living daylights out of Oracle and IBM.

    That may be, but when they release MySQL 12, I still won't be using it if it's still written by the same developers that claimed for years that adding referential integrity to a database just slows it down and programmers should be handling that in their application code, implemented "transactions without atomicity", insert statements that return the value of the "inserted" ID whether the insert succeeeded or not, and have otherwise generally demonstrated for at least the last 8 years that they know nothing at all about designing a reliable relational database. (Or maybe I'm just scarred for life from having to truncate corrupted tables once a week the last time I was responsible for maintaining a MySQL database.)

    PHP is a decent platform on the other hand (when combined with a competent database), although I find the language itsef to be rather quirky. Maybe they've improved this in PHP5, which I haven't used very much. Personally, I prefer JavaScript as a server side scripting language, but the only platform I know that has that as an option is ASP. It would be nice if PHP would go the ASP route as far as separating out the scripting language from the rest of the platform. Then you could still use PHP's strengths but not be tied to one language. For example, people who like Ruby but just don't get the hype behind Ruby on Rails could still have a really great web development platform.

    --
    If I don't put anything here, will anyone recognize me anymore?