Slashdot Mirror


Building Scalable Web Sites

briandon writes "It's not a step-by-step guide (and doesn't claim to be one), but Building Scalable Web Sites is the closest thing available to a nuts-and-bolts look at managing the technical aspects of doing a Web-based startup. There's lots of code inside, but the book isn't built around a single, extremely contrived, case study like an online wine store. Instead, most of the chapters follow a general pattern: a topic (like bottlenecks in your application and platform, scaling, or monitoring) is addressed and some rules of thumb that describe the way that the author feels things should be done are set forth and explained, with lots of very specific hints and factoids mixed in along the way. Tools for other languages (in most cases, Perl) are mentioned in passing, nearly all of the code snippets are in PHP. MySQL 4.1 is the basis for most of the database-centered material." Read the rest of Brian's review. Building Scalable Web Sites : Building, Scaling, and Optimizing the Next Generation of Web Applications author Cal Henderson pages 330 publisher O'Reilly Media, Inc. rating 9/10 reviewer Brian Donovan ISBN 0596102356 summary If you've been kicking around the idea of doing a Web startup, then you should definitely give this book a read.

Henderson's resume, which can be found on his personal website, indicates that he joined Ludicorp about a year before they shut down GNE, their Web-based roleplaying game, to focus on Flickr (which had originally begun as an ofshoot of the game) and it's his role as web development lead at Ludicorp that led to the inclusion of the "The Flickr Way" sub-subtitle that runs diagonally across the upper right corner of the book's front cover.

The five-page-long first chapter sets the stage for the rest of the book with section headings that are all questions: "What is a Web application?", "How do you build Web applications?", "What is architecture?", and "How do I get started?".

Chapter two, "Web Application Architecture", begins with Henderson drawing an analogy between a web app and a type of multi-tiered dessert known as a trifle - the sponge cake at the bottom of the dish is the database, the next layer up, jell-o, is the business logic, and so on. The black and white image in the text is identical to the color image included in a slide from an eight-hour workshop that the author gave in San Francisco titled "How We Built Flickr". Having read the book and some reviews of his workshops and looked at the list of talks on Henderson's site (some with Powerpoint decks for download), it seems likely that a lot of the ideas expressed in the book were developed over an extended period of time through repeated presentations.

Next up are the considerations around development environments, beginning with a 3-point list of guidelines for building small-scale web apps up into big ones: use source control, have a one-step build process (literally, if possible, a single button), and track bugs (as well as non-bug items like features and support requests). Readers get to feast their eyes on a cropped screenshot of Flickr's build control panel (two buttons, "perform staging" and "perform deployment", to match the last two steps in the release sequence in an HTML form). For small teams, the author is in favor of allowing multiple developers to trigger releases and he suggests several ways of trying to keep that workable. In version control, Subversion gets the nod and, though no bugtracking tool is singled out as the best, FogBugz garners the highest praise ("extremely effective") and has the shortest list of "cons". The author never comes out and says what the Flickr / Flickr-Yahoo team uses in either area, however.

Chapter four is the most readable introduction to internationalization, localization, and Unicode that I've seen up to this point. MySQL's currently incomplete implementation of UTF-8, sarcastically referred to by some as "UTF-7½" (Google for it), is mentioned in enough detail that a reader can decide whether or not it's likely to be an issue for their app. The book as a whole is packed with little nuggets of information like that - things you might not have otherwise been even peripherally aware of until they bit you.

Input filtering and strategies for avoiding building cross-site-scripting and SQL injection vulnerabilities into your app are addressed in a chapter on data integrity and security that shows the same attention to detail as the rest of the book. The section on UTF-8 filtering, for example, features a three-way benchmark of UTF-8 validation techniques (using regular expressions, iconv, and ord()) and the merits of each approach are considered.

The coverage of handling emails programmatically in chapter six is also quite good. Henderson does the basics and then delves into a number of possible pitfalls in considerable detail. The salient aspects of the TNEF (media type application/ms-tnef) format, used by MS Outlook for attachments and metadata, for instance, are explained and pointers are given to open source TNEF parser implementations. I also got a lot out of the section on dealing with email from wireless devices like mobile phones, titled "Wireless Carriers Hate You" (there's that dry British wit again).

The second half of the book (chapters seven through eleven) focuses more on scalability. It's also where you'll find the most material on using MySQL, including but not limited to query profiling and optimization, a discussion of the merits of denormalizing once you begin to reach a certain scale, and a comparison of the different MySQL backends. There's an entire chapter devoted to finding and dealing with bottlenecks - how to determine whether your app is CPU-bound, I/O-bound, or context-switching-bound and what to do about it. The chapter on scaling begins by debunking the "scaling myth" (but he actually tackles several misconceptions at once - namely that scalability is synonymous with speed, that scalability is a byproduct of having written your app in Java, etc.) before getting into vertical vs. horizontal scaling (buying more powerful and expensive servers vs. adding more cheap cheap servers), load balancing, and more. Monitoring (both of web stats and your application itself) and APIs (RSS/RDF/Atom feeds, mobile content delivery formats like WAP and XHTML mobile, and REST/XML-RPC, and SOAP Web services) both get chapters of their own.

Henderson's sense of humor is evident throughout the book, but not in the annoying overly cutesey way that made me want to toss "Extreme Programming Installed" into the circular filing drawer. In the section on software interface design (where he means the interfaces between the layers of the trifle), for example, there's a "Web Application Scale of Stupidity" that places "sanity" in the center and OGF (one giant function) and OOP at the extremes. The process of separating web app logic from presentation is broken down into 3 steps: separating logic code from markup, splitting the markup into per-page files, and moving to a templating system. He closes out the chapter with a breakdown of the hosting, hardware, and networking issues involved in serving up web apps.

Technically, I think that Building Scalable Web Sites is 100%. There were just a few niggling flaws. Two dates given (both on page 155), 1990 for the creation of libxml and 1995 for the design of XML-RPC, are incorrect and I spotted a handful of grammatical mistakes (probably proportionately fewer than in this review) that I've already submitted, along with the date mistakes, as errata through the form linked from the O'Reilly catalog page for the book.

Additionally, though the cover does say "The Flickr Way", you won't find many sentences that begin "At Flickr, we [...]". Aside from the "Rolling Your Own" section in chapter seven describing some custom middleware and a protocol that they whipped up for moving files around within their system, there aren't a lot of explicit details about the way that Flickr operates in the book. You'll actually get more insider info from Tim O'Reilly's "Database War Stories" entry regarding Flickr, which is based on Henderson's answers to questions posed by O'Reilly, than from this book.

If you'd like to get a feel for Henderson's style, chapter five ("Data Integrity and Security") is available as a PDF on the O'Reilly catalog page for the book and Henderson has also put some articles online (all PDFs, not much overlap with the material in BSWS) at his website.

You can purchase Building Scalable Web Sites : Building, Scaling, and Optimizing the Next Generation of Web Applications from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

17 of 124 comments (clear)

  1. Re:I know they get kickbacks, but... by truthsearch · · Score: 2, Insightful

    Maybe /. doesn't want to support Amazon due to their stance on patents. Both Amazon and B&N have affiliate programs. B&N has less (if any) software patents.

  2. PHP and Industry by crnbrdeater · · Score: 4, Insightful

    All my personal sites plus all my side contracting sites run on LAMP.

    I really enjoy working with PHP but...do a search on any tech job board and you will find all two job openings for people with LAMP experience. Embarrassed to say it but I went out and learned ASP.Net/C# so I could make a living.

    I realize there are VERY large PHP/MySQL site out there but I haven't had that many opportunities to scale a PHP app in a commercial environment. I wonder how many full time PHP developers there are out there and how many of those work on enterprise level websites. Can't be that many can it?

    (Perhaps we never see these types of openings(LAMP) because developers are so happy with their job that new positions rarely open - heh)

    --
    ~CrnbrdEater
    1. Re:PHP and Industry by suggsjc · · Score: 5, Insightful

      Or...could it be because the LAMP sites don't need to continually add new developers?

      One of the reasons that you don't find openings specifically looking LAMP experience is probably because of "the right tool for the right job" and large scale sites aren't going to use strictly LAMP or any specific architecture, instead a mix of tools. Also, large scale sites will probably want people for specific tasks (each aspect of LAMP indivudually) instead of a jack of all trades.

      --
      When I have a kid, I want to put him in one of those strollers for twins and then run around the mall looking frantic.
    2. Re:PHP and Industry by drinkypoo · · Score: 4, Insightful
      Object-oriented is just not necessary for most things on the web.

      you do realize that for most people, non-OO vs. OO largely boils down to replace(mystring, "foo", "bar") vs. mystring.replace("foo", "bar"). (Whether that's actually correct syntax in any language is another discussion.) You don't have to program in an OO manner just because you're using an OO-capable language.

      With that said, why wouldn't you want to do OO? It's highly useful in a web environment, especially since we tend to think in terms of objects on pages anyway, even if we mean something slightly different by "object".

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    3. Re:PHP and Industry by crnbrdeater · · Score: 2, Insightful

      Depends. Code behind files really help seperate the logic from the display. If you use the built in data readers and such you do rely on the ASP.Nets built in javaScript functions but that is no way to build high demand sites. We end up not using many of ASP.Net's built-in front-end features because of performance reasons.

      While your comment about OO for front-end stuff is true most of our business is back-end processing. The website itself is just a pretty front for all the work that is being done behind the scenes. With 100+ developers it is helpful to use OO.

      Right or not most large companies prefere MS over OSS.

      --
      ~CrnbrdEater
    4. Re:PHP and Industry by cartel · · Score: 2, Insightful
      you do realize that for most people, non-OO vs. OO largely boils down to replace(mystring, "foo", "bar") vs. mystring.replace("foo", "bar"). (Whether that's actually correct syntax in any language is another discussion.) You don't have to program in an OO manner just because you're using an OO-capable language.

      I forgot about that. Maybe it's good for small things like that, but with web sites (unless you decide to use Javascript and make it interactive) you generally don't have the need for objects (i.e., people, cars, etc.). You're more or less just spitting out content, and then you're done. And you also have no event handling going on which requires the use of objects.

      I'm curious though. Besides working with strings, can you give me example of some objects (OO objects, that is) you might have in a normal web page?

    5. Re:PHP and Industry by Doctor+Memory · · Score: 2, Insightful

      If you're up to speed with a basic web framework (e.g. Struts), doing the basic development for a site like that would take a day or so (less if you have a previous project you can cannibalize). That's all the programming; fighting with the designer to lock down the design and codify the CSS and make it all look pretty will take another week (at least!). Seriously, though, if you have a good designer who can mock stuff up in HTML and write the CSS, a couple of weeks should be plenty of time. That doesn't include round-trips with the client when they have new ideas based on what you show them originally, and it doesn't include end-to-end testing and integration with any of their existing systems (inventory / product catalog / sales).

      From what I've seen, professional web site development proceeds at the client's pace, regardless of the language used. Yeah, you can bang out a prototype faster in some technologies, but 80% of the time is spent reworking the site to comply with changing specifications, and it's really a wash whether it's faster to add or move a form field with PHP, Perl, Ruby or Java. If you're doing "proper" development (separating presentation and logic, using CSS to define look-and-feel, writing modular code), change costs are largely invariant across technologies for simple sites (like the one you pointed out).

      --
      Just junk food for thought...
  3. Alright, I know this may be flamebait... by confusednoise · · Score: 4, Insightful

    ...but I really can't take any book seriously titled "Building Scalable Web Sites" that explains itself using PHP and mySQL. I know PHP/mySQL have their place but I just don't think of them as industrial strength.

    No doubt there will be multiple posts following to tell me how wrong I am, but that's how I see it.

    1. Re:Alright, I know this may be flamebait... by Anonymous+Crowhead · · Score: 3, Insightful

      Another thing to note: If you are in charge of "building a scalable website", and you do not know how to "build a scalable website" and thus resort to reading a book entitled "building scalable websites", then you should probably not be "building a scalable website."

    2. Re:Alright, I know this may be flamebait... by budgenator · · Score: 2, Insightful

      If what you are really after is an iron-clad, explosion proof, enterprise scaled, industrial strength web application where any down time result in horrendous lose of revenue; your not going to grab PHP/MySQL of the shelf and run with it any more than you are going to grab anything off the shelf and run with it. The bottom line will always be a well designed and coded application connecting to a mid-level database on commodity hardware will out preform the best database running on the best hardware when the connecting application is poorly designed and coded. When the stakes are high every commponent needs to be critically analysed and tested; nothing gets a free pass based on he-said she-said or I think.

      --
      Apocalypse Cancelled, Sorry, No Ticket Refunds
  4. Well, you're not alone by Dekortage · · Score: 2, Insightful

    Your opinion is shared by many... but see this other post on Slashdot for my response.

    --
    $nice = $webHosting + $domainNames + $sslCerts
  5. Nothing about Caching? by zigamorph · · Score: 3, Insightful

    Without any caching the M in LAMP quickly becomes a bottleneck.

  6. Here's an article on actual scalability by Anonymous Coward · · Score: 1, Insightful
  7. It's amazing how many people break these rules by Reverend528 · · Score: 2, Insightful

    As we all know, MySQL databases and the PHP (Personal Home Page) language can't be used for building robust enterprise apps (there's not even an enterprise version of PHP). It baffles me to see people building large reliable websites with these technologies. I'm very tempted e-mail their webmasters and ask them how they are defying logic.

    1. Re:It's amazing how many people break these rules by ukpyr · · Score: 2, Insightful

      ideal != workable

      I work in Perl (mod_perl), PHP, and Java on a daily basis. For simple one-shot applications or very narrow focused projects (see your examples) PHP works fine and is a fairly speedy tool to use. When you introduce an enviroment like and Intranet or interacting with non-mysql database, complex procesess, interacting with (SHOCK!) non-web applications, Java has a huge advantage due to it's rigid structure. For complex, larger team applications or groups of applications, PHP falls short very quickly.

      Best tool for the job rules the day.

    2. Re:It's amazing how many people break these rules by taosystems · · Score: 2, Insightful

      You're confusing webapp data with commerical data. Sites like wikipedia.org and friendster.com don't need to keep multi-indexes on multi-tables, they only need to update ONE record at a time, with little interaction involved with other data tables. And most of the data stored is static, and combined with simpler db fetches, makes for a faster site than you think. And let's not forget about caching the compiled php code itself. It may not be for the purist, but for the job, it works well.

  8. Re:Here's an article ACTUALLY MENTIONING PHP by warith · · Score: 2, Insightful

    "So, if PHP is scalable, why is Digg so painfully slow?"

    Well, as clearly identified in both the article and the quote I provided, all the scalability issues they encountered were related to their DATABASE LAYER. So my first guess (based on this case study) would be that Digg's database architecture is still inferior to Slashdot's, instead of a knee-jerk condemnation of PHP. YMMV.

    "Scalability is about not needing to keep throwing hardware at the problem..."

    Wrong... scalability is precisely about the ability to meet increased demand gracefully with modest increases in resources (usually hardware).

    Scalability does NOT mean, "the ability to handle increasing load with the same resources".