Taco Bell Programming

← Back to Stories (view on slashdot.org)

Posted by timothy on Sunday October 24, 2010 @09:42AM from the how-dare-you-insult-the-code-monkeys dept.

theodp writes "Think outside the box? Nah, think outside the bun. Ted Dziuba argues there's a programming lesson to be learned from observing how Taco Bell manages to pull down $1.9 billion by mixing-and-matching roughly eight ingredients: 'The more I write code and design systems, the more I understand that many times, you can achieve the desired functionality simply with clever reconfigurations of the basic Unix tool set. After all, functionality is an asset, but code is a liability. This is the opposite of a trend of nonsense called DevOps, where system administrators start writing unit tests and other things to help the developers warm up to them — Taco Bell Programming is about developers knowing enough about Ops (and Unix in general) so that they don't overthink things, and arrive at simple, scalable solutions.'"

3 of 394 comments (clear)

Min score:

Reason:

Sort:

Weak error handling by Animats · 2010-10-24 11:54 · Score: 4, Informative
A big problem with shell programming is that the error information coming back is so limited. You get back a numeric status code, if you're lucky, or maybe a "broken pipe" signal. It's difficult to handle errors gracefully. This is a killer in production applications.
Here's an example. The original article talks about reading a million pages with "wget". I doubt the author of the article has actually done that. Our sitetruth.com system does in fact read a million web pages or so a month. Blindly getting them with "wget" won't work. All of the following situations come up routinely:
- There's a network error. A retry in an hour or so needs to be scheduled.
- There's an HTTP error. That has to be analyzed. Some errors mean "give up", and some mean "try again later".
- The site's HTML contains a redirect, which needs to be followed. "wget" won't notice a redirect at the HTML level.
- The site's "robots.txt" file says we shouldn't read the file from a bulk process. "wget" does not obey "robots.txt".
- The site is really, really slow. Some sites will take half an hour to feed out a page. Maybe they're overloaded. Maybe their denial of service detection software has tripped and is metering out bytes very slowly in defense. You don't want this to hold up the entire operation. Last week, for some reason, "orbitz.com" did that.
- The site doesn't return data at all. Some British university sites have a network implementation which, if asked for a HTTPS connection, does part of the SSL connection handshake and then just stops, leaving the TCP connection open but sending nothing. This requires a special timeout.
- The site doesn't like too many simultaneous connections from the same IP address. We limit our system to three simultaneous connections to a given site, so as not to overload it.
That's just reading the page text. More things can go wrong in parsing.
Even routine reading of some known data page requires some effort to get it right. We read PhishTank's entire XML list of phishing sites every three hours. Doing this reliably is non-trivial. PhishTank just overwrites their file when they update, rather than replacing it with a new one. (This is one of the design errors of UNIX, as Stallman once pointed out. Yes, there are workarounds they could do.) So we have to read the file twice, a minute apart, and wait until we get two identical copies. Then we have to check for 1) an empty file, 2) a file with proper XML structure but no data records, and 3) an improperly terminated XML file, all of which we've encountered. Then we pump the data into a MySQL database, prepared to roll back the changes if some error is detected.
The clowns who try to do stuff like this with shell scripts and cron jobs spend a big fraction of their time dealing manually with the failures. If you do it right, it just keeps working. One of my other sites, "downside.com", has been updating itself daily from SEC filings for over a decade now. About once a month, something goes wrong with the nightly update, and it's corrected automatically the next night.
1. Re:Weak error handling by arth1 · 2010-10-24 14:21 · Score: 4, Informative
  
  The site's HTML contains a redirect, which needs to be followed. "wget" won't notice a redirect at the HTML level.
  Actually, it does. But in any case, this is why you parse the HTML after fetching it with wget -- how else can you get things like javascript generated URLs to work?
  
  The site's "robots.txt" file says we shouldn't read the file from a bulk process. "wget" does not obey "robots.txt".
  From the wget man page:
  Wget can follow links in HTML, XHTML, and CSS pages, to create local
  versions of remote web sites, fully recreating the directory structure
  of the original site. This is sometimes referred to as "recursive
  downloading." While doing that, Wget respects the Robot Exclusion
  Standard (/robots.txt).
  
  The site is really, really slow. Some sites will take half an hour to feed out a page.
  And you still haven't looked at the wget(1) man page, or you'd know about the --read-timeout parameter.
  
  Maybe they're overloaded. Maybe their denial of service detection software has tripped and is metering out bytes very slowly in defense. You don't want this to hold up the entire operation. Last week, for some reason, "orbitz.com" did that.
  Not holding up your operation is why you use multiple tools that can run concurrently. A wget of orbitz.com taking forever won't prevent the wget of soggy.com that you scheduled for half an hour later, and neither will stop the parser.
  Of course, if you design an all-eggs-in-one-basket solution that depends on sequential operations, you deserve what you get.
  
  The site doesn't return data at all. Some British university sites have a network implementation which, if asked for a HTTPS connection, does part of the SSL connection handshake and then just stops, leaving the TCP connection open but sending nothing.
  This requires a special timeout.
  Yes, the --connect-timeout.
  
  The site doesn't like too many simultaneous connections from the same IP address. We limit our system to three simultaneous connections to a given site, so as not to overload it.
  wget limits to a single connection with keep-alive per instance. (If you want more, spawn more wget -nc commands)
  
  Even routine reading of some known data page requires some effort to get it right. We read PhishTank's entire XML list of phishing sites every three hours. Doing this reliably is non-trivial. PhishTank just overwrites their file when they update, rather than replacing it with a new one.
  That's no problem as long as you pay attention to the HTTP timestamp.
  
  (This is one of the design errors of UNIX, as Stallman once pointed out. Yes, there are workarounds they could do.) So we have to read the file twice, a minute apart, and wait until we get two identical copies. Then we have to check for 1) an empty file, 2) a file with proper XML structure but no data records, and 3) an improperly terminated XML file, all of which we've encountered.
  Oh. My.
  I'd do a HEAD as the second request, and check the Last-Modified time stamp.
  If the Date in the fetch was later than this, and you got a 2xx return code, all is well, and there's no need to download two copies, blatantly disregarding the "X-Request-Limit-Interval: 259200 Seconds" as you do.
  It'd be much faster too. But what do I know...
  
  The clowns who try to do stuff like this with shell scripts and cron jobs spend a big fraction of their time dealing manually with the failures.
  The clowns who do stuff like this with the simplest tools that do the job (
Re:8 keywords? by iamnobody2 · 2010-10-24 15:23 · Score: 4, Informative

8 ingredients, no. i've worked at a taco bell, there's a few more then that. this is most Hot Line: beef, chicken, steak, beans, rice, potatoes, red sauce, nacho cheese sauce, green sauce (only used by request), cold line: lettuce, tomatos, cheddar cheese, 3 cheese blend, onions, fiesta salsa (pico de gallo, the same tomatos and onions mixed with a sauce), sour cream, gaucamole, baja sauce, creamy jalapeno sauce. plus 5 kinds/sizes of tortillas (3 sizes of regular, 2 sizes of die cut) nacho chips, etc etc here's an interesting fact, those Cinnamon Twists you may or may not love? they're made of deep fried rotini (a type of pasta, usually boiled)

--
nobody's perfect