Learning High-Availability Server-Side Development?
fmoidu writes "I am a developer for a mid-size company, and I work primarily on internal applications. The users of our apps are business professionals who are forced to use them, so they are are more tolerant of access times being a second or two slower than they could be. Our apps' total potential user base is about 60,000 people, although we normally experience only 60-90 concurrent users during peak usage. The type of work being done is generally straightforward reads or updates that typically hit two or three DB tables per transaction. So this isn't a complicated site and the usage is pretty low. The types of problems we address are typically related to maintainability and dealing with fickle users. From what I have read in industry papers and from conversations with friends, the apps I have worked on just don't address scaling issues. Our maximum load during typical usage is far below the maximum potential load of the system, so we never spend time considering what would happen when there is an extreme load on the system. What papers or projects are available for an engineer who wants to learn to work in a high-availability environment but isn't in one?"
map reduce
http://labs.google.com/papers/mapreduce.html
With great power comes great electricity bills.
You could start by reading this book for a practical approach:
Zawodny is pretty good...
You can't talk about Wikipedia's flaws on Wikipedia
Ok, first up:
1. Check all your SQL and run it through whatever profiler you have. Move things into views or functions if possible.
2. CHECK YOUR INDEXES!! If you have SQL statements running that slow, the likely cause is not having proper indexes for the statements. Either make an index or change your SQL.
3. Consider using caching. For whatever platform you're on there's bound to be decent caching.
That's just the beginning... but the likely cause of most of your problems. We could go on for a month about optimizing.. but in the end if you just stuck with what you have and checked your design for bottlenecks you could get by just fine.
I said no... but I missed and it came out yes.
There are some good presentations on the web about how youtube, digg, google etc handle their scaling issues. Here's an example: http://video.google.com/videoplay?docid=-630496435 1441328559
These are both decent starting points. Please report back if you find something good -- I'd be very interested.
http://highscalability.com/
http://www.allthingsdistributed.com/
I've worked in multiple extremely super-scaled applications (including ones sustaining 70,000 connections at any one time, 10,000 new connections each minute, and 15,000 concurrent throttled file transfers at any one time - all in one application instance on one machine).
:)
The biggest problem I have seen is people don't know how to properly define their thread's purpose and requirements, and don't know how to decouple tasks that have in-built latency or avoid thread blocking (and locking).
For example, often in a high-performance network app, you will have some kind of multiplexor (or more than one) for your connections, so you don't have a thread per connection. But people often make the mistake of doing too much in the multiplexor's thread. The multiplexor should ideally only exist to be able to pull data off the socket, chop it up into packets that make sense, and hand it off to some kind of thread pool to do actual processing. Anything more and your multiplexor can't get back to retrieving the next bit of data fast enough.
Similarly, when moving data from a multiplexor to a thread pool, you should be a) moving in bulk (lock the queue once, not once per message), AND you should be using the Least Loaded pattern - where each thread in the pool has its OWN queue, and you move the entire batch of messages to the thread that is least loaded, and next time the multiplexor has another batch, it will move it to a different thread because IT is least loaded. Assuming your processing takes longer than the data takes to be split into packets (IT SHOULD!), then all your threads will still be busy, but there will be no lock contention between them, and occasional lock contention ONCE when they get a new batch of messages to process.
Finally, decouple your I/O-bound processes. Make your I/O bound things (eg. reporting via. socket back to some kind of stats/reporting system) happen in their own thread if they are allowed to block. And make sure your worker threads aren't waiting to give the I/O bound thread data - in this case, a similar pattern to the above in reverse works well - where each thread PUSHING to the I/O bound thread has its own queue, and your I/O bound thread has its own queue, and when it is empty, it just collects the swaps from all the worker queues (or just the next one in a round-robin fashion), so the workers can put data onto those queues at its leisure again, without lock contention with each other.
Never underestimate the value of your memory - if you are doing something like reporting to a stats/reporting server via. socket, you should implement some kind of Store and Forward system. This is both for integrity (if your app crashes, you still have the data to send), and so you don't blow your memory. This is also true if you are doing SQL inserts to an off-system database server - spool it out to local disk (local solid-state is even better!) and then just have a thread continually reading from disk and doing the inserts - in a thread not touched by anything else. And make sure your SAF uses *CYCLING FILES* that cycle on max size AND time - you don't want to keep appending to a file that can never be erased - and preferably, make that file a memory mapped file. Similarly, when sending data to your end-users, make sure you can overflow the data to disk so you don't have 3mb data sitting in memory for a single client, who happens to be too slow to take it fast enough.
And last thing, make sure you have architected things in a way that you can simply start up a new instance on another machine, and both machines can work IN TANDEM, allowing you to just throw hardware at the problem once you reach your hardware's limit. I've personally scaled up an app from about 20 machines to over 650 by ensuring the collector could handle multiple collections - and even making sure I could run multiple collectors side-by-side for when the data is too much for one collector to crunch.
I don't know of any papers on this, but this is my experience writing extremely high performance network apps
There is a good chance that this was the reason you were not accepted. I work at a very similar firm to the one you describe, one anyone here would recognize, and we do reject people for the reason you mentioned. Basically, the problem is that unless someone does have experience with very large scale applications, we find that they have a pretty steep learning curve ahead of them. While many candidates think that they know how to build a scalable app, what worked for them on an application that has 100k users totally breaks down when there are 100 million users.
Its very difficult to get that kind of knowledge/experience without having actually done it before. The way I got it was by being hired into a project which was a rewrite of a very large scale system, and I got hired right as that project was starting. This was a great way to make it up the learning curve without too much pain, because I got to hear from the team directly about what decisions they thought were wrong about the previous design, and got to participate in the design discussions for the next generation. The team was very experienced at this and the choices they made (and mistakes/bumps along the way) taught me a lot about how to build such an application.
Most databases have async APIs. Postgresql and mysql have them in the C client libraries. Most web development languages, though, do not expose this feature in the language API, and for good reason. Async calls can, in rare cases, be useful for maximizing the throughput of the server. Unfortunately, they're more difficult to program, and much more difficult to test.
High scale web applications have thousands of simultaneous clients, so the server will never run out of stuff to do. Async calls have zero gain in terms of server throughput (requests/s). It may reduce a single request execution time, but the gain does not compensate the added complexity.
If at first you don't succeed, skydiving is not for you
You are talking about two things: reliability and performance. And there are two ways to measure performance: Latency (what one end user sees) and through put (number of transactions per unit time). You have to decide what to address.
You can address reliability and through put by invest a LOT of money in hardware and using things like round robin load balancing, clusters and mirrored DBMSes, RAID 5 and so on. Then losing a power supply or a disk drive means only degraded performance.
Latency is hard to address. You have to profile and collect good data. You may have to write test tools to measure parts of the system in isolation. You need to account for every millisecond before you can start shaving them off
Of course you could take a quick look for obvious stuff like poorly designed SQL data bases, lack of indexes on joined tables and cgi-bin scripts that require a process to be strarted each time they are called.
It's not something you can cram into an elevator pitch; erlang is an entirely different approach to parallelism. If you know how mozart-oz, smalltalk or twisted python work, you've got the basics.
Basically, processes are primitives, there's no shared memory, communication is through message passing, fault tolerance is ridiculously simple to put together, it's soft realtime, and since it was originally designed for network stuff, not only is network stuff trivially simple to write, but the syntax (once you get used to it) is basically a godsend. Throw pattern matching a la Prolog on top of that, dust with massive soft-realtime scalability which makes a joke of well-thought-of major applications (that YAWS vs Apache image comes to mind,) a soft-realtime clustered database and processes with 300 bytes of overhead and no CPU overhead when inactive (literally none,) and you have a language with such a tremendously different set of tools that any attempt to explain it without the listener actually trying the language is doomed to fall flat on its face.
In Erlang, you can run millions of processes concurrently without problems. (Linux is proud of tens of thousands, and rightfully so.) Having extra processes that are essentially free has a radical impact on design; things like work loops are no longer nessecary, since you just spin off a new process. In many ways it's akin to the unix daemon concept, except at the efficiency level you'd expect from a single compiled application. Every client gets a process. Every application feature gets a process. Every subsystem gets a process. Suddenly, applications become trees of processes pitching data back and forth in messages. Suddenly, if one goes down, its owner just restarts it, and everything is kosher.
It's not the greatest thing since sliced bread; there are a lot of things that Erlang isn't good for. However, what you're asking for is Erlang's original problem domain. This is what Erlang is for. I know, it's a pretty big time investiture to pick up a new language. Trust me: you will make all your time back in writing far shorter, far more obvious code than you did in learning the language. You can pick up the basics in 20 hours. It's a good gamble.
Developing servers becomes *really* different when you can start thinking of them as swarms.
StoneCypher is Full of BS
- How To Succeed In The Enterprise Software Market (Hardcover): Useless, its about the industry, not about writing the software.
- Scaling Software Agility: Best Practices for Large Enterprises (The Agile Software Development Series): Useless, it describes how to use agile practices in large enterprise level software teams.
- Groupware, Workflow and Intranets: Reengineering the Enterprise with Collaborative Software (Paperback): Useless, it just describes how various, existing enterprise level software categories are supposed to work together.
- Metrics-Driven Enterprise Software Development: Effectively Meeting Evolving Business Needs (Hardcover): Possibly useful, if he wants to know how to use metrics to help write the software. I have a feeling this doesn't give him techniques for writing enterprise software specifically though, which sounds more like what he wants.
- SAP R/3 Enterprise Software: An Introduction: Useless, it's an SAP manual.
- Essential Software Architecture (Hardcover): Possibly, can't tell from the description whether there is enough enterprise specific information to be useful.
- Large-Scale Software Architecture: A Practical Guide using UML: Aha! Something that sounds like what the guy is asking for.
(Not that I'm saying the books are useless in general, I'm just not sure they're what this guy/girl is looking for.)I think this guy/girl is looking for something along the lines of this comment but in "accepted" book format. It doesn't look like the search returns a "handful"....
The VM speed is not Java's problem. The decrepit servlet architecture, which was designed from the start around one-thread-per-request, is. Anything that fixes this architecture is essentially a patch on a broken system. Even if you escape, you're going to find that many JavaEE components will have you buying stock in RAM manufacturers.
A good JMS provider is nice to have for HA though. Nothing like durable message storage to help you sleep well.
Done with slashdot, done with nerds, getting a life.
- Very low overhead processes; creating a process in Erlang is only slightly more expensive than making a function call in C.
- Higher order functions.
- Pattern matching everywhere (e.g. function arguments, message receiving, etc). If you've want two different behaviours for a function depending on the structure of the data that it is passed (e.g. handlers for two different types of packet, with different headers) you can write two version of the function with a pattern in the argument.
- Guard clauses on functions, lets you implement design-by-contract and also lets you separate out validation of arguments from the body of a function, giving cleaner code.
- Simple message passing syntax, with pattern matching on message receive for out-of-order retrieval.
- Asynchronous message delivery; very scalable.
- Lists and tuples as basic language primitives.
- Gorgeous binary syntax. I've never seen a language as good as Erlang for manipulating binary data.
- Automatic mapping of Erlang processes to OS threads, allowing as many to run concurrently as you have CPUs.
- Network message delivery, allowing Erlang code with only slight modifications to send messages over the network rather than to local processes (the message sending code is the same, only how you acquire the process reference is different).
There are also a few down sides to the language:- The preprocessor is even worse than C's, so metaprogramming is hard (and badly needed; patterns like synchronous message sending or futures require a lot of copy-and-pasting).
- Implementing ADTs is ugly (but no worse than C).
- Variables are single static assignment, which is a cheap cop-out for the compiler writer and makes code convoluted at times.
- Message sending and function call syntax is very different for no good reason. You are meant to wrap exposed (public) messages in function, which makes things even more messy.
- Calling code in other languages is a colossal pain.
- The API is inconsistent (e.g. some modules manipulating ADTs take the ADT as the first argument, some take it as the last).
Erlang is a great language for a lot of tasks, particularly servers, but it's not suited for everything.I am TheRaven on Soylent News
Task queuing to deal with server downtimes, and horizontal scalability.
::SLAPS::) integrated it directly in .NET via WCF, and there are others. In its simplest form, you really just send your jobs to a "queue", and have automated processes pick em up and handle em. If the processes go down, they'll just handle them when they get back up, so even a whole database server farm going down at the same time won't make you lose queued up requests. Nifty (it of course gets more complicated than that, but the basic scenarios can be learned by following an internet tutorial).
:)
The first is handled by just about any messaging/queue system. J2EE has had one for ages, Microsoft has MSMQ that recently (better late than never...
Then horizontal scaling. Why horizontal? Because just taking a random new box and plugging in it the network is easier and faster (especially in case of emergency) than having to take servers down to upgrade them (vertically scaling). Also adds to redundancy, so the more servers you add to your farm, the less likely your system will go down. There are documents on it all over (Microsoft Patterns&Practices has some on their web sites, non-MS documentation is hard to miss if you google for it, and many third partys will be more than happy to spam you with their solutions), but it really just come down to: "Use an RDBMS that handles clustering and table partitioning, use distributed caching solutions, push as much stuff on the client side (stuff that doesn't need to be trusted only!), and make sure that nothing ever depends on ressources that can only be accessed from a single machine (think local flat files, in process session management, etc)".
With that, no matter what goes down, things go on purring, and if someone ever bitch that the system is slow, you just buy a 1000$ server, stick a standard pre-made image on the disk, plug it in, have fun.
Oh, and fast network switches are a must
The problem is, it's hard to explain why. The overhead of using things like that is tremendous; Erlang's message system is used for quite literally all communication between processes, and a system like Windows Events or MSMQ would reduce Erlang applications to a crawl. Erlang uses an ordered, staged mailbox model, much like Smalltalk's. If you haven't used Smalltalk, then frankly I'm not aware of another parallel.
It's important to understand just how fundamental message passing is in Erlang. Send and receive are fundamental operators, and this is a language that doesn't have for loops, because it thinks they're too high level and inspecific (you can make them yourself; I know, that must sound crazy, but once you get it, it makes perfect sense.)You're about to see a completely different approach. I'm not saying it's the best, or the most flexible, but I really like it, and it genuinely is very different. What Erlang does can relatively straightforwardly be imitated with blocking and callbacks in C, but that involves system threads, and then you start getting locking and imperative behavior back, which is one of the things it's so awesome to get rid of (imagine - no more locks, mutexes, spin controls and so forth. Completely unnessecary, both in workload, debugging and in CPU time spent. It's a huge change.)
Really, it's a whole different approach. You've just got to learn it to get it.No, I said that. I wrote some code to help explain it to you, though of course slashdot's retarded lameness filters wouldn't pass it, so I put it behind this link. Sorry it's not inline.
Hopefully that will help. Sorry about the lack of whitespace; SlashDot's amazingly lame lameness filter is triggering on clean, readable code.
StoneCypher is Full of BS