Learning High-Availability Server-Side Development?
fmoidu writes "I am a developer for a mid-size company, and I work primarily on internal applications. The users of our apps are business professionals who are forced to use them, so they are are more tolerant of access times being a second or two slower than they could be. Our apps' total potential user base is about 60,000 people, although we normally experience only 60-90 concurrent users during peak usage. The type of work being done is generally straightforward reads or updates that typically hit two or three DB tables per transaction. So this isn't a complicated site and the usage is pretty low. The types of problems we address are typically related to maintainability and dealing with fickle users. From what I have read in industry papers and from conversations with friends, the apps I have worked on just don't address scaling issues. Our maximum load during typical usage is far below the maximum potential load of the system, so we never spend time considering what would happen when there is an extreme load on the system. What papers or projects are available for an engineer who wants to learn to work in a high-availability environment but isn't in one?"
map reduce
http://labs.google.com/papers/mapreduce.html
With great power comes great electricity bills.
You could start by reading this book for a practical approach:
Zawodny is pretty good...
You can't talk about Wikipedia's flaws on Wikipedia
Ok, first up:
1. Check all your SQL and run it through whatever profiler you have. Move things into views or functions if possible.
2. CHECK YOUR INDEXES!! If you have SQL statements running that slow, the likely cause is not having proper indexes for the statements. Either make an index or change your SQL.
3. Consider using caching. For whatever platform you're on there's bound to be decent caching.
That's just the beginning... but the likely cause of most of your problems. We could go on for a month about optimizing.. but in the end if you just stuck with what you have and checked your design for bottlenecks you could get by just fine.
I said no... but I missed and it came out yes.
start by being clear about what you want to achieve. If it is HA then you want to look at clustering, failover, network topology, DR plans etc. If it is HP then look for the bottlenecks in the process, don't waste time shaving nanoseconds off something that wasn't bothering anyone. At infrastructure level you might think about cacheing some stuff, or putting a reverse proxy in front of a cluster of responding servers. In general disk reads are expensive but easily cached, disk writes are very expensive and normally you don't want to cache them, at least not for very long. Network bandwidth may be fast or slow, latency might be an issue if you have a chatty application.
Our in-house applications don't get built around performance at all (personally I find it disappointing but I don't write the rules... yet). We generally scale outwards: replicated databases, load distribution systems, etc.
:)
Many of the code guidelines we have established are to aid in this. Use transactions, don't lock tables, use stored procedures and views for anything complicated, things like that.
I guess my answer is that we delegate it to the server group or the dba group and let them deal with it. I guess this means the admins there are pretty good at what they're doing.
More Twoson than Cupertino
There are some good presentations on the web about how youtube, digg, google etc handle their scaling issues. Here's an example: http://video.google.com/videoplay?docid=-630496435 1441328559
These are both decent starting points. Please report back if you find something good -- I'd be very interested.
http://highscalability.com/
http://www.allthingsdistributed.com/
Stress testing? Use LoadRunner or some other tool to simulate users.
If you are using Java on Tomcat, BEA, or Websphere, use a product like PerformaSure to see a call tree of where your Java program is spending it's time. Sorts out how long each SQL takes too, and shows you what you actually sent. If you have external data sources, like SiteMinder, it will show that too.
If you mean "What happens if we lose a bit of hardware" simulate the whole thing on VMware on a single machine and kill/suspend VMs to see how it reacts.
Most importantly, MAKE SURE YOU MODEL WHAT YOU ARE TESTING. IF you are not testing a scaled up version of what users actually do, you have a bad test.
Never answer an anonymous letter. - Yogi Berra
From working on both academic and enterprise software designed specifically to scale, these are four things I've noticed are incredibly important to scalability:
Languages - I recently saw a multi-million dollar product fail because of performance problems. A large part of it was that they wanted to build performance-critical enterprise server software, but wrote it mostly in a language that emphasized abstraction over performance, and was designed for portability, not performance. The language, of course, was Java. Before I get flamed about Java, the issue was not Java itself and alone, but part of it was indeed using a language not specifically designed for a key project objective: performance. The abstraction, I would argue, did the project worse than all the other peformance issues associated with bytecode however. Relevant books on this subject are everywhere.
Libraries - Using other people's code (e.g. search software, DB apps, etc.) will always introduce scalability weaknesses and performance costs in expected and unexpected places. Haphazardly choosing what software to get in bed to can come back to bite you later. It is an occupational hazard, and each database product and framework and even hardware configuration has its own pitfalls. Many IT book on enterprise performance or even whitepapers and academic papers can provide more information.
Abstraction - There is no free lunch. When you make things easier to code, you typically incure some performance penalty somewhere. In C++, Java, and most other high level languages, the sheer notion of modularity and abstraction eventually add so much hidden knowledge and code that developers either lose track of what subtle costs everything is incurring, or are suddenly put in a position where they can't go back and rewrite everything. Sometimes it is better to write a clean, low-level API and limit the abstraction eyecandy or it will come back to bite you. On the other hand, sometimes a poor low-level API is worse than a cleanly abstracted high-level API. In practive, few complex and performance-oriented systems are architected in very high level languages however. I have seen few books on this subject, and it is pure software engineering. Design patterns might help, however.
Audience - Both clientelle and developer audiences make a big difference. Give an idiot a hammer with no instructions... and you get the point. Make sure your developers know what they're doing and what priorities are, and also design your interfaces and manuals in such a way as to keep scalability in mind. Why have a script perform a hundred macro operations when a well-designed API could provide better performance with a single call? This entails both HCI and project development experience.
Wish I could suggest more books, but there's just too many.
I keep hearing about Erlang being the next greatest thing since sliced bread... unfortunately, I don't have time to look into it too much. Could someone give me an 'elevator' pitch on what makes it so great for threading? Is it encapsulation based objects, a thread base class, or what? How does it handle cache coherency on SMP?
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
Is that most of them have poor native APIs when it comes to scalability. Some of them have something like
But that is far from optimal. When will they be smart and release an async API that notifies you via callback when complete? This would be very useful for apps that need maximum scalability.
Microsoft's .NET framework is actually a great example of doing the right thing - it has these types of async methods all over the place. But then you have to deal with cross-platform issues and problems inherent with a GC.
It's not that much different for web frameworks either. None that I've tried (RoR, PHP, ASP.NET) have support for async responding - they all expect you to block execution should you want to query a db/file/etc. and just launch boatloads of threads to deal with concurrent users. I guess right now with hardware being cheaper it is easier to support rapid development and scale an app out to multiple servers.
From what I have read in industry papers and from conversations with friends, the apps I have worked on just don't address scaling issues. Our maximum load during typical usage is far below the maximum potential load of the system, so we never spend time considering what would happen when there is an extreme load on the system.
Is it just me, or is the question hopelessly confused? He's using the term "availability" but it sounds like he's talking about "scalability."
Availability is basically percentage of uptime. You achieve that with hot spares, mirroring, redundancy, etc. Scalability is the ability to perform well as workloads increase. Some things (adding load-balanced webservers to a webserver farm) address both issues, of course, but they're largely separate issues.
The first thing this poster needs to do is get a firm handle on exactly WHAT he's trying to accomplish, before he can even think about finding resources to help him do it.
OtakuBooty.com: Smart, funny, sexy nerds.
I've worked in multiple extremely super-scaled applications (including ones sustaining 70,000 connections at any one time, 10,000 new connections each minute, and 15,000 concurrent throttled file transfers at any one time - all in one application instance on one machine).
:)
The biggest problem I have seen is people don't know how to properly define their thread's purpose and requirements, and don't know how to decouple tasks that have in-built latency or avoid thread blocking (and locking).
For example, often in a high-performance network app, you will have some kind of multiplexor (or more than one) for your connections, so you don't have a thread per connection. But people often make the mistake of doing too much in the multiplexor's thread. The multiplexor should ideally only exist to be able to pull data off the socket, chop it up into packets that make sense, and hand it off to some kind of thread pool to do actual processing. Anything more and your multiplexor can't get back to retrieving the next bit of data fast enough.
Similarly, when moving data from a multiplexor to a thread pool, you should be a) moving in bulk (lock the queue once, not once per message), AND you should be using the Least Loaded pattern - where each thread in the pool has its OWN queue, and you move the entire batch of messages to the thread that is least loaded, and next time the multiplexor has another batch, it will move it to a different thread because IT is least loaded. Assuming your processing takes longer than the data takes to be split into packets (IT SHOULD!), then all your threads will still be busy, but there will be no lock contention between them, and occasional lock contention ONCE when they get a new batch of messages to process.
Finally, decouple your I/O-bound processes. Make your I/O bound things (eg. reporting via. socket back to some kind of stats/reporting system) happen in their own thread if they are allowed to block. And make sure your worker threads aren't waiting to give the I/O bound thread data - in this case, a similar pattern to the above in reverse works well - where each thread PUSHING to the I/O bound thread has its own queue, and your I/O bound thread has its own queue, and when it is empty, it just collects the swaps from all the worker queues (or just the next one in a round-robin fashion), so the workers can put data onto those queues at its leisure again, without lock contention with each other.
Never underestimate the value of your memory - if you are doing something like reporting to a stats/reporting server via. socket, you should implement some kind of Store and Forward system. This is both for integrity (if your app crashes, you still have the data to send), and so you don't blow your memory. This is also true if you are doing SQL inserts to an off-system database server - spool it out to local disk (local solid-state is even better!) and then just have a thread continually reading from disk and doing the inserts - in a thread not touched by anything else. And make sure your SAF uses *CYCLING FILES* that cycle on max size AND time - you don't want to keep appending to a file that can never be erased - and preferably, make that file a memory mapped file. Similarly, when sending data to your end-users, make sure you can overflow the data to disk so you don't have 3mb data sitting in memory for a single client, who happens to be too slow to take it fast enough.
And last thing, make sure you have architected things in a way that you can simply start up a new instance on another machine, and both machines can work IN TANDEM, allowing you to just throw hardware at the problem once you reach your hardware's limit. I've personally scaled up an app from about 20 machines to over 650 by ensuring the collector could handle multiple collections - and even making sure I could run multiple collectors side-by-side for when the data is too much for one collector to crunch.
I don't know of any papers on this, but this is my experience writing extremely high performance network apps
I see a lot of recommendations for various technologies, software packages, etc. -- but I don't think this addresses the original question.
8 925669?initialSearch=1&url=search-alias%3Daps&fiel d-keywords=enterprise+software+
What you are asking about, of course, is enterprise-grade software. This typically involves an n-tier solution with massive attention to the following:
- Redundancy.
- Scalability.
- Manageability.
- Flexilibility.
- Securability.
- and about ten other "...abilities."
The classic n-tier solution, from top to bottom is:
- Presentation Tier.
- Business Tier.
- Data Tier.
All of these tiers can be made up of internal tiers. (For example, the Data Tier might have a Database and a Data Access / Caching Tier. Or the Presentation Tier can have a Presentation Logic Tier, then the Presentation GUI, etc.)
Anyway, my point is simply that there is a LOT to learn in each tier. I'd recommend hitting up good ol' Amazon with the search term "enterprise software" and buy a handful of well-received books that look interesting to you (and it will require a handful):
http://www.amazon.com/s/ref=nb_ss_gw/002-8545839-
Hope this helps.
I don't know if anyone has mentioned it but the key to a web application being scalable horizontally is statelessness. It's much easier to throw another server behind the load balancer than it is to upgrade the capacity on one. I've never been a fan of sticky sessions myself. This requires a different approach to development in terms of memory space and what not. With a horizontally scalable front tier, you can't always guarantee that someone will be talking to the same server on the next request that they were on the previous request. It requires a little more overhead in terms of either replicating the contents of memory between all application servers or on the database tier because you persist everything to the database.
At least that's my opinion.
"Fighting the underpants gnomes since 1998!" "Bruce Schneier knows the state of schroedinger's cat"
You are talking about two things: reliability and performance. And there are two ways to measure performance: Latency (what one end user sees) and through put (number of transactions per unit time). You have to decide what to address.
You can address reliability and through put by invest a LOT of money in hardware and using things like round robin load balancing, clusters and mirrored DBMSes, RAID 5 and so on. Then losing a power supply or a disk drive means only degraded performance.
Latency is hard to address. You have to profile and collect good data. You may have to write test tools to measure parts of the system in isolation. You need to account for every millisecond before you can start shaving them off
Of course you could take a quick look for obvious stuff like poorly designed SQL data bases, lack of indexes on joined tables and cgi-bin scripts that require a process to be strarted each time they are called.
First of all, excellent question.
Second: ignore the ass above who said dump Java. Modern hotspots have made Java as fast or faster than C/C++. The guy is not up to date.
Third: Since this is a web app, are you using an HttpSession/sendRedirect or just a page-to-page RequestDispatcher/forward? As much as its a pain in the ass--use the RequestDispatcher.
Fourth: see what your queries are really doing by looking at the explain plan.
Five: add indexes wherever practical.
Six: Use AJAX wherever you can. The response time for an AJAX function is amazing and it is really not that hard to do Basic AJAX.
Seven: Use JProbe to see where your application is spending its time. You should be bound by the database. Anything else is not appropriate.
Eight: Based on your findings using JProbe, make code changes to, perhaps, put a frequently-used object from the database into a class variable (static).
These are several ideas that you could try. The main thing that experience teaches is this: DON'T optimize and change your code UNTIL you have PROOF of where the slow parts are.
you work for Skype, don't you?
Feed the need: Digitaladdiction.net
The first two sentences here are one of my biggest pet peeves...if application developers don't start becoming more network-aware, and vice versa, I think you're dead meat. Hint: there are very few applications these days that aren't accessed over the network. I see so many "silos" like this when I'm consulting. The network guys and the app guys have no idea what the "other side" does. Where if they actually worked together on these type issues instead of talking past each other, something actually might get done.
So yes, developers absolutely need to worry about HA. It makes a difference whether your app is stateless or not. How the app is health-checked from the load balancer. How chatty the app is on the network. Etc, etc...
Denny
Erecting the wall of separation between church and state is absolutely essential in a free society. - Thomas Jefferson
The point was that he seemed to consider it so academic and so "well known" that he could just dismiss it without considering it.
Google seems to have taken this elementary technique and turned it into a something that can kick the crap out of an over-engineered solution under the right circumstances. I've read the paper, and assuming this is really used how they say it is, I can say that it does a fantastic job of performing AND HA, based on my personal experiences with gmail, google, groups, adwords, maps, analytics, etc.
Fanboy? Maybe, depending on your definition. Impressed? Hell yes.
Yes. If you take that sentence in context, the answer is "Yes." Compared to the likelihood that one of the thousands of worker-machines will fail during any given job, it IS unlikely that the single Master will fail. Moreover, while any given job may take hours to run, it also seems that many take just moments. Furthermore, just because a job may take hours to run doesn't mean it's CRITICAL that it be completed in hours. And, at times when a job IS critical, that scenario is addressed in the preceeding sentence: It is easy for a caller to make the master write periodic checkpoints that the caller can use to restart a job on a different cluster on the off-chance that a Master fails.
If a job is NOT critical, the master fails, the caller determines the failure by checking for the abort-condition, and then restarts the job on a new cluster.
It's not a logical fallacy, nor is it a bad design.
For the benefit of anyone reading thru, here is the parapgraph in question. It follows a detailed section on how the MapReduce library copes with failures in the worker machines.
Task queuing to deal with server downtimes, and horizontal scalability.
::SLAPS::) integrated it directly in .NET via WCF, and there are others. In its simplest form, you really just send your jobs to a "queue", and have automated processes pick em up and handle em. If the processes go down, they'll just handle them when they get back up, so even a whole database server farm going down at the same time won't make you lose queued up requests. Nifty (it of course gets more complicated than that, but the basic scenarios can be learned by following an internet tutorial).
:)
The first is handled by just about any messaging/queue system. J2EE has had one for ages, Microsoft has MSMQ that recently (better late than never...
Then horizontal scaling. Why horizontal? Because just taking a random new box and plugging in it the network is easier and faster (especially in case of emergency) than having to take servers down to upgrade them (vertically scaling). Also adds to redundancy, so the more servers you add to your farm, the less likely your system will go down. There are documents on it all over (Microsoft Patterns&Practices has some on their web sites, non-MS documentation is hard to miss if you google for it, and many third partys will be more than happy to spam you with their solutions), but it really just come down to: "Use an RDBMS that handles clustering and table partitioning, use distributed caching solutions, push as much stuff on the client side (stuff that doesn't need to be trusted only!), and make sure that nothing ever depends on ressources that can only be accessed from a single machine (think local flat files, in process session management, etc)".
With that, no matter what goes down, things go on purring, and if someone ever bitch that the system is slow, you just buy a 1000$ server, stick a standard pre-made image on the disk, plug it in, have fun.
Oh, and fast network switches are a must
The problem is, it's hard to explain why. The overhead of using things like that is tremendous; Erlang's message system is used for quite literally all communication between processes, and a system like Windows Events or MSMQ would reduce Erlang applications to a crawl. Erlang uses an ordered, staged mailbox model, much like Smalltalk's. If you haven't used Smalltalk, then frankly I'm not aware of another parallel.
It's important to understand just how fundamental message passing is in Erlang. Send and receive are fundamental operators, and this is a language that doesn't have for loops, because it thinks they're too high level and inspecific (you can make them yourself; I know, that must sound crazy, but once you get it, it makes perfect sense.)You're about to see a completely different approach. I'm not saying it's the best, or the most flexible, but I really like it, and it genuinely is very different. What Erlang does can relatively straightforwardly be imitated with blocking and callbacks in C, but that involves system threads, and then you start getting locking and imperative behavior back, which is one of the things it's so awesome to get rid of (imagine - no more locks, mutexes, spin controls and so forth. Completely unnessecary, both in workload, debugging and in CPU time spent. It's a huge change.)
Really, it's a whole different approach. You've just got to learn it to get it.No, I said that. I wrote some code to help explain it to you, though of course slashdot's retarded lameness filters wouldn't pass it, so I put it behind this link. Sorry it's not inline.
Hopefully that will help. Sorry about the lack of whitespace; SlashDot's amazingly lame lameness filter is triggering on clean, readable code.
StoneCypher is Full of BS