Learning High-Availability Server-Side Development?
fmoidu writes "I am a developer for a mid-size company, and I work primarily on internal applications. The users of our apps are business professionals who are forced to use them, so they are are more tolerant of access times being a second or two slower than they could be. Our apps' total potential user base is about 60,000 people, although we normally experience only 60-90 concurrent users during peak usage. The type of work being done is generally straightforward reads or updates that typically hit two or three DB tables per transaction. So this isn't a complicated site and the usage is pretty low. The types of problems we address are typically related to maintainability and dealing with fickle users. From what I have read in industry papers and from conversations with friends, the apps I have worked on just don't address scaling issues. Our maximum load during typical usage is far below the maximum potential load of the system, so we never spend time considering what would happen when there is an extreme load on the system. What papers or projects are available for an engineer who wants to learn to work in a high-availability environment but isn't in one?"
map reduce
http://labs.google.com/papers/mapreduce.html
With great power comes great electricity bills.
"Give it up, give it up," said he, and turned away with a great sweep, like someone who wants to be alone with his laughter.
You could start by reading this book for a practical approach:
Zawodny is pretty good...
You can't talk about Wikipedia's flaws on Wikipedia
so your are having scalability issues because of a poor design - "a second or two slower" - wow how do you get away with that poor performance? anything over a 1.5 seconds and I get major complaints here.
How does high availability come into it? and high availability isn't exactly difficult you just need budget
Generally, Saas (software as a service) providers have to scale their apps. The development issues they have are more or less solved. Look it up on Google... ('saas scalability problem').
nosig today
Ok, first up:
1. Check all your SQL and run it through whatever profiler you have. Move things into views or functions if possible.
2. CHECK YOUR INDEXES!! If you have SQL statements running that slow, the likely cause is not having proper indexes for the statements. Either make an index or change your SQL.
3. Consider using caching. For whatever platform you're on there's bound to be decent caching.
That's just the beginning... but the likely cause of most of your problems. We could go on for a month about optimizing.. but in the end if you just stuck with what you have and checked your design for bottlenecks you could get by just fine.
I said no... but I missed and it came out yes.
As discussed in the previous article in this series, JSON is a useful format for Ajax applications because it allows you to convert between JavaScript objects and string values quickly. In this final article of the series, you'll learn how to handle data sent to a server in the JSON format and how to reply to scripts using the same format.
start by being clear about what you want to achieve. If it is HA then you want to look at clustering, failover, network topology, DR plans etc. If it is HP then look for the bottlenecks in the process, don't waste time shaving nanoseconds off something that wasn't bothering anyone. At infrastructure level you might think about cacheing some stuff, or putting a reverse proxy in front of a cluster of responding servers. In general disk reads are expensive but easily cached, disk writes are very expensive and normally you don't want to cache them, at least not for very long. Network bandwidth may be fast or slow, latency might be an issue if you have a chatty application.
I think the word scalable gets people into trouble when they program for a future that will never arrive. Instead focus on building elegant applications -- they are easy to maintain, and you know you'll be doing a lot of that.
Our in-house applications don't get built around performance at all (personally I find it disappointing but I don't write the rules... yet). We generally scale outwards: replicated databases, load distribution systems, etc.
:)
Many of the code guidelines we have established are to aid in this. Use transactions, don't lock tables, use stored procedures and views for anything complicated, things like that.
I guess my answer is that we delegate it to the server group or the dba group and let them deal with it. I guess this means the admins there are pretty good at what they're doing.
More Twoson than Cupertino
There are some good presentations on the web about how youtube, digg, google etc handle their scaling issues. Here's an example: http://video.google.com/videoplay?docid=-630496435 1441328559
I've heard of companies who offer server networks for websites and corporate server backups in case of a massive flood of traffic. Basically it's just about free cuz you rarely use it but if your website shows up on The Daily Show and you get 1 million visitors, they sense that and host it from the backup on 50 of their servers at once until traffic dies down and bill you for it later.
Same with a corporate network. A bunch of people have to get in their last minute stuff on the last day of the quarter or whatever and your server is going nuts with the traffic so they're there to save you. Just take a day or so and write a "switch" sort of program on your server(s) that detects tons of traffic and contacts the emergency offsite servers that the company has your apps and DBs just sitting on and you use multiple servers of theirs until the traffic dies down. There is a little bit of a higher fee for corporate services but it's still really cheap. It's like a rented server but 99.999% of the time, they don't need to allocate any bandwidth at all to it so it's like 25x cheaper than renting a dozen actual, full time servers.
Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
Well, HA typically has to do with availability not performance. However, if you add redundant equipment, e.g. another column, you can improve performance and improve availability. So, db scaling issues can be resolved by adding memory and CPUs. Applications can be scaled by adding by cloning vertically, add memory, cpus, etc. A redundant column of equipment, e.g. web servers, etc.
These are both decent starting points. Please report back if you find something good -- I'd be very interested.
http://highscalability.com/
http://www.allthingsdistributed.com/
Stress testing? Use LoadRunner or some other tool to simulate users.
If you are using Java on Tomcat, BEA, or Websphere, use a product like PerformaSure to see a call tree of where your Java program is spending it's time. Sorts out how long each SQL takes too, and shows you what you actually sent. If you have external data sources, like SiteMinder, it will show that too.
If you mean "What happens if we lose a bit of hardware" simulate the whole thing on VMware on a single machine and kill/suspend VMs to see how it reacts.
Most importantly, MAKE SURE YOU MODEL WHAT YOU ARE TESTING. IF you are not testing a scaled up version of what users actually do, you have a bad test.
Never answer an anonymous letter. - Yogi Berra
You'll find that Erlang doesn't even blink at those volumes, and that Erlang's entire reason to exist is scalability/reliability. Granted, it's a little severe to pick up a new language, but the benefits are enormous, and it's one of those boons you can't really understand until you've learned it. It is, however, worth noting that transactions on an MNesia database in the multiple gigabytes are typically faster than PHP just invoking MySQL in the first place, let alone doing any work with it.
Erlang is difficult to learn from what's on the web; consider starting with Joe's book.
StoneCypher is Full of BS
I have a question sort of along the same line. I interviewed for a position at a very large internet company, and one of their primary concerns was very high performance and scalability. I went through the phone interviews and then the in-person interviews, and I actually did quite well, and was even told that I did quite well. However, in the end, I was told that while I did well, they would have liked to see more experience with very large web applications (I've worked at smaller companies). So, how do I go about learning something I think I already know, and from your experience, was that not the real reason I was not accepted?
Sorry this is a bit off-topic; I've just been dying to ask the slashdot community and this seems to be the most appropriate forum for the question.
From working on both academic and enterprise software designed specifically to scale, these are four things I've noticed are incredibly important to scalability:
Languages - I recently saw a multi-million dollar product fail because of performance problems. A large part of it was that they wanted to build performance-critical enterprise server software, but wrote it mostly in a language that emphasized abstraction over performance, and was designed for portability, not performance. The language, of course, was Java. Before I get flamed about Java, the issue was not Java itself and alone, but part of it was indeed using a language not specifically designed for a key project objective: performance. The abstraction, I would argue, did the project worse than all the other peformance issues associated with bytecode however. Relevant books on this subject are everywhere.
Libraries - Using other people's code (e.g. search software, DB apps, etc.) will always introduce scalability weaknesses and performance costs in expected and unexpected places. Haphazardly choosing what software to get in bed to can come back to bite you later. It is an occupational hazard, and each database product and framework and even hardware configuration has its own pitfalls. Many IT book on enterprise performance or even whitepapers and academic papers can provide more information.
Abstraction - There is no free lunch. When you make things easier to code, you typically incure some performance penalty somewhere. In C++, Java, and most other high level languages, the sheer notion of modularity and abstraction eventually add so much hidden knowledge and code that developers either lose track of what subtle costs everything is incurring, or are suddenly put in a position where they can't go back and rewrite everything. Sometimes it is better to write a clean, low-level API and limit the abstraction eyecandy or it will come back to bite you. On the other hand, sometimes a poor low-level API is worse than a cleanly abstracted high-level API. In practive, few complex and performance-oriented systems are architected in very high level languages however. I have seen few books on this subject, and it is pure software engineering. Design patterns might help, however.
Audience - Both clientelle and developer audiences make a big difference. Give an idiot a hammer with no instructions... and you get the point. Make sure your developers know what they're doing and what priorities are, and also design your interfaces and manuals in such a way as to keep scalability in mind. Why have a script perform a hundred macro operations when a well-designed API could provide better performance with a single call? This entails both HCI and project development experience.
Wish I could suggest more books, but there's just too many.
This is a broad topic, but I would say begin by identifying your single points of failure. You can then research setting up HA solutions for each of those resources. Also, understand the difference between high-availability and load balancing. Just because your database is fault-tolerant, it does not necessarily mean it can scale to cope with increased traffic.
Draw a high level map of your application and all the server/network resources it uses. Take each one of those components and analyze them for load balancing and fault tolerance. Any single component failure should not affect the overall uptime of the application. Part of a high-availability system is having proper monitoring and notification tools in place. It takes a lot to make a high availability environment work and some of it is not engineering related, but business process related. If your servers are in a data center and a database server goes down, yet your notification system sends an email to a database developer who works 9am to 5pm (maybe on vacation) alerting him/her of the issue... You can see how this can lead to problems. Proper health checks, escalation paths, etc. are all part of making your system work.
My $0.02.
http://www.couchdb.com/ is a distributed replicable non-relational database written in Erlang. It is a very clever system and I was impressed with the language choice of the developer.
Often there are a narrow set of query/reports types are the most common and consume the most resources. Perhaps consider making a nightly customized copy of a view(s) via batch tech that fits that frequent need well and put it on a separate server. This will not only speed up the common need, but also the other queries since their server load is lightened. In general, also make sure your indexing is designed well. In other words, put indexes where they are needed but don't put unnecessary ones. Study the usage needs carefully.
Table-ized A.I.
I keep hearing about Erlang being the next greatest thing since sliced bread... unfortunately, I don't have time to look into it too much. Could someone give me an 'elevator' pitch on what makes it so great for threading? Is it encapsulation based objects, a thread base class, or what? How does it handle cache coherency on SMP?
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
Is that most of them have poor native APIs when it comes to scalability. Some of them have something like
But that is far from optimal. When will they be smart and release an async API that notifies you via callback when complete? This would be very useful for apps that need maximum scalability.
Microsoft's .NET framework is actually a great example of doing the right thing - it has these types of async methods all over the place. But then you have to deal with cross-platform issues and problems inherent with a GC.
It's not that much different for web frameworks either. None that I've tried (RoR, PHP, ASP.NET) have support for async responding - they all expect you to block execution should you want to query a db/file/etc. and just launch boatloads of threads to deal with concurrent users. I guess right now with hardware being cheaper it is easier to support rapid development and scale an app out to multiple servers.
Scalable Internet Architectures by Theo Schlossnagle http://www.amazon.com/Scalable-Internet-Architectu res-Developers-Library/dp/067232699X.
Great tutorial at OSCON about it too.
But the biggest piece of advice is to never guess about where things are slow. Measure them and then fix the slow parts. Don't change a thing until you've benchmarked it.
From what I have read in industry papers and from conversations with friends, the apps I have worked on just don't address scaling issues. Our maximum load during typical usage is far below the maximum potential load of the system, so we never spend time considering what would happen when there is an extreme load on the system.
Is it just me, or is the question hopelessly confused? He's using the term "availability" but it sounds like he's talking about "scalability."
Availability is basically percentage of uptime. You achieve that with hot spares, mirroring, redundancy, etc. Scalability is the ability to perform well as workloads increase. Some things (adding load-balanced webservers to a webserver farm) address both issues, of course, but they're largely separate issues.
The first thing this poster needs to do is get a firm handle on exactly WHAT he's trying to accomplish, before he can even think about finding resources to help him do it.
OtakuBooty.com: Smart, funny, sexy nerds.
I've worked in multiple extremely super-scaled applications (including ones sustaining 70,000 connections at any one time, 10,000 new connections each minute, and 15,000 concurrent throttled file transfers at any one time - all in one application instance on one machine).
:)
The biggest problem I have seen is people don't know how to properly define their thread's purpose and requirements, and don't know how to decouple tasks that have in-built latency or avoid thread blocking (and locking).
For example, often in a high-performance network app, you will have some kind of multiplexor (or more than one) for your connections, so you don't have a thread per connection. But people often make the mistake of doing too much in the multiplexor's thread. The multiplexor should ideally only exist to be able to pull data off the socket, chop it up into packets that make sense, and hand it off to some kind of thread pool to do actual processing. Anything more and your multiplexor can't get back to retrieving the next bit of data fast enough.
Similarly, when moving data from a multiplexor to a thread pool, you should be a) moving in bulk (lock the queue once, not once per message), AND you should be using the Least Loaded pattern - where each thread in the pool has its OWN queue, and you move the entire batch of messages to the thread that is least loaded, and next time the multiplexor has another batch, it will move it to a different thread because IT is least loaded. Assuming your processing takes longer than the data takes to be split into packets (IT SHOULD!), then all your threads will still be busy, but there will be no lock contention between them, and occasional lock contention ONCE when they get a new batch of messages to process.
Finally, decouple your I/O-bound processes. Make your I/O bound things (eg. reporting via. socket back to some kind of stats/reporting system) happen in their own thread if they are allowed to block. And make sure your worker threads aren't waiting to give the I/O bound thread data - in this case, a similar pattern to the above in reverse works well - where each thread PUSHING to the I/O bound thread has its own queue, and your I/O bound thread has its own queue, and when it is empty, it just collects the swaps from all the worker queues (or just the next one in a round-robin fashion), so the workers can put data onto those queues at its leisure again, without lock contention with each other.
Never underestimate the value of your memory - if you are doing something like reporting to a stats/reporting server via. socket, you should implement some kind of Store and Forward system. This is both for integrity (if your app crashes, you still have the data to send), and so you don't blow your memory. This is also true if you are doing SQL inserts to an off-system database server - spool it out to local disk (local solid-state is even better!) and then just have a thread continually reading from disk and doing the inserts - in a thread not touched by anything else. And make sure your SAF uses *CYCLING FILES* that cycle on max size AND time - you don't want to keep appending to a file that can never be erased - and preferably, make that file a memory mapped file. Similarly, when sending data to your end-users, make sure you can overflow the data to disk so you don't have 3mb data sitting in memory for a single client, who happens to be too slow to take it fast enough.
And last thing, make sure you have architected things in a way that you can simply start up a new instance on another machine, and both machines can work IN TANDEM, allowing you to just throw hardware at the problem once you reach your hardware's limit. I've personally scaled up an app from about 20 machines to over 650 by ensuring the collector could handle multiple collections - and even making sure I could run multiple collectors side-by-side for when the data is too much for one collector to crunch.
I don't know of any papers on this, but this is my experience writing extremely high performance network apps
http://video.google.com/videoplay?docid=-55252469
I see a lot of recommendations for various technologies, software packages, etc. -- but I don't think this addresses the original question.
8 925669?initialSearch=1&url=search-alias%3Daps&fiel d-keywords=enterprise+software+
What you are asking about, of course, is enterprise-grade software. This typically involves an n-tier solution with massive attention to the following:
- Redundancy.
- Scalability.
- Manageability.
- Flexilibility.
- Securability.
- and about ten other "...abilities."
The classic n-tier solution, from top to bottom is:
- Presentation Tier.
- Business Tier.
- Data Tier.
All of these tiers can be made up of internal tiers. (For example, the Data Tier might have a Database and a Data Access / Caching Tier. Or the Presentation Tier can have a Presentation Logic Tier, then the Presentation GUI, etc.)
Anyway, my point is simply that there is a LOT to learn in each tier. I'd recommend hitting up good ol' Amazon with the search term "enterprise software" and buy a handful of well-received books that look interesting to you (and it will require a handful):
http://www.amazon.com/s/ref=nb_ss_gw/002-8545839-
Hope this helps.
I don't know if anyone has mentioned it but the key to a web application being scalable horizontally is statelessness. It's much easier to throw another server behind the load balancer than it is to upgrade the capacity on one. I've never been a fan of sticky sessions myself. This requires a different approach to development in terms of memory space and what not. With a horizontally scalable front tier, you can't always guarantee that someone will be talking to the same server on the next request that they were on the previous request. It requires a little more overhead in terms of either replicating the contents of memory between all application servers or on the database tier because you persist everything to the database.
At least that's my opinion.
"Fighting the underpants gnomes since 1998!" "Bruce Schneier knows the state of schroedinger's cat"
Developers need not worry about HA too much. Your IT department should be able to set this up for you rather seamlessly. With things like LVS/Keepalived you can easily implement load balancing and auto-failover for databases, web servers, etc (you don't even need to code in multiple DB servers; VRRP works wonders for this kind of thing.) As long as the application is designed sanely to begin with, HA as it is typically discussed comes down to minimizing the impact of hardware failure by buying two of everything and making failover happen automatically (human response time is anywhere from 5-15 minutes in a best-case scenario, where worst-case for auto failover is http://keepalived.org/) is an excellent solution for something your size. It's basically an LVS frontend with host checking and automatic failover capability (via VRRP,) custom host checks (i.e. run a typical SQL query every 3 seconds to check that everything is ok, if it's not, remove the DB from the pool, do some other stuff and rebalance the cluster. It can all run off one IP on the front end so the app won't notice what happened.)
You are talking about two things: reliability and performance. And there are two ways to measure performance: Latency (what one end user sees) and through put (number of transactions per unit time). You have to decide what to address.
You can address reliability and through put by invest a LOT of money in hardware and using things like round robin load balancing, clusters and mirrored DBMSes, RAID 5 and so on. Then losing a power supply or a disk drive means only degraded performance.
Latency is hard to address. You have to profile and collect good data. You may have to write test tools to measure parts of the system in isolation. You need to account for every millisecond before you can start shaving them off
Of course you could take a quick look for obvious stuff like poorly designed SQL data bases, lack of indexes on joined tables and cgi-bin scripts that require a process to be strarted each time they are called.
I just got hired at a company whose code is a mess and I have to clean it up. The biggest mistakes i see causing this are tables not normalized, business logic is not properly separated (neither in the DB nor in the code), MVC standards not followed, redundant coding and code/data bloat (ie. loading more data and/or code than is needed to perform a given function and/or task).
These are alot to wade through. My best suggestion is to go with a prebuilt MVC architecture and do your best to throw it all into there. The new STRUTS2 is awesome if you know JAVA. Stay away from Ruby on Rails if you want scalability (as even the RUBY ON RAILS site requires PHP to scale). If you are using PHP, PHPulse is the fastest framework out their but is lacking in documentation. Comparable ones like Cake and ZEND have loads of documentation but are more bloated and far slower
First of all, excellent question.
Second: ignore the ass above who said dump Java. Modern hotspots have made Java as fast or faster than C/C++. The guy is not up to date.
Third: Since this is a web app, are you using an HttpSession/sendRedirect or just a page-to-page RequestDispatcher/forward? As much as its a pain in the ass--use the RequestDispatcher.
Fourth: see what your queries are really doing by looking at the explain plan.
Five: add indexes wherever practical.
Six: Use AJAX wherever you can. The response time for an AJAX function is amazing and it is really not that hard to do Basic AJAX.
Seven: Use JProbe to see where your application is spending its time. You should be bound by the database. Anything else is not appropriate.
Eight: Based on your findings using JProbe, make code changes to, perhaps, put a frequently-used object from the database into a class variable (static).
These are several ideas that you could try. The main thing that experience teaches is this: DON'T optimize and change your code UNTIL you have PROOF of where the slow parts are.
you work for Skype, don't you?
Feed the need: Digitaladdiction.net
Is that your point? 'Cause Microsoft is making a metric assload of money off of Windows, and Apple is making money off of OSX too.
I think I smell a Google fan-boy.
Blar.
It sounds like you need to some basic disaster planning. Think in terms of "what if this happens?"
Like you loose your data center? How good is your backup, is it off site, do you have a tested plan for restoring the data and system on an interm basis on someone's system?
Then you can look at some more specific things, what happens if I loose this server, this connection, this router, and specific services, DNS, Email, etc.
The big question $$$ depends on how much you have to loose. If you can afford a day of downtime, you don't have to spend as much effort on HA as say the NYSE, or an airline.
Among others, another possibility Service Availability Forum (http://www.saforum.org/). You can download an open source implementation at http://developer.osdl.org/dev/openais/ and play with it (runs on top of Linux).
These really are 2 different things. Though they do sometimes cross over - oracle RAC is a good example of that.
As for where to read from a developer perspective? (which alot of people replying seemed to have missed the actual question). There are TONNES.
But split the question in two, where can i read about HA:
start here-> http://en.wikipedia.org/wiki/High_availability Theres also many books on the subject (i remember one of the few i happened to like is the things that came out of the sun blueprints books). The problem with HA is it very subjective. You can talk about HA for say web applications and just talk session sharing and an intelligent load balancer (ironically, the same thing gives you scalability until you get to the DB) or you can talk all the way down to fault-tolerant hardware. Also take a look at the whitepapers that came out of such projects as mosix, VAX clusters, oracle HA (both RAC and dataguard), IBM Websphere (There alot in the various IBM sites about HA for all their products and one is bound to be similar to yours in nature), Sun J2EE. Alot of these do go into development aspects as well and give some fantastic concepts and paradigms to follow. But you really need to define the requirements for HA. i.e. 0 dt or 30 minutes dt is a HUGE difference! (and that really is just one scenario in many, and as a developer your usually faced with multiple requirements).
Scalability is a different issue - and usually very application and environment dependent. Again http://en.wikipedia.org/wiki/Scalability is a good start but finding general literature is often very hard because its so dependent on the situation and the application.
Personally, i've found i learn best from example. Such things J2EE application servers (websphere, sun JES, etc), load balancing, oracle RAC vs dataguard, mysql ndb vs replication vs read-only replica database methods, apache, php, samba, windows (most things), pick just about any main-stream application and it'll almost certainly cover both HA and scalability at a level helpful to a developer.
If you want to get even more complex - take a look even cooler forms of scalability and HA that involve things like utility computing (vmware DRS, or egenera for eg). Have a look at their design documents because they offer even more diverse examples of both subjects at a more abstract layer (i.e. even below the OS and entirely on the HW)
In both cases, its hard to go from a "we weren't thinking about HA or scalability scenario when we build it" to "its HA and scalable". HA tends to be a little easier because clusters can wrap themselves around almost any situation, but scaling on such systems usually means "i need bigger, faster and more CPUs, more memory and better disk until i can figure out how to code scaling into it".
Always keep in mind though, the law of diminishing returns almost always applies.
The point was that he seemed to consider it so academic and so "well known" that he could just dismiss it without considering it.
Google seems to have taken this elementary technique and turned it into a something that can kick the crap out of an over-engineered solution under the right circumstances. I've read the paper, and assuming this is really used how they say it is, I can say that it does a fantastic job of performing AND HA, based on my personal experiences with gmail, google, groups, adwords, maps, analytics, etc.
Fanboy? Maybe, depending on your definition. Impressed? Hell yes.
Look for a job where they got lots of oltp!
It's just hard to do, and I've never seen a good book on the subject
(in fact I've considered writing one on and off for years but sadly
the $$ I cam make as a consultant on performance, scalability and
availability far exceeds the likely rewards from publishing a book).
Best advice is to look at some open source projects that are used
in highly scalable applications. The other thing I'd say is that
there isn't one true technique -- at this point everyone makes up
their own solution as they go. Often the applications' characteristics
drive the scaling architecture so each application is different.
Software Engineering for Internet Applications will guide you in the right direction.
On modern hardware, on an internal network, "a second or two" is an eternity. Instead of worrying about what would happen if all 60,000 people used the app at once (unlikely), I'd find the bottlenecks you have now and fix those.
Prioritize. You have statistics already about typical usage, and typical wait and service times. Fix the problem that exists, instead of the problem that doesn't, but might someday.
If moderation could change anything, it would be illegal.
Anyone who uses Java and performance in the same sentence as fast, hasn't been around enough heavy systems to know what the hell they're talking about. You get Performance, Easy of Use , and Cost pick two. It's an old rule, but still applies to day. The C/C++ folks are laughting their arse(s) off at java and performance. And if anyone is old enought to know asembler try not to cringe to much when Java or the C/C++ folks talk about performance.
Yes. If you take that sentence in context, the answer is "Yes." Compared to the likelihood that one of the thousands of worker-machines will fail during any given job, it IS unlikely that the single Master will fail. Moreover, while any given job may take hours to run, it also seems that many take just moments. Furthermore, just because a job may take hours to run doesn't mean it's CRITICAL that it be completed in hours. And, at times when a job IS critical, that scenario is addressed in the preceeding sentence: It is easy for a caller to make the master write periodic checkpoints that the caller can use to restart a job on a different cluster on the off-chance that a Master fails.
If a job is NOT critical, the master fails, the caller determines the failure by checking for the abort-condition, and then restarts the job on a new cluster.
It's not a logical fallacy, nor is it a bad design.
For the benefit of anyone reading thru, here is the parapgraph in question. It follows a detailed section on how the MapReduce library copes with failures in the worker machines.
You can get massive savings in processing by using various caching techniques. Caching lets you save the results of one process for use later.
1. Client side cache. Most developers shudder when they think of a web page being cached on the browser. However, some pages (like help pages, new articles) do not change with real time and can be stored on the client's browser for a few minutes. Learn how to use the HTTP Caching directives to reduce the number of unique pages requested by each user.
2. HTML Output caching. While some parts of a page may change, some elements (such as navigation elements, footers, etc) may not need to be recalculated with each page load. Many app servers let you save sections of the page so that once one user has generated them, you can reuse them again.
3. Database Caching. Frameworks like Hibernate allow you to cache the results of SQL calls so that if the same SQL is reissued (even between different users) the cache reads the result, not the database. Usually you can pick which calls are cached versus which ones have to be live.
Blueprints for High Availability , Evan Marcus and Hal Stern, second edition. http://www.amazon.com/Blueprints-High-Availability -Evan-Marcus/dp/0471430269/ref=cm_taf_title_featur ed?ie=UTF8&tag=tellafriend-20
Deals with the subject of high availability from the IT side rather than programming, but anyone dealing with HA systems needs to understand these issues.
Well ... I've read what others wrote here and I don't think you got many actual answers (welcome to slashdot :) ).
While there were some (very) good points about both scalability and HA, they didn't tell you how to go about learning that; HA and HS are two areas where by reading the books or following case studies, you can understand the basic problems, but not see how you actually go about building a particularly scalable or HA system (because it's usually a system, not a single server).
I've worked in maintenance for a c++ server, where we gave the users guarantees of both low response time under stress and minimal down time.
While working with that, we've had to use different angles in attacking the appearing problems, and usually the solutions we used were particular for the problems couldn't be very well generalized.
For example, when different threads used the same input data, it was better in some situations to duplicate the data for each thread instead of using locks on it, so you didn't incur the delays involved in locking. In other places, we used locking as even with locking those places wouldn't create bottlenecks.
In some value-objects (large data collections), we used copy-on-write pattern with sharing objects values in the same thread (to minimize the allocations done and memory fragmentation), and deep copies when the data was needed to be sent in other threads.
We split the server in multiple different servers handling the different parts of processing (so we could throw hardware at the computation-heavy parts of the application logic) and use load-balancing.
We also used in-memory databases in one case, for storing some of the information.
Some IO operations (like logging) ran on different threads with messages passed to them (having multiple threads for logging for example).
For HA, we had a complete monitoring system, with processes listening for hartbeat and making load-statistics for the different modules, and every computation part had a backup server ready to take over at certain loads.
My point with these examples, is that neither of these solutions can be applied ad-hoc, but each possible bottleneck has to be studied separately and the solutions to avoid it can vary depending on a lot of factors (while you can get ideas, you won't learn scalability or HA from a book).
The best solution I can think of for learning about HS and HA is working in a product that needs them and getting direct experience in improving the scalability and availability.
Tie two birds together: although they have four wings, they cannot fly. (The blind man)
It really depends on the application. We're handling hundreds of concurrent connections and millions of connections a day per server with average CPU utilization hovering averaging 43.09% and never exceeding 63.47%. If you subtract time waiting for I/O and the average drops to 9.89% and peak 23.24%.
Performance bottlenecks often lie in the disks and network, not in the application.
The C10K problem
d =20331189 because this emphasizes that you should do as much work in one thread as possible in order to avoid redundant context switches. Thus don't separate threads based on role where one task must be picked up by many threads to complete -- instead, each thread should be able to execute any sequence through the state machine for any client.)
http://kegel.com/c10k.html
High-Performance Server Architecture:
http://pl.atyp.us/content/tech/servers.html
(Note that this refutes http://ask.slashdot.org/comments.pl?sid=277739&ci
Maybe not a classic, but still nice (see its references):
Threads, Tasks, Coroutines, Processes, and Events:
http://shlang.com/writing/threads-tasks.html
Thanks for your reply, I think I'll give it a shot!
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
Hi,
l ications%20scalable%20with%20LB.pdf
...), then you optimize it again and break on I/O in the database. Then you have to rewrite all your requests,
an article I wrote last year about application scaling using load balancing may help you. It will not solve your problems but will certainly help you with the concepts, best practises and traps to avoid, which is a good starting point.
You can get read it online here : http://1wt.eu/articles/2006_lb/ or you can download it as a PDF here :
http://www.exceliance.fr/en/ART-2006-making%20app
Also, what you need is to perform benchmarks frequent during all the cycle of development of your applications. Using
traffic generators, you will simulate a lot of users and see how your application/database behaves. And believe me, it
never breaks where you expected it to ! On the first run, it's almost always caused my too much memory usage.
Then you optimize it (decrappify it in fact), then you break on concurrency (threads, processes, file descriptors,
sockets,
and when you finally saturate the frontend servers with 1% of your target load, you realize that you have to rewrite
everything using a faster language. But at least, you will be able to save time by starting on a few cheap servers, for
the time needed to translate the code for version 2.
You talked about 60000 users. If it's a population of 60000 users, it's not much. If it's 60000 concurrent users, it's
a huge load and you will have to educate yourself in network and operating system tuning, because tuning the app alone
will not be enough.
Good luck!
Willy
How many of those apps are running a java backbone? Sure Java is not the sole cause of the world to slowdown in performance, but don't make it into something that its not (i.e. a performance speed demon). The heavy lifing is not done in java and there are a lot of reasons why. (If people are missing this one, just what the hell are they teaching in the CS Deparments in Colleges these days anyway. This is basic stuff, no brainer material.) Time to grow up and move on.
I would recommend familiarity with streaming architectures, which provide asymptotically better memory usage than the traditional store-and-forward model used in most J2EE and .NET applications. This is especially important if you will be sending large datasets to clients. These two articles outline streaming architecture, including both theory and practical implementations with performance results and analysis.
Another point: for enterprise applications, it generally pays off better to focus on tuning the database tier, in my experience. If that's true in your case, understanding SQL optimizations, lock optimizations, and the various types of table indexes (e.g. clustering indexes) would be your best option.
Actually, being forced to use your app doesn't make them more tolerant of delays. It makes *you* more tolerant because your users can't go away. They still hate the delays.
Erm, quite a few of them. Little site called eBay, for example, who migrated from a C++ impmlementation to Java in 2002. Happen to know one of the top ISPs in the country will be migrating about 20 million mailboxes to Java mailstores in the near future.
When people think scalability has much to do with what language an application is written in, I start suspecting they've never worked in a real data center before.
Like I said before you get Performance, Easy of Use, or Cost - pick 2. So if they opted for Performance and Easy of Use in their BACKBONE, then by the rule of thumb for the last 30 years, their costs are to high for their operations. (But I guess when you have Billions costs don't matter in the short term, but the stock holders might want to look into operational efficiency since it effects ROI.)
So did you forget to make an actual argument, or do you just not have one?
Task queuing to deal with server downtimes, and horizontal scalability.
::SLAPS::) integrated it directly in .NET via WCF, and there are others. In its simplest form, you really just send your jobs to a "queue", and have automated processes pick em up and handle em. If the processes go down, they'll just handle them when they get back up, so even a whole database server farm going down at the same time won't make you lose queued up requests. Nifty (it of course gets more complicated than that, but the basic scenarios can be learned by following an internet tutorial).
:)
The first is handled by just about any messaging/queue system. J2EE has had one for ages, Microsoft has MSMQ that recently (better late than never...
Then horizontal scaling. Why horizontal? Because just taking a random new box and plugging in it the network is easier and faster (especially in case of emergency) than having to take servers down to upgrade them (vertically scaling). Also adds to redundancy, so the more servers you add to your farm, the less likely your system will go down. There are documents on it all over (Microsoft Patterns&Practices has some on their web sites, non-MS documentation is hard to miss if you google for it, and many third partys will be more than happy to spam you with their solutions), but it really just come down to: "Use an RDBMS that handles clustering and table partitioning, use distributed caching solutions, push as much stuff on the client side (stuff that doesn't need to be trusted only!), and make sure that nothing ever depends on ressources that can only be accessed from a single machine (think local flat files, in process session management, etc)".
With that, no matter what goes down, things go on purring, and if someone ever bitch that the system is slow, you just buy a 1000$ server, stick a standard pre-made image on the disk, plug it in, have fun.
Oh, and fast network switches are a must
> 1: First of all _drop_ Java. (when are people going to learn... *sigh*)
0 98885&p=880-8bf/090015ea80098885/TPCONTNT.pdf&toc= y
Sounds like something a 14 year old teenybopper stuck in his mom's basement might say. Dude, where have you been the last 10 years. Java PWNS the server side.
> 2: Read http://h30163.www3.hp.com/NTL/view/?id=090015ea80
Thanks for sending us to some BS terms of use page...loser.
> 3: Use asynchronous IO
Uh, dude, that'll be built into the app server in most cases.
> 4: If at all possible, stay away from threads.
All enterprise apps are multi-threaded you fool. Threading though, is best left to the container, not to business logic developers.
Please, do us a favor, and stay FAR, FAR, FAR away from enterprise computing.
Yaws & MNesia & Erlang => thank you! You just made my life a lot brighter.
The response to this question proves what I've suspected for a long time - Many nerds/geeks/techies who post on slashdot have strong opinions about technology and almost everything else and will go to great lengths to prove that they are right. Finally, there's a question that requires actual knowledge about technology to answer it (not just opinions) and look at the number of posts rated 5 (or even the total number of posts).
Sorry, this is one of my beefs with slashdot.
There are a lot of smart people here who are speaking outside of there fields of expertise. I am not one of those people on this topic. I gave you enough clues to end this argument several comments back. To me this a basic argument, that is normally resovled in the 3rd or 4th year of CS in college. Unfortunely you haven't accepted the simple overview. I have used Java since 1998(and been in the IT business a lot longer). I have a full command of the java language and many others as well. So I am informed on what I am speeking about.
If you are really interest in the subject I recomend a good 4 year program in Computer Science (Berkly is one of the best.). If you are unable to take that route, I recomend a good Computer Club (Hal-PC is the best in the U.S.) much of their stuff is online.
Some things to ponder in your search for IT knowledge in this area are:
1.) Why don't they replace or add user functionality to the Cisco Highspeed Routers in Java (or Ajax if you perfer)?
2.) Why don't they allow highspeed database functions to be added in Java (or Ajax if you perfer)?
3.) In viewing a Highspeed and/or High Availabity Archectures what are the limitations and risks of having an interperted language in the main path of the highspeed data flows.
(This should be enought to get you started.)
========
========
Better questions for the uninitated are:
1.) What is the performace envolope of java/ajax?
2.) What is Java's strong points and weakness
3.) What is the differnece betten Java Beans, Java Applets, Java Application?
4.) What is a Java Server?
5.) etc...
The answers to these question will lead you away from Java as a Highspeed Solution in most cases.
High Availablity(HA)is a different discussion. Rapid Applicatioin Developement(RAD) is also a different descussion. However Java/Ajax can be a very viable solution in these situations.
========
========
(A side note:)
Also the slashdot moderators are for the most part very good writers. But are almost totally clue less about the technical issues, and wouldn't know when to mod up a breif technical comment if it jumped up and bit them in the arse. Thus the posting of AC.
Apprentice with a Technical Architect if you want to learn about HA - High Availabilty.
That's very different than scalability, but a good Tech Arch will know and be able to explain best practices for scalable apps too.
As a technical architect, these are the types of problems I work on daily. From network design, network load balancers, to web server hardware, to application server architecture and clustering to app server hardware to DB server clustering and redundant SAN storage design.
Then add in remote replication at least 200 miles away and disaster recovery systems in alternate locations. Don't forget about local and off-site backups. Some systems need to follow the sun around the world every day with the primary system and replicated secondary systems following. These are for 20,000 plus users and they aren't trivial web sites. These are complex data drive MUST HAVE apps for my business with almost a million transactions a minute.
It can't be down by accident - ever. Well, if it does, I'm fired. My two envelopes are already written.
Oh, and setting the budget for all the software, servers, storage, and networking for all this is also my job.
My biggest complaint with developers is their lack of knowledge about the capabilities of the operating system they are writing for. Java people are the worst. Seems that about 90% of them have bought into the - I don't need to know the OS BS." Read a little about the different CPUs and don't try to tell me that you are CPU bound, so porting from C to Java and from Windows to Solaris on SPARC will be better. That's just stupid.
Oh, did I mention I was a developer for 15 years first and a sysadmin for 5 years? Watch out for Tech Arch without any hands on experience.
The OP sounds valid to me. If a system is not scalable, availability can suffer as a direct result when workloads increase.
I don't think the scalability and availability form a dichotomy.
Unless you have an app that needs to be very tightly written I think the easiest way to write a scalable app is just to break the app down into components that can each be on their own server or duplicated across multiple servers. If each component isn't keeping state for itself then it doesn't matter which copy of a given component you make a request on so you can split tasks between copies with simple load balancing techniques. This also helps keep your application code clean as it makes you keep your components discrete.
I perfer to create a pool of each component in virtual machines and then let VMWare manage the cluster of virtual machines - moving the VMs to the appropiate physical machines as needed to keep performance at peak. I'd suggest one copy of each component per physical machine you have. A lot easier than trying to measure everything and guess where problem areas will be (of course using both methods together is better).
I let my components communicate with each other using XML-RPC. You can use SOAP but I find XML-RPC to be more lightweight and easy to use. Individual components can be written in different languages and even run on different operating systems. Just use whatever tools make the most sense for each component. This is especially good with web apps as it makes it easy to create alternative interfaces. The web-UI becomes a thin, to-the-point, component and it's easy to write alternative UIs for mobile devices, desktop applications, command-line access, etc.
The only real blocking point is when you need to work with large amounts of data. You can use a clustering db to spread the load across multiple machines. You can also break database use up into it's own logical components. If two tables aren't logically connected then there is no reason you can't put those tables in different databases on different physical machines. Some basic tactics such as having one database for data crunching and another for caching and storing session data can be pretty effective. You've already split your application into components so it isn't to hard to see that those different components don't usually need to do everything in a single database.
Being broken into components running in their own virtual machines makes it pretty easy to address availability too. Component pools can detect when an individual component is having issues, isolate it, restart it, replace it, alert an admin, etc.
At what price learning? At what cost wisdom? The price is a man's peace of mind, and the cost is his life.
You say you want a scaling application. But in the next sentence you only speak about scaling up. Scaling goes both ways and you want your application to go both ways. That's important. Otherwise you might be stuck with that ten servers you just needed for a usage peak.
It's just that where I work, one single Master server is NOT 'good enough' and this solution would be laughed out of the meeting. Multiple masters in different physical locations with automatic failover would be the baseline. I guess I'm not seeing it from their perspective.
Blar.
"What is high availability to you?" That is the question HP posed in a Service Guard class I was in once. It's a valid question though. I work with mission critical hospital systems in health care and deal with high availability on a medical hosting service. This means in my particular environment, we need 24/7 operation with minimum or no downtime (
Linux HA Project
IBM HACMP (High Availability Clustered Multi-processing)
HP Service Guard for Linux (also available on HPUX)
Oracle RAC (Real Applications Cluster)
Those are some ares to start with. If you are doing Oracle, you can create a GRID compute environment which will allow for true clustering and load balancing with Oracle in a shared environment with SAN. Once thing to keep in mind is that a SAN is required for most clustering. RedHat also offers the GFS filesystem which is a true proven clusted filesystem. There is another called GPFS which has been used cross platform as well, but required licensing.
When it comes to redundant hardware for HA, make sure you support the minimum requirements for heartbeat paths depending on what clustering solution you want. If you use HACMP or Service Guard, you will likely use a SAN HB and at least 1 redundant network path. Also when using a SAN, use multiple HBAs to provide reduncancy with a multi-path software such as dm-multipath (Linux), Securepath (UNIX), HDLM (Unix), MPIO (IBM UNIX), SDD (IBM UNIX). There are plenty of documents on how to do HA under various environments. I recommend looking at some of the IBM redbooks on HACMP and on Clustering. They also have redbooks for Oracle tuning on Linux with POWER, which will give you an idea about how to do Linux Oracle clusters. If you can create a Oracle Metalink account, you can find out some of the tuning and detailed info about Oracle clusters.
I am sure there are others I am missing, but that covers the base for most clusters. The only other thing is finding a persistent messaging platform (like IBM Websphere MQ - MQ Series) to handle message passing in applications. IPC is good under UNIX for programming, but not as good with clusters, security, or transaction guarantees.
The only other thing to remember is cost. HA environments do incur costs higher than small unreliable environments. Things like mirrored drives, redundant HBAs, redundant power supplies and power feeds, redundant NICs, etc. People worry about petty things like how likely drives will fail, etc. If you architect your environment properly and build your clusters, you build around that. RAID 5 on your SAN, redundant cards, fault tolerant hardware, better reporting mechanisms (HP and IBM integrate daemons on all their OS's to report potential hardware failures with mid-range to high end servers). Look at what your SLA is and what you have provide and then look for the best, most reliable hardware and software to fit in your budget to provide that. Not everyone can buy millions in hardware and software to run a true mission critical environment.
I realize that this is just background to the question, but it strikes me as an odd thing so say:
I have been the victim of numerous really crappy internal applications. It makes me mad, because it shows a lack of respect for my work. What tends to happen in practice, if the users are technically minded, is that they don't use the applications as intended, and invent some primitive system on the side. Excel sheets, and so on.
And the people who support the real system are usually happily unaware of this. They live in a fantasy land, where things work fairly well and the users are pleased.