On Building High Volume Dynamic Web Sites
"Apart from this I have been talking to commercial vendors like BEA (I was very impressed) who provided application servers with load-balancing, replication, etc., starting at $20,000 (Australian) -- they run sites like Amazon.com, Qwest, Wells-Fargo etc.
There is an issue here (is there? I don't have any experience to really know hence am asking you) ... I can build a custom solution with load balancing written at the application level. But how does this affect my maintainability (for example Amazon.com moving from just books to all sorts of other stuff .. how long did it take to redesign the site etc.)?
The site I first built could potentially hold information about a million refugees, and allowed searching on most fields regarding information on a person (wildcard queries). Unfortunately, on doing some stress testing (with around 700,000 records) I found that at most 15 hits could be handled every ten seconds. I optimized the code, switched JDBC drivers to a faster driver, wrote a simple load balancer (and I mean very simple) and limited searching of fields to a few fields as well as preventing bad wildcard queries (e.g., a wildcard at the start would make little if any use of the index). Consequently, I managed to get the system to handle slightly more load (200 hits at 5 seconds) (Hardware was Dual Pentium II 450Mhz I think, 512MB RAM, 2x8G Ultra-wide SCSI hard drives, and running Linux of course). BTW, The Kosova refugees articles has a lot of misinformation, e.g. encrypted databases, and the time to actually build it was actually one week (and two weeks of overcoming red tape, etc.)."
We work on a high-volume site that serves up commodity quotes, news, wx, etc. Each user has 7 different tabbed pages that are completely customizable. It works great. We just use Apache and JServ. They do the load balancing and failover we need. EJB is cool, but overkill even for us. I like our db connection pooling scheme better than EJB anyway. We do a lot of caching and work with a good DBA on getting the queries to run faster. After our site was functional, we spent 2 months doing nothing but optimization. Our HTML is very compact. Our logging is done in a separate thread. We don't use ODBC for SQL queries since it can only do one query at a time. I've got a lot of experience on parallel systems from grad school, so that helped. We don't follow all the rules of OO design. The most notable exception is we don't keep the HTML screens separated from the data objects. That was too much of a performance hit.
MP3.com uses mysql ! It gets 20 million page views per day ! Where did your MySQL crap out ?
Are you mad???? mySQL + PHP..THAT'S the way to go...uber speed, did a stress test on a site I run that tracks over 800,000 Half-Life players... no loss, no crash...
This is the BEST article I've seen yet. I do this exact same config (without the BIGIP_ and I've seen it work from Single Processor servers on up to Quad Xeon with with Orion Fibre Channel SCSI Configs. The bottom line is db calls. Minimize the db calls and you are there. If you are running NT, ISAPI is where it is at. You can cache ISAPI scripts therefor taking extra load off of the CPU and minimizing the time it takes to complete the process and it uses less handles as well. Great article... Issues of greatest importance... planning, planning, and planning... do it before you start or pay the price later.
At least some part of this is due to shitty HTML, not an underpowered server. Remeber, every page of comments is generated as a huge table - and you have to download the entire table before your browser can render it. If CmdrTaco and company had any sense, they'd find a way to do things without tables, so that browsers could progressively render pages as they load, and you wouldn't have to wait so long to get some interesting content to read.
re: data modeling. not sure what you mean. for my purposes excel does just fine against the data collected by IIS/SQL, and a lot of the science I capture in macros so the doctors just point and click and play with charting until they are happy. And there's a real OLAP product that looks pretty sexy but I haven't needed (because user drag+drop edits of excel pivot tables on 10k - 100k row datasets take 10 or 20 seconds, but that's ok).
In the visual studio kits and at the web site, it looks like they have all the components for dropping in / participating in advertising and the business to business and business to consumer data modeling - understanding what your users are doing and interested in, but I haven't had a reason to dig into them.
re: Dell. their maintenance engineer says "tain't so." Solaris ran dog slow on the x86 (they have a corporate goal of running Dell on Dell), and they wanted to be a consumer of tools, not an inventor (it's not their business). He said they are the largest dot-com (by revenue and by a large margin) on the web. And they had only a few minutes of web down time last year (a total of less than 10 minutes). Amazing stuff. He said "treat a PC like you treat a mainframe and it will never go down" (do root cause analysis, have a change control process, test before deployment, etc.). And their rack mounted stuff takes up a lot less space (for the same performance) than the mainframes and Sun gear.
I hadn't though about it that way, but the 8ways introduced last year are -larger- than any (non-sysplexed) mainframes that existed just a few years ago. ie 10s of thousands of startio-s/s, 100s of disks attached, 100s of mbytes/s of IO, 5000 or so mips. Nothing like living on a "faster" exponential curve than everything else that ships in smaller volumes.
Ari
Just as an experiment, I timed how long it took for the page to load after I cliked on the "Reply To This" link below your comment. 44 seconds.
I'd say that /. is most certainly not proof of anything at all about MySQL and Apache.
I'm guessing somehow that AOL does. I hear they get the occasional web traffic. And do you really want to make a popularity argument, anyways? From the perspective of an IIS developer, Apache/perl is a somewhat "weird" solution, too. When Greenspun tells us how to "do it" and lets us consider LINUX, Perl, PHP, MySQL, and other "open source" solutions, I'll start calling him an uc("god").
Maybe when Linux and MySQL are up to the task of backing sites on the scale at which Greenspun is interested in building, he will.
No, Greenspun's no god. But nobody is.
Kosova is the TERRORIST way of naming Kosovo. Spell it how the FREE world spells it.
http://www.freerepublic.com best political news source in town
Check out infotone which has developed client web software that solves this exact problem. Instead of relying upon expensive (and frankly, difficult to scale) server-side stuff, it relies upon your viewers to do much of the "server logic" on their machines in a secure way. Its free, beating the pants off $20K+ solutions.
To start, any large scale system like that would do well with a custom, managed index search as opposed to a simple database freetext field search on just any of searchable columns.
fourth
PHP is not quite a Competitor to WO, but I don't really think ASP is either... I love PHP, use it every day, and hack it for fun, but when it comes to doing the sort of thing WO is made for (large, logic driven web applications with multiple developers and data sources), WO cannot be beat.
In an n-tier environment (web server, app server, database server) all the application server really does is:
I have worked with a half-dozen commercial application servers and the all work equally well for anything I've built. Since there are 2 excellent open-source app servers, Zope and Enhydra, I see no reason to spend a dime on a commercial product unless of course their proprietary RAD tools are something you can't live without (trust me: you can live without them). Still, none of them beat AOLServer pointing to an Oracle8i DB set up for transactions.
The only thing you really need to remember in transaction processing is that the network is the bottleneck.
WebLogic(not clustered), EJB, and Oracle. 3 million records, complex joins, wildcard searches, heavy-duty number crunching. nothing takes more than 5 seconds. it rocks.
Hello boys and girls,
The word for the day is hackjob. Can you say hackjob?
HACKJOB!!!
I knew you could.
This is false information. Dn.net is far below capacity on its multiple oc3's and double digit amount of t3's. It's either the slashdot code or the employees at andover can't build a cluster properly. BTW, freshmeat runs on dn.net as well and never has any problems.
And there are plenty of quality perl books (see synonymous with cgi for many people) and hundreds of thousands of modules and scripts. I think your whole point is irrelevant anyway unless you're unable to program at all and must use ready made code.
I once developed for an e-commerce site and chose FastCGI as a backend solution. It's a lovely solution. One a single processor, we managed to fork out 40+ transactions per second, it was the web server that was loading and not the fastcgi module (sleeping most of the time). In fact a single instance of the fastcgi module could handle in the hundreds of dynamic html content creation per second. Also, the greatest benefit is that a crash in the fastcgi instance will automatically launch a new one up without affecting the web server. I am still opting fastcgi as the primary solution for delivering database information.
Hello, anybody home? You can know everything about setting up database backed sites and still have a site crash and burn. There's this thing called highly unpredictable human element. One newspaper could have mentioned him and the site could have been pelted to death with requests. Then there is this thing called budget.. jeez Really. That site is in the range of exponential user increase which you can't predict reliably at all. Just look in the news about how many sites crash because of the unpredictable human element leading to sudden exponential increase in hits. You're a moron if you judge him by a Boston Marathon site, which I doubt he had a large budget to play it safe with.
mod_perl should be able to handle somewhere close to this on a single machine and easily when load balanced without considering database overhead. A vbscript ASP backed solution should even be able to handle at least half this without, again, considering database overhead. I currently do 4.9 million database backed dynamic pages per day with multiple selects, large inserts, even full text search on a dual xeon 500 and 2gb of ram with persistent connections to an oracle database server with 742 million records. Benchmarks suggest that for the vast majority of the time the perl scripts are waiting for the database (which suggests I should re-evaluate the oracle server, but that's another story).
While your model works quite well for you, it doesn't really address this guy's problem. If he's correct that the database was the bottleneck, there is little a load-balancer can do unless you've replicated your database (big $$$$). Depending on the processing performed by the servlet, he might get an equal benefit by moving it onto a separate box. In either case, it probably wouldn't matter much.
I agree, but for a different reason. No transactions, no rollback, and no foreign keys.
i've developed in both WO and ASP......take it from me, WO is crap. for alot of reasons. also, there are not alot of WO people out there. i dont know where (geographically) you are, but my company on the west coast looked for 2 months for ANYBODY to do WO work, to no avail...had to hire Apple personnel (...and boy, did their code suck). ASP, while not super-scalable, is probably the fastest development time-wise. also, everybody and their sister knows ASP, as its usually the first "middle tier" environment JR coders are exposed to.
If it were me I would use Tango from pervasive software httpd://www.pervasive.com It's a app server, plus they sell a database, but I agree with others in that Oracle 8i is the way to go.. well once they get it stable on linux in a couple of months ;) Places like theglobe.com use it for everything dynamic (thou they started on NT, then added solaris), which is about as dynamic and busy as you can imagine and it has a persistant mod for Apache on linux for speed. It is scalable as you can run the client on your webserver and have it load-split to mutiple tango servers and db servers on the back end, heck you can have multiple webservers on the front end too, depending on how much SQL you do in your pages. The best part (IMO) is that it comes with a GUI dev tool, that made for object oriented design. No page based templates here... I know contractors who have had up raise their bid because the client did not take them seriously when the spec'd it for using tango... As for web objects.. if it is so great, why does apple use tango interally? http://www.apple.com/jobs/openings/search.html
Cold Fusion is another possible solution. Runs on linux now too. http://www.allaire.com
Go look at the book on his site. It gives many possible solutions (even states that NT and ASP are viable in many cases).
Of course PHP is faster if you are going to set it up against C cgi's However, why would anyone go to the trouble of rewriting their site in C only to have it run as a CGI though? Anyone writing in C in writing using some method to run in process, such as writing their own custom apache mods.
Slashdot is a terrible example of a high volume site.
They are only up 85% of the time, and their daytime responses are pathetically slow.
In general, Slashdot is one of the slowest and most unreliable web site on the Internet today.
I have the data to prove it, too.
what sites have you been administrating that need to do 100 million hits of purely dynamic pages per day without a cluster of servers (or one big monster a'la sun starfire)? I agree that you're going to want to use c or whatever lower level language in this case -- but there are certainly not many sites that are doing this many hits per day. Even www.aol.com only gets around 300 million hits per day (mostly static).
I visit it multiple times per day and have zero problems with it. Slashdot, on the other hand, has minutes where I can't even connect to the http port.
What tools do you guys use to stress test apache?
Have you read about the load-balancing capabilities built into Apache's mod_jserv 1.1? Here's a document outlining it.
> I'm not saying this phoseciously (sic)
facetiously.
I'm surprised no one else mentioned this:
Unless you have a very good reason not to, make your code available for review.
Apparently it's pretty scalable. But I haven't had cause to investigate it, as the sites I've been working on tend to be fairly low (in the 15,000 hits a day range).
Zope is another alternative. Object oriented,
persistent CGI, Fast CGI... lots of add ons.
Open source! I can't imagine not using it anymore.
Wouldn't want to.
For me PHP/apache was a quicker start for a small project, but I was really blown away later on a much bigger project by the visual studio tools and the VB / ASP runtime support (how 'bout using http for automated instrumentation steering & data collection? :-).
0 9380,00.html
It took a bit of getting used to but it's amazing how easy the IDE makes creating, debugging and tuning a (distributed) dynamic app - I don't know how they do it, but it looks like VB is compiled (feels as fast as C code?) and even though it is single threaded, it runs like it is multithreaded(?) - and it can run across multiple nodes for both more performance (whenever you need more just add a node) and reliability (scales like a dream and I didn't have to do much more than click buttons and plagerize code).
They put some full sample apps in their kits that look like what pcweek tested a year ago. See: http://www.zdnet.com/pcweek/stories/news/0,4153,4
If you run the numbers on Compaq vs Sun, the report says Compaq delivered 3200 or so pages per second for $180K in 1/8th of the Unix solution floor space, and the mostly Sun solution delivered 1300 pages per second for $430K.. Granted benchmarking app servers is still a new art. And Compaq's VB number was a little less (but still 2x the Sun number).
This also makes me wonder what's more important, making more folks programmers (someone who is an expert in something else first and a (not so good) programmer second (like me).. java is neat, but basic is really simple), vs making a pretty good programmer even more productive. Seems to me basic (or scripting, or visual programming that connects the dots) "scales" better (produces more solutions) than something more complex (like java or yet the next object abstraction / distraction). So my comparison is really PHP to (compiled) VB, and after the learning curve, I think VB (and the ASP runtime) is a lot (2-3x on NT4, 4-8x on windows 2000) faster, and easier to "point & click" install and manage across multiple nodes (and no more getting paged at midnight when my app gets fed unexpected data by the instrumentation - because the network load balancing mechanism also handles the errors - ie. the "retry" just works.).
re: space.. my lab is short of floor space, plus I hear that folks like exodus charge by the square foot.
fyi, the same cost & performance difference shows up in the www.specbench.org specweb numbers (where there's something strange going on, it looks like nobody but the PC crowd benchmark an actually used web server, most everyone else tests with something I've never seen used on a real site called "zeus"?).
And it looks like compaq has also cracked the scale out database problem. They seem to get near linear DB (tpc-c) scaling going from 32 to 64 to 96 processors. (about 2500 tpm-C per processor). see: http://www.tpc.org/new_result/ttperf.idc
smokin!
Ari
4 million plus hits a day, running off of 4 MySQL servers, each doing up to 250 queries a second. The keys with MySQL are: a) keep it in memory. We use 2GB memory per machine b) either don't do writes, or don't do any selects which take more than a second. That means mostly single row keyed queries. c) Don't do searches on the database, keep another copy of the data, either exported regularly or updated in addition to the database for searching. You can practically use grep on a flat file and support most of your searches, or keep it in memory. Just remember: Distribution is key. If you can, keep a copy of the data on the webserver. If you can't do that, then have multiple locations for the data.
I don't think as far as scalability, AOLServer and TCL are a really fabulous idea. Well okay, maybe scalability, but certainly not portability. And Oracle? How many thousand extra dollars do you happen to have sitting around? My question is this: People will constantly goo over Phil Greenspun all the time, but it is always ALWAYS people who have never run a server as he suggests, or any server. Can one person point me to a non-Greenspun and non-Ars-Digita site that runs AOLServer, TCL and Oracle? Anyone? I'm not saying this phoseciously (sic), I am genuinely curious, because I have not heard of anyone doing this since like 1997 latest.
Wrong. If the database is written to then it writes to disk and loads back to RAM.If the box crashes it will not result in corrupted data.
Philip was only illustrating simple (and high level) concepts through use of the specific software he uses. He never said or implied that you use AOLServer/TCL and Oracle only. If you would have thought a bit harder, you would have realized that his book and his courses are not specific to the particular implementation illustrated.
Relying on the browser to work as expected is right up there with not doing any QA on your web application in the "Asking for problems" catigory.
You can serve 600 - 1200+ requests per second using mod_perl depending on the dynamic code on average hardware. Before jumping down to a lower level language, you have to have a damn good reason why you're going to cut into development time just to get a doublefold++ improvement when you could have just put that money into hardware and spent your time actually getting the site/service up and running. Of course, if you're comfortable using c and have all the time in the world (or your project isn't very complex -- or has to scale to incredibly insane levels).. Anyway, just keep costs in mind (and remember to balance labor in the equation).
If you're even considering ColdFusion I suggest you also give Tango a try.
and please put a halt to this damned slow non-cacheable form of network pollution called
dynamic content. write a simple script that creates pages, and put in frames what really has to be dynamic. i know doublespam.comspam doesn't like this principle, but at least normal people get back the speed they deserve.
Please elaborate on this. I don't really know what you are saying since he said that the columns were indexed.
Actually, AOLServer version 3.0 will be open source. Betas are currently available for download. The code is available both under the AOLserver public license (basically the Mozilla License) and the GPL.
As for the expensive Oracle license: When the OS world produces an RDBMS of similar quality to Oracle, I'm sure many people will consider switching.
And as for the expensive hardware: Well, yeah. You don't run a site of any significance on a generic beige box, and that farm of generic beige boxes you're contemplating isn't cheap, either.
It seems to me that the difference between MySQL and Oracle is that as you move from 10k to 100k to 1M to 10M records, the slope of the curve in increasing query time is significantly flatter for Oracle than MySQL, especially when the database is under heavy load. I had a simple site running of my own design using MySQL, and to pull a row by index when I knew the index, still was occupying 3-4 seconds with MySQL when there were 300k items. Oracle might start at 0.3 seconds instead of 0.03, but it handles these lookups significantly faster. Coward from the East
I've done a decent bit in WO and ASP. I wouldn't say either is ideal, but I can get a project done in about 1/10th to 1/4 of the time in WO, and it will likely be more feature-rich, more maintainable, and cooler. WO takes an incredible amount of handshaking off your hand, as well as having a transparent persistence layer and other such goodies. On the other hand, unless the project really needs WO's features, its cost is prohibitive. ASP is much easier to learn, and a lot of people know it. On the other hand, if you took your average ASP programmer and threw them at the kind of task that WO is meant to handle, the result would be very unappealing. ASP, being easy to learn, also tends to foster sloppy code and bad programmers. PHP likewise, although it's more likely to be picked up by linuxheds and/or PERL people than, say, a windows newbie who wants a personal web page, so it's more likely to be written well. Speed-wise, I'm told it's faster (haven't tested that myself), it's about as easy to learn (most of that depends on what langs you know already) and it's a different platform for different zealots. And so on. But if you're talking about WO, you should really be comparing it to things like WebSphere and Sapphire/Web, not ASP. It's an app server with comprehensive dev tools and a lot of nice features. Sadly, with the price it has, it's overkill for most projects, and it really /is/ difficult to pick up. It's not for the faint of heart. - Geoffrey
Is it actually possible to use ACS with Postgres? While it's certainly possible to use AOLserver with Postgres, I'm under the distinct impression that some of the SQL in the ACS is Oracle-specific.
Unfortunately, there doesn't seem to be much way of avoiding this, short of the solution SAP uses to achieve portability between RDBMS's: develop your own SQL variant (in SAP, it's known as "OpenSQL") and use drivers to translate this into the SQL variant used by whatever RDBMS you're running on top of. Unfortunately, this system is a pain in the ass to build, and the generic SQL variant provided to programs tends to suck compared to the 'native' SQL of your RDBMS. In OpenSQL, for example, there's no way to even do a simple join between tables.
You are being throttled by MySQL. Step one, get rid of it. It doesn't handle multiple users acceptably. I have had better results under PostgreSQL and much better results with Oracle 8i. I have a site that runs as a CGI back-end machine for two Suns. Combined load is 350,000 page view per day. MySQL constantly craps out under the load.
That's what philg says anyway. And he should know.
So, why GSWeb or WebObjects?
- It's a framework, not an application server. In other words, you don't build applications, you use the WO framework to build application servers. Each app handles it's own memory, threading, resources, etc for incredible performance and scalability. This also makes it extremely flexible as it's a strong OO environment and almost all of the functionality of the app server can be overriden in you application.
- EOF (Entreprise Object Framework). If you've never used EOF (or GNUstep DB) you don't know what database development could be like.
- It's a standard environment. By this I mean the programming is more like a standard GUI app. You don't have a form that posts data to the same page or another page. Instead, you have a form with an action property that is bound to a method so that when an action is performed (click submit) the method is called. You as a developer don't have to worry about form names and how to do validation, your method can do whatever it needs and return an arbitrary page without messiness like forwarding or including other pages.
There's lots more. Basically, when logic very much outweighs content, WO is hard to beat. On the downside, it is a completely different way to think about web development and there is a learning curve. For OpenStep developers, this will be small, but for programming novices, it could be quite steep. But how productive you can be with a tool is definately a factor that needs to be considered.
Check out GNUstep Web at www.gnustepweb.org. I have the source for a very simple GSWeb application up at http://zeus1.tzo.com/GuestBook/, this is basically a GSWeb conversion of apple's first WebObjects tutorial app (yes, it's entirely useless and it would be stupid to use WO for a GuestBook :)
The reason is primarily that the system has not been embraced by the community. There aren't people outside of ArsDigita hacking on it, and the people inside are only hacking on it in ways that are good for ArsDigita (as of now, there is no development team for the toolkit, site development teams modify the toolkit, and sometimes their mods get folded into the distribution).
Philip will try to sell you on the "proven" AOLServer (uh, almost no one uses it in the real world) which is a very sketch open source project last time I looked at it. His system also backends to Oracle, which is not going free anytime soon. Finally, he will try and sell you the ACS as the greatest thing since sliced bread. It was, in 1995.
So read his book, which is pretty good (you could even buy it, its the only computer book I would put on my coffeetable, lots of pretty pictures). Just ignore all the technical parts.
PS: It will also force you to do development in TCL, which sucks ass.
The continued popularity of MySQL amongst the open-source community continues to amaze me. MySQL is not open source. Go read their license. It's not even free as in beer. Almost incidentally, it's not a database either. Their is no transaction support which means that if your box goes down the database is likely to be in an incosistent state, and there is no easy way to fix it. The stunning thing is that their is a real database that is open source as well. PostgreSQL.
Take a look at
http://www.eddieware.org/
- Easy to learn. Once Apache is configured with Php, it is seductively easy to write code.
- The Db connection pooling comes in handy.
Cons:- It is yet another language (I mean, I already know perl, right? Why do I have to learn another language?).
- Also, you tend to design page by page. it does not have a great library system to use. It has lots of code snippets you can copy.
Notable sites: Sourceforge.net, and persiankitty.com (reputed to be the yahoo of porn))- embperl: Looks like Php style of embedding. Comes preconfigured with DBI and friends. But, too cumbersome to programs. It also does not encourage component programming. It provides the substrate for you to build other features you might like.
- mason: Has cool component features. Has neat features such as caching with intesting way of managing it, and autohandlers. Looks ideal for the publishing world, where it evolved. On the cons side, not too many components.
- Apache::ASP: Not used it much.
Notable sites: dejanews uses embperl. techweb, stamps.com use mason.As an aside, python has the weirdest variable scoping and declaration rules. I ought to know, I have a PhD in programming languages.
The biggest selling points of ACS for me are:
- The documentation. Quite possibly the best documented system. I can take an average joe out of the street and train him to use ACS in a systematic way in no time.
- The Data model. I would expect to pay in hundreds of thousands of dollars for such a datamodel. It constantly amazes me to find the little details that I needed in those datamodel.
- Best practices to run a website. How to harden a Unix system and set up the services so that you can sleep peacefully.
The negative points are:- Alas, it requires Oracle. I just learnt the existence of ACS/pg and I am rejoicing!
- Not too wide spread usage. I expect this situation would change. Look at this way. I tried building a community web site. It took me untold hours to gather all the snippets of information for a Php based site. It took me no time at all to build it using ACS. I would use ACS for a community based website over any other toolkit anyday.
- No good enough template mechanism. Fortunately, ADP in AOLServer is changing that.
So, in conclusion, php and modperl and zope vs ACS is not a right question. You can implement ACS in php, and I frankly hope somebody does to save lot of human misery and suffering.Rama Kanneganti
Somehow my emacs inserted that "use Php" in the previous sentence.
Thanks.
Yeah, but templates aren't part of JServ.
Recommending JServ as an alternative solution to Dynamo is like (strained analogy) giving sheet metal and welding tools to kids vs. giving them a climbing frame.
Caching data in a hashmap is a good first pass solution, but then you have to deal with threading issues, which brings up the problem of State. Writing an LRU thread-safe cache is perfectly possible, but a total waste of time if you're trying to write a web application.
If you must go for a free application server, choose Zope or The Locomotive. I haven't tried them out, but they seem to have the (minimal) functionality required...
Since I'm an ATG employee, I'm going to restrict my comments to the strictly personal non-Dynamo related side.
I've written server-side Java using Apache Jserv before I ever used Dynamo. Server side Java alone will screw you up horribly, because you can't easily generate dynamic HTML pages, and unless you write your own templating feature you're stuck mixing HTML code with Java code.
This will stymie the designers completely, because the only way they can make changes to the page is to ask the programmers to do it for them. The programmers are stuck doing both design and programming. Forget about making easy progressive changes to the code -- if you have one page that you want to change every third element, you'll have to trawl mounds of printlns.
Not to mention that there's no personalization. There's no database cache so you go to the database every time you need data (which can be horribly expensive). There's no connection pooling. If you want to grab stuff from the database, you have to write dynamic SQL to grab elements... I can't describe how tedious this can get.
If you want to write everything yourself or don't have the money, use Jserv. It's a good Java server, and it does everything it's supposed to. Whether it will do everything you need is another matter.
Your post is very informative, but there is one thing missing. You said you're using multiple Apache boxes which talk to the MySQL box. But how do you distribute the load among the Apache boxes? The incoming traffic needs to be somehow directed towards one of the Apache servers. Also, what if one of them dies? Will it kill the whole cluster or will the other boxes take over?
___
___
If you think big enough, you'll never have to do it.
Before I forget.
Another important feature is that these techniques are server side so are independent of the browser used. You can also use PHP with Oracle 8i but most of the really cool stuff is centered around MySQL. Plus, why spend money on software if you don't have to. Spend the money save paying for support or donating to the OpenSource projects (PHP and MySQL). Yet another advantage is that the application can be developed and tested on Windows (read Configuring Windows98 For Local Dev) then uploaded to the Apache Server. A good free program with PHP syntax highlighting and tons of other goodies for developing is HTML-Kit.
"God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
--
Last summer I worked for a while at Planet Online in the UK. (They're big Linux supporters and host linux.org.uk). The focus of their activities is Freeserve, the largest ISP in Europe with two million users. It's specified for ten million.
The entire system runs on Linux, with NetApp filers doing the actual disk storage.
The user database runs in mySQL. Every mail sent or received, every user homepage hit, every DNS hit, involves a query against the back-end database. That's for two-million users - almost a hundred thousand of whom are logged in at any one time.
The amazing fact is that it isn't slow, it isn't unreliable, and it isn't reaching its limits. Why? Engineering.
The people at Planet simply have an in-depth understanding of the hardware and software that makes up their platform. They've put in fast kit and lots of memory, and they've tuned the code (long live open source) to make use of it. It flies.
20% as fast? No way. 20% slower maybe. But either way all you're saying is that your proxy server isn't as good at talking HTTP as your FastCGI server is at talking its own protocol. Naked HTTP communication doesn't add much to the content except a few headers, so this should be fixable just by improving the HTTP proxy code to be as efficient as the FastCGI networking code.
FastCGI is a good technology, but the problems you describe with mod_perl have a pretty simple solution. You just use mod_perl as an application server and put a reverse proxy (like Squid or Apache + mod_proxy) in front of it to serve static requests.
Don't fall into the application server trap. Think about what they're telling you. If you load-balance your Java servlet runners, you will still have to come up with a separate solution for load-balancing your web servers in front of them.
v ailability_Linux_Project.
Ultimately the proof is in the pudding. None of the big sites use commercial application servers for anything they really care about. I know people who work at Yahoo, Amazon, and others, and they use home grown stuff, often with Apache, and often in C or Perl. You can make a fast system with any number of open source tools like PHP, mod_perl, Resin, AOLServer, etc.
Your search speed issues can be fixed if you put MySQL on a box with enough RAM to keep most of your data in memory and use proper indexing. For full text searching, you want to build an inverted word index of your documents. There are plenty of examples in Perl that you can steal from if you need help. Try looking up the module Search::InvertedIndex on CPAN, or search the Dr. Dobb's Journal archives for an article about this subject.
Now, you need to do some load-balancing. For a good overview and some possible options, read this:
http://www.engelschall.com/pw/wt/loadbalance/
Then check out some of the links to IP-level load-balancers at http://perl.apache.org/guide/download.html#High_A
There are probably some good FreeBSD-based projects as well, since products like big/ip and Coyote Point's Equalizer are based on it.
If separating content from presentation is what you're after, take a look at the Perl module "Template Toolkit" on CPAN, or look at http://www.freemarker.org/ or http://www.webmacro.org/ for Java.
I'm not the previous Anonymous Coward.
However, I have to look at it the same way -- just like I wouldn't judge MySQL by one site, I won't judge the ACS by one site. Nor will I judge AOLserver by one site.
However, when AOL is pumping 28 thousand hits per second out to the net through AOLserver, I _do_ have to sit up and take notice.
Oh, I've run AOLserver for nearly three years on my Linux box -- and it runs smooth as silk. Very easy to develop dynamic pages in, very scalable, and wicked fast.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11
On the other hand, the pages served by /. are quite large by common web standards... lots of query results/test per page especially for Nested or Flat mode on a story with 400 comments. Definately not your average web "hit".
This sounds bogus to me. Slashdot runs on MySQL... Slashdot get a LOT of traffic, and those hits involve a LOT of database access.
140,000 pages per day is a pretty small, load. A very small load for a custom C webserver. You could handling 140,000 pages a day easily with ASP and VBScript, for that matter.
I read an article recently stating tha VA Linux will be working with TCX to add "real" (non-"hack") replication to MySQL.
No need to write your own template engine, there are several out there to choose from.
Regaring non-DB caching, that it an excellent point. However, it's not really that hard of a problem to solve, either. It's nice to have it solved for you (Dynamo, etc.), but as long as you have everything running in the same JVM, you can cache data in a HashMap (keyed by user ID or whatever), just make sure to watch threading issues.
I've used Dynamo extensively for the last 1.5 years, and I can recommend against it with absolute assurance. It forces you into a slow and unproductive development cycle, it's byzantinely complex, and it's not very reliable. As many of the posts have suggested, you're really going to need to do the basics:
- Improve your db
- Cache the living heck out of everything
- Buy more memory
- Buy more machines
- Etc.
And guess what, Dynamo doesn't help you do any of those.
Marketing != Truth
If you want to write server-side Java (not a bad choice), go with Apache's JServ for absolute sure. It is delightful.
I have written a truly remarkable operating system which this sig is too small to contain.
OH yeah, make sure you pool your database connections. A lot of time is spent creating and tearing 'em down.
--hunter
RateVegas.com - Vegas Reviews
Check out Enhydra - its an open source application server that has been in development since about 95 (http://www.enhydra.org) and it runs on Linux. Its also tested and used by FedEx, Huffy Sports and others so you know it must be pretty good.
\forall code \in C, \frac{\Delta readability(code)}{\Delta t} < 0
There's an application server that was developed in-house by what used to be Small World in New York, now iXL's New York office.
It was originally just called "the Filter," but on its 3rd rewrite in 1998 was rechristened "Velma" (yes, from Scooby Doo). Info on it is available at http://velma.ixl.com/.
The first version was Perl-with-CGI, the second C/C++-with-CGI. The third, Velma, is C/C++ with a persistent connection, like most professional app servers.
Though it was originally written for Solaris, I know of at least one port to Linux.
The idea when iXL decided not to support it was to release it Open Source. I don't know if that happened officially, but certainly unofficially it's available. The developers who made it have all left the company, so there's probably no one officially "in charge" of it or of dealing with other open-source contributers, but if someone offered to take it off their hands, I'm sure they'd listen.
It really is good technology, I wrote custom dynamic web sites in it for almost a year.
Please email me (take out the S_P_A_M, sorry) for more info, I think I could hook you up with the right people.
ColdFusion Server, and Sybase...
"What do you do with the mad that you feel when you feel so mad you could bite?" - Mister Rogers
It's worth having a look at what your java code is doing - and select your implementation of it with care. http://ww w.volano.com/report.html</A> will give you a good comparison of a few of the brands out there. I was working on a project using a java front end to an informix db at my last job, and one problem that I vaguely recollect was something like all reads on hashtables being synchronous, i.e. only one thread could read from it at a time. Those sorts of things can produce severe bottlenecks. I think IBM provide some fairly good runtime analysis tools with their jikes compiler for free, have a look around their <A href="http://www10.software.ibm.com/developer/open source/jikes/project/index.html">site</A >.
Never trust a man in a blue trench coat, Never drive a car when you're dead
I managed a reasonably big site (16cgi hits/sec) using mod_perl, php, mysql, and LOTS OF RAM. Your single best optimization, as listed above, is LOTS OF RAM to cache with. mod_perl has some other great tricks- if you're using templates, put a BEGIN{} block at the top of a module that is a PerlRequire in httpd.conf, and assign templates or other file-IO portions of scripts to globals. They stay in ram then (but you bought LOTS OF RAM, right?).
:)
Put mysql's data partition in
a) it's own partition or better yet DISK
b) on it's own scsi controller
If mysql is your bottleneck, run oracle. make lots of index tables. run benchmarks on your queries. with mysql, avoid table joins if you can, it's much faster without them. Optimize the Sh#t out of
your tables for query speed.
don't run heavy cpu junk like log analysis on the box that needs to serve dynamic content
use squid cache or another cache, or even plain old ramdisks to hold your static stuff, remember that IO is a huge bottleneck. Try to put eveything in ram.
don't run 4 quake3 servers and one unreal tournament server on the box when you are anticipating heavy load
Cache anything you can (did I mention that?) take slashdot pages for example- every time someone posts a comment, you should take a dump of the dynmamic page to flat html. When the next person requests the page, give 'em the dump if there are no new comments (saves hitting mysql every time!). Of course, you cached that html page in your LOTS OF RAM, right?
Imagine you need to serve a lot of file download requests. Apache has a built-in maxclients limit of 250, but you can modifiy that in the source. A dual p3/600 + 1gb ram can easily saturate a t3 with static content..
More stuff.. don't open lots of filehandles if you don't have to. Optimize out any calls that open a new shell (don't use $var = `pwd`; in perl, for example, use built-in function that don't require a new shell). Modify your linux kernel to allow more open file descriptors, max user processes. Nuke any unwanted ulimit directives in your start-up scripts.
Remove daemons you don't need on the box. Don't run anything you don't absolutely need running.
Run more than one instance of Apache- one compiled with mod_perl or mod_php, another just flat. This saves some of your LOTS OF RAM by using the cache only in the daemons that need it. You can even combine multiple daemons per ip/domain, but use squid to make it look like one.
Did I mention to get LOTS OF RAM?
No certificates are name based, not ip based, so no problem. (You may need wildcard certificates if you plan on doing www1 ... www99... 'manual/explicit' balancing) [see, someone answered ;-)]
It's been said before, and I'll say it again. Go to Philip Greenspun's web site and read the book, check out the code, download the freeware. This guy and his crew understand high volume db backed sites like nobody else.
amazon has many disparate systems serving many needs for its business. do not assume that /exec/obidos links, which covers every new product line Amazon has ever entered). there is much business to be done behind the scenes to move all those books, there are other ("non retail") sections of the site, and there's more then one way to skin a cat.
transaction middleware was found necessary for running the frontend "retail" website (all the
This is one of the few good comments on this article...
.02 swiss francs ;-)
First Design, second design, third design and then you can start your implementation...
A bad design (or no design at all) is the best way to generate real headaches for you...
I am quite stunned that all the comments mention really specialized stuff (PCGI, mod_perl) but do not seem to care about the design of the thing:
You can really obviously tell this person to move to a three tier architecture, for example...
Even before saying the least thing about the technology used... My
I'm in the process of comparing Apple's WebObjects to Microsoft's ASP for a somewhat similar project. I'd be very interested to hear from anyone who has info on the WO/ASP comparison, but here's what I have found so far.
.asp pages. WO has an object modeler. You set up objects that know about your database's structure, and the SQL is generated automatically when you access those objects.
ASP has the better market share, so you'll be able to find more resources (consultants, books, courses). WO has the better tools. It's a Betamax-VHS problem.
ASP is kind of a markup language that you mix with HTML, then pump through a special Web server (typically MS's IIS, but you can buy add-ons to make other Web servers work with it). The special Web server pre-processes the ASP directives, fetching data from databases if necessary and calling external components to do your work. Then the result gets rendered in plain HTML and sent out to the client (Web browser). You need additional products to handle load balancing and failover.
WO is an application server, meaning you write and compile an application, which the users interact with via their Web browsers. It automatically handles load balancing, fail over, and all kinds of good stuff. The developer tools are very complete, and they include a simple Web page editor, a data modeling package, and lots of nice features. WO is pretty expensive, and they charge per CPU that your application is deployed on. You can use Mac OS X Server (BSD Unix based) or Windows NT for deployment.
Both products connect to a range of data sources via ODBC. ASP requires you to embed the SQL in your
Your choice (ASP, WO, or other) depends on your exact application. I'm not recommending either product, just sharing what (I think) I know.
>How much research did you do exactly? :)
:)
Plenty.
>There of tons of big name sites that use WebObjects.
True, but there are many more that use ASP. At the bookstore, there are dozens of ASP books, and exactly zero WebObjects books. These are the perceptions that my PHB is using to resist WO and bring in ASP.
If anyone has suggestions as to how I can change his mind, I'd love to hear them.
Well, they are quite denormalised, yes!
>> Are you perhaps saying that you only expose some specially created denormalized tables for reporting purposes?
No, the whole app architecture really is built around SQL queries with no joins. By doing this it is more effecient for our file-caching of db queries - you end up caching a smaller amount of files, as opposed to the many more files which would essentially containing the same data as one another if you were caching the results from a query involving a join. (That doesn't read very well - I hope you understand what I'm trying to say!)
At the app level, we do do things that are effectively functionally identical to an SQL join. We do all of our db access through a layer of code that makes the caching invisible, and this layer also deals with the 'joins'.
HTH,
Jim
Good comments. We also have a highly (98%) dynamic site running apache+php+mysql (plus some big sun boxes in there... somewhere... :-) Anyhow, our site does millions of hits a day (more than a couple) and it takes machines, and architecture, and money. We spend a lot of money on hosting (Exodus, dual gigabit uplinks) and have approx. 100 servers. We've spend hundreds of hours designing our systems to scale big, and be distributed, and run on RICs (ridiculously inexpensive computers). And it does, well! But this is a serious proposition. Get yourself some talent--and don't always belive the guys spouting off "N-tier", "app server", "java" as the only solution... there are many solutions, go discover what's right for YOUR site--maybe it's a 3-tiered java app server, maybe not. Hire good developers a DBA, and remember: always build for a 10x load.
The text of the book, up to chapter nine, is available at http://patrick.net/wpt/.
Hmm. I must say, having written dynamic sites in C and scripting languages, that I would rather have a bigger array of servers running scripted stuff than a smaller array of systems running compiled code. C just doesn't cut it (for me) in terms of maintainability and stability - a segmentation fault that can bring down your entire apache (sub)process vs the slight speed decrease by using a scripting language? Nah.
Plus, I dispute your premise; do you have some stats on how many hits/day a mod_perl-driven site can take? On what hardware? How complex a script? Where are the bottlenecks?
It seems to me that it's very likely the bottlenecks are disk I/O or database access, in which case the language used doesn't matter a damn.
ben_ the technologist and platform agnostic
With that design, there will be a period of time where a record has been "written" but is not readable. This will only be viable when new records are being added. In any situation where a record is being updated or deleted, something hitting one of the "read" DB servers will get the old record.
This could get really nasty for an application that reads the DB, and then writes something based on what it just read.
Yes, but can you have Zope AND high volume? Zope is like a three legged dog stuck in the middle of a horse race. But it does some really cool tricks.
But servlets are already persistant. I do wonder if the potential speedup in going from interpreted to compiled is worth it.
(Elegance is not an option)
The latest version of php- Version 4 - uses the zend engine. Zend has an optomizer and a cacheing option. I've used php3 and I am pretty happy with it. php4 is still in beta, but its almost done.
I hope this helps.
john
-- john
ASP has the better market share, so you'll be able to find more resources (consultants, books, courses). WO has the better tools. It's a Betamax-VHS problem.
:) There of tons of big name sites that use WebObjects. Some are Disney, Adobe, Toyota, BBC, MCI, Toshiba. There are many others. As for finding consultants, WebObjects allows you to write apps in Java, Objective-C (same as Mac OS X's Cocoa!), or WebScript. I'm sure you can find one or two Java developers out there. :)
How much research did you do exactly?
You can use Mac OS X Server (BSD Unix based) or Windows NT for deployment.
Or Solaris or HP-UX. I believe Solaris is the most common deployment platform for the big guys.
- Scott
------
Scott Stevenson
Scott Stevenson
Tree House Ideas
hmm..oracle has a parallel server option and i believe you can have multiple mysql engines. having the database entirely or partially loaded on a ramdrive or raid-5 array will help. also java servlets can be easily load balanced by using apache jserv...see http://java.apache.org/jserv/howto.load-balancing. html ...
PHP does this, too, and it's integrated into Apache, which has some fairly obvious advantages (don't underestimate the advantage of ubiquity - ever). 90% of the problems in performance I've seen are usually due to bad design, inefficient database use, and often a serious misconfiguration somewhere.
But if none of these are the case, and the original questioner still has database bottlenecks, then the solutions start to get more involved and expensive. Many other first optimizations have been mentioned many times on ./ and everywhere else, like: move the database to a separate machine (preferably to a private network behind the front end server(s), use expires headers on your graphics to encourage longer caching, serve graphics from a different machine, use reverse proxying to break up requests to different machines performing queries, and make everything static that doesn't need to be dynamic. This last one has enabled me to really stretch the abilities of sites, so I'll repeat it; if it doesn't need to be dynamically generated every time, don't do it!
But if there is still a bottleneck at the database, it gets expensive, because no open source databases currently do replication. You need replication to do true mirroring of databases. Guess what? It's expensive. But from the sound of the project, I'm sure the questioner will fix the problem long before that's an issue.
Expanding a vast wasteland since 1996.
This has nothing to do with the stress the box is under; the behavior is the same whether the box does millions of hits a day or mere dozens.
Supposedly, Red Hat is considering working on support for replication? I'm none too clear on how that would work with no support for transactions...
Expanding a vast wasteland since 1996.
...no joins???
This is only possible if the tables are incredibly denormalized. Are you perhaps saying that you only expose some specially created denormalized tables for reporting purposes?
Those are routers, dude. Most people have to take two or three hops to just to get to their NAP, and some even that many to make if off their WAN. In Seattle, it takes 5 hops for me to get to the above.net backbone, then a single hop to San Jose, CA, then 4 more hops to hit Alta Vista. That's 11 hops for me, and I'm only 1000 miles away.
Dave
And that applies to any technology. You say your company looked 2 months for WO consulting, but they didn't look hard enought to find us at Subsume Technologies. I have to assume you looked in all the wrong places, which is easy to do in a world full of posers. It's also possible the work you had was crap work that nobody wanted to do, which is also becoming too common, regardless of the technology.
just a perl script that tail -f the update log, and it's not a problem to maintain state if you store which was the last statement executed.
The perl script can take out irrelevant data and pipe it across the network to a waiting daemon.
or you could share the update log and tail it on the client.
Dynamo for Linux is ALMOST in beta release...
>Cache your dynamic data
One thing that sometimes gets forgotten is that you can reduce calls to the server by using client-side strategies. For example, putting the raw results of a database query in a Javascript array then mainpulating it (sorting, etc.) in the browser can be very fast and groovy. It's not a substitute for server-side caching, but if you can take any load onto the client then it's good.
JJ
"And the meaning of words; when they cease to function; when will it start worrying you?"
The original ACS stuff was built on Illustra, a commercialized version of Postgres. Postgres has some very nice features, e.g., tables that can inherit their structure from other tables. And in fact it would be very convenient to use these for ACS 4.0 (coming out in about two months).
But ultimately we gave up on Illustra because it wasn't reliable enough.
Oracle does have some very nice features of its own, e.g., the ability to run Java and PL/SQL inside the database. But I don't want to be remembered for pushing Oracle. In fact, I kind of walked away from their Web tools (we built www.comdex.com in 1996 using Oracle Webserver). They were clunkier and slower than AOLserver.
The thing to keep in mind is that the substrate layers don't matter. If you change out the RDBMS or the operating system or the HTTP server or the scripting language no end-user will care. But they care if the workflow or the data model is changed.
Since several folks have mentioned AOLserver and the Open Source website building toolkit from ArsDigita, I thought folks might be interested in learning that it is being ported to Postgres by myself and a handful of other folks.
As distributed by ArsDigita, the toolkit requires Oracle, certainly the best choice for serious, high-volume sites that require the utmost in reliability and robustness.
However, many of us don't need to scale to the size of eBay or Amazon and aren't interested in paying Oracle's licensing fees for web use (thousands of dollars a year even on Linux x86 systems), so there's been a lot of interest in an Open Source alternative.
We released a preliminary, beta subset of the toolkit running on Postgres last week. The curious may pick up a copy at http://acspg.benadida.com. AOLserver can be found at http://www.aolserver.com and Postgres 6.5 at http://www.postgresql.org.
AOLserver works with a variety of database servers, not just Oracle. Large e-commerce sites don't begrudge Oracle their money. The rest of us can use the ACS with Postgres.
Don't knock what you don't know...
I do a lot of work with application servers. If you're prepared to spend lots of money on app servers ($10,000 - $35,000 per CPU) you can scale to very large numbers of users. The app servers do load balancing and caching for you, scales to support multiple processors per box and multiple boxes, and can scale to hundreds of hits per second (with fast enough hardware).
Have a look at the Netscape Application Server (or the new version, the iPlanet application server, which is not yet available), WebLogic, SilverStream, or IBM WebSphere advanced.
Jim, we're building a setup a little similar to what you're doing. We're also in the UK, wanna ask you some more about your setup. I wouldn't mind emailing you though as I don't want to get flamed for talking about M$ stuff here - email me? olliec@hobomedia.com.
I am curious how people get these performance numbers for their web sites (5hits/sec, 200hits/sec, etc.) What tools exist for determining these numbers?
python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"
Did you remove unecessary whitespace from the output as part of your minimization exercise? For example, did you convert the "readable":
This is the title
to:
This is the title
I ask because on a previous contract I had to make optimization suggestions for a high-volume web site, and one of them was to remove "unecessary" whitespace (how to figure out what unecessary was re different browser's interpretations of the HTML is another story) -- the savings approached 20% for the particular situation, but I never got a chance to find out if this was common practice in high-volume "production" web sites or not...
----- In a previous message, Sun Tzu wrote:
> I have gone to some trouble to minimize the number of database calls and will work a bit more to minimize the size of returned pages.
Another tool that most people do not know of is WebCatalog http://www.pacific-coast.com Originally developed for E-Commerce it can do a whole lot more. The language is much easier that PHP/SQL/Perl etc and best of all the database and templates are cached in Ram. It will not hit the disk unless the database is written to. I serve one site with over 2 mil hits a day on it.
Sybase SQL Server 12 - Available on Win, Sun, Linux, HPUX, AIX
Jaguar CTS (component server) - Win, Sun
PowerDynamo (dynamic web content provider) - Win, Sun
The standard practice is to get EA Server with EA Studio which includes the Sybase suite of development tools:
PowerBuilder 7.0 (app dev like VB): win, sun
PowerJ 3.5 (Java dev) : win, sun
PowerSite (HTML/JavaScript/ASP dev):win, sun
Unfortunately, the main components are currently only available on NT and Solaris. AIX and HPUX are on the way.
You do have the option of using Apache for the web server and, through CGI using powerdynamo and jaguar on another machine, but that almost defeats the purpose. You could go all Solaris though.
I have to say that overall, I do like the package, it makes for rapid website development and, since Jaguar is Corba compliant, I can re-use all my business logic within local client desktop applications when I want the added speed and functionality.
"Anyone who can't laugh at himself is not taking life seriously enough." - Larry Wall
There's no reason to have the parent post moderated down to zero. It may sound a bit confrontational, but sometimes the truth hurts. Ok, MySQL is a database, but it isn't a relational database and hence has the problems that our AC indicates. PostgreSQL is a lot closer to catching Oracle than MySQL will ever be.
I've found the Apache/PHP3/FrontPage Extensions for UNIX to be effective on the server.
The client can use FrontPage, or Visual InterDev.
Van
Linux rocks!!! www.dedserius.com
www.dedserius.com
VB != VisualBasic
On the content management side I'd recommend Zope (wwww.zope.org) which is a truly brilliant piece of software that brings object orientation to Web applications programming and is a total must for easy manipulation of sites and easily and consistently adding andmaintaing site content and design. Its PHP on steroids integrates with MySQL has chat boards, can integrate Python custom functions.
Brilliant and well worth a look. However, the documentation totally sucks, but its worth persevering with to solve half your problem.
-he who laughs last, is a bit slow.
journal
There's an archived article by Ralf S. Engelschall that was printed in WebTechniques magazine. It talks about using pools of servers and reverse proxying for load-balancing. You might want to look there.
free experimental electronic music netlabel at www.viablehybrid.com
Bugger. Slashdot ate my A HREF Tag. Sorry.
The article is at http://www.webtechniques .com/archives/1998/05/engelschall/
free experimental electronic music netlabel at www.viablehybrid.com
hmm..oracle has a parallel server option
This probably would be WAAAY out of the question for the needs of the guy asking the question here. OPS requires at least 2 Solaris boxes ($$), Sun Cluster software ($$, and Sun will *NOT* support you AT ALL unless you buy whatever they tell you to buy during the ordering process, which is consequently, really obfuscated and lengthy), plus you need the Oracle licenses themselves. For a minimal system (say, 2 Ultra 2s, 300Mhz processor in each) You're looking at 600 "power units" @ $150/ea for RISC processors. So, just for Oracle licenses (not including Parallel Server) you're looking at $90k. Sure, it's nice if you're a big buck company, but it's out of reach for most mere mortals.
There's no sense pushing around a lot of bits if you don't have to, and while you can make the rest of everything pretty fast by periodically dumping dynamic content to static pages and serving them up with Zeus or khttpd, reserving PHP3 or mod_perl use for the very few things that really must be completely dynamic each and every time, and you can use Layer 4 switches to load-balance between servers at a site, and distribute your servers around the world and distribute the load to the closest unloaded server by using customized nameservers, the graphics is at least one thing that can be solved by using the Akamai network.
--
Brad Knowles
Brad Knowles
http://daily.daemonnews.org/ -- if you're not
If you have the money, definitely go for Oracle8i (and take an Oracle8i performance tuning class). The overall performance and the amount of performance tuning you can do with Oracle8i is incredible.
Here, here! I second this motion.
Re: few outsiders working with ACS While there aren't as many people as I would like using it on the outside just yet, there are people who are not working for Ars Digita who are working on the system. You can find these people pretty easily, as they show up at http://www.photo.net/bboard/q- and-a.tcl?topic=web/db, where people discuss using ACS, AOLServer, etc. I think that one of the primary reasons that the system has not been embraced by the open source community is simply that Oracle, while free for development use, is very expensive once you want to go live. However, because of this, a bunch of people decided to port it over to Postgresql--there is now a beta of this effort, available here: http://acspg.benadida.com This means that it is now possible to do development of ACS using 100% free, open-source components (AOLServer 3+ is open-source too, thanks to Philip and Hal Abelson at MIT).
Re: AOLServer not being popular and/or proven Hello? The reason it is called AOLServer is because AOL uses it they bought NaviSoft, makers of NaviServer, which they renamed to AOLServer). That by itself makes it popular and proven--AOL handles 28,000 hits a second using it. Of course there are other people using it too, and not just Ars Digita clients. There are even a few hosting services that'll host AOLServer, like these guys: http://www.am.net
So, what's so great about AOLServer? The nice thing about AOLServer is that out of the box it is ready to handle connections to relational databases. No need to make ODBC calls, etc. AOLServer sets up connections when it starts up and your web pages can get handles from a pool of these connections, use them, then recycle those handles. Again: no overhead for database access. Each AOLServer can handle 8 simultaneous database accesses per second, that is, it can serve up 8 database-backed pages a second. And that's on top of serving up *static* pages, which are an entirely different matter. AOLServer is nice also in that certain high level features are built right in--you can send an email with a one-line command for example or grab a web page from someplace else with a one line command (helpful for doing things like Philip's Bill Gates Wealth Clock, for example) and there's a one line command for scheduling stuff to run (like cron)from the web server. And the Tcl interpreter is built in, too, so no CGI overhead. If you want to read more about AOLServer and how it stacks up to Apache, check this out for a few quick paragraphs from Philip's book about AOLServer or for more information, go to http://www.photo.net/wtr/aol server/introduction-1.html and
http://www.photo.net/wtr/aol server/introduction-2.html
Re: ACS no longer being the greatest thing since sliced bread Well, it does quite a lot out of the box and it is being used to create real, serious, heavy-duty websites. Given that at the moment they are busy expanding like crazy and doing work for clients, it isn't so hard to understand why they may not be driving the toolkit as hard as they could, there's lots in there already, including monitoring services in addition to just a site-building toolkit. And new modules do show up in the toolkit even so and there's a list of possible future improvements on the arsdigita site somewhere with more stuff. And Philip is thinking about this stuff, don't doubt it. In any case, if you want to see for yourself what the toolkit has, check out this page and see if it meets your needs: http://www.arsdigita.com/pages/ toolkit/modules.html
Re: Philip's book Definitely worth reading. Funny, smart, sharp. Definitely look at the technical stuff, though, even if you aren't using ACS, since at the very least the stuff on relational databases is important.
Re: Using Tcl Tcl sucks? Well, you get used to it--now that Tcl 8 has the complete Perl regular expression package, it sucks less. But, an important point is this: when you use AOLServer, you will be using a bunch of AOLServer commands in your Tcl code to get stuff done. So you won't be programming in straight Tcl. There are utility procedures that are part of the ACS toolkit that help too. And over time there will probably be some tools to do some of the grunt work (they already have one out called The Prototyper). But in any event, once you get over a few quirks, it's like programming in any other scripting language. One nice thing is that the language is pretty small, so you can learn it very quickly and get going.
By the way, at Scriptics, the company founded by Tcl's inventor John Ousterhout to support Tcl, Brent Welch, author of one of the better Tcl books and a well known "name" in the Tcl community, has built the Tcl developer's site, dev.scriptics.com, using the Ars Digita toolkit.
And, if you want to learn how to use ACS, Ars Digita offers free 3 week bootcamps in Cambridge, MA and in several other places--look on their site for a "bootcamp" link. Or you can get the problem sets used at bootcamp off their site and learn the stuff at home (if you install everything on your own machine and do the 3 problem sets (note: PS 3 has been replaced with PS 5), you get a $10,000 sign on bonus if you decide to work for Ars Digita--and speaking of working for Ars Digita--check out their salary structure (http://www.arsdigita.com/pages/j obs/tech-jobs.html). Might make you want to start learning Tcl after all :-)
___
DC
P.S. If you decide to work for them, please mention me so I can get a shot at the Ferrari (actually, I'd just as soon take it in cash) :-)
P.P.S. More seriously, if you want to do the problem sets or a bootcamp, here are a few pages I put together for people like you:
Problem Set Zero This is meant to help people bone up on what they'll need to do the problem sets/bootcamp. Meta Cheat Sheet lists a bunch of useful cheat sheets that I and other bootcampers put together and some other stuff.
dev.scriptics.com, a portal for Tcl developers at Scriptics, the Tcl company--uses the Ars Digita toolkit.
http://www.college411.com a commercial portal for college students. They are not an Ars Digita client but they are using the toolkit.
__
DC
The ASP approach has some other benefits you can use. All of your heavy processing can be offloaded to Microsoft Transaction Server. MTS uses compiled code only. Visual Basic, Visual C++, Delphi, and Borland C++ Builder all support MTS. MTS is also designed to make administration easier.
The down side to this is that the Microsoft solution is outrageously expensive. You'll be spending several thousands of dollars just for your operating system, server software, and development environment. You can really only consider it if you are well funded.
What happens when the machine goes down and all your data is in RAM? I think this may be acceptable only for e-commerce appliations that store only session information in this way.
that's just my opinion.
If you mod me down the terrorists will have won
Just to go against your theory: I've rewritten web apps from c into php and seen a performance gain. The reason? php scripts run in process with the web server while my c cgi's had to fork.
If you mod me down the terrorists will have won
6) dont make everything dynamic just because you can
Here's something that may save you a lot of grief for any page that doesn't change minute-to-minute:
Because you need built-in scalability, you might consider dynamic authoring of "static" pages. It will, of course require more front end engineering (perl, etc.), but the result takes most of the processing out of the hit rate equation.
This can be taken a step further by automating the authoring process from an updated database.
More NRE, less maintinence...
Tangochaz
--------------------
Education has produced a vast population able to read but unable to distinguish what is worth reading. -- George Macaulay Trevelyan
TangoChaz
"It's not enough to be on the right track -- you have to be moving faster than the train." -- Rod Davis, Editor of Seahorse Mag.
TangoChaz
--------------------
Wise men talk because they have something to say, fools because the
The problem with this article is that it talks
a lot about "reverse proxy".
It even gives examples of some software that
specifically does CACHING proxies.
But this can just shift the problem from one server to another.
There should be more examples of solutions that
do transparent load balancing, vs proxies.
Hint: "proxies" are almost expected to cache.
"load balancing [boxes]" are not.
its in the works, check out this site. These folks are already set up on sourceforge and the project name is acspg
Good luck with your research and if you need an Enterprise solution for your services, contact us.
You can't handle the truth.
15 million listings, big Oracle databases, lots of hits
I think that another big and important factor is the location of the user that will see the website... i am located in Argentina, and if i do a tracert www.altavista.com i will have a display result of 19 jumps (machines) where my request has sended... 19 machines to see the fucking altavista page... in usa... maybe it would be 3.
anyone want to host a scener from argentina in usa?
---- EoF
take a look at tpc.org. Sorry, but the worlds best transaction processor runs on Win2K and IIS. How can a server/database platform that almost doubles what the closest Sun or IBM machine running Oracle can do be crap? This is the pure performance statistic BTW - has nothing to do with price(all the price/performance stats have been dominated by MS for years) And we are talking about a 4 million dollar Compaq setup vs. over 7 million for the IBM setup and 13 million for the Sun setup. I'd call Win2K an industrial strength solution. And ASP is the defacto standard for programming websites with W2K. I think I'll stick with the facts vs. someones anecdotal evidence. I'd love to know what sort of data modeling you can do with WO that you can't do with ASP(as I have complete control over my SQL Server and Oracle databases) but alas you did not back up what you said with any relevant examples. Slamming MS is not getting the open source movement anywhere. As an ASP Developer I find your comments offensive and lacking credibility. There is certainly nothing in your message convincing me to switch. Gotta go, my AIX machine needs rebooting...
I've done a few high volume sites, using a number of different technologies (mod_perl, PHP, ASP). There are unique tips and pitfalls for each one, but there are some design issues that apply to all of them.
Abstract Everything
Never build HTML or graphic design right into the code. Use standard include libraries for common things like changing fonts, starting/ending tables, building the body tag, etc. Any kind of frequently used design elements should be programmatically generated from a central location. That way graphic design changes can be made in one place and propagate across the entire site.
Also abstract common database functions, either in include files or by using stored procedures if you're running a real SQL backend. Never put the SQL query right in the code -- I guarantee you'll have multiple copies of it around, and when it comes time to adjust the schema, you'll hate yourself.
Do logging at the app level
Your application knows a heck of a lot more about what's going on than the web server does. And if you're running a real database, logging to the database is infinitely more scalable and manageable than having hundreds of flat files sitting on the filesystem somewhere.
Depending on the nature of the application and how you handle privacy concerns, it may make sense to log the username, date and time, pagename, and application state for each pageview. Or any number of other things. Logging in the application is the only way to go.
Have flexible debugging code!
Use a central "debug" procedure and call it frequently. Use a querystring value, session value, or something to determine whether "debug" prints the info or keeps quiet; normal users can continue to use the site while developers can get detailed info about what's going on in the app.
If you're brave, you can comment out the calls to debug once you're confident that a section of code is working. I never do, though -- it's a fairly minor performance hit, and it allows customer support reps to provide developers with information on hard to repeat problems.
That's it for my secrets. Good Luck.
-b
If I wanted a sig I would have filled in that stupid box.
Clearly you did it wrong then. If you write good perl code, and run it persistently as a part of the webserver using mod_perl, you will get great performance. If you write lousy perl and run it as a CGI, performance will suck.
Excellent suggestion... I'm working through the course description right now... http://photo.net/teaching/one-term-web. html is where you really want to go if you have the time. As far as using open source goes, I'm using this course with linux, AOLserver, and the trial copy of Oracle, but when I actually build a site for public consumption I'm going to try and use PostgreSQL.
For case 1, let's assume complete a complete linux front to back solution, with as much free (or mostly free) software as possible:
;) ) load balanced with Virtual Server Project. For a reasonably heavy duty method of doing this relatively cheaply, see Cubix and their "density" series... up to 8 servers in a single box... with hot plug everything. RAID isn't as necessary here... as the systems themselves are effective your RAID...
Needed Software Components:
1. Favourite Distro of Linux
2. MySQL or Postgres Database (personal pref is for MySQL... not going to get into the pros and cons here...)
3. Dynaminc Web-Scripting Language (PHP, Perl, whatever... personal pref for this kind of thing is PHP... again, I'm not debating at the moment...)
4. Linux Vitrual Server Project - very solid load-balancing from my experience. Don't know how it compares with the appliances on the market... but it's still solid.
5. HA/Redundancy software (Linux HA project isn't quite there... but they're getting close... there are some commercial packages available - one that's free for non-profit use - http://www.high-availability.com
Hardware:
NB: For maximum up-time I recommend systems with redundant hardware (backup power supplies, dual NICs, and RAID arrays)
1. Firewall/Load-balancer - preferably using HA/Redundancy software on two machines... Mirrored (RAID 1, right?) boot/system hot-plug drives are a good idea.
2. Web-farm - up to X systems (where X+1 breaks your budget...
3. Database system - again preferably an HA/Redundancy cluster for maximum availability. I recommend a mirrored boot/system disk again, with a RAID 5 array (or RAID 5+5 - mirrored RAID arrays) for speed and maximum availability... highest RPM drives you can afford can help here a lot for speed, too.
4. 100 BaseT Switch for maximum through-put. Personal preference is for Cisco but your budget dollars may vary.
5. I've mentioned RAID a couple of times... you can get SCSI and IDE raid these days (SCSI being more common)... the cheapest/fastest one I've see is from Raidzone - very nice, check them out (up to 15 - 40 GIG hot-plug IDE drives in one array, with a very high through-put). You can also do software RAID, taking a performance hit, but saving coin...
Case 2 assumes that you don't mind using some commercial stuff... and have a bigger budget:
1. Replace Virtual Server with an appliance. (Alteon, F5 and Cisco all make good products... presently my preference is with F5's BigIP.
2. Replace in born Linux firewall with Checkpoint's firewall-1 running under linux - or an appliance firewall, a Cisco PIX is very nice, and has very high though-put. The Nokia appliance running Checkpoint and a BSD bastardisation is quite nice.
BlackNova Traders
Let me start off by saying that designing N-tier applications is very difficult; it's a new art and there are not a lot of hard and fast rules. There are many different approaches to take, and not all are appropriate for every circumstance -- a great deal depends on the specific usage characteristics of your system.
With that in mind, your first step to take is to move your database server and web server to seperate boxes. Just doing this gives you a huge performance boost. Also, it's nearly impossible to properly tune the performance of an enterprise-class database server unless it's on a dedicated box. The most important thing on your db server is to have lots of RAM - ideally, you want your entire database to be cached; if it's too big, then you at least want to have the most frequently accessed tables & indexes cached.
Secondly, use only stored procedures to access the database. Stored procedures can give you a huge performance gain over ad-hoc SQL statements. Also, forcing all db access to go thru stored procs makes it much easier to secure your database and to do performance profiling.
I'm not familiar with MySQL, so I can't really say if it's up to the task or not. Both Sybase and Oracle have versions of their enterprise db servers available for Linux. Sybase 11.5 for Linux is gratis for both production and development; 11.9.2 is gratis for development, but needs to be licensed for a production system. I don't know off the top of my head what Oracle's licensing policy is. Of the two, I personally prefer Sybase - I think it's easier to administer, and has a more elegant architecture.
Lastly, you need to take a close look at what kinds of transactions are running most often, and optimize your indexing strategy for those transactions. Take a close look at the locking behavior and see if you are getting a lot of locking contention -- this can be a real performance killer.
On the web server side, I don't see anything wrong with using Java servlets / JSP's over any other competing technique. Typically, the bottleneck in this kind of system is physical I/O, not CPU; so the slight performance gain you would get from using native code vs Java would not help you, particuarly if you're using a current JVM with a good JIT compiler. If you are maxxing out the CPU utilization on the web server, you might want to consider moving some of the logic from your servlets into stored procedures, balancing the load better between the db and web servers.
I wouldn't start looking at clustering options until I was sure I was getting all the performace I could get out of my existing web & db servers.
"The axiom 'An honest man has nothing to fear from the police'
Why is it that the proponents of "one nation under God" are so eager to get rid of "liberty and justice for all"?
I've seen various RDBMS mentioned in this discussion, but nothing on Interbase. Is there anyone out there using Interbase now that it's open source or does anyone have any opinions about this RDBMS?
1. if you use Java, test JVMs. some are much slower than others. volano.com's benchmarks make intersting reading. they are a narrow test, mind you. you can also try running an alternate OS (solaris-intel springs to mind as it does have stable JVMs & does SMP bette than linux -- if you follow that route, beware of hardware compatibility issues) 2. check the raw query performance of the SQL server. you can use a command line tool for this. if the database is the bottleneck this will show it up. then you can go look at speeding it up.
I'm currently using Zope, It's open source, completely free for commercial or non-commercial use and is simply frickin amazing.
Do yourself a favor and check it out.
MySQL has some releases under the GPL.
Gah! A mistake here and there is all right, but I'm seeing this over and over on this page. Did you people flunk fourth grade? The word is PERSISTENT, people!
Verily hath their moderation points been wasted upon me.
If the author is only worried about database access time, then he was not clear enough in his original post. Regardless, most large sites like Amazon ARE replicating their databases (even to the point of placing DB servers at different geographical locations, along with server farms.) While this is expensive, it is necessary for any high, high volume site.
I lost much of my faith in Greenspun when I went to his Boston Marathon site to track a friend. It had completely crashed under the load of the users. It lead me to believe the guy is more talk than substance when it comes to websites that scale. He has great things to teach, but the key is to take his example and ideas, and port them, not duplicate them IMHO.
Now having said that, I will say that his information is a fantastic read. And those "napkin" and "sketchpad" drawings are great. So my advice is read him for ideas about DB layout, community concepts, DB driven concepts, etc. but not for nuts and bolts. When Greenspun tells us how to "do it" and lets us consider LINUX, Perl, PHP, MySQL, and other "open source" solutions, I'll start calling him an uc("god").
A couple of years ago I developed the LookSmart (search engine) CGI system. This was one of the earliest category-based search systems and at the time was unusual in that every page was customisable in a number of ways - so every page had to be CGI generated.
The equipment we had at our disposal was a couple of fairly big Sun servers, so we had to be able to serve a high volume of CGI-generated traffic off only two machines (rather than using the usual farm arrangement which seems common for high volume sites these days). Last I heard the site was serving some millions of CGI pages a day - upward of a hundred pages a second at peak load. The architecture has stood the test of time - today it's essentially unmodified from the one I designed in 1997.
How was this level of performance achieved? Given the hardware constraints I took an approach which squeezed the maximum possible hits per second out of what we had. I chose Apache as the web server and FastCGI as the interface to a rather large CGI written in C/C++. Using this setup I was able to produce CGI pages not much more slowly than Apache can produce static pages, even when producing fairly complex pages with database content. C/C++ might seem like a strange choice for a CGI language but in terms of sheer speed it was miles better than Perl - the other obvious option at the time. I've done a lot of CGI programming in different systems since then and despite the hype about other CGI systems I'd say they're still not that much easier than C to write CGIs in.
One other trick I used to make the LookSmart site so fast was to avoid the use of conventional databases altogether - at least at the CGI side of things. Our database was regularly exported as a specially-formatted flat file which I mmap()ed into each CGI's memory space. This allowed all the CGI processes to share the same flat file image, while taking advantage of the demand paging system to minimise the memory impact of the database. Other tricks were used to maximise the locality properties of the database. This all resulted in a system which was many, many times faster than SQL database access would have been. (In fact we later tried an ASP/SQL server architecture for comparison and it was several orders of magnitude slower)
One downside of this relatively low-level approach to CGI was reduced flexibility. Initially it wasn't easy for non-programmers to modify the behaviour of the site. Over time I gradually added more and more configurability to the system so more of the system could be tinkered with by naive users. Which in itself led to plenty of problems, but that's another story :) In the end I'd developed a farily rudimentary but fast templating system which adequately served most requirements for flexibility.
Would I do it the same way again? Given the same hardware and load constraints, yes I think I would. In a situation where you must extract the maximum possible performance out of the available hardware this kind of approach really worked extremely well. The cost was in increased development effort and more reliance on programmers (ie. usually me) when changes had to be made to the system.
AOLserver is an excellent web server... totally multithreaded, persistent database connections, database abstraction layer, extremely fast.
ACS is an excellent toolkit. Anyone with a technical mind that has gone through its data model will notice how well thought and implemented it is. Really slick. The port for PostgreSQL has its first beta out, and a new beta due in the next couple of weeks. USU Free Software and GNU/Linux Club website runs with ACS/pg and it is great... totally free source software. http://linux.usu.edu. Download ACS/pg at http://acspg.benadida.com and build a serious, reliable, and scalable web site.
Philip Greenspun does a great job when it comes to the web and database backed websites. He supports Free Software (his toolkit and tools are all free) and gives away US$ 10,000 every year in a prize for someone who creates a good, free, web service. He and his company have trained hundreds of people on web services for free, and ArsDigita pays and treats its programmers REALLY well.
As for Oracle, I don't know why you folks cry so much about it. It is the best RDBMS around, period. It's not for everyone though. If you have really important data, you probably have the money to pay for it. Someday PostgreSQL will get there too. It has improved VASTLY. MySQL is not even trying to solve the problems that RDBMSs were designed to solve. Sorry, but it is the truth.
If you are used to Apache and PHP, fine. Go with it, it's your tool of choice. But don't let your religion get in front of your technical mind to the point where you simply can't distinguish what's true and what's not. Let the technical details speak for themselves.
ok, after hearing this $%#@ i thought i would weigh in with a REAL example of how much ASP sucks and how incredibly powerful WO is -- you need skill and talent to be a WO developer, as someone else said, anyone can be an ASP developer. when Dell first developed their on-line store they used some technology that is irrelevent for this discussion. their return rate -- i.e. misconfigured products -- was about 10 percent. things like ordering a SCSI HD with no SCSI controller were a common example of misconfiguration. but hey, what does the average home user know about building a computer? so they said, we need something better. so they turned to WebObjects -- which was then owed by NeXT. it took a handful of NeXT engineers about 3 months to put together the Dell store. once the Dell store was on-line with WO the return rate went from 10 percent to 3. not 3 percent, but 3 actual returns due to misconfigured products. now, fast forward to 1997, Apple buys NeXT and this looks bad: Apple technology is running the Dell store. so what happens next? Microsoft to the resque. it took a TEAM of MS engineers over 9 months to reengineer the site in ASP. on top of that the site did not have anywhere near the same functionality as it did when it was run by WO. and, and this is a HUGE and, the site went from being served on 4 large Sun server to being served from over 75 machines running Windows NT. a pretty compelling argument that says ASP is not industrial class stuff. it is great for a small site or a site run by people who suck bill gates' you know what, but it is not a serious development tool. and to correct someone else, WO can be developed on Mac OS X Server -- and soon OS X -- and NT and can be deployed on OS X Server, NT and solaris for sparc. it can also connect to oracle, sybase, informix, etc., throught adaptors not ODBC as someone else said. you can do increadible data modeling with WO which you CAN NOT do with ASP. and with WO, as the name implies, you deal with data as objects that can be munipulated just like any other application object. try to do that with ASP.
i don't know where you've been living, but you can deploy oracle enterprise server (8i) on linux. you can get 8i for around $3000.
(A general overview) Buy two good sun servers and use a fiber connection between your database and web server with a switch or router to handle the split. and spend some doe on fancy smancy database software.Furthermore get the highest level of support you can get for your software and *especially* hardware. Don't forget about security either. You are going to have to have some serious firewall and security software installed. Use some form of UN*X or Linux and hire someone to tweak the OS ffor speed and efficiency.Most importantly develop a sound system and follow it.
I would hear how a cgi page can be cached? Anonymous coward said in a post that "cache cgi" or something like that. A cgi script mostly (as in my case) creates pages on-the-fly and one header routine so when you can do some trick on meta tag. When you do that all the pages are then cached. Some problems may arise in such a caching solution. Is there anything different from that for caching cgi scripts.
Servlets do, in general, have scalability problems all their own. ;-) That being said, here's some advice:
:-) Seriously, they aren't very quick. Actually, using JSP (even though it's a servlet itself) can be quite helpful as it will automatically do a lot of caching for you.
1) Use Weblogic or even better iPlanet. These products have much faster servlet engines than you can get elsewhere.
2) Use squid as a proxy IN FRONT of the Weblogic back end. This will signficantly improve your performance, at the cost of somewhat reducing the dynamic nature of your site.
3) Switch from using servlets.
sigs are a waste of space
I am the head programmer for an e-commerce site, and I recently decided to rewrite everything, from the groud up. Using these simple guidelines, I was able to acheive spped increases on the order of 100-500 times faster.
Avoid monolithic components at all costs. I mean by this: large tables in your database, large columns in your tables, and large, multi-function CGI programs. This also means that everything should do one thing, and one thing only.
You database design needs to be normalized, and your selects optimized. Understand indexes. If you don't understand the concepts of normalization and database optimization, buy a book. Read it. Grok it. Also, examine the algorithms that are employed by the CGIs. If the algorithms are less than optimal, rewrite them from scratch. Rule of thumb: the source of software speed is ~90% design, ~10% implementation. If you use a fundamentally crappy algorithm or database design, a better implementation won't help you.
Another desirable aspect of a high volume web site is many servers. For example, your server's LAN should probably include: a web server for dynamic pages, a web server for static content (this can include static content masquerading as dynamic content, ie. caching server, or images), a secure server (if you need it), a database server, and a log server. That last one is a bit unique, and underrated. The best thing to do is log as much activity as you can, this allows you to identify problem areas. A log server doesn't have to be high powered, but it allows you to correlate activity much better. Customize your hardware to it's task. RAID is your friend. Swapping is the devil. The proccessor isn't always the bottleneck. Keep this kind of stuff in mind, and above all, keep the UNIX philosophy in mind.
Where I work, we have just a basic shopping cart system, by since everything is a small component, it is easy to upgrade and extend. We have somthing like 25 tables in our database, and none with more than 9 columns.
Are you using MySQL? If so, a quick hint: use INSERT DELAYED instead of INSERT, and EXPLAIN SELECTs. If you have to manually join tables, that's a good thing.
Yes, I'm still a junky. Are you still a bitch?
Freshmeat uses PHP, MySQL and Apache. I'm sure they get a ton of his every day and the entire website amounts to a huge indexed database.
"God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
Thank you very much for the multiple MySQL server trick. I've never thought of this, and that's a really useful hack. I'm probably going to use this soon -- in a simplified way: my current MySQL server is currently busy just computing stats and reports, thus slowing the whole process while it could easily be deported to another server. Coool! Thanks A LOT! You've saved me hours of coding on this one. Question: how exactly do you connect the update log to the other database? I can think of several ways.
A few ideas come straight to mind.. Persistant CGI (such as Apple WebObjects) is a great idea. mod_perl on Apache can severely speed things up. Perhaps move to Oracle, as you may benefit from some large performance increses, Linux 2.3.x has khttpd for kernel caching of static web pages (ala AIX).
EraseMe
The advice against MYSQL will be arguing point for some here but I use Oracle every day. It can be a pain in the ass at times but there is no substitute for the reliability. Just spend the time to set up right the first time and you have it made.
Spending wisely is another bit of good advice but one big server backed by redundant smaller servers can be a very good option. It has worked for me in the past. The fellow I am replying to is saying don't blow the whole wad on a couple of huge servers. This can be misunderstood. A big server can give a lot of stability if configured correctly.
ACK
Unfortunately, it doesn't help for the columns themselves to be indexed. Take this:
.40 second to .01 second by making this change, and as a bonus the search results are more relevent, too, since parts of words were not considered.
select id from inventory where description like '%foo%';
Indexing doesn't help with this at all since it has to search through the entire description field - 'foo' could be in the middle of the field as much as the beginning.
The solution is to create another database consisting of all the words in the applicable fields and a pointer to the record's location. Like this:
create table word (
word varchar(20),
id int);
where id is the record ID of the record with the word in it. So a description field like this:
HEWLETT PACKARD LASERJET 4
would become four records, with word being 'hewlett', 'packard', 'laserjet' and '4'. Then you can build queries that search for occurances of those words in the word database, which is - of course - indexed.
I was able to decrease query time from
I'm sure that was what the original poster was referring to. If you do it, remember to exclude common words like 'and' or 'for'.
Hope that helps.
D
----
Okay, I have done a couple of sites that are serving up high volume... Here's a few tips:
www.suckage.com -> Java servlets on Linux with mySQL. It works pretty well. On Linux, use the IBM JDK1.1.8. It's the only JDK worth anything in terms of performance... Also, it uses native threads so SMP will help 'ya. Honestly, Java on Linux is still not as fast as Solaris or WIndows NT...
www.fatfreeradio.com -> PHP and mySQL. This site is also running pretty well. Cool thing about PHP is it's very quick to develop. Bad thing is it's hard to stick to the MVC paradigm...
--hunter
RateVegas.com - Vegas Reviews
I'm one of the developers working on www.blink.com (we do a web-based bookmarking system), and the system's gotten large enough (thousands of concurrent sessions at any particular time, and many gigs of stuff in the database--sorry I can't be more specific) that I think I can comment on the performance concerns.
First of all, I think java servlets are easily ready for prime time for large, complex sites, much more so than other technologies I've seen. Blink has a huge number of features (even I don't know them all :-)) and there's no way something like this would have worked in perl or php3. We're not at 100K lines of code yet, but we're certainly headed in that direction, and a less rigid language would probably have left us with an unmaintainable system at this point.
Speed hasn't been a problem, at least on the web server and java side. Apache runs on its own machine, and that's hardly ticking over. We have to make sure we've got enough java servers behind it, but that's not a big problem. We use, for the most part, what I regard as fairly low-end PC-type boxes running Linux, not 8-CPU Xeons or E4500s or anything like that.
The database is what kills you, especially when you're highly transactional, as we are. To be able to do things like tell you if there's new (to you) content on a page, we need to know when you last clicked on it, which means we need to update the time last visited every time you click on one of your links. So our write load makes it very difficult to use multiple servers.
I strongly recommend getting hold of a guy who knows SQL very well when you put together your site. Doing schema changes later rather than earlier is more painful (as I know from experience) and the right schema can make you many, many times faster than the wrong one (as I also know from experience :-)). And at this point, if you're doing fairly sophisticated stuff, I'd recommend a good transactional server over something like MySQL. Surprisingly, MS SQL server is a great product, and fairly cheap compared to Oracle (though it comes along with the usual miseries of anything running on Windows).
Delivering real applications over the web is pretty difficult. It's really, really important to think about where you keep state, and where you can avoid keeping state. This can make a noticable difference in efficiency, too.
In the end, there's no substitute for experience; there's a lot to learn in this field, and it's quite different from traditional applicatons programming. So get it wherever you can, whether it be by writing a prototype or finding someone who's done it before and describing your project over a couple of beers.
cjs
The world's most portable OS: http://www.netbsd.org.
The best way of dealing with complexity is to simplify and limit features to the essentials. JavaScript, graphics, fancy layout, etc. all can present an bottomless hole for time and bandwidth.
Another thing to realize is that the relational database you are using is probably one of your biggest bottlenecks. Relational databases are slow. MySQL is actually one of the faster ones, but that's because it doesn't make a lot of the guarantees or provide a lot of the features that a "real" relational database provides. You can get a lot of performance from your relational database by tuning it, but ultimately, the architecture and functionality itself present a limit. If speed is of the essence, consider using dbm, plain files, or memory mapped files (Apache has several "databases" internally, and that's its approach.)
For a single person project, Java is probably not the best implementation language. It really shines for multiprogrammer projects. You may find that Python, PHP, or Ruby are better choices. Perl and Tcl are also widely used, but they are also the oldest of the scripting languages and have a lot of rough edges and clunkiness. The performance of all of them is excellent when used as server plug-ins. Perl and PHP have by far the largest libraries and toolsets.
Some of the packages I haven't used but that look interesting are the following. Enhydra takes a much better approach to dynamic HTML than most other packages, I think. Erlang / Eddie address scalability, reliability, and clustering in a really clean way (but you have to learn a new language). Zope I think has a lot of good ideas, but I'm not sure I'd use it for a large server.
Most load balancing solutions use some kind of network routing hacks. That's efficient, but can be a pain to configure. However, there is now a load balancing module for Apache that works as a proxy; that's probably also worth looking into.
The Coda file system is a free, next-generation AFS; while I'm not sure Coda is mature enough yet to be used in heavy-duty applications, systems like it help with scalability by replicating files automatically and have been used on some really large web sites.
I would stay away from commercial "application server platforms". They often are mostly repackaged open source software, and they are expensive and complex, and while they claim to be general, they are also usually created with fairly specific commercial applications in mind ("shopping cart", etc.). I think those kinds of packages are worthwhile for corporate developers that work with large teams of programmers and need consistency, documentation, training, and support.
So, to recap, this kind of stuff is still a lot of work and it seems like you are ahead actually. But there are a lot of proven open source tools available that have been used on really big projects. There are also a lot of open source developments coming around that may make this kind of project a easier.
You're saying WO is crap, but the only reason you give is the lack of developers. I'm confused. There aren't as many developers out there, mostly all of the WO developers I know work on in house projects. But your average WO consultant is going to be a lot better than your average ASP consultant. I could probably walk down the street and find an "ASP Consultant" but that doesn't make me feel good about the technology...
Here's a shocker, WO isn't easy! If that scares you, go run to MS with their "Hey, anyone can do it!" products and increase the number of worthless MCSE's and ASP developers out there (It's so easy, anyone can do it poorly!). If you've never written WO apps before, you're damn right it's going to take work, and you actually have to learn. Imagine that. If you sit down two people with no ASP/WO experience and ask them to write an app, the ASP person will certainly succeed faster. If you sit down an experience ASP developer and an experienced WO developer and ask them to write an app, the time difference would be minimal and the WO would also have a completely seperated interface to ship off to the html designers and a database model that can be reused in your other applications (not to mention better performance in a much easier to maintain and modify package).
In my experience, locking contention is usually due to inappropriate indexing and bad SQL coding. I'm not familiar with MySQL, but if you are having to do funky schema changes like splitting the tables it sounds like MySQL isn't ready for prime time yet.
FWIW and FYI, MySQL is optimizing for speed by keeping things simple. They don't support a lot of SQL features for transactions by design. This lets them get some tremendously good performance for somethings. For example, Slashdot, which is mostly DB reads with occasional DB writes. But I wouldn't run an e-commerce site with it.
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
shameless plug:
If you are looking to build high volume dynamic site you need to look at ATG. The Dynamo app server is widely regarded as a great!. Sony's The Station is one of the a very high volume dynamic sites that use Dynamo.
- Application Cache: Figure out what objects are constant across your entire application, and build a simple object cache that loads them up, holds them in memory, and has the ability to refresh the cache on a background thread.
- NO SESSION STATE: This one is counter-intuitive. You would think that it would be a good idea to cache some objects on a session-by-session basis, but it's not! Think about it this way: An average user clicks about 1 time per minute. By NOT caching objects for that user on a high volume server, you can use the resources to serve a couple thousand more requests. Also, most servers use cookies to hold a session ID, which must be read on each request - turing off session globally on your server will dramatically increase your maximum throughput.
- Scalability, NOT Performance: This is a trap that many developers fall into - they think that by optimizing how fast pages come up, they are creating a scalable site. What this translates into is optimizing for a single user's performance. The actual goal is a linear ramp up of response time based on the number of users. I've seen sites that returned pages in record time choke under a load of 20-30 simutaneous users because of this.
- Session State: If you truly need session state, create a separate database or LDAP server, optimized for reads, to hold your session state. Don't use cookies to hold your session ID - use a GUID or something similar as a URL parameter. This lets you have session state, but allows users to transparently switch between Web servers on the front end, and thier Session follows them. Also, you are only loading up session state on the pages that need it.
- Load Test: After every major change, load test your application, and watch for that linear ramp up. Capture statistics internally in your application about how effective your caching is, how much time is being spent on a particular operation, etc. Use these stats to figure out where to tune next.
I think that's about it. If you (or anyone else) has any more questions, feel free to email me @ scotty@auragen.com. I've been doing a lot of this stuff lately, and could probably give you some more pointers.Scott Severtson
Applications Developer
Scott Severtson
Senior Architect, Digital Measures
The system we are deploying on is capable of sustaining a delivery of more than 230 dynamic pages per second (which we have benchmarked using RSW's test suite) - around 20million dynamic pages in a 24 hour period.
We are using Compaq PCs and each is fitted with twin PIII @ 500MHz, 1GB RAM, 9Gb SCSI local drive (mirrored) and they are running WinNT 4. All networking is 100Mb, doubled up cards and cables (for fault tolerance). Our hosting ISP (big and in the UK) is responsible for maintenance, support, (main/primary) backup and hardware configuration.
The system is isolated from anything but port 80 requests by a pair of Cisco firewalls, and all the kit plugs into a Cisco Catalyst 5505 switch. Behind the wall sits a BigIP box (for fault tolerance and load balancing - we've found BigIP kit to be very good, BTW) which passes HTTP requests to a pair of identical web/application servers (with the above config) that are both running IIS 4.0.
The backend is connected to the front via a couple of hubs, and consists of a grand total of four boxes (again with the above config) running M$-SQL 7.0. Using Compaqs own clustering solution, these are linked up as two buddy pairs for 24x7 fail-over. Each pair of db servers appears as one virtual SQL server and each pair of db servers shares a 186Gb RAID5 array.
We are using a proprietary scripting engine, which interfaces as an ISAPI filter and currently only runs under NT (hence the choice of M$ technology throughout). This is similar in architecture/functionality to PHP or ASP - although it outperforms both.
Our application architecture allows us to split our db across two db servers (careful db schema design allowing load distribution). Furthermore all of our SQL queries are very atomic (no joins!) and the query results are then turned into script files and saved into a (db query) file cache on the RAID.
Before we implemented caching of db queries as files, the db access was the bottleneck with throughput. Since implementing the query cache we can run at full tilt with our db servers ticking over at well under 10% load. We are using pooled persistent connections to our dbs also.
Our script (embedded in HTML) is compiled into byte code on first access and written out as a file (similar to the stuff Zend are gonna do / are doing). This improves execution speed greatly after the first access.
All files that are accessed by the script engine (which doesn't include the images/etc) are cached into memory. All the application code is stored on the RAID and is accessible to all web servers in the cluster - the performance hit due to accessing a file over the network is only incurred first read (due to the local memory caches on the web servers script engines).
When a db row is changed, the script file in the cache is rewritten and the all web servers are told to refresh the file cached in memory. (Mostly, our app reads from the db - writing to the db is quite infrequent compared to reading, hence this works very well for us).
Searching the site is all done through a table of keywords (again this table is file cached and then memory cached) - so we do not incur a performance hit from free-text searching of our pages.
The system is fully scaleable simply through the addition of extra web/application servers should the need arise (which I'm sure it will, one day!). As previously stated, the actual real world load on our db servers is negligible so we should be ok there for a while. If necessary, we can also move from 100Mb Cat5 up to optic-fibre if networking becomes the bottleneck.
Just my 2 pence worth - HTH.
Jim
Probably not a popular view here, but if you are going to use something like Perl to be the front end glue for all your dynamic content, don't expect to be a speed demon when you've got a few hundred simultaneous users. A compiled language, like C, is optimum for thrifty thinking web designers. Small memory footprint. Small loading time. Small CPU expenditure.
Harder to program in, but worth the trouble if you're building a scalable site. Just a personal opinion... I would avoid a real SQL database for simple jobs. While plugging in something like MySQL may make programming a whole lot easier, you pay for it in performance.
1)Free text search on your database is going to
slow down your application. If you can try searching on a summary or title, or try to use a third party search i.e like verity
2) Have you indexed the tables?
3) Ram use is very intensive on dynamic sites
try for 1G to start with
4) 200 req a second is pretty respectable, put page requests and more importantly dynamic page requests is what you want to measure.
5) have a look at
http://hotwired.lycos.com/webmonkey/design/site
.html
for some tips about designing redesigning sites
6) dont make everything dynamic just because you can
adios
With the important thing being the balancing... separate DB and web servers, ads off of a different box... more of that info is included in the Slash source d/l, so take a look..
"It's tough to be bilingual when you get hit in the head."
140,000 pages per day is a pretty small, load.
;)
Obviously, but the main part of the point was that it is 99% idle during the peak loads. I doubt ASP/VBScript could implement starship traders, run 17 separate 64,000 sector games with hundreds of players each, and still be 99% idle on a Celeron 366.
Geeky modern art T-shirts
What happens if I'm running on an old 486? (I know people who are.) What if I have Javascript turned off, or filtered by a firewall? What if I have an browser with a buggy Javascript implementation?
Unless you are in a highly controlled environment where you know the capabilities and configuration of every client, you should never, never, never do anything critical to your site on the client side.
Tom Swiss | the infamous tms | my blog
You cannot wash away blood with blood
I've been amusing myself lately by creating a discussion system at my unreasonable.org website. The interesting part wasn't so much coding it up (in PHP with PostgreSQL and a little Perl), but in figuring out the process - who's doing what to whom, and who might be doing other whats to someone else as I expand the system. I'll probably hack on it for a few months and then do a rewrite to clean up - but the data model shouldn't have to change.
Tom Swiss | the infamous tms | my blog
You cannot wash away blood with blood
... the problems you describe with mod_perl have a pretty simple solution. You just use mod_perl as an application server and put a reverse proxy (like Squid or Apache + mod_proxy) in front of it to serve static requests.
Sure, that's the standard solution to the problem of oversized mod_perl-ified processes eating up memory, but it comes at a high price in mod_perl's performance. I've found that mod_perl with a proxy in front of it can be about 20% as fast (in terms of requests/second) as direct access to mod_perl.
"Naked" mod_perl, i.e. mod_perl without a proxy in front of it, is one of the fastest solutions around; but proxied mod_perl is so much slower that you might as well go to something like FastCGI and save all the administration effort that mod_perl requires.
Always keep a sapphire in your mind
The problem with giving students bad grades in courses is that (1) not only were they confused to begin with and didn't learn the material, but (2) they will be pissed off with you later! :-) Anyway, ACS has nothing to do with AOLserver or Tcl, really. SAP was written in COBOL initially. I'm sure a lot of language snobs threw rocks at them but at the end of the day a good engineer realizes that the value is in the data model and business process workflow. Now that ArsDigita has $20 million in revenue, we're releasing pure Java and Apache versions of ACS. So the anonymous coward (and former bad student) could have learned that his statements about being "forced" to develop in Tcl are simply wrong. All that he would have had to do is visit arsdigita.com. Anyway, the sad thing is that most people aren't good enough engineers to compare apples to apples. So they compare our toolkit, which is all about data models, to something like PHP, which is probably a fine programming language but it does not even try to to do the things that ACS does. On the scaling issue originally posted. People with sane tech architectures are usually able to rely on serving 10 dynamic requests/second/processor (these would involve a db query and a template merge). Could you do more than this? Sure! The real-time gamer guys build amazing stuff using UDP, custom C code, and in-memory databases of their own design. But what's the point for the average online community or ecommerce publisher? Most people with high traffic sites can afford a rack of processors and a load balancing switch. Anyway, I don't want to get into a flame war about AOLserver or Tcl. If you want to use our toolkit with pure Java running inside Oracle 8.1.7, go for it! You won't even need a separate Web server. If you want to use Apache instead of AOLserver, go for it! You'll probably need a slightly bigger machine but it won't really affect the end result. At ArsDigita, instead of trying to keep up with the latest in fashionable languages, we spend our time worrying about "What data models and workflow structures would we need to run the entire MIT Sloan School from a Web-based system?"
Not a God? Now I'm sad.
Not as sad as I was on that Boston Marathon day (when we got 30X the traffic that the customer told us to expect). But is that a failure for our toolkit? No. We had the equivalent of 8 400 MHz CPUs working on the Marathon server (two Sun E450s). If you can serve what seemed like half the US population with two Sun E450s using PHP, that's great. But so what?
There are people using ACS with a fat-ish db server plus a rack of small Web servers plus a load balancer. They don't seem to have any trouble scaling to arbitrary size.
"On really expensive boxes"? Photo.net handles 1 million hits/day, goes to the database on every page load, and runs on a computer with 4 180 MHz processors! There are 10-year-olds whose Quake machines would crush my server like a bug.
I am so sick of being the poster child for AOLserver (5 MB of code, smaller than the Oracle client library) and Tcl (which takes about two hours to learn). They are both open source, they don't get in the way, but if you want to use our toolkit (which is also free and open-source), you can now do it with Apache or with all-Java inside Oracle. Ben Adida ported it to PostgreSQL so you can be 100% open-source.
Anyway, if you want to start from scratch in MySQL and PHP and try to replicate the effort of ArsDigita's 80 full-time developers, be my guest. But don't confuse yourself into thinking that these are "solutions" (as you put it). They are tools.
When your site uses SSL, the certificate is for one IP, mod_backhand and round robin DNS all respond with different IP addresses. Reverse proxying doesn't scale beyond 4 servers or so. How does Amazon or whatever handle load balancing among SSL web servers?
Of course, I posted this late, so no one will probably read this question...
"There's so much left to know/ and I'm on the road to find out." -Cat Stevens
Pertinant links:
:-).
Here, at Server51.net, is some info on SLASH.
-and-
This is the place you really want to go. Slashcode.com. Here, you'll find bug reports, feature requests, and the latest source (and a few trolls...just like home
Also be sure to look through the mailing list archives... (at Server51, I believe)...
Here's my copy of DeCSS. Where's yours?
censorship is a form of noise, which actively seeks to drown out content with silence - Crash Culligan
Of course, don't search just slashdot for articles like this. Also search the archives of papers that have been presented at various USENIX/SAGE conferences (in particular, LISA), and other USENIX publications, starting at http://www.usenix.org/publicat ions/publications.html.
You will also want to use index sites such as Yahoo! and Excite, as well as search engines like Google, Altavista, and Hotbot, not to mention community directory projects such as dmoz Open Directory.
That's just a sampling of the sorts of research that you should START with. Of course, to do this right, you'll need to do much, much more.
--
Brad Knowles
Brad Knowles
http://daily.daemonnews.org/ -- if you're not
Also, a lot of functionality can be implemented at a lot of different levels. If you're using a powerful and fast database like Oracle, then use its features to do as much as possible.
And finally: optimize your queries. It matters a great deal whether a single page in your site is constructed from 10 or 20 different queries.
Just my 2 cents...
First design the system, all the tiers (layers) in the n-tier system. Secondly, define which functionality is needed in which tier to make the system work as defined. Third, choose for each tier the right technology to make it perform at best: performance AND reliable. THEN you start implementing.
:)
If you let your vision get blurred by focussing on one or 2 technologies before you've defined and designed the system, you'll never get to your goal.
As a global delimiter there is always the budget and what you can buy for the budget. Another is the lack of skills in certain fields of technology: some people at your team can be great Perl programmers but lousy in COM component building for middle tier software.
I know this sounds vague, but if you first do step 1 and 2 and good, you can then ask for which tier/functionality group which technology is the best to choose, concerning your team's delimiters money and knowledge/skills.
Any platform and tool mentioned in this thread will be, in a certain way, able to give what you asked for. Even Windows2000 with build in middleware (COM+), ASP, and an additional SQLserver or the free MSDE (Microsoft Database Engine, an SQLserver lookalike without storedprocs). It's however not possible to point NOW to the right direction which SET of tools, technologies and languages you'll need to create the high speed webapplication setup you'll need.
Please keep that in mind.
--
Never underestimate the relief of true separation of Religion and State.
I have much experience in this area. I run a hosting company that has built High Availability into the design of the network. Using Linux Virtual Server as a load balancer allows us to scale very easily and provides maximum uptime and speed. For scalability, to add more resources, we simple add new servers behind the LVS. The LVS is a patch to the linux kernel, so it performs very fast and it provides a layer of security, since you can forward only the ports that you want to. LVS has several different algorithms for scheduling which server gets each request.
Also, using the Roxen WebServer provides great performance for dynamic content. Roxen has built in database features that perform very well, for instance, roxen will keep the database connection open for fast access. Roxen will also allow you to write scripts that run very quickly because they get read into the roxen process memory the first time they execute, and they never have to be called again (no forking). Or, you can get away from CGI all together and write your own roxen modules that run internally in the server and add functionality. Roxen also has it's own built in markup language called RXML that includes tags for doing database queries, creating images on the fly, and many more. Bottom line, Roxen gives you the ability to do anything you want with your web server, and still give you the best performance possible.
For example, we have developed smtp and pop servers that run internally in the web servers. Our design allows us to load balance all of our services. The mail server also uses mysql to manage all the messages, email accounts, and aliases. None of the email accounts actually have system user accounts on the servers (higher security and easier to manage). Also, we get better performance out of the mail server, because instead of writing to files for mailboxes, it writes to a database. This is much easier to manage in an environment such as the hosting industry where we must maintain many domains and users.
I was an Apache user for many years, until I discovered Roxen in mid 1998. And I would never go back. I have experience running thousands of web sites, using this concept, successfully.
I also know that Real Networks uses Roxen as there web server and development environment. Maybe, somebody from there can post a comment about there experiences as well.
You may say that 50 million hits is an unrealistically high number to be discussing, but considering the rate of growth of web traffic, it won't just be the top ten sites that get this much traffic - many sites can expect it.
Agreed. Almost none of these tools are sufficiently mature to this point.
Use well known, well respected, and evolved tools.
To this point, I would caution against using MySQL - while excellent for smaller jobs, I have found it unable to hold up to high volume concurrent use as well as Oracle. When it comes to an RDBMS, spend the money and get it over with. Free solutions have not closed the gap wuite yet.
Cache! Cache whatever you can.
Make use of Akamai or another caching network where useful.
Be ready to spend. Running a fast, large hits web site is expensive.
And spend wisely. Don't fill up your rackspace with full-height servers when 1u servers will do. Don't tie too much of your site to one piece of hardware - you get better tolerance and granularity from many small servers as opposed to a few big servers. The colocations I visit tend to be a textbook case study in how to waste rack space.
100 million hits???? No. I have mistakingly put perl on one of our most highly visited pages and watched it get crushed like a grape...and this was only a random redirect generator.
I'd happily eat my words if someone could demonstrate PHP handling 100 million hits a day. I'd be pleasantly surprised..as my experience at a site that gets this type of traffic (so yes, I actually have some vague idea of what I'm talking about) has indicated otherwise. Our experiences with PHP (yes, we tried it on part of the site) on even moderately traffic'd parts of our site were not good.
Really, I do this stuff for a living (primarily in server-side Java and these suggestions may show that), and there is no easy way I know to get great maintainability and high performance. Here are a few things to consider, though (there may be some information that was in previous posts):
Well, I'll stop there. There's plenty more to cover, but there are also plenty of books out there. Basically, in my experience, the first rule of maintainability is keeping the problem domain separate, and the first rule in performance is caching.
--- but I don't want a "sig".
One of the most common ways to scale your site is to add hardware load balancing and multiple cheap servers running exact copies of your web site. This is how I have built my company's site.
We use a Cisco Local Director to load balance as many Linux servers as we like. It is very easy to insert and remove servers in a configuration like ours. Most large company's (such as Amazon or Yahoo!) will use multiple systems like this.
As for adding on new features, I would say that an iterative process is best. Once again, most large sites will have hundreds of developers working on different features which they will roll out one by one.
Slashdot would be the obvious example, right? So ask CmdrTaco and his crew, and take care to download en fiddle with Slash first!
Ok, I promise I don't work for these guys, but I'd have to highly recommend Resin from www.caucho.com. It's open source and amazingly fast. We can serve dozens of requests per second of a resonably complicated site on a crappy $400 Linux PC.
Also, what JVM are you using? Definitely try the newest Sun (with Inprise JIT, must be downloaded separately) for a single-processor system and IBM's jdk for an SMP box.
For App Servers, you should check out Web Sphere from IBM (not too expensive, relatively speaking). Also, TowerJ can DRAMATICALLY speed up Linux server-side java and improve scalability (towerj.com, I believe), but it ain't cheap.
Good luck!
--JRZ
Be sure to separate dynamic and static content. Make sure your servers that are hitting the database and generating content are not wasting cpu and memory
serving images.
Make sure that you use HTTP Headers to your
advantage. By setting these correctly, you can benefit from cacheing on the client end, as well as on a front end server (SQUID, mod_proxy, etc.).
Find a tool that lets you seperate code and HTML. Perl Modules work great in conjunction with mod_perl and one of the embedded perl Modules (HTML::Mason, HTML::Embperl, ePerl, etc.) However, many tools allow you to do similar things, Java Server Pages. The key is to be consitent and keep the logic separated from the layout and display.
HTH, Aaron
In Starshiptraders.com (a game), I wrote the entire thing in C -- there isn't even a copy of Apache involved. That system serves about 140,000 pages per day, and at peak times, the single-cpu (Celeron 366/320MB) is about 99% idle. All the files will fit in memory -- I only have about 120MB of data, but it is frequently intensively used. The data is all stored in flat files with ponters to related records in the same and other files. All data is, therefore, directly addressable and there is not much in the way of wasteful I/O. I can't really recommend this approach though for the obvious development and maintainability problems that it entails. ;)
/. folks are proof that MySQL and Apache are up to the task for some serious work. I have gone to some trouble to minimize the number of database calls and will work a bit more to minimize the size of returned pages. Each of my tables is indexed on the (very few) columns that are used to access it, so I get no full table scans. PHP can be compiled as a module for Apache, eliminating the startup overhead and resulting in quite efficient processing. PHP is also very easy to work with.
;). (Anyway, it's at a hosting provider who is not optimized for PHP -- they call a php executable for each of my pages. If the need should arise, I will move it.) However, I think that this approach is a good balance between maintainability, efficiency and scalability. You can start with a single system and, when the load exceeds that capabilities of the box, you can easily offload the database onto a dedicated database server and put up multiple webservers on the front end with DNS round-robin or somesuch.
My other dynamic site project, SiteReview.org (user-posted website reviews), is written in PHP, serves pages with Apache, and has a MySQL back end data store. The
Currently that site is a work in progress with very small volume and I therefore have no evidence yet that I did anything right
Geeky modern art T-shirts
I have worked to minimize the database calls and keep the pages as small as I can. By minimizing the database calls I give my self more room to grow before I start needing more hardware and by keeping the page small I make the site more slow connection friendly and make better use of my bandwidth. I think that if you are waiting for something to download it should be what you want (content) not fluff.
I have added features slowly as I have gotten them working. Comments, user logins, syndication pages, etc. I think that if you get a good idea get it online and then work to make it better.
I think you should always keep in mind that anything cool may soon be much bigger so write a site that is cool when ten people use it and is still cool (and fast) when a ten thousand people (or more) are using it.
I would also recomend setting things up so that your content can be syndicated and shared on other sites.
RootPrompt.org's headlines for example can be had in netscape's rss format at:
http://rootprompt.org/rss/
and in text format at:
http://rootprompt.org/rss/text.php3
Doing this will allow you to share the content that you create with the world without requiring a lot of machine on your end.
Noel
RootPrompt.org -- Nothing but Unix
kayaking
If you're using jserv and apache then it allows you to use multiple front end servers e.g. round robin DNS. and multiple backend JServ engines.
If you want to cluster multiple SQL servers, then you can have multiple read only mysql servers and one write mysql server which updates the other mysql servers from the update log.
Your DB code would have to be aware of the read and write DB servers.
As for coping with changes if it's interface changes then you need to create you architecture seperating application logic from interface logic.
e.g.
Servlet (Java) code handling DB intercation and application type logic and an intelligent templating language handling interface (XSL).
About 6 months ago I began development on a site that I wanted to be scalable to the extreme. After research into which tools best fit my job I decided on AOLServer for many reasons including: multithreaded vs forking architecture, persistant db connections, shared memory space, proven track record, simplistic implementation including embedded tcl in pseudo asp like pages. Fortunately since then AOLServer has become OpenSource under the GPL allowing my complete architecture to rely on only OS tools: LINUX, PostGres, AOLServer, Postfix, and more. I believe AOLServer to be one of the best kept secrets as far as Open Source tools out there. A company named Arsdigita has an Open Source toolkit designed for building online communities and online forums for any problems in case you get stuck. I would write more but its 8AM and I haven't been to bed yet ... maybe when I get up after Noon ;)
http://www.acme.com/software/thttpd/benchmarks.h tml
http://www.cs.wustl.edu/~jxh/research/
http://photo.net/wtr/thebook/server.html
http://aolserver.com/features/
http://www.aolserver.com/tcl2k/html/index.htm
http://www.linux-ha.org/
http://www.linuxvirtualserver.org/
http://www.citi.umich.edu/projects/citi-netscape /reports/web-opt.html
http://linuxperf.nl.linux.org/
http://www-4.ibm.com/software/developer/library/ java2/index.html
http://www.squid-cache.org/
Hopefully this may be of help to you also.... after all this research I was very pleased to go with AOLServer even though they were not in the web server comparison at thttpd site the model was represented by Zeus and thttpd and AOLServer has many additional features that really sold me.
In my experience, locking contention is usually due to inappropriate indexing and bad SQL coding. I'm not familiar with MySQL, but if you are having to do funky schema changes like splitting the tables it sounds like MySQL isn't ready for prime time yet. Dodgy workarounds are no substitute for a quality DB server. My personal preference, Sybase ASE 11.0.3.3 for Linux, is available with a zero-cost license for both production and development deployments. Sybase ASE 11.9.2 is more has some significant improvements over 11.0 and is zero-cost for development only. (Unless you REALLY need row level locking, 11.0 will probably meet your needs.)
Question - are you splitting between rows or between columns? If you are having to split between rows, the problem is most likely resulting from an inappropriate clustered index. In an insert-intensive database, a bad clustered index will result in a hot spot in the last data and/or index page of the database. The best solution here is to cluster on a surrogate key. Your surrogate key generation algorithm needs to be carefully designed to distribute inserts evenly in the table. If you are doing primarily single-row updates and inserts, you should only be seeing page-level locks. If you are updating records frequently, try and use only fixed-length datatypes (or at least only update the fixed-length fields); this allows in-place updates. You should avoid indexing frequently updated fields, if possible.
Database design is an art. Ditto for performance tuning. An expert DBA is worth his/her weight in gold. There's a good reason top DBA's command top rates
"The axiom 'An honest man has nothing to fear from the police'
Why is it that the proponents of "one nation under God" are so eager to get rid of "liberty and justice for all"?
Cache your dynamic data
I cannot emphasize how important this is. If your site pulls up a page of 50 records, which each pull other database info for each record, create a cache for the whole page. Then create a cache for each record as well. Cache elements as well as entire pages of dynamic data into separate cache tables in the database wherever possible.
Split your servers
Separate your database and Apache servers. They should communicate with each other through a 100Mb network switch at the least (make sure you use full duplexing as well).
Split your tables
I know this sounds funny, but the current version of MySQL has a tendency to lock tables when doing writes to the table. This means that one update on a table can halt all other reads on the table. MySQL is very fast, so normally this isn't much of a problem, but when you start getting into a high volume of requests, you're going to start to get bogged down. The solution to this is to split your table into multiple tables, i.e. the table mydata becomes mydata_1, mydata_2, mydata_3, etc, where for example mydata_1 might hold records starting with a-e, mydata_2 is for f-h, etc. This might sound tricky but it will save you a lot of trouble. I can't even count the number of times we stared at the mysqladmin processlist and saw one of our tables constantly locked (stopping all reads and writes) before we came up with this solution.
Your MySQL server
Should have a ton of memory plus a fast disk. Preferably a gig or more for memory, use RAID or a fast (10,000 RPM+, 6ms or less seek time) SCSI disk for the data, also keep it separate from your OS disk (i.e. don't have your OS running on the same disk as your MySQL tables). Save the high MHz processors for...
Your Apache servers
We've found that the really processor intensive stuff happens on the Apache servers. So you should keep an eye on the load average, etc. on these servers. If they start to get bogged down, you can just pop in another Apache server and split the load.
The nice thing about this setup is that you can keep adding Apache servers as your processing needs go up. From our own experience MySQL is not very processor intensive but very dependent on memory and disk speed. When using MySQL with Linux, you also have to be careful about file system limitations, handlers etc. Tweaking certain variables should help. Good luck!
No, I'm not flaming Slashdot; I love everything else about the site. But its accessibility unfortunately didn't improve with the Andover.net takeover, nor through any of the other changes that have been happening in the last two years.
I'm sure other people's mileage will vary, I'm interested in hearing other people's experience.
----------
In a real emergency, we would have all fled in terror, and you would not have been notified.
Here are some of the advantages I've seen:
Since FastCGI processes are independent of the server, they are less likely to weigh the server down with a heavy processing load, and buggy FastCGI's are less likely to slow down or crash the server. If a FastCGI is going haywire, the problem can be diagnosed with the usual tools for analyzing the process behavior (like ps, top, Sun's proctool, etc). And FastCGI can be configured to adjust the number of running processes to fit the load.
In contrast, technologies like mod_perl or PHP, which are embedded in the server, place an extra load on the server itself. It increases their memory footprint (especially in the case of mod_perl), which can be very problematic when Apache forks extra servers to handle request spikes -- you run the risk of running out of memory. They can make the web servers start up more slowly, and if one of your programs has gone on the blink, it can adversely impact the servers themselves. And embedded programs are not as easy to debug as independent processes.
In the case of servlets, since they all run as threads within a JVM, then if one of them is buggy or slow, it's not easy to find out which one is causing the problem. Usually you just notice that your JVM has slowed down, deteriorating everything else; then you have to go about finding out which thread is responsible.
None of these problems come up with FastCGI. You can write FastCGI processes in whatever language you like, as long as you honor the protocol. And there are re-usable FCGI interfaces for C, C++, Java, Perl, Python and TCL.
I personally happen to like Perl a lot, and I very much like the idea of mod_perl. Programming to the Apache API with Perl is way cool, and so many Perl programmers fall all over themselves praising it as a panacea. But because of the memory impact on the server, I have found very difficult to implement mod_perl so that the server is stable and doesn't eat up all my RAM. It can be done, with a lot of effort on the part of the web server administrators, but it's certainly a lot harder than it is with FastCGI.
And for the record, I do recognize the strengths of the various other techniques that I've mentioned as well. They all deserve their status as highly respected technologies for server-side programming, but FastCGI ranks up there with them in quality and deserves more attention than it's been getting.
Always keep a sapphire in your mind
Most points mentioned here are covered in detail in this book.
Philip & Alex's Guide to Web Publishing and the Web Tools Review are some good sources of information on this topic. Both can be easily found at http://www.photo.net/. Philip Greenspun, who is the creator of photo.net and wrote the Guide to Web Publishing, also is the founder of ArsDigita. ArsDigita does web dev consulting and offers a free, open source toolkit for building robust, high-utilization sites. The previous poster directed you to a good info source, I'm not sure why they were rated down to 0...
- Ignore your application server vendor. They have to pass on some of the cost to Oracle, and they don't really manage Amazon.com with their product - but they probably do some small part of it so they can say that legally. I'm willing to bet that its the most unreliable part of Amazon.com.
- Use well known, well respected, and evolved tools. These include things like mod_perl, Apache, Oracle, java servlets are getting there (but you saw that they don't scale fantastically, and their JDBC drivers are much slower than Perl's equivalent), but they just aren't that fast yet on large projects. AOLServer also looks like a fairly nippy option, but you need to use tcl to program it AFAIK.
- Tune your database. This can't be stressed enough. It may take the rest of your life, but do it anyway. And if you can't do it, then hire a proffesional. These guys are expensive though - but you get what you pay for in this respect.
- Split up your hardware. A separate DB and Web server can increase your application's speed no end due to removing contention for resources.
- Cache! Cache whatever you can. If using something like mod_perl then stick the "Oops" proxy server in front of it to cache page accesses (there are good reasons why this speeds things up). Cache stuff in your server's ram. Cache stuff in shared memory.
- Be ready to spend. Running a fast, large hits web site is expensive. There's no ifs nor buts about this unless you don't mind downtime. PhilG of "Phillip and Alex's" fame estimates something like $100,000+++ a year to run a web site like this, taking into account Oracle costs, support, DBA costs (yes, you do need one), hardware and network costs.
And read "Philip and Alex..." - even if you only get the web version - somewhere off http://photo.net. He debunks the myths of application servers and reducing the costs and time of development of this sort of thing. And read "The Mythical Man Month" - that also debunks the idea of reducing the time to develop complex things.Good Luck!
Matt. Want XML + Apache + Stylesheets? Get AxKit.