What Do You Look For in a Big Iron Review?
ValourX writes "We're starting to write more reviews of enterprise-class hardware and software and although we've done pretty well with our reviews, the high-end products are a lot trickier when it comes to testing and evaluation. Obviously it is not possible to build an enterprise-grade 'your neck is on the line' production environment just for writing reviews, but maybe we can do something smaller, just for testing purposes. What do you as an IT professional want to read in a review for a server OS or a high-speed switch, or a big iron server or proprietary workstation? What tests should we run? What results and feature comparisons are going to be most meaningful to you?"
Well the 2 main issues with Big Iron Equipment is How Well it handles Load and Scalability. For Load They should max out the system slightly above the recommended specs and see how well it handles it. Most people don't care for overall benchmark but more issues that affect the user. Say it was a WebServer We don't care how many pages/second it can handle but how well we get the webpages when the system is maxed out. Do we have to wait 5 minutes and the page just pops in. Or do we wait 5 Minutes for a page to load but we see the results of it coming in. When working above the required load how much does the system heat up (causing possible failures in the future). Secondly is how well can it scale, Can Extra Processors be added on, Can you add/hotswap processors on the system. What is the Max Ram it can hold can you add more is there room to add more. How compatible is it with competitors stuff (Say an IBM Server with a Sun Storage Array) how well do they follow the standards so you are able to use the server even if the company who produced it died.
Speed (which a lot of people put there Big Irons to the test) is really not that important of a detail. A PC with a 3 Ghz Processor will out perform a Sun Fire15k with multiple processors, for any single task. But when it starts handling load the Sun Fire will handle it better then the PC. When companies decide to buy the Big Iron they want it to be an investment that can last them at least 3-4 years preferably 4-10 years. And all they need to do is add stuff to it so that it scales with the time.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
If you knew our operations guy, you would test resistance to physical attacks.
Pictures. I like hot chicks standing next to big servers. Big servers in action shots are good too.
I think the number one question we would all liked asked is... can you run linux on it?
I would like to see a review of a high speed, stackable gigabit managed switch.
Real-world numbers from some inductry-standard benchmarks would be good. You can get TPC-C and SPECint from most vendors, but those are run after weeks of tuning by their internal experts.
I would like to see what they get in a regular user's hands.
Phil
I guess today is a passable day to die.
Bossman Compatibility: Verifies that the hardware vendor has taken my boss's boss out to dinner and purchased suitably expensive drinks. Rating based on the number of stars the restaurant recieved, although points may be docked if the filet mignon was a little overdone. This one is related to the...
CYA Verification: Vendor must have a name recognizable to people who read periodicals such as "CTO Magazine" so, when it breaks down, I can say "who ever hear of XVY Company's gear being bad?" If the vendor is a company like Dell which also sells home PCs, this metric should also include going to my boss's boss's house and verifying that his Dell is running okay so I don't have to hear shit like "I don't know why we got Dell, my desktop at home has problems all the time, too, and it's only six years old!"
Sweetness Factor: Not as much of a factor as it once was, depending on how big of iron we're talking about. But it the thing has, say, requires a cooling tower that happens to have a waterfall built into it, that's point right there. May conflict with....
The Under-Desk Operation Profile: Since it'd take at least a month and a dozen SRs and books of useless paperwork just to get the beastie screwed into a rack at our NOC, the server must both fit nicely under the desk in my cube with all the other machines and not be too loud. Generation of excess heat is a plus since the facilities people have set 61 degrees as a reasonable temperature for my office in the winter.
Extra-App Capacity Testing: For when some moron in another department comes in and convinces my boss's boss that "all that server is doing is running the backend for our entire operation, so can we put our incredibly messy half-working app on it too and treat it like QA?" If this server can alert a Terminator unit to go to the aforementioned coworker's home in the middle of the night and slay him and his family, this requirement can be waived (oh, I wait for the day this will be waived....)
I'm sure there are a few other benchmarks you could run, but honestly these are the Big Five that I decide on.
Every year during my review, I just pray the words "slashdot.org" aren't mentioned.
Make sure it has a good number of phaser arrays and photon torpedo banks.
be used to build a beowulf cluster...
Slackware
What do you as an IT professional want to read in a review for a server OS
Security, scalability, robustness. Use your common sense when thinking of a definition of those terms.
or a high-speed switch
Managed how? (ssh, telnet, www?) Any layer 3 routing or filtering capabilities? Packets per second? Backplane speed?
or a big iron server or proprietary workstation?
For our research we're phasing out our SGI machines and going to free OSs (Linux cluster, Free OSs on the desktop) so that's not applicable here.
What tests should we run? What results and feature comparisons are going to be most meaningful to you?
Doom 3 framerate?
Trolling is a art,
when i'm choosing a big iron, i try to find one which can get the big creases out of my big pants
As an IBM employee, I want to see brain-washingly favorable reviews of IBM hardware. Especially the ones that will make me money. :-)
Cheers,
Matt
Terrorist, bomb, al Qaeda, nuclear, yellowcake, kill, assassinate. Carnivore is dead... long live Echelon.
Can the system be expanded without rebooting, can you manage it using computer operators that wouldn't trust to determine which end of a mop should be applied to the floor.
In many a large setting, a big concern is "does it play nice with XYZ." [Insert cliche about certain-hardware manufacturer that set the "random" retry ethernet window to minimum, rather than minimum+random, to achieve better performance for its cards, intentionally mucking with interframe spacing....] XYZ is going to be: Specific app or other (hardware) product. If the apps are internal (as some of ours are) then you can't help us -- but there are some fairly customizeable out-of-the-box apps that you could test against....
Basically, none of these purchases happen in a vacuum. The merits of the technology matter, but "playing nice" is a dealbreaker. If this causes ANYTHING to break, forget it for now. et cetera.
When in doubt, parenthesize. At the very least it will let some poor schmuck bounce on the % key in vi. (Larry Wall)
Can it survive a good /. ing ?
In Soviet Corea, big irons you!
Please tell us if there are any stupidities in installing, running or backing up the software (or software components) related to copy protection. If the company does not respect the paying user, then I have no respect for the company and won't buy their product.
I'd be interested in how well it works after the following:
...while the system is on. ...while the system is off.
Coffee spilt in one of the CPU PSUs.
Coffee spilt on the keyboard (if present).
Coffee spilt in one of the disk system PSUs.
Swapping two of the disks in an pack...
More seriously, it would be handy to know the ratio of workload handled to watts consumed. Workload:cooling required would also be handy.
Phil
I guess today is a passable day to die.
Who asked for it and more importantly did anyone pay for it either directly or indirectly.
Help fight continental drift.
Kickbacks.
But I already know that "enterprise" software is going to require me to do far too much work to get something not exactly what I need for far far far more than it's worth.
I have had problems in the past looking at various hardware and comparing the true costs of it, especially support. With the third party support companies out there (we use Terix, amongst others), there are so many options, and with yearly support contracts in excess of $100,000, for our relatively small company, mis-calculating these in a recommendation can be a very big deal.
Just my $.02... oh, also just plain reviews of support companies on different hardware would be good also.
"If voting could really change things, it would be illegal. " - Revolution Books, NY
A good cost analysis is worth a lot. Say you look at a new and shiney server system, it has the latest OS, servers, and features. But what is that worth?
If the cost of this "new" server is 5X more expensive (as a package) than another system that gives you the same functionality and comparable performance then knowing that this alternative exists and what the performance / price difference is would be valuable.
:)(smile)
I've worked with too many companies whose products *do not* scale the way they claim, or whose products will techincally scale, but are at that point virtually useless. Use bogus data, who cares, but test the data volume, throughput, storage, archival, etc. to the limits and make sure the product is still useful. This is the single biggest problem I've had with enterprise installations, and the problem as an architect is that it's difficult to test on a very tight timeline for product evaluation. I've had egg on my face more than once because I had to take the vendor's word for it.
Second, install the application yourself. Don't let the vendor do it for you. And when you install it, install it as an enterprise would. That is, if it's an n-tier application, or has multiple components, don't take the "default" installation and put all of the components on one system. Of course this will work. Try distributing the components over multiple systems like an enterprise would. Often this is where the complexity comes in and products falter.
One company I worked for purchased some software from Tivoli. After 6 months, and a team of engineers onsite from the vendor, they still couldn't get the components to talk for more than a day without problems (after weeks of installation), and still couldn't get useful data out of the database due to its size, so we took our $500mil back and bought something else. Having an evaluation that would've tested this would've saved us a bundle.
akad0nric0
This sentence no verb.
I'd expect such a review to compare the two.
...IN SOVIET KOREA, old people iron YOU!
As someone who has to build, integrate, then deliver systems to other peoples' server floors I have some things that would be nice to know. How much power does the thing ACTUALLY use, not what the manual says, but real world usage (all you need is a clamp annmeter and a split extension cord) This test helps us determine power requirements if we deliver 100 of these, and cooling requirements.
My opinion is that you can only truly understand gear when it's failing: show me what happens when you cram 100Mbps down a T1 interface - do packets drop, or does the router punk out? Show me what happens when, while you're doing that, you hot-swap two of the other cards. What happens when you do a processor-fail-over while under load?
Also, attack the box - look for what's listening, and pound it with every known security vulnerability. Tell me whether passwords are stored as one-way-hashes ONLY, and what are the pwd recovery procedures? Does SSH cost extra?
That's the type of testing I'd like to see.
Thanks for asking,
Need Geek Rock? Try The Franchise!
Frame rate when playing Doom3?
How many shells does it hold, and is it a revolver or clip. Does it come with a silencer to avoid pesky street light cameras?
God spoke to me.
** For a server OS
How easy is it to install? How easy is it upgrade? How easy is it, if its a different architecture (ie, Windows, Linux, Mac), to migrate big programs (Exchange, databases) from one to another? How well does it gel with existing servers? Do they recognize one another? Do they acknowledge? Can they fit into existing Active Directory-type listings effectively?
Most to all shops are not created overnight. They are built on mistakes or tried-and-true methods that are (usually) quickly outdated. The problems arise when you try to "fix" the existing problems by bringing in more robust OS's and capabilities. It is the meshing of these that is more important to Network Admins that tales of how well this server did on a single machine in a non-network environment.
** High-speed switch
Does it scale (how easy is it add one to five or more on a single chain?)? How is the admin interface? Is it web-based? Console (ie, serial port) based? Does it have both in case console is all that's available? Can you break it or overrun it with traffic?
** Big iron server or proprietary workstation?
Someone else has mentioned scale so let me throw in something different: How easy is it to recover? Does it have Raid? (Well, it should obviously) Break it, remove a disk and see if you can recover from it easily. "Lose" a driver and see how quickly you can recover.
Something I'd love to see is a review that includes a call to the tech support of that server. Don't tell them you're a reviewer, just tell them you got a problem. See how quick they respond, how informative they may be, how far does it have to go before they call in reinforcements? (ie, higher level support)? Will they call on-site repair? If so, how long did you have to troubleshoot before they determined it? Sometimes a card or piece will break and front line support will make you bleed through their ignorant manuals step-by-step when its clear that Piece A is broken and need a on-site tech with experience with that hardware to come and replace it.
** What tests should we run?
Stress, along with installing/upgrading hardware.
** What results and feature comparisons are going to be most meaningful to you?
I believe that over the course of this comment writing and thinking back over my dealings on big iron hardware, that comparisons in regards to tech support, informativeness, and responsiveness are something that can immediatley be added to the review process.
Something more long-term would be how long did the server run before downtime, problems, burnouts, or hardware failures.
+ doom fps
+ Gentoo compile time
+ Overclocking possibilities
+ Case mods, preferably with blue neon lights
10 ?"Hello World" life was simple then
Reviews for this sort of equipment are pretty much meaningless. I might buy a 16-way server to run Oracle, you might buy the same system to run large scale data analysis. PCs are easy to review and evaluate, they are commodity; can be used for any of multiple purposes. When I buy a large SMP system, I am buying it for a specific purpose, and the chances are it will never be re-purposed. So before spending uberbucks on a system I want to talk to the vendors other customers who are running similar workloads on the same tin. If the vendor gives me a long list of folks who use their systems for similar applications that is usually a good sign, if they can't then I move on.
Large scale SMP systems require a slightly different mentallity than PC systems, as anyone who has managed a P690 or E10k will attest. You expect performance, you expect reliablity, you expect service, and for what you pay you better get it!
*narf!*
Can I pick up chicks with it?
I say we just grow up, be adults and die.
First, how reliable is the hardware? Both in terms of MTTF (Mean Time To Fail) and MTTR (Mean Time To Repair). To what extent can the hardware "self repair"? For example, we have a system which has redundant "hot spares" for the CPUs, memory, and power. If a CPU fails, the hardware, independant of the software, can usually recover the work in progress on that CPU to the "hot spare". The memory subsystem is constantly checking itself. If a memory bank is "weak" or reporting temp errors, the data in that bank is copied to a spare bank and the spare bank is switched as active. Again, without any help from the software. Is the I/O redundant? Again, on this system, there are multiple "paths" to the disk subsystem. If one "path" fails, the OS (not hardware in this case), will redrive the I/O on another "path". The application is not impacted, other than a slightly enlonged I/O time.
How reliable is the OS and other major software components such as the database software? How long can it run between reboots? What causes the reboots that are done?
How secure is the OS and other major software components such as the database? Not only from virus, worms, and "hackers", but from errant applications? If an application "goes wild", can it cause another application or the OS to terminate?
How auditable is the system? Things such as who accessed the system, when, and what did they do? This is likely a combination of things in the OS, the database software and other "OS level" software. It should not depend in any way on the application itself.
How easy is it to debug problems? As an example, you have something that runs at 2 a.m. to produce an overnight report. This something has a problem. How easy is it for a programmer to determine why the application failed? How fast can it be fixed?
Those are a few of my major concerns.
Nothing is more annoying than if you buy a big frame and then you find out that a silly little piece of software is no longer maintained. Or like HP announced today, that they are once again changing their HP-UX roadmap and once again proved that they can't be taken seriously if they predict anything further out than 3 months.
It all comes down to the simple fact that in the end, almost all of the big boxes are the same to the application. Sure, some have hard and some softpartitioning. Sure, you have different cpus, memory latencies and whatever - to the app it is just a bunch of system calls. But in the end, if you can't run your app on it, its useless, no matter how fast, redundant or whatever it is. We have completely moved away from selecting the box by its hardware properties. They are all sufficiently redundant and whatever. We go purely by how well the software we need to run is supported on the OS and if they have a roadmap that can be trusted.
Peter.
2. Assembly : Standard 19" rack? Power needs? Built-in interfaces? Recommended operating environment?
3. Configuration : Specs, options, adheres to open standards? Why is it more/less expensive?
4. Security/Reliability : What to use for management (ssh, telnet, serial console, proprietary menu system). How secure can you make it? What does it do when maxed out?
5. Maintenance : What parts are user replaceable or which parts can the user order without a certification on the product or shipping it off?
6. Tech support : What plans are available and how much do they cost?
7. Can it do what they say it can and how gracefully does it bow out when overloaded?
8. Whas it built with common sense in mind?
Feel free to add more...
If you're half as beautiful naked, you'd be 4 times as beautiful with twice as many clothes on.
Break it. Call support. See if you can understand what they are saying. See if they can understand what you are saying. See if they can understand what is wrong, or if they lead you through meaningless troubleshooting steps. See how long it takes for someone to show up to fix it. See whether they can actually get it fixed.
The rest of it is not all that important, really.
I have seen the future, and it is inconvenient.
Do this one last. Pour a gallon of steamy coffee on the appliance while it's operating. If it survives the procedure, give them props in the review. Otherwise, attempt to get technical support for the symptoms. This is not so much to evaluate the potential synergy of electronics and coffe but rather to gauge the support. Mind you, I don't get to deal with a lot of "big iron" things, but the question of support seems to be a very important as well as a very intangible one. All you can go on is testimonials from the web (which, for all you know, are bogus) and recommendations, really, and some preceived reputation of the vendor, but those really only cover the extremes, because the people on the extreme end of (dis)pleasure are the ones who are the most likely to say something. I'm not aware of any systematic way of trying to rate the average user experience. Here's a story: one client has Dells and is thrilled with them, and gets good service all the time, no matter what's goingon. Someone else with Dell allegedly had their service contract terminated (they were refunded the money, but still) beacuse the replacement part they needed was more expensive than the support plan. This is just anecdotal evidence - how am I to reconcile things like this into an overall opinion of their service? Maybe they should also indicate where the support centre is located/outsourced to, so you can be arbitrarily patriotic (or try to impose sanctions) with your purchases.
Even as you read this, your pants are strangling your loins! Aaa!
This is going to be harsh, but you need to hear it.
Obviously it is not possible to build an enterprise-grade 'your neck is on the line' production environment just for writing reviews
In order for the review to be accurate, that's how it has to be tested. Evaluating enterprise equipment in a non-enterprise environment with people who have no enterprise experience is pretty much worthless...and you're not going to fool anyone.
There's also no market for this sort of thing. Equipment on that level is bought because of high level executive briefings, price negotiations, migration options, and politics. Why? Because the market is so cutthroat and all the features that matter are there. The decisions are not made on whether or not a power cord was included, it was easy to unpack, the manuals were clear, how well built it looks, and how it did on SysMark SuperServerSimulator 2005...which is about the only thing all you 2-guys-with-a-webserver "hardware review" sites know how to do.
Further- often when a hardware vendor wants to get a contract, they provide a unit for evaluation.
On top of that, the major analyst firms already fill what little niche there is, and they have really big names 90% of the important people with Nice Shoes will recognize, which means even if that analyst is wrong, the decision to go with their recommendation is justifiable and won't get the Nice Shoes person fired. You'd be lucky if .01% recognized your name, much less trusted it. "Jones! Why does our website keep crashing?" "Well, we're having a lot of hardware problems." "Why did we go with ABC for our servers?" "Oh, XYZhardware.com said they were the best." "Jones, clean out your desk."
So...sorry, there's no market for what you're trying to do, and you don't have the means to do it.
Please help metamoderate.
I want to know that if all goes wrong, I can switch out for new equipment simply. Inevitably I will have to upgrade my hardware and software, I want to know that the next product to come out will be a seamless transition.
Much like other postings, I want to know how well can it handle change?
CS: It is all sink or swim...oh and did I mention there are sharks in that water?
If the system is from a lesser-known manufacturer then I would like to know some background information to help evaluate the company and help tell if they will be around for the lifespan of the product. If it is from a well-known company, I would like to know the level of commitment they have for the product line and if there has been talk about phasing the product line out. I know it is impossible to predict this completely but there may be some signals if there is a lack of longevity/commitment.
I have very specific requirements when it comes
to big dog servers and I just bought two more today.
1. Does the hardware vendor support linux or just pay lip service to it.
2. Can I get it without a os loaded or can I get it
preloaded. If it so much as comes with a oem windows
cd in the box I will ship it back.
3. Have they pissed off the community lately.
I just bought over 40k in servers today and guess who did not see a penny of that? DELL
Got Code?
With the advent of TCP-level load balancing and what-not, the speed is less of an issue so much as keeping the damn thing up in the first place. So, as a result, I like to see, hot-swappable everything. Not just power supplies, not just hard drives, but VME cards, bus cards, and even CPUs in symmetric systems would be a big plus.
When I hear "Big Iron" I think mainframes.
In particular, big IBM mainframes (s/3x0) running something like MVS (maybe VM at a push).
Anyone else think the term "Big Iron" is used innapropriately to describe a bunch of piddling little boxes that don't even need an air-conditioned datacenter equipped with an automatic Halon fire extinguishing system?
So they claim it, but does it work?
Reliabilty: The quality or state of being reliable
Is the system built using good design methodologies, and practices?
Quality components?
Availability: The quality or state of being available
Does the system have many single points of failure?
Are those points truly supseptible?
Servicability: The quality or state of being serviceable
Can I change broken parts with incuring an outage?
Can I add/remove/change without incuring a outage
When we spend hundreds of thousands, to millions of dollars on equipment, it better run, and be fixable without the system having to be stopped, and incuring a outage!
Test those things, prove they etiher work or don't, and the best is what will be bought.
Simple one-day, weekend, or even weeklong reviews are meaningless in the corporate IT environment. Hell, the merits of any particular vendor's gear isn't truly relevant either. I've worked in an institutional IT environment and a corporate one, and this is how purchasing works:
1. Requirements solicitation - figure out what needs we need to fill, be it wifi net access, a file server, etc
2. Vendor research - contact the usual suspects in the field (networking, big iron servers, etc) and arrange for consultation and formal bids to be made. NOTE: this step is skipped ENTIRELY if the company/institution already has a corporate account with a vendor that provides the appropriate services that you require.
3. Formal bidding process - pit the vendors against eachother, it's fun when you get them onsite to demo their gear. Generally vendors will lower prices to sweeten their bid.
4. Award the contract to one of the vendors, or (more likely) have funding denied to you by the beancounters and end up doing a half-assed implementation of what any of the vendors was going to do.
Individual machine or software reviews are a *tiny* part of the process for securing enterprise level hardware/software services.
------- "From bored to fanboy in 3.8 asian girls" ----------
Develop a standard app on Tomcat that you can use to test. Develop a JMeter test set for that app. (I'm thinking shopping card or some other transaction based application) Ramp JMeter from relatively few concurrent users to the infrastructure's breaking point. Figure out why it breaks. Tell us.
The beauty is that JMeter can run a pretty good stress test with a small farm of clients, and Tomcat is the reference application for J2EE. You're not just testing Tomcat though, you're testing the backend database as well and how well the Box/OS deals with Java issues. This sort of test should be fairly portable thus it will work for most hardware/OS combinations.
Big Iron is about scalability and reliability. You can't really test reliability, though you could call support and claim a part is broken and see what support is like. In which case I'd say call three different times with three different theoretical issues and average the response. Scalability though is where it's at as far at the machine/OS is concerned.
Now All that said. I don't doubt that SPEC already has a test that does just that. (Scalability of J2EE Apps) but you might get something from developing your own test.
What if it is just turtles all the way down?
Okay, I'm mainly saying this because we've had so many server failures at work in the last month. What features does the machine have to make it fault tolerant? Can they actually be demonstrated as part of the review. Do the automatic failovers actually operate. Will they operate properly if the server crashes under a full load.
It's good to use your head, but not as a battering ram.
As an IT professional, what I'm really looking for in Big Iron reviews are reviewers who actually have not just a small glimmering of what they are talking about but that they have at least moderate amount of clue. Like say knowing the difference between LPARsand domains. Having actually used large systems would of course be nice but I suspect thats far too high expectations...
Wouldn't the audience you seek just judge for themselves what their requirements are?
Is there consumer brand now to address 'mission critical' so now we have to find a new way to emphasise the 'mission critical' phrase to avoid the wave of 99.999999% claims from everybody selling 10k hardware.
What about kick-ass database management systems like Model 204, that run on IBM mainframes or plug-compatibles?
1. How much redundancy is available
a. Are there multiple fans or fan trays?
b. Are there multiple power supplies?
i. How many are needed to power the system?
ii. Can they be powered on and off individually?
b. Are there multiple CPUs?
i. Can they fail independantly, without outage?
ii. Can they be partitioned or dedicated?
c. How about multiple storage controllers?
2. How maintainable is it?
a. Hot-swapability
i. CPUs?
ii. Fans?
iii. Power supplies?
b. Manufacturer longevity
c. Product line stability
d. Off-the-shelf parts?
3. Physical specs
a. It's gotta be rack-mountable, right?
b. How many U high?
c. How deep?
d. Are there pluggy bits on the front, back, both?
e. How much does it weigh?
f. How bloody annoying are the rack rails?
g. Can you open and close it with things mounted directly above and below?
h. Can you swap out any and all parts without unracking?
i. How much heat does it generate?
j. How much power does it require?
k. Is there a maximum rack density specified?
l. Is it loud enough for OSHA to require ear plugs?
4. Expandability
a. How many net ports minimum/maximum?
b. What kind of net ports can it have?
c. How many storage thingies (hard drives, etc)?
d. Is there an upgrade path for the CPU(s)?
5. Servicability?
a. Is there a "lights out" managment board available?
b. Does it require dedicated management software?
c. Does it support SNMP?
i. Standard MIBs?
ii. Custom MIB(s)?
iii. Can it send traps?
d. Are you forced to connect a monitor/keyboard?
e. Is it supported by the obnoxious management/monitoring software of my choice?
F. Miscellaneous
a. Can it run Linux?
b. Does it force me to run Microsoft software?
c. Ok then, what the hell O/S does it run?
d. Can I have the source?
e. Please?
f. There's no SCO crap in there, right?
g. If I fill a whole rack with them, will it impress the chicks?
h. Ok, then how do I impress chicks?
i. What the hell's a chick, anyway?
I'm sure I've left out a ton of stuff, but those are some quick thoughts.
Remember karma isn't REALLY an integer ranging from -1 to +5
Test support. Tweak an obscure setting beyond reason, load the machine beyond it's capacity, then request the vendor to send one of their support engineers. Insist it's not a top-tier person but the casual support engineer a customer would run into.
Evaluate how many engineers were sent, how long it took them to find the problem, how quickly did they manage to resolve it, and weither the recommendations they presented to prevent the problem from occuring again made sense.
Because when it comes to BIG IRON, the LAST THING YOU WANT IS YOUR NECK ON THE LINE.
I prefer the small, transportable kind that fit in a suitcase. The big ones do a much better job on my shirts, though.
The important thing is the Big Iron Boards. The folding kind are the best, but the ones that drop out of the wall are a close second.
It's a good thing I reread the story, because at first I thought it was about golf clubs. How silly that would have been!
1. Rollout.
2. Administration.
3. Upgradeability.
(All of this, and 1000 more things, are summed up by vendors with this magic word, "scalable"):
I help admin a national network and you're right, it's often hard to know how well a product is going to perform until after the fact. Testing only goes so far, but of all the concerns I've dealt with these are the three I try to stick to.
"All great things are simple & expressed in a single word: freedom, justice, honor, duty, mercy, hope." --Churchill
I learned something about hardware from a simple worm.. I wrote an article on it and I'll re-post it here. Additional comments are at the bottom:
-----
Blaster was a worm, and of worms in general I would say that there is little new to be learned from them. They simply exploit holes that haven't been patched in vulnerable software from Microsoft. The security community continues to lambaste Microsoft regarding their alleged push toward making security their #1 priority, which actually comes in second place - after profits of course.
I did learn something new with blaster though. I have a very good friend that works for a large ISP. They have a number of Cisco 12000 series GSR routers as well as Foundry Big Iron Switches. For those who are not familiar with the Cisco 12000 series routers, let it be sufficed to say that it is Cisco's biggest, baddest router that stands up to 6 feet tall and comes from the factory with a 4 barrel carburetor, dual testosterone modules and a custom paint job with flames painted on the side (pin stripes are optional). These switches are designed to handle hundreds of gigs of traffic across their backplane and through their interfaces. If the ISP were forewarned that they would be seeing 300 mbps of traffic coming from the MS Blaster worm, they would have said "Bring it on!"
For those of us that aren't CCIE's, Cisco routers and Layer 3 switches have a function called CEF, or Cisco Express Forwarding. CEF is a technology that by its simplest definition caches routes.
If a packet from my computer is destined for yahoo.com, it will first hit the DNS server to resolve the host name to its IP address. My computer will then send packets to my ISP with the destination IP of yahoo.com (66.218.71.198). My ISP's router, presuming it's a Cisco router with CEF enabled, will look at its internet BGP tables and determine the optimal route my packet should take on the internet to arrive at that destination. Once the router has processed the route, it caches it so that all future packets coming from my home IP address, destined for yahoo.com will automatically be routed using the cached route. This takes a tremendous load off the router CPU as each packet no longer needs to be processed by the CPU, hence the term "Express Forwarding".
What the blaster worm did was send out hundreds of thousands of ICMP pings per second. This usually wouldn't be a problem for the router, except for each packet was destined for a unique IP address. What started happening is that each route was looked up, routed, and stored in its cache for future packets - only there weren't any future packets. What happened next was the memory space allocated for caching CEF routes filled up, and once full, the router simply purged its cache so that every packet had to then go to the CPU to be routed. Once this happened, all hell broke loose.
CPU utilization on the routers jumped to 100%, which should never happen under normal conditions, but this was clearly not a normal condition, and the internet came to a crawl.
-------------
As you can see, Worms have an ability to push hardware beyond its designed purpose and now ISP's are wondering what level of testing goes into these devices to see if they can withstand the severe abuse that falls outside of the design parameters.
Hope this helps,
Joel
Good security is based upon reality and common sense. Common sense is a function of having common knowledge.
haff life two, duh!
'What Do You Look For in a Big Iron Review?'
Whether it has a de-scaler and how long it takes to boil.
I always ask for an FMEA - Failure Mode and Effects Analysis - for typical and HA deployments. Big, expensive equipment tends to fail in big, expensive ways, and I want to know all the ways it can fail, all the potential effects of those failures, and what impact they have on my enterprise. Then, I want to know the recommended mechanisms and patterns that can be employed to minimize failure impact.
visualisations of a Beowulf cluster of $big_iron$
When features are the same or similar from vendor to vendor their support organiztion can be a deciding factor. When reviewing big iron break something on purpose and make a call to the support line. The review should definitely include the response from that process.
Testing the support system simulates the "your neck is on the line" environment without much infrastructure cost expenditure. It is definitely very valuable information for those trying to narrow down the field. I know I wouldn't consider buying even the best whirly-gig in the world if I can't get it fixed quickly when it's b0rken.
-Rusty
The Master (Angelo Rossitto) in Mad Max Beyond Thunderdome, "Not shit, energy!"
Oh, and if you can add a bit of reliability too, that would be nice.
KFG
Some big-iron-type products out there require some kind of proprietary infrastructure to be built around them in order to take advantage of all of their special features. Examples would be requirements of particular web browsers or browser plug-ins, particular network services, etc.
/etc/init.d, DNS, LDAP)?
In the case of software, do they require certain files to be in odd, non-configurable locations in a file system? Does the software make good use of the services of the OS (syslog, inetd,
I always look for "lowest common denominator" access, i.e. will I be able to manage this hardware or software without too much pain using a CLI over a low-bandwidth connection.
Also, how fussy will the vendor be with regard to self-maintenance? Do they have intelligent people on their support lines who will listen to my opinions, or are they script-reading drones who just want to point a finger at another vendor and get you off the phone? Will they even let you change the hardware or software configuration without a tech being there?
Here's something I'd like to see in a big iron review:
* Are the prices openly availiable
* If not, can I get them via email, phone, fax?
* How many phone calls to a sales guy does it take to get a price list.
* You mean he wants to fly out to discuss pricing?
* How much cheaper is my buddy at SavvyCorp able to buy it for since he knows the right guy to haggle with.
What I mean is, when I have home stuff or testing equipment or generally anything that's "for play", my requirements are erratic, and usually any single requirement can be overcome if the product has a certain "coolness" factor. But that's not what you're talking about anyway.
When you're talking about being on the job, my concerns are often not an issue of benchmarking or anything of the sort. I'm not interested in, "I've tried it in some simulated ideal environment, so it should work great in the real world!" The question are always, "but what about a non-simulated environment?" and "What about a non-ideal environment?"
Speed and performance are important, but ultimately I want to know, when you stick this machine in the rest of my network, which is held together with duct-tape and the cat-in-the-jar, and give limited access to users who are going to be using it for real-world purposes, and give limited users who are going to try to do things they shouldn't be doing and whatnot-- in that environment, how does it work? If it remains secure and stable and working, that's what I want to hear. I want it to keep working with as little headache as possible, and I want the administration to be easy. If it's fast, that's gravy.
For that, I always rely on my own real-world experience, the real-world experience of geeks that I trust, and "word around the street" in places like /.
Do they even use "Big Iron" for web servers? Very much? Aren't they all mostly SPARCs or Vanila Intel? File serving and number crunching would be the standard "Big Iron" useage, right?
"Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
For example when you test something like a mainframe enviroment you have to realise that it isn't designed to perform the same tasks that a PC or a workstation or a server is ment to perform.
A Mainframe looks like a dinosaur when you grade it by PC standards, but when you actually see what they do and what they are designed to do you quickly realise that no PC or PC cluster could be made to do the same things at anything close to a reasonable cost.
For example take I/O operations for instance.
You have your standard PC PCI slots that run a 66mhz and are 32bit. That means that the a PC has about 120-130MB/s worth of bandwidth to move information from one device to another. Give or take.
Now when you look at a Mainframe enviroment you notice that it's very distributed by comparision, a modern top of the line Z Series has a theoretical 26TeraBytes worth of I/O operations at it's disposal.
Completely blows anything in the PC or workstation or server world away. There is no way you could create with a PC cluster a cost effective and reliable and backward compatable way of doing what a Mainframe can do and still be in the same price range.
So when testing computers test them for what they were designed to do and the enviroment they were designed to operate in and avoid making meaningless connections in between things like a cluster of PC servers aggragate SPEC CPU score vs a Mainframe's.
Or compare the $ per cpu power of a Itanium proccessor vs a Power970 (mac g5) proccessor. It's mostly pointless and meaningless except for curiosity sake.
A lot of studios and production houses shop around for render servers. Distributed rendering would be a great benchmark.. There are free renderman compliant renderes that people could benchmark with, not to mention many open source renderers out there.
Say it was a WebServer We don't care how many pages/second it can handle but how well we get the webpages when the system is maxed out.
Wouldn't this to a much larger degree depend on the software rather than a hardware ?
3.243F6A8885A308D313
- Just how big is this big iron? Is it too big to fit on my ironing board? The device's physical footprint in its facility matters.
- What is its heat dissipation? Ironing boards aren't data centers, people: we want a high heat dissipation, to get the job done quickly.
- Does it have the industry's latest features? I want my big iron to have power management, in case I accidentally leave it plugged in.
These are just a few things I absolutely need to know about before I buy a big iron.Tired of FB/Google censorship? Visit UNCENSORED!
For big iron machines the new HPC Challenge set of benchmarks would be a good choice. They look at much more than floating point performance (Linpack) and can give good statistics for buyers to look at when choosing a system. The Challenge benchmarks look at disk performance, memory bandwidth, network I/O, bus performance, etc., etc. Things that people evaluating different systems to buy will want to know so they can buy a system that suits their particular application needs. The only issue is they will take a LONG time to setup let alone run.
1) Understand big iron: Big Iron machines like SGI Altix, IBM P690, and Sun Fire E25k. How they scale, the technology behind the processors and what they are doing to push the envelope so to speak.
2) Talk to big iron vendors about what products their customers are using on their machines, what they are using it for and how that product benefits them.
3) Programming environments: Many big iron machines require using programming toolkits like LAM, PVM and MPI. Articles on the proper use and understanding would be cool.
4) Interconnect: Ethernet, Myrinet, Infiniband, what the hell are these? Where are they better suited and why.
5) Benchmarks are good but avoid the benchmarks that took place on the vendors location (IBM BlueGene) which are finely tuned by the vendor for the benchmark. SGIs benchmark for Columbia was much better since it took place at the customer site while it was being built and after it was completed. While not necessarily real world (synthetic benchmarks) they were more accurate to a production environment.
If the steam is powerfull enough to quickly get the creases out of underpants! --the elderly underpants gnomes
It's Korea, not "Corea"
For example, their tests for the Cisco CRS-1 are available here and their results here. The beast qualifies as both big iron and very fast switch to me... Watch and learn ;)
With a computer like that, you want clients to think that their machine, if unleashed, would start trying to take over the world, it is so powerful. If the mainframe you bought has loads of blinking lights and make ominous noises, its a win. Also, DOD terminology also helps, along with a speech synthesizer that randomly says, of "pentagon uplink complete", "retarget it in progress", etc.
This is my sig.
It has to fit the overtightened lug nuts on my car, but not be too large for use in close-in self defense/LARTing situations.
Best Slashdot Co
How about an addition to every article regarding servers as to what, if any, additional accessories that you would *think* the high end server providers would actually provide with new machines.
For example, I'm sitting at my desk now setting up a new *ell server with 3 SCSI drives from the factory. Now, the machine can hold 6 total drives yet, the physical pieces of plastic that hold the drives and attach them to the SCSI backplane are not in the machine. Now, I've know this is how *ell ships their servers but really, why should it take 8 phone calls to *ell to track down the right people to get those extra brackets. What if I needed a floppy in a server that didn't come with one? From *ell it's a crap-shoot trying to find one.
So, rants aside (to late), how well do providers backup their products with not just regular service issues but replacement parts and accessories and why on servers with hot-swap SCSI backplanes is it now become the standard to sell a machine bought with 3 drives that can hold 6 and simply not just "throw in" the 3 $4.95 plastic drive cases to hold more drives if I want to install them? (rant off again).
Why do overlook and oversee mean opposite things?
When it comes to enterprise class computing, these are the crucial factors:
Well, I hope this helps. I think the absolute worst criteria is performance, because any system in use for a substantial amount of time will receive an upgrade or two. If you don't have the performance now, you can always request it during the next upgrade. But if you don't have security, stability, or reliability from day one, you might not be around later to perform an upgrade...
The society for a thought-free internet welcomes you.
Seriously, where are they?
For enterprise systems you can't build one and test it in any way that will produce meaningful results; UNLESS you start a business, buy the system, run your company with it, and come back to us in five years with your financials.
Honestly, when it comes to purchasing very expensive machines, I don't think IT departments should be looking at a journalist's review. They need to be doing the research and testing themselves.
Vendors will bend over backwards to get you to buy their big-ticket items. They will generally give you test machines and allow your engineers to hammer away. Those making the purchasing decisions will talk to their engineers, and value their opinions much higher than those of a magazine.
At least thats how it should work, and that is how it does work in top companies who rely on these machines for their entire business.
When it comes to stability, the big OS's have tools the little ones don't - the ability isolate the misbehavior of one application from others. Within big isolation containers (partitions), quota limits are important to prevent one process from affecting others.
Maybe part of reviewing and testing should include tests to ensure a second, mis-having application can't affect the one under the microscope (and vice versa?).
- The Kessel run is for nerf herders. I can circumnavigate the entire Central Finite Curve in a lot less than 12 parse
Sorry if this seems harsh (it is), but frankly, if you have to ask then you probably shouldn't be doing a review on this subject.
I think you are in need of some professionals who have already 'served time' with such systems.
Joel
As was previously mentioned, pure CPU, disk or other benchmarks are not directly relevant to "Big Iron" servers. What would be useful is benchmarks based on industry products with comparissons to other machines.
A good example would be the benchmarks the Transaction Processing Performance Council uses to test database speed. Naturally I single those benchmarks out as I am a database administrator, but others surely exist for application servers and the like. The key is consistency in benchmarking with loads of comparissons. I don't care if a machine does 1000 zargs a second unless I know how many zargs other machines can do.
I find that I need much more help with the little iron. My putting game has gone to hell.
I had a similar experience with that vendor. I'm curious to know about which product, but don't want to discuss it on slashdot.
But Herr Heisenberg, how does the electron know when I'm looking?
I've been working with mainframe systems since the late 70's as a systems programmer. I've also been involved with small personal Un*x (PC's) since the early 80's.
One has to remember that when dealing with big iron stuff, you are not dealing with the "walk into the store and say I'll take one of those and two of these". You are targeting people who deal with this stuff on a professional level and who know (or should know) their environment to a tee.
First lets break out people into two groups. Those that are existing big iron users with an established IT department and the infrastructure associated with it, and those that are looking at scaling from a room full of smaller servers and are trying to consolidate the workload, and are trying to move from a "walk in to the store and see what's available".
For the first group, what is wanted is how robust is the product, how well it can be integrated into the existing system(s), how well the product fits with the primary OS and hardware vendor's future direction. What are the maintenance costs, what type of training and support will be available. Will the product (or a follow on) be around for the long run. There is nothing worse then obtaining a product, investing the time and money into installing it, educating the support, operations, and/or end-users, then having the product disappear in two years.
For the second group, in addition to some of the items already listed above people new to the big iron arena will want to know what is how much education and assistance will be available. There will be required hand holding during the transition. Planning and direction assistance to ensure that things are done right the first time. While it is possible to simply move the existing workload into the new environment and not change how things are done, it is probably better to take a close look at what the new environment can offer. This might require taking a close look at the end result and not the means of certain tasks.
I thought we were discussing Big-Iron, not toys.....
Like maybe a real computer beast (z/900, IBM-9672,...)
8->
Typically, these type of reviews are conducted exactly the same as if they reviewing the distro of the week. "Gee, it installed really easy" or "Gee, it couldn't identify my SoundBlaster PCI that I bought for $10.00 at a garage sale". Rather than discussing the ease of installation, how about focusing on the setup of services? For instance, how much effort and how well documented is it to install, configure, and create a simple functioning website? Or, how was the setup process for NAT or DHCP or DNS? What does it take to configure the file system/network shares for end users? What's required on the client side?
Reviews also tend to exist in a vacuum. They ignore existing software and servers and expect a green field environment. That's just not realistic. How about testing interoperability with other major platforms? No review could ever be complete, but they could certainly try to cover more ground.
A good example of a bad Enterprise software review is the recent TechExtreme/eWeek review of Novell Linux Desktop. They ignore the context and entire purpose of the product. It's not sold at retail, but the reviewer insists on treating as though it were the new Fedora Core release. Wrong, wrong, wrong. How easy is to manage? To patch? To image? How well does the Zenworks integration work? How well does the product address its target market?
A review should also include vendor practices. When reviewing a MS product, for instance, it might be helpful to point out that Microsoft has the tendency to stop offering support packs after major upgrades are released (ala the recent decision to not release sp5 for Win2000). What about service and support included in the product? Is it advisable to buy a support contract? How long will this product be supported? Is support overpriced (ala Cisco)?
The last thing that really gets me is when reviewers will endorse a product based on vendor promises of forthcoming functionality or the reviewers own expectations. "The product kind of sucks now, but wait 'til version 3 because we always know it takes 'em three tries to get it right." Review the product as delivered. If it lacks major functionality, they say it. How can you review something that's incomplete.
Frankly, it's no wonder that some vendors no longer participate in "fake" RFP's or vendor roundup. What passes for journalism on some of those IT websites is just embarassing.
I thought this was a requirement for pretty much any hardware review. I don't believe we should limit this to computer hardware either. My dishwasher currently has 0 fps. I'm looking to upgrade that for sure.
blarg.
Overload the hardware as badly as you can, see how it copes (Experience: practically all OS's have a "breaking point" after which you need to restart the machine to recover fully).
Try to install faulty components, see what happens (Experience: even if the manufacturer claims failure tolerance, this is seldom the case).
Check if the iron really runs in the manufacturer's reported maximum temperature and what happens at the temperature plus couple degrees (Experience: Sun boxes keep running, HP/UX boxes immediately shut down).
Check if the system runs itself down gracefully when UPS reports power is out. Cut power entirely, see what happens.
Check if you can administer everything without touching the iron, including shutting the box down and starting it (Lights Out Management).
"Although it is not true that all conservatives are stupid, it is true that most stupid people are conservative."
I think its cool to look at rooms full of rackmounts.
"brxref
I am not in the IT field. But I find it interesting and impressive to hear about how large operations run. I'd be really like to hear about different methods for laying out datacenters. What rackmounts do they use, what types of airconditioning, what types of power conditioning? How do they organize all their cables. What procedures work for figuring out which server failed, how do you take it offline and fix it. I know these questions might be really basic to a lot of IT people, but not to me. I also havent' seen many writeups for the proper practices in these areas.
"brxref
Can it handle a slashdotting while running as a webserver? Should be easy enough to arrange, just make up a webpage mentioning a) Half Life 2 b) the PATRIOT act c) Google and d) Natalie Portman (preferably with a few mpgs).... Submit and wait for the smoke....
...that says Univac Inside ! b
...Performance statistics.
One of the biggest reasons companies even think of spending lots of money on High-end hardware is because of the capabilities during error/failure states. For instance, on the IBM Regata, you can lose processors, memory sticks, bus paths, etc, and it has "self" healing technology by which the system stays up. This is a high dollar item for P-series though, it is possible to take advantage of some of this in their non-RS6000 line pSeries stuff as well.
Compaq Non-stop has similar technology for error/failure recovery/bypass, even on x86, but again that's another high-dollar item. HP has had stuff like this for a long time as well, so now that they're both one company, they've begun integrating some of those technologies.
All failures though have a direct impact in performance. If anything fails, and takes the system off-line, it is no longer performing as needed, or designed. So uptime counts in this case.
Those losses in system uptime, without redundant systems, can be directly seen in a business' financial performance because they may have SLAs to meet where they might pay their customers for outages outside of SLA.
So, error/failure recovery is a huge metric regarding performance of a system.
Bottlenecks are also important to test for. This can be easily accomplished by separating the subsystems, either by logical, or block diagram, and stress testing the components, as a subsystem, then measuring what capabilities are affected in the overall system.
The point of this testing method is to understand what roadblocks might happen during critical operating periods. Example: You have a database or application server performing a lot of internal(memory related) IO. This memory IO might exercise 3 of 4 specific subsystems in the system. When some abnormal behavior is encountered, what happens to that IO? Does the system have a way of balancing IOs to mitigate the performance impact?
Obviously, these are contrived tests, but still neccessary because this situation is likely to happen in a production environment.
Nothing impacts ROI like lots of errors or failures in a short time period. So unless you're going to run 192 hour stress tests to see if the system returns to normal afterward, I doubt you're going to be able to accurately account for everything that might happen in a short time period.
To highlight what I mean, say you have a system that is responsible for running an App server, small database of login credentials, and a web server. The system has 250 simultaneous users at any one time. For this you'd need a lot of ram and some decent IO, depending on the amount of interactivity the App or Web server has with the end users. Let's say though, that the system encounters an abnormal ending due to a hardware or worse, software/kernel bug. If this happens several times in a week, customers are likely to start asking for pay back based on their contract for service, etc.
So there as well, you have two or three possible avenues to investigate to find the cause of the errors/failures.
High dollar/High end systems aren't supposed to allow this sort of thing, though they don't always mitigate poor application performance, but you shouldn't encounter kernel bugs and stuff, unless you're doing something you shouldn't or the system is just broken.
--SuperBug
are the manufacturers specifications.
Product cross comparison of specifications using iedntical test suite rather than manufacturers 'tuned' suites.
Real world test comparison. How well does the box do it's job when it's doing everything it will do in deployment at once.
Clear breakdown of cost so that all the 'gotchas' like proprietary cards or code that is not included, warranty, spare parts turnaround, ease of diagnosis, actual electric consumption, etc.
Now I'm the grandest Tiger in the Jungle!
To capture the essence of the enterprise, you need to hire four newly graduated students and have them write the worst program possible in Java. Don't worry, the "worst" part comes automatically. Then, apply this program to several brands of servers and see which one actually survives. That is the one you recommend in the review.
-- "Makes Little Debbie look like a pile of puke!" - Moe Szyslak
Get whatever webserver the vendor recommends, throw /. on it, find the biggest firehose you can and throw the IP of the test system on the /. homepage.
Measure the amount of sweat from the marketoids foreheads.
The biggest issue we've had while doing research for our database server is a lack of our understanding of equivalency between the vendor products.
We have $300k-$500k to spend on a database server. We currently run DB2 on Sun E6500 with 16 cpus, but are looking to either upgrade to a mid-sized new generation Sun box or a mid-sized IBM box. At a similar price point, what is the raw throughput, scalablity, and feature set different using the same software.
Even if the comparison was with Oracle, or running Java, but with the same level of hardware, it would be useful.
We've poured over the TPC and spec.org benchmarks, but those are vendor created and its impossible to find any equivalency. Either the software changes between benchmark runs or the hardware is of a different class.
This kind of head-to-head between server class machines would be invaluable.
The same thing with SAN hardware and switches. Single point feature set comparisions are pretty useless. You can get those from the vendor. But no vendor will ever provide comparisions. If you could provide that, it would be an invaluable service.
If I post a reply to the story that is offtopic then it should be moderated as offtopic. However, if the reply is nested in a thread and is ontopic for the thread but offtopic for the story, then it really shouldn't be moderated offtopic. Some of the most intersting little gems I see on /. are hidden in threads and have no relevence to the story, they are one of the great features of Slashdot.
I'd really like to see a feature where authors can designate their posts as offtopic. If you don't want to see them then you could just set that in your prefrences. All I can do now is add "OT:" to the subject, as is customary on usenet.
His idea is pointless.
>What do you as an IT professional want to read in a review for a server OS or a high-speed switch, or a big iron server or proprietary workstation?
As an IT pro, my opinion doesn't matter. Purchase of big iron is rarely decided by the so-called IT professional. It is a complex process that involves financing, the application guys, database vendor, office politics, etc. and hence should not depend (solely) on input of IT professionals. For example, if you buy an iSeries server, maybe IBM will give you a big discount on DB2, so whether or not the server has logical partitioning won't be a big deal - you'll be able to buy couple of Itanium 2 servers with VMWare from the money saved on Oracle 10g EE...
Secondly, as those servers are highly specialized, comparing them is like comparing apples and oranges.
Sure enough, if you wonder which server would be the best for Oracle apps, you can probably get TPC figures and compare those, but then you don't need a review, do you?
Alternatively you want it for a specific application (or existing environment) which exists only on Sun - why on earth would you want to read a review of Superdome or its comparision with E15K?
Thirdly, if you can't trust your UNIX vendor, then better don't buy UNIX. If you do, don't pay too much attention to reviews.
Most Compaq machines come with a "iLo, Remote Lights Out Edition", which is a network-accessible console. It gets it's own IP and security information, and allows me full access to the machine. As an admin of remote server sites, having network access to the console (ability to do reboots, access lilo prompt, BIOS, etc..) remotely is indispensable and IMO is a required feature on any modern server. A serial console doesn't cut it here.
Also, the ability to monitor hardware status from the OS (linux). Can I tell if a disk in the RAID failed without calling the ISP and asking them to check the lights? How about the temperature of the CPU?
It's incredibly hard to test enterprise-level hardware and software, because its utility (or flaws) only show up when you try and scale it out.
I know. I used to be an SE working in enterprise environments.
There were really only three measures in the enterprise environment that mattered:
* does it work in our environment? That means interoperating well with other hardware and software platforms. It can be managed using existing tools, or with add-ons to the existing tools. It works on/with the platforms used in-house, with the restrictions our infrastructure teams have.
* does it scale? There's no point in deploying a software distribution system that only works across 2,000 hosts if you need it to work across 75,000 hosts...unless 2,000 hosts are all you need. Scalability also includes "how do we get it installed, and how do we manage it?" It's related to the above, because, well, they'll use their existing tools to push your software out.
On the hardware side, I guess this would be "does it really support as many users/connections/whatever as they say?" 20,000 connections from one box is different than 1 connection from 20,000 boxes. People test the former, but in real life it's the latter.
Show how the hardware device degrades, given the load. For example, some OSs will slow down under load, but will keep working (solaris), while some OSs will just barf and refuse to do any more work (HP-UX). Neither one is right for all environments.
* are the positives outweighed by the negatives?
Every product has plusses and minuses. If it doesn't interoperate, is its lack of interoperability made up by some feature that the staff just can't live without? Will it save money, but add maintenance to the environment? Is it too complicated? Is the vendor too small? What about support?
Note that cost isn't as much as a factor as you'd think. Aggravation, at the enterprise level, is a much bigger factor than cost. Windows isn't being replaced because of costs (although it is a factor). It's being replaced because of aggravation.
1.) *ALL* configurations used. This includes the specific configuration files of each device being tested as well as those surrounding it. Tell me exactly what versions of code were used on everything. I want to be able to completely reproduce your test should I be so inclined. Without full disclosure the validity and reliability of these sorts of review tend to be questionable.
2.) As someone who buys many, many millions of dollars of network equipment I honestly don't care all that actively about the raw throughput of a given box. Turn features on. Turn *lots* of features on. Then tell me how much traffic is passed. As mentioned in #1 define exactly what sort of traffic you pushed through and what the end points might be. Many major vendors' equipment tends to fall apart when ACL's are applied to interfaces and absolutely crumble when QoS features are turned up. Give me an idea of what (if any) features have an impact on performance, and how much of an impact they have.
3.) Reliability: Run through as many failure modes as you can. Pull linecards, fans, power supplies, CPU modules, switch fabric cards, etc.. More importantly, measure any traffic loss when new parts are inserted into a live box. Tell me how much of a hit I take when I reboot the box. For larger boxes tell me about features that let me gracefully take some- or all- of the unit offline for maintenance. Any software extensions (e.g. graceful restart for OSPF/BGP/IS-IS) should be tested. Observe the behavior of the box as routing protocol timers are tightened. Many larger boxes allow for degraded operation under extreme fault conditions. Are there mechanisms to assure that critical traffic makes it through or do faults hit everything. In short, I want to know what happens when things break. Tell me about those things that could interrupt production traffic.
4.) Management / Operation: You'd be surprised how badly some vendors fall down in this department. Tell me about craft/console interfaces and any caveats that might apply. Tell me about facilities for out-of-band management (..if there are any). Tell me about mechanisms to access the box - do ssh/telnet/rsh/whatever work correctly with all common desktop implementations (certain combinations of network hardware and ssh clients are notorious for problems). If the box supports SNMP, what version and what facilities are available to limit access to the box? Also verify that the MIB's are available and well documented. The same would apply for TL-1 on carrier equipment. Tell me about syslog facilities and the ease/consistency of configuration of same. Tell me about aaa support - RADIUS? TACACS+? Others? What happens when authentication servers go unavailable. What about debugs on the box? Are they potentially destructive to activate with heavy traffic on the box? Are they complete? Basically you just need to remember that visibility into the operation of the box is absolutely critical - both reactively (fault management) and proactively (trending performance/etc).
5.) Security: What mechanisms are available to secure the box and how well do they work? In some cases there may be both hardware and software measures the vendor has taken. Evaluate how well the vendor has kept up with security patches, how available said patches might be and how much impact introducing these patches might have. Can I apply updates to minor software components without interrupting traffic/rebooting the box? What sorts of services does the box offer to the outside world and how readily can they be disabled? What mechanisms are available to not only block traffic to the box but also rate-limit to deal with DoS conditions.
6.) Vendor Support: This is tough to quantify but is arguably among the most importa
I have been sysadmin'ing 'big iron' type UNIX configurations since 1997. Still do today. I think your question makes some assumptions that don't totally apply.
What sells big iron usually isn't the performance. Well, it is, but it isn't. Usually you go for big iron because you can only scale your application veritcally and not split it out horizontally. You're more forced into a big iron situation than it being a choice.
The number one issues with big iron seem to be more cost, system availability and reliability, redundancy, vendor support, ability to deal with troublesome situations (failing disks, CPU or RAM goes bad, applications going out of control, etc etc). I'm ignoring clustering as another layer on top of that, of course.
So, I guess I'm really going back to my earlier point in that performance specs don't mean as much to me as do all the less measurable items. Like, say, binary compatibility. But really it is more vendor comparison than an iron comparison.
One could do performance runs on large UNIX configurations, but then, you end up with something that is likely to be abstract and/or malleable. You might as well quote TPC specs.
How does it fail? And when it fails what's the consequences? What's the repair time?
;-)
How do they want it used? How do the *NOT* want it used?
Memories of restoring and chown'ing gigs of data with management crawling all over me because something that failed was back online in an hour, but the infrastructure and restore took 12-14 hours. (Don't trust motherboard RAID without MUCH RESEARCH!
They will usually tell you the gotcha's or limitations.. 50ft cable Cat6e and no more. Only x^2 connections at a time. etc...
I want to know worse case scenario how it will fail.
If it can handle x per second, what happens at 0.95x? or 1.15x? Does it slow down gracefully, does it crash, does it hang?
It's one thing to give the engineering limits of a piece of equipment but those truly tough times are when the equipment is overloaded with usage (for whatever reason) and you've got to handle it. If we lived in a perfect world we would never push anything past 50-70% of its engineered design limits but having worked for an ISP 120-150% of its design limits was more the norm. And then when a customer/client/user does something patently stupid to overload the system it's still your problem to get the infrastructure stable and back online.
The above thoughts are just my opinion since I've been doing more support than engineering of late so I'm curious to read others responses as well.
"Don't fear death... fear not living..." -me
You mean like the Univac 1230? In one control application, the software group set up some code in the background loop to make the A and B registers blink like theater marquees. Cool and a half. Also a rough guide to system load; as it did more, the speed of the marquee slowed down.
The clearance system sounds logical. It is not. It is completely arbitrary. -- John Bolton
I have said it before and I say it again:
People spend way too much time wondering about how to backup things, when they really should pay more attention on restoring things.
Redundancy. Are there any obvioius desgin flaws? For example if you have raid controller but all disks have been connected to same cable or backplane, it wont do any good. In worst case scenario one of the drives will lock up the whole bus. What is the probability that in case of hardware failure the system will become unavailable? Example again, in case of failing CPU do you lose only the process that was running on that particular CPU, all processes assigned to that CPU or does it trash the whole OS?
Full system restore. What does it take to restore the system from scratch? Is it possible to do it remotely? Do you need to do something hands-on except to insert media from which to restore? Too often I have seen systems that need OS installed to restore tapes over it again. Decent systems provides you bootable tapes or boot monitor for restoring from backup media.
If the vendor says system will survive failure of single disk, make sure that you test it. Unplug disk on the fly and see what really happens.
The list is virtually endless:
- Are there any internal hardware watchdogs to monitor health of hardware?
- What happens when watchdog fails?
- Is it possible to reroute things around broken components? Can this be done automatically? Does it require downtime?
- What checks are in place to make sure that data doesn't get corrupted?
- How will the failing component affect system load?
- Are there ways to access hardware control panels and console remotely? Are they secure?
On Big Iron scalability is Big Issue, but it doesn't do any good if system doesn't stay up.
-- Reality checks don't bounce.
I want in a review what any geek would want:
The maximum FPS (frames per second) for Half-Life 2, Farcry, and Doom 3 at varying resolutions.
There's no way I would buy a server that can't get above 40 FPS on Half-Life 2 at 1280x1024! It you can't trust it to play your favorite shooter without choking, how could you trust it to support the essential functions of you company?
And a deflector array capable of emitting an inverse tachyon pulse.
-Peter
A slightly different perspective:
I'm a banker focusing on enterprise software M&A and private placements. In this and previous gigs I've used analyst research from Gartner et al. all the time. Though I don't consider Gartner to be too much more than a great way to buy some market attention, people in the financial community pay a great deal of heed to quantifiably positive reviews from the company because they act as a good proxy for market acceptence, and thus marketability and company value. That is, if you know nothing about a company but analysts like Gartner love it, it's worth taking a good look at.
With that said, the most focused-on chunk of a Gartner report is often the "magic quadrant." This is a cartesian plain that ranks companies based on their "ability to execute" (Y) and "vision" (X). Those sort of standardized metrics which easily place each company relative to other competing companies offer a load of perspective into a technology's value. It's never exact and always a bit subjective, but the multidimensional ranking is extremely useful and makes companies much easier to look at.
Read jack phelps dot net
A big ironing board.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
That's a short list of things I look for.
I am the Lorvax, I speak for the machines.
A certain manufacturer (name available upon request) said it supported RedHat enterprise linux. That's great. I installed the OS in accordance with the manufacturer instructions, using the manufacturer's installation cd. After redhat was installed and running, I removed a disk from the raid 5 array. The OS didn't do anything. I mean NOTHING.
If a fault occurs, I want the OS to go bananas. I want the OS to generate syslog messages at the minimum. I want console messages for a console server to see and respond to. I want outgoing emails. I want hell to break loose. Do something.
I called the manufacturer. They said their array manager does that. I pointed out to them their array manager is written in visual basic and only runs on windows. The guy on the phone was chagrined. For that matter, apparently there was never a thought to "making the OS go crazy with messages". I was the first guy to call and ask for that under linux.
I ended up writing a script that polled the hardware via a command line interface provided. This cli was only mentioned in a previous documentation revision. When my company purchased a different server from the same company, this cli didn't work as it used different hardware to manage its array. For this one, there is no cli!
That's just for the array managment. What happens if I have a bad fan, or power supply, or overtemp, or bad memory bank? Pick something and break it. How is it useful to have hardware redundancies without at least notifying the OS of a degraded condition?
I guess this has turned into a bit of a rant, but I can't be the only person to think of these things.
1. What is the hourly rate of consultants
2. What is the lead time in finding a competent consultant.
3. How many Man hours did it take for the in house people to configure the Big Iron system from its component bits.
4. Once built, how long was spent optimizing the various interconnects.
Most of my horror experience stems from having to find a VAX VMS consultant to help with a Reuters feed. If your client is expecting a certain setup, and you cannot hire an expert, you get some pretty nasty pinch points.
Once you've been sold a big iron system different levels of support are provided, and can become a significant part of the cost.
[% slash_sig_val.text %]
C'moon guys - you all perfectly know that the nice and shiny 4proc capable boxes you bought with 600% price premium, just to have _option for future_ still happily runs with their single processor. And are retired on average after 3 years of service.
Perhaps the ONLY thing I have ever seen being upgraded is memory. And storage, of course, always is kept in external arrays, so, this is not an issue to even consider - as long as you have at least one pci slot to add additional scsi controller.
When ever I have been in this situation I always want to know what the terms fault tolerent/hot swap really means in the sales doco.
It is not usually until the cabinets arrive that you get a chance to actually pull out that GBP 16,000 card whilst the power is on to see if the machine still works.
For that matter. I have never managed to get any pre-sales support dude to actually pull out a redundant power supply on their demo rack... things like this are important, can you pull out the power supplys that are supposed to be hot swap, can you pull out a drive from the raid array whilst the power is on, replace it and actually still see your Oracle serving requests... and the raid array magically rebuilding the drive.
It is usually at the worst possible time you find out that what is in the brochure isn't what is in the box...
THAT is what I would like to see in a review.
"Steve pulled out the third blade while the power was on. Three of the requests on our test rack took four seconds longer to complete but then the load test continued to serve requests at less than a second per query"
PS: What ever did happen to SGI?
Hello, "big iron" at this point has very little to do with performance and scalablilty. Those are useful but at this point a large number of customers don't need more performance. What they need is "better" performance. Cue.... RAS. Which stands for Reliability, Availability and Serviceability. The last one tends to be the biggest. When your machine starts to act strangely, you want it to help you figure out the problem, not just hang or return bogus results. You also want to be able to test out patches and hardware upgrades without affecting the Availability of the system.
You also want determinism in the way the machine misbehaves. Say something crashes, you don't want to have to deal with the problem that some other part of the system thinks the transaction has been committed to disk, when in reality its sitting in an OS buffer, or a disk buffer and just got erased. You want your memory and system busses to detect that 1 in 1E40th chance that a bit got accidentally flipped happened.
Now testing this kind of stuff is _HARD_ you need ways to inject errors and see if the machine behaves as expected. This is really sort of a joke on /. I cant imagine taking a "big iron" review here seriously. The majority of the crowd that hangs out on /. thinks a cheap desktop PC running linux and MySQL is a replacement for a high end AIX box running DB2. Just because it may perform the same doesn't mean everything is equal. Hell most linux systems don't even have kernel crash dumping enabled. Very few of these machine have ECC and the ones that do 99% of the time end up having the machine either automatically reboot as the NMI fires or the NMI get ignored, rather than handing uncorrectable errors. So much for Servicablity, and with the linux disk caching schemes you can pretty much kiss your reliability good by one crash and your disk may have come back online, but you can't trust the data vs what you told the other end of your transactional system unless you sent a fsync after every write (which btw causes linux to run so slowly its truely amazing).
If you don't know what the main issues are with "big iron", good luck trying to review them.
... How fast can I run Doom 3 on a fully configured Silicon Graphics Onyx 3000 with InfiniteReality4 Graphics with 128 CPUs and 16 graphics pipes? I bet the thing would bog down to 14fps with uncompressed textures and 4xAA.
1. It takes a while for a good sized IT staff to fully understand the ins and outs of large scale systems.
2. Got JCL?
3. Got DB2? Oracle? other databases with 100k+ tables?
4. Do you have an large temperature controlled room (no, not your apartment's second bedroom)?
5. Got Staff? Got Unix?
6. How are you going to test integration with other large-scale systems. Many companies use several. Hope you can afford the leases on your "test" systems.
This seems like a pie-in-the-sky project that manufacturers won't support.
But, the main thing I want to know is
UNIX Big Iron or zSeries (AKA Mainframe)??
Here's what I want on the UNIX side:
Dynamic LPARING...break a big hinking unit into smaller units or independent servers runnong on the gigantic server.
The ability to move resources from one lpar to another while UP!
Partial Processor LPARS....the ability to create a small LPAR that only uses a partial processor. For those times you need another server for a low load job.
Some builtin drive space for OS is nice, but we're using SAN for application storage.
Call home....loose a processor, then call service for me. Loose a fan, call home.
UNIX bigironshould be like that and IBM has it. We have a brand new 570 and that is a awesome box. Dynamic LPARING, partial CPU for a LPAR (AIX 5.3) Plus the Power 5 chip itself. Look no further for UNIX Big Iron...unloess u need bigger....then wait for the 595.
Gorkman
I work with Big Unix. HP, IBM, Sun. Not the Intel platforms, either.
What do I want? I want it to just work, I want it to be flexible, and I want them to fix it if it doesn't.
I don't want to have to dink around with multiple reboots to get the drivers loaded. I don't want to have to explain downtime to load firmware to a "new" switch every 3 months.
I want to be able to put cards in myself, or be able to have the Customer Engineer (CE) do it, and not lose support either way.
I want it to be able to go from serving application ABC on Monday to serving application XYZ on Tuesday without more than a reboot.
I want it to use standards. Not extend them, not interpret them, not futz with them in any way. I want to plug vendor A's storage array into vendor B's san switch along with vendor C's scsi bridge and vendor D's server, and not have the whole thing go "Huh?" and die.
Most importantly, I want Slashdot to get a panel of real administrators, bastard operator from hell types preferabley, together for each test (maybe offer free t-shirts or something) and discuss how to beat the crap out of the hardware.
If it doesn't break, you're not thinking like management!
Anonymous Cat
Why am I anonymous? None of your business!
Dismantle the system. Without powering it down. How many components can you remove, following all procedures, before the system becomes unavailable?
How many big shirts I can iron with it per hour.
/*hides and cries.
Then I try to beat that record.
One simple question: Can it run 30,000+ copies of Linux on a single piece of big iron hardware?
Sigs. We don't need no steenking sigs.
Do I have to support it with native tools only, or are there decent third party tools to choose from?
Rather than trying to find ultra generic benchmarks that apply to all computers, try identifying benchmarks specific to the primary and secondary markets the product is targeting.
Include comparisons (briefly) to this product and other products in its secondary target market who are major players in that market. For example, if I am primarily looking at server consolidation products, blade servers might come to mind. However, an 8-way server running VMware might consider the server consolidtation market as a strong secondary market. I would like to see how they compare.
One of the big reasons to buy a big system are for it's purported high-availability features, such as raid arrays, hot-swap drives, hot-swap power supplies, hot-add RAM, etc.
Performance is really secondary to availability in a lot of these situations. What happens when you pull a drive from an array or mirror set? How long does it take to rebuild? What happens if you pull 2 drives and put one right back? What happens if you scramble which drives are in which array slot then power on? Can you extend a volume 'on the fly' or do you need to back up to tape, re-create the entire array, and restore your data?
For example, HP/Compaq raid controllers let you replace (one drive at a time) small hard drives with higher capacity drives. Once all of the drives are higher capacity you can go and allocate the rest of the space as a new raid volume. On an RS6000 you'd need to back up to tape, re-do the entire thing, then restore.
Test enterprise features -- connecting disimilar SANs, FC multi-channel failover, pull a power supply, use a space heater to simulate an A/C failure and see at what temp (and for how long) it takes for the system to become unstable.
Disconnect random cables, pull random chips from sockets, see if the on-board diagnostics help you locate the problem.
Disassemble the system and re-assemble it. How many tools did you need? How long did it take? How is the internal build quality? Are you going to cut your hands on sharp edges or is it all nicely laid out and easy to service? (Even if the end-user company won't be servicing it, if service is easier then it will likely be faster as well, shortening outages).
For networking equipment, can you build a 100% fault-tolerant clustered hot-failover system? What extra pieces did you need and how much would it cost? How well does the failover work? Seamless? 1-2 second 'burp'?
"But actually trying to use m4 as a general-purpose langage would be deeply perverse" --ESR
That's the other one, the one that isn't Soviet.
At the bottom of the