Costs Associated with the Storage of Terabytes?
NetworkAttached asks: "I know of a company that has large online storage requirements - on the order of 50TB - for a new data-warehousing oriented application they are developing. I was astonished to hear that the pricing for this storage (disk, frames, management software, etc...) was nearly $20 million dollars. I've tried to research the actual costs myself, but that information seems all but impossible to find online. For those of you out there with real world experience in this area, is $20 million really accurate? What are the set of viable alternatives out there for storage requirements of this size?"
At about 1$/G of storage, 50TB comes out to around $50,000.
Not sure if this is what you're looking for, but I couldn't help noticing the timing after the last article.
Why is it that 90% of "Ask Slashdot" pieces seem to boil down to "I have no real world experience, and I'm just wondering how I can solve problem X for Y dollars when twenty different vendors all sell solutions for 100 * Y dollars?"?
--
Twoflower
We are looking at something similar but smaller 20TB and the price we are looking at is around $2,000,000 Canadian. The price sounds a little steep to me.
the new 320 gigabyte harddrives previously mentioned. And you divide 50000 (50TB) gigs by 320. you get an approximate cost of having 50TB by multiplying that by 350$ the appoximate cost of the drive. However, with that much data a RAID is certaintly in order. So multiply the number of drives by 1.5 or 1.75 to get the number of drives needed for a RAID. Then multiply that by 350. This comes out to a little over 80000 dollars. The only cost left is the cost of all the raid controllers (expensive) and networking all the drives together. So for the raw storage of 50 terabytes it costs about $80,000. If you were to buy ultrafast scsi drives instead of the 320GB drives the price will be multiplied by about 3 since a 100MB super fast scsi drive is also about 300$ with 1/3 of the space. So that brings it to $240,000. Add to that the cost of labor and all the other hardware and I don't see how it could come out to more than 1 million dollars. I'm not an expert, but just doing the math it seems that more than that is too much.
The GeekNights podcast is going strong. Listen!
From an earlier slashdot story, you can get 300GB hard drives for around $1 a GB. So you are looking at spending $50,000 on hard drives. Figure 4 IDE drives per computer and you need about 50 computers. That would run you maybe $15,000 at around $300 per computer.
I'd say it would need 10 employees to set it up including a couple programmers, a couple sysadmins, and some techs, would probably cost you $200,000 if it took them four months.
I'd say you could do it for less than half a million. Throw in $150,000 a year for facilities and maintence and you have no worries.
Google does something like this. They have tons of cheap computers with cheap hard drives.
I don't think it still costs $400,000 per terabyte for a 50 terabyte system when my server has a terabyte of storage for about $3000 total. I think to get a good pricing structure, you need to give the speed and size requirments.
From experience, I know that around 30TB is about $1M. I can't see how 50TB is more than that...
(The 30TB came from IBM.)
Taral
WARN_(accel)("msg null; should hang here to be win compatible\n");
-- WINE source code
Imagine how long FORMAT C: would take...
It's more involved that how many bytes you need to store, of course. How fast do they come in and go out? How often do the bits turn over? How reliable does the data need to be, and how fresh the reliability (do you need to mirror it real-time at a remote, hardened site, or back it up once a month)? What systems does the data need to feed and be fed from? What are your labor costs (tape changers, administrators, etc.)? How much wood do you need to buy for office furniture ?
Because it makes a nice change from the "I'm stuck, somebody tell me what to do!" pieces.
sorry for sounding a bit trollish, but the current replies here seem to follow the formula of checking the biggest ide drive on pricewatch and multipying that out to give you a number.
:)
forget all that.
if all you wanted was a pile of ide hard drives, maybe this would be ok, but anybody looking for 50TB of storage is not just looking for some disk to hold the pr0n they downloaded last week. large scale storage systems need to manage multiple host access to high speed (15krpm U3SCSI) drives in flexible raid configurations with maximum redundancy, high speed caching (with GBs of RAM to do it), fiber channel switching, cross platform capability, high end management and monitoring, HSM backup and data migration, offsite vaulting of disaster recovery data, power and air conditioning, and a fat service contract from the vendor. none of the above are going to be found at pricewatch.com.
your best bet is to talk to multiple storage vendors about your needs. call up EMC, Hitachi, IBM, and Fujitsu to start, them let them see each other's numbers. With the amount of money that you are going to spend (and it almost certainly will exceed $10 mil - but maybe not $20), each of these vendors will do backflips to get your business (and EMC is particularly good at junkets - take them for all they're worth
I am not an expert in this field, but Google was willing to tell me lots.
RaidWeb sells rack mountable RAID units that take IDE drives and have SCSI or fibre connectivity. A 12-bay 4U SCSI (with 12x 120Gb IDE drives) system comes in at just under $8000, giving over 1Tb fault tolerant storage. There are several other companies that have units like this.
Rackmount Solutions sells rackmount cabinets. A 44U cabinet with fans, doors, etc. will come in at around $3000.
In theory, a single cabinet could house 11Tb of data, and cost around $91000. This still doesn't consider cabling, cooling, power distribution, networking, a proper server room (air con, false floor for cables, access control), and in all likelihood one or more controlling servers.
More practically, depending on how they are going to make this data accessible, you could be looking at 9 raid units per cabinet plus 3 2U servers and a switch in the remaining space. Each server can support multiple SCSI cards and gigabyte networking. Such rackmount computers will set you back in the region of $6000 (incl. network and SCSI adapters, excl. software).
So you can call it $100,000 for 9 Tb storage ... $600,000 for 54Tb. That doesn't answer the management software question, and may not be a suitable solution. But it sure is a lot cheaper than $20 mil ;)
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
Search terms: IDE Raid Chassis
Sponsered link: raidzone.com
Their 4U 2T system goes for $25K, so 50T would be about $750K and fit in 2 1/2 racks. They claim that they will be doing iSCSI soon, but right now it's just NAS. Still, this is a far cry from $20M. If budget is a concern, you can figure out how to use an array of NAS in place of a SAN.
If you are hell-bent on SCSI or FC, you are going to be into serious dough as SCSI drives are almost 10X the price of IDE at this time, and don't come with as large of capacity (which means that you will need more rackspace, chassis, power, etc.) $20M is probably not too far off. Modern IDE drives with dedicated smart controllers are really not too bad. Just keep a pile of them to swap out bad ones as you are going to be going through drives pretty quick.
With the size of your drive array, backup is going to be a serious issue. You are going to need a multi-drive robotic array of good size. Those are not cheap either.
i don't know nearly enough to put such a thing together, but i do know enough to know that every real-world project probably costs 50x what a geek-fantasy basement equivalent would cost.
Because in this case, it's pretty obvious that the prices are overly inflated. He's paying almost a thousand times what the raw drives go for.
I think it's pretty reasonable to feel that you could put something like this together for under $100K.
May we never see th
From experience (with EMC - Sun) your price tag sounds a bit on the high side, but not by very much. Considering that EMC storage (after all mission critical data should be stored on EMC/Hitachi/StorageTek, NOT on consumer IDE) costs much more than consumer IDE/SCSI (25 - 75x) and that's only the disks.
If you're going with EMC, you'll need to put those disks in something, like a frame (cabinet), and for your size, more like 5 cabinets. With that many cabinets, you'll need some sort of SAN switch and associated fibre cables (not cheap). That gets your disks into cabinets and all hooked together.
You wanted to access the data? Then you'll need EMC fibre channel cards ($15k a pop for the Sun 64bit PCI high end jobs). But you'll more than likely be serving data from a cluster of machines, so count on buying three ($45k) per machine (so each card is on a different I/O board hitting the SAN switch, redundancy)
Who's going to set this up? For that kind of coin, EMC (or whomever you go with) will more than likely set the thing up and burn it in for you on site. The price probably also includes some kind of maintenance contract with turn around time fitting the criticality of the system.
Yes, my 'big ass storage' experience may be limited , but I think that 20Million for 50TB installed/supported/tested by a big storage vendor is in the ballpark.
Good luck.
Figure you get two IBM Sharks with two expansion frames, maximum cache and 36GB disk eighpacks.
... about $500k
That's like $6MM for most customers.
Fibre channel directors and switches
Tape robot... $1 MM
Storage Mgmt software like TSM... $400,000
The extra $10MM is probaly for full-time consultants, a more expensive solution like EMC or a more fault-tolerant solution.
Conformity is the jailer of freedom and enemy of growth. -JFK
What they've prob got is a massive SAN (storage Area Network) running over 2 or more sites. If one site goes down you can run on the other and at 30 miles apart.
Also accessing this amount of data at reasonable high rates is expensive, think Storagetek silos, HDS SAN's etc etc. All this is highend very very fast stuff.
If you've got 50 TB of data running in an OLAP cube you've got to have massive IO capability to properly load and spin the cube around. Ie the cost ain't in the actual storage media, but the IO (esp if you've got a split system requiring multi-site system).
There should be plenty of examples of this sort of data storage now - telcos to web logs. Pricing, well depends on the deal you can get at the time...
When you get a Symmetrix frame from EMC, you also get a support contract. EMC will send multiple people to your installation for maintenance. EMC will remtoely monitor your Symm via modem. They will help you plan your storage needs (including what kind of backup and reliability you need). EMC will provide 24x7 support for everything you need. Then there's management software, etc.
Don't forget that the hardware isn't cheap: Frame, multiple redundant hot swappable power supplies (requires specialty power connection), dozens of scsi drives, dozens of scsi controllers, 10-20 fibre channel connections, an interconnection network between FC and SCSI controllers that includes fiber and copper ethernet, hubs, etc., and a management x86 laptop integrated into the frame.
$20 mil for this is a fair price in my opinion. Anyone who rolls their own is just insane. There are hundreds of engineers behind each of these boxes, and it shows.
No, I don't work for EMC.
Using your sig line to advertise for friends is lame.
Floppies. Lots and lots of floppies. They are so cheap right now! And the come in pretty colors too.
IDE-RAID with 3ware 7500-12 controllers and 3U 14-bay cases (available from rackmountpro, and probably others) could be one possibility, but I don't think you would get a 'flat' storage-space from it, probably have to be segmented instead. As others have pointed out NFS/Samba aren't really manageable ways to handle a filesystem spread amongst multiple machines. People who do this, like archive.org and google, have custom software to access the data stored on their machines. But it doesn't have to be that way forever...
I think iSCSI could give very interesting possibilities for open-source SANs using this type of hardware...maybe front-end servers which map requests as necessary to back-end servers holding the storage, you could have a rather nice fully-resilient highly-scalable system that way, which would just appear as another drive to a client machine, no NFS/SMB etc...
150GB * 334 drives = 50TB = ~$100,000
add $19,900,000 for consulting fees and you've got your 20 million. Speaking as a consultant, that seems reasonable to me.
Ok, Lets see. 50 Terabytes divided by 600 megs per CD means you will need 83334 CDs (rounded up.) At about 20 cents each (retail) that should only set you back about $17k. Add in $100 for some of those heavy duty shelving units from Home Depot and a wintel box to read and write them, and you are looking at well under 20k for total hardware cost. At this point, just go hire someone away from their McJob for a reasonable amount to swap the CDs and you are in business.
Searching eBay for EMC provided some interesting results (these are mostly "buy it now" prices):
EMC Symmetrix 3930 w/ 12 TeraBytes = $57K
(With the proper drive configuration, this unit should be able to deliver up to 70TB in a single system).
This one comes with 12TB of storage (256x50GB HD's). If you throw out all 256 of those 50GB HD's (or just give them to me as a consulting fee for saving your company over $19.5 million) and buy 256X181 GB HD's, you're just short of you 50 TB mark (~46,336 GB).
On Pricewatch those drives come out at $999 ea x 256=$255,744.00 add the initial $57K and you've got a machine that meets your specification significantly less than $20mil
Here are some other EMC machines for sale on eBay:
EMC Symmetrix 3830-36 With 3 TB No Reserve! = $59K
EMC Symmetrix 3700 6TB w/Install & 1YR Mnt! = $48K
EMC Symmetrix 5700 3TB Storage System = $9K
This is what I found by doing minimal research. I'm not 100% sure that the Symmetrix 3930 can handle that configuration (its not my money) so before you go down this road -- do your research (better than I did).
-Turkey
You could buy IBM if you want to loose your data all over the floor. Why do people always reduce these conversations to price...data is priceless...would you send fine china in a paper bag across country and no insurance(IBM Shark, Hitachi), or in a double wall cardboard box with bubble wrap(EMC Symmetrix/Clariion)...if that data gets lost its gone history...hasta la bye bye...its not all about cost people, don't get burned buy the right tool for the right Job...
Power Corrupts,Absolute Power Corrupts Absolutely, leaving one person(group)in charge is absolutely corrupt.
i've built a (fully redundant with backup) 100TB array for $2 mil.
if youre interested in buying one, mail me. i can drop one together with no real problems.
-- zurk42 at hushmail d.o.t com
I'm not sure what the prices are running these days, but back in 1999 I put together a 6TB system running RAID 5 on an all fibre-channel system using (at the time FC hubs -- switch fabric was too immature) StorageTek (aka Clariion) arrays for right around $2.5M.
Keep in mind, that's just for the disks, array controllers/cabinets, hubs, and Sun FC cards. No servers are included in that price.
There are so many variables that you didn't go into that it's hard to give you an educated answer to your question, but it seems feasible to get to around 50TB today for that kind of money taking into account the increased storage density that we've gotten in the last couple of years.
Wait until 2023, puchase 50 TB memory stick for $12 at Wallmart.
Lack of capital, plain and simple... that's my answer to this sort of question
$20MM sounds very high. The Sun StorEdge 9980 System costs $2.3MM for 20T, upgradeable to 70T. So say between $5MM and $6MM for 50T. The system is a fully racked SAN - just plug it in and go. http://store.sun.com/catalog/doc/BrowsePage.jhtml? cid=82215&parentId=75082
Does this stuff have to be online for immediate access, or would ti tbe acceptable for it to be online in a very slow filesystem and be available within 1 minute?
I built a system using spectralogic Bullfrog AIT changers, and LSCI's SamFS system. It sticks metadata for the files on your actual disk, but when you request one of the files, it goes to tape and gets it for you. For 50TB (uncompressed), you would be able to get by for under $500,000. However, that's without mirroring tapes. Trust me, you want to mirror your tapes. I've had them fail before. Figure double the price if you are going to mirror. Also, I'm not sure if the new AIT drives are out yet that will hold 100G uncompressed. If so, this will bring the cost down.
I know, the system sounds sketchy, but it works quite well. Seek time is definitely slow, but once it finds it on tape, the actual transfer is quite fast.
Need Free Juniper/NetScreen Support? JuniperForum
I've read a lot of books in my day, but quite frankly, most of what little knowledge I have comes from the kindness of people who have helped me to learn.
I don't think there's any excuse for asking a question without first doing a little basic research, but here we have somebody who has legitimately never had any experience with terabyte storage asking if there's a cheaper way. It's a legitimate question, and one that probably could not be answered by looking in a book. So the person here is right to ask, and has already gotten some very good answers.
I have a somewhat similar problem: how do I make sure that on the order of a terabyte of audio and video data survives the next hundred years? This given that the disk on which the first 80 gig of this data were delivered to me has two errors that have corrupted two of the files already, and the data isn't even a year old.
What I've been doing is asking other people how they've solved the problem, and also thinking about it on my own. It's how problems get solved. I've gotten some very good and thoughtful answers to my questions already.
The big factors in storage cost, breifly:
r) Reliability
s) Speed
c) Cost
In rough terms, c=s*r, meaning the cost will rise dramatically for high speed reliable storage versus low speed crap storage.
In addition, how the storage is designed (and how much more it can cost) depends a lot on data access patterns as well (read-mostly vs write-mostly, oltp vs dss vs datawarehouse vs
Maxtor has 0.3TB IDE for $1/GB. If you built a huge array of IDE controllers for these, your disk cost for 50TB would be around $50k. If some vendor actually built a beast with the requisute number of IDE busses and whatnot, the chassis might run you another $100k. All in all, real cheap storage. But it would suck on performance and reliability, put out too much heat and noise probably, etc, etc.
Highly available disk arrays with extreme disk platter performance and large amounts of caching can easily run $20 million for 50TB, if not more. There are middle of the road solutions though, it doesn't have to be that expensive unless you're going all out for huge concurrency and speed in an OLTP environment that requires 99.999999% uptime.
11*43+456^2
Management software can be pricy as well; Sun's SRM stuff comes in at around a quarter million per 10TB, plus you'll need a server (NT at the moment... yeugh!) for that etc.
All that added together and scaled up probably hits a good chunk of $10m, but $20m does seem a little much, but then, it may do a lot more than our system above.
Is HSM a solution? We rarely access very old data but we still like it to be easily available. With HSM we can move data to tape or some other cheaper storage while it still appears to be on the local filesystem. Applications don't know the difference other than they have to wait about 45 seconds as the data is fetched to local storage. In the end it depends on how you access your data. http://searchstorage.techtarget.com/sDefinition/0, ,sid5_gci214001,00.html
Sorry, but I've personally seen the carnage when a supposedly-multiply-redundant EMC system lost customer data. Add to that EMC's egregious business methods, and they quickly become the last choice, behind Hitachi and IBM. There's a reason that no one else will work with EMC.
Plus, you probably shouldn't compare the Clariion systems directly with Hitachi/IBM Shark...they're much more on the level of the LSI (StorageTek/IBM/etc) systems.
If you need 50TB of online storage for a highly available application, count on buying triple that in disk. You'll likely be using SAN RAID controllers with snapshot capabilities to minimize or eliminate downtime necessary for backups. This is on top of mirroring and striping (RAID 1+0) which doubles your disk needs. Or, using less expensive controllers you could fake the snapshot mode (put the disk in 3 disk mirror sets, break one of the 3 out and reassemble a backup stripeset, mount that and back it up.
Your options in such a large environment are extensive - and managing it can get fun.
While EMC is the established leader in this sector, there are alternatives that could bring that 20MIL price tag down. We use NetApp filers here and in the two years I've worked with them have experienced no trouble. Just thought I'd put my two cents in...
I participated in a Data Warehousing project at a fortune 100 retailer that we anticipated could grow to over 10 Exabytes if we threw all our data sources at the same DB. It would have kicked butt, albeit at great (probably non-justifiable) expense.
We figured on prices similar to the ones above, though somewhat inflated as this project was several years ago. The problem was EMC.
I worked with a systems engineer who had headbutted management for years over EMC. EMC has NEVER allowed a head-to-head comparison between their products and any competitor including the retailer I worked for. In our case at least, apparently any time he got his technical managers to get close to requesting a comparision between "in-house" EMC systems and normal DEC Alpha / Compaq drive systems , EMC would get wind of it, call everyone with any power, and invoke the 'strategic relationship' and 'technical partnership' phrases. Management would always falter under such onslaught, so bamboozled they couldn't tell which end was up. The comparisons never went forward.
We did some preliminary comparisons ourselves, reading and writing a several hundred gigabytes of data using a small C program that SEQUENTIALLY read and rewrote data. The EMC was about the same speed as the standard drive system (slightly slower, but not much) for sequential access.
The comparisons were VERY IMFORMATIVE when we read and wrote RANDOM data. EMC was an ABSOLUTE DOG (very, very slow). The problem was that EMC uses a 32K byte buffer because of its mainframe history, so each record we read (a 1 kb record) incurred disk read penalties like we were reading 32 kb.
Further, we learned by rumor that EMC employs 'read-ahead' software that tries to anticipate the next read location and fills the multi-GB buffer with disk data if it detects a sequential read. Since we occassionally had 2 or 3 sequential reads in the middle of our data (the nature of our data made this happen occassionally), the disk array would apparently go hog wild filling buffers for sequential reads it thought we would use but did not.
The final point was that although EMC had good prices initially, they apparently RENT their equipment (you never own your own hardware), so the prices for upkeep/next-years-rental can spiral up at their salesperson's whim.
That's my 5 cents here. Please be aware these were unofficial studies performed during spare cycles by probably incompetent persons including myself; any correspondence between the truth and the above remarks is purely coincidental. This post provided for entertainment purposes only, please don't sue me, I'm a worthless nobody.
Unitarian Church: Freethinkers Congregate!
The hitachi solution is, as far as I know, reliable to the first power failure, period. Then it's an empty disk again. I believe they do guarantee it in anything other then a powerfail situation, however. Hence the quad-redundant power + onsite generator requirement. If you really have that kind of budget, call a sales rep and ask them about physically moving it 2 years down the road. Last time I asked, they said "Buy another one, lease an OC3 from bell and mirror. Don't Turn It Off."
I've only got a budget for 1TB systems. At that scale, it's amazing how cheap it is. 1 HBA on each set of 15 x 72GB 15k U160s, (raid5) using network sync between the two seperate boxes. Came in to about 25 grand. Nice in that you can 'detach' one entire system, back it up, then resync it. This is for a large-dataset low-transaction volume setup, though. Secondly, backup is hideously expensive. Tapes = useless. Get something that lets you snapshot + delta the whole array. Drives are a thousand times cheaper then tapes to manage. (TCO, equipment AND maintenence) Plus without 100 tapes in paralell, you won't be able to backup that kind of data in a reasonable timeframe.
--Dan
ok, I did a quick google search, and came up with this: http://www.serverworldmagazine.com/monthly/2002/08 /sgi.shtml
it's the SGI File Server. According to the article, it scales to over 50TB and costs around $67000 for a 458GB. Based on those numbers, you would need 110 of them to equal 50TB and total cost would be $7.37mln. Obviously, this is without any consultation with SGI, and they may have a better price. So, short answer is: Call SGI and have them quote you a price. Oh, and use GOOGLE before you submit an "ask slashdot" question. Nobody seems to do that.
FWIW, I used to be a Storage Area Network (SAN) designer for Compaq. The largest cost of creating multi-terabyte storage arrays is not the disks - it's the infrastructure needed to support the disks. i.e. the backplanes, the external raid controllers, not to mention that everything needs to be dual redundant. Further, all modern SAN's are attached to the hosts using Fibre Channel. Fibre Channel switches can run anywhere from $15k each, to over $200k each, depending on the size and featureset of the switch. Also, each host attached to the SAN will require one or more fibre channel adapters which run several thousand dollars a piece.
Based on current internet list prices, a given SAN will cost roughly $250k per Terabyte. Thats just over $12M for a 50TB SAN. Once you add additional Warranty, onsite service, and installation/configuration services (yes, you must pay for the vendor to come on site and set these things up - they are not simple, nor intuitive), your up closer to the $20M figure in your initial question.
I'd rather be a conservative nutjob than a liberal with no nuts and no job.
...I realize that accepted pricing is well above the price I mentioned. And yes, obviously I left out the maintenance.
The problem is that I find that corporate spending on IT purchases has gotten ridiculous. Let's buy a TEMPEST array! Let's buy something with a Sun nametag because the name sounds good! Let's buy a $2k piece of software for each workstation even though there's a free alternative!
I'm not saying that anyone *provides* something in the price range I was talking about. No one is crazy enough to do so, if companies are willing to pay much, much more. I'm saying that, if you're asking whether it's possible to *build* something like this for the price range I mentioned, off the cuff it doesn't sound so unreasonable.
Yes, a seasoned IT person who works with high-end systems like this will laugh. Why? Because they're used to paying huge amounts of money. Because it's an accepted part of the culture to throw down this much cash. What I want to know is -- how often do people question these basics? How often has someone said "Wait a minute...this is wrong."
Are you telling me that if you were in a third world country without the exorbant amount of funding that we USians enjoy, and someone asked you to put together a 50TB storage system for under $1M, you'd simply say "It can't be done"? No consideration, nothing?
I mean, when I look at the fact that the *case* on, say, a Sun high end system costs more than a whole cluster of workstations, I start to wonder just how much excess is going on here.
Say we take the bare-metal, dirt cheap approach. Grab a bunch of Linux boxes. Throw RAID on them configured so that 1/3 of your data is overhead for reliability, and a 100Mbps Ethernet card in each. The figure used earlier was $1 per gig. Put 6 200 GB drives in each. Throw down $250 for the non-drive cost of each system. You have 800GB of data on each system, 400GB of overhead. That's 63 systems. $16K for the systems, $75K for the drives, and we come in to $91K. I left out switches -- you'd need a couple, but certainly not $9K worth.
You'd need some software work done -- an efficient, hierarchical distributed filesystem. I didn't factor this in, which you could consider not fair, but there may be something like this already, and if not, it's a one time cost for the whole world.
Maybe another few systems up near the head of the array to do caching and speed things up, and you still aren't even up to $150K, and you have failover (at least for each one-drive-in-three) group.
I haven't looked at this -- it might be smarter, since you'd want to do this hierarchically, to have caches existing within the hierarchy, or maybe Gbit Ethernet at the top level of the hierarchy. And obviously, this may not meet your needs. But as for whether it's possible to build something like this for that much money? Sure, I'd say so.
Finally, existing SANS or any sort of network-attached storage are overpriced, no two ways about it. Very, very healthy profit margins there. Sooner or later, someone is going to start underselling the big IT "corporate solution providers" and is going to kill them unless they trim margins by quite a bit.
May we never see th
Realize that when you're talking about this 50TB storage, are you talking 50TB raw, or are you going to be mirroring these disks, maybe striping, all of which are going to increase your needs.... Oh yea, what about backing it up, every try to spool a 30TB database to tape? hope you have 28 hours in a day. Think split mirrors for backups. If you need 50TB of usable space, I'd triple that number at LEAST.
I'm an AIX Systems administrator, and yes I do cry myself to sleep at night....
So, would IDE really be that bad? Wouldn't it be better to put together a Beowolf cluster of smaller databases, each tasked with a portion of a search? Intelligent distributed processing is a much faster way to do a query of a database. If you have some large (but not unmanagable) number of notes, lets say 50 (one per terabyte), with backup nodes extending it to 64, any few failures would be correctable at full load.
I know that I don't have the skillset to put together the 50 Terabyte database right now, but I really believe that I could do it in less than 1 year, with half the budget, assuming free telecom to backup sites.
--Mike--
He is a troll
how many punch cards in a terabyte? and how much space would that take up?
-- sigs suck --
Many companies are selling IDE-based RAID boxes. For instance, take Nexsan's ATABoyII (Sponsored link). It has 1TB of usuable storage (hardware RAID-5 with hot-standby disk and battery-backed cache, redundant power supply and fans). The FCAL version is I believe close to $18,000 each. So 50 TB would be $920,000. Then you need FCAL switches, fiber optic lines, and a few servers to serve the data. Overall, including power conditioning, air conditioning etc, I think $20M is overkill. They're probably going with EMC or Hitachi, which have very nice to configure (GUI), mostly reliable but completely overpriced arrays, and in this case $20M looks right (including consulting fees).
An IBM Shark recovered from one of buildings damaged by the WTC collapse which had been soaked in water and had the power cut off was powered up with no loss of data after 60 days.
Even the data in non-volatile storage (which is only guaranteed for 2 days) survived intact.
The quality of a piece of hardware is directly proportional to the price, else someone would make the equivalent part cheaper. This holds for:
Cheap Expensive
----- ---------
AMD Intel
Soltek/Via Intel
Memory Certified ECC Memory
Maxtor IBM
Dell Toshiba
Anyone who has built a server with Intel or Intel approved parts can vouch for them being god-damn good things.
It is always the later (or hidden) costs that will byte you :)
It seems that this question is extremely dependent upon the kind of application.
Are you mostly reading, or also frequently writing this data? Are you searching or doing indexed lookups? Is this a nasty bandwidth hog or a trickle? Is this a zillion parallel transactions or only a few users? What kind of latencies are expected? What reliability is required? What access is needed to historical data?
Consider some concrete examples that are *very* different from each other yet could each total 50TB and would have very different solutions:
- Video-on-demand system for a Hollywood studio deciding that peer-to-peer pirate systems can only be beaten by a legitimate system that is better.
- Online credit card transaction system for, say, Visa.
- SETI data that needs to be collected and searched for messages from extraterrestrials.
- Particle accelerator data that needs to be collected at truly horrendous rates.
- Lexis/Nexis database.
- Google database.
- Echelon data.
- IRS data.
- "Dictionary attack" database for a lone cryto-analyst.
The possibilities go on and on. At the minimum a 50 TB database might be a small number of equipment racks with a single computer attached to them, all totaling maybe $100,000.
And on the other end, I can easily imagine a system where $200,000 of a much larger total might be spent for, say, a terabyte of DRAM.
I can easily imagine a system with less than $5,000 of battery backed up power supplies, and I can imagine a system with hundreds of throusands in generators.
This question has enormous dynamic range.
-kb, the Kent who would enjoy working out solutions for specific instances of this question.
is no longer accurate. You aren't banned anymore.
Lasers Controlled Games!
First, we have to get a few assumptions out of the way. Let's assume that we're dealing with IBM/Hollerith punch cards, so we can standardize. Now, let's pretend that these cards hold 80 bytes. The card's design predated ASCII, so while there were 12 punches per character, they didn't even represent as much data as a byte. (Follow the link for more.) So, rather than pretend they're 12 bits * 80, let's assume they were used for more or less today's equivalent of 80 bytes. And, in keeping with the hard drive manufacturers' "truth" in advertising, I'm going to assume that 1TB = 1,000,000,000,000 bytes.
Whew, now that we established those assumptions, 1TB of punch cards would be 12,500,000,000 cards.
Assuming the strict standard dimensions under which these cards were produced, we can say that this stack of cards would be ~1,322 miles high +/- ~99 miles. In terms of volume, these cards would take up ~43,025 cubic yards (+/- ~3,211 cu yds). Assuming roughly 100,000 cu yds of concrete for a major-league baseball stadium (seems to vary a fair amount by stadium design), it would take roughly 2.3 TB of punchcards to equal the volume of concrete in a stadium.
Aww, man, I realized I didn't subtract the volume loss for the diagonal-cut top left corner. Someone else can take that.
( As always, props to Google for my research. :-)
As long as you're here, I've got a question: Why do people buy systems like this?
I design and build software for a living, including stuff for banks, and I've been trying to imagine a system where I really need 50 TB in one place. Email for 10 million? Customer records for 50 million? A search engine for the entire web? For all of these, my designs would end up like Google: an array of cheap, commodity boxes that each are responsible for a portion of the data.
So is it that there are applications that really require this? Is it that some architects are used to drawing the one single "storage" icon and a $20 million bill isn't enough to make them say, "Gosh, is there a better way to do it?"
Or is it that the sysadmin costs and pain associated with maintaining 25 racks of gear make it worth coughing up for the centralized system in the long run?
I used to work for Lockheed Martin on EOSDIS for NASA. The target for the archive was 2 Petabytes. The cost? Something like $1 Billion over 10 years, just to build.
Let's see:
50 Terabytes, that 50 * 1024 = 51200 Gigabytes
* 1024 = 52,428,800 Megabytes
52,428,800 / 1.44 Megs per floppy = 36,408,889 floppys
Yesterday I saw a ten pack of neon floppys for $3.50 at WalMart, so at 35 cents per floppy that's
36,408,889 * $0.35 = $12,743,111.15
So, less than $13 million bucks for all the storage you need! I'd like to see EMC match that deal.
What? You want disk drives with that? Oh....
SpyDock: Scientific Python in a Docker container
get:
:) ,
Arena Indy 2600/fibre channel 16 Bay IDE Rackmount
-(16 IDE drives with a SCSI 160/fiber channel interface, 512MB PC133RAM)2+ GB/s transfer rate.
16xWD 8mb SE 160GB 7200 RPM IDE Drives
this gives 1.7Terabytes per enclosure.
use a MSI dual athlon motherboard
-(5PCI slots, dual 2000+ athlonXP, 4Gb PC2100DDR)
put four SCSI160/fibre channel controllers in with two channels per.
you need 24 enclosures, 6 per card, no problem running four loops of 6 enclosures.
enclosures:
$8000 per
x 24 = $192,000
drives
$161 per
x 384 = $62,000
fibre channel controllers and cables:
~$2000 per
4 x 2000 = $8000
server machine:
~$3000
(dual XP 2000+, 4GB ram, 40GB IDE drive for OS
gigabit network card:
$300 (a good one)
$265,300
EVERYTHING x 2 for redundancy
$530,600
IF 2.5 series linux kernel supports it, you can run linux software RAID on the 24 enclosures and linux will see each enclosure as a single drive, but 2.4 series kernel has a 2Terabyte filesystem size limit.
otherwise, Windows2000 server, software RAID all 24 drives RAID 0(redundancy is handle per drive encluse, no server side redundancy is needed. I believe that NTFS and the windows kernel can handle more than 40Terabyte file system sizes.
two identicle servers mirror each other with either windows or linux clustering and fallover support.
you think that one Gigabyte network card is too little? how much PCI bandwidth is their available?
their would be a limit on the PCI bus in this situation but thier is no way around it, other than waiting on AMD Hammer OR getting an Itanium2 which has multiple PCI busses.
ok, so $530,600 plus $2000 is software, plus setup costs of say $5000, plus three racks @ $5000 per, A/C units $5000
it could be done for about $557,600, give or take about $10,000. and this is with awesome redundancy, RAID 5 underneath RAID 1
max transfer rates would not break about 512Mb/second(PCI 33Mhzx32bit=1056Mbs shared between the drive array and the gigabyte ethernet card, but seek and read/write speeds would be incredible.
yes, ill prob get my *** kicked for a 3 page long post, but it's worth it.
I think you forgot read / write specs. Allow the data to be pullable slowly enough and you could do this real cheap.
Seagate drives worked much better on the floors that we have here. It came down to the fact that Seagate drives handles the daily traffic much better than maxtor, not to mention the fact that grout stuck to the seagate drives MUCH better.
That would take WEEKS.
And then the 10 longshoremen making 80K to move these disks around.
I know that, being on slashdot a lot, you see a lot of people making cheap, shoddy, unreliable, and definitely not enterprise-class "solutions" out of string and tinfoil, but for a data warehouse application, that kind of cost is not unreasonable at all.
We have a somewhat-smaller situation at work, with a single Hitachi Lightning SAN providing our data warehouse nodes (two IBM p-series 680 servers) with a terabyte or so of fully-redundant fiber-connected disk. A single terabyte cost us nearly $750,000, and Hitachi bid competitively.
Enterprise-class solutions call for enterprise-sized wallets. Do not expect to slap together a few IDE drives and call it a day, unless you enjoy being fired.
- A.P.
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
The problem is that I find that corporate spending on IT purchases has gotten ridiculous.
I find it's gotten too spendthrift, myself. We just got a Blade 2000 (anniversary edition) with only a single system disk. Why the OS of a 30,000 dollar machine is not mirrored is beyond comprehension, to me.
Let's buy a TEMPEST array! Let's buy something with a Sun nametag because the name sounds good!
No, let's buy those things because, if something in them breaks, the production payroll machine doesn't go offline. Or let's buy those things because, if something does break, I can have a tech on-site in 4 hours with a hot-swappable replacement part. Let's buy them because my customers (my users) won't notice the downtime while I pull a CPU module, PCI card, or disk and replace it without powering the server down.
Let's buy a $2k piece of software for each workstation even though there's a free alternative!
No, let's buy a $90,000 piece of software because it allows us to precision-machine aerospace parts more efficiently than hand-drawing the same models in two dimensions on a drafting board, or because we can run simulation testing on our airframe to see how much stress it can take before it destroys itself. Let's spend our money smartly to produce more revenue and profit.
Say we take the bare-metal, dirt cheap approach. Grab a bunch of Linux boxes.
I've seen horror novels with better beginnings...
Throw RAID on them
Apparently "throwing RAID" on something is good enough for enterprise-level.
and a 100Mbps Ethernet card in each.
This will work great on a network where every client is connected at 100/full, and the normal servers have fiber or gigabit uplinks. You may have gotten away with this in 1995, but it's 2002.
The figure used earlier was $1 per gig. Put 6 200 GB drives in each. Throw down $250 for the non-drive cost of each system.
$250 for the rest of the system? Motherboard, RAM, CPU, power supply (dual? Hah!), and case? Our AIX NFS servers have RAIDed MEMORY, not to mention at least triple the amount they'll ever need of that, CPUs, local disk, power supplies, and PCI expansion chassis.
You have 800GB of data on each system, 400GB of overhead. That's 63 systems. $16K for the systems, $75K for the drives, and we come in to $91K. I left out switches -- you'd need a couple, but certainly not $9K worth.
Yeah, you could just go down to CompUSA and pick up a few Netgear 8-ports. Nobody will ever need a VLAN. (The modules in our 6509s cost more than $9k.)
You'd need some software work done -- an efficient, hierarchical distributed filesystem. I didn't factor this in, which you could consider not fair, but there may be something like this already, and if not, it's a one time cost for the whole world.
Yeah, you could hack something together. Let us know how that goes.
Meanwhile, I'll be enjoying another day of outage-free administration, at least on the machines we built the right way.
- A.P.
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
Regardless of quality, cheaper is always better, though. What part of that don't you undersand?
Regardless of quality, cheaper is always better, though. What part of that don't you undersand?
"Is". It's missing an apostrophe and two consonants.
- A.P.
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
Bullshit. Expensive Sun servers crash all the time due to memory and CPU failures. The more CPUs and the more memory, the more chances for failure. These boxes do not have redundant CPUs and memory until you get into absolutely isnane price levels. If you care about reliability, it is better to have truly independent machines, and let the software handle the redundancy. Sure, mirror the storage, because you know hard drives fail, and have redundant network interfaces to protect against a witch failure. But don't forget that a "high availability" E6500 is 22 times more likely to crash than a "workstation class" ultra 1.
Why the OS of a 30,000 dollar machine is not mirrored is beyond comprehension to me
This is part of what I'm complaining about. Hardware vendors have sold users on expensive, heavily hot-swappable systems where they make huge profit margins. They work very hard to steer clients away from consumer-level stuff, where their profit margins are nearly nonexistent. If you're willing to make a system the fundamental unit of failure here, you can easily buy a $3K system with a second failover $3K system. Why pay five times as much so that you can swap out a CPU instead of just swapping out a whole system?
The whole measure-system-capabilities-by-dollar-value thing is what I'm objecting to -- your first response was "This is a $30K system".
No, let's buy those things because, if something in them breaks, the production payrool machine doesn't go offline.
I severely doubt that more than 10% of the people with TEMPEST systems actually need them. I was looking at one cluster of very overpriced and very underused set of TEMPEST workstations at a company a while ago. They would have been better off with some stock x86 machines.
hot-swappable replacement part
See above. It's much cheaper at this point to buy two consumer-level systems and let failover take over for one system than to buy a single high-end system.
No, let's buy a $90,000 piece of software because it allows us to precisions-machine aerospace parts more efficiently...
The price I quoted was $2k. You're listing $90K, which is well into the vertical application market. There -- yes, you don't have much of an option. You need an airfoil simulator that does foo, baz, and bar, and there's only one vendor with it -- you pay for it.
I'm talking about buying horizontal market things like commercial variants of CVS, compilers, or other systems where there are very good free alternatives, yet companies persist on evaluating things based on price.
Apparently "throwing RAID" on something is good enough for enterprise-level
Who's to say that this approach is fundamentally flawed? Sun? IBM? Of course they're going to scoff -- they've got machines and service contracts to sell. A high-level IT person? They've been suffused in the "spend lots more to get decent quality" propoganda from said companies for so long that it'd be hard to get an objective viewpoint.
and a 100Mbps Ethernet card in each
This will work great on a network where every client is connected at 100/full, and the normal servers have fiber or gigabit uplinks
Notice that I mentioned having the front-end systems, the ones doing caching, have faster interfaces.
$250 for the rest of the system
For a file server, very little is needed in terms of CPU juice, or RAM (before you start screaming about caching, as mentioned above, I want a systemwide cache sitting at the front of this). Make the cache able to cache anything on the SAN, so that you're efficiently using your resources. Why would I need PCI expansions chassis or RAIDed memory? I've already listed everything every box needs, and I'm willing to bet that the number of RAM chips you've had suddenly and unexpectedly fail (for God's sake, this is solid state storage) is right up there with numbers of servers hit by lightning.
Nobody will ever need a VLAN. The modules in our 6509s cost more than $9k.
Why would I want a VLAN within my storage system? To the outside world, this is a single entity. For that matter, Cisco systems definitely fall into my "overpriced because IT will buy it because it sounds sexy" category unless you really need the few systems that they do that *no one else* can duplicate in functionality. You can run VLANs off a Linux box.
Meanwhile, I'll be enjoying another day of outage-free administration, at least on the machines we built the right way
As I said earlier, I never claimed that this is available out of box right now -- just that you can build something like this. And neither did I say that your systems are outage-prone. I do think that name brand systems are oversold on vague reliability promises. Is my RAM going to suddenly fail? No.
I've found that the primary reason that purchases will spend their employer's money is the ASHF (Avoid Shit if it Hits Fans) syndrom. IT personnel are willing to make suboptimal purchasing decisions so that they have someone *else* to point to if something goes wrong. "Sun's supposed to fix that, not us." "This is a best-of-class component that failed."
Now to some extent, the corporate culture fosters this, but I just want to point out that every time I hear people bragging about the cost of the systems they administer, I wince and think about this.
My guess is that this is going to die over the next five years or so. At the moment, there's a glut of secondhand networking and serving systems available from dying dot-coms. Once that's over, though, you have companies in India and Eastern Asia that can't afford to waste the kind of money that US companies do on systems. So you get manufacturers (probably non-US) springing up to create low-cost systems that fill their needs, without the exorbant profit margins. Eventually, as reputations become established, they'll start selling to US corporations trying to bring down costs and compete with those foreign competitors, and overpriced IT purchases will be a thing of the past.
Linux is part of the advance front of this -- it's cheap to set up, runs on cheap commodity hardware (who's manufacturers make very little profit per unit), and you can build fancy things on top of it. As a matter of fact, that's most of the reason Linux has been propelled into the business market at all -- not because a bunch of geeks think it's sexy to use (though it sure would be neat if that *were* the reason), but because the profit margins are in a more sane range.
Almost all products follow a process of starting out very expensive, becoming more common and understood, commoditization, and eventual drop of profits to near zero. And once a product has reached the end of this process, bringing the price back up is very, very hard.
May we never see th
Expensive Sun servers crash all the time due to memory and CPU failures.
Really? Hadn't noticed. If by "all the time", you mean our E420 with an ecache parity problem (yes, this is a known issue with that series of CPU) which used to go down once a week until I took the faulty CPU offline from the command line, then, yeah, it crashed all the time.
The more CPUs and the more memory, the more chances for failure.
You obviously haven't heard about things like ChipKill and ELIZA fault-tolerance initiatives.
You can run a machine with a bad CPU for months without worrying about it, and bad memory modules can now be cycled out of use without even causing memory access violations.
don't forget that a "high availability" E6500 is 22 times more likely to crash than a "workstation class" ultra 1.
What are you smoking, and where can I get some?
- A.P.
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
You know that I am talking about commonly available multi-CPU systems, and not exotic (and insanely expensive) systems with redundant CPUs and memory.
What are you smoking, and where can I get some?
Do you seriously believe that an E6500 or similar system will not crash if there is a faulty CPU? Despite your impressively low slashdot UID, if you believe this, you have virtually no experience with such systems.
What were they doing with them?
Basic numerical analysis.
voltage spike that makes it past the UPS
I've yet to see a spike damage even a system on a cheap surge protector, much less a nice UPS. I *have* seen surges over POTS lines damage equipment, though. Come to think of it, my neighbor's house was hit by lightning at one point, knocking out her modem...yet leaving her computer intact.
Side note: You would use g++ as a compiler for your product? The code it produces is about as efficient as a fully-loaded Excursion full of fat chicks
Not anymore. Take a look at the code that a gcc-3.2 build puts out...It's light years beyond the 2.7 and earlier era, the time that built gcc such a bad rep. It's competitive with the better compilers out there now (at least in generated code...Sun's C++ compiler compiles more quickly). Oh, and the good code generation is on the x86 -- never tried comparing recent builds on SPARC or PPC.
Case in point: we broke IBM AIX 5.1 a few months ago
So you're asking me both to believe that this had to be fixed immediately (as in, whatever you were doing before you broke AIX 5.1 was no longer an option) and that Linux wouldn't have been fixed quickly (and while there probably are issues that have taken a while to fix, I tend to see patch times that beat competing OSes).
They'll meet the ethernet card full-on and be very disappointed at what they see
You're talking raw streaming of a huge sequential series of reads, which may or may not be an issue here -- but that's besides the point. You're leaving out the possibility that data could be interleaved across different machines to avoid exactly this issue. Do it in software, I say -- it's cheaper.
simply because it can do one or two of the things a real piece of networking gear can do
Okay, I'll bite. Short of sheer mass bandwidth that you absolutely require custom hardware for, like a backbone provider, what specific features are you complaining about the lack of?
I've found they spend more in the short term to save more in the long term. If you think doing something the right way is expensive, try doing it the wrong way.
I agree that doing something the wrong way can be more expensive -- I'm just not sure that saving money necessitates "doing it the wrong way".
May we never see th
Don't you think people don't upgrade their systems or implement new ones?
/proc is inefficient. However, there's also a silly perception that unless something costs excessive amounts of money, it must not be up to par with the competition.
:-) However, I suspect what you meant was "I don't think you've ever been in IT handling thousands of users", which you would be correct about -- I'm an engineer, not an IT person.
We aren't talking about SANs any more? There isn't a lot of reason to be dropping new OSes or new servers on components of the SAN.
Just as I agreed that there *is* a justification for vertical-market applications, I'm not saying that every copy of AIX should be purged. I just think that items like these are frequently sold in situations where they are not needed. That doesn't mean that they're never needed. I don't claim that Linux is the best alternative if you're using, say, oh, a system that needs to dump process info from the kernel very frequently -- Linux's
I am beginning to wonder if you have ever worked in a company with more than 100 employees.
Well, you're definitely wrong in the literal sense.
Which would explain some of the different focus here -- you're complaining that given a list of options from different providers, no one currently gives you what I'm talking about. My interest is in adding another option to that list -- whether it's possible to create a new option for the prices being talked about.
May we never see th
You know that I am talking about commonly available multi-CPU systems, and not exotic (and insanely expensive) systems with redundant CPUs and memory.
Dell PowerEdge 4600. Supports Chipkill and 2 CPUs. $3k.
Or is this still too expensive?