The 1-Petabyte Barrier Is Crumbling
CurtMonash writes "I had been a database industry analyst for a decade before I found 1-gigabyte databases to write about. Now it is 15 years later, and the 1-petabyte barrier is crumbling. Specifically, we are about to see data warehouses — running on commercial database management systems — that contain over 1 petabyte of actual user data. For example, Greenplum is slated to have two of them within 60 days. Given how close it was a year ago, Teradata may have crossed the 1-petabyte mark by now too. And by the way, Yahoo already has a petabyte+ database running on a home-grown system. Meanwhile, the 100-terabyte mark is almost old hat. Besides the vendors already mentioned above, others with 100+ terabyte databases deployed include Netezza, DATAllegro, Dataupia, and even SAS."
No porn collection jokes please.
Oh wait, that was petabyte...
I had been a Porn Collector for a decade before I found 1-gigabyte Porn Collections to write about. Now it is 15 years later, and the 1-petabyte barrier is crumbling.
Since 500GB drives, this has been a reality. A couple of companies started selling petabyte
arrays at about the time those drives were
established.
Petabyte DBs are old news to techie porn collectors. They always mix their two favorite subjects into one. Tech + Porn = Petabyte+ Porn Database
I have to find my kid. Last time I saw her, she was with her Uncle Micky while he was having his morning martini.
They have many towns now with less than 50k people completely photographed, every street in high res. That has to be well over 1-petabyte, though I doubt it's all in one location, must be distributed?
How many Libraries of Congress are necessary to break the 1-petabyte barrier ??
@neonux
Hard drives keep getting larger. Hard drive consumption keeps getting larger. How much larger it keeps getting really isn't all that impressive.
Petabyte? I never touched a byte!
http://www.clango.org
Take a look at almost any large financial firm. The email retention system alone is much larger than a petabyte, and that's just dealing with the online media, not including what's spooled to tape. Due to deficiencies in RDBMS ssytems, each of the large firms usually develop their own systems for managing the archival system on top of the database.
It's not how big it is, but how you use it. :)
Call me old fashioned, but I don't see why anyone but a search engine like google would need anything like a petabyte. You can have only so much useful information about anything. Sounds to me like, fill your garage with sh1t, build a bigger garage.
... DB design and old data that should be purged. Color me unimpressed.
-"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
I remember encountering a 1+ petabyte database 10 years ago: it was the database to record and analyze particle accelerator experiment data at CERN. And it was built using a commercial object database - not relational. Oh but wait - the relational vendors have told us that OO databases don't scale....
That was ten years ago.
Google Maps' database is far bigger...
A base of 8 tiles, with each becoming four more smaller tiles, in two modes (map/satellite), and 16 zoom levels.
Each tile is approx. 30kB.
(((0.03* (8 * (4^16)))/1024)/1024) == 983.04TB right there.
My calculator doesn't handle numbers big enough for streetview. O_O
No, but I did throw granola at a deaf person once
Gigabyte barrier. Petabyte barrier.
In what sense are these barriers? Does the database resist putting more data in it the closer to a petabyte you get? Is it likely to explode once it reaches 1 petabyte?
... but I do wonder if you've ever heard of Sarbanes-Oxley.
Run and catch, run and catch, the lamb is caught in the blackberry patch.
... we'll need an army of Chris Hansens and a mountain of beartraps. God help us.
Any organisation that wishes to be classed in any way professional knows that the value in it's databases has to be protected. That requires them to have the means to recover the data if something bad happens. A hot-mirrored copy is simply not good enough (one corruption would get written to both copies).
As a consequence, the size of commercial databases is limited by the amount of time the organisation is willing to have it unavailable while it is restored, in the case of a disaster, or the time taken to create/update secure, offline, copies.
Not by intrinsic properties of the database or host architecture
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
Imagine having tens of millions, or just millions users - all of them with their records, history, targeted ads data. Or some mail provider that stores attachments in a database. Or a file sharing service like those you and I know. That's a plenty of information to manage. Add an overhead, and it's easy to overfill even the biggest database.
Also I agree with you that bad design might be a concern. Of course there's no big database that couldn't get on a "purge" diet.
Now seems to me we might have a problem with querying such a big bucket of random data. Imagine a query taking months to complete. We're gonna be there in another ten years.
And then we lose the capacity to make electricity. And we can use our CDs, DVDs, let alone magnetic media to... well, dig trenches.
Those pesky petabytes of data are going to doom us.
Plain old sigh.
The LHC will generate several PB of data per year, as will the Large Synoptic Survey Telescope. These projects aren't all that uncommon.
"Seven Deadly Sins? I thought it was a to-do list!"
My porn collection has long since achieved infinity.
MMO Quests are like orgasms:
You may solo them, I prefer them in a group.
That is all.
The world will only need 5 large databases.
None of them will never need more than 640KB^H^HMB^H^HGBMB^H^HTB of RAM and 32MB^H^HGB^H^HTB^H^HPB of storage.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Wow....I think somebodies PB database got too close to a magnet, without a tinfoil hat.
War is the statesman's game, the priest's delight, the lawyer's jest, the hired assassin's trade.- Shelley
http://labs.google.com/papers/bigtable.html
Apparently nobody caught the Wired article on this a couple of months ago?
The Petabyte Age
I think that's obvious... actually with hard disk of 1 terrabytes being broadband, reaching a petabyte is quite easy, even for a midsize organization. Where I work, we build ourselves our disk matrixes, and reaching 1000 terabytes is about to put together just a few 1000's of disks, not a big deal.
WalMart's data warehouse is already 4 petabytes: http://storefrontbacktalk.com/story/080307walmart.php
Just because I can hook a shark from a boat, I do no offer to wrestle it in the water.
Is the location of IBM's Managed Storage Services (MSS) division, which deploys SAN for customers in Boulder (including IBM internal) and other locations (over high speed fibre links) on IBM "Shark" (ESS) and DS6000/DS8000 devices. When I worked at IBM their marketing materials stated they were managing over 4 petabytes of data for enterprise customers out of that location alone - that was four years ago! That doesn't count for other MSS locations either, nor all the other areas where IBM implements large amounts of storage for customers. Remember, many if not most of IBM's customers are governments and Fortune 100 companies, particularly high finance. I think they've got some data.
So you want to talk about high levels of storage - IBM has the game covered, considering they invented the HDD.
How much of that data is marketing information?
seriously, is all of that data current and necessary?
seems to me that they should prune off and backup old data.
They're using their grammar skills there.
I need measurements I can understand, like how many Keanu Reeves' brains is a petabyte? And could he hold it indefinitely, or would his head explode at some point? If the latter, can we get him started on it now?
Vincent J. Murphy
Spandex Justice
No no, if your Chia Pet is a-bitin' you're most likely doing something wrong.
Just add water, and sing along: Ch-ch-ch-chia!
Okay, I know that the article is refering to database, but the comments seem to have gone into the way of disc storage, so I will take the bait and go off topic.
Petabyte drives would not really be that unpractical of an application for people who like to archive stuff. I just filled up a 300 gig drive and a 750 gig drive with just stuff off of the DVR in under a year. While National Geographic HD may be compressed so badly that it barely looks better than HD, and a one hour show is under 2 gig, try archiving something with a higher bandwidth. For example, I recorded the Olympics, and saved the opening and closing ceremonies and all gymnastic events. A single 4 hour day saved is around 40 gig.
So, lets think media server for HD material. Let's just stick with HDTV for a while. Let's say that I want to archive on a media server a Blu-Ray disc. Let's for the matter of talking say that the movie takes up all 50 gig of the disc. Ten movies, 500 gig. 100 movies, 5 Terrabyte, 1000 movies, 50 Terrabyte.
Now let's say that we are an IMAX theater, and upgrading to the new Imax Digital standard. I read not too long ago that an Imax film is equilivant to 18k (most digital theaters project 2K, although some are now installing 4K systems). So, to keep from having these big massive films around of the 20 year old science documentaries that we keep in rotation, we get the digital versions of these. Does anyone want to do the math?
I am waiting for the day when neural implants can actually read the human brain, and as such, you can archive experiences to some type of storage medium. I am sure wikipedia has somewhere how much information the human brain processes a second. Now, I am sure we will find a way of compressing stuff, we can already do audio and video, so I am sure one day we will have the ability to compress smell, taste and touch, granting that we actually have a way of capturing these. Still, the amount of data would be massive, and will probably be a whole new avenue for the Porn industry.
Granted, these are extremes, but who would have thought 15 years ago when we first started hitting the 1 gig barrier, that in 2008 we would have discs used for storing movies that have a capacity of 50 gig, and we would even consider saving stuff at a resolution of 1920x1080 and have PCM sound at a bitrate of 4.6Mbps?
Give us the storage space, and we will find a use for it.
...on virus-infested Windows PCs.
"When information is power, privacy is freedom" - Jah-Wren Ryel
My database will never reach 640K.
The hard drive was invented by Chuck Norris and he gave IBM the permission to use it. The petabyte is just a keyword Chuck Norris uses to describe the way he can take down Johnny Lawbreakers with his teeth.
Is if this will run on Linux
We've had petabyte databases on mainframes for a good couple of years. DB2 v9 on zSeries has two new tablespace types that make managing these humungous databases much easier.
So it may be news for the PC world but it's bordering on ancient history on IBM mainframes.
Sigs. We don't need no steenking sigs.
He meant that the terabyte barrier (not the gigabyte barrier) was broken fifteen years ago, correct?
Seems that Yahoo made this claim months ago but for a 2 petabyte database. The article goes on to list a couple of others that have more than 2 petabytes of archived data. So it's safe to say that the petabyte data barrier has been broken for some time.
The article is a bit misleading, and the numbers (IMHO) are a bit sensational. Saying you have a 1 Petabyte database is all fine and good, but how are you measuring that? Total database size? Raw input data size?
I'm guessing it's the former, which skews the results. Any DBA knows that when you load data, you have to index it, maybe partition it, etc -- all of which lead to additional space allocation and overhead, inflating the total size of the database anywhere from 2-8x (or more) the original size.
And, there are databases out there that actually compress the raw data size, and make it MUCH smaller than it was originally (and they perform WAY better than the DB's in the article).
Nothing exciting here.
One thing to keep in mind in this whole argument is how what is the system capacity of a given data warehouse deployment vs the amount of data being actively analyzed. As an example of this, MySpace has a ~400TB data warehouse where roughly 120TB of user data. Not sure if some of the references above are counting "active" vs "capacity". http://www.asterdata.com/
I work in big storage. We have customers who want support for petabyte-sized _files_. I know of at least one company that was looking to buy several Pb of SAN a month.
The NOAA lab I work at is up to 16 petabytes now. Must've broken the 1 petabyte mark several years ago. :P
So when active, the Large Hadron Collider will generate the equivalent volume of data of 50 Libraries of Congress every second.
This is kind of my point. Do companies keep libraries of pr0n, video, music?
The video production team of Girls Gone Wild does.
640 terabytes ought to be more than enough for anyone. There, now you know how big your HD will need to be to qualify as Vista Capable.
The post surprisingly does not mention Aster Data Systems which is the datawarehouse behind MySpace. When web sites start to store and analyze every single user click then you quickly get into massive amount of data. It's no surprise that the Petabyte barrier is reached especially with the density of storage increasing at constant cost.
Example: the sound barrier. The aerodynamics of a moving airplane are completely different when traveling faster than the speed of sound, than when traveling slower, so it was a real barrier that required engineering effort to overcome.
Another barrier had to do with fabricating electronic components when the feature size became substantially smaller than the wavelength of light used to expose the masks. My old textbooks said it couldn't be done, but thanks to optical proximity correction and phase-shift masking, we can fabricate 45nm technology semiconductors with a 193nm ultraviolet light source.
But there is no radical new technology innovation needed to make a database just a bit bigger, even if an extra zero gets added to its size.
From the Greenplum article mentioned in the summary:
Ten years ago I was working in a bank that was dealing with a 4 pedabyte database. In order to store check images electronically, they have to be retained for seven years, front and back.
It's probably a lot bigger than that now.
On the other hand, if you have the extra space, it invites the usual waste in the form of archive directories for closed-out years, development junk, etc. Spinning round and round, doing nothing.
Yep. That's exactly it. $200 today buys a 1 TB drive. $200 a few years ago bought a 1 GB drive. As the price has fallen the value of the HDD has risen relative to its cost. Those archive directories and development junk aren't being deleted because they have value. Sure, it's enough value to justify keeping them around when a 1 GB drive costs $200, but they are worth keeping around with a 1 TB drive costs that much.
They aren't "doing nothing" - they just aren't doing enough that it's worth keeping it until the price drops enough.
All of this is making the 1 TB drive considerably more valuable than the 1 GB drive, despite their original purchase price parity. This is long-tail economics at work. As the individual bits become worth less and less, the value in of the bits in total continues to rise, resulting in a completely new set of capabilities.
My DVR is an excellent example of this - it's a thorough change in the way that I watch television. Suddenly, it's a family event that we can all share, because when I want to comment, I can just hit pause, and share my thought. Nothing's lost, if needed we can just hit rewind a bit, and suddenly, instead of being annoyed at my daughter for wanting to comment on a point during a televised debate, I'm excited and interested! No more SHUSHSTing at my family, it's now a much more shared experience.
The price of nonlinear access media has dropped so incredibly that marginal-value bits (like video) are suddenly cheap enough to make it all possible.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Teradata may have crossed the 1-petabyte mark by now too.
Sounds like it was precipitously named, then.
Check out the Emprise 7000. Scales from 1 TB to 1 PB
If you unracked it, you could squeeze it all into a single Volkswagon, yielding 100 Loc/VW (Libraries of Congress per Volkswagon).
There is no reasonable defense against an idiot with an agenda
:wq
The problem is, whether the "stated carrying capacity" is stated in "fake GIGs" or real gigs. A recipe for unfortunate events, if he'd get it wrong.
A horse can't be sick, you know, even if he wants to.
For $200 you could almost get two 1TB drives.
...And for all your, erm, "entertainment" collection sorting and comparison needs, you can use a little app appropriately called SizeMeNow to see how much space each folder uses:
SizeMeNow Info