The 1-Petabyte Barrier Is Crumbling
CurtMonash writes "I had been a database industry analyst for a decade before I found 1-gigabyte databases to write about. Now it is 15 years later, and the 1-petabyte barrier is crumbling. Specifically, we are about to see data warehouses — running on commercial database management systems — that contain over 1 petabyte of actual user data. For example, Greenplum is slated to have two of them within 60 days. Given how close it was a year ago, Teradata may have crossed the 1-petabyte mark by now too. And by the way, Yahoo already has a petabyte+ database running on a home-grown system. Meanwhile, the 100-terabyte mark is almost old hat. Besides the vendors already mentioned above, others with 100+ terabyte databases deployed include Netezza, DATAllegro, Dataupia, and even SAS."
I had been a Porn Collector for a decade before I found 1-gigabyte Porn Collections to write about. Now it is 15 years later, and the 1-petabyte barrier is crumbling.
I have to find my kid. Last time I saw her, she was with her Uncle Micky while he was having his morning martini.
How many Libraries of Congress are necessary to break the 1-petabyte barrier ??
@neonux
Take a look at almost any large financial firm. The email retention system alone is much larger than a petabyte, and that's just dealing with the online media, not including what's spooled to tape. Due to deficiencies in RDBMS ssytems, each of the large firms usually develop their own systems for managing the archival system on top of the database.
This is intended as a joke, I asume, but it also brings up the fact that it will be different sort of data that is now collected.
When I look at CRM systems, they used to contain basically the address and perhaps logs from calls they made to the call center. Now whole phone conversations are logged as well as faxes and letters that are scanned, together with images and video that is available.
Faxes and letters used to have only a reference number and you could look them up in a file cabinet.
So even though there is not that much more data collected, (things were already available) they are now all put in the database. Where it used to be an entry 'customer was extremely angry and cursed a lot' it now saves the mp3 for all eternity (where legal).
So yes, the HD space it takes is bigger and thus the amount is bigger, yet it does not automaticaly mean that sort of data is bigger. e.g. do we suddenly have shoesize or other data available? Could be but it also could be that we just have different file formats we now save in the databse.
Don't fight for your country, if your country does not fight for you.
I remember encountering a 1+ petabyte database 10 years ago: it was the database to record and analyze particle accelerator experiment data at CERN. And it was built using a commercial object database - not relational. Oh but wait - the relational vendors have told us that OO databases don't scale....
That was ten years ago.
... we'll need an army of Chris Hansens and a mountain of beartraps. God help us.
Any organisation that wishes to be classed in any way professional knows that the value in it's databases has to be protected. That requires them to have the means to recover the data if something bad happens. A hot-mirrored copy is simply not good enough (one corruption would get written to both copies).
As a consequence, the size of commercial databases is limited by the amount of time the organisation is willing to have it unavailable while it is restored, in the case of a disaster, or the time taken to create/update secure, offline, copies.
Not by intrinsic properties of the database or host architecture
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
The LHC will generate several PB of data per year, as will the Large Synoptic Survey Telescope. These projects aren't all that uncommon.
"Seven Deadly Sins? I thought it was a to-do list!"
The world will only need 5 large databases.
None of them will never need more than 640KB^H^HMB^H^HGBMB^H^HTB of RAM and 32MB^H^HGB^H^HTB^H^HPB of storage.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
http://labs.google.com/papers/bigtable.html
It has an event horizon and is actively acquiring porn on it's own?
I drank what? -- Socrates
I need measurements I can understand, like how many Keanu Reeves' brains is a petabyte? And could he hold it indefinitely, or would his head explode at some point? If the latter, can we get him started on it now?
Vincent J. Murphy
Spandex Justice
Data mining is statistically based. The more information that's available to mine, the more accurate the results will be.
A minor quibble. I do data mining for a living. With most data sets, we end up sampling them down, because more data ramps up processing time faster than it improves accuracy. With most problems, more data doesn't improve accuracy measureably, once you've reached a certain critical mass size in the dataset. Simplistically, you don't need to flip the coin a billion times to figure out that it comes up heads 50% of the time.
It's a rare problem that we use more than 100,000 records for. They exist, but they're rare.
I was taught to respect my elders. The trouble is, it's getting harder and harder to find some.