Dell Says 90% of Recorded Business Data Is Never Read
Barence writes "According to a Dell briefing given to PC Pro, 90% of company data is written once and never read again. If Dell's observation about dead weight is right, then it could easily turn out that splitting your data between live and old, fast and slow, work-in-progress versus archive, will become the dominant way to price and specify your servers and network architectures in the future. 'The only remaining question will then be: why on earth did we squander so much money by not thinking this way until now?'" As the writer points out, the "90 percent" figure is ambiguous, to put it lightly.
90% - just like the percentage of statistics that are made up on the spot.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
I could believe the 90% number. There is plenty of data sitting around in case it is needed. Some of it will be needed. Much of won't be. How do you predict which is which ?
Opportunity too good to pass up
It was just about then that one of my favourite bargain-hunting websites turned up a device called the CORAID EtherDrive. Take a look at the product range at CORAID, but don’t spend too long on it.
That's the same device from a story I submitted yesterday. I hope they don't plan on getting a Z-Series running ZFS.
My work here is dung.
Or at least I've certainly heard it before, that in large storage systems, the average number of times that a file is accessed during its lifetime is less than one. That is, some files are accessed lots of times, but most are never accessed.
That's certainly true of lots of the files I use. For example, I shoot a lot of digital photos and upload them to my computer. A few of them get a lot of use, the rest just sit there occupying some tens of GB of disk space (which is pretty cheap these days) and are never accessed except for the occasional migration to a new disk drive.
Which 90% though? Like the Coca Cola exec who remarked that he was pretty sure half of his advertising budget was wasted, he just wasn't sure which half.
People always bitch that they have to pay for Microsoft (or whatver) Office's features because they only use 5% of its functionality. But you buy all those features at once because you don't know which you will need in the future. Data warehousing is the same way. If you start taking data offline you'll just need that data. That's why analyses of very large data sets are performed before archiving.
But what is really wanted is a way to cluster the database servers, with old data automatically cycled to the slowest, most remote nodes, and with the most frequently-altered data heavily replicated and aggressively synchronized.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Wow, this percentage is the same as /. articles! Well, at least I assume - I haven't read the article.
It’s an odd statistic. How is that data measured? 90% of all documents? 90% of stored bytes? When they said “ever again” did they mean explicitly retrieved by name, or should we include free text searches in that statistic? How long an interval needs to pass before some piece of data is clearly identified as belonging to the 90%, so that steps can be taken to reflect its reduced importance?
Why is so much data being collected? They should go back and review what data they're collecting and why.
RIP America
July 4, 1776 - September 11, 2001
Interesting that this seems to have been written up as a "hardware" or "storage" topic.
The problem is, that IT people dream up all these "write only" applications that record data, without any rational plan for what the data might actually be used for in the business.
For example, some people worry about privacy when they go to the grocery store and know that all their purchases are being tracked by their loyalty card, or worry that the big bad US government is tapping all the E-mail.
In fact, I'm 100% sure that some IT geek had some wet dream years ago about recording everybody's purchases and E-mail and phone call and it's being done every which way.;
The true "IT application" issue is that there is no real business need for this data 99.999% of the time. It gets recorded, probably gets staged off to tape, maybe indexed in some giant table, and then ... sits there for years with no actual need for it.
I'm sure the IT geeks who dreamed up the technical ability to record all this stuff, thought they were hot shit when they came up with it. Oh, man, those IT architects were just having a big go-round whipping this problem in scalability. In their heads, they were gonna record everything on disk, then go home and fuck the prom queen.
Automated Hierarchical Storage Management has literally been around for decades. It may be new-ish on low-end crap x86 servers, but for say, mainframe users, it isn't new at all.
What is new is available implementation choices. When your tier choices are between enterprise disk and enterprise tape, you are biased towards keeping data on disk; there's still use cases for HSM with only high-end disk and tape, but they aren't as great. Now with lower-cost disk available, you have a cheap disk choice too, with fairly reasonable access time.
SirWired
...Dell/EMC have several products available to help with this problem.
Coincidence?
A perfect application for my patented write-only memory.
Only a small part of the human brain is active at any given time, but you just try to think without the rest of it...
I'm taking Business Systems units as electives to my degree and they always push the assumption that "more data is always a good thing because it could help you later in decision support etc.". I always found that assumption to be poorly grounded, guess I was right.
A fast SSD or 10,000 RPM'er for your OS or critical apps, and a larger 7200 or 5400 drive for all your other "media"? Personally I've been doing that setup since like 1997...
I imagine it's more a matter of "better safe than sorry" when management asks whether or not something should be kept.
Chances are, this "business data" is somewhat financially related. This also means there's a fair chance the government can/has/will tax something in those documents. And how long are we supposed to keep our records in case of audits?
I saw this over a decade ago when I was working as an IT consultant in the advertising industry. They regularly used only 5% - 10% of their information (and that's being generous). The systems I designed included a server for active work, an archive server for information used in the last 24 months, and then an archive solution (Magneto Optical at the time) that allowed for the information to be available, just not on demand. This idea has been working since for the clients that are still in business.
Businesses already know that most data is stored once and never looked at again. The simple solution would be to offer multiple locations to store data, one for frequently accessed data and one for archival, etc. The problem comes down to is training. There are a lot of people that can barely use a computer and the whole concept of the multiple folders would confuse them. Another solution is to solve the issue with software. There are several archival solutions that will look at the file accessed date and either move it to cheaper disk or even tape. It leaves a stub file in place in the original location and if a user tries to access the file, it will pop up a box saying "please wait while the file is restored". This solution is nice in where the users don't have to change how they save data but it is harder to manage. You have your data spread across multiple systems instead of one and backups could become harder. Overall, it just depends on which direction you want to go with your data and what makes the most sense.
FINALLY !!! AN APPLICATION FOR THE WOM!!!! http://www.national.com/rap/files/datasheet.pdf Bob Pease sure was fore-sighted, since this memory chip was invented back in the seventies!
In my experience with small businesses, it may be never read but will absolutely need to be found for some type of emergency presentation/proposal.
Sig it.
This is one reason I like tape: The drives are expensive, but the tapes are $30-$50 (LTO-4 is $30 on mail-order). So having an autochanger moving all the rarely used data into storage is likely the most efficient way of moving data to long term archiving. Even better is making sure that 2-3 sets of tapes are used (one onsite, one offsite.)
Of course, hard disks by themselves may seem cheaper, but they are not a true archival medium. There are so many moving parts in a HDD and each of them (bearings, heads, spindles, motors, controller card) are a point of failure.
With HDD capacities starting to not grow as exponentially as they did last decade, it would be nice if tape companies would not just catch up with 2-3TB native tape offerings, but be able to offer drives at a lower price so home and SOHO users can use them for long term storage. I'm sure that if someone offered a consumer level tape drive for $500 with a decent capacity, that a lot of small businesses would buy it, especially if it came with decent backup software (Retrospect, Backup Exec, Amanda, bru, or another utility that is similar.) Since some tape drives are even bootable (some HP offerings have a section of the tape to emulate a boot CD or DVD), it would be ideal for bare metal recoveries even by nontechnical users. Pop in the tape, boot the machine, type in the encryption key, select where the data should be restored to, walk off for a bit and it is done.
Even though the SAN companies have said tape is going to die, until another form of media (perhaps super-inexpensive flash media [1]) is as reliable as tapes and can be put in the Iron Mountain case and sent offsite for safekeeping for decades on end, tape will be with us. Only optical comes close to tape for long term archiving abilities.
[1]: I can see someone make flash media that is semi-smart where it is put in a specific case, shipped to an offsite warehouse, and that warehouse plugs in the cases into 5-12VDC. Then over time, the circuitry on the flash drives periodically checks the stored flash media for damage or bit rot, corrects errors by rewriting blocks, and good blocks it would periodically move to ensure that there is a high signal to noise level on all media. Of course, this requires power, while tapes can happily sit in a climate controlled warehouse and be still recoverable.
But if you revised this to say, "Never accessed again a week after it's creation", I'd believe it.
So why do all serious RDBMS systems have functionality for dynamically partitioning data based on the relevance of the data? Big databases are often set up to (for instance) have the last month's data on fast storage, and older data on slower/cheaper storage.
this helps me to be a better employee. From now on I'll only save 25% of the data I acquire, because the odds are the other 75% would only be needed 7.5% of the time. In other words, 92.5% chance not likely to be needed at all.
If you're talking about blog entries. Almost all of them (well, almost all of *mine* :-) are written once and never read, unless you count spiders as reading them.
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
...I didn't bother to read any further because I felt it was probably useless data anyway.
Over 92% of fire extinguishers will never be used, we could probably save a bit of space by having the unneeded ones stored off-site, or in less accessible corners of the garage.
Slightly more seriously, we can certainly answer this question posed by the linked article easily: "why on earth did we squander so much money by not thinking this way until now?" The answer is: because you are a moron. Anyone who has given even a moment's thought to storage has known this, either implicitly or explicitly, for a long time. So whoever's included in your "we," Steve Cassidy, is just profoundly stupid. I think that quite easily explains why you all squandered so much money by not thinking about this. Next question?
...at least 70% of the crap you store in your house isn't really needed, either. Do you really ever LOOK at the pictures hanging on the walls? Are you sure you're going to read every book you own, again?
-Styopa
Plan 9 from Bell Labs, an OS they released in the early 90s, had a file system for this. Hard drives as cache with WORM drives for bulk storage.
It did some interesting things. cd /2009/12/25/ puts you in the root of the file system as it existed last Christmas.
Why go to all the trouble of setting of two different systems for live data and archived data when you can spend half the money on just one system for both and more storage space?
Anyone who manages large systems know that this is very true, yet the data piles up. I've often wished that databases would allow us to make a view or some other type of abstraction which would allow you to make the decision whether or not to join an archive table. Right now, everything needs to be handled on a program by program or query by query basis. Hey, maybe I should quickly patent this idea, then I can license it to Oracle. :)
to anyone aware of Sturgeon's Law. 90% of everything is crud.
You should have been. This isn't new.
Many of the SAN's I've worked on automatically migrate data off to slower storage or even tape as it ages without modification. If you do a read to something that's offline, it has to get fetched from the tape juke and it takes a bit.
And if you didn't have that 10% that is eventually needed, you'd be totally screwed. Do we really need to play the 20/20 hindsight game every time somebody thinks of something like this?
Many businesses work with a customer file a few times and then never again - for example lawyers and realtors. I'd like to see a file system that will auto archive data and shift it transparently into long-term storage, and then transparently undo it when needed again.
Excuse me, but please get off my Pennisetum Clandestinum, eh!
I also wonder if +90% of all health insurance benefits go unused each year. And you probably have business data and insurance for some of the same reasons: it's better to have it and not need it than need it and not have it. amirite?
I only post comments when someone on the internet is wrong.
Is there a reliable metric as to which 10% will be needed again at the time the data is written? If not then I don't see what this buys us.
No statement is true, not even this one.
If the data was recorded by Dell computers... then yeah I would expect that 90% of business customers aren't able to read it back.
I create a lot of business data, and 90% of "never read again" or "never read again after 2-3 days" is not far from true.
However, the data serves a purpose. I frequently do searches on the data and you never know what you wrote months or years ago will turn out to be just the document you need. Keeping records for years instead of days has more than paid for itself in the long run.
Now, will these records be useful 5 or 10 years from now? Probably not except to an archivist or someone researching how we did business during 2010 and earlier. Or perhaps to a lawyer *groan*.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
There are some laws that says you must keep some data for over a period of time, five years or more.
Even if you know that you wont be using it in the future almost for sure, the law is the law and you must obey.
At least 90% at Write Once Read Never.
Wonder if you could go into business archiving never-read data. I mean you could guarantee privacy....
"Waste not one watt!" - CZ
The german Chaos Computer Club (CCC) was addressing this with its draft law "Datenbrief". Businesses would have to report all types of data collected about one person to them via slow-mail or e-mail on an annual basis. The time and effort spend on this would constitute a 'fee' for data and thus force them to hold as less data as possible. See http://www.ccc.de/de/datenbrief (german)
I wasted money on a dictionary that has tens of thousands of words but have only ever looked up a few hundred. I should have bought one that just had the words I would actually need.
.sig withheld by request
a) Forbid *unmanaged* of documents. If the question: "where is the most up-to-date version of this document stored?" is systematically and easily answered then people can delete the crap from their laptops.
b) Forbid in-company attachments to mails. If the last version can be easily found, including the revision history, a link to this revision is worth *more* than the current state of the document. Most space in my inbox are totally useless attached documents.
c) Forbid the use of formats unsuitable for storing a certain kind of information. (Where i work, they use powerpoint/word files for electronics forms)
d) Provide a good archiving and backup service. Besides the quality improvement by using a service, also the 100th copy done in some unsystematic way of some data is prevented (forbid this explicitely)
e) Thin clients. store the data on a server. Deduplicate.
f) i would expect that most of the documents in a company can (and should) be stored in a database.
Dell has been doing this for our company the last 3-4 years now.
This is blinging
Made an order for 6 computers and received 11 in January. Returned the extra 5 and they refunded me for all 11, then took 6 months to realize their mistake.
All this after I called them trying to tell them about their error, and getting some script/screen-reading Indian who didn't understand me.
Imagine what it would have been like if the situation was reversed . . . yikes.
Fuck Dell. This is the kind of dumb thinking that will lead to their inevitable downfall. Welcome to Gateway Country Part Deux.
Back in the day I ran (operations manager) a very large mainframe shop with three supercomputers churning out enough numbers to consume several tons of paper each month. The scientists would make their way to the output distribution area each morning, pick up a 6" stack of 11"x17" paper, and flip to the last page to find an eigenvalue. All too frequently they'd shake their heads and say something like "Should have been higher" and drop the whole stack into the recycling bin strategically posted near the exit.
A purely statistical analysis might suggest that we have the trucks delivering the paper just drop it off at the recycling center, saving wear and tear on the printers, printer ribbon costs, and scientists' time, as they would no longer have to come by to pick up their output. Could probably have cut a couple of staff as well. And the loss of information would be negligible, statistically speaking. The scientists failed to see the humor in this, however, so we continued killing trees at an alarming rate.
Backup snapshots are wasting space 99% of the time!
I swear to God...I swear to God! That is NOT how you treat your human!
Well over 99% of all lifeboats are never used.
What matters is the cost/benefit ratio.
The potential for the data being valuable may be very low, but the cost of storing it is going down all the time. Disk space today is a dime a gigabyte, so let's keep it just in case.
Most people don't understand the nature of large amounts of data like that. They think "I want more, more, more" and never beyond that. Getting data is easy, getting useful data is far more important and for that you need to have your customers spend some time with the database where they can tell you everything that they don't need or want. Once you can confirm the accuracy of that information you can then purge your data of the clutter.
What people really fail to understand though is that getting rid of data is just as important. Unless your dealing with something like scientific research data, or have a compelling legal reason (SEC etc), or another really good reason (manufacturing plans) than your data needs to have a planned lifecycle just like any other asset. You need to have a date for end of life for data (SQL Data, documents, etc) just like you would for emails or other documents. As a rule of thumb, set up an end of life asset policy for your data, notify the stakeholders and users and from that point forward - every chance you have to destroy that data, do so.
If you destroy data when you had a subpeona, knew a subpeona was coming or knew a criminal investigation was coming you can end up a felon. Any data that isn't destroyed can be used against you in a court of law. However - if your data is destroyed via policy on a given date and that destruction doesn't violate something like a SEC requirement that you are safe. Yes, I do speak as someone that has at times been heavily involved in litigation (the technical expert that has prepared data for use in court and explained what everything means to lawyers) more than once.
Let's face it, in practice, you are makking backup every friday, of almost everything: Database, SVN, CVS, builds, release, etc...And there is a good reason for it, like computer burned, sysadmin left without giving up the password ;).... But in reality, this backup data is almost never used. In my long long practice i never had the chance to see the need of these backups, nevertheless, you just have to have it. Period.
And now that Dell's looked at the files, they've been read. There goes that theory.
UTF-8: There and Back Again
I have serious doubts about how they came up with that number. Data captured once can be stored in a data warehouse and analyzed and reused in many different ways for analytics and reporting, so I am not sure how they estimate that 90% of data is never used again (unless, of course they meant that it is not pulled up again on the frontend application side, which would still make no sense at all).
At our hospital, they have replaced the inpatient electronic medical records system at least 3 times in the last 20 years, and our data warehouse, which has been around for more than 15 years, contains a large percentage of that clinical data from the different (current & historical) systems. A lot of this data is still used pretty actively for retrospective research, recruitment of patients for clinical trials, operational and financial resource planning, forecasting, cost-accounting, etc. In other words, at our institution, most of our data is used all the time, but for different purposes.
Folks, hierarchical storage has been discussed in one form or another since the 70s (probably much earlier, but I'm not that old). Everybody and their mothers already have some implementation of archival media.
As for 90% of the data never being read, I beg to differ. Data is sliced and summarized many times in its lifetime (and sometimes those summaries need to be refreshed to include new dimensions or details), even if there's nobody really looking closely at the finest grain. But if you throw away the oft-unused detail, how can you re-summarize?
And one warning to all (mainly Dell): try to tell the judge that you deleted that important evidence of your wrong doing because it was "dead weight" and let me know how that goes.
Having said all that, vendors are apparently just recently becoming aware that there's a need for automated deprecation, for moving unused data to slower/cheaper storage and fetch it back efficiently when needed. From memory to local disk to network storage to slower/cheaper network storage to tape.
I work on communications and control systems for subway and light-rail.
A lot stuff is recorded in case there is an incident or accident that they want to investigate. Even phone calls to the control center and radio transmissions are recorded. CPUC and FRA regulators come by, especially during construction and early service, and poke around, ask questions, pull records and so on.
There is a regulatory retention period. If nothing happens for that period, the stuff gets deleted. But a lot of minor stuff gets investigated. Supervisors check reports of safety rule violations and such.
I think financial auditing is similar. The auditor wants to be able to randomly select some
set of transactions during the audit, but is not going to look at the totality of records.
On another note, I do remember a story about a system admin who worked for the legislature.
He was asked to destroy some backup tapes. Instead, he handed them over to the FBI.
http://articles.latimes.com/1988-09-23/news/mn-2790_1_legislative-counsel
In the case of IT, perhaps they should have whatever number of meetings it takes to come up with a written retention policy. That way, you are covered when you delete something according to written policy.
"We can't solve problems by using the same kind of thinking we used when we created them." -- Albert Einstein
I just answer the phone, asks them to hang on for a second, and then just deposit the phone somewhere silent, like my bedroom, and wait for them to hang up. My current record is 7 minutes and 49 seconds.
This makes it as expensive for them to call me as possible with me just spending 5 seconds of my time.
That would explain why Dell did not pay attention to the blown capacitor issue.
For the major app that I work on for my company, I would say that a lot of the data is write-only until something goes wrong. There is a lot of data that is recorded simply for auditing purposes. The system keeps a copy of every version of a form that it has seen and in ideal situations these data rows, and sometimes entire documents that someone has written, are not looked at again - they are there so that if a problem is found or a complaint made everything can be tracked down to the source and procedures updated (and/or wrists slapped) so the problem is less likely to happen again in future.
I suspect that less then 1% of data is read a week after it is generated. There will be a lot of information out there, be it full documents or rows of stats in a data table, that is generated, made available to people by some means, read (or just skimmed) once by those people, and then "filed" for future reference. It was nearly the same with paper based systems, why should it be any different for electronic storage - the ease of storing and searching through the data (assuming it is well indexed) encourages more data to be stored like this because you don't have quite the same logistical problems associated by massive paper filling systems.
And yes, it will affect how people purchase and use storage. It has done for years, at least for large databases (main active store and transaction logs on fancy drives in a RAID 10 array possibly of SSDs these days, archive data pushed off (by data partitioning inside the one DB or by actually migrating data to another DB) to a slower array of spinning disks, backups to tape and moved off-site) and home users (active content on one drive, gobs of video on recordable media - though with large drive as cheap as they are these days most people don't need to offline storage unless they want a proper backup).
'The only remaining question will then be: why on earth did we squander so much money by not thinking this way until now?'"
The reason is that for 90% of businesses, the software and processes that could actually manage migrating unused data off to where it's not on primary storage but still accessible is so expensive and complex so as to not be worth it.
My company has about 10TB of corporate fileserver data. It's all sitting on SATA disks in a big NAS (well, used to be "big", nowadays it's "small"). Much of that data is important to the company, but may rarely be needed. While I could purchase a tiered storage system with fast FC drives for recently used data, slower SATA for nearline storage and tape for rarely used data plus software to manage it all, in reality it's cheaper to just keep adding SATA shelves to keep it all online. No one wants to pay for a librarian to manage Documentum or other such product to enable us to move unused data offline.
Plus there's the fact that personally, I don't trust tape for "offline" storage -- if it's not spinning and scrubbed, then it may not be readable. I still use tape for offsite disaster recovery, but would feel better if I could replicate data offsite.
In the article they don't even seem to advocate a tape backend, just SAS disks on the front end and SATA for nearline, but I don't see the point in that -- I've got a measured 98% read to write ratio -- SATA with a good NAS gives me more than enough performance for corporate fileserver needs, why would I want to pay the price premium to put a fast SAS cache in front of my SATA disks? SAS or FC gives me about 3X the IOPS at 8X the cost of SATA.
Perhaps in a large company it makes more sense to move to tiered storage, but in the 1000 user range, I just don't see the benefit.
Rate of access does not equal importance of data. How important are, say, dental records or DNA? To the majority of people, probably not too important. However, in law enforcement, they could be very important. The US military has DNA records on all of its members. However, unless you are dead and they are trying to identify your body, 99% of it is just stored and never used.
Medical records are stored and unlikely to be used on a regular basis, however, someone coming into the emergency room at the local hospital with chest pains, access to those records in a quick and timely manner may be important.
What the author seems to be proposing, however, is that records be stored on the basis of how often they will be needed (needed frequently - high speed storage, once in a blue moon, slow or offline storage). In reality, data should be stored on the cost associated with it not being available when needed.
Using the medical example, it seems that patient data would have a high cost of not being available when needed (death). Payroll information, however, which is needed somewhat frequently, has a lower cost if not available (employee having to wait for the information). As such, the metric should not be on how often the data is accessed, but instead on how vital quick access is.
If you can't figure out which 10% you'll need later, you can't use this fact to cut down on your data storage.
Rather than using WORM (Write Once Read Many) storage perhaps Dell should invent Write Once Read Never and put 90% of their info on that. It should be cheap to produce, testing would be a doddle too :-)
while (true != false) process_more_stupid_code();
I developed custom manufacturing tracking systems until the market died (between ongoing 50 year exodus from the US and "Enterprise Solutions") and I tended to store the data in to sets for two reasons.
First, data retrieval for the end-user was faster in the live system if old data were kept elsewhere. Second, it made daily backups of 24/7 systems easier because there was less data to copy.
The "live" system kept recent data (for some companies that was measured in weeks, for others months). The "Archive" system kept it for years (often legally required). Data would be moved from the live to archival system if the last time it had been touched exceeded some time limit we set AND the data wasn't related to some material still sitting on a shelf or moving through the production process.
I doubt I would have abandoned the model despite the speed/storage improvements made over the years.
This is why systems like Hadoop are taking hold. Working with unstructured and semi-structured data is hard and there is a heck of a lot of it out there. The fact that TFA starts going into storage sort of misses the point--storing it isn't hard. It is the processing... and Hadoop can handle both quite efficiently.
There are a number of tools available to analyze how NAS is being used. Here's one free tool--I'm sure there are others, too. http://www.f5.com/products/data-manager/
1. Oracle already has that, under partitioning. If there's a column you can define intervals on, you can have your database partitioned like that. E.g., you can have the database sliced by year, and move the old tablespaces to another HDD.
Probably DB/2 too, though I don't have that much experience with that one.
2. _But_ as Oracle itself points out, if you're doing it because of some delusions of gaining speed, you're doing it wrong. In this case while "90% never read", don't forget that in a well indexed database time will only increase with the logarithm of the number of records. So don't be surprised if dumping 9 million records out of 10 into another partition, will result in much less speed gain than you'd think.
And frankly that's the most common reason I see invoked for partitioning. Someone who has no clue thinks that he'll gain the uber-speed by spliting the data. Some of them even people from the IT department who should know better. Sometimes you can't move them from there no matter what, because the poor dumb beast already promised some PHB that as the great performance optimization and would lose face if he admitted he was wrong.
And especially data which is as in TFA just dead weight and never read, actually has very little impact.
So basically do a proper analysis first.
- are you splitting the data thinking that 1/10 of the data will be 10 times faster to access? Think again.
- are you splitting because of HDD costs? How much data do you have there? I mean, sure, if you're Google or Amazon, it adds up. Otherwise, exactly how much more would the extra complexity cost you? Very few meals are free, and sometimes the extra couple of hundred or even thousands of dollars in just buying fast hard drives can be easily cheaper than the cost of such an overhaul or just the extra admin overhead in the long run.
- do you actually need all that stuff? I don't read that 90% figure in just number of records, but probably most of it is in columns or whole tables which really aren't needed for anything, but are dutifully stored out of some delusion that some day they might be important. Do you actually need all that trivia? Or would you be better off just dropping a few tables instead of partitioning?
E.g. just think about how many details some sites want to know about you just to let you download a patch for a game you've bought. And I don't mean to ship it to you, or check your credit card, but just to register.
Really, there is a difference between data even for mining and pointless trivia. As a trivial example: team's averages for the last season are data that can be used meaningfully, but "which team won the most games (as in, a whole two of them) on a rainy Tuesday night under artifficial lighting" is trivia. The thing is so fine sliced that you're seeing statistical flukes. In the case of those registration sites I mentioned, demographics by age intervals are meaningful data, but by exact birthday is pure pointless trivia. Statistics by region are useful data, but statistics by street name and number are pointless trivia. No, seriously, you won't hit some jackpot that allows you to create a genre specifically for gamers on the even side of the road, nor specifically for people born on a day of the year divsisble by 3.
At any rate, that's what 90% figure is really about. And partitioning won't solve that. Even dividing the database between old records and new records won't change the fact that you still have 90% of any given partition consinsting of trivia nobody really needs.
Or, hey, if you absolutely can't let go of any byte of trivia, how about just moving those tables wholesale on a slower HDD? I mean, they're never read anyway.
A polar bear is a cartesian bear after a coordinate transform.
I work in a lab that does analysis of tissue samples from clinical trials. We're required by federal law to document an absurd amount of data. And electronic documentation for the most part isn't feasible with the way the tests are performed and the way the laws are written. End result is that I have to print out around 100 pages a day, just for myself, and initial and date documents several hundred times a day. Add in the other 15 analysts that work in my lab, and yeah, the amount of hand-written data is huge. All of this of course gets copied on a photocopier (extra physical copy stored in a separate offsite location), scanned into a pdf for electronic distribution, and then audited (and it's very rare for any errors in the audit to actually have any impact on the test results). The thing is that those pages and documents get boiled down to just a few numbers. From a 100-page stack of testing info, we typically derive about 20 numbers, and a margin of error for each. Chances of anyone going back to investigate any single sample is vanishingly small, and yet all the data and copies of that data have to be retained.
TL;DR - I can certainly believe that 90% figure, my personal experience would indicate it to be much much higher. I'd like to reduce the amount of data that needs stored; but the law prevents that from happening.
On identifying the 10% which will be needed ahead of time? I think the focus should be the opposite - to preserve MORE data and index it better. It's not hard to imagine that an addition 10% could have been used if made available at the point of need in a relevant format, effectively doubling productivity. How many employees in a company with 10K+ developers are still coding hashtables. Sure there are variations in languages and needs, but some HAS already written JUST what you need and if you had access to company owned code with the same ease as browsing GPL code on Google you would benefit.
The devil is in the details: figuring which part is the 90% that you'll never need again, and which is the 10% that will be needed. Some of that "write-only" data is stuff that companies are legally obligated to retain, some is CYA records that you hope you'll never need again. In both cases when the court, IRS, etc. orders you to produce the documents, you'd better have them.
Purge policy. This is not news, though the figure may be ambiguous. Any SA can tell you, if asked how long has the data remain untouched? You see this in database backups where they go un-queried for years. We give it 3 years then it's gone. Storing data for 3 years isn't going to break the bank, per se'.
What's ambiguous about 90%? It's inaccurate, not very precise, but completely unambiguous.
Probably 95% of the records are for compliance and things legal wants saved for CYA purposes. This is more a function of the legal environment, where everyone wants to sue every business that looks at them funny, and how courts expect tons of documents on everything you've ever done. It'd be an interesting analysis to see what the costs of excess records retention are compared to the legal losses, and more importantly, the losses consumers incur because they can't afford to fight well documented machines or what consumers lose because companies are under-documented.
http://www.accountkiller.com/removal-requested
It is like saying that 99.99% of backups are never restored. You never know in advance which 10% you'll need later.
massive files. zero content. that will get your useless content below 50% right there.
One advantage of being an old fart is that sometimes you can remember the way things were...back in the 70's. For some reason, during the last 2 weeks I've had the coincidental task of explaining what we used to call "Data Structured Systems Development", specifically for people who were concerned about data overload.
Ken Orr and Jean Dominique Warnier used to say, "Data that is not used is not accurate." DSDD started from the outputs and systems were designed to capture and use only data that were necessary for those outputs.
Research on information has pretty much debunked the idea that data and information are the same. Even back in the 70's, we had programmers who wanted to capture everything, "..because we might need it some day." The good news is that capturing tons of data allowed us to use statistics to streamline business behavior and do scientific and systems research on the behavior of the businesses. The bad news is that this isn't done very often, and data deteriorates; it never gets to make the transition from bits to information.
"The mind works quicker than you think!"
When I first got into enterprise computing 15 years ago, HSM (Hierarchal Storage Management) was making a comeback. At that time, the vendors were promoting it as something new in their promo cruft, but if you read the technical manuals, they claimed that it was an old idea that wasn't effective before but was now.
Here it is again, and I'm sure it had a resurgence in the middle as well, with NetApp and others coming online. HSM waxes and wanes every handful of years, but it ultimately only makes sense in very rigidly structured realms. It rarely works well for average users in an average mixed-use environment, and any attempt I've seen to implement it in a case like that has failed.
Now in the mainframe world, it used to work brilliantly. I used to work for a telco that had an IBM S/390 with HSM managing the storage. Data would be written to fast directly accessible disk, then gradually automatically migrated off to slower secondary disk, then to tape, and then booted out of the tape library. If you tried to access any file that had been ignored for 'x' and was moved to tape, you would get a pop-up message letting you know that it was being retrieved, and would be available in a few minutes. The tape would be mounted, the file retrieved to fast 1st tier storage, and you were in business. If you were looking at a file older than that, then you'd get a service call ticket number, the ops guys would get a 'mount this tape' request, and you'd get a notification when it was available.
Part of the problem though, is trying to keep track of filesystem sizes. With tape on the back-end, you can have effectively infinite filesystems, parts of which are slower than others. The OS has to understand it at a fairly low level to make it work. And the users need to understand it as well.
"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
As with White House emails, to cite a prominent example, business keeps as much as it does Just In Case they get audited/investigated/sued. They want to be able to provide a (non-)paper trail.
90% is probably high for most. In my experience 50% is probably low. I'd figure they crank out twice as much digital effluvia as they use. And nothing makes the other side sit back like telling them 'You want an audit? Fine, we've got elevnty jigglebytes to go through, conservative estimate to go through it, twentyteen months'. Such occurences aren't common except in knowledge, so they're saving the trash for a rainy day.
"I may be synthetic, but I'm not stupid." -- Bishop 341-B
Someone doesn't understand opportunity cost.
Or, in different words, 99% of tax audit-related data is saved by never read. But if you get audited and you don't have it, you're in trouble, and your costs likely far exceed 100x the cost of having kept the records in the first place.
90% of backup data is also never read. That doesn't make it useless or "dead weight".
What I wonder now is if the Data is never read again. How is the data stored? Is the data archived or sitting on drive on a server in the backroom. The other question I wonder is what data is stored there old PC specs, support documentation or HR and Customer Data. Then I wonder is 90% is never read again how secure is that data especially if the data is confidential. Just things to wonder about. This is why most companies don't leave old data laying around so they don't have to worry about storage and safety concerns.
http://www.thetechnologygeek.org
... and not a single mention of the Pareto Principle.
There is nothing really new in this analysis. EMC has for the past 10 years made an entire business out of Information Lifecycle Management. The basic tenets of ILM is that information is most likely to be accessed immediately after it is created, and that the likelihood that it will be accessed drops off as time advanced.
The standard example, that of a banking transaction. The transaction record will be most active in the first month, it is created, backed up, and then later read for the statement. AFter the statement is printed, then it is unlikely ever to be resurrected.
So ILM says to progressively migrate data from active, online storage to nearline, and then offline storage as time progresses.
ACtually, the whole thing mirrors the way human memory works. The most active memories we have are of the past few days, after that they migrate to long term memory and the experiences summarised and becomes experience.
I'm not sure this is really ground breaking or startling. A DBA I worked with pointed out that the vast majority of data in a database is written and never read - something you have to take in to account when deciding whether or not to place an index on a table (they slow down the speed of your inserts). It doesn't take much effort to extrapolate that to include any form of data.
Most of the data we store on disk at work is never read, it just sits there taking up space. If we actually thought about it we could put in place a mechanism to move it to an offline storage mechanism of some variety.
Yeah, I had a sig once; I got bored of it.
Data is always needed from manufacturing, operations, sales, marketing, administration, outsourcing etc. Realistically speaking 10% usage is very high, im tending towards 5% usage. But of course depends on how in depth the researcher went, there are always unofficial sources of data that are used in corporations. Anyway whats so bad, big waste means more need, more staff, more employment and higher wages. You always appoint a new rep to oversee data, and monitor changes when new information is required. It shows management incompetency but hey i get better pay, why is anyone complaining. I
Turns out we were doing you a favor. You didn't really need it anyway!
I don't think anybody would believe that. Now if it had been Apple...
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
(IAAL) Most of the comments here relating records retention for audits or compliance purposes are dead on.
That said, nearly every company that I've worked at could benefit from a new or updated records retention policy. Typically, the mentality is to just keep everything until someone finally realizes that certain records are 27 years old and starts asking around: "hey, can we just delete this?"
The better approach is to actively look for stuff to toss, and do so on a monthly or quarterly basis. Not only is it cheaper to store, but for most industries, *not* having the records available for many common causes of litigation is beneficial.
the tagline says "90%... is NEVER read..." yet the article doesn't agree: ALL of it is read, just maybe not more than onct! makes me wonder whether /. folks, among others, only see what they want to see in the headline, and get awlkindza frothy at the mouth to explain their point on what they 'thought' they read. no wonder washington only has to say something in a headline to get folks to believing it! thanks fer lis'nin' seekertom
Keep your data and have it automatically moved to lower cost SATA storage on your SAN.
http://www.compellent.com/Products/Software/Automated-Tiered-Storage.aspx
We have been using a compellent SAN for about a year now and I couldn't be happier!
Not dumb enough, experienced enough and I've ready enough instead of taking a seat of the pants guess - I'll back my approach over yours anytime.
The obvious fucking answer to your strawman is that everyone that stores tapes for a long time keeps them dry. Storing disks for long periods of time is not as simple since they are not designed to last as long and have multiple time based modes of failure, and they are far more fragile so with your strawman poor storage example you would see far more problems with the disks.
The second answer is 7 years and your tiny sample size is nothing, I've had drives spinning for that long - some survived some didn't - stored cold for that long with silica gel - some survived some didn't. See what I wrote about lubricant problems in long term storage in the comment above, there is no perfect environment to store hard drives. They will die over time if it's dry or if it's humid.
The third answer is I've seen a huge number of drives and a much larger number of tapes - guess what experience and the literature say has the larger number of failures. While I don't have 40 dead drives yet the number is starting to mount up and I should destroy them properly some time.
It also annoys me a great deal when people pre-emptively insult others by accusing then of their own faults and invent results that are the opposite of what they actually are. Review what you've written, compare it to what reality exhibits and you'll see I'm not being wilfully ignorant here.