To Purge Or Not To Purge Your Data
Lucas123 writes "The average company pays from $1 million to $3 million per terabyte of data during legal e-discovery. The average employee generates 10GB of data per year at a cost of $5 per gigabyte to back it up — so a 5,000-worker company will pay out $1.25 million for five years of storage. So while you need to pay attention to retaining data for business and legal requirements, experts say you also need to be keeping less, according to a story on Computerworld. The problem is, most organizations hang on to more data than they need, for much longer than they should. 'Many people would prefer to throw technology at the problem than address it at a business level by making changes in policies and processes.'"
The problem is that it's easier to just archive the cruft stuff than it is to go through it all and figure out what's worth keeping.
http://www.geoffreylandis.com
Seems to me that companies would keep all that data just in case a legal issue came up, in order to have a leg to stand on. Lawsuits are unpredictable that way.
Data bulimia is a serious problem. If you know someone effected, make sure to get them the help they need asap.
$250k a year for a 5000 employee company? To put it in perspective, if the average employee at this company is making $60k a year, this company will be paying $1.5 billion in salaries over the same 5 years. To be fair, I think the estimated cost from the article is very much underestimated. But while corporate storage costs more than you'd think, and companies are definately storing a whole bunch of data they don't need, what about the costs of reviewing and purging that data? That is straight up time, whether it's reviewing existing data or spending the time to create guidelines for which data to keep. And time costs money. More than storage.
Whale
Maybe it'd be cheaper for the companies to buy the employees an annual subscription to Penthouse.
Jeez lads, would you lay off the porn?!
Genesis 1:32 And God typed
I work for a storage company, stop messing with my job security.
10 GB of data per user, sure.
10 GB of user data, no way.
If assuming 300 work days per employee, that would mean that the average employee creates 1.2 kB of data per second.
The only way this could be true is if you count data that isn't user generated, and they count the total data storage for the company and divide it by employees.
If so, users deleting their e-mails won't have much of an effect.
Apps aren't really well designed for this in mind. They don't come at the problem from a "document lifecycle" perspective but instead a "document creation".
This is generally because data has a variable lifespan. Lets take an email as part of a project as an example. As the author I may decide that the email isn't needed after a week so set an expiry of 1 week. But you, as the recipient, may take that email and turn that into several tasks so for you the email is much more important and thus want to keep it for much longer.
Users aren't really going to be good at making these decisions unless some application continually bombards them with "go check the status of these 1000 documents you've got".
For example, Financial institutions are required to keep data for longer period for legal purpose as well as traceability (during investigation of fraud or other kind of crimes). The banks worked for had legal requirement of keeping data at 2 places at least 15 km apart, with all kind of protection against fire and intrusion.
A good manufacturing company would keep data for longer period ot only to comply with ISO standards, but to trace manufacturing defects and a good evidence of past history for insurance company against theft/fire and other kind of problems.
We used to keep daily changes of source code of only previous releases, and purge rest of of the releases (we kept the final source code and patches of all previous releases, but purge daily changes).
In a nutshell, it depends upon your type of bussines.
hilarious
so $50 a year is now too much for a large company to tag onto employee costs? If someone is making $30,000 a year, whats another $50. The problem might be in multi-year retention, in which a 2 year employee will require $100 of storage and so on. but this does not account for the diminishing price in memory costs or other, associated costs. Maintaining a 10 year archive at that price, and assuming that employees where putting out 10gb of data 10 years ago, would cost $500 a employee, and scale that up to a larger company, and you can see data storage prices in the millions. This is assuming that the data is: 1)Stored serverside 2)Not kept only as a physical backup after lets say 3 years. It would be cheaper in the long run to after some point x to move everyting to hard storage and keep it offline, only to be used in the case of lawsuits and other, archival needs. Using a model like this allows for near unlimited storage time with minimal costs. If a new format of storage comes about, the biggest pain might be updating these records, but in the terms of memory costs for such a operation, look at the advancements in storage space; 10 years ago, people thought 10gb was large.
It's far better to spend a few $K than to waste literally weeks of time trying to sort things out, especially when you need sales to be selling and not worried about their computers.
--Mike--
I don't know what most major companies' policies are regarding backing up emails (just back up the text or back up emails plus attachments) but, as but one example, I'm sure this would be an easy spot for most companies to dramatically reduce the amount of storage space required. Most business communications I see from corporate personnel have various attachments on every email - things like logos, custom backgrounds, etc. Forget getting rid of all the unnecessary attachments - getting rid of the "look at my pretty email that looks like a page from a spiral-bound notebook with my company logo at the bottom" images, and the hundreds and thousands of duplicates of those images, would reduce storage requirements, bandwidth requirements, and probably make corporate communications look more, you know, professional. So many emails are filled with unnecessary garbage and, if that's being backed up, that garbage can get costly.
Then again, I'm biased - I believe email should just be pure text. Perhaps that's a sign that I'm now old...
"...most organizations hang on to more data than they need, for much longer than they should."
As an infrastructure consultant, I see this EVERYWHERE. At the average client, I find the same INSTALL MEDIA (O/S ISOs) in three or more locations, all of which are being backed up. WHY??? You already have a TRUE backup, it's called the CD the software came on, or the electronic source you downloaded it from. Just gigs upon gigs of wasted space.
And don't even get me started on email limits and policies. I can't think of a single company that actually enforces the mail limits. And even those that do, will "extend" or except nearly anyone who requests it.
We in the IT field need to create better policies and actually follow through on them...
If at first you don't succeed... How does that go again? Ah, forget it.
I work for a few lawyers and we just began running into issues with "data discovery". Two recent examples:
1.
They are a medium sized law firm and they were involved in a lawsuit with another law firm. The other law firm (much smaller) required a copy of all the data from the firm.
Data from encrypted laptops = 80GB x 6 users
2 hours per laptop to decrypt and image (12 hours)
Data from 4 servers and email = 65GB (2 hours)
That's now almost 500GB and 14 billable hours of support.
2.
The law firm was involved in a lawsuit where they were doing discovery and had to review evidence.
They were going to get data from 10 laptops (800GB total) that will require backups of the data and archival for X years (so far it is 1 year and indefinite).
Quickly the data discovery is getting expensive - and annoying on a technical level.
average employee generates 10GB of data per year at a cost of $5 per gigabyte to back it up...
I cry nonsense in the statement above.
I put a 25 cent blank DVD into the DVDwriter of my PC. Then I copy the entire contents of my 'C:\backup' folder onto this DVD. I start the program, and go do something else. Total dedicated time: 2 minutes
When the DVD write is done, I write a label code on the DVD (date, employee, backup number) and put the disk back on the stack in the file cabinet. Total dedicated time: 2 minutes
My salary and benefits: @ $18/hr time used on backup: 0.067 hrs My cost per gigabyte of backup: $1
So if I'm an average marginally competent employee, why can I do backup %500 more efficiently than the average.
This statistic must be junk.
Used to be records were kept on paper,
paper was kept in boxes,
and boxes were dated MM/YY.
I came into the office one fine 1998 January 02,
and the hallway was stacked full of boxes dated 01/94,
02/94, 03/94, etc.
Company policy was discard records after three years,
so all records from 1994 were on their way to the dumpster.
The major cost of purging is the manpower and downtime. Therefore it's easier to keep the stuff, possibly with occasional housekeeping if your schema isn't as scalable as it should be. While the legal and tax requirements (which vary from country to country) have a limited lifetime, there are always possibilities, such as legal defences, where old data may be needed. These uses will not require the performance (and cost) of enterprise class storage: speed, redundancy, administration, warranties.So migrate it to a few 1TB drives in someone's desk. That way if subpoena'd you can plausibly have "lost" it, whereas if it's in your interests, it can miraculously be found.
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
On top of what you said - $5 a gigabyte? What is this 1998? Even if you get WD's highest quality consumer hard drives they're about $1 a gigabyte, plus if you buy them in bulk they're probably considerably cheaper. You can use 2 or 3 of them for data redundancy, and it's still significantly cheaper. I question where they got that number.
My last job had some files from the 1890's. The company had moved from New York to New Jersey to Houston in all that time. I can't imagine that material would ever need to be used, or would be called up during a legal investigation. Even if it were, would the authorities penalize a company for files that were that old??? At some point, everything is trashable or museum material.
This company occasionally needed blueprints from the 1930s/1940s (great lakes ships), but none of their ships went back much further than that.
Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
I've been in my current position almost a year now, and I've already generated about 1/2 a terabyte of data; and that's only the stuff I've decided is worth keeping (I've probably generated several terabytes in reality),... Of course, I'm probably not your average office worker -- my data is mostly monte carlo simulations of proteins, on the order of millions (some in the billions) of steps long. Some of the largest trajectories are 45 GB (yes, that's one file).
In a world where backup takes money, a law that says to companies "keep every communication backuped" is saying essentially the same thing as "communicate less".
The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.
put everything on one disk drive, unRAIDed. when it fails, problem solved. voila, built in obsolescence
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
You're obviously not writing software, doing CAD work, or any kind of computational modeling. It's easy to have that much data -- my source tree alone is 2GB.
Only 10 GB?!?! Pfft! Amateurs,...
I've been in my current position almost a year now, and I've already generated about 1/2 a terabyte of data; and that's only the stuff I've decided is worth keeping (I've probably generated several terabytes in reality),... Of course, I'm probably not your average office worker -- my data is mostly monte carlo simulations of proteins, on the order of millions (some in the billions) of steps long. Some of the largest trajectories are 45 GB (yes, that's one file).
At least that's what you tell your boss so he won't find your porn.
Conversely, keeping all of that data also opens you up to legal trouble. Different types of records should be kept for different lengths of time, in accordance with your company's records schedule.
If you have too many records, you may have to turn over information that could be damaging to your case in any litigation against you - information you aren't even required to keep in the first place. Confidential information may be leaked, stolen, or lost, and the probability of that happening only goes up with time. Additionally, if you have a ton of records that you don't need and won't use, your ability to find the information you do need is severely hampered.
While high storage costs may be a factor for disposing of unneeded data, it is not the reason for doing so. You shouldn't be keeping more data just because storage is getting cheaper.
I can't see how wanting "more slaw" is on topic, or why it would be spelled with two Os. Oooh .. Moore's Law. Then there's the sound-alike Mooer's Law which was summarized as "focus[ing] on [the] idea that people may not want information, as it obliges them to study the information, and come to an understanding about it."
Storage vs Study, Moore vs Mooer - fight!
...whilst policies and procedures often solve a lot of things in a cleaner, more common sense manner there are unfortunately far too many people lacking common sense.
Throwing hardware at it guarantees it'll be done, expecting people to follow policies and prcoedures will likely leave you with a 50% success rate in ensuring the correct data is kept/binned and that's if you're lucky.
The world as a whole would be so much more efficient if we could get people to follow policies and procedures or at least the common sense, good practice ones.
If assuming 300 work days per employee, that would mean that the average employee creates 1.2 kB of data per second.
Top posting and absence of editing by Microsoft Outlook users engaged in a brief inter-departmental discussion could easily account for that volume.
Is that what you meant by "isn't user generated"?
Business Intelegence Software just may make use of the software. Wile a lot of buisness are STUPID in their use of BI Software. There may be some point either the company dies or will get a clue and do some BI analysis on its data.
You actually can do some amaizing things with BI. Say for example You are storing Time Card Data from employees. And you want to check the effectivnes of managers. So with say 20 years of time card data and employee records of which manager is which. You just may find a coraltion between differn't managers how long people take their breaks, how many sick days they take. Factor out difference of age and experience in the company, then possible create a coralation of how much value the department makes over the next.... And you will have in nice number form proof that Manager A sucks, while Manager B is effective. Even if people may not like Manager B as much, or the people under him like him, but his managers don't... (as they may have been found to be bad managers by the same calculations).
Oddly enough computers are really good at doing a lot of complex math... Imagine that... So it can far easier handle crunching 20 years of data and finding coralations far better then many peoples gut feeling.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
They count more than just the stuff you typed as "user data." For example, Linux admins download ISOs, lawyers download PDFs, Windows admins download patches, service packs, and malware cleaning tools, and sales people download porn. All this data is used by the users and must be archived.
A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
My 10GB mail box in outlook, when mirrored to my local hard drive in MBOX format, automagically becomes 2 GB - and that's before compression and attachment pruning.
I have no idea what the hell Outlook is doing on the server, if it is just storing things in multiple formats at once or if it is just mis-calculating all the space, but that is one hell of a difference.
Drive is 3-5x more expensive than $1 a gigabyte...raid level 5 means 2+ drives, we're to $6-10 already, then you say the majority of the cost wouldn't be in the media...
From working at a large university, three fortune 500 companies, and now the small business I work for, I don't think it's even suggestible that most user data is backed up in an out-sourced tape data center. That's an absurd suggestion. The vast majority of data never makes it off either a local hard drive or a temporary, lightly backed up network "drive".
No matter how you skew it, the numbers they came up with in the original post are absurd. Be it either how much the data costs to store, or how much data is being stored - one of those is way out of wack with reality.
I did a back-of-the-envelope calculation on just this question in 2004, and estimated that file deletion was not productive unless we could do it at a rate of at least 17MB per minute (of labor). Four years later the threshold is probably at least 45MB per minute.
Generally, this means that if we can blow away whole disks or huge directories of data, it may pay off. Users going through their files one by one is usually an absolute waste.
"Not an actor, but he plays one on TV."
Any record destruction policy must include a "litigation hold". A litigation hold means that record destruction must stop when litigation is anticipated or pending. But in a complex enterprise, it is tricky to know what litigation the enterprise anticipates. It was the trickiness of litigation hold that led to the demise of Arthur Andersen. The risks associated with litigation hold give enterprises incentive to store lots more records. --Ben http://hack-igations.blogspot.com/2008/07/document-discovery-litigation-hold.html
Benjamin Wright, Dallas, Texas, benjaminwright.us
Look at how people deal with email. I've got coworkers that have every single email (including mailing lists they've subscribed to) they've ever sent or received since they started (~8yrs ago). They're probably got 20GB of email on their laptop. Now we only allow 100MB of server based email storage, so that helps on the server side, but we're still backing up this guys laptop.
On the datacenter side, we had a database corruption about 10years ago so we implemented snapshots, and then snapshots of those snapshots... we actually now carry about seven copies of our database. Why still seven? Because nobody wants to have to recommend that we have fewer copies of data in case we have another problem again. The funny part is that nobody in operations was around at the time of this outage.
Atleast de-duplication technology is being adopted, which gives us an excuse to hoard even more data. However from a legal standpoint, tell the Judge we don't retain data older than X years is easier than recalling 50k tapes you sent offsite 8years ago.
Bottom Line: It's just easier to store it than to be the one that "Recommended we delete XYZ files, that's why we don't have the data."
Big whoop.
Holy Enron Batman! I hope you aren't suggesting perjury is better than accountability.
In Soviet Russia meme tires of you!
Congratulations. You're the first person I've seen who understands that.
Accounting understands the need to close one year and open the next. They have processes for what is carried over and how it is identified.
Yet no other department (or application) understands the need to close old data and archive it.
Altho I agree that the inertia of keeping records trumps the work of evaluating them, the large financial services company I work for is turning with the tide, starting to focus on deletion and destruction, mainly for potential liability reasons. Not just aged documents, but prior versions, drafts, notes, etc. It makes me wonder what the historians of the future will have left for primary sources--besides the final, signed-off Establishment-sanctioned records of events. Are we on the road to compromising their ability to determine and describe What Really Happened, and thus our own ability to understand our past? Could John M. Blair write "The Control of Oil", or Ron Chernow "Titan: the Life of John D. Rockefeller Sr." fifty years hence?
mmmmmmmmmmmmmmmmmmmmmmmmmm, mooreslaw.
Nope that sounds pretty typical :)
After all, most coders come in everyday and re-copy the source tree, libraries and all to a new folder, in case they make a mistake and need to go back to a previous version.
No? Really? They told me that this was industry standard practice.
A12A.713 is the root of ASC('evil')
IANAL. This is why most companies spend some money developing a retention policy and planning its implementation. It requires a bit of time from every employee to decide if a piece of information is something that requires short term, long term or permanent storage but if you get people into the habit of sorting things like email into folders that reflect the company retention policies (which need to be pretty clear and well planned both from an IT and a legal perspective) then you can reduce the cruft you retain considerably.
With clear policies on when the various categories of information can be safely and legally deleted you can reduce the storage costs and simplify the e-discovery phase if it comes up.
Likewise you need good planning and employee training on what to do when a Hold is placed. Ie, if your company enters litigation, you will place a hold on data deletion and *NOTHING* gets deleted so that the courts can't find you guilty of attempting to hide information from them in a litigation.
Any company that doesn't come up with a retention policy that takes everything into consideration, doesn't train its employees on those policies and doesn't practice what it has decided will be its policy is in for a world of hurt when suddenly its in court and has to produce emails from a specific individual or individuals from 3 years ago etc.
If your employees can generate 10Gb of data during the course of a year, then they can learn how to apply retention principles to it while they do so. Its just one more aspect of the job.
Now there are various attempts at software to automatically filter and organize your data - email and documents etc - according to key words and phrases, email addresses etc. I believe some of them are pretty well evolved and take a lot of the burden off your employees - and cover you when those employees can't be bothered to do what they should be doing according to the rules, but I have no experience with how well these work.
Here's an article on email retention (from a quick google search, no idea how well its written)
http://searchstorage.techtarget.com/tip/0,289483,sid5_gci1212767,00.html
"The first time I got drunk, I got married. The second time I bought a chimpanzee, after that I stayed sober" Arian Seid
That's nothing. I work for a computer consultantcy and I have half a terabyte of attachments and meeting invites alone.
Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
At least that's what you tell your boss so he won't find your porn.
;-)
Bullshit. Proteins are MUCH more interesting than porn.
Actually, I am only half joking - I waste far too much time here on Slashdot, and from time to time I have to give myself a nudge to get on with my job, only to find that the work is more interesting...
They told me that this was industry standard practice.
;-D
No. The industry standard practice is to store the source code for every program you have ever written on punch-cards in a locked filing cabinet.
Or didn't you know that?
(Just to spell it out for the irony-impaired: if this slips under the radar of your world view, google "Real Programmers Don't Eat Quiche".)
What about throwing company policies at a technology problems?
Hypothetically (never happens in the real world of course), what if there was a document management server, samba dropbox, where all documentation for deliverables are kept in portable excel 2003 format? What if content identification is done my creating folders with "project" and "project"_old naming conventions, hyperlinking is done in excel (because html is complicated), and ad nauseum for the automated process called "company policy"?
It sounds like you work with a lot of old people that have to send attachments to themselves because they haven't heard of this new-fangled thing called "FTP",...
It is true that a single 1TB ide desktop drive can be bought for around hundred bucks but in the enterprise world most companies use scsi drives. The largest capacity SAS drive you can buy now is 146GB so adding in raid etc. it will be $8000 later before you walk out the store with 1TB usable storage.
From TFA ediscovery on 1TB of data cost around $1 to $3 million so suddenly that 1TB ide drive "cost" to the company is way more than just $100.
So simply buying another $100 1TB drive before considering other options is not very wise.
I'm not sure if most of you understand what is really being written about here. There are laws in place that REQUIRE that companies retain EVERY document according to a certain set of rules. These rules change depending on the type of company, but a good rule of thumb is 2 years of document retention. Publicly traded companies are under even more extremely strict guidelines including Sarbanes Oxley.
Exchange servers alone will generate huge amounts of data in no time at all. When these companies go into litigation (and they almost always do), all of this data is considered discoverable and can cost the legal department huge fees.
When involved in litigation, these documents can not simply be pulled out of archive and made available for review. There is a very strict set of rules that require these documents are produced in a non-editable, read-only image format (usually tif) and then put into discovery review platforms such as Concordance or Summation. This costs tons of money to have produced because they typically do not produce them in house.
The cost of producing the documents is only the beginning though. After they are produced, the legal fees of having them reviewed is where the really steep fees come into play. Lawyer fees can run upwards of 250 - 400 dollars per hour. That means that an email that took someone 10 minutes to type out might be reviewed for 30 min- 1 hr by the Legal Team (depending on relevance). So that single email could end up costing several hundred dollars between document production and review.
Now, if there are suspicious documents that have links to files that no longer exist, the opposing counsel has the right to do a forensic investigation on the system to look for deleted files. If they are found to be deleted when they should have been kept, the court can actually sanction the company in question... not a good position to be in!
Electronic Discovery is huge business these days and only grows as more and more companies enter litigation each year.
FTP? Is that a new mainframe thing? Will it work with Lotus Notes? I guess that gives too much away about where I work.
Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
The media itself is a small cost, the major cost is in the sending, storing and retrieving the media from off site vendors like Iron Mountain.
With the amount of data growth it's no wonder Warren Buffett has a position in Iron Mountain.
You moron, they counted the BitTorrent data transfers too.
RutSum.com
10GB of original data is easy, and it doesn't take a year, just a week or two. Today and yesterday, I measured physical properties of a lot of output from a particular industrial process (just one plant in a factory, and I only recorded measurements of a few instruments). This only gave me a few hundred MB of raw data, but it will result in several GB of data after analysis. This is all original data, and this is a normal amount of output. I regularly fill several DVDs with this sort of archive data.
Of course, this kind of data (spectroscopic, radiometric, structural anisotropy, etc.) is probably not what the lawyers would be interested in. It would take far longer to explain it to them than to collect and analyze it.
FWIW, the data in question does not involve recording videos or images. We don't have the bandwidth or storage space for that, as it could result in TB per day of raw data in industrial contexts. Camera-based instruments analyze the images in real time. The images are discarded; only the analysis results are recorded.
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
Sure, it's easy to just discard everything older than X. The problem is that there _is_ data you need to keep for a long time, so that crude an approach isn't very effective. (For instance, for financial records you need to keep a record of everything you've bought until you sell it or declare it fully depreciated, and then keep those records for N years longer for tax purposes.)
For my work, I don't usually need files much older than 2-3 years, but occasionally I do need to drag out something 10 years old (typically standards documents or RFCs, though), and one of my customers has an access ring that we installed 4-5 years ago and occasionally need to look up things about. In the telecom business, you regularly need to drag up design documents and database schemas for anything that's still in the field, which is sometimes quite antique. (For instance, the data format of a T1 hasn't changed much since the early 80s, and it's mostly the same as the ~1960 original, even though the implementation hardware and software have changed radically over the decades, and robbed-bit has mostly been abandoned. The European E1 standards were more flexible, since they learned some lessons from T1's signalling limitations, but that means that each different telco does some ugly unique cruft that you have to look up.)
Of course, there are extreme cases - my wife had a summer job in college converting a several-year-old database from a hand-rolled format into a then-current IBM database format, just in case the data got subpoenaed in an regulatory lawsuit of some sort (it never was, AFAIK.) But there's still telco data out there that predates the practical viability of the Relational Database...
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
We're not talking SQL servers here, or customer information databases, we're talking average employees performing their business duties.
It cracks me up all the "experts" of backup on this site suggesting that an average corporate user is generating insane amounts of data. I understand you generate absurd amounts of data, I understand I generate insane amounts of data. I also understand the vast majority of corporate users wouldn't be able to generate 10 gigs of data without a file sharing program and an mp3 player.
The end result of Sarbanes-Oxley, on top of the increasing amount of encryption and the use of high-density short-lived storage, is going to be a frustrating gap in the historical record for future generations.
Let's say your corp is more than 50% likely to go through "e-discovery" once every 10 years. Each worker will generate 10GB * 10 years = 100GB, backing up all the increasing data pile is (pairing the balancing ends of the accumulation for half the accumulation years) 101GB * 5 = 505GB, at $5:GB is $2525, plus about $2M:TB / 505GB = $1.01M, for a total of $1,012,525 per worker, times at least 0.50 probability is at least $506,262 average predictable cost per employee.
One approach is to keep much less data. But when you keep less data, you have to guess right every time what data you'll need later. If your process discards data that's valuable later (but lost) it better be worth less than the amount you save. That's too hard to know, which is one reason companies keep all the data, and figure it out later.
A better approach is just to cut that $1-3M:TB e-discovery cost. Of course, the best way is to avoid being investigated, but one has less than 100% control over that, especially from inside the IT department. A much better way to do it is to better inventory the data stored as you go along accumulating it, in the terms in which a later e-discovery would search it. Which also can have the benefit of making the info in the data more available in the normal course of business, which can make that data's increased value (and lowered costs of searching it) worth the entire process. The cheaper possible e-discovery would be just a bonus.
What really gets me is how these economics are the true cost of storage. A 1TB drive costs $120, and maybe a better 1TB in a 100% redundant RAID costs $250. But it really costs something like $300,000 over its lifetime (probably replaced every 3 or so years, across the 10 years I analyzed). If IT spent a few hundred hours a year streamlining the navigation of all that data, at a cost of a few dozens of thousands of dollars, divided across all those employees, the entire org's IT operations would be much more economical, when the large cumulative risk of e-discovery costs are factored into the true cost.
--
make install -not war
There should be enough local cache for every user to have access to every document they could possibly create, unless you are working at a movie company. Given proper indexing, it should be possible for users to find what they need.
Storage is cheap enough for this to work, even if some documents are slow (compressed, maybe combined as deltas with other very similar documents) or very slow (have to pull from tape or something). But again, all of that which an average user needs should be cacheable on their own local hard drive.
Granted, the tech isn't really there, especially for desktop apps. But how much is it costing not to purge? How much would it cost to write software to make it easier to purge (and train users on that software), vs writing software to better archive (and just taking the hit on hardware)?
Don't thank God, thank a doctor!
I've become the e-discovery guy (at least for email) where I work. Our lawyers told me that the latest revision of FRCP (Federal Rules of Civil Procedure) require an entity to keep evidence, even if automatic purging systems are in place.
Rule 37 of FRCP says that if you are ordered to hand over the evidence, and you cannot, then the judge can order that "designated facts be taken as established for purposes of the action, as the prevailing party claims". In other words, if the person suing you claims you sent them an email offering a million dollars to not go to court, and you auto-purge your email (taking away the ability to prove you didn't send the email), the judge has the option of deciding that yes you did make an offer of a million dollars via email. T'would suck to be you.
It even gets a little worse. Although you must keep evidence after being told you are being taken to court, it turns out you need to keep all evidence in case you are taken to court. I'm told that the criteria here is "reasonable expectation that the matter will go to court". It's reasonable (for example) to expect to end up in court if an employee dies while on the job (and it wasn't due to natural causes). The point here is that if a person dies, you'd better keep any email about the situation that lead to death - 60 day auto-purging email expiration practice be damned.
Auto-purging is a fine thing, as long as you have the ability to except items out, in case they become evidence.
"The most sensible request of government we make is not, "Do something!" But "Quit it!"
You're obviously not writing software, doing CAD work, or any kind of computational modeling. It's easy to have that much data -- my source tree alone is 2GB.
And what about our colleagues in the porn production industry? I mean, one hour of hi-res MPEG is a lot of megabytes. Multiply it by the number of, ah, employees...
Advice: on VPS providers
The dataset in os/390 had a metadata field for it's retention period. The file system would delete the file if it was older that X time. Pity that UNIX and Windows do not have such a concept.
Folders (no, you can't make your own folders):
# New contains all un-opened mail.
# Read contains opened mail.
# Inbox contains mail held via the Hold button.
# Outbox contains draft and future mail.
# Sent contains mail you sent.*
# Old contains deleted mail.
*If you include yourself as a recipient on a message it appears in your New folder rather than your Sent folder.
Message Retention
# New, Read, Inbox and Outbox messages will be available for 60 days from the original date.
# Sent messages will be available for 7 days from the original date.
# Old messages will be available for 3 business days from the date of deletion.
I hate it.
Later on you may need that data:
* To prove you used or thought of or invented something before someone else later applied for a patent on it, or
* To prove a claim that some agreement was made, or
* To show you acted in good faith in some matter
Toss the email and you may lose trails of evidence about such things. Also, you can't always tell at the time you get/send mail which mail will need to be kept, nor for how long. I've had patent relevant stuff that was needed 10+ years later, for example. Toss the spam, but the rest should be mostly kept. (Attachments might be trimmed a bit more than your own words.)
Predict score -5
why do i care? slow 'news' day?
My score:
9/10 on the WGAS (who gives a) scale, bigger is worse
Back up everything and hide the media somewhere. If they subpoena it, deny everything.
I'd probably do something like this -- assuming I ever back up my data. Which, as far as you know, I don't.
Paleotechnologist and connoisseur of pretty shiny things.
they cant get the password/private keys unless you give it to them, its called the fifth amendment, protection against self-incrimination.
in the enterprise world most companies use scsi drives.
Or fibre channel.
The largest capacity SAS drive you can buy now is 146GB
I just bought a bunch of 300GB SAS drives for one of my NAS. Actually, it looks like 1TB drives are available now too!
Technicalities aside, you're completely right. Folks don't seem to understand that the hard drives you walk out of Best Buy with are not the same that plug into your CLARiiON. And then of course, there is backing up that data because the enterprise backs up more of its data than it doesn't. You need a backup platform, tape systems, tape, backup licenses, off site storage and people to manage all of it.
The "experts" cited in this, they wouldn't happend to be professionals affiliated with companies that offer services to comb through your data and help you get rid of all the old stuff, would they?
$5 per gigabyte to back it up? Fuck YOU. Liar. Try 50 cents per gig or less! This whole article's credibility is nothing...
It usually turns out that few files takes up the most place. SELECT filename, size FROM allDisks WHERE size > 2 MB ORDER BY size DESC Add your filter of choise to remove important files from the result, and go ahead erase a bunch of useless huge files.