Slashdot Mirror


To Purge Or Not To Purge Your Data

Lucas123 writes "The average company pays from $1 million to $3 million per terabyte of data during legal e-discovery. The average employee generates 10GB of data per year at a cost of $5 per gigabyte to back it up — so a 5,000-worker company will pay out $1.25 million for five years of storage. So while you need to pay attention to retaining data for business and legal requirements, experts say you also need to be keeping less, according to a story on Computerworld. The problem is, most organizations hang on to more data than they need, for much longer than they should. 'Many people would prefer to throw technology at the problem than address it at a business level by making changes in policies and processes.'"

36 of 190 comments (clear)

  1. Easier to keep by Geoffrey.landis · · Score: 5, Insightful

    The problem is that it's easier to just archive the cruft stuff than it is to go through it all and figure out what's worth keeping.

    --
    http://www.geoffreylandis.com
    1. Re:Easier to keep by Daimanta · · Score: 5, Insightful

      True, proper archiving takes huge amounts of time since it adds overhead to your operation.

      In an ideal world, everything that you store is automatically labeled and old data will automagically be purged. But storing all kinds of shit is just that much easier. It also doesn't help that data storage is so dirtcheap. 1TB can be bought for around $100 if I am not mistaken. It doesn't pay to kill old useless stuff you have floating on your hard disk.

      --
      Knowledge is power. Knowledge shared is power lost.
    2. Re:Easier to keep by Sobrique · · Score: 4, Insightful
      Add to that legal requirements of retention - you'll need to filter your 'customer communications' from your 'shopping lists'. That's what actually makes this a nuisance - the possibility that there will be legal action in 5 years time, that you'll need to fight.

      Yes, less data need to be kept, but first there needs to be a _massive_ re-education of the 'data packrat' culture that the users of it have.

    3. Re:Easier to keep by sunking2 · · Score: 4, Insightful

      Cheaper to keep. Every hour I waste cleaning house costs more than it does to keep it stored. Storage continues to get cheaper, salaries typically don't. Sure, that $1.25M is a big scary number. But nothing compared to the salaries/benefits at a 5000 person company. Now you can argue the cost of data retrieval goes way up because chances are it'll take a hell of a lot longer to find, but that's a different argument altogether and you can just as easily question what the cost of not being able to recover something that was cleaned by accident is.

    4. Re:Easier to keep by zappepcs · · Score: 2, Insightful

      The problem is that it's easier to just archive the cruft stuff than it is to go through it all and figure out what's worth keeping or training staff to organize their data and retain only that which is necessary .

      There, fixed that for you. Meta-tags and other efforts might change this in the future, but until there is a generalized understanding of things that should be archived and things that should not, and a better way to store, find, retrieve, and utilize company data, there will be tons of data saved that really should not be. Humans are like that.

    5. Re:Easier to keep by daeg · · Score: 5, Insightful

      The bigger problem is that you will fight different battles. If you're fighting a sales rep that sold your clients to a competitor, you want as much ammunition as possible. If a client is suing you for incorrect information relayed 8 years ago and you're probably guilty, you want as little information as possible.

    6. Re:Easier to keep by COMON$ · · Score: 3, Interesting
      What I want to know is how these numbers are broken down. $5 per gigabyte to back up? Maybe if you factor in the cost of a robotic library. Considering that tapes currently run about $30 a pop for for 800GB and that I am on a 12 month rotation, I still don't come NEAR that price. 1.25 million for a 5000 person company? What kind of company? 10GB average is about 9GB over my average user here. Even when I worked at a larger company, we still weren't even breaching 700MB average INCLUDING e-mail.

      Lovely scaremongering, but what did they mean by legal e-discovery? The time it takes to sort through the data or what?

      --
      CS: It is all sink or swim...oh and did I mention there are sharks in that water?
    7. Re:Easier to keep by Chrisq · · Score: 3, Interesting

      We went paperless, and when application forms, etc. arrive they are scanned and stored. Examination of the data shown that very often people would print out all the existing infromation on a customer and add it to the pile sent for scanning.

      Result, look up a customer and you would find some files scanned half a dozen times.

    8. Re:Easier to keep by TheRaven64 · · Score: 2, Insightful
      The $5 presumably includes the physical media, the backup operator's time spent configuring the system, the hardware for performing the backup, and the safe, secure, off-site storage costs. 10GB per years is a lot more than I produce - my PhD was only 1.5GB in total, including temporary files (build cruft and so on), with only 210MB needed for the subversion repository (176MB after bzip2) - the bzip2'd repository of my book (including all text and code examples) is only 4.6MB. My mail folder is only 3GB, and that contains over ten years of email messages (and would compress very well).

      On the other hand, I don't use Word, which manages to make single-page documents that are more or less plain text take up a few MBs. If you're in a company where everyone sends Word document attachments as emails instead of plain text (I've seen it done[1]) then you could probably generate 10-20MB of date per day from around 5KB of actual content, and backing this up might be cheaper than educating your users. Assuming some other work as well as emails this can easily get to 10GB.

      [1] Even worse was my publisher, who sent me a scanned version of a contract as a Word document. A PNG of the same image was around 100KB, while the word document was 5MB and contained nothing other than the image. A lot of people just treat Word documents as a default container format for any content.

      --
      I am TheRaven on Soylent News
    9. Re:Easier to keep by BobMcD · · Score: 3, Interesting

      you'll need to filter your 'customer communications' from your 'shopping lists'

      Actually, I thought it was a fairly common legal tactic to make the data as difficult to actually find as possible, without revealing too much to the other side.

      "They want records from three years ago? Send a truck with printouts of all the files we have, that'll keep them busy..."

      Does anyone know that this is no longer the case?

    10. Re:Easier to keep by vvaduva · · Score: 3, Insightful

      Well, I did not RTFA in detail but it does not seem to address key regulations like HIPAA and SOX which put hard numbers on data retention. So whether or not it's expensive, you have to do it if you want to be legit. If the issue is discovery, a sound archival system will eliminate expenses related to discovery and would allow one to provide requested information very quickly and efficiently. I say let the legal people fight discovery requests and unless you have something to hide, stick with the requirements for archival and retention. The argument "the less you keep the less they ask for" is simply stupid. In certain SOX-related situations, even the appearance of impropriety will come back to bite you, so I always tell folks to do the right thing, by running your business properly, identifying document types correctly and sticking to regulatory requirement as much as possible.

    11. Re:Easier to keep by Geoffrey.landis · · Score: 2, Interesting

      The problem is that it's easier to just archive the cruft stuff than it is to go through it all and figure out what's worth keeping or training staff to organize their data and retain only that which is necessary .

      There, fixed that for you.

      According to the original article, ("The average employee generates 10GB of data per year at a cost of $5 per gigabyte to back it up ") the cost of backups is fifty dollars a year per employee.

      So if that an average employee costs the company $100 per hour (including overhead), then if "training training staff to organize their data and retain only that which is necessary" takes more than half an hour per year, it's more cost effective to archive the junk than it is to train the employees to sort it.

      --
      http://www.geoffreylandis.com
    12. Re:Easier to keep by cmause · · Score: 5, Interesting
      There used to be a sort of gentlemen's agreement between attorneys to not dig in to electronically stored information (ESI). That was back when everything important ended up on paper anyway, which was discoverable.

      As time went on, fewer things ended up on paper, but the rules of discovery didn't evolve. That was the time of backing up a U-Haul full of printed out copies of every file, e-mail, etc. that a company had. Now the opposition had to dig through mounds of trash in the hopes that they will find that one incriminating document.

      Then attorneys got more savvy, and in the so-called Rule 26 (refers to the Federal Rules of Civil Procedure), the attorneys would agree on the format of ESI to be exchanged. In December, 2006, the Federal Rules of Civil Procedure changed to directly address ESI and electronic discovery.

      Now, in litigation, parties may still get obnoxious amounts of data, but it's electronic. Once it's processed and converted (usually to TIFFs with extracted text, but sometimes PDF), attorneys can do what amounts to a Google search through the files and find what they want pretty quickly. In fact, paper documents are usually scanned and OCRed so they can be handled and searched in the same manner.

      Actually, I thought it was a fairly common legal tactic to make the data as difficult to actually find as possible, without revealing too much to the other side.

      "They want records from three years ago? Send a truck with printouts of all the files we have, that'll keep them busy..."

      Does anyone know that this is no longer the case?

      So no, it's no longer the case. But the first guy who did it must have thought he was pretty funny.

    13. Re:Easier to keep by guruevi · · Score: 2, Informative

      1) This is the average. Your company might have 700MB/user, in my organization, it's close to 1TB/user/year that gets added. We're doing medical imaging.

      2) It's not just tape libraries. The cost for D2D2T or D2D2D (what we're doing) goes way up compared to a 'simple' backup scheme. Especially if you're like us and require mulitple gigabit streams, disk storage can't be just 4 cheap SATA disks in RAID5. We have 2 storage arrays with 14 drives each for general access and another storage array with 10 SATA disks for primary backup and those things don't come very cheap especially since you need multiple servers to handle the load.

      3) Encryption, tape rotation or multiple locations add to the costs.

      4) If you're buying a solution eg. from IBM (Tivoli), you need to pay for a consultant and/or another employee to get that stuff running. We're doing what we're doing with open source and it's going well, but if you can't and need to pay for software, it adds up (especially for Windows systems)

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
    14. Re:Easier to keep by euri.ca · · Score: 2, Interesting

      Lets not get snippy here, but I think the consensus is that:

      • $5/GB is reasonable (or low) for hardcore backups like the source tree, accounting records (anything where you have a person verifying that it's there is super expensive by default)
      • 90-100% of what any typical user makes (the 5GB/year figure) doesn't (or at least shouldn't) make its way into the expensive storage. But it might anyway, because your options for backing up email easily are limited.

      Of the 30 gigs of things I've put on this laptop this year, maybe 100 megs have been checked-in to CVS (and the expensive backups), I doubt accounting and HR have generated another 4.9Gigs on me this year.

  2. Huh? by qoncept · · Score: 4, Insightful

    $250k a year for a 5000 employee company? To put it in perspective, if the average employee at this company is making $60k a year, this company will be paying $1.5 billion in salaries over the same 5 years. To be fair, I think the estimated cost from the article is very much underestimated. But while corporate storage costs more than you'd think, and companies are definately storing a whole bunch of data they don't need, what about the costs of reviewing and purging that data? That is straight up time, whether it's reviewing existing data or spending the time to create guidelines for which data to keep. And time costs money. More than storage.

    --
    Whale
  3. 10 GB user data? Not likely by arth1 · · Score: 5, Insightful

    10 GB of data per user, sure.
    10 GB of user data, no way.
    If assuming 300 work days per employee, that would mean that the average employee creates 1.2 kB of data per second.

    The only way this could be true is if you count data that isn't user generated, and they count the total data storage for the company and divide it by employees.
    If so, users deleting their e-mails won't have much of an effect.

  4. It's not the storage... it's the apps by paulhar · · Score: 4, Insightful

    Apps aren't really well designed for this in mind. They don't come at the problem from a "document lifecycle" perspective but instead a "document creation".

    This is generally because data has a variable lifespan. Lets take an email as part of a project as an example. As the author I may decide that the email isn't needed after a week so set an expiry of 1 week. But you, as the recipient, may take that email and turn that into several tasks so for you the email is much more important and thus want to keep it for much longer.

    Users aren't really going to be good at making these decisions unless some application continually bombards them with "go check the status of these 1000 documents you've got".

    1. Re:It's not the storage... it's the apps by ubercam · · Score: 3, Informative

      Users aren't meant to be making those decisions, the Records Management department should be... that is if you even have one! If you leave everything up to the users, you WILL have a cluster fuck of records.

      I work in Records Management at a large company with many different divisions in diverse fields. RM is completely left up to us. We manage well over 10,000 boxes and there's only 3 of us. We alone determine when something is to be destroyed (but require authorization from dept heads to be shredded), how long it's kept, etc.

      Disclaimer: We work mainly with paper records, but the exact same principles apply to electronic records.

      You need a retention schedule. Look at your national, state/provincial and municipal laws to determine the minimum legally required length of time each TYPE of record is to be kept. Employee time cards are different from pension plans, sales invoices and legal files. It's not *always* 7 years either. Some are less, some are more, some are permanent. Also, you don't have to shred when the law says it's time if there's a valid business reason to keep that set of records. I mean, let's get this straight. You don't HAVE TO shred at all, but you're digging yourself a deep hole if you do... "You can get in just as much trouble by keeping records too long as you can by destroying them too quickly." - Dr. Mark Langemo

      If this was all left up to individuals, they would just keep everything. I've seen what this is like, and it's pathetic, maddening and counter productive. Things must be properly named and catalogued down to the file level when put in storage, or you will NEVER find ANYTHING without an exhaustive search EVERY time. It might be alright when it's on your desk or in your local filing area and you know what's where, but when you archive it, you can't assume the guy looking for your file you need knows anything about it. We need explicit details or else we can't help you. At my company we require everyone to fill out a nice sheet detailing the contents of their box, the type of records, dates (most remember dates above all else), sender's name, dept, etc.

      We are by no means a perfect operation here, but we're far better than 90% of other companies out there.

      There is a series of excellent seminars done by Dr. Mark Langemo (sorry no links) to teach you how to deal with records. Also check out ARMA International if you're looking to get in touch with other Records Managers in your area. They have local chapters all over the place.

      To summarize, if your company doesn't have a Records Manager, HIRE ONE NOW and give him/her the resources to get your records under control! Check out ARMA, they have jobs posted on their site. There are also many companies out there that will help you clean up your stuff and get you started on the right track.

  5. Re:hmm by NoisySplatter · · Score: 3, Insightful

    It's not so much that you want your company to have a leg to stand on, its that you don't want your legal opposition to get their foot in the door. Innocent until proven guilty remember?

    --
    In Soviet Russia meme tires of you!
  6. It depends upon business by William+Robinson · · Score: 2, Informative

    For example, Financial institutions are required to keep data for longer period for legal purpose as well as traceability (during investigation of fraud or other kind of crimes). The banks worked for had legal requirement of keeping data at 2 places at least 15 km apart, with all kind of protection against fire and intrusion.

    A good manufacturing company would keep data for longer period ot only to comply with ISO standards, but to trace manufacturing defects and a good evidence of past history for insurance company against theft/fire and other kind of problems.

    We used to keep daily changes of source code of only previous releases, and purge rest of of the releases (we kept the final source code and patches of all previous releases, but purge daily changes).

    In a nutshell, it depends upon your type of bussines.

    1. Re:It depends upon business by PainKilleR-CE · · Score: 3, Insightful

      Additionally, there are many businesses that don't understand their data retention requirements beyond 'we need to keep some data for 10 years', so instead of compartmentalizing their data and saying 'keep this for 10 years, that for 5 years, and purge this every year and that every 3 months', they just keep everything. Further, if they have a data retention requirement for 3 years or 10 years, they might wait longer before purging it just because it's easier to keep it then it is to go find and remove the 5 or 12 year old data.

      I only recently organized some data being maintained by the company I work for that was basically divided into 'archived' and 'live' data, logs generated by a many-user application. The 'archived' data went back 4 or 5 years with no easy distinction between data that was many years old and data that was generated in the most recent archive. Now at least the data is sorted by date (and being archived by date), so that when someone decides on how long we want to keep it (they can't seem to make up their mind, and while everyone seems to agree that we don't need data from 2005 and earlier, no one's willing to say I can delete it, either), it won't be hard to dump the older data at least on an annual or semi-annual basis.

      --
      -PainKilleR-[CE]
  7. Re:hmm by MrMr · · Score: 5, Interesting

    The top 500 company I worked for did just the opposite: Destroy all data in case a legal issue comes up.
    They called it 'desk cleanout day', and unless you were an official dedicated contact on a particular subject you were to wipe all correspondence of more than a year old.
    (There were also other grades of information, but erase after a year was the default).

  8. Email Attachments by whisper_jeff · · Score: 4, Insightful

    I don't know what most major companies' policies are regarding backing up emails (just back up the text or back up emails plus attachments) but, as but one example, I'm sure this would be an easy spot for most companies to dramatically reduce the amount of storage space required. Most business communications I see from corporate personnel have various attachments on every email - things like logos, custom backgrounds, etc. Forget getting rid of all the unnecessary attachments - getting rid of the "look at my pretty email that looks like a page from a spiral-bound notebook with my company logo at the bottom" images, and the hundreds and thousands of duplicates of those images, would reduce storage requirements, bandwidth requirements, and probably make corporate communications look more, you know, professional. So many emails are filled with unnecessary garbage and, if that's being backed up, that garbage can get costly.

    Then again, I'm biased - I believe email should just be pure text. Perhaps that's a sign that I'm now old...

  9. My last job by dj245 · · Score: 2, Interesting

    My last job had some files from the 1890's. The company had moved from New York to New Jersey to Houston in all that time. I can't imagine that material would ever need to be used, or would be called up during a legal investigation. Even if it were, would the authorities penalize a company for files that were that old??? At some point, everything is trashable or museum material.

    This company occasionally needed blueprints from the 1930s/1940s (great lakes ships), but none of their ships went back much further than that.

    --
    Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
  10. Communicate less by Yvanhoe · · Score: 2, Interesting

    In a world where backup takes money, a law that says to companies "keep every communication backuped" is saying essentially the same thing as "communicate less".

    --
    The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.
  11. easy solution by circletimessquare · · Score: 2, Funny

    put everything on one disk drive, unRAIDed. when it fails, problem solved. voila, built in obsolescence

    --
    intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
  12. Re:I'm 500% better than average! by Chris+Mattern · · Score: 2, Insightful

    Unfortunately, writable DVDs are not an acceptable archive medium, and a stack of disks with written labels is not an indexing solution that will scale beyond one person.

  13. Re:10 GB user data? Not likely by value_added · · Score: 2, Funny

    If assuming 300 work days per employee, that would mean that the average employee creates 1.2 kB of data per second.

    Top posting and absence of editing by Microsoft Outlook users engaged in a brief inter-departmental discussion could easily account for that volume.

    Is that what you meant by "isn't user generated"?

  14. Yes--deleting costs money! by mkcmkc · · Score: 4, Insightful

    I did a back-of-the-envelope calculation on just this question in 2004, and estimated that file deletion was not productive unless we could do it at a rate of at least 17MB per minute (of labor). Four years later the threshold is probably at least 45MB per minute.

    Generally, this means that if we can blow away whole disks or huge directories of data, it may pay off. Users going through their files one by one is usually an absolute waste.

    --
    "Not an actor, but he plays one on TV."
    1. Re:Yes--deleting costs money! by ckaminski · · Score: 2

      Currently filesystems track the following:

            Creation time
            Last Access Time
            Last Modified Time

      If we also had a

            Last backed up time/scanned time

      that virus scanners and backup software could use instead, then you can track last-access to eliminate files that haven't been opened by end-users in a particular time period for permanent offsiting or removal. Making today's complex HSM architectures easier to implement or not necessary at all.

  15. litigation hold by Benjamin_Wright · · Score: 2, Informative

    Any record destruction policy must include a "litigation hold". A litigation hold means that record destruction must stop when litigation is anticipated or pending. But in a complex enterprise, it is tricky to know what litigation the enterprise anticipates. It was the trickiness of litigation hold that led to the demise of Arthur Andersen. The risks associated with litigation hold give enterprises incentive to store lots more records. --Ben http://hack-igations.blogspot.com/2008/07/document-discovery-litigation-hold.html

    --
    Benjamin Wright, Dallas, Texas, benjaminwright.us
  16. Mod parent way up! by khasim · · Score: 3, Interesting

    Congratulations. You're the first person I've seen who understands that.

    Accounting understands the need to close one year and open the next. They have processes for what is carried over and how it is identified.

    Yet no other department (or application) understands the need to close old data and archive it.

  17. Re:hmm by Lumpy · · Score: 2, Interesting

    That was a common company wide AT&T policy wipe everything after 60 days. all email to be deleted after 60 days. it was a fireable offense for creating a pst file on your desktop and we did a regular sweep for pst files on corperate pc's on a regular basis.

    It really did not stop anyone from keeping info, many managers simply printed out the emails and kept them in files, one IT manager we let go had 3 years of email printed and stored in file cabinets in his office. it was insane.

    --
    Do not look at laser with remaining good eye.
  18. Store Smarter, Not Just More by Doc+Ruby · · Score: 2, Interesting

    Let's say your corp is more than 50% likely to go through "e-discovery" once every 10 years. Each worker will generate 10GB * 10 years = 100GB, backing up all the increasing data pile is (pairing the balancing ends of the accumulation for half the accumulation years) 101GB * 5 = 505GB, at $5:GB is $2525, plus about $2M:TB / 505GB = $1.01M, for a total of $1,012,525 per worker, times at least 0.50 probability is at least $506,262 average predictable cost per employee.

    One approach is to keep much less data. But when you keep less data, you have to guess right every time what data you'll need later. If your process discards data that's valuable later (but lost) it better be worth less than the amount you save. That's too hard to know, which is one reason companies keep all the data, and figure it out later.

    A better approach is just to cut that $1-3M:TB e-discovery cost. Of course, the best way is to avoid being investigated, but one has less than 100% control over that, especially from inside the IT department. A much better way to do it is to better inventory the data stored as you go along accumulating it, in the terms in which a later e-discovery would search it. Which also can have the benefit of making the info in the data more available in the normal course of business, which can make that data's increased value (and lowered costs of searching it) worth the entire process. The cheaper possible e-discovery would be just a bonus.

    What really gets me is how these economics are the true cost of storage. A 1TB drive costs $120, and maybe a better 1TB in a 100% redundant RAID costs $250. But it really costs something like $300,000 over its lifetime (probably replaced every 3 or so years, across the 10 years I analyzed). If IT spent a few hundred hours a year streamlining the navigation of all that data, at a cost of a few dozens of thousands of dollars, divided across all those employees, the entire org's IT operations would be much more economical, when the large cumulative risk of e-discovery costs are factored into the true cost.

    --

    --
    make install -not war

  19. Although I agree with you in principle.... by Degrees · · Score: 2, Informative

    I've become the e-discovery guy (at least for email) where I work. Our lawyers told me that the latest revision of FRCP (Federal Rules of Civil Procedure) require an entity to keep evidence, even if automatic purging systems are in place.

    Rule 37 of FRCP says that if you are ordered to hand over the evidence, and you cannot, then the judge can order that "designated facts be taken as established for purposes of the action, as the prevailing party claims". In other words, if the person suing you claims you sent them an email offering a million dollars to not go to court, and you auto-purge your email (taking away the ability to prove you didn't send the email), the judge has the option of deciding that yes you did make an offer of a million dollars via email. T'would suck to be you.

    It even gets a little worse. Although you must keep evidence after being told you are being taken to court, it turns out you need to keep all evidence in case you are taken to court. I'm told that the criteria here is "reasonable expectation that the matter will go to court". It's reasonable (for example) to expect to end up in court if an employee dies while on the job (and it wasn't due to natural causes). The point here is that if a person dies, you'd better keep any email about the situation that lead to death - 60 day auto-purging email expiration practice be damned.

    Auto-purging is a fine thing, as long as you have the ability to except items out, in case they become evidence.

    --
    "The most sensible request of government we make is not, "Do something!" But "Quit it!"