Slashdot Mirror


Media Providers And Short Online Retention?

delfstrom asks: "Retention time for online reference material is decreasing. First it was Deja moving archives offline. Now try to find the AP story you saw on Yahoo from earlier this year about a judge's order against a CyberPatrol decryption tool. You can't, because anything older than 30 days is canned from news.yahoo.com. Likewise, certain online newspapers (not to mention any names) are removing content after a mere 7 days, though for $25 per retrieved article you can go back to 1977. This certainly goes against the philosophy of not breaking links. What responsibility do information providers have in maintaining articles that they post? In this era of electronic publishing, academic papers are beginning to contain URLs in the references. To what extent can we keep copies of such information and provide it to others?"

5 of 13 comments (clear)

  1. Exactly my point! by tzanger · · Score: 3

    This is why I've been hoarding data since about 1992 or so. Anything that I deem worth keeping I keep a local copy of, whether it be my old Bluemail .qwk archives, newsgroup postings, HTML pages adobe acrobat files from where and whenever, old .mod and .stm/s3ms, you name it. I've got .zip files I'll probably never use again, but I've kept them specifically because I got sick and tired of so-called "permanent" sites taking them off.

    Whenever my hard drive gets full, I do a couple categorization passes (I try to keep them categorized as I go but it's never quite perfect; there's always too many files in my /data/dump directory) and then make an .iso. Two copies are burned, one for my bookshelf and one for work or safe storage.

    As Signal11 once had in his .sig (and ripped from somewhere I'm not sure, but I've seen it in the old taglines of yore): I don't have a solution but I admire your problem.

  2. No sense by SEWilco · · Score: 2

    It doesn't make sense. Storage keeps getting cheaper, and they go and break the bookmarks and links which would bring people back without effort.

  3. How does this compare with "offline" news? by Muggins+the+Mad · · Score: 2

    I don't know myself, but I'm curious how this compares with things like old newspapers and such.

    I know newsstands tend not to keep even yesterdays papers, it's up to organisations like libraries to do that.

    Do we have any comparible organisations who specifically archive things like online news?
    How do they deal with copyright issues?

    - Muggins the Mad

  4. Re: No sense -- In The Real World... by InitZero · · Score: 3

    Storage keeps getting cheaper,

    There are three issuses here. The first is that storage isn't as cheap as you think. The second is that indexes are hard to maintain. Finally, you forget that old text is a good revenue stream.

    Storage

    You are correct that space is cheap for small amounts of storage. If you go to your local computer store, you can buy a 60-gig drive for less than I paid for my first five-meg drive. I have no contention there.

    However, people who archive data for a living don't buy bare 60-gig IDE drives and string them together. It ain't that simple.

    I work for a newspaper. We have every text we have published since 1985 and every picture since 1996 (don't quote me on that last date). They are both inside IBM RS/6000s. The text archive is under 15 gig. The photo archive clocks in at 230 gig (and growing by nearly 600 meg a day).

    Initially, the data lived in a $100,000 HP optical jukebox. When that got too small, we scrapped it and bought IBM 7133 disk arrays. Bare, before you put the first drive in the box, they cost $36,000. Each nine gig drive is $2,000. (Yes, I know you can get them cheaper. But not hot-swap, not with an IBM warrenty, etc.) When you hit 144 gig (9 gig by 16 drives), you've got to buy another 7133. In order to get good performance, you can't just RAID-5 everything in one big SSA loop. You have got to have multiple paths. Each enhanced SSA card is a few thousand dollars.

    Indexing

    Keeping the raw images isn't that difficult in the grand scheme of things. Indexing and searching for content, however, is less than trivial. Keeping the database well-groomed is hard work. You do want all the stuff these web sites keep online to be searchable, right?

    Storing photographs is especially difficult. For a quick discussion on archiving images, see this post from a week or so ago.

    Revenue

    Newspapers sell you a hundred stories with pictures and comics a day for, generally, 50 cents. However, if you want a story that was in last year's newspaper, they can charge you five dollars for that story and you will pay it.

    Why on earth would newspapers give you content for free that they spent money to create and archive? Yeah, yeah, information wants to be free and all that but they are still have to make a profit otherwise there will be no information to be made free.

    Solution?

    The obvious solution is for these media outlets to charge for old stories. That way the links don't break and they have a way to support the archive and indexing costs. Folks here won't like that idea.

    Summary

    It's easy to say that the media should keep everything online all the time. In the real world, however, there's problems with doing just that. The problems are both technical and financial. Information may want to be free but 'wanting' doesn't pay the bills.

    InitZero

  5. Re: No sense -- In The Real World... by InitZero · · Score: 2

    We just installed a 500 Gig RAID for US$20,000 for storing huge (and critical) medical images.

    Does that storage have a single point of failure? It is mirrored? Is it SSA? Will it work on an RS/6000? Can it be backed up to ADSM/TSM?

    All of these are critical questions for us. There are many solutions that will hold a lot of data for little cost. Take the 1U Maxtor box for example. At under $5,000 for 320 gig, it sounds good. However, it only has one NIC and doesn't support an SSA connection so we can't use it. It doesn't scale well within our application environment.

    InitZero