Archiving Digital History at the NARA
val1s writes "This article illustrates how difficult archiving is vs. just 'backing up' data. From the 38 million email messages created by the Clinton administration to proprietary data sets created by NASA, the National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"
Hm. This sounds like a job for OpenOffice...
Ok, I was tempted to make a pr0n joke about this, but I think the bigger question is what kind of indexing system will this use?
I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack, err. haybarn.
Can't they get more storage performance out of their system by (more) aggressively compressing old information? That shouldn't matter too much to the indexing mechanism. Also, it might make sense to tag the importance of different documents so that its compressing/archiving treatment can depend on that.
see a Text Widget
Surely data of this sort lends itself well to compression?
It happened with the Great Library of Alexandria, with pagan libraries throughout the Christian era, and more recently has happened with antiquities in Afghanistan and Iraq. The only thing that can reliably preserve data is large scale, geographically widespread distribution of copies.
Perhaps, the answer is compression.
Does anyone know whether there is an upper limit to text compression?
In signal processing, there is a limit called the Shannon Capacity theorem, which gives the maximum amount of information that can be transmitted on a channel. In text compression, is there a similar limit?
Note that the Shannon Capacity theorem does not tell you how to reach that limit. The theorem merely tells you what the limit is. For decades, we knew that maximum limit on a normal telephone twisted pair is about 56,000 bits per second, according to the theorem. However, we did not know how to reach it until Trellis coding was discovered, according to an electronic communications colleague at the institute where I work.
If we can calculate a similar limit for text compression, then we can know whether further research to find better text compression algorithms would be potentially fruitful. If we are already at the limit, then we should spend the money on finding denser storage media.
"Archiving Digital History at the NARA"
You'll have to pry it from my cold, dead hands!
Ohhhh, NARA, not NRA....
Perhaps it would be best to keep it all, even the stuff that now may seem totally useless, like Clinton administration emails from Janet Reno to Madeleine Albright asking what she thinks about Norman Mineta and his "hot Asian vibe." With search technology improving constantly, it would probably be better than throwing stuff away which could potentially be of interest, or spending time developing the AI to make the task less time-consuming. And besides, we can't make future historians' jobs too easy. They've gotta earn their pay, reminding us of the banalities of this age.
Doesn't matter. We can't absorb the information available at any moment in real time. So we certainly cannot go back and absorb it later.
The abandonment of the notion that information should be evaluated and only the best archived -- as in traditional libraries -- is indeed likely to lead to a dark age. But it will be symmetric to the old ones: can't find the target in the clutter instead of being unable to find it in the desert.
With the new GoogleNARA...
nara.google.com
Oh, wait... I'm getting ahead of myself...
IANAL, but I've seen actors play them on TV
Dear Monica,
I did what last night? Man, I must have been smashed. You sure? ROTFLMAO...
Yours truly,
Bill
Seriously, we're archiving every little tiny 1 and 0 for what reason? There's some things that can just go in a zip file and be put on a CD and that's it. Want them to stick around forever? Have files put out every so often in leftover space on AOL CDs. They'll never be gone forever.
If my grammar and spelling are off, I am [distracted/tired/careless] (take your pick)
In the age of pen and paper, only important stuff was written down. Nowadays all crap is preserved. This is useless. There is a big difference between data and information.
Oh well, what the hell...
Deleting e-mails seems to be a good way avoid archiving issues.
0 3/23/whitehouse.email/
http://archives.cnn.com/2000/ALLPOLITICS/stories/
Humor from a Genetically Molested Mind
It'd be crazy to suggest the NARA audit every single bit (no pun intended) of archival data to determine whether they're worth archiving or not -- not only is it impossible, it flies in the face of the whole idea of archiving. However, the estimate of 347 petabytes may perhaps be too pessimestic, as surely not every kind of information they have are worth archiving. Just my two cents.
I think more accurately, we are headed towards an age of super-saturation of information. I have no doubt we can store all the data we are currently and will be generating. The question is how do we process it in to something meaningful? Just because we have the ability to archive everything, does not mean it will be useful to the [insert personally welcomed overlord] of the future.
Maybe historians of the future will be fascinated that Clinton's instant-message signoff was "l8ter d00d", but I doubt it. We'll want to save everything now of course, because we can. But the majority of the information I suspect will just be filtered out when actually searched.
Personally, I take the "you never know" ideology and save everything.
Digital technologies mean that archivists now enjoy orders of magnitude more information than they had in the past. Consider all the hallway and phone conversations or jotted notes lost in a paper-based organization versus having an archives of e-mail, IM, and sticky-note digital files.
Digital technologies mean that archivists now enjoy orders of magnitude more potential accessibility that in the past. Even if paper has greater innate archival lifespan, its physical form makes in inaccessible to all but a select monkish class of archivists colocated with their paper archives. Even the select few archivists who are allowed access to paper archives can only effectively process at best dozen documents per minute (and only a dozen per hour if they must wander the files to find randomly dispersed documents).
By contrast, digital technologies radically expand access on two dimensions. First, technology expands the number of people that can access an archive in terms of distance -- a remote researcher can have full access, including access to documents in use by other archivists. A low cost to copy documents means a wealth of information. Second, search tools provide prodigious access to the files -- searching/accessng/reading thousands or millions of documents per second.
To say we face a dark age is to presume that paper documents provided far more enlightenment and comprehensiveness of documentation than paper ever actually did.
Two wrongs don't make a right, but three lefts do.
From the article:
and the 2000 census returns were converted into more than 600 million TIFF-format image files, some 40 terabytes of data
Why TIFF!? PNG (or any other lossless format) would reduce that considerably.
People should think outside the box.
The answer to archiving the required volumes is producing less volumes. Case in point... we recently spent a week or so at work optimising a process that was I/O bound. The bugger took 10 hours to run. Although purchasing faster disks, converting to RAID0, and other techniques did whittle down the execution time to about 5 hours, the final solution was to redefine the process to reduce the actual IO (removed a COBOL sorting stage in the process), and the process is now 2 hours.
Bottom line: with the 100 + 38 million dollars (FTFA) assigned to the project I am sure I could eliminate a number of redundant positions, optimise some communication channels, retire voluminous individuals, replace inefficient protocols/people, and basically reduce the sources of data. Hell, if the US were to actually have peace instead of demand it, there would be a much reduced need for military inteligence, political rhetoric, and other civil responsibilities. The military could be half the size, and what do you know, we could not only reduce the requirement for archiving, but could actually save money in the process.
Remeber, govenment is a self-supporting process.
Go ahead, mark me a troll.
gus
.. if only.
The ancient, esteemedgreat library of alexandria was burned to the ground as knowledge literally turned to smoke, lost to mankind forever. Was it barbarians? Motivated by political revenge? Demanded by religious zealots? Accidental byproduct of an act of war?
Really, it's only the great works of artistry that need to be retained and remained, sustained and maintained. Historically, it's interesting to catalogue art, but politics? The everyday communications that lead up to the horrible decisions that lead our politicians to make the mistake of the daily business? We want records of this?
Perhaps the easiest way of keeping this knowledge at all interesting or inspiring is to burn it regularly, let people imagine what happened to allow such blunders or let apologists spin tales of delight explaining elegant solutions to how stupid people stumbled upon genius decisions. Conspiracy theorists or intellectual artistry can probably generate far greater truths than the truth will ever reveal.
It would save a great deal of money too, just having a delete key. If we are going to care so little for the decisions in the here and now, why preserve the information to be twisted by people in the future with their own biases and projects? We seem to care so little for truth knowadays, why should that change in the future?
I guess you didn't see how Mr. Ebbers or the founder of Aldephia are facing prison time. Quit trying to spread that liberal lie that white collar crime pays off. By the way, it is inappropiate to refer to blacks as niggers. Grow up and learn to be a little more tolerant of diversity.
By the time the government comes up with a half ass solution, archive.org will already have it all organized, online, indexed, and backed up.
anybody know what the government has spec'ed TFA's archiving system to do? It says it will need to read 16,000 file formats, and be impervious to terrorist attack (?), but not much else...
I wonder what kind of searches and cross-linking will be done, for instance. What kinds of access control there will be? I'd also just like to see what the 16,000 formats are, out of curiosity. Sounds like a project waaaay larger than the $136 million they've allotted for it so far.
Stupid name.... i'm guessing they were chuckling about 'the ERA of NARA'...
How many libraries of congress is that?
http://www.fedora.info/
(Not to be confused with the Linux distribution)
From the website, Fedora is "a general purpose repository service...devoted to...providing open-source repository software that can serve as the foundation for many types of information management systems".
Problem for some is that Fedora can be a little hard to grok. It's not an out-of-the-box repository to install and run, like the repository application mentioned in the article (DSpace). It's an architecture for building repository software. Once you understand the potential for building applications on top of Fedora, you start to see some light at the end of the tunnel for just the sort of issues the article raises.
I think it may be worse than that- that there will be a huge proliferation of false information, sensationalistic 'infotainmnet,' advertising, propaganda, etc... Why, historians of the future may be depending on /. as their main source of of information! Think of what a tragedy that would be!
http://en.wikipedia.org/wiki/Signature_bloc
I heard Monica Lewinsky slurped up 1 gigabyte of this digital history.
For he today that sheds his blood with me shall be my brother.
And don't give me shit about my karma or whatever. My karma's fine, I don't care about it. I'm copying this because it's interesting and contributes to the discussion.
What do you think about Ralph's thoughts?
xkcd.com - a webcomic of mathematics, love, and language.
Are we currently experiencing a dark age because we don't have access to every letter, memo, bank statement and laundry ticket created in the 20th century? Archiving everything is an attractively simple approach, but if it turns out to be impractical we can always fall back on common sense and restrict ourselves to archiving the maybe 10% of things that have even a remote chance of being interesting in 100 years' time.
We need to imprint holographic storage on synthetic diamonds. Even if they're slow and expensive, they'll last even longer than the paper records they replace. We'll have to spend a fortune redigitizing all the polymer (CD/DVD, floppy, tape), celluloid (microfilm/fiche) and rotating (disc) media that will age to illegibility within our lifetimes. Until we get holographic gems, we need to archive everything on paper, including those expiring media, in a format easily digitized to a more permanent medium. But of course the government, and barely unaccountable bosses, want the public record to disappear down the memory hole. If they could accelerate the process, including newspapers, they'd spend everything we've got (and more) to make it happen.
--
make install -not war
NARA makes a distinction between a document and a record. Any old piece of paper or email is a document, but a record is something which shows how the US government did business.
For example, the email to my supervisor asking when I can take a week's vacation isn't a record. The leave request form I get him to sign is a record. An email about lunch plans: not a record. An email to a coworker about a grant application probably is.
Besides obvious records (eg: financial and legal records), there are many documents that may or may not be records. For the most part, it's up to each program to decide which documents are records and archive them appropriately.
My father is a blogger.
Every mail is great
If a mail is wasted
The gods get quite irrate
Every mail is wanted
Every mail is good
Every mail is needed
In your network neighborhood
Really, the idea of not being able to record and save every post-it note being equated with those times and places where writing itself was denigrated into virtual nonexistence is a bit silly.
KFG
I'm not sure most of this stuff is worth making preserving digitally enough to justify the cost. Just print em out, and put them in a Raiders of the Lost Ark-style warehouse. The few people who want to see all of clinton's administration's emails can travel to it and search.
I'd much rather see those hundreds of millions of dollars invested in, for instance, making all out of print recordings and books available on-line. It's a smaller problem (sounds like), but would benefit the world much more than online copies of every government employee's timecard records.
.
the only way to start the process of really archiving is breaking out of expecting single institutions to do it and distribute the task -- distributed archiving can start w bloggers, since they seem to have time on their hands: http://www.mcgeek.com/mainsite/tech/123,37.html/
I don't know about the NASA data sets, but they could certainly save a few petabytes by stripping the stupid HTML part of all Outlook emails...
In 1987, a Mac II came with a 40 MB drive. 17 years later, a PowerMac G5 came with 160 GB drive. This was at least 4000X improvement in storage density and price (and 1987's drive was both physically larger and more expensive than 2004's drive).
Assuming we continue the current rate of advance in storage density and price, future archivist should be able to buy a 0.64 PB drive for under $500 in 2021. A mere quarter of million dollars will provide enough space for a copy of all that stuff.
Two wrongs don't make a right, but three lefts do.
The Zapruder film was the beginning. In recent years, I've been dumbfounded by the vast extension in recording and documentation of things like crimes in progress, natural disasters, America's Funniest Home Videos, you name it. A plane crashes, and the next day there are ten different home videos from people in the vicinity who had camcorders.
I believe the cost of traditional photography in constant dollars dropped enormously between my parents' time and mine. I know we took about ten times as many silver-on-paper and Kodacolor dye-on-paper snapshots as my parent did. Then we got a camcorder. My parents captured about three hours total of 8 mm silent home movies. I have about forty hours of 8mm and digital-8 camcorder tape.
And since my wife and I got digital cameras, we've been taking five to ten times as many pictures as we did when we used film cameras.
Now, YES, I'm on the format treadmill. Got most of the old 8mm movies transferred to VHS. Got most of the VHS transferred to DVD. Got a lot of the old slides scanned. Got most of my digital images burned to CD. In the last five years, I've probably spent a hundred hours, or 0.2% of my life, on nothing but struggling to copy from old formats to new. I've spent a small fortune getting Shutterfly to print pictures, because to tell the truth I have much more faith in the prints surviving than the CD's.
So, I don't see a digital dark age. I see a bizarre situation in which the quantity of material recorded in digital form continues to increase exponentially for quite some time. _Most_ of it will get lost, and the percentage that survives, say, a hundred years will keep going DOWN exponentially with time.
But I'm guessing the total quantity of 21st century material available to historians of the 23rd century will, in absolute numbers, be just about the same as the total quantity of 20th century material.
It's one of those mind-boggling things like personal death that one can never quite come to grips with. The future is unknown, and we can accept that. But the fact that most of the past is unknown is equally true--and very hard to accept.
"How to Do Nothing," kids activities, back in print!
In 2022, we'll probably have terabyte capacity in our mobile phones. Seriously. In the early 90s, 80 Gb of drive space ran about $80,000 according to this archived historical document. Nowadays, I can get an 80 Gb drive for about $65 according to froogle, and that's without considering inflation. Sure at a conservative $1/Gb were looking at $347 million dollars today, but in 17 years time that'll probably look more like two or three hundred thousand bucks. No biggie for our bloated government.
I don't think backing up a president's email and backing up some minor whitehouse aide's email should have equal importance.
I agree really but I also find the problem with data is you never know until its too late. The aide's email could be an international "smoking gun" lost forever vs. an eternally archived Presidential request for diet soda on Air Force One.
I feel that if you can't completely automate backups then the best thing is to give users easy access to backup resources for their own material so they can judge whats most important and what isn't. This happens in some organisations at the moment but not in all; I used to work in a place where I had to make a special appointment with a tech just to burn a CD of stuff on my HD. Guess how much data we regularly lost as an organisation...
Plays violent online games as: Nerfherder76
AOL cd's are closed. Belive me, I've tried...
NARA needs to open up tons and tons of GMail accounts. Where do I send my invites so I can contribute?
All I know about Bush is I had a good job when Clinton was president.
To you and the countless others on /. who offer their corrections in a similar tone: Yes, we get it, the parent poster goofed and you supplied a correction. Given the trivial context here, it's hardly a big deal and doesn't warrant sarcasm. Everyone make mistakes and plenty of people make mistakes in their work every day, including people who do work where lives are at stake. That's one reason why it is good to work with other people. In life it's far more important to be forgiving, keep things in perspective, and help other people without the wiseacre commentary and then move on.
Digital Citizen
Who says you have to archive all data digitally. The system thats been working for years at our local public and univ. libraries is storing meta information digitally that references a tangible location.
I think you're missing the point, which is that all that data is now much easier to lose, especially in the short term, if it's not taken care of properly.
Perhaps, perhaps not. Sure, digital data can be lost easily, but it can also be copied/backed-up more easily. Assuming $0.01/page for paper copy (a gross underestimate of the cost of paper, toner, and labor for copies) and assuming 10 kB data/page (an overestimate), $10/GB (for high-end maintained storage), then cost ratio is at least 100:1 in favor of digital (and probably 1000:1). Inaccessible formats are a concern, but an automated batch process at the time of initial archiving can, at least, convert the data to some data format standard with a longer likely lifespan(e.g., plain ASCII, RTF, PDF, HTML, etc.)
Paper is its own single-point of failure concerns and the huge cost of copying makes those concerns real. Digital does add some new modes of failure (e.g., format obsolesce), but I think those are not as burdensome as the physical costs of copies.
Two wrongs don't make a right, but three lefts do.
The guy conflates integrity preservation solutions (RAID) with data format issues.
Major formats will be figured out after the apolocalypse, don't worry about that. (Sir, we found over 100 million 4 3/4" plastic discs with digital data on them! Should we try to decode them?)Data will be lost, that's true. But some of it will be figured out, just as when we look back at current histories.
In the past, when societies paper or papyrus instead of parchment, the recoverability of their information went down because they didn't survive as well. At other points, changes in inks (due to convenience of manufactur or cost) also led to lower data survivability.
So this isn't a new thing at all.
But most importantly, don't get too excited about it. You no more should be worried about whether your pr0n collection will survive than the average greek or roman was about their inventory/accounting records, or indeed their pr0n collections.
Havn't you heard? There's nothing Reiser4 can't do!
347 petabytes? Why not store it all as petafiles?
*duck*
All too relevant.... Recording every minute detail of communication is not the way our brains work now, and doesn't even seem to be on the horizon for how our brains are going. Why in the world would we want to archive every little detail.
Governmental psychosis is costly.
.
-shpoffo
Has not Google already figured out this problem with GMail? Google, maybe you should bid on the job? Imagine beign able to use google to search the national archive. Hmmm.... :)
if they arent already
To add insult to injury, slime-sucking lawyers now advise their clients to destroy records, like email, as soon as possible to prevent them from being the subject of discovery in a future lawsuit. At a previous employer, company policy was to nuke all email older than 30 days. Due to the drive to eliminate paper shuffling, email messages were the only record of many policy decisions.
Mea navis aericumbens anguillis abundat
You can calculate the amount of entropy in a document (text or no) and that is a limit to how small you could possibly make it.
I don't recall how close modern methods like arithmatic encoding make it to that limit, but I know it's close enough that we couldn't double the compression ratio of text documents from the current state of the art.
Trellis coding is a system for dealing with induced errors in modem signalling. It allows you to cancel some of them out. It doesn't actually increase the throughput in an ideal situation.
The thing that allowed us to reach the limit for a phone line is combined amplitude-phase coding, or the creation of the "constellation diagram" for modem encoding.
The constellation defines certain combinations of phase and amplitude that represents groups of bits (a baud). Trellis coding simply defines additional combinations that are not sent. If you see any of these on the receiving end, then you realize that the constellation is either being twisted (phase error) or shrunk/grown (amplitude error) and you can try to compensate for it.
The name comes from a trellis, like you grow plants on. The legal signals sent should go through the holes in the trellis. If you receive a signal that falls on the trellis (hits the trellis) you adjust it so that it goes through the trellis and assume this adjustment factor can be used to adjust other, valid hits too to more accurately determine the data that was sent.
http://lkml.org/lkml/2005/8/20/95
First, Moore's Law is about transistor density, which has nothing to do with hard drives. Secondly, hard drives haven't been getting any more reliable. That means all these hard drives have to be replaced every few years. It's a nightmare for long-term storage.
You are right -- Gordon Moore spoke only of trends in the number of transistors/IC. Yet his law was, if anything, about advances in the technologies of miniaturization. This miniaturization has had profound, indirect effects on storage. The same technologies that enabled semiconductor engineers to make smaller transistors have helped disk drive designers make denser drives. Smaller heads, faster electronics, and a better understanding of materials lead to advancements in both ICs and HDs.
I'm not sure what you mean about reliability. Perhaps reliability on a per-drive basis remains constant. But reliability on a per-bit basis has improved. How long would a cluster of 4,000 40 MB drives go without a failure in 1987? The reliability of 160 GB of storage has improved.
Yes, storage systems need periodic drive replacement but by the time a drive needs to be replaced, the indirect effects of Moore's Law will have made that replacement about 1/4 the price of the original drive. Thus, if storage is $1/Gb now, securing about $1.33/GB is sufficient to buy both today's storage and have the money needed to buy all subsequent replacements every 3 years in perpetuity. By 2022, a storage array of 100 servers with 6 drives each (an installation only 4 times larger on a device-count basis than the new Wikipedia installation) would provide the needed storage of 347 PB.
Two wrongs don't make a right, but three lefts do.
WANTED: Digital Librarian Archive Asst., Digital Salvage Director, UNIX Admin., eMail Archiver, Information Architect, Mathemetician, Psychological Councilor...
All positions require Computer Science Bachelor's and Master's degree and 18 years experience or 2 year Mathematics degree and 10 years experience, except for the Councilor, which requires 6 months Hooters waiting experience and a PRN
send resume and salary reqs. to address_empty at potmail.c
The Admin and the Engineer
or how many Volkswagon Beetles filled with DAT tapes?
:^)
or how many beowulf clusters are needed to search it? sort it?
Think of the crap as padding. When you only save important information, and then you lose information, it was important.
346 petabytes of padding might be overdoing it though.
you can have my violent video games when you pry them from my cold, dead hands.
Prime UID Club
Are we currently experiencing a dark age because we don't have access to every letter, memo, bank statement and laundry ticket created in the 20th century?
It is a matter beyond impeachment that future generations can expect substantial volumes of washie to go unclaimed, forgotten to the sonorous march of history.
you can have my violent video games when you pry them from my cold, dead hands.
Prime UID Club
"Give them to me."
"What do you want??"
"That Gem...and the Holograms."
you can have my violent video games when you pry them from my cold, dead hands.
Prime UID Club
You jest, but it's possible something like Wikipedia or (shudder) everything2 will be on some future historian's list of sources.
So historians in 2100 will have to wade through various trolls and defacement attemps to try to get what people thought about in 2005 - but at least they'll know not to click on Goatse links.
DNA combined with the cellular mechanisms to protect, and propogate the information.
How expensive is data storage, really ? I'll design a ten petabyte (10PB) storage system. You'll see how much it costs. To build this monster machine, I'll be using commercial off-the-shelf hardware organised as a massive Linux cluster.
You may ask "why do you want to build the most powerful Beowulf cluster on Earth when storage companies have all these amazing storage systems ?" Well, this system needs to be an open solution. The system will need to grow and evolve as the needs change. Vendor lock-in is simply unacceptable.
"Current storage software for Linux is useless for this massive an archive", you retort. Let me make it clear that there is no software anywhere that can fulfill the needs of the NARA. No matter what solution they choose, they will have to commission the creation of custom software. This software must be open-source, for the sake of the future growth and survival of the system.
Our basic storage unit will be a 300GB SATA drive. It's inexpensive, fairly reliable, and availaible in bulk quantities. We're going to need a lot of these little drives.
I'm going to create a very reliable system. First, it'll be completely redundant, with two identical systems, kept in sync, at two different geographical locations. I'll first consider the cost of a single location.
So, I need to store 10PB in a single building. I want reliability, and I'll get that by using two different levels of ECC: RAID-5 and distributed data. All disk drives will be part of 4-drive RAID-5 arrays. All these arrays will be part of larger 4-array RAID-5 arrays that are distributed across storage nodes.
Thus, each drive is regrouped with 16 others to form a 2-redundancy-layer 16 drive array, with 7 drives dedicated to parity. This leaves us 9/16 usable drives, or 2.7TB.
We therefore have 4-node storage subunits, with 5.4TB usable storage space. Inter-node communications go through gigabit ethernet with jumbo frames. The second gigabit ethernet interface of each node is also connected to a secondary network, for outside access to the node. Here, storage groups of 4-subunits are channeled through a single GBe port. That's 21.6TB which can be accessed through GBe.
Now, we have 512 such storage groups, each addressable at a speed of 1GBps. At this speed, the complete data store of each group can be transferred in or out in less than 72 hours.
We're using 2U cases for each system. This means that each subgroup uses 32U for servers, and we're using 42U racks. We have free space in every rack to allow for our high power density and communications and service equipment.
Let us consider the cost of a single such 21.6TB storage unit. We have 16 servers and 128 drives. Each server costs with storage costs around 2.5K. Each storage unit will be contained in a single rack with independent power conditioning, cooling and communications, at a cost of 15K$/rack, if we use high-end COTS equipement. So, we have a cost of 55K$/unit.
We need to make a cluster out of all this. We need extra computational power. We need a network backbone. We need tape drives to put data in the system. Front-end systems. A way to communicate with the outside world. A network operations center. That's going to add 5M$ to the system.
This means that the total equipment cost for a single location, including 5% of spare units, is around 35M$, including power and cooling equipment.
We need around 35000 sq.ft of floor space to host all this equipment and an 5000 sq.ft operations center. At a cost of 30$/sq.ft, we're talking about 1M$/year for the building.
Let's consider power costs, and do so very conservatively, with and a generous overhead on all figures. Each drive uses 15W of power. Each server uses a total of 250W of power. Each storage group uses 4kW of power. Each location thereforce uses 2MW of power. Over the course of a year, we're talking about 12.7GWh of power.
Now, we still need more power for cooling, which comes in at about 60% of total power costs. T
http://slashdot.org/comments.pl?sid=154005&cid=129 17603
My opinion?
The 21st century will disappear from history. In 500 yearstime they will know more about Italy of 1505 than the USA of 2005. Why? the records of Italy will still exist.
The entire digital info system is based on the free ride of petroleum. Petroleum will basically disappear from society fairly soon, (either it will simply deplete, or will become too expensive to drill it out) and everything made of plastic and anything requiring high energy density to acquire (like digging up precious metals) will be largely (but not completely) curtailed. The result is most of what we call "our culture" will be lost soon after the Collapse ( http://slashdot.org/comments.pl?sid=154005&cid=129 17603 ) and will be largely ignored from a lack of manpower ( http://www.dieoff.org/ ).
It's a truly stunning prospect - our civilisation will, with the possible exception of a few basic texts that can be copied to paper, will simply disappear. It will be seen as a dark age - not from a lack of people writing (as it was in 490 CE) but from a lack of putting things into a survivable substrate.
RS
Shoes for Industry. Shoes for the Dead.
the link shoud have been:
http://www.amazon.com/exec/obidos/tg/sim-explorer/ explore-items/-/0670033375/0/101/1/none/purchase/r ef%3Dpd_sxp_r0/103-5019446-5179842
RS
Shoes for Industry. Shoes for the Dead.
If the current administration has its way, we have no business archiving anything.
One of GWB's first acts was to lock down the Reagan administration's (and, all subsequent administration's) data forever. The 12 year release cycle that the Ford Administration approved was revoked within weeks of Jan 2000 (some cynics say, to prevent data about Iran-Contra and GHWB's involvement becoming public - but that's just crazy talk).
The only data less available than old parchment in a vault is random magnetic domains and / or the lack thereof.
You can't prosecute what didn't happen... Ask Oliver North about those PROFs backup tapes.
In ten years there will be no "official" record.
Bush will have achieved what countless computer marketing schemes promised: a paperless office.
The corrupt politician's wet-dream - no records.
It all started as "a matter of national security" - but the first victim (*target*) was a cartographer mapping Caribou trails through ANWAR.
Now, states like Missouri have eliminated publishing certain rules, laws and regulations on the Internet - as too costly. Yep, if you want to read the regs to Chap. 213 R.S.Mo. you have to go to Jefferson City and ask nicely at the Mo. Commission on Human Rights to look at a copy of the new rules. One per Commission....Damn electrons and ink are dangerous to 'merican republicans. Ration them - then burn-bag and deep six 'em.
Authority questions you. Return the favor.
I suspect 99.99% of this information is multiply redundant. With a good compression algorithm, it would fit onto a DVD or a CD or perhaps even a floppy.
"From the 38 million email messages created by the Clinton administration ..."
/dev/null stuff ... certainly there are many of "cantina" messages, but that's not the point.)
I am pretty sure, that 90% of those emails could be deleted. (Not saying, that the adminitration writes
- If I take a look at my emails, I have quite a lot of threats in there, all later messages include the whole conversation of previous messages.
- If I take a look at my (outbound) emails, there are lot's of mails to multiple recipients. It is only neccessary to store it *once*!
- Since most of the emails are internal, i.e. one administration member writes to another one, there is no reason to save the outbox of the sender and inbox of the receiver.
I guess they sent lots of attachments, too. And compression of those binary files isn't as effective as the compression of text.
Don't answer me. Moderate. Slashdot is about moderation, not discussion.
nt.
Who ordered that?
I guess the first question is, why are even keeping this data around. Give the historians something to argue about and delete some stuff.
This is my sig.
Keep in mind, though, that the prison sentences of Ebbers and the prison sentence of the guy who knocks over a BP are going to be about the same, even though Ebbers stole millions while the gas station thief would have got about $200.
HI, MY NAME IS ISAAC.
Digital Archiving and Long Term Preservation: An Early Experience with Grid and Digital Library Technologies
The first project is the Persistent Digital Archives project [1], which is a joint effort between the San Diego Supercomputer Center (SDSC), the University of Maryland (UMD), and the National Archives and Records Administration (NARA), and is supported by the National Science Foundation under the Partnership for Advanced Computing (PACI) Program. The main goal of this project is to develop a technology framework for digital archiving and preservation based on data grid and digital library technologies, and to demonstrate these technologies on a pilot persistent archive. We have already built a significant prototype using commodity platforms with significant disk caches coupled with heterogeneous tape libraries for back-ups.
Re: Refrigerator use
PLEASE PLEASE PLEASE To all our West Wing office staff we remind you once again that refrigerator cleaning day is FRIDAY and if you leave any foodstuffs they will be THROWN AWAY by 6 pm. Thanks!!!!
As Mr. Ballmer would say.
Hardware and media types can be migrated and validated in a straightforward process. It is the format and representation of the data that is the *hard* problem. Understanding how the information is represented in the digital record is the only way one can conceive of a process to migrate it to a more current representation. Unfortunately, many of the representations are proprietary, e.g., MS Office documents. Open standards for the data representations are the only way forward.