Slashdot Mirror


Snapshotting the Whole Internet?

Anonymous Coward writes "CNN is running a story about a company that is saving periodic 'snapshot' archives of the whole www (or as much as they can) for historical purposes. Interestingly they say that although they might have considered saving everything except ads, they didn't throw away the ads because historians claim that ads give a better "glimpse of what life was like" in the past. I wonder what legal ramifications will arise for possessing such archives of the "whole web" as snapshots-in-time. Thoughts of DeCSS, CPHack, MS Kerberos' click-wrap license, I.P. "ownership" of collected databases cross my mind."

16 of 142 comments (clear)

  1. Really? by BadlandZ · · Score: 3
    "The way we're able to pull this off is by having robots that go around and contact every Web server around the world periodically, and download each page -- each image -- off of every one of those sites"

    Why do I just NOT believe this? In order to do this, they would not even be able to search for httpd at every IP, because it would only grab one host, and virtual hosts would be skipped.

    They would actually have to search every registered domain name, dig up all it's host names, and then search each one for an httpd, and I don't believe they are actually going to do that (can someone point out to me that they are?).

    One of the biggest problems on the web right now is the lack of orgninzation of information... "You want to know what? Oh, it's on the web, no doubt, but WHERE!" As brillant as search engines are, they still don't know where everything is, and you frequently will miss what you REALLY want.

    So along come theses people, and they make an even larger claim than searching through _all_ of the web, but they say they will take a snapshot of ALL of it?

    /me looks at hit logs for home Linux box running apache on modem... Hmm... Don't see anything... been up weeks... I'll keep looking ;-)

    At best, they will get the most popular sites, and try to leave it at that.

  2. gah. slashdot on the brain. by Da_Monk · · Score: 4

    i parsed this as
    "Slashdotting the whole internet"

    help me....

  3. 30 years down the road... by Raunchola · · Score: 3

    Joey: Wow Grandpa, this was how the Internet was back in your day?

    Grandpa: Yep Joey, that's right, this archive shows how the Internet looked back when I was a youngster like yourself.

    Joey: Wow! You know Grandpa, it's too bad what the Internet has become now. Just look at this...porno, warez kiddies, AOLers, and ads everywhere! Talk about a downward spiral.

    Grandpa: Joey?

    Joey: Yeah Grandpa?

    Grandpa:: You're still looking at the archive.

    --

    --

    --
    The real Raunchola isn't cool enough to have any imposters
  4. I would like seeing . . by Money__ · · Score: 3

    . . the old internet before micrsoft.
    ___

  5. huge pr0n collection by paled · · Score: 4

    Sounds like an excuse for having the worlds largest pr0n collection 8^)

    --
    .
  6. Save the Slashdot archives! by ballestra · · Score: 3

    Future historians will no doubt be fascinated by the "ancient mystery" race of trolls. They developed their own hybrid latin/arabic based alphabet, but often reverted to more primitive picture-based communication. The figurehead of their religion was a deity named Natalie Portman, who they worshipped with sacramental rituals involving hot breakfast cereals. Who these people were, and how they really lived, will sadly remain a mystery.

  7. To answer a few questions... by Rhys+Dyfrgi · · Score: 5
    Most people seem to have not found the homepage of the project (not surprising, as I saw no link on the CNN story.) The project is at http://www.archive.org. There are 3 archives there; the web, from 1996 to now, taking 13.8 TB. FTP, in 1996, taking .05 TB. And Usenet, from '96 to '98, at 0.592TB. All this space info is from the front page of the site.

    There is info on the side on how the archive is accessed, created, who pays for it, everything. Read it before you hit that post button another time.
    ---

    --
    END OF LINE
  8. Re:This is not "how we live" at all by Jon+Erikson · · Score: 3

    The "teething problems" are what makes it historically interesting! I think historians of the future will be much more interested in looking at the development of the web through trial and error than at the finished product.

    Well I suppose that the sheer amount of perversion and degredation available on the net at this point in history will provide a lot of interest to future historians, so in that context sure it'll be "historically interesting"!

    But, pornography aside, what is there of real historical value on the net? Sure there are any number of mindless geocities homepages full of drivel about people's pets, but sifitng through this would drive anyone mad and there are a lot more "insightful" sources already available about today's culture.

    Unfortunately the web as it stands at the moment shows the worst side of humanity rather than its best side - historians looking through terabytes of things like the anarchists cookbook, virulent anti-Christian diabtribes, terrorist manifestos and race hate sites will hardly pick up a balanced view of society will they?!

    It's not a study, it's an archive. The purpose of this project is to collect data, not to analyze it or place any sort of value judgement on it.

    But unless it will be used as the basis for future studies then this project is a waste of time, so I don't think you have a valid point here.



    ---
    Jon E. Erikson
    --

    Jon Erikson, IT guru

  9. We are the Internet Archive by Anonymous Coward · · Score: 5
    Given our recent exposure, I thought I'd make a few comments since journalists tend to miss important details.

    Our website is at http://www.archive.org.

    We are *NOT* a company. We are a non-profit organization making our archives freely available to researchers, scholars, historians, etc.. A for-profit company may not be the right model to insure long term longevity of the collections. We only archive publicly available information on the Internet.

    We currently have about 17TB of Web pages and images on disk. We've also got about 6TB of older stuff on tape that we are migrating to disk. We're growing at about 3-4TB/month. We are not yet getting Usenet or streaming media because of labor limitations. Anyone wanna come work for us?

    We buy storage PC's with twenty 75GB IDE hard drives, 2 667Mhz CPUs and 512MB RAM. We run Linux, but are migrating to FreeBSD because of the 2GB file size barrier.

    Access currently requires a bit of UNIX skill. There is no browser interface to our collections. You'll need to be able to write your own search software, as the only index we have right now is a URL index. If you want access, you'll need to fill out a form at http://www.archive.org/proposal.html.

    Kurt Bollacker
    Technical Director, Internet Archive

    kurt@archive.org -- www.archive.org
    P.O. BOX 29244, San Francisco, CA 94129
    vox: 415-561-6796 -- fax: 415-561-6768

  10. Re:This is not "how we live" at all by mikpos · · Score: 4

    So you're saying that because the web isn't perfect and it doesn't reflect the general society, it won't be useful to historians? If you ask me, this would make it more interesting, not less. This transition will have an extremely short lifespan (probably under 20 years in length), so the more data the better (for the historians).

    And, FYI, just because the Royal Family doesn't reflect English society, it does not mean that historians don't find them interesting.

  11. History is in the trials and tribulations by Valdrax · · Score: 5

    Well I suppose that the sheer amount of perversion and degredation available on the net at this point in history will provide a lot of interest to future historians, so in that context sure it'll be "historically interesting"!

    You just don't get it, do you? Should historians gloss over the Holocaust, the Reconstruction, and the Dark Ages simply because they were "icky?" Sometimes the darker elements of society are the most worth examining in a historical context. The whole point of the saying about those who don't study history are doomed to repeat it isn't that you should study only the good points and avoid them.

    But, pornography aside, what is there of real historical value on the net? Sure there are any number of mindless geocities homepages full of drivel about people's pets, but sifitng through this would drive anyone mad and there are a lot more "insightful" sources already available about today's culture.

    Do you think it's not just as frustrating to shuffle through archives of old 19th century newspapers to find ads and articles about the medicine of the day? The point that the man speaking for the Internet Archive was making is that this is not a study of only the famous. With these archives at hand, you can study the transition from the early days of research papers to the rise of pornography and personal websites to the current days of e-commerce to whatever major social trend the web next holds. An archive of the web shows how society has adapted to the format. You can see what issues were hot enough to spur crops of websites only to fade away in the span of a year or two.

    Face the music that the majority of humanity isn't putting out "insightful" commentary. Ignoring the common man is a mistake that many historians simply can't ignore because there's nothing available about them. All the "mindless" Geocities sites give an insight into the kind of people that use them.

    Unfortunately the web as it stands at the moment shows the worst side of humanity rather than its best side - historians looking through terabytes of things like the anarchists cookbook, virulent anti-Christian diabtribes, terrorist manifestos and race hate sites will hardly pick up a balanced view of society will they?!

    Sounds like you're the one with the hardly balanced view of society if you honestly think that is what the majority of the web is. The fact is that the majority of the web currently is commercial sites and those "mindless" Geocities sites you like to talk down about. Though there are some bad elements on the web, it's also worth historical note that the web led to the coming out of many of these fringe groups. The very anarchy and rebellion of the web is of major historical interest, and the web is becoming one of the more important socio-economic influences of the turn of the century, at least in America.

    But unless it will be used as the basis for future studies then this project is a waste of time, so I don't think you have a valid point here.

    Ah, but it will be. Say in 30 years you want to do some research on the Y2K histeria of the turn of the century. While there will be plenty of books to read through, a major factor in spreading the word about Y2K was the Web. However, these web sites are already mostly gone from the Web today. Fortunately, the Internet Archive may have already preserved them for future study.

    Would you like to study the rise of Linux or of the web itself? Many of the early web pages about the topics could provide priceless research. Hell, even if you really object to the large amount of pornography, the booming porn industry on the web was a major driving factor in advances in e-commerce. It would also be valuable in studying the "warez" counter-culture of today.

    Plus, like it or not, it's not for you to say. This is being done by a privately funded group. If you really feel so strongly that the web is worthless and should absolutely not be archived for historical purposes, then go torch the place. While you're at it, go ahead and start burning those libraries that hold material about history you object to. Otherwise, your choices are "shut up" and "like it."

    --
    If it's for-profit but free, you're not the customer -- you're the product (e.g., the Slashdot Beta's "audience").
  12. Re:disk space by kurtbollacker · · Score: 3

    We buy PC's with 20 75GB IDE hard drives, paying about $11/GB for storage. Pretty cheap these days. We've calculated that the growth of the Web and the growth of disk drives tend to track pretty closely, so the cost of keeping up with the Internet will mean a relatively constant spending rate.

    Kurt Bollacker,
    Technical Director, Internet Archive
    kurt@archive.org
    www.archive.org

  13. Re:This is not "how we live" at all by Valdrax · · Score: 3

    Well, what else is it going to be used for? Your suggestion that it be used as a reference on "the growth of communications techology" is rediculous - the growth of hate material and pornography on the web has no correlation with the growth of communications technology at all. This project is not getting a snapshot of web technology, it is getting a snapshot of web content, something entirely different.

    Much of the content of the web relates to the growth of communications technology. You are limiting yourself severely if you are only thinking of the raw bandwidth connections. The growth of use of non-textual content, multimedia, and scripting languages and applets are all advances in communications technology. Not to mention, the radical growth of the internet, in terms of number of sites and content on sites help document that. Plus, your view of what the content of the web actually is is rather stilted. Even if it wasn't, the vast amount of negative material is worth studying in it's own right.

    Please demonstrate how believing in God and decent Christian morality is "factually wrong". I doubt you can.

    <offtopicrant>
    Please show how your condenscending and arrogant attitude reflects a life led by Christ. It's jerks like you who give the rest of us a bad name. Did he anywhere indicate that he was talking about Christiantiy? Anywhere? Or was he just responding to your blanket assertion that you were persecuted for your views, which in no place specified Christianity.

    Instead, you assumed he did, or were attempting to sidetrack the issue to make yourself look like the oppressed religious minority. This kind of behavior is what disgusts others. When people look at me, they see someone like you -- an arrogant, bigoted ass who sees the entire world as filled with evil sinners out to get them. It makes me sick.
    </offtopicrant>

    My point is, if you read my post, that this is not a good thing given the exlusive access to the net by a certain portion of society. Would you consider how a society lived through records of its nobility?

    Um.. Let me think. YES!! That's how historians have had to do it for ages. Should we ignore early American politics because it too was primarly run by white, middle-aged landowners? You must attempt to look at all elements of a society to see how the framework fits together. If the web is a rich WASP playground, like you assert, then it's worth studying why exactly this is. Just because one particular class was behind a major force of societal change doesn't make it not worth studying. This kind of PC "1984" style of thought would have us ignore all of our history for the goals of delluding ourselves into thinking we're perfect. Well, we're not. Get over it, and start trying to figure out why. Just because America was led by white landowners early on doesn't mean we should've ignored history to know where we are now. Similarly, we should not ignore the current Internet so that future generations will know how they got to where they will be.

    P.S. Sort your HTML tags out.

    Finally, the prima facie evidence of a troll. Someone you picks at the formatting rather than the content of the person they disagree with. The guy obviously accidentally selected the "Extrans" option, which I too have accidentally done before in the past thinking that it would do the opposite. Besides, you should really preview and check your spelling before being so harsh. Talk about the pot calling the kettle black.

    --
    If it's for-profit but free, you're not the customer -- you're the product (e.g., the Slashdot Beta's "audience").
  14. This is not "how we live" at all by Jon+Erikson · · Score: 4

    Sure, this'll be a useful reference for future generations, won't it? I'm sorry, but as much of a fan of the web as I am, I really wouldn't consider it to be something worthy of archival in the state that it is at the moment. Why? Well, because currently the web is still in the transitional period between the days of ARPAnet and purely academic use and acceptance as a medium through which the general public can communicate. And as such, it's still in a state where teething problems overwhelm content.

    The trouble with the web is that although it is supposedly accessable to anyone with a phone line and a PC, the harsh reality is that cost and communications infrastructure have meant that only those of a certain socioeconomic group are currently able to use the web, and this group is mainly comprised of the priviliged, a group which most /.ers fall into by dint of their jobs or backgrounds. Research carried out my both my consultancy group and others all indicate that the majority of people able to use the web are white, middle-class and certainly in higher than average tax brackets.

    So given this, how does taking a snapshot of the web give a view of how society is at the moment? It doesn't, any more than looking at the Royal family of England gives a picture of what England is like (despite what some Americans seem to think). The views that are expressed on the web are those of a priviliged class who do not have to suffer the effects of current liberal free-market policies and the increasing divide between the rich and the poor.

    No, this exercise will be a "Who's Who" of society, showing only those who are rich enough to be able to afford net access. The majority of people, unable to benefit from the web, will be left by this study as an underclass, something which I view as incredibly wrong and an example of the undeniable arrogance that most people on the net display towards those that are perceived as their inferiors. Indeed, I have suffered the same myself here on this forum for expressing views that people consider "outdated" or "primitive", even they are held by many others.

    Anyway, any study that attempts to categorise how we live at the moment using the web is doomed to be prejudiced and incomplete. Until everyone is online and has equal access, this is just another arrogant study attempting to categorise who is worth enough to be able to use the net.



    ---
    Jon E. Erikson
    --

    Jon Erikson, IT guru

  15. Interesting, but... by TheFrood · · Score: 3

    This is really a cool idea. Up 'til know, we've taken it for granted that our media would last long enough for historians to make use of it in the future. With the web, you can't assume that's the case, so it's good that someone's taking it upon themselves to archive the web.

    But I want to know more:

    How deep does this archiving go? Are they going to store every single page and image of every single website?

    How much storage space is required for the whole web? Wild guess: A recent /. story put the number of Apache-served websites at about 10 million. Since Apache has roughly two-thirds of the market, that makes the total number of web sites 15 million. If sites average, say, 10Mb in size (wild guess), then it looks like 150 terabytes would be enough to store the entire web.

    What software/OSes are they using for this project?

    When do we get to see the archive?

    --
    If you say "I'll probably get modded down for this..." then I will mod you down.
  16. Google doing something similar by Chairboy · · Score: 4

    You know, Google is doing something similar. They have copies of websites cached from the last time they crawled it.

    On topic, though, doesn't this threaten to change accountability from people who post commitments on their sites that they cannot meet? In the past, these people could change their website at will, and since there wasn't a physical copy, there was no evidence of the previous comittment.

    I anticipate this also being used in court. Think about it, if someone sues for libel, the evidence could be available in the 'snapshot' archive. This converts their project into a legal document, and means that the company doing this net archiving could be in danger of contempt hearings if they don't take extraordinary measures to ensure the integrity of the data.

    I don't know, this sounds like an awfully big responsibility, and I hope this company has a good bunch of lawyers.