Slashdot Mirror


Obtaining Archives of USENET?

Academic Researcher asks: "Google took over Deja and built the mammoth Google Groups, which is a near complete archive of USENET dating back to early 1980s. For an academic project, I need to analyse a lot of USENET data. The Terms of Service for Google Groups will not allow automated access (and even so, I'd have to write a bunch of tools and reverse engineer it all). Inquiries about purchasing copies of the archive have gone unanswered. No one else seems to have such an archive. Apart from this meaning that world has one single USENET archive (I hope they have backup floppies!), how can I obtain historical data for research purposes ? I'd happy pay money for DVD's of archival material if they were available. Can anyone help"

13 of 86 comments (clear)

  1. Google just bought the data by sartin · · Score: 4, Informative


    Technically, Google did not buy Deja, they just bought a bunch of data. Deja was purchased by buy.com, part of eBay and is now firmly merged into eBay (with some of the folks over in paypal).
    </p>

  2. Sure! by Sevn · · Score: 5, Funny

    http://groups.google.com

    Search: *

    Then save to file.

    DONE!

    --
    For every annoying gentoo user, are three even more annoying anti-gentoo crybabies. Take Yosh from #Gimp for example.
  3. You understand how much data you're talking about? by rthille · · Score: 4, Interesting


    You're going to want to talk with Jim Gray of this article:
    http://slashdot.org/article.pl?sid=03/07 /10/235225 1

    --
    Awesome furniture, accessories and cabinetry in Santa Rosa, CA: http://humanity-home.com/
  4. History of selling Usenet archives by shoppa · · Score: 5, Interesting
    In the early 1990's, a company in Vancouver BC proposed a monthly distribution of Usenet via CD. This sparked extensive discussions in the newsgroups (at the time almost exclusively dominated by academic people, of course) with a lot of resentment that some company was going to be making money by selling *their* posts. (Do a Google groups search for "usenet on CD" to see some of these. They also mention Walnut Creek.)

    In any event the massive number of binary posts (porn, movies, warez, etc) on usenet in the past few years would make the "full" archive of the past few years number in the tens of thousands of CD's. A "full" usenet feed passed up the bandwidth of a T1 about 1998 IIRC.

    Some individuals archive individual usenet groups, or the group is gatewayed back and forth to a mailing list that is archived. This IMHO is more appropriately managable for research.

    The announcement of Google Groups with a 20 year archive acknowledges several sources for the broad timeframe of the archive (as well as the donors to the preceding Dejanews archive); you might want to check out their specific work.

    1. Re:History of selling Usenet archives by dougmc · · Score: 4, Interesting
      In any event the massive number of binary posts (porn, movies, warez, etc) on usenet in the past few years would make the "full" archive of the past few years number in the tens of thousands of CD's. A "full" usenet feed passed up the bandwidth of a T1 about 1998 IIRC.
      NOBODY has a full archive of the posts made to alt.binaries.* -- I doubt that even the NSA has that much storage to spare for it.

      I believe that a full feed is around 300 GB per day now, with 99+% of that being alt.binaries.*.

  5. sequential snapshots are an archive by RGRistroph · · Score: 4, Informative

    Why can't he start archiving now, and then work on his project testing it on his data as he goes ? He might have enough by the time he is done.

    Also, why does it have to be usenet specifically ? If he needs a long time period but not a great breadth of groups, he may be able to find mailing list archives that are sufficient. The 9fans and lkml both go back quite a ways, just to mention a few off the top of my head. If you go down to your universities sys admin department, and ask them what's the oldest continuously active mailing list they have archives for, you may strike gold.

    In fact, your university's sys admins may have usenet archives also.

    You may find this helpful in scraping web-served mailing list archives into a form you can use:

    http://www.linpro.no/lwp/

    Also there is a perl script out there that will download the archives a yahoo group into an mbox.

  6. I need to analyse a lot of USENET data... by AtariDatacenter · · Score: 4, Insightful

    > For an academic project, I need to analyse a
    > lot of USENET data.

    Are you SURE you're prepared for that much data? And the costs for just storing that much data? Not to mention manipulating it? I think you'd need an academic project just to analyze that factor alone.

  7. Re:Google by Big+Sean+O · · Score: 4, Interesting

    If it's really for an academic project (say a master's thesis), you might want to direct your inquiry to Craig Silverstein. Since he's grad student it's likely he would be more interested in your project than the Google corporate types.

    Who knows, mebbe you can parlay the project into an internship into a real live job.

    --
    My father is a blogger.
  8. Translation by truffle · · Score: 4, Funny


    how can I obtain historical data for research purposes ?


    Translation:

    How can I build the largest pr0n collectin ever?

    --

    ---
    I support spreading santorum
  9. Re:Yeah by orthogonal · · Score: 4, Funny

    I suppose that if you sent it to the wrong department then it would take *much* longer.

    Well, it's not the department actually.

    What you need to do is fill in the X-Meta tag in your email (or if you use Microsoft email products, the <HEAD> <META> tags) with keywords describing your email. "Hot amine poon-tang" is often a useful meta tag.

    Then, you need to get a lot of people to link to your email (that is, refer to it in their email) with a X-References tag.

    You can do this just by getting it popular on a mailing list, but even more effective is to post it to a number of bloggers. Bloggers, being how they are (self-important but afraid of not being noticed), will endlessly refer to it, and will probably get into interminable "blog spats" about it -- well, to be honest, about something completely tangential to it, but what do you care? It pushes your "Mail-Rank" score up regardless.

    If you can't do that, try contacting "MailKing", a commercial service that sends out a lot of email, purportedly from unrelated individuals, which refers to your email. They'll make it look like your email is really important.

    When your "Mail-Rank" is high enough, Google will have no choice but to notice it and reply.

    Best of luck getting your email noticed!

  10. Follow up by Anonymous Coward · · Score: 5, Informative

    I'm the original poster of the question. Just to answer a few points raised by other people:

    1. I am only interested USENET data until the early 1990's, and even then only for parts of comp.*. I know that I wouldn't be able to purchase/obtain selective parts of it, but I have quantified that 1980->199[0-5] is order of couple of hundred gigabytes max. This is not a large volume of data if obtained on DAT (I certainly have the facilities/funds for limited hardware/software), or if I could make use of whatever facilities store it at the moment that would also be great (e.g. if I were able to come to an agreement to use a host account located at google, where I could develop and run tools over the data [1]).

    2. I have so far developed tools to complete part of this work, but found myself violating Google's ToS and having myself blocked from access to Google Groups. My tools simply scrapped out a large amount of data from a specific newsgroups (comp.something) and performed some automated analysis that I use to correlate with other findings, etc.

    [1] There's an analogy here with obtaining time on a supercomputer for research work, though what I need is not a supercomputer, but superstorage.

  11. Here is what you are looking for. by Anonymous+Cowdog · · Score: 5, Informative

    I am surprised this Ask Slashdot question hasn't (really) been answered after this many hours. Well here is the answer.

    Go to this link:

    http://www.archive.org/web/researcher/proposal.php

    Create an account, log in, write a proposal, submit it, then wait.

    Yes they have Usenet data, not just web data.

    After I submitted a proposal this way, I had to bug them by email and eventually phone even to get the account set up. It's not that they aren't helpful, they are just busy with lots of projects.

    But don't expect hand holding. You need to be comfortable operating in a Unix environment. And I don't know if the data can leave their servers; you might need to do all your processing using their machines.

  12. Re:I disagree by Raul654 · · Score: 4, Funny

    It's just not that useful. 106 TB/year, mostly porn and warez.

    I don't see how you can reconcile those two sentences.

    --


    To make laws that man cannot, and will not obey, serves to bring all law into contempt.
    --E.C. Stanton