Obtaining Archives of USENET?

← Back to Stories (view on slashdot.org)

Obtaining Archives of USENET?

Posted by Cliff on Friday July 25, 2003 @12:27PM from the dealing-with-the-red-tape dept.

Academic Researcher asks: "Google took over Deja and built the mammoth Google Groups, which is a near complete archive of USENET dating back to early 1980s. For an academic project, I need to analyse a lot of USENET data. The Terms of Service for Google Groups will not allow automated access (and even so, I'd have to write a bunch of tools and reverse engineer it all). Inquiries about purchasing copies of the archive have gone unanswered. No one else seems to have such an archive. Apart from this meaning that world has one single USENET archive (I hope they have backup floppies!), how can I obtain historical data for research purposes ? I'd happy pay money for DVD's of archival material if they were available. Can anyone help"

9 of 86 comments (clear)

Min score:

Reason:

Sort:

Google just bought the data by sartin · 2003-07-25 12:34 · Score: 4, Informative

Technically, Google did not buy Deja, they just bought a bunch of data. Deja was purchased by buy.com, part of eBay and is now firmly merged into eBay (with some of the folks over in paypal).
</p>
1. Re:Google just bought the data by /dev/trash · 2003-07-26 04:31 · Score: 3, Informative
  
  Actually according to this http://news.com.com/2100-1023-252449.html?legacy=c net
  
  Ebay bought the consumer website and Google bought everything else, including the deja.com name.
Google by rmohr02 · 2003-07-25 12:53 · Score: 3, Informative

You can't automate a search of Google Groups through the web, but you might be able to work out a deal with Google's corporate offices. There can't be many other people with the data to talk with.
Re:it occurs to me by damien_kane · 2003-07-25 13:15 · Score: 2, Informative

The problem with this is data retention.
I dont personally know of any UNSENET servers that have more than 10 days worth of posts.
Some of the groups (like alt.binaries.*) will only have 1-3 days retention.

Nope, this guy needs an archive, not a current snapshot
sequential snapshots are an archive by RGRistroph · 2003-07-25 13:42 · Score: 4, Informative

Why can't he start archiving now, and then work on his project testing it on his data as he goes ? He might have enough by the time he is done.

Also, why does it have to be usenet specifically ? If he needs a long time period but not a great breadth of groups, he may be able to find mailing list archives that are sufficient. The 9fans and lkml both go back quite a ways, just to mention a few off the top of my head. If you go down to your universities sys admin department, and ask them what's the oldest continuously active mailing list they have archives for, you may strike gold.

In fact, your university's sys admins may have usenet archives also.

You may find this helpful in scraping web-served mailing list archives into a form you can use:

http://www.linpro.no/lwp/

Also there is a perl script out there that will download the archives a yahoo group into an mbox.
google api by krismon · 2003-07-25 19:21 · Score: 2, Informative

download the google free API(for non-commercial use), would make your scripting/programming, much easier...
Follow up by Anonymous Coward · 2003-07-26 00:40 · Score: 5, Informative

I'm the original poster of the question. Just to answer a few points raised by other people:

1. I am only interested USENET data until the early 1990's, and even then only for parts of comp.*. I know that I wouldn't be able to purchase/obtain selective parts of it, but I have quantified that 1980->199[0-5] is order of couple of hundred gigabytes max. This is not a large volume of data if obtained on DAT (I certainly have the facilities/funds for limited hardware/software), or if I could make use of whatever facilities store it at the moment that would also be great (e.g. if I were able to come to an agreement to use a host account located at google, where I could develop and run tools over the data [1]).

2. I have so far developed tools to complete part of this work, but found myself violating Google's ToS and having myself blocked from access to Google Groups. My tools simply scrapped out a large amount of data from a specific newsgroups (comp.something) and performed some automated analysis that I use to correlate with other findings, etc.

[1] There's an analogy here with obtaining time on a supercomputer for research work, though what I need is not a supercomputer, but superstorage.
Here is what you are looking for. by Anonymous+Cowdog · 2003-07-26 00:46 · Score: 5, Informative

I am surprised this Ask Slashdot question hasn't (really) been answered after this many hours. Well here is the answer.

Go to this link:

http://www.archive.org/web/researcher/proposal.php

Create an account, log in, write a proposal, submit it, then wait.

Yes they have Usenet data, not just web data.

After I submitted a proposal this way, I had to bug them by email and eventually phone even to get the account set up. It's not that they aren't helpful, they are just busy with lots of projects.

But don't expect hand holding. You need to be comfortable operating in a Unix environment. And I don't know if the data can leave their servers; you might need to do all your processing using their machines.
Re:That's interesting... by Anonymous Coward · 2003-07-27 08:01 · Score: 1, Informative

Actually, if you are at all genuinely upset by this, Google lets you remove your postings from their archive.

http://groups.google.com/googlegroups/help.html# 9