Obtaining Archives of USENET?
Academic Researcher asks: "Google took over Deja and built the mammoth Google Groups, which is a near complete archive of USENET dating back to early 1980s. For an academic project, I need to analyse a lot of USENET data. The Terms of Service for Google Groups will not allow automated access (and even so, I'd have to write a bunch of tools and reverse engineer it all). Inquiries about purchasing copies of the archive have gone unanswered. No one else seems to have such an archive. Apart from this meaning that world has one single USENET archive (I hope they have backup floppies!), how can I obtain historical data for research purposes ? I'd happy pay money for DVD's of archival material if they were available. Can anyone help"
http://groups.google.com
Search: *
Then save to file.
DONE!
For every annoying gentoo user, are three even more annoying anti-gentoo crybabies. Take Yosh from #Gimp for example.
In any event the massive number of binary posts (porn, movies, warez, etc) on usenet in the past few years would make the "full" archive of the past few years number in the tens of thousands of CD's. A "full" usenet feed passed up the bandwidth of a T1 about 1998 IIRC.
Some individuals archive individual usenet groups, or the group is gatewayed back and forth to a mailing list that is archived. This IMHO is more appropriately managable for research.
The announcement of Google Groups with a 20 year archive acknowledges several sources for the broad timeframe of the archive (as well as the donors to the preceding Dejanews archive); you might want to check out their specific work.
I'm the original poster of the question. Just to answer a few points raised by other people:
1. I am only interested USENET data until the early 1990's, and even then only for parts of comp.*. I know that I wouldn't be able to purchase/obtain selective parts of it, but I have quantified that 1980->199[0-5] is order of couple of hundred gigabytes max. This is not a large volume of data if obtained on DAT (I certainly have the facilities/funds for limited hardware/software), or if I could make use of whatever facilities store it at the moment that would also be great (e.g. if I were able to come to an agreement to use a host account located at google, where I could develop and run tools over the data [1]).
2. I have so far developed tools to complete part of this work, but found myself violating Google's ToS and having myself blocked from access to Google Groups. My tools simply scrapped out a large amount of data from a specific newsgroups (comp.something) and performed some automated analysis that I use to correlate with other findings, etc.
[1] There's an analogy here with obtaining time on a supercomputer for research work, though what I need is not a supercomputer, but superstorage.
I am surprised this Ask Slashdot question hasn't (really) been answered after this many hours. Well here is the answer.
p
Go to this link:
http://www.archive.org/web/researcher/proposal.ph
Create an account, log in, write a proposal, submit it, then wait.
Yes they have Usenet data, not just web data.
After I submitted a proposal this way, I had to bug them by email and eventually phone even to get the account set up. It's not that they aren't helpful, they are just busy with lots of projects.
But don't expect hand holding. You need to be comfortable operating in a Unix environment. And I don't know if the data can leave their servers; you might need to do all your processing using their machines.