Obtaining Archives of USENET?
Academic Researcher asks: "Google took over Deja and built the mammoth Google Groups, which is a near complete archive of USENET dating back to early 1980s. For an academic project, I need to analyse a lot of USENET data. The Terms of Service for Google Groups will not allow automated access (and even so, I'd have to write a bunch of tools and reverse engineer it all). Inquiries about purchasing copies of the archive have gone unanswered. No one else seems to have such an archive. Apart from this meaning that world has one single USENET archive (I hope they have backup floppies!), how can I obtain historical data for research purposes ? I'd happy pay money for DVD's of archival material if they were available. Can anyone help"
Technically, Google did not buy Deja, they just bought a bunch of data. Deja was purchased by buy.com, part of eBay and is now firmly merged into eBay (with some of the folks over in paypal).
</p>
But have you tried e-mailing people at Google Groups?
http://groups.google.com
Search: *
Then save to file.
DONE!
For every annoying gentoo user, are three even more annoying anti-gentoo crybabies. Take Yosh from #Gimp for example.
You can't automate a search of Google Groups through the web, but you might be able to work out a deal with Google's corporate offices. There can't be many other people with the data to talk with.
You're going to want to talk with Jim Gray of this article:
http://slashdot.org/article.pl?sid=03/0
Awesome furniture, accessories and cabinetry in Santa Rosa, CA: http://humanity-home.com/
couldn't you write a script to first get a list of all the newsgroups on a server (here I am assuming that the server I use, news.tbaytel.net, is at convergance with the rest of usenet), then get all the headers in each, then get all the messages in each. If you use the server provided by your ISP over a fast connection you can easily max out your connection as it is more or less a direct line.
History will be kind to me, for I intend to write it - Sir Winston Churchill
"Inquiries about purchasing copies of the archive have gone unanswered." I assume he wasn't just searching for "let me buy your data" and actually did contact the corporate offices.
In any event the massive number of binary posts (porn, movies, warez, etc) on usenet in the past few years would make the "full" archive of the past few years number in the tens of thousands of CD's. A "full" usenet feed passed up the bandwidth of a T1 about 1998 IIRC.
Some individuals archive individual usenet groups, or the group is gatewayed back and forth to a mailing list that is archived. This IMHO is more appropriately managable for research.
The announcement of Google Groups with a 20 year archive acknowledges several sources for the broad timeframe of the archive (as well as the donors to the preceding Dejanews archive); you might want to check out their specific work.
Why can't he start archiving now, and then work on his project testing it on his data as he goes ? He might have enough by the time he is done.
Also, why does it have to be usenet specifically ? If he needs a long time period but not a great breadth of groups, he may be able to find mailing list archives that are sufficient. The 9fans and lkml both go back quite a ways, just to mention a few off the top of my head. If you go down to your universities sys admin department, and ask them what's the oldest continuously active mailing list they have archives for, you may strike gold.
In fact, your university's sys admins may have usenet archives also.
You may find this helpful in scraping web-served mailing list archives into a form you can use:
http://www.linpro.no/lwp/
Also there is a perl script out there that will download the archives a yahoo group into an mbox.
Lots and _lots_ of quality porn was available in those early years.
:-p
I know! Hire me and I will take care of that section of the data (approximately 60%).
Did you just send off emails, or did you actually pick up the phone and try to talk to somebody? You might have much better luck with the latter, Google must get a billion random emails a day.
> For an academic project, I need to analyse a
> lot of USENET data.
Are you SURE you're prepared for that much data? And the costs for just storing that much data? Not to mention manipulating it? I think you'd need an academic project just to analyze that factor alone.
I sent them an e-mail reporting a minor problem and it took them months to get back to me.
Tim
Omnia vestra castrorum habetur nobis.
together on this.
I'm looking for archives back to 1992 of alt.binaries.pictures.erotica.*, alt.binaries.warez.*, and alt.sex.*
Seriously, I'm sure that a number of intelligence agencies have archives of this stuff. I believe the FBI used to get a USENET feed on 9-track 1/2" tape.
Mea navis aericumbens anguillis abundat
You REALLY just want to get all that hot porno and wares and MP3z!!! We can see right through your story! I bet it's "seminal research" too.....
Ron Paul 2012
how can I obtain historical data for research purposes ?
Translation:
How can I build the largest pr0n collectin ever?
---
I support spreading santorum
dejanews had a rival for a year or so. They got bought out long ago. Sorry, I can't remember their name. But, if google really won't co-operate look up those other guys.
download the google free API(for non-commercial use), would make your scripting/programming, much easier...
I've always felt that nobody should be allowed to ask for more money than to recoup direct costs for transfering data which others produced. I wasn't paid by Google or DejaNews for my intellectual property. Tell them they must cease distribution of unlicensed copies of my posts immediately if they refuse to provide them through an automated process to anyone who wants them.
You're thinking of RemarQ, which now has this remarQable homepage.
I'm the original poster of the question. Just to answer a few points raised by other people:
1. I am only interested USENET data until the early 1990's, and even then only for parts of comp.*. I know that I wouldn't be able to purchase/obtain selective parts of it, but I have quantified that 1980->199[0-5] is order of couple of hundred gigabytes max. This is not a large volume of data if obtained on DAT (I certainly have the facilities/funds for limited hardware/software), or if I could make use of whatever facilities store it at the moment that would also be great (e.g. if I were able to come to an agreement to use a host account located at google, where I could develop and run tools over the data [1]).
2. I have so far developed tools to complete part of this work, but found myself violating Google's ToS and having myself blocked from access to Google Groups. My tools simply scrapped out a large amount of data from a specific newsgroups (comp.something) and performed some automated analysis that I use to correlate with other findings, etc.
[1] There's an analogy here with obtaining time on a supercomputer for research work, though what I need is not a supercomputer, but superstorage.
I am surprised this Ask Slashdot question hasn't (really) been answered after this many hours. Well here is the answer.
p
Go to this link:
http://www.archive.org/web/researcher/proposal.ph
Create an account, log in, write a proposal, submit it, then wait.
Yes they have Usenet data, not just web data.
After I submitted a proposal this way, I had to bug them by email and eventually phone even to get the account set up. It's not that they aren't helpful, they are just busy with lots of projects.
But don't expect hand holding. You need to be comfortable operating in a Unix environment. And I don't know if the data can leave their servers; you might need to do all your processing using their machines.
I think, for practical purposes, Google owns most of USENET now, including its complete history; there will probably not be any other institutions (well, maybe the NSA) that keep a history of USENET postings.
300 GB/day -- That's 2 new hard drive each day (Pricewatch.com: 160 GB for $114 x 2 = $228/day), plus the cost of the bandwidth, plus the cost of the RAID backup (I hope). Expensive, but well within reason for a large orginization.
To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
Your question was just FUCKING STUPID!!!!!!
That's interesting... I'm pretty sure I never gave anyone permission to sell my thousands of copyrighted Usenet posts.
Now maybe somebody will notice the flaw in the "archives are great, you and your copyright can get stuffed" arguments. It's OK for Deja to sell my copyrighted material, and for Google to make it available but only under restrictive agreements? Not by me it's not.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
(Re: Your link)
Psst... get A paying job you bum.
As most people said: call 'em. You can explain what you like to achieve and fight of their arguments against it if needed.
And also, check how much your email sounded like "Hey yo Google, could you burn me the usenet data you bought from deja for $XXX million onto a CD for my high-school project? I'll pay for the CD. Thanks".
I'm not saying it did, but if it can be interpreted like that even remotely, the poor first-line email reader person at google is going to delete it before it's even half read.
news.cis.dfn.de (I believe that's it - you can find it mentioned a lot in the "free news servers" newsgroup) has retention in the discussion groups (at least the ones I've looked in) all the way back to (at least) december of last year.
But if you need years of research data, it seems google would be the place to go. If your needs are limited to only a few groups, many of them have archives. The archives, for example, for some of the rec.audio groups go back many, many years and are fully indexed and stored in digests.
I also don't get the complaining about terms. These people have a huge archive of usenet data; there's nothing to stop you from building one of your own except the fact it would cost a damn fortune to get setup and organized. If you have that kinda money, quit complaining and do it; if you don't, quit complaining about someone else (who does) taking the initiative. It would seem google has the only reasonably complete usenet archive - would you rather there were none at all?
Young Minds used to sell Usenet archive discs. They apparently don't anymore, but you could try to buy someone's copies...
The tag did not imply "do not print". Addtionally thier is no precedent of tradtion. No laws have ever alowed publishers to print a book under copyright and be required to be burned on a certain date by all copyholders even by liberaries with the sole exception of the authors mauscript, and remember that copyrights do eventually expire.
Your best posts are probally in basements and attics , in the cabinet under night stands accross the land on yello jagged torn scraps of fan folded paper with some of the little perferated tracks still attached.
Our very discussion of the x-no-archive header btw fruitfully negates assertion that archiving was not supported by usenet protocols etc. Finally the copyright holder must inforce copyright through civil litigation...i think damage assesment would be in mills if not ziltch.
Anyway .... come on you are just playing devils [rria] advocate here! These are not presidents that most at /. really want to set at least in quite this way.
I am using 1.4 on RH9 and have had no problems reading /. but something has seemed funny. It think its the fonts. Maybe i have seen weird fonts? If this is the case and bugging you then set up a css or change the default serif /sansserif fonts. The defaults might have changed in the build. /. is also one of the few sites that seems to never make firebird 0.6 freeze. I often use firebird as well as mozilla. I sometimes wonder if my bird is compatiable with my rh9 version of glibc.
Many areas of the law are purposefully made to protect commercial interests and "simulate the economy." Politcians are still at it! That this also seems in corporate interests is no accident since a great deal of commerce is conducted by corporations.
Consider some alternatives.
The polictical problem is a lack of clout. Watch out for the researcher lobby ! They are armed with regression and have infinitely many means ! So I argue that although its not what i would like the law is clearly on google's side. Posting to the public usenet groups clearly implies the expectation of users freely copying and archiving the 'works' by third parties since the architechture of the technology rely on copying. Archiving was even a practice of indiviual users and definitely of sys admins.
Usenet is a means of communicating to those in the public at large who have the interest to read it. For it to be an effective means of communicating with the public the information is not only carried on the public networks (like a phone conversation)but clearly remains accessable intertemporally to the general public, and private users individually.
As i said in a reply to another comment, No law allows the publisher to require copies to be burned on a certain date by all copyholders even by liberaries with the sole exception of the authors mauscript. Finally all copyrights do* eventually expire, with the intention of enriching the public domain.
The full database (and prehaps some missing posts a la the google deletion policy) could theroretically be reconstructed from other sources. The database is a convience that they offer to the public.
If a compeditor compiled an equally useful database then they could in effect create the same service. To do so without directly copying the database under discussion would be a fairly large and costly undertaking. Directly copying useful large useful portions directly from their website would be much cheaper and could easily be scripted. The terms of the tos simply protect the service (value added) that was created by deja by creating a "home" in the new protcol "web". It not only collected the posts in one place but also transfered the information formerly on nntp to http making it more accessable to many users without harming or taking away value from the those that use tradtional clients.
Since folks vist their database pages and click thier ads simulating commerce the laws are serving intended goals but prehaps at the expense of public interest. (That is assuming that your study is in that interest.) Good luck with your study.
He provided much of the early USENET archives. See his page here for details of his contribution and email address.