Obtaining Archives of USENET?
Academic Researcher asks: "Google took over Deja and built the mammoth Google Groups, which is a near complete archive of USENET dating back to early 1980s. For an academic project, I need to analyse a lot of USENET data. The Terms of Service for Google Groups will not allow automated access (and even so, I'd have to write a bunch of tools and reverse engineer it all). Inquiries about purchasing copies of the archive have gone unanswered. No one else seems to have such an archive. Apart from this meaning that world has one single USENET archive (I hope they have backup floppies!), how can I obtain historical data for research purposes ? I'd happy pay money for DVD's of archival material if they were available. Can anyone help"
Technically, Google did not buy Deja, they just bought a bunch of data. Deja was purchased by buy.com, part of eBay and is now firmly merged into eBay (with some of the folks over in paypal).
</p>
http://groups.google.com
Search: *
Then save to file.
DONE!
For every annoying gentoo user, are three even more annoying anti-gentoo crybabies. Take Yosh from #Gimp for example.
You can't automate a search of Google Groups through the web, but you might be able to work out a deal with Google's corporate offices. There can't be many other people with the data to talk with.
You're going to want to talk with Jim Gray of this article:
http://slashdot.org/article.pl?sid=03/0
Awesome furniture, accessories and cabinetry in Santa Rosa, CA: http://humanity-home.com/
couldn't you write a script to first get a list of all the newsgroups on a server (here I am assuming that the server I use, news.tbaytel.net, is at convergance with the rest of usenet), then get all the headers in each, then get all the messages in each. If you use the server provided by your ISP over a fast connection you can easily max out your connection as it is more or less a direct line.
History will be kind to me, for I intend to write it - Sir Winston Churchill
In any event the massive number of binary posts (porn, movies, warez, etc) on usenet in the past few years would make the "full" archive of the past few years number in the tens of thousands of CD's. A "full" usenet feed passed up the bandwidth of a T1 about 1998 IIRC.
Some individuals archive individual usenet groups, or the group is gatewayed back and forth to a mailing list that is archived. This IMHO is more appropriately managable for research.
The announcement of Google Groups with a 20 year archive acknowledges several sources for the broad timeframe of the archive (as well as the donors to the preceding Dejanews archive); you might want to check out their specific work.
Why can't he start archiving now, and then work on his project testing it on his data as he goes ? He might have enough by the time he is done.
Also, why does it have to be usenet specifically ? If he needs a long time period but not a great breadth of groups, he may be able to find mailing list archives that are sufficient. The 9fans and lkml both go back quite a ways, just to mention a few off the top of my head. If you go down to your universities sys admin department, and ask them what's the oldest continuously active mailing list they have archives for, you may strike gold.
In fact, your university's sys admins may have usenet archives also.
You may find this helpful in scraping web-served mailing list archives into a form you can use:
http://www.linpro.no/lwp/
Also there is a perl script out there that will download the archives a yahoo group into an mbox.
That implies that he emailed them. Contacting via phone sort of forces an answer.
And when you "force an answer," the answer is generally "no."
CEE5210S The signal SIGHUP was received.
Did you just send off emails, or did you actually pick up the phone and try to talk to somebody? You might have much better luck with the latter, Google must get a billion random emails a day.
> For an academic project, I need to analyse a
> lot of USENET data.
Are you SURE you're prepared for that much data? And the costs for just storing that much data? Not to mention manipulating it? I think you'd need an academic project just to analyze that factor alone.
I sent them an e-mail reporting a minor problem and it took them months to get back to me.
Tim
Omnia vestra castrorum habetur nobis.
Seriously, I'm sure that a number of intelligence agencies have archives of this stuff. I believe the FBI used to get a USENET feed on 9-track 1/2" tape.
Mea navis aericumbens anguillis abundat
You REALLY just want to get all that hot porno and wares and MP3z!!! We can see right through your story! I bet it's "seminal research" too.....
Ron Paul 2012
how can I obtain historical data for research purposes ?
Translation:
How can I build the largest pr0n collectin ever?
---
I support spreading santorum
But if you follow it up with masses of "C'moooooooon" they generally just give in.
dejanews had a rival for a year or so. They got bought out long ago. Sorry, I can't remember their name. But, if google really won't co-operate look up those other guys.
download the google free API(for non-commercial use), would make your scripting/programming, much easier...
Seriously, this should actually work. You do not automatically waive all rights just by posting to usenet. I don't think disk space or bandwidth is the only thing stopping them from archiving binaries. If they were not concerned about getting sued, wouldn't they at least archive all the popular binaries (i.e. porn and warez)? They could easily become the world's largest pay site ;-)
IAAAL - I am actually a lawyer
But at least he'd have an answer, which is more than he has now.
I'm the original poster of the question. Just to answer a few points raised by other people:
1. I am only interested USENET data until the early 1990's, and even then only for parts of comp.*. I know that I wouldn't be able to purchase/obtain selective parts of it, but I have quantified that 1980->199[0-5] is order of couple of hundred gigabytes max. This is not a large volume of data if obtained on DAT (I certainly have the facilities/funds for limited hardware/software), or if I could make use of whatever facilities store it at the moment that would also be great (e.g. if I were able to come to an agreement to use a host account located at google, where I could develop and run tools over the data [1]).
2. I have so far developed tools to complete part of this work, but found myself violating Google's ToS and having myself blocked from access to Google Groups. My tools simply scrapped out a large amount of data from a specific newsgroups (comp.something) and performed some automated analysis that I use to correlate with other findings, etc.
[1] There's an analogy here with obtaining time on a supercomputer for research work, though what I need is not a supercomputer, but superstorage.
I am surprised this Ask Slashdot question hasn't (really) been answered after this many hours. Well here is the answer.
p
Go to this link:
http://www.archive.org/web/researcher/proposal.ph
Create an account, log in, write a proposal, submit it, then wait.
Yes they have Usenet data, not just web data.
After I submitted a proposal this way, I had to bug them by email and eventually phone even to get the account set up. It's not that they aren't helpful, they are just busy with lots of projects.
But don't expect hand holding. You need to be comfortable operating in a Unix environment. And I don't know if the data can leave their servers; you might need to do all your processing using their machines.
"Inquiries about purchasing copies of the archive have gone unanswered"
So, yes, I guess this qualifies as a dumb question.
I think, for practical purposes, Google owns most of USENET now, including its complete history; there will probably not be any other institutions (well, maybe the NSA) that keep a history of USENET postings.
300 GB/day -- That's 2 new hard drive each day (Pricewatch.com: 160 GB for $114 x 2 = $228/day), plus the cost of the bandwidth, plus the cost of the RAID backup (I hope). Expensive, but well within reason for a large orginization.
To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
That's interesting... I'm pretty sure I never gave anyone permission to sell my thousands of copyrighted Usenet posts.
Now maybe somebody will notice the flaw in the "archives are great, you and your copyright can get stuffed" arguments. It's OK for Deja to sell my copyrighted material, and for Google to make it available but only under restrictive agreements? Not by me it's not.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Hey... Cunning thought... <ahem> DMCA... <ahem> subpoena... <ahem>
Given that Google may well not have a legal leg to stand on making any sort of money out of using posts where copyright is owned by the poster (no, I don't buy the "you've given implicit agreement" arguments for a second, and I'm betting they don't want to risk their whole business finding out whether a court does either) there's an interesting "deal" to be made here. :-)
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
news.cis.dfn.de (I believe that's it - you can find it mentioned a lot in the "free news servers" newsgroup) has retention in the discussion groups (at least the ones I've looked in) all the way back to (at least) december of last year.
But if you need years of research data, it seems google would be the place to go. If your needs are limited to only a few groups, many of them have archives. The archives, for example, for some of the rec.audio groups go back many, many years and are fully indexed and stored in digests.
I also don't get the complaining about terms. These people have a huge archive of usenet data; there's nothing to stop you from building one of your own except the fact it would cost a damn fortune to get setup and organized. If you have that kinda money, quit complaining and do it; if you don't, quit complaining about someone else (who does) taking the initiative. It would seem google has the only reasonably complete usenet archive - would you rather there were none at all?
Actually, google is pretty good about removing any pages you don't want listed. I'll bet they'd remove usenet posts too if you could prove you were the poster.
Problem with posting to usenet though, you're already giving them the rights to redistribute it, as all usenet servers propogate the content to other newsservers. And any of those servers can be pay only, and there's not a whole heck of a lot you can do about it. So to say that google suddenly is out of compliance because they happen to have a server with extremely good retention doesn't change the facts much. I'd have to check the posting policy on a few newsservers to know for sure though.
-Restil
Play with my webcams and lights here
what i find works is to get someone to repeat with you (to them) "can we have a [insert item here]?" over and over and over and over till they give in.
ask bart...
Free Webmail
The tag did not imply "do not print". Addtionally thier is no precedent of tradtion. No laws have ever alowed publishers to print a book under copyright and be required to be burned on a certain date by all copyholders even by liberaries with the sole exception of the authors mauscript, and remember that copyrights do eventually expire.
Your best posts are probally in basements and attics , in the cabinet under night stands accross the land on yello jagged torn scraps of fan folded paper with some of the little perferated tracks still attached.
Our very discussion of the x-no-archive header btw fruitfully negates assertion that archiving was not supported by usenet protocols etc. Finally the copyright holder must inforce copyright through civil litigation...i think damage assesment would be in mills if not ziltch.
Anyway .... come on you are just playing devils [rria] advocate here! These are not presidents that most at /. really want to set at least in quite this way.
I am using 1.4 on RH9 and have had no problems reading /. but something has seemed funny. It think its the fonts. Maybe i have seen weird fonts? If this is the case and bugging you then set up a css or change the default serif /sansserif fonts. The defaults might have changed in the build. /. is also one of the few sites that seems to never make firebird 0.6 freeze. I often use firebird as well as mozilla. I sometimes wonder if my bird is compatiable with my rh9 version of glibc.
Many areas of the law are purposefully made to protect commercial interests and "simulate the economy." Politcians are still at it! That this also seems in corporate interests is no accident since a great deal of commerce is conducted by corporations.
Consider some alternatives.
The polictical problem is a lack of clout. Watch out for the researcher lobby ! They are armed with regression and have infinitely many means ! So I argue that although its not what i would like the law is clearly on google's side. Posting to the public usenet groups clearly implies the expectation of users freely copying and archiving the 'works' by third parties since the architechture of the technology rely on copying. Archiving was even a practice of indiviual users and definitely of sys admins.
Usenet is a means of communicating to those in the public at large who have the interest to read it. For it to be an effective means of communicating with the public the information is not only carried on the public networks (like a phone conversation)but clearly remains accessable intertemporally to the general public, and private users individually.
As i said in a reply to another comment, No law allows the publisher to require copies to be burned on a certain date by all copyholders even by liberaries with the sole exception of the authors mauscript. Finally all copyrights do* eventually expire, with the intention of enriching the public domain.
The full database (and prehaps some missing posts a la the google deletion policy) could theroretically be reconstructed from other sources. The database is a convience that they offer to the public.
If a compeditor compiled an equally useful database then they could in effect create the same service. To do so without directly copying the database under discussion would be a fairly large and costly undertaking. Directly copying useful large useful portions directly from their website would be much cheaper and could easily be scripted. The terms of the tos simply protect the service (value added) that was created by deja by creating a "home" in the new protcol "web". It not only collected the posts in one place but also transfered the information formerly on nntp to http making it more accessable to many users without harming or taking away value from the those that use tradtional clients.
Since folks vist their database pages and click thier ads simulating commerce the laws are serving intended goals but prehaps at the expense of public interest. (That is assuming that your study is in that interest.) Good luck with your study.
no, I don't buy the "you've given implicit agreement" arguments for a second
So do you think people should be able to sue any Usenet server in the world for copying their posts without explicit permission, or just Google? If the answer is "just Google", then please explain what attributes of Google take away from them the permission that every other news server has.
He provided much of the early USENET archives. See his page here for details of his contribution and email address.
Essentially, I believe people should have a right to take action against anyone who has used their posts in a way other than they might reasonably have expected when they posted them to Usenet. They have not consented to them being used in any other way.
This is clearly a grey area. Typically posts propagate within a few days, and then remain on servers for 2-4 weeks before they expire. Someone keeping them on a server a bit longer is fine, because there's no explicit standard that says everything expires after max 4 weeks AFAIK.
Someone who's kept them for years, and now offers them in a searchable archive, however, is (IMHO) operating well outside the expectations of normal Usenet practice. Someone who sells a database full of such posts for profit is likewise doing something quite different to what the original posters might have reasonably expected.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.