Slashdot Mirror


Obtaining Archives of USENET?

Academic Researcher asks: "Google took over Deja and built the mammoth Google Groups, which is a near complete archive of USENET dating back to early 1980s. For an academic project, I need to analyse a lot of USENET data. The Terms of Service for Google Groups will not allow automated access (and even so, I'd have to write a bunch of tools and reverse engineer it all). Inquiries about purchasing copies of the archive have gone unanswered. No one else seems to have such an archive. Apart from this meaning that world has one single USENET archive (I hope they have backup floppies!), how can I obtain historical data for research purposes ? I'd happy pay money for DVD's of archival material if they were available. Can anyone help"

86 comments

  1. Google just bought the data by sartin · · Score: 4, Informative


    Technically, Google did not buy Deja, they just bought a bunch of data. Deja was purchased by buy.com, part of eBay and is now firmly merged into eBay (with some of the folks over in paypal).
    </p>

    1. Re:Google just bought the data by /dev/trash · · Score: 3, Informative

      Actually according to this http://news.com.com/2100-1023-252449.html?legacy=c net

      Ebay bought the consumer website and Google bought everything else, including the deja.com name.

  2. Dumb question by Anonymous Coward · · Score: 0, Redundant
    1. Re:Dumb question by Anonymous Coward · · Score: 1, Funny

      "Inquiries about purchasing copies of the archive have gone unanswered"

      So, yes, I guess this qualifies as a dumb question.

  3. Sure! by Sevn · · Score: 5, Funny

    http://groups.google.com

    Search: *

    Then save to file.

    DONE!

    --
    For every annoying gentoo user, are three even more annoying anti-gentoo crybabies. Take Yosh from #Gimp for example.
  4. Google by rmohr02 · · Score: 3, Informative

    You can't automate a search of Google Groups through the web, but you might be able to work out a deal with Google's corporate offices. There can't be many other people with the data to talk with.

    1. Re:Google by Big+Sean+O · · Score: 4, Interesting

      If it's really for an academic project (say a master's thesis), you might want to direct your inquiry to Craig Silverstein. Since he's grad student it's likely he would be more interested in your project than the Google corporate types.

      Who knows, mebbe you can parlay the project into an internship into a real live job.

      --
      My father is a blogger.
  5. You understand how much data you're talking about? by rthille · · Score: 4, Interesting


    You're going to want to talk with Jim Gray of this article:
    http://slashdot.org/article.pl?sid=03/07 /10/235225 1

    --
    Awesome furniture, accessories and cabinetry in Santa Rosa, CA: http://humanity-home.com/
  6. it occurs to me by Vilim · · Score: 1

    couldn't you write a script to first get a list of all the newsgroups on a server (here I am assuming that the server I use, news.tbaytel.net, is at convergance with the rest of usenet), then get all the headers in each, then get all the messages in each. If you use the server provided by your ISP over a fast connection you can easily max out your connection as it is more or less a direct line.

    --
    History will be kind to me, for I intend to write it - Sir Winston Churchill
    1. Re:it occurs to me by damien_kane · · Score: 2, Informative

      The problem with this is data retention.
      I dont personally know of any UNSENET servers that have more than 10 days worth of posts.
      Some of the groups (like alt.binaries.*) will only have 1-3 days retention.

      Nope, this guy needs an archive, not a current snapshot

    2. Re:it occurs to me by Drakin · · Score: 1

      The problem with this apreoach is the server's data retention poliy. You're not even going to come close to haveing an archive that extends back a year, let alone longer.

      I know there's some servers that you're lucky if they still have the post a week later on the server.

    3. Re:it occurs to me by grondu · · Score: 1

      I dont personally know of any UNSENET servers that have more than 10 days worth of posts.
      Some of the groups (like alt.binaries.*) will only have 1-3 days retention.


      You need a better news server. You're never gonna get a huge collection of whacking material from a news server like that. From Supernews:

      alt.binaries.multimedia.erotica has 190466 articles, and 23.6 days of retention.

      And non-pr0n:

      comp.os.linux.misc has 34277 articles, and 281.5 days of retention.

      This hasn't always been the case for Supernews; they recently upgraded storage. You can check retention for any group at Supernews.

      Not a Supernews employee, just a customer.

      --

      I'm the urban spaceman babe, but here comes the twist... I don't exist

  7. RTFA by Anonymous Coward · · Score: 0

    "Inquiries about purchasing copies of the archive have gone unanswered." I assume he wasn't just searching for "let me buy your data" and actually did contact the corporate offices.

    1. Re:RTFA by rmohr02 · · Score: 1

      That implies that he emailed them. Contacting via phone sort of forces an answer.

    2. Re:RTFA by acceleriter · · Score: 1

      And when you "force an answer," the answer is generally "no."

      --

      CEE5210S The signal SIGHUP was received.

    3. Re:RTFA by __aafkqj3628 · · Score: 3, Funny

      But if you follow it up with masses of "C'moooooooon" they generally just give in.

    4. Re:RTFA by rmohr02 · · Score: 1

      But at least he'd have an answer, which is more than he has now.

    5. Re:RTFA by jon+doh! · · Score: 1

      what i find works is to get someone to repeat with you (to them) "can we have a [insert item here]?" over and over and over and over till they give in.

      ask bart...

  8. History of selling Usenet archives by shoppa · · Score: 5, Interesting
    In the early 1990's, a company in Vancouver BC proposed a monthly distribution of Usenet via CD. This sparked extensive discussions in the newsgroups (at the time almost exclusively dominated by academic people, of course) with a lot of resentment that some company was going to be making money by selling *their* posts. (Do a Google groups search for "usenet on CD" to see some of these. They also mention Walnut Creek.)

    In any event the massive number of binary posts (porn, movies, warez, etc) on usenet in the past few years would make the "full" archive of the past few years number in the tens of thousands of CD's. A "full" usenet feed passed up the bandwidth of a T1 about 1998 IIRC.

    Some individuals archive individual usenet groups, or the group is gatewayed back and forth to a mailing list that is archived. This IMHO is more appropriately managable for research.

    The announcement of Google Groups with a 20 year archive acknowledges several sources for the broad timeframe of the archive (as well as the donors to the preceding Dejanews archive); you might want to check out their specific work.

    1. Re:History of selling Usenet archives by dougmc · · Score: 4, Interesting
      In any event the massive number of binary posts (porn, movies, warez, etc) on usenet in the past few years would make the "full" archive of the past few years number in the tens of thousands of CD's. A "full" usenet feed passed up the bandwidth of a T1 about 1998 IIRC.
      NOBODY has a full archive of the posts made to alt.binaries.* -- I doubt that even the NSA has that much storage to spare for it.

      I believe that a full feed is around 300 GB per day now, with 99+% of that being alt.binaries.*.

    2. Re:History of selling Usenet archives by dargaud · · Score: 1
      > They also mention Walnut Creek

      Hey, I know those guys... They took some of my copyrighted wallpaper images, removed my logos, and resold them on CDs. Assholes.

      --
      Non-Linux Penguins ?
  9. sequential snapshots are an archive by RGRistroph · · Score: 4, Informative

    Why can't he start archiving now, and then work on his project testing it on his data as he goes ? He might have enough by the time he is done.

    Also, why does it have to be usenet specifically ? If he needs a long time period but not a great breadth of groups, he may be able to find mailing list archives that are sufficient. The 9fans and lkml both go back quite a ways, just to mention a few off the top of my head. If you go down to your universities sys admin department, and ask them what's the oldest continuously active mailing list they have archives for, you may strike gold.

    In fact, your university's sys admins may have usenet archives also.

    You may find this helpful in scraping web-served mailing list archives into a form you can use:

    http://www.linpro.no/lwp/

    Also there is a perl script out there that will download the archives a yahoo group into an mbox.

    1. Re:sequential snapshots are an archive by Anonymous Coward · · Score: 0

      I already have automated tools that have allowed me to work from (a) some mailing list archives that I have located elsewhere on the net (b) limited "scraping" that I've already made from early fa.* and mod.* newsgroups. I found my IP# temporarily suspended from access to Google Groups after my scraping activities (by what looks like an automated bandwidth/usage monitor).

      Basically my tools (a) build thread/message relationships, (b) eliminate duplicates based on message identifiers, dates, header fields, etc, (c) build quantitative statistics based on world frequency, thread sizes, message densities, etc. This data is then used to hypothesise / justify conclusions about the nature of development of protocols, computing, etc in the 1980s.

    2. Re:sequential snapshots are an archive by RGRistroph · · Score: 1

      It sounds very interesting. At one time I intended to do a thesis on statistical authorship identification using usenet as a test body.

      However, what would word frequency and etc have to do with the nature of development of protocols and computing ?

  10. Let me know when you get that copy... by Anonymous Coward · · Score: 0

    Lots and _lots_ of quality porn was available in those early years.

    I know! Hire me and I will take care of that section of the data (approximately 60%). :-p

  11. These inquiries... by Meowing · · Score: 1

    Did you just send off emails, or did you actually pick up the phone and try to talk to somebody? You might have much better luck with the latter, Google must get a billion random emails a day.

  12. I need to analyse a lot of USENET data... by AtariDatacenter · · Score: 4, Insightful

    > For an academic project, I need to analyse a
    > lot of USENET data.

    Are you SURE you're prepared for that much data? And the costs for just storing that much data? Not to mention manipulating it? I think you'd need an academic project just to analyze that factor alone.

  13. Yeah by TheOnlyCoolTim · · Score: 1

    I sent them an e-mail reporting a minor problem and it took them months to get back to me.

    Tim

    --
    Omnia vestra castrorum habetur nobis.
    1. Re:Yeah by __aafkqj3628 · · Score: 1

      Strange, I sent an email off and got a response within a couple of days.

      I suppose that if you sent it to the wrong department then it would take *much* longer.

    2. Re:Yeah by orthogonal · · Score: 4, Funny

      I suppose that if you sent it to the wrong department then it would take *much* longer.

      Well, it's not the department actually.

      What you need to do is fill in the X-Meta tag in your email (or if you use Microsoft email products, the <HEAD> <META> tags) with keywords describing your email. "Hot amine poon-tang" is often a useful meta tag.

      Then, you need to get a lot of people to link to your email (that is, refer to it in their email) with a X-References tag.

      You can do this just by getting it popular on a mailing list, but even more effective is to post it to a number of bloggers. Bloggers, being how they are (self-important but afraid of not being noticed), will endlessly refer to it, and will probably get into interminable "blog spats" about it -- well, to be honest, about something completely tangential to it, but what do you care? It pushes your "Mail-Rank" score up regardless.

      If you can't do that, try contacting "MailKing", a commercial service that sends out a lot of email, purportedly from unrelated individuals, which refers to your email. They'll make it look like your email is really important.

      When your "Mail-Rank" is high enough, Google will have no choice but to notice it and reply.

      Best of luck getting your email noticed!

  14. I'm looking for the same thing, maybe we can work by Anonymous Coward · · Score: 0

    together on this.

    I'm looking for archives back to 1992 of alt.binaries.pictures.erotica.*, alt.binaries.warez.*, and alt.sex.*

  15. Archives by Detritus · · Score: 1
    Try http://www.nsa.gov.

    Seriously, I'm sure that a number of intelligence agencies have archives of this stuff. I believe the FBI used to get a USENET feed on 9-track 1/2" tape.

    --
    Mea navis aericumbens anguillis abundat
  16. SURE..... by JDWTopGuy · · Score: 2, Funny

    You REALLY just want to get all that hot porno and wares and MP3z!!! We can see right through your story! I bet it's "seminal research" too.....

    --
    Ron Paul 2012
  17. Translation by truffle · · Score: 4, Funny


    how can I obtain historical data for research purposes ?


    Translation:

    How can I build the largest pr0n collectin ever?

    --

    ---
    I support spreading santorum
    1. Re:Translation by rthille · · Score: 0, Flamebait

      Um, steal it from the Vatican?

      --
      Awesome furniture, accessories and cabinetry in Santa Rosa, CA: http://humanity-home.com/
    2. Re:Translation by Anonymous Coward · · Score: 0

      There's an entry at snopes.com about the myth of 'a big porn library at the Vatican.'

      There is no such thing, it's something that Kinsey came up with as a quip when he wasn't conducting research, i.e. forcibly masturbating young boys to orgasm to see how much they could endure....

    3. Re:Translation by RevMike · · Score: 1
      Assert(girl && geek && bi && cute && humble)

      That usage has been deprecated. Use AssertTrue instead. :)

  18. dejanews had a rival by geoswan · · Score: 1

    dejanews had a rival for a year or so. They got bought out long ago. Sorry, I can't remember their name. But, if google really won't co-operate look up those other guys.

  19. google api by krismon · · Score: 2, Informative

    download the google free API(for non-commercial use), would make your scripting/programming, much easier...

    1. Re:google api by darkpurpleblob · · Score: 1
      download the google free API(for non-commercial use), would make your scripting/programming, much easier...

      The Google API is not going to be of any use here. It cannot be used to access Google Groups.

  20. Hey, that's my copyrighted data... by Anonymous Coward · · Score: 0

    I've always felt that nobody should be allowed to ask for more money than to recoup direct costs for transfering data which others produced. I wasn't paid by Google or DejaNews for my intellectual property. Tell them they must cease distribution of unlicensed copies of my posts immediately if they refuse to provide them through an automated process to anyone who wants them.

    1. Re:Hey, that's my copyrighted data... by Arioch+of+Chaos · · Score: 3, Interesting

      Seriously, this should actually work. You do not automatically waive all rights just by posting to usenet. I don't think disk space or bandwidth is the only thing stopping them from archiving binaries. If they were not concerned about getting sued, wouldn't they at least archive all the popular binaries (i.e. porn and warez)? They could easily become the world's largest pay site ;-)

      --
      IAAAL - I am actually a lawyer ;-)
    2. Re:Hey, that's my copyrighted data... by Anonymous+Brave+Guy · · Score: 2, Interesting

      Hey... Cunning thought... <ahem> DMCA... <ahem> subpoena... <ahem>

      Given that Google may well not have a legal leg to stand on making any sort of money out of using posts where copyright is owned by the poster (no, I don't buy the "you've given implicit agreement" arguments for a second, and I'm betting they don't want to risk their whole business finding out whether a court does either) there's an interesting "deal" to be made here. :-)

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    3. Re:Hey, that's my copyrighted data... by Restil · · Score: 1

      Actually, google is pretty good about removing any pages you don't want listed. I'll bet they'd remove usenet posts too if you could prove you were the poster.

      Problem with posting to usenet though, you're already giving them the rights to redistribute it, as all usenet servers propogate the content to other newsservers. And any of those servers can be pay only, and there's not a whole heck of a lot you can do about it. So to say that google suddenly is out of compliance because they happen to have a server with extremely good retention doesn't change the facts much. I'd have to check the posting policy on a few newsservers to know for sure though.

      -Restil

      --
      Play with my webcams and lights here
    4. Re:Hey, that's my copyrighted data... by Anonymous Coward · · Score: 0

      Google operates a webserver which distributes usenet posts and puts advertising around them. That is not what I gave implicit permission for when I wrote and posted my texts. The limit is a pay only Usenet service (that is the "nominal fee covering bandwidth costs" for using an automated access system). Google denies automated access. The idea here is that if Google wants to use the comments for profit, they must adhere to the Usenet paradigm.
      In my opinion the implicit permission also does not include extremely long retention times. Most everybody is aware of Google and other archives now, but in the early days the nature of Usenet was more like that of a vocal conversation in a room with a long echo, not that of a collaborative library. I for one have stopped posting to Usenet due to this change. I also mostly stopped writing personally identifiable messages on webpages which I have no control over.

    5. Re:Hey, that's my copyrighted data... by roystgnr · · Score: 1

      no, I don't buy the "you've given implicit agreement" arguments for a second

      So do you think people should be able to sue any Usenet server in the world for copying their posts without explicit permission, or just Google? If the answer is "just Google", then please explain what attributes of Google take away from them the permission that every other news server has.

    6. Re:Hey, that's my copyrighted data... by Anonymous+Brave+Guy · · Score: 1

      Essentially, I believe people should have a right to take action against anyone who has used their posts in a way other than they might reasonably have expected when they posted them to Usenet. They have not consented to them being used in any other way.

      This is clearly a grey area. Typically posts propagate within a few days, and then remain on servers for 2-4 weeks before they expire. Someone keeping them on a server a bit longer is fine, because there's no explicit standard that says everything expires after max 4 weeks AFAIK.

      Someone who's kept them for years, and now offers them in a searchable archive, however, is (IMHO) operating well outside the expectations of normal Usenet practice. Someone who sells a database full of such posts for profit is likewise doing something quite different to what the original posters might have reasonably expected.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  21. RemarQ by Anonymous Coward · · Score: 0

    You're thinking of RemarQ, which now has this remarQable homepage.

  22. Follow up by Anonymous Coward · · Score: 5, Informative

    I'm the original poster of the question. Just to answer a few points raised by other people:

    1. I am only interested USENET data until the early 1990's, and even then only for parts of comp.*. I know that I wouldn't be able to purchase/obtain selective parts of it, but I have quantified that 1980->199[0-5] is order of couple of hundred gigabytes max. This is not a large volume of data if obtained on DAT (I certainly have the facilities/funds for limited hardware/software), or if I could make use of whatever facilities store it at the moment that would also be great (e.g. if I were able to come to an agreement to use a host account located at google, where I could develop and run tools over the data [1]).

    2. I have so far developed tools to complete part of this work, but found myself violating Google's ToS and having myself blocked from access to Google Groups. My tools simply scrapped out a large amount of data from a specific newsgroups (comp.something) and performed some automated analysis that I use to correlate with other findings, etc.

    [1] There's an analogy here with obtaining time on a supercomputer for research work, though what I need is not a supercomputer, but superstorage.

    1. Re:Follow up by Anonymous Coward · · Score: 0

      As someone else suggested:
      1) Call them
      2) Send them a letter on university stationary.

      Google employs academic types and you might find someone sympathetic.

  23. Here is what you are looking for. by Anonymous+Cowdog · · Score: 5, Informative

    I am surprised this Ask Slashdot question hasn't (really) been answered after this many hours. Well here is the answer.

    Go to this link:

    http://www.archive.org/web/researcher/proposal.php

    Create an account, log in, write a proposal, submit it, then wait.

    Yes they have Usenet data, not just web data.

    After I submitted a proposal this way, I had to bug them by email and eventually phone even to get the account set up. It's not that they aren't helpful, they are just busy with lots of projects.

    But don't expect hand holding. You need to be comfortable operating in a Unix environment. And I don't know if the data can leave their servers; you might need to do all your processing using their machines.

    1. Re:Here is what you are looking for. by Anonymous Coward · · Score: 0

      This is exactly what I need from Google. Unfortunately the Web Archive only contains USENET back until 1996 and I'm really only concerned with 1995/6 (unless the work expands into something else). I'll suggest that Google should offer something like this.

  24. USENET is Google now by 73939133 · · Score: 1

    I think, for practical purposes, Google owns most of USENET now, including its complete history; there will probably not be any other institutions (well, maybe the NSA) that keep a history of USENET postings.

    1. Re:USENET is Google now by SpaceLifeForm · · Score: 1

      I don't believe Google owns Usenet in any way. I mean, Usenet is NNTP, a P2P protocol, that is actually a method to distribute the postings made to a Usenet server. As to those postings that Google has collected and archived, each individual posting is really copyright of the poster (if not stolen), and I certainly don't recall receiving any compensation from Google for having the use of my copyrighted postings. So, IMHO, Google should provide access to all Usenet postings in the archive, with the access method being programmatic and non-Web based.

      --
      You are being MICROattacked, from various angles, in a SOFT manner.
    2. Re:USENET is Google now by 73939133 · · Score: 1

      Of course, Google doesn't own USENET legally; in fact, I think Google's use of USENET is of questionable legality, since postings traditionally have been made with expiration dates in mind and Google's use far exceeds the implicit license posters to USENET have given people for using their postings.

      What I'm saying is that the institution of USENET these days is to a large degree controlled by Google; Google has changed the way people post, Google archives everything, and most people, looking for an article, will go to Google. I would argue that the P2P aspects of USENET are related to Google these days in the same way the P2P aspects of Napster were related to Napster.

      I stopped using USENET after the Google takeover; you may want to do the same.

    3. Re:USENET is Google now by Yottabyte84 · · Score: 1

      Use the X-No-Archive header, and they won't snarf your posts.

    4. Re:USENET is Google now by 73939133 · · Score: 1

      Use the X-No-Archive header, and they won't snarf your posts.

      They'll just snarf all the responses with bits-and-pieces quoted out of context; that's even worse.

      Besides, Google has snarfed and republished, without permission, thousands of my posts from before X-No-Archive even existed.

    5. Re:USENET is Google now by mink · · Score: 1

      Thats the fault of people newsreaders. They should be set to keep the noarchive bit on reply.

      --
      Well I've wrestled with reality for thirty five years doctor, and I'm happy to say I finally won out over it.
  25. I disagree by Raul654 · · Score: 1

    300 GB/day -- That's 2 new hard drive each day (Pricewatch.com: 160 GB for $114 x 2 = $228/day), plus the cost of the bandwidth, plus the cost of the RAID backup (I hope). Expensive, but well within reason for a large orginization.

    --


    To make laws that man cannot, and will not obey, serves to bring all law into contempt.
    --E.C. Stanton
    1. Re:I disagree by rpresser · · Score: 2, Funny

      OK, so say we keep this archive for a year. That's 730 x 160gb hard drives. Forget internet bandwidth; forget LAN bandwidth; where the fuck are you going to get enough hard drive controller bandwidth to be able to search such a monster?

      "OK, archive, give me the md5 hashes of every article posted between the hours of 10:00 and 11:00 (except Wednesdays) during 2002. Don't forget the porn. Count how many of the hashes contain both the hex strings 'DEAD' and 'BEEF'."

      "Hmm, I'll have to get back to you ... I should have an answer by 2012."

    2. Re:I disagree by Anonymous Coward · · Score: 0

      There are companies dealing with terrabytes of new data daily, my employer was asked for advice in how to back that up (its a leading tape manufacturer). From what I gathered its mostly useful for places like NOAA or satellite stuff.

    3. Re:I disagree by dougmc · · Score: 1
      Hmm, I'll have to get back to you ... I should have an answer by 2012.
      Actually, if they were storing md5sum, header information etc. into a database as the items came in (which is exactly what the NSA would do if they were tracking Usenet like this), this sort of request could be satisfied with simple SQL in a few seconds. The database wouldn't even be that big (not compared to some corporate databases out there.) But the actual images/movies/warez/etc., that's another matter.

      I'm not saying it's not possible to have this much data archived somewhere. I just don't think anybody has done it -- it's just not that useful. 106 TB/year, mostly porn and warez. And if they do have it, they're not likely to share it to you.

      Of course, the guy who originally asked about this probably doesn't need all the binary data.

    4. Re:I disagree by Raul654 · · Score: 4, Funny

      It's just not that useful. 106 TB/year, mostly porn and warez.

      I don't see how you can reconcile those two sentences.

      --


      To make laws that man cannot, and will not obey, serves to bring all law into contempt.
      --E.C. Stanton
    5. Re:I disagree by rpresser · · Score: 1
      Actually, if they were storing md5sum, header information etc. into a database as the items came in (which is exactly what the NSA would do if they were tracking Usenet like this),


      You are of course right. Change MD5 Sum to Number of times any of the strings "AB", "CD", or "EF" appear in the body of the article and it meets my original claim of impossibility.
    6. Re:I disagree by Fareq · · Score: 1

      You are of course right. Change MD5 Sum to Number of times any of the strings "AB", "CD", or "EF" appear in the body of the article and it meets my original claim of impossibility.

      and then realize that, if you are looking for information in/about the binaries, you have to assemble the many pieces of the binary, and then decode them.

    7. Re:I disagree by bill_mcgonigle · · Score: 1

      You are of course right. Change MD5 Sum to Number of times any of the strings "AB", "CD", or "EF" appear in the body of the article and it meets my original claim of impossibility.

      Are there more usenet posts than web pages? I only ask because Google manages to index over three billion pages in the manner you refer to.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    8. Re:I disagree by Anonymous Coward · · Score: 0

      Kibbo

  26. There's no such thing as a dumb question. by Anonymous Coward · · Score: 0

    Your question was just FUCKING STUPID!!!!!!

  27. That's interesting... by Anonymous+Brave+Guy · · Score: 2, Insightful
    Technically, Google did not buy Deja, they just bought a bunch of data.

    That's interesting... I'm pretty sure I never gave anyone permission to sell my thousands of copyrighted Usenet posts.

    Now maybe somebody will notice the flaw in the "archives are great, you and your copyright can get stuffed" arguments. It's OK for Deja to sell my copyrighted material, and for Google to make it available but only under restrictive agreements? Not by me it's not.

    --
    If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    1. Re:That's interesting... by Anonymous Coward · · Score: 1, Informative

      Actually, if you are at all genuinely upset by this, Google lets you remove your postings from their archive.

      http://groups.google.com/googlegroups/help.html# 9

    2. Re:That's interesting... by Anonymous+Brave+Guy · · Score: 1
      Actually, if you are at all genuinely upset by this, Google lets you remove your postings from their archive.

      Sure, if you jump through their hoops and do it for each of the thousands of posts you might have made that are stored in their archive.

      This isn't really the point. Personally, I don't dramatically object to having my Usenet posts archived, since I don't post anything there I wouldn't want others to see in future. However, I do object to having my material sold for profit, or to having restrictions imposed on how it may be used by a commercial organisation that has copied it. Google is on shaky legal ground even having it, and their apparent heavy-handedness with the researcher in this case upsets me.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    3. Re:That's interesting... by deanj · · Score: 1
      That's interesting... I'm pretty sure I never gave anyone permission to sell my thousands of copyrighted Usenet posts.

      Someone will just put them on Kazaa, and then the copyrights won't matter, just like everything else up there.

    4. Re:That's interesting... by Anonymous+Brave+Guy · · Score: 1
      Someone will just put them on Kazaa, and then the copyrights won't matter, just like everything else up there.

      It's rather unlikely that anyone would, or could, put the whole of a Usenet archive onto Kazaa. It's even more unlikely that anyone would deliberately single out my own posts. If they did, it's even more unlikely still that those posts would get widely distributed, as most of what I have to say will appeal to a fairly small audience reading the specific newsgroups to which I post. And of course, I am perfectly entitled to use the same legislation as the RIAA and friends to sue anyone who is distributing my work if I really want to, though it's even more unlikely still that I'd bother.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    5. Re:That's interesting... by BobTheLawyer · · Score: 1

      Google didn't purchase your post: they purchased the database containing your post. There are intellectual property rights (often copyright) in a database or catalogue of information and these rights are quite separate from the information itself. This is the kind of right that anyone who compiles a telephone directory will have. When you post to usenet, the sensible legal view has to be that the post remains your copyright but that you grant an implied license to anyone anywhere in the world to read or archive your post. So Google couldn't (and wouldn't need to) ever buy the rights to your actual posts.

    6. Re:That's interesting... by Anonymous+Brave+Guy · · Score: 1
      Google didn't purchase your post: they purchased the database containing your post. There are intellectual property rights (often copyright) in a database or catalogue of information and these rights are quite separate from the information itself.

      I appreciate the distinction you're making here, but aren't the rights to a database usually relevant in cases where the data itself is trivial, such as the phone directory you mentioned? Phone numbers aren't themselves subject to copyright. In this case, I fail to see how wrapping up my posts as part of a database would make them any more legal to copy than if they were copied as separate entities, and it is the morality and legality of that copy being made that I am questioning.

      When you post to usenet, the sensible legal view has to be that the post remains your copyright but that you grant an implied license to anyone anywhere in the world to read or archive your post.

      I challenge that "sensible legal view". When people post to Usenet, they typically expect posts to be distributed freely around the system for a short time period after the post is made, and then to expire. Archiving is a somewhat dicey proposition, since it was neither the original purpose of Usenet nor something provided for within its normal protocols, but you could possibly make a case for it on public interest grounds. I'm not sure what effect (or otherwise) I think the "x-no-archive" header should have here.

      However, archiving for profit, selling on databases of others' copyrighted material for profit, restricting the access to the data by providing an interface to it but then banning certain uses under the ToS, and other practices employed by Deja/Google are much less clear. I'm afraid how I don't see that they are just distributing material copyrighted by others, without permission, for profit.

      Are you really a lawyer, BTW? If so, what is your level of experience in this field? Is you assessment of the "sensible legal view" based on case law, statute law, or just professional opinion?

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    7. Re:That's interesting... by BobTheLawyer · · Score: 1

      Copyrights in the database are quite separate from copyrights in the content of the database. An analogy: if you cover a Britney Spears song then she retains copyright in her song (and you would need a license from her), but you have copyright in your recording of the song. Having multiple copyrights in one work is very common.

      The question here is whether the implied license granted by someone posting to Usenet is wide enough to let Google operate their Groups service.

      I'm an English IP lawyer (well, I'm Scottish but practice English law). I'm not aware of any case-law anywhere on the status of Usenet. However Courts have to take a wide view of the implied license granted by people who post information on Usenet (and indeed the internet) or else they would be making a wide variety of internet use unlawful. Implied licenses tend to be simple: I've never heard of a court recognising an implied license as complicated/sophisticated as you suggest.

    8. Re:That's interesting... by Anonymous+Brave+Guy · · Score: 1
      Copyrights in the database are quite separate from copyrights in the content of the database.

      Exactly my point; whatever rights Google may have relating to the former Deja database, are independent of my rights over my own work.

      However Courts have to take a wide view of the implied license granted by people who post information on Usenet (and indeed the internet) or else they would be making a wide variety of internet use unlawful.

      I don't see how that necessarily follows...

      Implied licenses tend to be simple: I've never heard of a court recognising an implied license as complicated/sophisticated as you suggest.

      I've never heard of a court either recognising or disputing practices such as Internet archiving yet, either. :-)

      But seriously, while there's an argument in favour of simplicity, I don't see how "We want it to be simple!" could or should translate into "By posting on Usenet I give up most/all useful rights relating to copyright of this material". Even if you decide that archiving is reasonable because I posted the material freely in the first place -- which you might do on public interest grounds, given certain caveats about people being able to prohibit this for their posts in a sensible way -- I don't see how I've granted any implicit rights for commercial use of my post. Then again, maybe you can argue that any non-free ISP providing Usenet access is making commercial use of my posts. Then again again, I'd argue that Usenet itself is essentially non-commercial, and the ISP is charging for the connection to Usenet.

      You could have a fair and reasonable debate about what the implied permissions granted by posting to Usenet should be, in the best interests of both the public interest and the rights of the original poster, but this seems to be avoided at present, until someone feels aggrieved enough to take up the challenge. Unfortunately, in the meantime, people like me are having our material used by others for commercial gain, and while that is somewhat offensive, it's not worth our while to do anything about it.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  28. Damn socialists. by Anonymous Coward · · Score: 0

    (Re: Your link)

    Psst... get A paying job you bum.

  29. Call 'em, and check that mail by Anonymous Coward · · Score: 0

    As most people said: call 'em. You can explain what you like to achieve and fight of their arguments against it if needed.

    And also, check how much your email sounded like "Hey yo Google, could you burn me the usenet data you bought from deja for $XXX million onto a CD for my high-school project? I'll pay for the CD. Thanks".

    I'm not saying it did, but if it can be interpreted like that even remotely, the poor first-line email reader person at google is going to delete it before it's even half read.

  30. long retention by poptones · · Score: 1
    Easynews has retention of at least several weeks - even in the binaries groups. the discussion groups go back considerably longer.

    news.cis.dfn.de (I believe that's it - you can find it mentioned a lot in the "free news servers" newsgroup) has retention in the discussion groups (at least the ones I've looked in) all the way back to (at least) december of last year.

    But if you need years of research data, it seems google would be the place to go. If your needs are limited to only a few groups, many of them have archives. The archives, for example, for some of the rec.audio groups go back many, many years and are fully indexed and stored in digests.

    I also don't get the complaining about terms. These people have a huge archive of usenet data; there's nothing to stop you from building one of your own except the fact it would cost a damn fortune to get setup and organized. If you have that kinda money, quit complaining and do it; if you don't, quit complaining about someone else (who does) taking the initiative. It would seem google has the only reasonably complete usenet archive - would you rather there were none at all?

  31. Young Minds by Anonymous Coward · · Score: 0

    Young Minds used to sell Usenet archive discs. They apparently don't anymore, but you could try to buy someone's copies...

  32. copyright applied to usenet posts by zarthon · · Score: 1
    Anonymous Brave Guy (457657) writes on on 22:50 Monday 28 July 2003 (#6556019):
    Archiving is a somewhat dicey proposition, since it was neither the original purpose of Usenet nor something provided for within its normal protocols, but you could possibly make a case for it on public interest grounds. I'm not sure what effect (or otherwise) I think the "x-no-archive" header should have here.
    The intention of the user's use of the x-no-archive tag can not be fully deterimined however, the original purpose for the tag had to do with limiting the size of the archive and had nothing to do with copyright laws. If you have posts from the 80's you remeber the infamous single sided single density 5.25 inch floppy disk that was actually floppy. Huh dude 2megs was a frickn hard drive.

    The tag did not imply "do not print". Addtionally thier is no precedent of tradtion. No laws have ever alowed publishers to print a book under copyright and be required to be burned on a certain date by all copyholders even by liberaries with the sole exception of the authors mauscript, and remember that copyrights do eventually expire.

    Your best posts are probally in basements and attics , in the cabinet under night stands accross the land on yello jagged torn scraps of fan folded paper with some of the little perferated tracks still attached.

    Our very discussion of the x-no-archive header btw fruitfully negates assertion that archiving was not supported by usenet protocols etc. Finally the copyright holder must inforce copyright through civil litigation...i think damage assesment would be in mills if not ziltch.

    Anyway .... come on you are just playing devils [rria] advocate here! These are not presidents that most at /. really want to set at least in quite this way.

    I am using 1.4 on RH9 and have had no problems reading /. but something has seemed funny. It think its the fonts. Maybe i have seen weird fonts? If this is the case and bugging you then set up a css or change the default serif /sansserif fonts. The defaults might have changed in the build. /. is also one of the few sites that seems to never make firebird 0.6 freeze. I often use firebird as well as mozilla. I sometimes wonder if my bird is compatiable with my rh9 version of glibc.

    1. Re:copyright applied to usenet posts by Anonymous+Brave+Guy · · Score: 1

      Thanks for a thoughtful reply.

      To address your book analogy quickly first, the difference there is that in the case of a book, you've presumed acquired the right to have that book, by buying it, borrowing it from a library or otherwise. When you paid for it, you bought the right to keep it. So did the library when they bought their copy (and they can lend out the copy they bought, but not copy it again themselves and lend out multiple copies from one original). The whole issue isn't really directly comparable to pure information resources such as Usenet posts.

      Anyway .... come on you are just playing devils [rria] advocate here! These are not presidents that most at /. really want to set at least in quite this way.

      You don't see a problem with the idea that I invested my time to write something, offered it to the general public for free, and now someone else is taking my material, using it for to profit and restricting how it's accessible via their system, without giving me any compensation or, realistically, much chance to do a whole lot about it?

      There's more to this than potential financial damages. At issue is whether I have the right to control copying of my information, the whole purpose of copyright. As anyone who's ever been involved in a GPL project can tell you, there are more reasons you might want to control the fruits of your labours and how they may be used than financial gain. As I've argued here many times before, you should have that right, because you invested the time and effort to produce the material. No-one who didn't has any right to claim that "information wants to be free" just so they can benefit from your work without offering anything in return, or use your work in a way that you do not like.

      If information has no value, and good free alternative sources are always available, and all the other mantras of the "make it free" community are true, then they have no need to use your offering, do they? They can just use their own version, or find a free and equally good source elsewhere. Oh, but wait... They can't, can they?

      BTW, my probs with Mozilla relate to how the page renders; it often draws the sidebar too wide, or forgets to draw the main column of text at all. A quick resize of the font fixes that, BTW, so the page is downloading properly but sometimes rendering incorrectly first time out.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  33. lawyers policians and research alternatives by zarthon · · Score: 1
    Practically no one is lining up to give reseahers data ! This is a hotpoint of mine. I find it particularly outragous, that the research that is excluded by law is usually that which is in the public interest, while many private and comerical studies are well funded and leagal.

    Many areas of the law are purposefully made to protect commercial interests and "simulate the economy." Politcians are still at it! That this also seems in corporate interests is no accident since a great deal of commerce is conducted by corporations.

    Consider some alternatives.

    • Do you really want to analyse all of that data?
    • How reliable do you beleive the body of the database to be for your purposes? Deletions omissions moderation and changes in policies over time all contribute to a less than clean situation.
    • Could you use a subset of the data obtained through getting some individual group archives available elsewhere?
    • Could you propose that their resarch team colaborate with you and run the analysis on the data at google thus not theating them with another existant copy? Its not like they aren't into text analysis and internet use!
    • Could you analyse /. instead? Inquiring minds want to know ! Thus far your research is stimulating internet activity !

    The polictical problem is a lack of clout. Watch out for the researcher lobby ! They are armed with regression and have infinitely many means ! So I argue that although its not what i would like the law is clearly on google's side. Posting to the public usenet groups clearly implies the expectation of users freely copying and archiving the 'works' by third parties since the architechture of the technology rely on copying. Archiving was even a practice of indiviual users and definitely of sys admins.

    Usenet is a means of communicating to those in the public at large who have the interest to read it. For it to be an effective means of communicating with the public the information is not only carried on the public networks (like a phone conversation)but clearly remains accessable intertemporally to the general public, and private users individually.

    As i said in a reply to another comment, No law allows the publisher to require copies to be burned on a certain date by all copyholders even by liberaries with the sole exception of the authors mauscript. Finally all copyrights do* eventually expire, with the intention of enriching the public domain.

    The full database (and prehaps some missing posts a la the google deletion policy) could theroretically be reconstructed from other sources. The database is a convience that they offer to the public.

    If a compeditor compiled an equally useful database then they could in effect create the same service. To do so without directly copying the database under discussion would be a fairly large and costly undertaking. Directly copying useful large useful portions directly from their website would be much cheaper and could easily be scripted. The terms of the tos simply protect the service (value added) that was created by deja by creating a "home" in the new protcol "web". It not only collected the posts in one place but also transfered the information formerly on nntp to http making it more accessable to many users without harming or taking away value from the those that use tradtional clients.

    Since folks vist their database pages and click thier ads simulating commerce the laws are serving intended goals but prehaps at the expense of public interest. (That is assuming that your study is in that interest.) Good luck with your study.

  34. Email David Weisman from UWO by Large+Green+Mallard · · Score: 1

    He provided much of the early USENET archives. See his page here for details of his contribution and email address.