Slashdot Mirror


SpamArchive.org Launched

An anonymous reader writes "SpamArchive.org has just been launched. SpamArchive.org is a community resource that provides a database of known spam to be used for testing, developing, and benchmarking anti-spam tools. The goal of this project is to provide a large repository of spam that can be used by researchers and tool developers. In the past, there were a few small personal spam archives that were used. There was no large set of spam that could be used to test new anti-spam algorithms. Thus, developers could not sufficiently test their techniques across a range of messages. Also, the lack of a "standard" sample of spam made it difficult to effectively benchmark anti-spam tools."

30 of 269 comments (clear)

  1. So... by Markus+Landgren · · Score: 5, Funny

    Do they have a mailing list I can sign up for if I want to get updated by e-mail?

    1. Re:So... by RyoSaeba · · Score: 5, Insightful

      LOL, want'em to forward every new spam they receive ?
      Don't you have enough already ? ^_^

      Seriously, this sounds like a great idea.

      I can see a few technical troubles to catalog spam, though.
      Most obvious is that usually spam is personalized, that is the recipient's mail address (or part of it) often appears either in the subject or in the body. So will this archive store every variant of every spam, or just a 'global' model ?
      Also need to define how catalog tools are supposed to access the archive, ie: grab from url ? ftp text file ?

      And in any case, until spam filters are hooked directly on the smtp mail server itself, users will still have to take the time to configure their anti-spam tool, launch it regularly to clean the mailbox, and so on...

      For instance Mozilla will incorpore spam filters, but from what i got you'll still have to download that freaking spam before it gets filtered, which can take some time if those are big spams (like viruses or such).

      Ok, it sure beats having legitimate mails removed from the server without our knowledge...

      Just my 2 cents of euro.

      --
      Tsuyoikoto ha taisetsu da ne, dakedo namida mo hitsuyousa (Strength is an important thing, but tears too are necessary)
    2. Re:So... by stevenp · · Score: 4, Funny

      > Do they have a mailing list I can sign up for if I want to get updated by e-mail?

      No, but you can open a Hotmail account and receive a dayli dose of UP-TO-DATE spam message FOR FREE.

    3. Re:So... by plumby · · Score: 5, Interesting
      It may partly depend on what user name you picked. I've got two email accounts with my ISP, neither of which I've ever given to anyone. One has a common surname as the account name. The other has a collection of random gibberish as username. The first one recieves several spam messages per day. The other one has probably recieved one in the last 3 months.

      I guess that the spammers quite probably have a standard list of common names that they put in front of @hotmail.com, @aol.com, etc.

      As a tip, though, I've just set my spam levels on hotmail to only recieve emails from people that are in my address book. I've not got a single spam on that account (except from MS themselves) since I did that.

  2. A hotmail account is just as good by Anonymous Coward · · Score: 5, Funny
    There was no large set of spam that could be used to test new anti-spam algorithms


    Whoever wrote this obviously doesn't have a Hotmail account.
  3. Hard to get worked up about that by RebRachman · · Score: 5, Interesting

    Even I know how to buy a domain name and write a few paragraphs of text on a white background. There is nothing about this archive to hint at its origin or credibility. This is a /. worthy story?

    1. Re:Hard to get worked up about that by arvindn · · Score: 5, Informative

      Even I know how to buy a domain name and write a few paragraphs of text on a white background.
      But you didn't, did you?

      This is a /. worthy story?
      You're missing the point. The story is not on /. because something revolutionary has been done, but because the huge number of /. readers can get together and create a useful database. Obviously it would be no good if no one knew about it. In a sense, the story is worthy because it got on /. :) Kind of a reverse Catch-22, if you like.

      What you can do:
      • Help them implement their automated spam review scripts. As with any project, they need volunteers.
      • Make sure you send them a copy of all the spam you receive. From their page:
        SpamArchive.org's efficiency is proportional to the amount, quality, and variety of spam that is provided. End users can forward known spam to submit@spamarchive.org.
    2. Re:Hard to get worked up about that by RebRachman · · Score: 5, Insightful

      The point is that if they want to do a spam archive, you would expect them to do some minimal research. This page clearly shows that SpamArchive.org has not done the following basic background work:

      1. Told me who they are so that I might trust them.

      2. Told me anything about their technology/database so that I might know if it is really going to be useful. For all I know they haven't even thought about the collection, storage and retrival issues behind dealing with this.
      3. Collected the archives supposedly uncoordinated that already exist and collated them.
      4. Added even one link to a relevant site. You would assume that to undertake such a project they would at least have visited a few sites before concluding there was nothing out there. Posting couple of relevant URLs wouldn't be too much work.

      In short, I am not impressed that someone who can do 20 minutes of work is the same someone who can undertake the huge project proposed here. It looks like they think that somehow all they need is for people to send them information by e-mail, and for a few other people to volunteer to do the work. Not a promising start.

  4. Database? by dat00ket · · Score: 5, Funny

    Can't researchers just set up their own hotmail account?

    Seems cheaper.

  5. Trade Spam! by Pathwalker · · Score: 5, Funny

    Now that spam is so collectable, someone should start a service to let people trade it?

    What will someone give me for my rare "Help fund the freedom fighters in Chechnya!" complete with numbered bank accounts to send donations to?

  6. Tell everyone! by some+guy+I+know · · Score: 5, Funny

    I think that they should send email out to everybody describing this great service!

    --
    Those who sacrifice security to condemn liberty deserve to repeat history or something. - Benjamin Santayana
  7. Who are these guys? by gomerbud · · Score: 5, Interesting

    Dude, i could have registered a simlar domain and put up a comparable web page within a matter of hours. I hope they really exist.

    Wouldnt it be great if the submit email address was forwarded to someone's ex girlfriend? Thats the ultimate form of revenge...

    1) Register domain name.
    2) Put up web page advertising some kind of anti-spam database.
    3) Forward all email sent to the submit address to someone you dont like.
    4) Get slashdotted.

    The end result is that three million people send 100 spams the first hour to the submit address. Within a short amount of time, your foe has 300 million emails in his/her mailbox. Now that's spam.

    --
    Kan jeg få en pils, vær så snill?
  8. What about NANAS? by tsvk · · Score: 5, Informative

    NANAS, or the newsgoup news.admin.net-abuse.sightings does just this. It is a public archive of spam which can be searched e.g. with Google Groups:

    http://groups.google.com/groups?group=news.admin.n et-abuse.sightings

    Why reinvent the wheel? Or does this new spam archive have any new functionality to offer?

  9. NANAS Google Archive by Ricardo+Dias+Marques · · Score: 5, Informative

    Well, there is already a pretty large Email and USENET Spam archive at the NANAS (news.admin.net-abuse.sightings) newsgroup.

    You can check the Google Groups archive

    You can read the NANAS charter at http://www.killfile.org/~tskirvin/nana/charter/nan as.html

  10. Whois.. by Anonymous Coward · · Score: 5, Informative

    says:
    Domain Name: SPAMARCHIVE.ORG
    Owner, Administrative Contact, Technical Contact, Billing Contact:
    Guru Rajan (ID00024772)
    11475 Great Oak Way
    Suite 210
    Alpharetta, GA 30022
    us
    Phone: +1.6789699399
    Email: guru.rajan@ciphertrust.com

    http://www.ciphertrust.com introduces itself as:

    Protect Your Email Gateway
    Anti-spam and email security for the enterprise

    CipherTrust has integrated defenses for all email application-level threats into one, comprehensive device. Our IronMail appliance protects enterprise email systems such as Microsoft Exchange, Lotus Notes and Novell GroupWise against viruses, spam, and intruders, and provides message privacy and policy enforcement.

    1. Re:Whois.. by Anonymous Coward · · Score: 4, Insightful

      So let's get this straight...

      This database is run by a little-known company of
      mixed reputation that sells its own anti-spam tool.

      It doesn't promise any new functionality that news.admin.net-abuse.* doesn't already provide. There's absolutely no reason to believe that the spams collected here will be any 'better' a sample than those collected by opening a random Hotmail account.

      So, what's in it for Ciphertrust? As well as their own library of spam, they'll have a collection of e-mail addresses of people who are interested in fighting spam.

      And what's in it for us? Anyone? Bueller? Anyone?

  11. The opposite by sholden · · Score: 5, Insightful

    Exactly the opposite is needed for work on mail filters.

    Spam is really easy to find, everyone knows that, create a hotmail account fill out some web forms, post to some newsgroups, put a mailto: on a web page. Wait a little while. Bingo, lots of spam.

    However, non-spam email is harder to find. Using your own makes techniques that work with your particular type of email and not other people's.

    Non-spam is harder to collect. Since email is often private in nature. Removing identifiers from the headers is easy enough, but the body also can contain things like addresses, emails, phone numbers, comparisons of the boss to bacteria, etc.

    A collection of real emails, from which personal information has been replaced with fake data would be of great use. A few people I know are working on creating such a data set of email. It is aimed at more general email filtering though, not just spam detection, and hence requires categorisation. And is from academia and hence will probably lose the race with the heat death of universe for completion.

    I do note they have a 'non-spam' heading on the very sparse web page which is encouraging.

  12. Spam and anti-spam by zedman · · Score: 5, Funny

    Would spammers try to "anti-spam" the spam archive by submitting billions of perfectly normal emails?

    Ian

  13. What if... by serlaten · · Score: 5, Interesting

    ...spammers use the anti-spam tools to create spam that doesn't trigger the automatic spam filters.

    1. Write spam mail
    2. Filter through widely used spam filter
    3. If spam is flagged as spam, rewrite; goto 2
    4. Send
    5. Profit
  14. That could be heaven for spammers.. by heytal · · Score: 4, Insightful

    The archive could give them a lot of valid email addresses...

    Consider this one: You forward a spam to submit@spamarchive.org. The forwarded mail is now a part of the archive. Spammers snoop the archive for email addresses.

  15. Spam archive and stats by minesweeper · · Score: 4, Informative
    If you're looking for 5+ years of archived spam and plots of spam volume versus time, check out this guy's site.

    His page of graphs shows the exponential growth of spam over the past few years.

  16. Not intended purpose by 0x0d0a · · Score: 4, Informative

    This isn't like Distributed Checksum Clearinghouse or some other spam *solution*. It's intended to test to see what percentage right antispam tools get right -- false positives and negatives. It's useless (at least directly) to end users.

    So unless your antispam tool breaks on some names in personalized letters, I would think that it's okay.

  17. As Admiral Ackbar says... by imag0 · · Score: 4, Funny

    It's a trap!!!

    1) Set up story about new site accepting spam to assist in creating better anti-spam tools.
    2) accept all the submissions from the teeming millions(tm) at a popular tech site or two.
    3) cull all the email addresses from those duped to forward spam to you.
    4) sell said email addresses to spammers.
    5) PROFIT!!!!

  18. Maybe I'm being cynical.... by Maddog+Batty · · Score: 5, Interesting

    If you were a spammer and wanted to collect a large number of valid email addresses, how about this as an idea...

    1) Produce a website pretending to be antispam.

    2) Ask people to send their spam emails to the site (generally including a valid from address of course)

    3) Publish on slashdot so as to get lots of interest.

    4) ???

    5) Profit!

    (Unfortunately, we all know what stage 4 is for spammers...)

    --
    wot no sig
  19. What's the point? by brunnock · · Score: 5, Insightful

    What's the point of testing a filter against a database of known spam if you can't test it against a database of nonspam?

    Anybody can write a filter for bulk mail. How do you differentiate between solicited and unsolicited bulk mail?

  20. How to end spam by Permission+Denied · · Score: 5, Interesting
    I've had the same email address for five years, and I receive zero spam. None whatsoever. I also advertise the email address widely (web, usenet, mailing lists).

    How does this work, you ask? I create a new email address each time I give out my email address. We have a sendmail setup that allows you to make "username+foo@example.com" go to "username@example.com" where "foo" is any arbitrary string.

    So, amazon.com thinks I'm "username+amazon@example.com", securityfocus thinks I'm "username+bugtraq@example.com" and so on. Once I receive spam on one of the addresses, it's trivial to write a filter that matches with near 100% confidence ("username+bugtraq@example.com" should only receive messages originating from securityfocus, etc.). Most times, if an address receives a spam, I can just procmail all mail to the address to /dev/null (eg, no complex rules like for the bugtraq example). This also allows me to track where spammers get their lists.

    We use sendmail. Equivalently, qmail allows "username-foo@example.com" and if you own your own domain, just use "foo@example.com".

    I find this advanced filtering stuff fascinating, from a completely academic point of view. I, of course, can't apply any of it since I don't receive any spam, but it's interesting nonetheless. I just read through how the Bayesian filter works. It is very simple: it only filters based on word (token) probabilities. So, it would assign a value to "make," "money" and "fast," but not "make money fast". Seems like you could get much better results if you do something more advanced like Markov chains or a neural net. There's lots of research out there on textual matching, and I'm not sure why people would start out with such a simple algorithm when there may be better things available (where "better" is measured not only by accuracy, but also by training time).

    1. Re:How to end spam by CvD · · Score: 4, Insightful

      It is still too much work for me to have to set up a new email address every time I leave it on a website somewhere.

      With an advanced spam filter, you set it up and forget about it...sometimes checking your spamfolder if there are any false positives.

      How do you create new email addresses? Do you have a CGI script interfaced with your alias file or so to easily make new email addresses? That would be useful.

      For me it still is too much work to set up email addresses that way. And you need to start doing this from the beginning, otherwise there will still be an amount of spam that gets sent to your username@example.com address (as is the case with me).

      Cheers,

      Costyn.

  21. Are they legit? by Zocalo · · Score: 5, Informative
    Typical of a Slashdor story. Lot's of people asking questions when they can find out the answer and post it in the same amount of time.

    According to WHOIS, "spamarchive.org" was registered by one Guru Rajan, who has an email address at "ciphertrust.com". Also according to WHOIS, "ciphertrust.com" has the same person as technical contact and if you check the website you find they are the vendors of "IronMail: The Secure Internet Email Gateway", an established if not well known product.

    In short, yes, it seem legit, and it probably took me less time to find that out than the time taken by the myriad people asking "is it legit" took to post the question. ;)

    --
    UNIX? They're not even circumcised! Savages!
  22. I think it's already been done, but in reverse... by Pendant · · Score: 5, Interesting

    In order to counter the rising tide of spam I recently installed a spamblocker, even though I'm wary of such beasts because of the danger of false positives.

    Sure enough, I have received false positives. But only from one source: my filter traps the Network Solutions email asking for confirmation to proceed with the transfer away of a domain to another registrar. Net$ol changed the format of these emails a while back: they now start off by talking about a "special offer" and it's only towards the end that the real purpose of the message is revealed. My suspicious mind wonders whether these emails are intentionally designed to look like spam to reduce the number of successful transfers... sneaky :(

  23. FALSE STATEMENTS by mgkimsal2 · · Score: 4, Insightful

    ... and I receive zero spam

    Once I receive spam on one of the addresses...

    I also advertise the email address widely ...

    So, you receive no spam, but when you do receive spam, you edit procmail. Which is it?

    Also, you widely advertise your email address, but you don't actually use your email address, but made-up aliases. Which is it?

    You're simply masking the problem, and going thru a moderate amount of gyrations (which most average joe 'net users won't/can't go through) to do so.