Slashdot Mirror


SpamArchive.org Launched

An anonymous reader writes "SpamArchive.org has just been launched. SpamArchive.org is a community resource that provides a database of known spam to be used for testing, developing, and benchmarking anti-spam tools. The goal of this project is to provide a large repository of spam that can be used by researchers and tool developers. In the past, there were a few small personal spam archives that were used. There was no large set of spam that could be used to test new anti-spam algorithms. Thus, developers could not sufficiently test their techniques across a range of messages. Also, the lack of a "standard" sample of spam made it difficult to effectively benchmark anti-spam tools."

4 of 269 comments (clear)

  1. Re:So... by RyoSaeba · · Score: 5, Insightful

    LOL, want'em to forward every new spam they receive ?
    Don't you have enough already ? ^_^

    Seriously, this sounds like a great idea.

    I can see a few technical troubles to catalog spam, though.
    Most obvious is that usually spam is personalized, that is the recipient's mail address (or part of it) often appears either in the subject or in the body. So will this archive store every variant of every spam, or just a 'global' model ?
    Also need to define how catalog tools are supposed to access the archive, ie: grab from url ? ftp text file ?

    And in any case, until spam filters are hooked directly on the smtp mail server itself, users will still have to take the time to configure their anti-spam tool, launch it regularly to clean the mailbox, and so on...

    For instance Mozilla will incorpore spam filters, but from what i got you'll still have to download that freaking spam before it gets filtered, which can take some time if those are big spams (like viruses or such).

    Ok, it sure beats having legitimate mails removed from the server without our knowledge...

    Just my 2 cents of euro.

    --
    Tsuyoikoto ha taisetsu da ne, dakedo namida mo hitsuyousa (Strength is an important thing, but tears too are necessary)
  2. The opposite by sholden · · Score: 5, Insightful

    Exactly the opposite is needed for work on mail filters.

    Spam is really easy to find, everyone knows that, create a hotmail account fill out some web forms, post to some newsgroups, put a mailto: on a web page. Wait a little while. Bingo, lots of spam.

    However, non-spam email is harder to find. Using your own makes techniques that work with your particular type of email and not other people's.

    Non-spam is harder to collect. Since email is often private in nature. Removing identifiers from the headers is easy enough, but the body also can contain things like addresses, emails, phone numbers, comparisons of the boss to bacteria, etc.

    A collection of real emails, from which personal information has been replaced with fake data would be of great use. A few people I know are working on creating such a data set of email. It is aimed at more general email filtering though, not just spam detection, and hence requires categorisation. And is from academia and hence will probably lose the race with the heat death of universe for completion.

    I do note they have a 'non-spam' heading on the very sparse web page which is encouraging.

  3. Re:Hard to get worked up about that by RebRachman · · Score: 5, Insightful

    The point is that if they want to do a spam archive, you would expect them to do some minimal research. This page clearly shows that SpamArchive.org has not done the following basic background work:

    1. Told me who they are so that I might trust them.

    2. Told me anything about their technology/database so that I might know if it is really going to be useful. For all I know they haven't even thought about the collection, storage and retrival issues behind dealing with this.
    3. Collected the archives supposedly uncoordinated that already exist and collated them.
    4. Added even one link to a relevant site. You would assume that to undertake such a project they would at least have visited a few sites before concluding there was nothing out there. Posting couple of relevant URLs wouldn't be too much work.

    In short, I am not impressed that someone who can do 20 minutes of work is the same someone who can undertake the huge project proposed here. It looks like they think that somehow all they need is for people to send them information by e-mail, and for a few other people to volunteer to do the work. Not a promising start.

  4. What's the point? by brunnock · · Score: 5, Insightful

    What's the point of testing a filter against a database of known spam if you can't test it against a database of nonspam?

    Anybody can write a filter for bulk mail. How do you differentiate between solicited and unsolicited bulk mail?