SpamArchive.org Launched

← Back to Stories (view on slashdot.org)

Posted by chrisd on Wednesday November 20, 2002 @10:00PM from the spam-is-angry-making dept.

An anonymous reader writes "SpamArchive.org has just been launched. SpamArchive.org is a community resource that provides a database of known spam to be used for testing, developing, and benchmarking anti-spam tools. The goal of this project is to provide a large repository of spam that can be used by researchers and tool developers. In the past, there were a few small personal spam archives that were used. There was no large set of spam that could be used to test new anti-spam algorithms. Thus, developers could not sufficiently test their techniques across a range of messages. Also, the lack of a "standard" sample of spam made it difficult to effectively benchmark anti-spam tools."

20 of 269 comments (clear)

Min score:

Reason:

Sort:

Re:So... by RyoSaeba · 2002-11-20 22:11 · Score: 5, Insightful

LOL, want'em to forward every new spam they receive ?
Don't you have enough already ? ^_^

Seriously, this sounds like a great idea.

I can see a few technical troubles to catalog spam, though.
Most obvious is that usually spam is personalized, that is the recipient's mail address (or part of it) often appears either in the subject or in the body. So will this archive store every variant of every spam, or just a 'global' model ?
Also need to define how catalog tools are supposed to access the archive, ie: grab from url ? ftp text file ?

And in any case, until spam filters are hooked directly on the smtp mail server itself, users will still have to take the time to configure their anti-spam tool, launch it regularly to clean the mailbox, and so on...

For instance Mozilla will incorpore spam filters, but from what i got you'll still have to download that freaking spam before it gets filtered, which can take some time if those are big spams (like viruses or such).

Ok, it sure beats having legitimate mails removed from the server without our knowledge...

Just my 2 cents of euro.

--
Tsuyoikoto ha taisetsu da ne, dakedo namida mo hitsuyousa (Strength is an important thing, but tears too are necessary)
recycled spam by ndevice · 2002-11-20 22:12 · Score: 2, Insightful

With some people already accusing bugtraq of being a repository for exploits that anyone could use for exploit purposes, you'd think that the same could happen to the spam archive.

Soon we'll see old spam being recycled as the new breed of spam trolls mine the archive for inspiration - and maybe just material reuse.

Then, of course, it's not like we don't see recycled spam anyway, so maybe this isn't such a bad thing...

(And if I sound incoherant, it's 2 in the morning. I should be sleeping.)
They're asking for trouble by EmagGeek · 2002-11-20 22:12 · Score: 3, Insightful

Is this really necessary? I mean, come on, how hard is it to find spam for research? Most people get more spam than their Hotmail inbox can handle just for signing up for the account. All a researcher has to do is start clicking the "Remove Me" link in those emails and he or she will have more spam than he or she knows what to do with!
Combine that with posting to some anti-spam newsgroups with their real email address, and bingo boingo, all the spam in the world will come right to them.
This site also creates a problem in that only the spam posted to that site might be used for research. There might be millions of spam emails overlooked because they don't make it onto that site. Think of those poor spammers that won't get filtered :)
Won't someone please think of the children!?!?
1. Re:They're asking for trouble by piranha(jpl) · 2002-11-20 23:06 · Score: 2, Insightful
  
  Is this really necessary? I mean, come on, how hard is it to find spam for research? Most people get more spam than their Hotmail inbox can handle just for signing up for the account. All a researcher has to do is start clicking the "Remove Me" link in those emails and he or she will have more spam than he or she knows what to do with!
  
  Wrong. I've been setting up bogus e-mail accounts on a domain created exlusively for spam research/testing. I've gone through at least a dozen "unsubscribe" links and never received one spam out of it to those test accounts. Perhaps the spammers only highlight records for people who "unsubscribe" when those people were in their database in the first place.
  (The most spam I've received so far in one of these test accounts was from signing up to freefootfetishezine.com.)
  This site also creates a problem in that only the spam posted to that site might be used for research. There might be millions of spam emails overlooked because they don't make it onto that site. Think of those poor spammers that won't get filtered :)
  
  That doesn't make sense; they might not get a good sample of the spam if they don't solicit samples, just as much as they might not get a good sample if they do. It makes more sense that they would get more spam--and more diverse spam--from soliciting examples. Consider that submitted samples would come from all over the world, from a variety of sources, and in a variety of languages.
The opposite by sholden · 2002-11-20 22:29 · Score: 5, Insightful

Exactly the opposite is needed for work on mail filters.

Spam is really easy to find, everyone knows that, create a hotmail account fill out some web forms, post to some newsgroups, put a mailto: on a web page. Wait a little while. Bingo, lots of spam.

However, non-spam email is harder to find. Using your own makes techniques that work with your particular type of email and not other people's.

Non-spam is harder to collect. Since email is often private in nature. Removing identifiers from the headers is easy enough, but the body also can contain things like addresses, emails, phone numbers, comparisons of the boss to bacteria, etc.

A collection of real emails, from which personal information has been replaced with fake data would be of great use. A few people I know are working on creating such a data set of email. It is aimed at more general email filtering though, not just spam detection, and hence requires categorisation. And is from academia and hence will probably lose the race with the heat death of universe for completion.

I do note they have a 'non-spam' heading on the very sparse web page which is encouraging.
Non-spam messages for false hit testing by jjl · 2002-11-20 22:34 · Score: 3, Insightful

Archive of samples of non-spam messages should be collected as well, containing real E-mail messages which aren't spam. These messages should be more or less normal private E-mails which are just volunteered to make public for testing purposes.
The purpose of the samples of non-spam messages would be to help preventing false hit testing for the spam filtering algorithms, just as real spam messages are used to tune the algos for detecting spam.

--
--
That could be heaven for spammers.. by heytal · 2002-11-20 22:41 · Score: 4, Insightful

The archive could give them a lot of valid email addresses...

Consider this one: You forward a spam to submit@spamarchive.org. The forwarded mail is now a part of the archive. Spammers snoop the archive for email addresses.
Benchmarking "False Positives" by gwappo · 2002-11-20 22:54 · Score: 3, Insightful

It would seem to me that the value of such a repository is limited if all it contains is spam.
If anyone writes an anti-spam tool, I need to distinguish between spam and non-spam, making non-spam equally valuable for spam-filter benchmarking.
Having a log with only spam makes it quite easy to achieve a 100% benchmark (simply reject it all!).
Couldn't find anything about this on the site, so unless I'm missing something, the value of such a log is limited at best.
Re:Who are these guys? by Corporate+Troll · 2002-11-20 23:07 · Score: 2, Insightful
Much easier:
- Set up sendmail
- Make script that sends a mail out of a random collection of SPAM, goatse.cx pictures and viruses. Make sure that the FROM: fields is faked
- For the paranoid: use free dial-up ISP in order to cover your traces.
- Set script in cronjob and let it run every minute. (or run put the script in infinite loop)
Your ex is gonna love you for that. Not that *I* ever do such things... Don't be astonished if your car is keyed the next day, by the way.
Re:Hard to get worked up about that by RebRachman · 2002-11-20 23:08 · Score: 5, Insightful

The point is that if they want to do a spam archive, you would expect them to do some minimal research. This page clearly shows that SpamArchive.org has not done the following basic background work:

1. Told me who they are so that I might trust them.

2. Told me anything about their technology/database so that I might know if it is really going to be useful. For all I know they haven't even thought about the collection, storage and retrival issues behind dealing with this.
3. Collected the archives supposedly uncoordinated that already exist and collated them.
4. Added even one link to a relevant site. You would assume that to undertake such a project they would at least have visited a few sites before concluding there was nothing out there. Posting couple of relevant URLs wouldn't be too much work.

In short, I am not impressed that someone who can do 20 minutes of work is the same someone who can undertake the huge project proposed here. It looks like they think that somehow all they need is for people to send them information by e-mail, and for a few other people to volunteer to do the work. Not a promising start.
Is it me or by zBoD · 2002-11-20 23:18 · Score: 2, Insightful

it is exactly the same thing as www.spamrecycle.com that exists for a long time now?

BoD

--
BoD
What's the point? by brunnock · 2002-11-20 23:21 · Score: 5, Insightful

What's the point of testing a filter against a database of known spam if you can't test it against a database of nonspam?

Anybody can write a filter for bulk mail. How do you differentiate between solicited and unsolicited bulk mail?
Re:What if... by thing_in_itself · 2002-11-20 23:22 · Score: 3, Insightful

After a certain point though, spammers are pretty much stuck with a few basic "selling points" -- it's hard to sell something if you don't include a product description or URL or address/phone of some sort, and spam filters will evolve to catch those kinds of things unless they're stripped down to their bare bones (as in, just a random bare URL.... hey, wait, that sounds like half the e-mail I send to my friends ;).
Even then, a hypothetical "widely used" spam filter will probably include a user-specific Bayesian filter, so you can create your own local database of what tends to be spam, and more importantly, what tends not to be spam -- and your own "real mail" keywords will probably be highly specific to your interests/career. So you're basically "evolving" a personal blacklist/whitelist to go along with the global filter.
But probably the most interesting thing about "spam evolution" is that if spam can get through a spam filter, it's going to be really toned-down and bland. That may not make a difference to you, but it'll drastically lower the spammers' response rates because their ads aren't as flashy. Less profit = less spammers. (This last paragraph wasn't "my idea" -- forget where on the web I saw it.)
Copyright by rockdreamer · 2002-11-20 23:41 · Score: 2, Insightful

Spam, like all written text is subject to copyright

Couldn't the spammers sue for copyright infringement?
Re:Whois.. by Anonymous Coward · 2002-11-20 23:54 · Score: 4, Insightful

So let's get this straight...

This database is run by a little-known company of
mixed reputation that sells its own anti-spam tool.

It doesn't promise any new functionality that news.admin.net-abuse.* doesn't already provide. There's absolutely no reason to believe that the spams collected here will be any 'better' a sample than those collected by opening a random Hotmail account.

So, what's in it for Ciphertrust? As well as their own library of spam, they'll have a collection of e-mail addresses of people who are interested in fighting spam.

And what's in it for us? Anyone? Bueller? Anyone?
Re:How to end spam by elodan · 2002-11-21 00:29 · Score: 3, Insightful

IMO, all the spam filtering technology we're so busy inventing is missing the point to an extent. It's not so much the problem of finding the spam in your mailbox and having to delete it, as it is to do with the amount of bandwidth downloading the spam eats up.
You and I resent the time we spend deleting rude/crude/criminal/porno spam, but at the end of the day if you've got broadband you only notice the TIME lost.
A user using a cheap Linux handheld in India can't afford the bandwidth to download a hundred graphic-rich spams a day.
Bandwidth costs.
Shouldn't we therefore be looking at ways to stop the spam being sent, or at least limit the propagation of it by filtering it early in the routing process?
Unfortunately I'd guess this messing with other people's email would have legal implications, but can we work round it?
Re:How to end spam by CvD · 2002-11-21 00:32 · Score: 4, Insightful

It is still too much work for me to have to set up a new email address every time I leave it on a website somewhere.

With an advanced spam filter, you set it up and forget about it...sometimes checking your spamfolder if there are any false positives.

How do you create new email addresses? Do you have a CGI script interfaced with your alias file or so to easily make new email addresses? That would be useful.

For me it still is too much work to set up email addresses that way. And you need to start doing this from the beginning, otherwise there will still be an amount of spam that gets sent to your username@example.com address (as is the case with me).

Cheers,

Costyn.

--
The Official Steve Ballmer Webpage
Large collection of legitimate e-mail needed more? by tschild · 2002-11-21 01:38 · Score: 2, Insightful

I don't thing that a large archive of spam is hard to come by. You don't need to publicly invite submissions either - just acquire a domain and hosting with catchall e-mail service, set up e-mail forwarding to an address for your database, then publish several addresses under that domain where spammers are bound to pick them up (newsgroups, FFA lists) and register them with services who sell their e-mail lists with a lot of different demographic information vectors. You'll get as much input as you have a use for.

For calibrating spam filters you'll probably only want spam from the last few months as spam does evolve - e.g. it's mostly herb*l vi*gra these days.

What is at least equally needful but much more hard to come by is a large, representative collection of legitimate e-mail, to test spam filters for false positives. This collection would need to cover diverse languages, cultures and contexts (private, business/x-industry, business/y-industry, system error messages, automatic notification messages etc.)

What is hard about this collection of legitimate e-mail is that the privacy of both sender and recipient is affected, and that, if confidential information is masked or deleted, the e-mail isn't the original one and spam filters might evaluate it differently.

There is one subset of legitimate e-mail available: public archives of mailing lists. But these e-mails don't cover the style of e-mail in other contexts.
FALSE STATEMENTS by mgkimsal2 · 2002-11-21 01:42 · Score: 4, Insightful

... and I receive zero spam

Once I receive spam on one of the addresses...

I also advertise the email address widely ...

So, you receive no spam, but when you do receive spam, you edit procmail. Which is it?

Also, you widely advertise your email address, but you don't actually use your email address, but made-up aliases. Which is it?

You're simply masking the problem, and going thru a moderate amount of gyrations (which most average joe 'net users won't/can't go through) to do so.

--
creation science book
archive of spam not all that useful by pigpen_ · 2002-11-21 02:15 · Score: 2, Insightful

An "standard" archive of spam might work great for benchmarking rule based filters against each other, but adaptive filters, like the popular Bayesian kind, work best when they learn on your own emails and spams. There's also no point in testing an adaptive filter when you can't also feed it non-spam emails.

--
Zambozay! My brain must've been eatin' a sandwich!