Gmail Spam Filter Testing

first spam? by miketang16 · 2004-06-14 02:36 · Score: 5, Funny

psh.. i've done this to my friends before.. they didn't need to make a website to ask for it...

--
-------
"In times of universal deceit, telling the truth becomes a revolutionary act."
-- George Orwell

Re:first spam? by Anonymous Coward · 2004-06-14 03:14 · Score: 4, Funny

Oh, I didn't know that was you who passed my address along so I could b uy che.ap v1agra! Thanks! Those pi1ls made my p.e..ni.s gr0w 3-5 lnches! It was really very thoughtful of you, Mike.
Re:first spam? by Anonymous Coward · 2004-06-14 03:23 · Score: 0

yeah, me too. my cybersex with miketang16 was never better than it is now that she sold me that viagra
Re:first spam? by miketang16 · 2004-06-14 16:37 · Score: 1

i'm glad it was good for you too

--
-------
"In times of universal deceit, telling the truth becomes a revolutionary act."
-- George Orwell

The Filter is great! by umrgregg · 2004-06-14 02:36 · Score: 5, Funny

Apparently, Google's spam filter even filters messages that aren't there. From the website:

3778 messages were received, totaling 213 MB.
3917 were spam, and Gmail correctly identified 41.9% of these messages.

Fantastic

--
NMG

Re:The Filter is great! by Anonymous Coward · 2004-06-14 02:59 · Score: 5, Funny

No, thats just a classic threaded code bug:

They just forgot the mutex surrounding the two snprintfs... so this user probably got 139 messages in the time it takes to execute snprintf, all spam.

Which is.... about right.
Re:The Filter is great! by aismail3 · 2004-06-14 03:02 · Score: 5, Informative

When I add up the figures from May 13 to 19, I get that 4869 messages were received. 4717 of those were spam, and 1820 were marked, so Gmail's success rate was 38.6%.
Re:The Filter is great! by PacoTaco · 2004-06-14 03:31 · Score: 1

No, thats just a classic threaded code bug
Occam's Razor says he just can't add.
Re:The Filter is great! by digitalpeer · 2004-06-15 04:59 · Score: 1

Last Week (View)
3778 messages were received, totaling 213 MB. 3917 were spam, and Gmail correctly identified 41.9% of these messages.

And according to my calculations, he's getting more spam than he is mail.

One of the best things Google/GMail could do by Anonymous Coward · 2004-06-14 02:37 · Score: 5, Interesting

Is use the GMail data to operate a checksum blacklist. Obviously, if thousands (or millions) of their users are getting the exact same email, it's probably spam.

Re:One of the best things Google/GMail could do by kryptkpr · 2004-06-14 02:45 · Score: 4, Informative

Spammers have thought of this already, and they send nearly-identical messages.. Ever notice the random strings of letters and/or numbers at the bottom/in the subjects of spams?

--
DJ kRYPT's Free MP3s!
Re:One of the best things Google/GMail could do by lockefire · 2004-06-14 02:46 · Score: 5, Funny

Actually, I get a whole lot of emails with the random words and nothing else. I haven't quite caught on to the advertising strategy in that.
Re:One of the best things Google/GMail could do by Adhemar · 2004-06-14 02:48 · Score: 0, Redundant

Is use the GMail data to operate a checksum blacklist. Obviously, if thousands (or millions) of their users are getting the exact same email, it's probably spam.

Have you read a spam message recently?

Most of the spam messages in my inbox/spam folder tend to have strange xqwv words or rather ackward interpunction { in them; These anomalies change from message to message, even if the the rest of the contents is the same. The whole point is to circumvent checksum-based blacklists.

Google has some pretty bright minds aboard, and a potentially a huge lot of email to use as corpus. I strongly believe that Google/GMail is capable of implementing a rather good email filter. But it will be a bit more complex than the solution you suggest.
Re:One of the best things Google/GMail could do by Cruciform · 2004-06-14 02:52 · Score: 5, Interesting

I've been getting them as well.
The only reason I could think of someone sending those around is to bog up Bayesian filters with random crap, possibly lowering their effectiveness.

Any spammmers/spam-experts feel like enlightening us? :)
Re:One of the best things Google/GMail could do by Anonymous Coward · 2004-06-14 02:52 · Score: 2, Insightful

Anti-Spammers have thought of this, too. Things like the Distributed Checksum Clearinghouses have "fuzzy" matching.

Google also has enough computer power to generate some sort of Bayesian filter to catch the most common spam system wide, and even a personalized filter on each account to catch the rest.
Re:One of the best things Google/GMail could do by Anne_Nonymous · 2004-06-14 02:52 · Score: 1, Offtopic

Try turning Javascript on, then you'll see what it's all about. (don't)
Re:One of the best things Google/GMail could do by FauxPasIII · 2004-06-14 02:53 · Score: 0

Watch the film Josie and the Pussycats. It does a pretty good job explaining this phenomenon.

Argh, I want a Big Mac !

--
25% Funny, 25% Insightful, 25% Informative, 25% Troll
Re:One of the best things Google/GMail could do by sugar+and+acid · 2004-06-14 02:53 · Score: 1

Except for the many different legit mailing lists that people subscribe to. Any kind of bulk email will be screened by this, thus crippling gmail by preventing mailing lists that people subscribe to from being delivered.
Re:One of the best things Google/GMail could do by jefe7777 · 2004-06-14 02:58 · Score: 1

>>Actually, I get a whole lot of emails with the random words and nothing else.

and when they don't get a bounce from you, what do you think that tells them?

"valid email address found boys...saddle up!"

and then your address goes on those CDs that are sold to everyone and their dog.

enjoy.
Re:One of the best things Google/GMail could do by Pharmboy · 2004-06-14 03:01 · Score: 1

Google has some pretty bright minds aboard

Yes they do, this is just one of the articles discussing this, here.

They have a much higher ratio of PhDs than Microsoft, or just about anyone short of a hospital. They also give their employees the freedom of spending 20% of their time working on any unrelated subject they choose, appearantly in the hopes that the outcome of this research will benefit Google, or at least will make the better PhD's with more than one iron in the fire, WANT to work for them.

--
Tequila: It's not just for breakfast anymore!
Re:One of the best things Google/GMail could do by Anonymous Coward · 2004-06-14 03:02 · Score: 0

It's called whitelisting.
Re:One of the best things Google/GMail could do by wo1verin3 · 2004-06-14 03:07 · Score: 2, Informative

It's a good thing you're not using Outlook. :)

I get those in Eudora and they don't seem to do much, my friends with Outlook however... not so lucky. :)
Re:One of the best things Google/GMail could do by pqdave · 2004-06-14 03:10 · Score: 1

You think they bother? That would require having either a valid return address or a legitimate account on the outgoing mail server. It's cheaper to just send mail to everyone--Just because there is nobody using an address now doesn't mean that there won't be next week.
Re:One of the best things Google/GMail could do by Xzzy · 2004-06-14 03:18 · Score: 3, Interesting

My server was set up to forward anything sent to one of my domains to get dumped into a common inbox. I noticed a ways back (before I changed my config to just bounce all this crap) that I'd get a lot of those dictionary emails to random email accounts.

So either it's some kind of probe to find working addresses, or a filter clogger. Or maybe both.

For a few of the random emails I would later start getting "real" spam. Not a majority though.
Re:One of the best things Google/GMail could do by xandroid · 2004-06-14 03:21 · Score: 2, Informative

Try looking at the source -- when this happens to me, I see that the random words are plaintext, and the intended advertisement is in HTML (which I've blocked).

--
$ echo "ceci n'est pas une pipe" | sed -Ee 's/(eci n|pas )//g'
Re:One of the best things Google/GMail could do by Halo1 · 2004-06-14 03:21 · Score: 5, Informative

Most of the time, these messages contain both a text/plain section with only random words, and then a text/html part with the real payload. If you use mutt or so, you most likely only see the text/plain stuff. Another trick is using just a text/html section with random text, but also with an image that contains the real payload.

--
Donate free food here
Re:One of the best things Google/GMail could do by brysnot · 2004-06-14 03:21 · Score: 1

Usually the random text is a text/plain attachment which is parsed by the spam filter. The real message is in the text/html attachment. Most email clients like outlook ignore the text/plain attachment and display only the text/html message.
Re:One of the best things Google/GMail could do by Lord_Dweomer · 2004-06-14 03:26 · Score: 1

Its either that or its testing which combinations of things might get through so they can include them with their REAL spam.

--
Buy Steampunk Clothing Online!
Re:One of the best things Google/GMail could do by Anonymous Coward · 2004-06-14 03:34 · Score: 0

Spammer is trying to do two things:
1. break any Bayesian filter used on that mail server/inbox
2. probe for a valid email address.

Adding noise to the filter will allow more mail through as "questionable". This might still be tagged as spam, but not as readily as it would be without the added noise.
If you get a mailing list that's 5 years old, you might want to make sure it's still receiving mail. You don't want to be sending 3000 mails to invalid addresses, now do you?
Re:One of the best things Google/GMail could do by jefe7777 · 2004-06-14 03:34 · Score: 5, Insightful

>> You think they bother?

heh heh...abdolutely.

100 known good addresses are worth 10,000 "who the fuck knows" addressess.

>>It's cheaper to just send mail to everyone

no it's not.

let's pretend you are a spammer, and you want to send out spam.

If you target 1 billion questionable addresses, each time a client has a new campaign, then that's 1 billion pieces you have to deliver. every time.

what if you have 1000 clients? that's 1000 billion deliveries.

do you see where this is going? if you don't KNOW WHAT A VALID EMAIL ADDRESS IS, YOU HAVE TO GUESS.

but what if the first time you send out just a "test" to those billion addresses, and then subtract the one's that bounce.

You are left with 50,000 known good addresses.

that's gold. You now have 1/20th of the load,and you are now serving your clients quicker, a helluva lot less load. you are only using an open relay for 1/20th of the time.

overall a smaller footprint by 1/20th.

you tell me. does it make sense to blindly blast out email?
Re:One of the best things Google/GMail could do by ryen · 2004-06-14 03:37 · Score: 3, Informative

those emails could possibly also contain embedded image tags (known as web beacons). when you open an email and attempt to 'download' the image, some server on the net knows it was you who retreieved the image and has just verified that your email address is active and spammable.
Re:One of the best things Google/GMail could do by ckd · 2004-06-14 03:39 · Score: 4, Funny

They have a much higher ratio of PhDs than Microsoft, or just about anyone short of a hospital.

Remind me not to go to your hospital. I want MDs treating me, not people who can give me a dissertation on ancient Sumeria or something. (MDs who also know about ancient Sumeria excepted.)
Re:One of the best things Google/GMail could do by dragonman97 · 2004-06-14 03:47 · Score: 4, Interesting

Indeed - while I was doing a lot of spam fighting at work, I reviewed a honeypot I'd set up, and was amazed. I used mutt to review the messages, and found a couple of messages where the text part was a page or two from "The Wizard of Oz" and the nasty offer for some kind of auto insurance or other crap was in the HTML section, replete with hidden hash busters behind color backgrounds. These guys are sharp - they must be paying some smart programmers a lot of money, and it's only sad that they've sunk to such levels.
Re:One of the best things Google/GMail could do by letxa2000 · 2004-06-14 03:56 · Score: 5, Insightful

Spammer is trying to do two things: 1. break any Bayesian filter used on that mail server/inbox. Adding noise to the filter will allow more mail through as "questionable". This might still be tagged as spam, but not as readily as it would be without the added noise
Except that won't work, as anyone that understands Bayesian filtering will tell you. In the case of every message with "random words" I've checked recently, the random words actually increased the spam score of that message. Why? Because it seems the random words aren't so random and either the same spammer is using the same "random words" over and over or various spammers are using sets of the same words. Over time most of the "random words" they use actually become great indicators of spam since my real email doesn't typically contain the random words they use.
In one recent analysis, 10 random words were inserted by the spammer. He got lucky and 1 of those words actually had a very low score for my Bayesian corpus. Unfortunately (for him), the other 9 words had scores of 99.99%! His use of random words literally nuked any possibility of him getting through my filter.
Anyway, random words will not help spammers get through Bayesian filters. But it seems that many people (both spammers and non-spammers) think it will. But, hey, that's good for me: as long as "random words" is seen by spammers as a viable solution to Bayesian filters, my Bayesian filter will continue to work and will not have to deal with any innovative way to get around the filter (if any exists).
Re:One of the best things Google/GMail could do by XryanX · 2004-06-14 03:57 · Score: 1

I find that pr0n sites like to do this. They send me an e-mail with an html section that display a collage of pictures, while the text is simply a news story(usually about Iraq).

Odd thing is that they never use punctuation, and it always looks as if a Neanderthal pieced it together. Perhaps the text message is made by a program that crawls through online news sites for similar articles, and then takes random words of it?
Re:One of the best things Google/GMail could do by FooAtWFU · 2004-06-14 04:00 · Score: 2, Informative

>>It's cheaper to just send mail to everyone
>no it's not.
It doesn't matter how cheap it is when 80% of spam supposedly comes from infected zombie computers. (I'm too lazy to actually LINK to the recent story on this.)

--
The World Wide Web is dying. Soon, we shall have only the Internet.
Re:One of the best things Google/GMail could do by FooAtWFU · 2004-06-14 04:07 · Score: 1

Notice that with Gmail you must click a special link to enable the display of external images. This is a per-message link. (I'd post a screenshot somewhere but this is /. :)

--
The World Wide Web is dying. Soon, we shall have only the Internet.
Re:One of the best things Google/GMail could do by KnightStalker · 2004-06-14 04:09 · Score: 1

Yeah. That list of 50,000 known good addresses isn't going to grow unless you blindly blast out email. Pretend you're an evil spammer who is using networks of residential windows zombie machines to send spam. Or pretend you've cracked a badly maintained server in a country where administrators might be hostile to requests from the U.S. Then you can go ahead and try dictionary attacks against every mail server you can think of without any risk, and send the bounces to whatever, and you don't have to worry about the footprint.

--
* And remember, it's spelled N-e-t-s-c-a-p-e, but it's pronounced "Mozilla."
Re:One of the best things Google/GMail could do by wickidpisa · 2004-06-14 04:52 · Score: 3, Insightful

It may not increase false negatives, but it has decent chances of increasing false positives which is a much greater problem. My best guess is that spammers are hoping that once enough random words are classified as spam words, real emails with those words will start being classified as spam. If they can force enough false positives, people will start turning off bayesian filtering.
Re:One of the best things Google/GMail could do by pqdave · 2004-06-14 04:55 · Score: 1

Either your math is off, or my math is off. My math says 1/20th of 1 billion (us) is 50 million.

I don't think spamming makes sense, blindly or semi-blindly, but I think it's likely that spammers can generate addresses via harvesting and semi-intellegent dictionary attacks with more than 1 in 20 valid addresses. If it were hard to get the ratio under 1 in 100 valid addresses I might agree that tracking bounces is a good idea.

Another thing to consider--Due to IP based filters, bounces will vary based on the sending IP address.
Re:One of the best things Google/GMail could do by kryptkpr · 2004-06-14 05:18 · Score: 1

Because simplistic checksums of spam would not be effective, the main DCC checksums are fuzzy and ignore aspects of messages. The fuzzy checksums are changed as spam evolves. Since the DCC started being used in late 2000, the fuzzy checksums have been modified several times.

While in theory a good plan, this is an uphill battle. Set up some checksums/algorithms, spammers adapt, update said algorithms, spammers adapt.. rinse and repeat.

--
DJ kRYPT's Free MP3s!
Re:One of the best things Google/GMail could do by Anonymous Coward · 2004-06-14 05:32 · Score: 0

and infected zombie computers are controlled by their masters, who know they don't STAY infected all that long.

so why waste a zombie computers time pumping out messages to addresses of which only a few are good? when you can utilize the zombies to pump out messages to known good addresses. (or do another round of test mails for bounces to add to your known good list)
Re:One of the best things Google/GMail could do by Anonymous Coward · 2004-06-14 05:35 · Score: 0

>> That list of 50,000 known good addresses isn't going to grow unless you blindly blast out email.

he already addressed that. perhaps you should reread his post. you go hunting for valid addresses every so often, but you blast the "known goods" every day.

the dictionary attacks are for discovering, which again was already addressed.

everyone has been getting the emails with no coherent message, and no html(i.e. there wasn't advertizing for anything). that's what this thread was talking about.
Re:One of the best things Google/GMail could do by jefe7777 · 2004-06-14 06:00 · Score: 1

my math was off. but my point remains the same. be it 50 million or 50 thousand out of a billion, it is still worth it for spammers to track bounces. i'm not even saying all do. I'm just sayingt that known good addresses are worth far more to profitable spammers and so good addresses are worth digging for using a variety of techniques.

>>but I think it's likely that spammers can generate addresses via harvesting and semi-intellegent dictionary attacks with more than 1 in 20 valid addresses.

harvesting? are you now contradicting yourself? if it's not important for spammmers to have "known-good" addresses(which is my point), why bother with harvesting?

harvesting is a generic term for obtaining known good addresses.

you can harvest from websites, newsgroups, bounces, fake opt-outs , whatever...

the point of harvesting is to have a db of known-good addresses that you can blast for as long as possible.

>>If it were hard to get the ratio under 1 in 100 valid addresses I might agree that tracking bounces is a good idea.

well my 1 in 20 was "conservative" to say the least. i wouldn't doubt a ratio way in excess of 1 in 100 plausible.

think about a dictionary attack. how many attempts, given the domain name, would it take to find trichardson@stuff.com ?

arichardson brichardson crichardson drichardson arichards brichards crichards drichards annrichards abrichards ann_richards...to infinity

i think 1 in 100 is way too optimistic.

we're talking about people who continue even when they get less then a 5% response to their spam. if i was a spammer i'd definitely want to minimize the time & effort for that 5% response...and one signficant way is to make sure your spam is going to a live person.
Re:One of the best things Google/GMail could do by Rhodnius · 2004-06-14 06:44 · Score: 1

the other 9 words had scores of 99.99%! His use of random words literally nuked any possibility of him getting through my filter.
His use of random spam words caused a thermonuclear fission explosion in your filter?
I guess that would make sure that nothing got past your filter, except maybe some shrapnel...
Re:One of the best things Google/GMail could do by letxa2000 · 2004-06-14 08:37 · Score: 1

It may not increase false negatives, but it has decent chances of increasing false positives which is a much greater problem. My best guess is that spammers are hoping that once enough random words are classified as spam words, real emails with those words will start being classified as spam. If they can force enough false positives, people will start turning off bayesian filtering.
That won't work. Please review other responses regarding Bayesian and/or read some papers on Bayesian filtering. Once you understand how it works you will see why this approach can't work. If you want me to explain it to you, I will, but it would be redundant. It's been explained many times before.
Re:One of the best things Google/GMail could do by soft_guy · 2004-06-14 12:38 · Score: 1

It is possible to get a PhD in medicine. Many researchers and medical school professors have them. Most physicians who see patients do not. An MD is more equivalent to a masters degree in terms of the time spent in school. Where MDs spend most of their time is in residency, specialization, getting licensed, etc. The actual going through medical school and getting the MD is fairly short (2-3 years.)

--
Avoid Missing Ball for High Score
Re:One of the best things Google/GMail could do by JuggleGeek · 2004-06-14 14:10 · Score: 1

Obviously, if thousands (or millions) of their users are getting the exact same email, it's probably spam.
When the NYTimes sends their daily headlines emails, I'm sure they are emailing thousands/millions of people with the exact same email. Reuters, Google, Slashdot also email me and thousands of others every day, all with the same information, all legitimate opt in mail.
Just because someone is sending thousands or millions of mail does *not* mean that that mail is spam.
Re:One of the best things Google/GMail could do by StrongAxe · 2004-06-14 14:53 · Score: 1

Its either that or its testing which combinations of things might get through so they can include them with their REAL spam.

And just how could this possibly work? So they send you a message composed entirely of random words. It gets through your filters. But since it has no meaningful content nor valid return address, how are they supposed to know that it actually arrived? You couldn't tell them, even if you wanted to.
Re:One of the best things Google/GMail could do by StrongAxe · 2004-06-14 14:57 · Score: 1

Except that won't work, as anyone that understands Bayesian filtering will tell you. In the case of every message with "random words" I've checked recently, the random words actually increased the spam score of that message. Why? Because it seems the random words aren't so random and either the same spammer is using the same "random words" over and over or various spammers are using sets of the same words. Over time most of the "random words" they use actually become great indicators of spam since my real email doesn't typically contain the random words they use.

Their mistake is probably using a real dictionary, and then taking a uniform distribution of random words from it. Unfortunately, words in common usage form a much smaller subset of the language, so random dictionary selections are bound to hit lots of words that nobody has ever heard of.

I could get rid of half of my Nigerian 419 spams by filtering on the word 'modalities' alone.
Re:One of the best things Google/GMail could do by Lord_Dweomer · 2004-06-14 21:44 · Score: 1

You've obviously never heard of those little 1x1 pixel transparent gifs that send info to websites.

--
Buy Steampunk Clothing Online!
Re:One of the best things Google/GMail could do by Guppy06 · 2004-06-14 23:35 · Score: 1

"the text part was a page or two from "The Wizard of Oz""

So the spammers are violating intellectual property laws? Well, that's a whole different story, then! Forward this off to the feds and they'll be arrested inside of a week, likely for violations of the DMCA PATRIOT Act or some such nonsense.
Re:One of the best things Google/GMail could do by pqdave · 2004-06-15 05:00 · Score: 1

my math was off. but my point remains the same. be it 50 million or 50 thousand out of a billion, it is still worth it for spammers to track bounces. i'm not even saying all do. I'm just sayingt that known good addresses are worth far more to profitable spammers and so good addresses are worth digging for using a variety of techniques.

Spammers need lots of good addresses, but in general it's easier to get more good addresses than it is to weed out the bad ones from a list.

harvesting? are you now contradicting yourself? if it's not important for spammmers to have "known-good" addresses(which is my point), why bother with harvesting?

Spammers need good addresses, not known-good. I'll bet that if you offered a spammer the choice of 10,000 known-good addresses, or a million addresses with 15,000 of them good, they'd take the million.

harvesting is a generic term for obtaining known good addresses.

It's a term for obtaining likely-good addresses. Many harvested addresses are bogus, just not as many as randomly-generated addresses.

you can harvest from websites, newsgroups, bounces, fake opt-outs , whatever...

the point of harvesting is to have a db of known-good addresses that you can blast for as long as possible.

well my 1 in 20 was "conservative" to say the least. i wouldn't doubt a ratio way in excess of 1 in 100 plausible.

think about a dictionary attack. how many attempts, given the domain name, would it take to find trichardson@stuff.com ?

arichardson brichardson crichardson drichardson arichards brichards crichards drichards annrichards abrichards ann_richards...to infinity

A simple dictionary attack will get a high hit rate against aol.com, msn.com, any of the top ISP's. For smaller domains with fewer valid addresses, use the top 500,000 most popular usernames and combine these with each domain in turn. This will be far better than random.

i think 1 in 100 is way too optimistic.

we're talking about people who continue even when they get less then a 5% response to their spam. if i was a spammer i'd definitely want to minimize the time & effort for that 5% response...and one signficant way is to make sure your spam is going to a live person.

Your math is off again. Direct mail would be happy with a 5% response rate, the best estimates I've seen of spammer response rates are below 0.05%.
Spammers need to get out to as many total suckers as possible within the limits of their time and bandwidth. While decreasing the number of undeliverables is a good thing, a spammer would reach the point of diminishing returns long before doing all that would be necessary to pay attention to bounces.

He gave out his e-mail address... by Anonymous Coward · 2004-06-14 02:38 · Score: 5, Funny

... to the entire Slashdot community! Now he's going to be flooded with all sorts of spam and shit. LOL!

Oh... right. :)

Re:He gave out his e-mail address... by umrgregg · 2004-06-14 02:58 · Score: 4, Funny

Notice the reader who submitted the story was anonymous... Gotta love friends who sign you up for spam.

--
NMG
Re:He gave out his e-mail address... by Anonymous Coward · 2004-06-14 02:59 · Score: 0

All you need to do is to have a good spam filter. Try eat this : stefan.andersson@travelservice.se
Re:He gave out his e-mail address... by unknown_host · 2004-06-14 03:54 · Score: 0

The reader wishes to remain anonymous, but you can send spam^H^H mail.. to: i_love_spam@gmail.com
Re:He gave out his e-mail address... by Algan · 2004-06-14 04:31 · Score: 4, Interesting

It's not that bad as you think. I posted an dedicated email address to slashdot two times already, just to see what volume of spam I get. Surprisingly, it's only 2-3 messages every other day or so.

Well, I guess I need a booster shot, so here it is: slashdot@hates.ms. Spam away...

--
If con is the opposite of pro, is Congress the opposite of progress?
Re:He gave out his e-mail address... by LBArrettAnderson · 2004-06-14 05:42 · Score: 1

yes, and the fact that he's giving away his e-mail address to random people is the reason that he thinks the spam filter is so innaccurate. he is getting e-mails from completely normal e-mail addresses from completely normal programs with completely normal subjects and bodies. He is counting things as spam while it is not spam.
Re:He gave out his e-mail address... by Anonymous Coward · 2004-06-14 07:30 · Score: 0

Don't you mean Email you?

It's important to be helpful, I think.
Re:He gave out his e-mail address... by istewart · 2004-06-14 08:08 · Score: 2, Funny

I once posted my AIM screenname (IStewart12) to Slashdot and got a total of one message from a concerned individual warning me about the flood of IMs I was now likely to receive. Must not have been a very active thread.
Re:He gave out his e-mail address... by JuggleGeek · 2004-06-14 14:29 · Score: 1

I'll bet if you run that same test with an address that ends in .com, you'll find that you get a lot more spam.

whining? by Gothmolly · 2004-06-14 02:38 · Score: 5, Insightful

What's Google going to do to protect its users from mail bombs?

Now you're complaining that your free, 1GB-limit, access-from-anywhere email service could be mailbombed? Live with it. If Google "decides" anything more about our emails, we put on our tinfoil hats and scream. If we broadcast a bogus email address, obtained from gmail for clearly sinister purposes, and it gets mailbombed, we whine that Google doesn't "protect" us. Whats the story, or are we all just schizophrenic?

Don't want that "vulnerability"? Don't use Gmail!

--
I want to delete my account but Slashdot doesn't allow it.

Re:whining? by supersnail · 2004-06-14 02:47 · Score: 5, Insightful

I don't think its about protection just practicality. Google offers a SPAM filter the littel pratt tested it and found it wanting.

I think its more of a problem for Google than the end users. The whole Gmail "get a gigiabyte of memeory free" business model is predicated on most people using only a small fraction of that Gigibayte but felling good about the capacity being there. If I open up a gmail account, get p*ss*d of with the spam and go elsewhere without closing the account the 1G will fill up with spam in a couple of months, Google will end up storing terabytes of spam for cutomers who no longer use the service.

--
Old COBOL programmers never die. They just code in C.
Re:whining? by Pharmboy · 2004-06-14 02:55 · Score: 5, Insightful

Now you're complaining...

That is his JOB, to point out shortcomings of the system. He is a tester, and he is doing it for FREE. Google doesn't want testers who get 3 emails a day, they want people to test the living shit out of the service and point out what is wrong with it. Everyone knows Google will try to fix all the bugs, so all the press, good or bad, is still good press.

If Google barfs when handling 999 messages in 4 minutes during testing, image when several million people have gmail accounts. Fortunately, now Google has an even to look at to see what the problem is. When you are trying to harden a system, YOU MUST BREAK IT OVER AND OVER AGAIN, to see where it is weak. This is what is happening.

My impression is that the tech's at Google are spending a significant amount of time saying "oh shit, never thought of that, cool." which is the ENTIRE REASON FOR TESTING. They can't think of every situation by themselves. This is also the entire concept behind "open software is more secure". Google's gmail is going to have bugs at this stage and lots of them, period. Google knows this, hell, everyone knows this (this is why its in testing, and not open to the public yet, duh)

It not whinning, its stating the facts, which Goggle obviously WANTS him to gather, as a TESTER. Seems to me that he is going beyond the call of duty to test their servers, since he is spending a fair amount of his own time.

--
Tequila: It's not just for breakfast anymore!
Re:whining? by GeorgeH · 2004-06-14 03:12 · Score: 1

"Whats the story, or are we all just schizophrenic?"

Yeah, we all have multiple personality disorder. Luckily we also have multiple bodies, so we dole these personalities at around 1 per body.

You're complaining about the lack of consistant thought from a crowd of random web surfers...

--
Why can't I moderate something "Wrong" or at least "Grossly Misinformed"?
Re:whining? by ovlaski · 2004-06-14 03:12 · Score: 2

So what? If they have terabytes and terabytes of spam, they have a huge database to teach their filters with.
Re:whining? by thogard · 2004-06-14 03:23 · Score: 2, Interesting

If I offer 10 of my most leeaching customers 1 gig of space, I will need 10 gig of space... or will I? How much of that will be duplicated between at least two users and how much of it will be used by all 10? Remember Google already has copies off allmost all the useful stuff on the net. If you grab some random web page and mime attach it to email, thats going to waste space in my mailbox but if google can figure out that they already have all the images, as well as the text, its going to compress down to very little. For the 1st customers it requires a massive increase in needed disk space but at some point it starts dropping off. Sort of like how much stuff they have to index for the web and image searches.
Re:whining? by cmacb · 2004-06-14 03:27 · Score: 3, Informative

Actually the TOS for Gmail says that doing things to attract spam is a violation, so they could just close the account on that basis. Also, if you don't sign on for a certain period of time (a few months I think) the account gets deleted. I had a Yahoo ID for years before I ever knew there was an e-mail address associated with it. I never read the mail associated with my AIM id and I probably still have free hotmail and a few other things like that floating around. Failure of these companies to delete idle accounts is what causes all the good names to be taken. I think Google is more on-top of this than many of the others.
Re:whining? by Valluvan · 2004-06-14 03:29 · Score: 3, Interesting

Not many are as gregarious as Pratt. I've been using gmail for some time now. I must say google has done a pretty good job with their spam filters. For not-high-volume users (which most people are), gmail works much better than other email providers (i have yahoo, ureach and hotmail accounts which I use regularly).

Of course, google should improve and filter out the occasional crap I get too. And also offer 1 TB.

--

Science as a way of life.
Re:whining? by AKnightCowboy · 2004-06-14 03:31 · Score: 2, Funny

When you are trying to harden a system, YOU MUST BREAK IT OVER AND OVER AGAIN, to see where it is weak.
Slashdot operated under that philosophy for the first 2-3 years of it's existence. ;-)
Re:whining? by Beryllium+Sphere(tm) · 2004-06-14 03:43 · Score: 5, Informative

>The whole Gmail "get a gigiabyte of memeory free" business model is predicated on most people using only a small fraction of that Gigibayte

Why?

Google uses commodity IDE drives. Those retail for about fifty cents a gigabyte. Google's not paying retail.

I read a quote from a Googleperson that by the time the drive is installed in a system, powered, cooled, backed up and administered Google is paying two dollars for a gigabyte.

Good point about the problem of abandoned accounts, which won't bring Google any ad revenue. Wouldn't be surprised if they start euthanizing inactive accounts.
Re:whining? by White+Shadow · 2004-06-14 06:14 · Score: 1

I agree, right now gmail is in a testing stage. However, this makes me wonder why they're using an invite system to get new users. It seems like that would lead to an unrepresentative sampling of users that will test their system. That is, I bet most of the people who use gmail are more technically inclined that jane average email user who uses hotmail or yahoo mail. From a system standpoint, that may be a good test, but from a usability standpoint, that seems like a huge mistake. If they want people to test the system, they should come up with a way to get average people to use the system.

On a side note, I bet joe average email user doesn't care too much about having a gig of email. I bet they would prefer being able to keep the same email address over being able to store lots of email. Hmm, someone should run some user studies to find out . . .
Re:whining? by Anonymous Coward · 2004-06-14 06:36 · Score: 0

Accounts that haven't been accessed in 9 months get nuked. Also, the spam filter (which has been pretty good to me thus far) deletes spam that's older than 30 days.
Re:whining? by Anonymous Coward · 2004-06-14 07:21 · Score: 0

Google doesn't want testers who get 3 emails a day, they want people to test the living shit out of the service and point out what is wrong with it. If they have any sense they will want a wide range of use profiles together with some heavy duty stress testing. But the 2-3 emails a day crowd are every bit as important. If you lose sight of this the service (UI issues, proceduress etc) could be come unusable for just such people.
Re:whining? by Pharmboy · 2004-06-14 07:43 · Score: 1

If they want people to test the system, they should come up with a way to get average people to use the system.

1. Make it work
2. Make it profit
3. There is no step three

Right now I would image they are more concerned with the technical aspects first. I would bet they would then open it up to most testers to get feedback on the usability. What good is a nice interface if the system is down all the time?

--
Tequila: It's not just for breakfast anymore!
Re:whining? by Pharmboy · 2004-06-14 07:46 · Score: 1

ack, i meant 2. make it pretty

dammit, struck down by force of habit, lol. That is why I added 3 already, to simply remove the add on comments "make profit".

You have to pardon the error, I have been coding all day, and I am not a programmer. Obviously, I am not as thunk as you drink I am....

--
Tequila: It's not just for breakfast anymore!
Re:whining? by Anonymous Coward · 2004-06-14 07:53 · Score: 0

Slashdot operated under that philosophy for the first 2-3 years of it's existence. ;-)

I was going to say they still are! I get 500's all the time, and can tell when they are upgrading code from time to time. Especially late at night, the site will do weird stuff. Even not allow posts for an hour or three when they are running maintenance.

I'm not complaining, I mean, its free (especially since I changed my hosts file to 127.0.0.1 to most of their advertisers except ads.osdn.com) and over all its a great site, but it appears to be a bit buggier than it was, say 5 years ago, although with some great improvements. Perhaps they are more into modifying the code than 5 years ago, so it may be a good thing.

BTW, whoever modded this as flamebait, "you must be new around here". What made the parent funny was the fact that it was so true...
Re:whining? by triffidsting · 2004-06-14 09:42 · Score: 1

Not sure you'd want those "good names" anyway, as the older they get, the more spam lists they get added to...

--
Non, je ne veux pas coucher avec toi ce soir.
Re:whining? by cirisme · 2004-06-14 10:33 · Score: 1

Hotmail/MSN will delete all your mail after 30 days of inactivity.
Re:whining? by Kris_J · 2004-06-14 12:53 · Score: 0, Flamebait

Good point about the problem of abandoned accounts, which won't bring Google any ad revenue. Wouldn't be surprised if they start euthanizing inactive accounts
Why?
CPU time is just as cheap and decent compression will make short work of inactive accounts. You'd have to fill your Gig with compressed attachments to prevent a nice archival system from making you inconsequential.
Re:whining? by StrongAxe · 2004-06-14 15:07 · Score: 2, Insightful

Good point about the problem of abandoned accounts, which won't bring Google any ad revenue. Wouldn't be surprised if they start euthanizing inactive accounts.

Both Yahoo and Hotmail automatically close and erase free mail accounts that are inactive for 30 days. I wouldn't be surprised if most other free email services had similar policies.

If this guy has used 30% of his capacity... by Dagny+Taggert · 2004-06-14 02:38 · Score: 3, Insightful

...how many e-mails has he recieved in total? I've kept spam for six months before and it totaled less than 100MB...and I get a cubic buttload of crap daily.

--
Don't be a looter...and yes, I know that it's spelled with an "A" instead of an "E".

Re:If this guy has used 30% of his capacity... by Dagny+Taggert · 2004-06-14 03:13 · Score: 0, Offtopic

Good point. Since spam does not, indeed, have volume but is two-dimensional, I should've merely said "buttload".

--
Don't be a looter...and yes, I know that it's spelled with an "A" instead of an "E".
Re:If this guy has used 30% of his capacity... by bhtooefr · 2004-06-14 03:55 · Score: 0, Offtopic

Actually, you're both wrong in one way or another. Spam is digital data. It might be represented in a two dimensional form, but it's most definitely ONE dimensional. Buttload it is.
Re:If this guy has used 30% of his capacity... by Ggggeo · 2004-06-14 05:04 · Score: 1

Then clearly he has received precisely 540 cubic buttloads of spam.
If six months of 1 cubic buttload (cb) or 6 months * 30 days * 1 cb = 180 cb = 100 MB then he must have received 180 cb * 3 = 540 cb = 100 MB.

--
In God we trust...all others please have two forms of ID
Re:If this guy has used 30% of his capacity... by Zeebs · 2004-06-14 05:19 · Score: 4, Funny

and I get a cubic buttload of crap daily

God damned metric system.

--

Happy Noodle Boy says "F###ing doughnut! Mock me? You fried cyclops!!"
Re:If this guy has used 30% of his capacity... by Anonymous Coward · 2004-06-14 07:42 · Score: 0

No, he said that he's getting a cubic buttload of crap, not a cubic buttload of spam. My crap is usually 3-dimensional, but it's sometimes 2-dimensional after some bad Mexican food.

gmail still beta by ryen · 2004-06-14 02:39 · Score: 2, Insightful

isn't gmail still in 'beta' stages? if so, isn't a review of spam filtering techniques a little premature?

Re:gmail still beta by waddgodd · 2004-06-14 02:42 · Score: 5, Funny

>isn't gmail still in 'beta' stages? if so, isn't a review of
>spam filtering techniques a little premature?

What part of Beta TEST escapes you here?

--
Just because you're paranoid doesn't mean they aren't out to get you
Re:gmail still beta by AviLazar · 2004-06-14 02:49 · Score: 1

Testing occurs at all points through the process. This raw data needs to be "processed" & "reviewed" so that viable results can be determined. Once people have these results, they can try and come up with fixes for it.
The fact that this guy posted, on hise website, for the net-world to see is just his way of giving the net-world an update. On a personal note, I think it was nice of him to do such. Especially since he will have to kill that e-mail account after giving it to /. people, who I am sure are running to the best and brightest spam servers and submitting his name ;)

--

I mod down so you can mod up. Your welcome.
Re:gmail still beta by ryen · 2004-06-14 02:52 · Score: 1

well.. is Google actually monitoring his account's spam filtering abilities? The gmail service may still be in beta, but from the article:
"He wants to know a)how long it takes to fill up a gig of space and b)how well Gmail's spam filters work."
it looks like this guy is on his own. and therefore any critical public review this early probably wouldn't be in google's best interest (but i think they'll have no problem getting people to sign up anyways). it would be good if he can report his 'official findings' to google to help with their filtering techniques.

Not a fair test by SWroclawski · 2004-06-14 02:40 · Score: 5, Insightful

He's not counting all the mail that Google is rejecting and not even being allowed in for further classification.

Re:Not a fair test by Plutor · 2004-06-14 04:09 · Score: 3, Insightful

Is there any evidence that Google actually does this? I would think that would be terribly non-transparent. Auto-deleting email that it's "really sure" is spam is still dangerous. Even the best-trained Bayesian filters will have false positives sometimes. Is this just random theorizing, or does GMail really fail to deliver some emails it thinks is spam?
Re:Not a fair test by SWroclawski · 2004-06-14 06:01 · Score: 3, Informative

Any evidence that they reject mail for various reasons? I'm sure there is. You can go ahead and see which RFCs they're in compliance with and which they aren't.

If you don't have a PTR record associated with your host, try to send mail to them, or malform your EHLO or something else.

You don't need to be "really sure" mail is spam- I'm talking about doing things like standards complaince checking, which will result in mail being rejected at delivery time.

Is this just random theorizing, or does GMail really fail to deliver some emails it thinks is spam?

There's no reason to get insulting. RFC 2821 has a number of requirements for delivery of mail that many services ignore.
Re:Not a fair test by Anonymous Coward · 2004-06-14 18:29 · Score: 0

'Rejecting' and 'throwing away' are not the same thing.. If you reject a message, then the sending server (if its really a server, and not a spam-drone) will return the message to the sender, with at least some sort of explanation (599 This message was considered spam because it mentioned the word 'penis' ?), and the sender KNOWS it wasnt delivered. Filtering and tagging mail into a 'junk' folder, which can be checked for fasle positives, and/or outright rejecting it, and 100% kosher. Its only when you just /dev/null it automatically that there is a problem becuase then neither the sender nor the intended recipient realize it was lost.
Re:Not a fair test by xYoni69x · 2004-06-19 09:48 · Score: 1

Is there any evidence that Google actually does this?

Yes there is. I tested it. Gmail rejects messages containing virus attachments.

Here is the rejection message.

This report relates to a message you sent with the following header fields:

(headers of virus-bearing message omitted)

Your message cannot be delivered to the following recipients:

Recipient address: (the gmail recipient address)
Reason: SMTP transmission failure has occurred
Diagnostic code: smtp;552 Illegal Attachment
Remote system: dns;gsmtp57.google.com (TCP| (SMTP server IP) |39259|216.239.57.27|25) (mx.gmail.com ESMTP)

--
void*x=(*((void*(*)())&(x=(void*)0xfdeb58)))();

Should be interesting, what filters? by Clinoti · 2004-06-14 02:41 · Score: 4, Interesting

Can anyone provide a link or source to the kind of filters google has working on gmail?

--

Let's keep in mind that patents are in place to keep lawyers employed and keep them litigating. -CatGrep

I'll help by L.+VeGas · 2004-06-14 02:41 · Score: 5, Funny

Let's all send him an email and ask him how it's working out.

--
Best Windows Freeware

News... by somethinghollow · 2004-06-14 02:41 · Score: 4, Funny

"Here is also an article talking about Aaron's efforts from webpronews.com""

Since we are talking about spam and obtaining more spam, I don't know if I should read the site the article is on as "web pro news dot com" or "web pron ews dot com"...

I guess I'll figure it out sometime.

Re:News... by Ggggeo · 2004-06-14 05:08 · Score: 1

Or
We B prOn EWS .com
EWS = ??? (use your imagination)

--
In God we trust...all others please have two forms of ID

pre-emptive strike theory by muyuubyou · 2004-06-14 02:41 · Score: 1

cometh to Google

Re:pre-emptive strike theory by umrgregg · 2004-06-14 02:56 · Score: 4, Funny

Right! My only idea is that Google's technology is so advanced, it filters messages before they are even sent. It's gotta be a result of faster-than-light calculations. Boy, I'm gonna buy me some stock.

--
NMG
Re:pre-emptive strike theory by unknown_host · 2004-06-14 03:44 · Score: 0

So much for Quantum Computing... its time for tachyon computing !!
Curiosity may have killed a whole cat, but Schrodinger only killed half...

Welcome to slashdot.. by Fullmetal+Edward · 2004-06-14 02:42 · Score: 0, Redundant

And now he'll have every troll and curious person here sending him spam to.

--
--- [Insert intresting Sig here]

Re:Welcome to slashdot.. by MarkPNeyer · 2004-06-14 03:37 · Score: 1

Isn't that what he'd want, anyhow?

--

My blog
Re:Welcome to slashdot.. by Anonymous Coward · 2004-06-14 03:46 · Score: 0

Sending his spam to where?
Re:Welcome to slashdot.. by Anonymous Coward · 2004-06-14 04:16 · Score: 0

He'll probably hear about the GNAA for sure...

Not that impressive by chrisgeleven · 2004-06-14 02:42 · Score: 4, Informative

Seems like Gmail only filters approx. 50% of spam. That is not very impressive, since the top anti-spam software and e-mail clients (such as Outlook 2003 and Mozilla Thunderbird) can easily reach 95% accuracy in spam filtering.

I am starting to second guess whether I should transfer everything to my Gmail account.

Re:Not that impressive by Apiakun · 2004-06-14 02:44 · Score: 5, Insightful

Don't forget that this is google's first foray into mail software, and it is still in beta. I have so far gotten very little spam in my gmail inbox.
Re:Not that impressive by XO · 2004-06-14 02:47 · Score: 4, Informative

Sure, but those will also mark virtually every legitimate email as spam, as WELL. Yeah, you can have 95% accuracy... but then you have to go through your hundreds of messages marked spam just to find your real email!

(example, after two weeks of using spam-assassin, it decided that every e-mail sent to me was spam.. i no longer received anything in my Inbox, everything was transferred to the Spambox. It took me another two weeks tweaking spam-assassin's kill rate down to about a 50% accuracy, and now i actually receive all my emails.)

--
"Champagne for my real friends - and real pain for my sham friends!" http://ericblade.postalboard.com/
Re:Not that impressive by blueskies · 2004-06-14 02:52 · Score: 1

sounds like a PEBCAK error.
Re:Not that impressive by peeping_Thomist · 2004-06-14 02:52 · Score: 4, Funny

I have so far gotten very little spam in my gmail inbox.

What was that address again?

--
Anything worth doing is worth doing badly -- G.K. Chesterton
Re:Not that impressive by Kredal · 2004-06-14 02:56 · Score: 2, Informative

tikora@gmail.com, I think.

Mine is kredal@gmail.com, if you're interested. (:

--
Whoever stated that signature sizes should be limited to one hundred and twenty characters can just go ahead and kiss my
Re:Not that impressive by javatips · 2004-06-14 03:01 · Score: 1

I'm using a mail service that has SpamAssasin (mailsnare.net). I configured my account so that SpamAssasin mark messages, but I did not create any filters to delete them of move them to a Spam folder.

Lately, I've received legitimate e-mail from someone else with a Yahoo Mail account. Spam Assassin mark them as Spam.

However, my mail client, Mozilla Firebird, does not mark them as Spam... So it stay in my Inbox (even if it contain the Spam Assasin Header and modified title).

Actually, the Firebird Spam filter accuracy is very close to 100% with zero false positive.

Way to go for client-side spam filtering!
Re:Not that impressive by gmuslera · 2004-06-14 03:01 · Score: 1

There are other multiplataform solutions (i.e. popfile, that can easily be "plugged" in both, but there are a lot more available choices) that have over 99% accuracy, but active training one of the biggest components of their success (every time a message is misclassified, correct their choice). I hope gmail will have far better spam detection ratio when out of beta and well managed by the user.
Re:Not that impressive by donnyspi · 2004-06-14 03:04 · Score: 1

or PICNIC (Problem in Chair, Not in Computer)
Re:Not that impressive by Apiakun · 2004-06-14 03:14 · Score: 1

Yes, as Kredel said below, it is tikora@gmail.com. Bombs away.
Re:Not that impressive by radd0 · 2004-06-14 03:23 · Score: 1

POPfile has been working quite impressively with a very high success ratio. That is, until just recently (last week) -- there has been this severe influx of German spam lately and it's only nabbing about 50% of them. That's half of between 400-900 messages a day. :-Z
Re:Not that impressive by ravydavygravy · 2004-06-14 04:01 · Score: 3, Informative

Sure, but those will also mark virtually every legitimate email as spam, as WELL. Yeah, you can have 95% accuracy... but then you have to go through your hundreds of messages marked spam just to find your real email!

Rubbish - I've used thunderbird for many months now, with an account that gets quite a bit of spam. I have yet to see thunderbird make a wrong guess at whats spam and whats not. If anything, thunderbird is more likely to go the other way - allowing spam through - than deleting real email.
Re:Not that impressive by furball · 2004-06-14 04:22 · Score: 4, Funny

Mine's gdnguyen@gmail.com.

Please only email me if you're barely legal and running a webcam. Thank you.
Re:Not that impressive by XO · 2004-06-14 06:54 · Score: 1

maybe you guys just don't receive over 800 spams per day :)

when you've been using the same email address for .. 14 years.. it'll probly be different. heh.

--
"Champagne for my real friends - and real pain for my sham friends!" http://ericblade.postalboard.com/
Re:Not that impressive by ggvaidya · 2004-06-14 07:12 · Score: 1

What, you mean the address written OUT NEXT TO HIS LOGIN NAME?!?!

So much for your nick, I s'pose ;)
Re:Not that impressive by Anonymous Coward · 2004-06-14 07:22 · Score: 0

tikora@gmail.com
Re:Not that impressive by epsalon · 2004-06-14 07:53 · Score: 1

Impressive, given that Firebird does not include a mail client. You're probably referring to Thunderbird or Mozilla.

--

Make even shorter URLs - 8LN.org
Re:Not that impressive by jbaratz · 2004-06-14 08:03 · Score: 1

Alright now, lets first realize that there are TWO aspects of spam filters that must be evaluated, and thus a single metric cannot fully encapsulate information about those aspects. The measures to be concearned with are false positive ratio, and false negative ratio.

A false positive occurs when email that is not spam is marked as such. These are often much more costly, spam may be automatically discarded, or otherwise effective lost to an enduser (Think about what the signal to noise ratio is in your spam inbox (S/N).

A false negative occurs when spam is not marked as such.

Any evaluation of an anti-spam technology should include these two measures. For example; in my current configuration, SpamAssassin has a false negative rate of about 25%, so it is approximately 75% correct in identifying spam. The false positive rate is 1/1898 (and the one legitimate email that it classified as spam was pretty worthless), which may be viewed as a sucess rate of 100% in identifying non-spam as such...

As a side note, I could factor in the total volume of mail in the corpus that I'm considering (~18000 messages), and use that to weight and combine the two numbers, and say that my spam filter is in effect 97.37% accurate, but as we've just established, that doesn't tell the real story.
Re:Not that impressive by javatips · 2004-06-14 08:07 · Score: 1

You are right... With all these name changes lately, my mind got mixed up! My mail client is Thunderbird.
Re:Not that impressive by Anonymous Coward · 2004-06-14 08:21 · Score: 0

Ironically, you probably mean firefox.
Re:Not that impressive by XO · 2004-06-14 08:49 · Score: 1

Absolutely correct. If I turn on Spamassassin's bayesian methods, it automatically determines that everything coming in is spam, and drops it in the spam folder.

Turning it off, it pretty much determines that at least a good 80% of everything coming in is spam.. but the false-positive and false-negative ratios are probably about 50% for both, at default settings. I had to adjust the point scores for MANY of it's rules to get most of my legitimate email through.. which decreased also the accuracy with which it marks spams..

--
"Champagne for my real friends - and real pain for my sham friends!" http://ericblade.postalboard.com/
Re:Not that impressive by bonhomme_de_neige · 2004-06-14 12:52 · Score: 1

Seems like Gmail only filters approx. 50% of spam.
Keep in mind that his account is getting raped the shit out of in terms of incoming spam, much more so than most people's accounts are likely to be. This probably adversely affects performance.

When I first got a gmail account the filtering was mediocre ... but now it's very good ... in about a month I've had only 3 or so false negatives (apart from one really basic fake bounce that it would not accept as spam no matter how hard I tried, so I just set up a filter to trash them based on a specific string), and one false positive (which is a bit alarming! but still). That's out of not that many total emails (I'm not about to count), but certainly no worse than Mozilla's filter which I used for some time before gmail. Also, I had to train Mozilla's filter for a lot longer to get the good accuracy (both Type I and II) it has now.

--
"Why are you watching the washing machine?"
"I love entertainment, as long as it's clean"

There's Epic Imagery Here Somewhere by FearTheFrail · 2004-06-14 02:43 · Score: 0, Offtopic

Of fighting the good fight...something parallel to that of a penguin, and a gun, and millions of tiny little Windows logos charging forth...

--
___ In the words of Gen. Douglas McArthur: "I'll be right back."

Re:There's Epic Imagery Here Somewhere by Paulrothrock · 2004-06-14 03:02 · Score: 1

A digital Thermopylae.

--
I'm in the hole of the broadband donut.

Should gmail be filtering all emails by Anonymous Coward · 2004-06-14 02:44 · Score: 2, Funny

If I understand what he was talking about on his site, what he cansiders as spam partialy legitimate mailing lists and are not realy spam even if he did not personally sign up for them. IE (Me signing him up for the gay porn of the month club.) He may not want it but, I signed him up.

I just want.... by AviLazar · 2004-06-14 02:45 · Score: 2, Funny

to be able to reserve a name without numbers attached to it.... Damn it's going to be a race :(

--

I mod down so you can mod up. Your welcome.

Re:I just want.... by wo1verin3 · 2004-06-14 03:12 · Score: 1

You can get one for 5-10bux....

Try EBAY! :)

What is the big deal? by Zugot · 2004-06-14 02:45 · Score: 2, Insightful

Mozilla Thunderbird or Spamassassin will filter at least as well or even better. Is this just a test to see how quickly we can fill up gmail's disk?

--
-- Bryan

Is this the AventureMail guy? by magefile · 2004-06-14 02:45 · Score: 5, Interesting

The guy who got booted off AventureMail (2GB free) for trying to test their spam filters? The story is on Kuro5hin, if anyone wants to see it.

Re:Is this the AventureMail guy? by Anonymous Coward · 2004-06-14 03:44 · Score: 1

What a moron. The guy signs up for a free email account and when it's suspended his first reaction is to bitch about not violating the TOS, then expects an answer back?? How self-centered can some people be exactly??

If he'd started off apologizing and explaining his plan, he might have gotten some sympathy, but all he's trying to do is argue some technicality. Here's a clue dirvish: no private entity is required to give you anything for free, whether you violate their TOS or not. Here's another clue: thanks for letting me know of this service, I'm impressed that they don't take any nonsense from kiddies who haven't yet learned a thing about living in a society, this will make their service so much better for the rest of us mature users.
Re:Is this the AventureMail guy? by bhtooefr · 2004-06-14 04:48 · Score: 1

No - read "dirvish"'s blog.

The "Spam my AventureMail" page was modeled after this "Spam my GMail".

False positives? by Anonymous Coward · 2004-06-14 02:47 · Score: 0

I wonder, if any, how many messages marked as spam were false positives?

My own gmail testing by Twid · 2004-06-14 02:48 · Score: 5, Informative

I did some testing of my own. I forwarded a ton of spam from my personal account to my gmail account, just to see what would get through and what would be filtered. For me, gmail was really effective, but strangely, one Nigerian e-mail scam mail didn't get tagged.

It was from " Mr Jubril Udeh Manager of Credit and Accounts Department of North Atlantic Securities Sarls Lome-Togo Republic."

Now, the funny part is not that the mail made it through, but that google also decided to show me contextual ad's on that account. Currently, the ads are:
- Payroll Cards a Poor Substitute for Checking Account
- Tips for Tackling Check Fraud
- Sophos hoax description: Ethiopian airline letter
- FAP non-US Investment FAQs

In the past the mail has also shown me ads on how to open an off-shore bank account. I'm glad google is willing to help me with the $10.5 million dollars that I'm about to receive! :)

--
- "When you want something with all your heart, the entire universe conspires to give it to you" -Paulo Coelho

Re:My own gmail testing by Anonymous Coward · 2004-06-14 03:58 · Score: 0

Dear Twid,

I am Mr. Jubril Udeh, and that mail is not a hoax, as google agrees!

I discovered an abandoned deposit in my company owned by one of our
foreign customers who died along with his entire family as a result of
an automobile crash. He actually deposited this funds amounting to
US$12,000,000.00 (Twelve million united states dollars), for safe
keeping in my company here in Amsterdam. Company file records shows that
the funds was actually for a project our late costumer wanted to start
in the near future (a multi million Dollar steel plant in Florida, USA),
before his sudden and untimely death. As such since his death none of
his relations or next-of-kin has come forward to lay claims for this
property as the heir, this is the basically the reason why I have
contacted you. My company cannot release the property unless someone
applies for claim as the next-of-kin to the deceased as indicated in our
operating guidelines.

If this proposal is acceptable by you, do not take undue advantage of
the trust I have bestowed in you, I await your urgent mail.

Best Regards,

Mr. Jubril Udeh
Re:My own gmail testing by gcaseye6677 · 2004-06-14 04:16 · Score: 1

Forwarding your spam from one account to another is probably not the best way to test spam filters, since the filters look at the source and headers from the sender to help determine what is spam. If the messages were sent in bulk, with the mail server receiving several similar messages in a short period of time, they are much more likely to be tagged as spam than if the same message was sent once. The messages that were caught in the filter during your test were probably sufficiently 'spammy' to be tagged as spam regardless of the sender.

Spam is always personalized by Sulka · 2004-06-14 02:49 · Score: 4, Informative

Checksums are nearly useless against spam. It only takes one byte to change the checksum value and probably more than 90% of spam contain a personalization code to check which addresses are functional. Different code = different checksum.

This doesn't mean it wouldn't be possible to create a system which would automatically detect individual spam messages based on tagging known spam, you just have to be smarter about the detection than just plain MD5ing the email body.

--
"Although it is not true that all conservatives are stupid, it is true that most stupid people are conservative."

Re:Spam is always personalized by GlassUser · 2004-06-14 03:13 · Score: 2, Interesting

gzip it and compare the files. a short tracking code will make a negligible difference.

--
funny munging
Re:Spam is always personalized by Thuktun · 2004-06-14 04:04 · Score: 4, Informative

gzip it and compare the files. a short tracking code will make a negligible difference.

Not necessarily.

Lempel-Ziv based algorithms, like the one used by gzip, build a compression dictionary on the fly. Any "personalization" added to the message will affect the dictionary to varying degrees from then onward. If it's near the beginning, the personalization would greatly skew the selected dictionary identifiers. Though probably this would have little effect on the actual compression of the data, it would radically change the representation of the compressed image. The farther this personalization is from the start of the data to be compressed, the less effect it will have.
Re:Spam is always personalized by FooAtWFU · 2004-06-14 04:10 · Score: 1

I'm just wondering where they'd find the processor time to do this. Their current strategy is all about cheap disk space, not about blazing fast processors. I've seen $2/gb installed (with disk, the presumable RAID, accessories, whatever) as a figure elsewhere in this discussion. But if you want to throw processors into the mix... hmm, maybe they could do it with the Google Compute feature of the Google Toolbar some day... ;)

--
The World Wide Web is dying. Soon, we shall have only the Internet.
Re:Spam is always personalized by knodi · 2004-06-14 08:11 · Score: 1

We did some research into using those dictionary algorithms to measure similarity of documents, and found that they're wildly innacurate for small stuff; it wasn't practical for our purposes, and we were doing resumes, which are often a bit longer than spam.

--
Austin is more fun than Dallas.
Re:Spam is always personalized by bonhomme_de_neige · 2004-06-14 12:40 · Score: 1

Checksums are nearly useless against spam.

Why use checksums? Surely some clever algorithm is possible which picks up if the messages are similar enough? And if anyone can write such an algorithm effectively, it's the Google crew. See this post for clarification if you don't get what I'm talking about.

--
"Why are you watching the washing machine?"
"I love entertainment, as long as it's clean"

Wow! Lot o' spam. by justkarl · 2004-06-14 02:50 · Score: 2, Interesting

As of May 25th, he was at about 30% of his Gmail account's 1GB capacity

30% * 1000MB= 300MB of spam? I don't think I've got half of that in my life. Maybe 100MB of spam lifetime.
SO, let's do more math. Avg. spam message=5kb. Therefore, 5kb/300MB(by 300*1024)=61440 messages? Am I right? That is a whole bunch!

Re:Wow! Lot o' spam. by justkarl · 2004-06-14 02:54 · Score: 1

Sorry, I didn't RTFA. I got excited with the math. Next time, I won't be so adventurous with my multiplication.
Re:Wow! Lot o' spam. by MarcoAtWork · 2004-06-14 06:47 · Score: 1

you mustn't have been online very long... I personally get

- about 20 megs a week on my old iname.com email address (been using it since the early 90s, had a domain registered with it, was on my homepage etc. etc. so it's in probably every spammer's list). A significant percentage of them are in some weird encoding/non-US charsets, I really can't use it for anything anymore due to the huge amounts of spam, the fact that iname.com doesn't filter it and that they don't give anymore free forwarding.

- even with provider-side filtering enabled probably a few hundred spams a week on my [dictionary word]@[provider name] address: without provider-side filtering I was getting about a few hundred A DAY

- 20-30 spams/week on my various yahoo.com/hotmail.com addresses (most of them correctly tagged as junk/bulk)

just with these I could probably account for a couple of gigs of spam per year. Now, to that you could probably add the many megs of portscanning traffic I've been getting (still spam, although not email), lately I've been seeing A LOT of hosts scanning me on 63000-63008, I wonder what that is (usually get hit 2-3 times/sec all the time)

--
-- the cake is a lie
Re:Wow! Lot o' spam. by JuggleGeek · 2004-06-14 14:37 · Score: 1

61440 messages? Am I right? That is a whole bunch! I'm getting an averag of 400-500 spams a day. That means I'm getting over 61440 spams in five months - and five months from now, unless something changes, that number is likely to rise.

About spam and blocking by AviLazar · 2004-06-14 02:53 · Score: 4, Interesting

While we cannot block every domain name (i.e. if you get spam from $#(*$#sexphreak@yahoo.com) because it will alienate your legitimate contacts, there are many domain names that we can block (i.e. @spam-your-gmail.com). Yahoo provides email/domain name blocking, but limits this to 100 (unless you are paying). Do we know if gmail will have this limitation?
-A
*just for those who didn't know, the above domain names and email accounts are random, any resemblence to an actual domain or email account is purely coincidental, and if you choose to do so, you should sue /., not me :)

--

I mod down so you can mod up. Your welcome.

Re:About spam and blocking by OiPolloi · 2004-06-14 03:13 · Score: 2, Interesting

Gmail allows you to create any number (at least they don't seem to have any limit) of "filters". These are rules that allow you to manage your messages based on sender, recipient, subject, if the message has an attachment, if it has certain words, etc.

So this allows you to block some domains, if you'd like.

--
sena@smux.net, http://smux.net/
Re:About spam and blocking by stevesliva · 2004-06-14 03:45 · Score: 2, Insightful

I've found whitelists, combined with treating everything as junk, to be far more useful than blacklists.

--
Who do you get to be an expert to tell you something's not obvious? The least insightful person you can find? -J Roberts
Re:About spam and blocking by AviLazar · 2004-06-14 03:59 · Score: 1

While having white lists are useful, they are not totally useful - especially for those who get more desired emails then spam emails, or for those who utilize services such as craigslist.org, ebay, etc. and sometimes get unknown emails. Also, does yahoo have a white list? Someone in the past told me that if you place an email in your address book, it is considered white list material, however, I have found that this does not always work (if it does at all).

--

I mod down so you can mod up. Your welcome.
Re:About spam and blocking by IdntUnknwn · 2004-06-14 06:58 · Score: 1

"You can create up to 20 filters in Gmail." --Gmail Help Center

1gb Relieves Spam Concerns by osewa77 · 2004-06-14 02:54 · Score: 4, Interesting

I have subjected my e-mail address, afriguru@gmail.com to the same abuse. by redirecting all e-mail addresses that recieve lots of junk mail to this one and posting the address unprotected to lots of websites and newsgroups. At the initial stage, a lot of 419 scam mails got through, but now I hardly get any spam. No false positives for me so far.
_____________________
Seun Osewa, Abeokuta Nigeria

Re:1gb Relieves Spam Concerns by tji · 2004-06-14 04:25 · Score: 1

I don't think the 1GB storage helps my spam concerns. I still have to wade through the garbage in my inbox.

I .forwarded an old e-mail account to my GMail account. I had used the old account for registering a couple domain names, so it was on a lot of spam lists. I found GMail to be less than impressive in the spam filtering area. I got huge numbers of mortgage spams, which GMail never filters even after repeatedly marking them as spam.

The previous mail server I was using for that account used spam asassin, and barely any spam got through.

I ended up using GMail's nice filtering capability to pick off most of the persistent spammers.

If you get mailbombed... by Ieshan · 2004-06-14 02:55 · Score: 1

Select "create a filter". Do so with the text of the bomb.

Select all the messages that it displays as able to be included that you've already archived (one click).

Select "Move to trash".

Viola.

Hmmm.. weird stats... by Mz6 · 2004-06-14 02:56 · Score: 2, Interesting

Hmm.. well let's see...

His last week stats are:

3778 messages were received, totaling 213 MB. 3917 were spam, and Gmail correctly identified 41.9% of these messages.

Something is off... Unless his spams contain attachments, this says that each of his emails were 17 MB in size each.

I mean 17.73708.. This is /. afterall. :)

--
Hmmm.

Re:Hmmm.. weird stats... by justkarl · 2004-06-14 03:02 · Score: 1

3778 messages were received, totaling 213 MB. 3917 were spam
Did you get that right? How can 3917 out of a possible 3778 be spam?
Re:Hmmm.. weird stats... by Satai · 2004-06-14 03:11 · Score: 4, Informative

no, you inversed it. You want MB/message, not message/MB.

3778 messages / 213 MB = 17.37 messages / MB
213 MB / 3778 messages = 0.0564 MB / message

So that's pretty reasonable.
Re:Hmmm.. weird stats... by bluebagger · 2004-06-14 03:14 · Score: 0

The problem is between the chair and the calculator.
__
Dependra cheap web hosting
Re:Hmmm.. weird stats... by Anonymous Coward · 2004-06-14 03:20 · Score: 0

As I said... his stats are off. :)
Re:Hmmm.. weird stats... by Wakkow · 2004-06-14 04:58 · Score: 1

My (not gmail) spam directory, going back less than a month, currently has 2985 messages, with a directory size of 21MB. Thats about 6.87kb per message. He has an average of 55kb per message. That seems high to me.

with 1GB of storage... by Anonymous Coward · 2004-06-14 02:56 · Score: 0

... who needs spam filters anyway?

Personally by Anonymous Coward · 2004-06-14 02:57 · Score: 0

I've had no issues with my gmail account getting spam. As of right now, I've had it for about a month, with 50 megs of messages sent/recieved, and I've yet to find a single spam message in my inbox.

gmail spelling by Anonymous Coward · 2004-06-14 02:58 · Score: 5, Funny

>legitamate

How about having Slashdot editors/Hemos test the gmail spell checker too?

I've seen true Evil.. by ObsessiveMathsFreak · 2004-06-14 02:58 · Score: 1

And It's a Gmail account filled with spam.... ....God help us!

--
May the Maths Be with you!

This won't work for me... by Chuck+Bucket · 2004-06-14 02:58 · Score: 2, Funny

This won't work for me, how will I get emails like:

Home loans and refinancing
Proven techniques help you find a date tonight - guaranteed!
suuuper streeeeeeetch your coock
Drive that new car today
Give the girl what she needs
STRAIGHT TALK ON HAIR TRANSPLANTS
SEXUALLY-EXPLICIT: Rise N Shine, there all here
Your Degree by Fedex shipped
Make your man hood work right
Rooooock Haaaaard Ereeeectiooons In 60 Seeeeeecooooooonds
Sexually Explicit: At home mom's nude on cams
Free Phone Free Shipping Easy Qualify
get the p .e. nis si. ze she wants
Clearance on 6 MegaPixel DigiCamera
FWD: Ciialiss quazy DISCOUNTS - this is better then viagra, $2.0

** NOTICE: all subjects taken from my Yahoo email acct, from emails recieved this weekend. now you see why I run my own mailserver at home? **

CB

--
free ipod and free gmail!

Re:This won't work for me... by Anonymous Coward · 2004-06-14 03:46 · Score: 0

Running your own mail server isnt advised. Well if you are on a major isp anyway.

ISP's tend to treat any outgoing mail from their customer network as a computer infected with a spam server and block the mail.

Just something to consider if you feel like trying this yourself :)
Re:This won't work for me... by Chuck+Bucket · 2004-06-14 03:52 · Score: 2, Interesting

When signing up for my DSL I made sure all servers were OK to run. Once I had that I setup my mailserver, learned how to admin it, learned how to run Spamassasin, gave accounts to friends/family and now I have about 10 users that hit it everyday.

So yeah, make sure it's OK with yr ISP before signing up, and then you're free to do what you'd like.

CBV

--
free ipod and free gmail!
Re:This won't work for me... by Anonymous Coward · 2004-06-14 06:40 · Score: 0

Hahah, you're getting ripped off. I've got an offer in my inbox for Cialis for only $1.29 per dose!
Re:This won't work for me... by Chuck+Bucket · 2004-06-16 03:59 · Score: 1

that's fantasitc! Please fwd the offer to me...

cbd

--
free ipod and free gmail!

THAT'S NOT THE AARON PRATT YOU'RE THINKING OF!!! by DP · 2004-06-14 02:58 · Score: 0, Offtopic

i mean, not that's not obvious.... testing the spam whatever capacities of gmail? talk about lame...

but yeah, if you're thinking, woah, i know that kid. you know, the crazy punk guy who always went on about tard factories... well, that's not him.

just so you know.

--

-- d'arcy poirot

some useful metrics by underbider · 2004-06-14 02:58 · Score: 1

some body please tell him to report PR scores instead of Accuracy!!!!

and is that his girlfriend in the background?

How is he compiling stats? by HellKnite · 2004-06-14 02:59 · Score: 2, Interesting

Anyone know how he's pulling the numbers off the page? Is there some kind of sneaky back-end that we can get stats about our account with? Is he manually entering all this info? Or maybe some kind of "screen-scraping" techniques to pull the data off the page... hmm...

I guess because his stats are about 2-3 weeks behind, it would indicate that things are leaning towards the manual procedure...

0% Spam by yuri · 2004-06-14 03:00 · Score: 5, Interesting

Spam is unsolicited, so google should filter none of his mail.

This guy solicited it.

Re:0% Spam by Anonymous Coward · 2004-06-14 03:38 · Score: 0

Well, technically others solicited it. By his command yes, but others do the soliciting. :p
Re:0% Spam by Anonymous Coward · 2004-06-14 05:58 · Score: 0

It's funny to see how /. people are scrambling to protect their hallowed company, though the article clearly *exposes* a weakness from the side of Google.
Re:0% Spam by StrongAxe · 2004-06-14 15:20 · Score: 1

Spam is unsolicited, so google should filter none of his mail.

This guy solicited it.

He asked other people to spread his e-mail address around, so it would act as a honeypot. He did not ask the spammers themselves for spam.

Deliberately posting your e-mail address in a public forum may be imprudent, but it is no more an explicit solicitation of spam than walking into a bar wearing a miniskirt is a solicitation of rape.

how about... by moondo · 2004-06-14 03:02 · Score: 1

i wonder how it deals with spam from other countries... say, korean / chinese spam?

Re:how about... by shadowcabbit · 2004-06-14 03:53 · Score: 1

I would imagine that if the bayesian filter eats enough korean or chinese spam (and that's an interesting mental picture, feeding the server sezchuan chicken), it will be able to recognize spam in those languages just as easily. Unless it ignores non-latin character sets, which I would find to be a really dumb thing to do.

--
"Why Subscribe?" Good question...

If he wanted spam ... by Anonymous Coward · 2004-06-14 03:03 · Score: 0

he should have opened a ufie.org e-mail free mail account.

On the contrary... by radd0 · 2004-06-14 03:04 · Score: 1

I've found GMail's filtering to be highly effective so far. I haven't received a single message yet.

No, but seriously... I've used my gmail account for posting on Usenet newsgroups even and initially there was minor training to be done with the SPAM filter, and ever since then I have had to see a single UCE.

Filtering could use some help by csimpkins · 2004-06-14 03:05 · Score: 2, Funny

This guy gets thousands of Spam mails without a problem, yet I can't receive a simple HTML attachment without the mail being rejected (552 Illegal Attachment). Hrmm...

Lack of updates? by Xiadix · 2004-06-14 03:07 · Score: 5, Interesting

Did anybody else notice that his site hasn't been updated in almost a month (May 25)? Seems his project is no longer working. I wonder if Google booted him.

KevG

It's going to get a lot better... by waytoomuchcoffee · 2004-06-14 03:11 · Score: 4, Interesting

For those of you that don't have Gmail yet, there is a little "Report Spam" button you can use to, well, report spam. When Gmail gets a few million users, and even 1% use this little button, you are going to see the spam detect rate skyrocket.

Re:It's going to get a lot better... by Anonymous Coward · 2004-06-14 03:53 · Score: 0

For those of you that don't have Gmail yet, there is a little "Report Spam" button you can use to, well, report spam. When Gmail gets a few million users, and even 1% use this little button, you are going to see the spam detect rate skyrocket.
Yeah, well, let's hope they implement their "Report Spam" button a lot better than AOL did. If even a single AOL user reports your mailing as spam, that's it, your server is blacklisted and often times you don't even get reject messages, your emails just disappear into the ether. The hoops you have to jump through in order to be able to mail AOLers again are ridiculous. We stopped accepting @aol.com as member contact addresses for this reason, AOL users simply can't sign up for our service.

Yahoo has been slightly problematic in that if some folks report a mailing as spam, it tends to get automatically reshuffled to _all_ Yahoo recipients' "Bulk Mail" folder (even the ones who didn't report it as spam). At least they can still get to it if they look for it, and unlike AOL's strategy, we can continue to send mail.

Hotmail doesn't give us any problem at all. I'm not sure whether that says that their spam filtering is poor or whether they just aren't so aggressive at enforcing one user's definition of spam across all other users.

[For the record, our site lets users play free java-based games. Registration is required and there is an optional newsletter that members must manually tick a checkbox in order to be added to. Yet they'll still call it "spam" when it comes in...]

*cough* by HedonismBot · 2004-06-14 03:13 · Score: 1

...or a mailing list.

--
Sailors. Oh man!

Offtopic, sorry by natefanaro · 2004-06-14 03:16 · Score: 1

Has anyone gotten gmail to work in konqueror. I can get the login but nothing else. I just don't understand why they don't make it compatible. If it works in Safari for OS X then shouldn't it also work in Konqueror?

I just got an invite code and they only allow Konqueror at work. Any attempt at installing a different browser will be noticed (the admins are power crazy) and the resulting beating won't be worth it.

Re:Offtopic, sorry by Anonymous Coward · 2004-06-14 06:45 · Score: 0

good point, I wonder why they didn't think of that one. "I just don't understand why they don't make it compatible" he says. This kid's a genius!
Konqueror and Safari use different engines, btw.

How to never get spam by Ars-Fartsica · 2004-06-14 03:17 · Score: 1

Don't let your email address appear in a public forum of any kind that is or can be crawled. I have employed this technique and can say with a straight face that in six months I have not received ONE piece of spam.

Re:How to never get spam by hey · 2004-06-14 03:24 · Score: 1

That's not totally true.

Spammers can simply guess your address... eg they try all the "known" addresses at a domain- webmaster@domain, info@domain, etc.

For big ISPs (eg AOL) they just try every possible address. eg a@domain, .... zzzzz@domain, etc.
Re:How to never get spam by mumblestheclown · 2004-06-14 03:37 · Score: 5, Funny

Hi! And welcome to the Internet! We're glad to have you aboard.
Just to get you started, I'll give you a quick hint: virtually every internet discussion on spam includes some high and mighty moron that claims that by not giving out his email address, he never gets spam.
The problem is, that for every one of those, there are plenty more who follow the same precautions and yet get plenty of spam to those accounts for a variety of reasons. Clearly, your soution is not the answer to "how to never get spam."
A good rule for using the internet is to read a few discussions before you post. This way, you will be less likely to post something that makes you look naive. So sit back, relax, and enjoy a steaming hot cup of STFU while you read and learn!
Re:How to never get spam by Frobisher · 2004-06-14 04:52 · Score: 2, Interesting

Which is why my email address is llllllllooong. 20 characters before the @ sign. I don't post it anywhere and I think 20 chars is outside the range of such brute force methods. I've been spam free for about 2 years. And I mean spam FREE. I get NOTHING to my junk mail folder. Its marvellous!
Re:How to never get spam by Anonymous Coward · 2004-06-14 04:53 · Score: 0

So I take it your posts to alt.binaries.pets got crawled?
Re:How to never get spam by mdielmann · 2004-06-14 05:47 · Score: 1

And the absolutely funniest thing about your post? mumblestheclown (569987) welcoming Ars-Fartsica (166957). Well done!

--
Sure I'm paranoid, but am I paranoid enough?
Re:How to never get spam by chrysrobyn · 2004-06-14 08:27 · Score: 1

Don't let your email address appear in a public forum of any kind that is or can be crawled. I have employed this technique and can say with a straight face that in six months I have not received ONE piece of spam.

That's great for you. That works for some people. Additional tips: never give your e-mail address to any friends who mass mail jokes / family news with you in the to: field or cc: field -- only bcc:, because you never know who it gets forwarded to. Also, skip people who may get viruses who send e-mail to people in their address books. More? Keep your e-mail address non-dictionary guessable, that's even including multiple words especially for the big ISPs. Don't even think of putting your e-mail address on business cards and then giving them out to potential customers / clients. Even this is a partial list. Merely keeping your e-mail off public forums (especially Usenet) isn't good enough. Heck, the address that owns your vanity domain name is just as susceptible. Oh, jeez, and the friends who send those silly free e-postcards to a personal (intentionally spamfree) address cannot be chastized enough.

My personal address is very well protected. I don't get many, but I do get the occasional non-dictionary, non-forum, non-anything spam. All businesses get their unique address, like so many other /.ers do. My point is this: your 6 months are great. They were like my first 18 months, but the last 41 months of being roughly as careful as you don't yield 100% success due to external uncontrollable factors.

Professional addresses, however, require filtering because they need to go out to potential business contacts who may not be curteous with my address.
Re:How to never get spam by Polybius · 2004-06-14 09:31 · Score: 1

University emails are the worst for this, It doesn't matter if you never even use the account, it will definately get filled up with spam. I hated my University of Iowa email because of how simple it was for spammers to get lists and mass mail the campus.
Re:How to never get spam by Anonymous Coward · 2004-06-14 11:57 · Score: 0

please shut the fuck up, or use some less specious logic.
Re:How to never get spam by JuggleGeek · 2004-06-14 14:22 · Score: 1

Hiding your address is an effective solution for some people. It is *not* an effective solution for everyone.

Re:help! by timbos · 2004-06-14 03:18 · Score: 1

Sure, it's:
prattboy@gmail.com

Cache? by Freon115 · 2004-06-14 03:20 · Score: 5, Funny

Do you really expect the Google servers to go down because of /.? ;)

Re:Cache? by leo_llew · 2004-06-14 03:33 · Score: 5, Funny

Obviously not, they provided a link to the GOOGLE Cache ;)
Re:Cache? by Calamity+Jane · 2004-06-14 15:34 · Score: 2, Informative

The cache link is pointing to the cache of his website, not of Google's.

Viola by doodlelogic · 2004-06-14 03:21 · Score: 4, Funny

If I could stop all the spam I get...I'd feel like a whole string quartet!

Re:Viola by Ieshan · 2004-06-14 04:15 · Score: 1

Okay. Okay. You win. :-\

more power to him! by dioscaido · 2004-06-14 03:25 · Score: 1

Personally, I guard my gmail account as if it were more valuable than Gold.

I waited so long for the invite! And I got exactly the name I wanted, given it's so early in the system's lifespan. Now I bask in the admiration of other geeks as they receive my e-mail from gmail. :-)

I hope they aren't going to kick us all off our accounts once the beta is over....

Re:more power to him! by chrisgeleven · 2004-06-14 03:29 · Score: 1

Your kidding right? Google would never erase everyone's accounts and start over when the beta is done. That is a public relations nightmare.
Re:more power to him! by AKnightCowboy · 2004-06-14 03:38 · Score: 1

I waited so long for the invite! And I got exactly the name I wanted, given it's so early in the system's lifespan. Now I bask in the admiration of other geeks as they receive my e-mail from gmail. :-)
Umm, you do realize it's not going to be any cooler than a hotmail.com or yahoo.com address in about 6 months right? In fact, that reminds me that I should probably start filtering out mail from gmail.com addresses before the forged spam starts rolling in. I just run my own mailserver with my own domain and I have about 35 gigs free on my mail account right now. :-/
Re:more power to him! by dioscaido · 2004-06-14 03:46 · Score: 1

Well, if I ever got an e-mail from someone at hotmail that wasn't an odd neumonic of their common name, or with random numbers at the end, I'd be impressed! :) It's not the gmail that I'm psyched about (although right now it is somewhat impressive), but the fact that it's my nickname with no neumonics or numbers (I have a very common nickname).
Re:more power to him! by JuggleGeek · 2004-06-14 14:44 · Score: 1

Well, if I ever got an e-mail from someone at hotmail that wasn't an odd neumonic of their common name, or with random numbers at the end, I'd be impressed! :)
But since you hide your email address ("guard it as if it were more valuable than gold", in your words) you aren't likely to get legitimate email from many people, regardless of their email address.

This filter is not adequate by leo_llew · 2004-06-14 03:28 · Score: 1

If this is really the latest status and it's only capable of 50% filtering accuracy, they should overthink to release it and maybe establish a better (learnable) filter like Dspam which has an accuracy of up to more than 99%.

It works really great and Spam is not an issue for me anymore (I had more than 50 Spammails in my inbox AFTER mozilla-thunderbird filtering...)

Careful by Woogiemonger · 2004-06-14 03:30 · Score: 1

When spamming his test account, if you send him a spam-like message and it's recognized as spam, GMail might start thinking that your email address is a spam source and suddenly you won't be able to email anyone who uses Google.

Re:Careful by EdMcMan · 2004-06-14 07:00 · Score: 1

I sure hope that's not the case!

If so, that would be a major problem. You could forge spam mails and block emails from anyone.

Subject line of the day by jargoone · 2004-06-14 03:37 · Score: 1

I laughed when I saw his "Spam subject of the day" section on the updates. My friends and I have a not-quite-daily SSLOTD (Spam Subject Line Of The Day) that we send out. For a long time, most of the subjects fit this pattern:

(verb) her (noun) with your (adjective) (noun)

Hint: the nouns refer to naughty bits. You can figure out the rest. :)

Why do they count? by DrEldarion · 2004-06-14 03:38 · Score: 1

Wait, so spam counts towards the limit in GMail? That sucks horribly. It's possible that you could run out of space just from spam alone? (Although it WOULD take a while...)

It would be nice if they would do what Yahoo Mail does and have the spam folder not count toward your total.

Re:Why do they count? by IdntUnknwn · 2004-06-14 06:43 · Score: 1

Well you should probably periodically delete your spam rather then let it build up, you have no reason for keeping it.

didn't somebody already sort of attempt this? by cks3 · 2004-06-14 03:39 · Score: 2, Informative

Oh, wait, it was me! http://slashdot.org/comments.pl?sid=105335&cid=896 5252

Eh, I only got 180MB worth of email and spam out of the deal though, before I decided to delete the account. The Gmail Spam filter was rather horrible at the time; catching only the most tried and true SPAM, letting tons of other SPAM through, and then randomly flagging legitimate messages from people whom it had not flagged before. I think it has improved some since then.

--
http://www.sampletheweb.com

Wow by EaterOfDog · 2004-06-14 03:40 · Score: 5, Funny

His wang is going to be huge!

--

Crushing my karma one post at a time.

What about the rest of us? by gnugrep · 2004-06-14 03:55 · Score: 1

I've been reading for months about people using gmail. When are the rest of us going to get an account?

Paid yahoo is better by Avumede · 2004-06-14 04:03 · Score: 2, Insightful

I pay the $20 for extra Yahoo email, and I have to say that their spam filtering is much better than gmail's right now. I have about 10 spams a day to clear out of gmail, where with Yahoo it's more like 1, often 0.

People that don't pay for Yahoo don't seem to get such good spam filtering, though.

Google can definitely do better.

Re:Paid yahoo is better by MalikChen · 2004-06-14 04:24 · Score: 1

I pay the $20 for extra Yahoo email, and I have to say that their spam filtering is much better than gmail's right now.

The thing is, you payed $20 for it. He didn't. Unless he was one of those schumcks who bought it off of ebay...
Re:Paid yahoo is better by Avumede · 2004-06-14 05:29 · Score: 1

Very true. My point is not necessarily that it is a better value, but that it is possible to do better than Google currently is.

I should have mentioned that Yahoo does occasionally have false positives, and several people seem to have their otherwise normal mail tagged as spam.

I redirected an old address by FooAtWFU · 2004-06-14 04:03 · Score: 2, Interesting

I redirected an old manager@(two letters here).net site so Gmail gets a carbon copy of all the spam sent there (it's lots, trust me). At first it seemed that my Thunderbird Bayseian filters were doing better, but the trend seems to have reversed lately.

No, I'm not keeping proper statistics. =b

--
The World Wide Web is dying. Soon, we shall have only the Internet.

Calculations? by haxor.dk · 2004-06-14 04:08 · Score: 2, Insightful

So, in less than a month, he has recieved in excess of 300 Megabytes of useless junk ?

I think somebody needs to recalculate axactly how much bandwidth go to waste because of this SPAM plague. The cost in global comms traffic must be staggering!

Re:Calculations? by Anonymous Coward · 2004-06-14 05:11 · Score: 0

After getting back from a two-weeks trip I got 27005 mails in about 133 Mbytes - I would hit 300 MB in a month, too!

(And it is getting worse...)

http://bloodgate.com/spams/

Cheers,

Tels
Re:Calculations? by haxor.dk · 2004-06-15 02:56 · Score: 1

Nice site, thanks for link.

More focus on false positives. by ron_ivi · 2004-06-14 04:14 · Score: 5, Insightful

Reviews of spam filters always seem to focus on how much stuff they block.

The consequenses of blocking a non-spam email are so much worse (parent not hearing from kid. the customer that would have saved your startup.) than a spam getting in, I wish the spam filter reviews would focus on those.

Re:More focus on false positives. by Major_Small · 2004-06-14 05:44 · Score: 1

most of the tests I read/care about focus mostly on the false positive : spam getting past the filter ratio...
Re:More focus on false positives. by Anonymous Coward · 2004-06-14 05:49 · Score: 4, Informative

false positive : spam getting past the filter ratio...
A false positive is not one of spam getting past the filter, it's one of non-spam getting blocked.
I.e. the filter says it's spam, and it isn't - in the same way that a false-positive medical test says you have a virus even when you don't.
Re:More focus on false positives. by einTier · 2004-06-14 08:09 · Score: 2, Informative

False positive = condition you are testing for comes up positive, when it should be negative.
False negative = condition you are testing for comes up negative, when it should be positive.
Put in the context of a spam filter, it depends on whether you are testing for spam or for legitimate emails. If you are testing for spam (if spam then...), a false positive would be an email that is not spam getting sent to the spam folder or deleted. A false negative would be spam that lands in your inbox.

--
-------------------------------------------------- $665.95 -- retail price of the beast.
Re:More focus on false positives. by Anonymous Coward · 2004-06-14 08:40 · Score: 0

If you are testing for spam (if spam then...), a false positive would be an email that is not spam getting sent to the spam folder or deleted. A false negative would be spam that lands in your inbox.
That's the way I see it too. With this definition, the false positives (legimate email not getting through) is potentially far more harmful than the false negative (spam getting through).

It Must Be Those Pigeons... by Anonymous Coward · 2004-06-14 04:17 · Score: 0

Working on Gmail...

Now my friends are spamming me by mparaz · 2004-06-14 04:30 · Score: 2, Funny

"Please invite me to GMail!"

Dumb question about SPAM filters.. by StressGuy · 2004-06-14 04:32 · Score: 3, Interesting

I have Mozilla, it has a Bayes SPAM filter. Lately, it's been getting fooled more and more. The messages that make it through have one or more of the following features:

1) Several intentionally mis-spelled words

2) Lots of text in white (so it's invisible or nearly invisible)

3) Message in .GIF form only - no plain text.

Could you add filters that look for, say, more than 10% of the words mis-spelled, text font nearly equal to background color, or no actual text in message? These would take effect in addition to the existing Bayes filter.

--
A goal is a dream with a deadline

Re:Dumb question about SPAM filters.. by scifience · 2004-06-14 05:51 · Score: 1

Could you add filters that look for, say, more than 10% of the words mis-spelled...

If you did this, almost all the e-mail sent by AOL users would get filtered out.

Wait, that might not be so bad...
Re:Dumb question about SPAM filters.. by gnu-generation-one · 2004-06-14 08:39 · Score: 1

"Could you add filters that look for, say, more than 10% of the words mis-spelled"

So that'll kill any email from your friends who weren't already blocked by the "from AOL" or "HTML" filters...
Re:Dumb question about SPAM filters.. by Mard · 2004-06-15 00:40 · Score: 1

Could you add filters that look for, say, more than 10% of the words mis-spelled, text font nearly equal to background color, or no actual text in message?

These improvements come with the added benefit of blocking every legitimate @aol.com email I have ever recieved. Somebody implement them immediately!

--
DRM = Digitally Restricted Media. This is a viral sig, pass it on.

Lousy spam filtering by Animats · 2004-06-14 04:43 · Score: 1

I'm disappointed in Google. Their spam filtering is apparently only 25-50% accurate. I would have expected better.

Single-user spam filters have to solve a tough problem, but Gmail can use a multi-user spam filter, which recognizes similar spams mailed to different mailboxes. The fundamental property of spam is that similar messages go to many people. Google can exploit that, much as Spamcop does.

In theory, Google should be able to recognize spam far more reliably than single-user spam filters. And this is a search problem, something Google is good at. What's wrong over there?

A detection problem by Donny+Smith · 2004-06-14 05:21 · Score: 1

>And this is a search problem,

A detection problem, actually.

They can _find_ spam (and all other) messages, the problem is how to tell which ones are not legitimate while keeping false positives at minimum.

don't count on stupidity -- by Heisenbug · 2004-06-14 05:35 · Score: 1

I assumed the same thing as the grandparent when I saw those emails -- that they were trying to get normal words marked as spam words, and make the filters less effective with normal messages. It would appear, though, that they're not very bright yet -- they're not targeting the low-scoring words. I expect that'll change before too long. What'll happen to your filter when all of the lowest scoring words it knows suddenly become the highest-scoring?

I don't actually know -- but I do know that you aren't the only one with access to those percentages.

Re:don't count on stupidity -- by letxa2000 · 2004-06-14 08:20 · Score: 1

It would appear, though, that they're not very bright yet -- they're not targeting the low-scoring words. I expect that'll change before too long. What'll happen to your filter when all of the lowest scoring words it knows suddenly become the highest-scoring?
How in the world are the spammers going to target my low-scoring words? Let's see, some of my low-scoring words:
1. Header "EDS". Probably because I know someone that works at EDS.
2. Header "BAY1". Who knows where that comes from, but one of my frequent contacts must have that in a header.
3. Body "ADC". Probably because I talked about A/D Converters from time to time.
4. Body "BCD". Probably because I talk about Binary Coded Decimal from time to time.BR 6. Body "GND". Probably talking about electrical grounds.
Anyway, that's a few of my sub-1% Bayesian tokens. How does that compare with yours? Or the low-scoring tokens of an accountant? Very little overlap I'd suspect. So how in the world is a spammer going to target low-scoring terms? If they knew them then they'd just slide their spam right past these filters. But they don't know them, they can't know them, and even if they somehow hacked into your system and got your Bayesian statistics, it won't help them get past anyone elses.
Random words and text insertion basically represents spammers kicking and flailing as they drown in the sea of Bayesian anti-spam filters.

Aventuremail not as tolerant by dirvish · 2004-06-14 05:42 · Score: 3, Interesting

I tried to do the same thing with my AventureMail account but AventureMail wasn't cool with it. They deleted my account! You can check out what little data I collected before the account suspension and read the emails to and from AventureMail about the merits of the account suspension at http://3fingersalute.net/aventuremail

--
FoundNews.com - get paid to blog.,

Publisher's Clearing House? by nandhp · 2004-06-14 06:06 · Score: 1

I'm sending prattboy a free DVD player (I got an ad for one) and it also offered me $1,000,000 from Publisher's Clearing House and it had the following pre-filled in and it wasn't the name I entered for the DVD player (Tester, Gmail): Girl, Pratt heow, AR 12333

sent the spam to prattboy@gmail.com by Anonymous Coward · 2004-06-14 06:32 · Score: 0

Since we all know that slashdot is a favorite place to harvest addresses for spam, listing the guy here should do the trick : prattboy@gmail.com
prattboy@gmail.com

Yale Story by dirvish · 2004-06-14 06:35 · Score: 2, Interesting

Here is a discussion from Yale's LawMeme on the legal ramifications of Prattboy's experiment. Does asking others to sign you up for spam count as an opt-in?

--
FoundNews.com - get paid to blog.,

Up to date by Local+Echo · 2004-06-14 06:51 · Score: 1

Am I in the wrong month here? Doesn't his web page say the last stats for May 25th? This is June people his stats are over 2 weeks old, doesn't he update this page or what?

New spin on the "word salad" strategy by Scott+Richter · 2004-06-14 07:03 · Score: 5, Interesting

Except that won't work, as anyone that understands Bayesian filtering will tell you. In the case of every message with "random words" I've checked recently, the random words actually increased the spam score of that message. Why? Because it seems the random words aren't so random and either the same spammer is using the same "random words" over and over or various spammers are using sets of the same words. Over time most of the "random words" they use actually become great indicators of spam since my real email doesn't typically contain the random words they use.

Right, and my Thunderbird Bayesian filter catches all of those word salad approaches. But they've come up with a new one - what I call the "encyclopedia attack."

What they do is copy an encyclopedia entry and put it at the bottom of their spam. The thing is usually a few paragraphs long, so that textually it dominates the message. The subjects are fairly random, and are occasionally educational ;)

The problem is that the text of this doesn't trip the "too many strange words" flag that's used for word salads. My Thunderbird filter is really having trouble with these. Anyone else having trouble with these spams?

Re:New spin on the "word salad" strategy by letxa2000 · 2004-06-14 08:08 · Score: 1

What they do is copy an encyclopedia entry and put it at the bottom of their spam. The thing is usually a few paragraphs long, so that textually it dominates the message. The subjects are fairly random, and are occasionally educational.The problem is that the text of this doesn't trip the "too many strange words" flag that's used for word salads. My Thunderbird filter is really having trouble with these. Anyone else having trouble with these spams?
I've seen excerpts from books, the Constitution, etc. I haven't had a message like that get past my filter ever, as far as I know. Unless they got dang lucky and sent you an encyclopedia entry for a topic you often discuss it shouldn't have any significant effect. It doesn't matter if the encyclopedia entry "dominates" the spam text. If the spam is spammy and the encyclopedia text is "neutral" (which it will be unless the spammer gets lucky and picks a topic you often discuss) then all the neutral words in the world aren't going to compensate for a few good spammy words. It's not enough to be "neutral" you have to be downright good. Unless they can send a messager with headers that are close to what my friends' mails' have, unless they know my friends' names, unless they know the topics I often discuss, they're just not going to be able to break through my Bayesian filter by "swamping" it with neutral text. It just doesn't make a difference.
Re:New spin on the "word salad" strategy by Scott+Richter · 2004-06-14 09:24 · Score: 1

I've seen excerpts from books, the Constitution, etc. I haven't had a message like that get past my filter ever, as far as I know.
Perhaps my Thbird filter is just too new - my old Mozilla database was huge, but I started over a few months ago.
It doesn't matter if the encyclopedia entry "dominates" the spam text. If the spam is spammy and the encyclopedia text is "neutral" (which it will be unless the spammer gets lucky and picks a topic you often discuss) then all the neutral words in the world aren't going to compensate for a few good spammy words.
Not so sure about that. If a spam consisted of the words "Buy my viagra," that would be a spam. If those three words were interspersed through an article, I highly doubt it would be tagged as spam. So dilution should be a factor. I don't know exactly how Thbird implements it, but in standard Bayes theory, this is a problem.
It's not enough to be "neutral" you have to be downright good.
Only if you have the threshold on your filter cranked down pretty far.
Unless they can send a messager with headers that are close to what my friends' mails' have, unless they know my friends' names, unless they know the topics I often discuss, they're just not going to be able to break through my Bayesian filter by "swamping" it with neutral text. It just doesn't make a difference.
Then you've implemented your filter to approximate a whitelist, while most people implement theirs to be more like a blacklist. Particularly for those of us who need to be reachable by people who have never emailed us before, cranking down the level that far isn't an option. As such, neutral things have to be classified more as ham than spam.
Re:New spin on the "word salad" strategy by letxa2000 · 2004-06-14 10:24 · Score: 1

Perhaps my Thbird filter is just too new - my old Mozilla database was huge, but I started over a few months ago.
I'm guessing that's it. Things like this will cause a much more severe reaction when the corpus is small.
Me: It doesn't matter if the encyclopedia entry "dominates" the spam text...
You: Not so sure about that. If a spam consisted of the words "Buy my viagra," that would be a spam. If those three words were interspersed through an article, I highly doubt it would be tagged as spam. So dilution should be a factor. I don't know exactly how Thbird implements it, but in standard Bayes theory, this is a problem.
It'd only be a problem if you're using some Bayesian filter that works on word pairs or context. The simple Bayesian filter proprosed by Graham almost two years ago is simply based on tokens. It doesn't matter where they appear in the body of the message, just that they appear. So the word "Buy my Viagra" is going to be identical to having those same words spread throughout the article. Considering spammers like to try to embed words in small fonts or white-on-white color, the simple approach proprosed by Graham makes much more sense than a more complicated multi-word Bayesian filter that looks for word combinations.
Me: It's not enough to be "neutral" you have to be downright good.
You: Only if you have the threshold on your filter cranked down pretty far.
I think that's wrong. I was going to say that your experience is very different than mine but, actually, I think that's wrong.
Due to the way Bayesian works, if you have 40k of completely neutral words and, say, 5 or 6 spammy words, that's going to get tagged as spam regardless of whether you set your threshold to 90%, 50%, or 30%. Neutral words that have a spam probability of, say, 50% just aren't going to be considered for determining whether a message is spam or not. Those words lose importance in the spam decision. The best the spammer could do is try to dilute so many words that all your words were "neutral" and no words were "good" and, thus, it'd be impossible to determine spaminess since no word would be particularly good or spammy. But in reality it's not possible to dilute the value of all words, and diluting the value of the good words is particularly difficult since those same words will be getting flagged as spammy by other users who don't have those same words as "good" words. Not to mention they don't know what your good words are to start with.
Me: Unless they can send a messager with headers that are close to what my friends' mails' have, unless they know my friends' names, unless they know the topics I often discuss, they're just not going to be able to break through my Bayesian filter by "swamping" it with neutral text. It just doesn't make a difference.
You: Then you've implemented your filter to approximate a whitelist, while most people implement theirs to be more like a blacklist. Particularly for those of us who need to be reachable by people who have never emailed us before, cranking down the level that far isn't an option. As such, neutral things have to be classified more as ham than spam.
Uh, no, sorry. Perhaps I misstated myself. If a spammer wants to get through, he is going to have to do the above (know my friends name, topics I discuss, etc.) to get their spam through and probably have to lose most of the content of the spam he wants me to see. If he wants to tell me to "Buy my Viagra" at the very least he's going to have to know some characteristics of my "good" words and even then he's going to have a hard time getting through if he's talking about Viagra, using red font color, etc. A completely neutral, non-spam message from someone I've never heard from before is going to be neutral and, as such, won't be filtered. Very, very few spams are "neutral." Even when they trying to dilute Bayesian filters by using random words, their messages are still very
Re:New spin on the "word salad" strategy by MobiusKlein · 2004-06-14 12:08 · Score: 1

All you need to do here is report them to the FBI for copyright infringment!

Hmm - what's the penalty for that per instance?
rbb
Re:New spin on the "word salad" strategy by Scott+Richter · 2004-06-14 12:17 · Score: 1

The challenge for spammers is exactly what Graham said: They need to make their messages look like what your "good" messages look like.
The question is, is your filter identifying ham or spam? Also, what does a "good" message look like? If one has a diversity of "ham" relative to its population size, then it's hard to characterize them. At that point, the task of identifying spam is almost solely based on the characteristics of the spam, as ham can look like anything. If it's a reasonable assumption that ham is ill-defined, then masking can go a long way to getting spam through.
On an emprical level, we have two observations: 1) your Bayesian filter is working fine with "encyclopedia" spams, and 2) mine isn't. I've been training mine for 2 months, and it catches 100% of word salads, and maybe 20% of "encyclopedia" spams. That's a real problem.
I think 2 months training should certainly be enough. The question is, why is it not working, because it's clear that it's not. We'll probably agree that the root cause is that your database is older, broader, and better characterized. I would guess that this allows your ham to be better characterized, while mine is more fuzzy. In other words, my filter may be partially handicapped compared to yours.
From a mathematical standpoint, my ham database is likely a sparse space, which causes myriad problems with such calculations. Even then, it's only the new techniques that are causing problems.
In short, it shouldn't take so long to effectively train a filter. And for young filters that inherently have to separate the spam from the noise rather than the ham, this is a significant problem indeed as dilution really is a problem in such instances. I do agree that, if your filter is working with such well-defined ham, that dilution won't work, because it can separate ham from neutral, meaning that having neutral looking spam won't work. Since my filter can't, it does work.
Re:New spin on the "word salad" strategy by letxa2000 · 2004-06-14 13:43 · Score: 1

The question is, is your filter identifying ham or spam? Also, what does a "good" message look like? If one has a diversity of "ham" relative to its population size, then it's hard to characterize them. At that point, the task of identifying spam is almost solely based on the characteristics of the spam, as ham can look like anything. If it's a reasonable assumption that ham is ill-defined, then masking can go a long way to getting spam through.
Just because you have lots of different types of ham doesn't mean it's any harder for Bayesian to identify it. In the end, it's a simple game of statistics. And it's a game that works very well. One ham doesn't have to look like all the other ham, it just has to look different than spam.
Additionally, people who receive lots of different types of ham are in the definite minority. The vast majority of email users have a relatively short list of contacts that'll eventually produce some fairly predictable ham. Those of us that have receive lots of email on lots of subjects from lots of never-written-before users are in the definite minority. And a minority of spam is going to get through our filters anyway. At that point the spammers will be targetting a minority of the minority, and that minority is extremely anti-spam... sounds like a losing business model to me.
On an emprical level, we have two observations: 1) your Bayesian filter is working fine with "encyclopedia" spams, and 2) mine isn't. I've been training mine for 2 months, and it catches 100% of word salads, and maybe 20% of "encyclopedia" spams. That's a real problem. I think 2 months training should certainly be enough. The question is, why is it not working, because it's clear that it's not. We'll probably agree that the root cause is that your database is older, broader, and better characterized. I would guess that this allows your ham to be better characterized, while mine is more fuzzy. In other words, my filter may be partially handicapped compared to yours.
My Bayesian corpus was started in May 2003--just over a year ago. It actually hovered around 99.5% for the first 3 months, then was in the 99.8x% range for about 4 months, and it hasn't dipped below 99.9% for the last 5 months and has been peaking at around 99.98%. My corpus has 9518 good messages and 133,466 spams. The few spams that get through these days are actually some bounces from viruses (which I don't count as spam nor do I report them as spam which is why they still get through from time to time), one or two foreign-language spams, and a few spams that were getting through because they were using whitelisted email addresses from the same domain (I have since modified the whitelist to work on the NAME of the person rather than the email address).
I agree with you, you probably just don't have a finely-tuned Bayesian filter yet. But that's not an inherent flaw in Bayesian, it's just a matter of being patient. If you keep with it Bayesian is going to work great for you--the dilution tactics might just mean that you have to be patient in training your Bayesian filter longer than was necessary a year ago. The end result will be the same, though.
Also, while you are training the Bayesian filter, alternative filters are definitely a plus. In the filter I developed and use (see sig line), the user has the option of enabling common keyword filters that has an updated list of known spam phrases, domains, etc. This helps detect spam while the Bayesian filter is still getting up to speed. Such standard filters are a very important part of helping tune the Bayesian filter initially without having to depend entirely on the user. Once the Bayesian filter is trained, the archaeic keyword filters can be disabled. At this point I don't use the keyword filters at all--I depend entirely on Bayesian.
Re:New spin on the "word salad" strategy by mibus · 2004-06-14 15:01 · Score: 2, Insightful

Anyone else having trouble with these spams?

Surely it's the people who aren't having this problem that you want to hear from - they're the ones with good spam filtering ;-)
Re:New spin on the "word salad" strategy by Scott+Richter · 2004-06-14 15:02 · Score: 1

Just because you have lots of different types of ham doesn't mean it's any harder for Bayesian to identify it. In the end, it's a simple game of statistics. And it's a game that works very well. One ham doesn't have to look like all the other ham, it just has to look different than spam.
Au contraire, given the way Bayes' rule works, a posteriori probabilities are intimately related to the statical variance.
P(spam|X) = P(X|spam)P(spam)/(P(X|spam)P(spam) + P(X|ham)P(ham)) is the adapted Bayes rule as it works with spam, where P refers to conditional or overall probabilities, and X is a given mail signature. For high-variance ham, the problematic term is P(X|ham), which will result in little difference between noise and ham. Put it this way - if you can email yourself a page from an encyclopedia (without spam) and it isn't flagged as spam, then your filter can't tell ham from noise.
For the math above, particularly for highly dimensional spaces (like the many descriptors adapted to email signatures using filters) with few samples, Bayes can have issues. Basically, the fewer samples and more dimensions and greater variance per dimension, the coarser the space must inherently be to derive useful statistics. Look at it this way - to get a description of the probability of getting spam signature X from either ham or spam, we have to map the total space spanned by the mails. If we have 10 descriptors, and each is even binary, and we need at least 10 datapoints per cell to get statistics, that means we need at least 10,000 messages. That should give some idea of the problem. Less variance makes the space more dense and inherently more manageable. Duda and Hart's book "Pattern Classification and Scene Analysis" gives a better description than I can, particularly the treatment on Parzen windows and other methods of dealing with the so-called "curse of dimensionality" with Bayesian models.
I agree with you, you probably just don't have a finely-tuned Bayesian filter yet. But that's not an inherent flaw in Bayesian, it's just a matter of being patient. If you keep with it Bayesian is going to work great for you--the dilution tactics might just mean that you have to be patient in training your Bayesian filter longer than was necessary a year ago. The end result will be the same, though.
I agree - the only thing is, at this point last time - before the "encyclopedia" crap started - I had much more success. My old database was about the size of yours, and worked well. But the thing is, on the type of spam I got with that one, my current one works fine, so who knows. Time will tell.
Also, while you are training the Bayesian filter, alternative filters are definitely a plus. In the filter I developed and use (see sig line), the user has the option of enabling common keyword filters that has an updated list of known spam phrases, domains, etc. This helps detect spam while the Bayesian filter is still getting up to speed. Such standard filters are a very important part of helping tune the Bayesian filter initially without having to depend entirely on the user. Once the Bayesian filter is trained, the archaeic keyword filters can be disabled. At this point I don't use the keyword filters at all--I depend entirely on Bayesian.
That's the winning strategy. Personally, since I'm curious, I like to see the dropoff as it learns, but I certainly wouldn't recommend that for common practice. ;)
Re:New spin on the "word salad" strategy by letxa2000 · 2004-06-15 01:59 · Score: 1

Me: Just because you have lots of different types of ham doesn't mean it's any harder for Bayesian to identify it.
You: Au contraire, given the way Bayes' rule works, a posteriori probabilities are intimately related to the statical variance. P(spam|X) = P(X|spam)P(spam)/(P(X|spam)P(spam) + P(X|ham)P(ham)) is the adapted Bayes rule as it works with spam, where P refers to conditional or overall probabilities, and X is a given mail signature. For high-variance ham, the problematic term is P(X|ham), which will result in little difference between noise and ham. Put it this way - if you can email yourself a page from an encyclopedia (without spam) and it isn't flagged as spam, then your filter can't tell ham from noise.
Again, that's not a problem. If you mail yourself a page from an encyclopedia with no spam then it shouldn't be flagged as spam. The purpose of the Bayesian filter isn't to differentiate ham from noise, the purpose is to differentiate ham from spam. The only question is whether the insertion of noise in spam has any significant effect on the ability of a Bayesian filter to detect spam. It shouldn't, at least once it is properly trained.
As I have already conceded, the use of random words may prolong the training period somewhat in unusual situations such as yours where you receive a lot of mail from unknown senders talking about a large number of topics. But you are definitely out of the ordinary when compared to the bulk of email users. The use of random words may prolong your training period somewhat, but it's going to have almost no effect on a more typical user of email. Certainly, the use of random words cannot achieve the spammers' ultimate goal of defeating Bayesian or making it worthless.
If we have 10 descriptors, and each is even binary, and we need at least 10 datapoints per cell to get statistics, that means we need at least 10,000 messages. That should give some idea of the problem. Less variance makes the space more dense and inherently more manageable.
This is consistent with what I said earlier: The dilution caused by the spammers' use of random words may require that a new Bayesian user be patient for a longer period of time before Bayesian reaches optimum filtering levels. But I don't believe anything has contradicted my statement that a ham doesn't have to look like the rest of your ham for it to not be filtered. An unknown ham will look like noise, and pure noise shouldn't be filtered by Bayesian--only spam. So an unknown ham just has to look different than spam. And if your ham doesn't look different then spam, well, I feel for you. :)
Re:New spin on the "word salad" strategy by Scott+Richter · 2004-06-15 02:56 · Score: 1

Again, that's not a problem. If you mail yourself a page from an encyclopedia with no spam then it shouldn't be flagged as spam.
Then your filter isn't so attuned to your ham as you think. You claimed that your filter knows your email so well that an incoming email doesn't just have to be neutral, but downright good. If you can email yourself an encyclopedia entry from a new, neutral account, that's not the case. At that point, you're not recognizing ham, because you can't. You're recognizing spam, and will be hampered somewhat by dilution.
The use of random words may prolong your training period somewhat, but it's going to have almost no effect on a more typical user of email. Certainly, the use of random words cannot achieve the spammers' ultimate goal of defeating Bayesian or making it worthless.
If I'm having 80% failure on encyclopedia attacks after 2 months, that's getting cose to worthless.
An unknown ham will look like noise, and pure noise shouldn't be filtered by Bayesian--only spam.
That's contrary to what you stated earlier, and is precisely what I originally claimed. But if you start down that road, then dilution does become a problem - spams don't have to look like your ham, simply like noise, as your filter has to let noise through since it can't tell ham from noise.
Re:New spin on the "word salad" strategy by letxa2000 · 2004-06-15 09:35 · Score: 1

Then your filter isn't so attuned to your ham as you think. You claimed that your filter knows your email so well that an incoming email doesn't just have to be neutral, but downright good. If you can email yourself an encyclopedia entry from a new, neutral account, that's not the case. At that point, you're not recognizing ham, because you can't. You're recognizing spam, and will be hampered somewhat by dilution.
Dilution doesn't work. :)
See below.
If I'm having 80% failure on encyclopedia attacks after 2 months, that's getting cose to worthless.
How many ham and spam are in your Bayesian statistics? It's obviously not the time that makes Bayesian improve in accuracy, it's the amount of data.
Me: An unknown ham will look like noise, and pure noise shouldn't be filtered by Bayesian--only spam.
You: That's contrary to what you stated earlier, and is precisely what I originally claimed. But if you start down that road, then dilution does become a problem - spams don't have to look like your ham, simply like noise, as your filter has to let noise through since it can't tell ham from noise.
Ok, either I'm missing your point or you are missing mine. I went back through the thread and I'm not sure where I contradicted myself. So let's try again. We have several possibilities:
1. Ham, which is what you know for a fact you want to see and is probably from people you've talked to before or on topics you normally discuss.
2. Spam, which is what you know for a fact you don't want to see.
3. Ham Noise, which is mail you want but it might be someone you've never heard from talking about an unusual topic that you don't usually talk about (although I would think the email you receive should either be from someone you've talked to before or on a topic you've discussed before. An unknown person emailing you out of the blue about a topic you never discuss strikes me as relatively unlikely, even for admins on the far side of the bell curve).
4. Spam with noise, which is spam which is definitely spam that you know you don't want to see, but has "noise" injected to "dilute" it.
I hope we can agree that the first two are the "extreme" cases and are easily recognized by Bayesian.
So the question is, is there a difference between "ham noise" and "spam with noise?" The answer is definitely yes.
If someone I've never heard from before sends me an email out of the blue discussing the meaning of life (which is unlikely to start with), that's ham noise. There's not going to be anything particularly innocent nor particularly damning about it. It's going to be quite neutral and a Bayesian filter is going to let that through unless the spam threshold is set aggresively low.
However, if a spammer sends a spam that's trying to sell me Viagra and is using standard spammer tricks (hiding dictionary attacks in white text, using red fonts to make their sales pitch stand out, including links to domains we've never seen before or using IP addresses instead of domains, using lots of HTML comments to break up words, etc.) and also embeds the exact same noise as the neutral message above, does that spam magically become neutral? Definitely not. Bayesian only looks at the most interesting aspects, or terms, of the message. While there wouldn't be anything particularly interesting in "ham noise" that would lead to a high spam score, a spam is going to be just as spammy with or without a bunch of neutral text. The neutral text would only "dilute" the spam score if every word is included in the spam probability calculation. I don't know of any Bayesian implementation that recommends that approach precisely for this reason.
You look at the 15 most interesting terms (at least in the Graham-advocated approach); those that are furthest from 0.50, so you're looking at only terms that are extremely spammy or extremely innocent. All that neutral text is

Improvement over time. by Jett · 2004-06-14 07:48 · Score: 2, Interesting

I've had a gmail account for almost 3 months now. In the first month I got 3 spam messages, they all made it thru the filter. Since then I've gotten 5 more, only 1 of which made it thru. It's not statistically significant yet, but to me it feels like the filter has improved. I'm already up to 5% of my 1gig too...

Image-Based Spam and Checksums by BarefootClown · 2004-06-14 08:29 · Score: 2, Interesting

What about vetting at least the image-based spam for checksumming? Scan the e-mail for image links (or images included inline). If there's a link, check it against the known list of spam links. If it's in the list, mark the message as spam. Spammers will quickly figure that trick out, though, so step two would be for Google to follow those links, and retrieve the images. Run a checksum of the image file itself; if there are a lot (say, a thousand) messages including the same image, tag it as spam. This combines spam filtering with the fun of reminding spammers that Google has an order of magnitude more bandwidth than they do. Use their own messages against them: the more you spam, the bigger the Slashdotting (Googling? Alas, that word's already taken.)

For bonus points, keep the downloaded images in the Google cache; keeps them available for the mail user, alleviating the load on the sending site for legitimate messages, and keeps them available for, well, the Google cache.

--

"Make it ten--I am only a poor corrupt official."
--Captain Louis Renault (Claude Rains), Casablanca

Christina Aguilera by smallguy78 · 2004-06-14 09:27 · Score: 0

Why does he have a picture of Christina Aguilera for his website background image?

--
Nothing costs nothing

Re:Christina Aguilera by Anonymous Coward · 2004-06-14 09:48 · Score: 0

Dude...because it's Christina Aguilera.

Posting anonymous for obivious reasons...

What's EWS? by Anonymous Coward · 2004-06-14 10:10 · Score: 0

"Here is also an article talking about Aaron's efforts from webpronews.com"

Can you imagine my disappointment when I visited WebPronEws to find out what kind of porn "ews" is, and it turned out to be yet another dull tech site? :^(

Sigh.

I went to college with aaron...not surprised by kc8jhs · 2004-06-14 10:16 · Score: 1

I'm not surprised that this is the type of thing he would be known for. This, or starting a website for people to complain about that school.

P.S. I told him about this /. post this afternoon, he was unaware of it at the time.

-Mikey P

Invite Plz!!! by KageMonkey · 2004-06-14 10:53 · Score: 0

Through all that spam, I wonder if prattboy will notice if I asked him for a Gmail invite.

The Solution to Spam by vakuona · 2004-06-14 11:39 · Score: 2, Interesting

I have 3 ideas that may overcome spam.

This may require an overhaul of the email system though. One may be to have multiple addresses bound together. So you would give one email address and only "authenticated" or approved contacts could get your second address. Now sending an email simultaneously to the two email addresses would result in the email being delivered directly. mail sent to only one of the 2 addresses would be delivered as per normal, and would be subject to the normal filtering. But i guess the spammers would find ways to get both addresses too and defeat that, but it sould be doubly difficult, if not actually an order of magnitude more difficult. How may people get the same spam on different email addresses? This could be useful.

The other is to hit spammers where it hurts, audience. By rolling out a proper ad delivery system (yuck) which was separate from email, if people used their email less for getting information about products, but had it collected by some RSS type system, the spammers would be left with a dwindling audience unless they switched too. The ads would be strictly opt in.

Or mail collection rather than mail delivery. If people collected mail rather than got it delivered to them, they could in theory just not collect spam. why would anyone collect spam?

Lastly is education. if people kept their own whitelists of approved mailers, they could in theory get rid of most spam by keeping good whitelists.

How do you intentionally get a lot of spam? by Anonymous Coward · 2004-06-14 14:43 · Score: 0

If someone wanted to receive a lot of spam, what is the most effective way to do so? Sure, you could post messages to some newsgroups including your real email address - but does anyone know of some sure-fire way to get lots of spam very quickly?

Bless you, Hemos & Anon! by Niet3sche · 2004-06-14 18:18 · Score: 1

(Google cache of this site: cache: gmail.prattboy.net)

Wonderful. So we finally have an article which provides ... *drumroll* ... an actual cache to the content! Rockin'. I don't care who hosts the cache (although I suspect that google is better equipped than most, in both cost absorbtion and raw bandwidth / failover capability), and in fact I don't want it to be slashdot. That'd be stupid, IMO.

Anyway ... yay!

Greediness by pingurslapp · 2004-06-14 20:08 · Score: 1

Ok some of us out here, are begging for gmail addresses, and doing all sorts of things to get one. and this puZ is begging for spam mail. simple solution..... mv /home/prattboys"brain" /dev /null

So much for testing.... by Anonymous Coward · 2004-06-15 04:58 · Score: 0

Last night I found an error on prattboys website that others had pointed out and I was going to email him questioning it. Woke up this morning to check my own gmail and found this...
This is an automatically generated Delivery Status Notification

Delivery to the following recipient failed permanently:

prattboy@gmail.com

----- Original message -----

WEeeeeee! maybe they nuked his box? or was this because of akamia hosing?

285 comments