New Kind of Spam 'Un-Training' Filters?
Zaphod2016 writes to tell us the Wall Street Journal is reporting that email in-boxes are under a new kind of spam attack. This new spam has confused many people due to its lack of advertising, viruses, or request for personal information. One popular theory is that these innocuous blocks of text, often drawn from popular literature, are being used to "un-train" spam filters to allow more malicious spam through in the future.
Bayesian and other filters do not rely on "spammy" words alone -- they also rely on "unspammy" words, and spammers have no idea what those words are because each person receives different email.
A scenario, with made up (but plausible) numbers: Suppose you're a developer of a Linux driver for the Bozodrive 1000. The majority of your legitimate email comes from Linux driver development mailing lists. A full 50% of those emails contain the word "IRQ." 99% of the emails contain the word "driver," and 15% contain the word "Johannsen" which is in the signature of one of your friends. And precisely 0% of the emails containing any of these terms have ever been found to be spam.
Any decent spam filter will give a huge weight to the presence of these "unspammy" words, because of the extremely high probability of emails containing them to be non-spam. The presence of randomly selected confusion words in empty spams is not going to affect these frequency counts.
In order to defeat a filter by confusing it, the spammer must guess what the SPECIFIC non-spam words for that PARTICULAR email user are, and then produce bogus, spam messages containing those words in the appropriate frequencies. This will cause the classification counts for those words to become more equalized, and the value of those words in determining spammyness to be greatly reduced. However, this is an impossible task unless the spammer has access to the actual emails of the target.
Perhaps the intent of the empty spams is to confuse the filters, but whoever devised the method has no understanding of how these things actually work, whatsoever.
By having a baysian filter forget over time, it also helps shrink down the database and helps it adapt as the contents of spam change over time.
Having the filter forget is the ONLY effective policy. In statistical filtering, it is certainly NOT true that more data == better results. You want a sample of data that most accurately represents the sort of content you are receiving RIGHT NOW. I completely purge my Firefox Bayesian database every couple of months and retrain on recent emails only. The result is ALWAYS an increase in accuracy, particularly a reduction in false positives.
The only way to increase the false positives is to get the spam filter to learn the words that usually appear in your legitimate messages.
Since the spammers have no way of knowing what those words are, there is no way they can bypass your filters
Take my dad for instance; he isn't on any mailing list; 99% of his email is along the lines of "how are you" and "give my love" etc; pretty run of the mill stuff.
People who ask those sorts of things usually sign their name to their email. Those names will become strong non-spam keywords. ANYTHING your dad talks about specifically will help -- hobbies, places he usually goes, etc. You'd be surprised how much specific, intelligent content even the most "ordinary" of people will produce.My limited experience is that whatever filtering Hotmail uses has been allowing lots of Spam to slip through in the last few weeks.
Anyone else?
How's Yahoo & G-Mail been doing?
[Fuck Beta]
o0t!
Quite a few, apparently.
I read one article which claimed that one spammer in particular "received 10,000 credit card orders in one month [snip] each for $39.95 US."
So that's nearly $400,000 per month. Nice work if you can get it.
Source:
http://www.cbc.ca/story/business/national/2005/04/ 08/spam-050408.html
If the spammers are now sending round Gutenberg texts, this is entirely appropriate. Project Gutenberg caused probably the first ever spam, when Michael Hart launched the project by trying to mail everyone on ARPANET with the U.S. Declaration of Independence. (source)
-- Ed Avis ed@membled.com
Answer is: No, it won't. At least not with Bayesian. The only way to mess up a Bayesian filter is if they can send you messages that are heavy in words/terms that often appear in your good email. And that's going to vary from user to user. Unless you're sending me the exact words that I use in my daily emails, adding a plethora of other words is not going to make my filter any less accurate or create more false positives. It will either let my filter recognize your "poison" as spam itself or, at worst, be neutral.
My Bayesian filter, among other things, considers an excessive number of infrequently/never used terms as a characteristic that is itself subject to Bayesian classification. So while the "poison words" have no statistical effect on my filter, the fact that a bunch of unusual words are found in a message is going to increase the chance that my filter correctly recognize the message as spam.
My spam was constantly growing through about December of last year. This year, it seems to have leveled off. Sure, I'm still getting just under 20,000 per month which sucks, but I see almost none of them and according to my spam stats, the spam has leveled off. Hopefully this is the plateau before it falls. :)
I still want to know: Who are the idiots who BUY spammed products???
Here are actual samples of emails that Gmail and Yahoo have let through to my inbox over the past couple days. First, Gmail:
Attached to the above was an image file that contained an obvious ad. So to Gmail, this apparently looks like a regular text email that happens to have an attached image.
(You can argue about how effective this is, since Gmail thumbnails all images, meaning you'd need to click a separate link to open it and read it.)
Now Yahoo, where I get approximately 1,000 messages to my bulk folder per day - this is the only one that's gotten through to my inbox in the last day:
I know it's basic, but I'd like to add that if you have control of the HTML of the page that you are posting you email to, you can use a simple tool to confuse the mining bots. It doesn't work on forums like slashdot, but a good scrambler that I've had success with is Enkoder.
I've wondered why more sites don't use Craigslist's method of temporary forwarding from an anonymous, random address that can be easily filtered if need be. Bandwidth?
Dammit Otto, you have lupus.
Close but incorrect. I believe it was an add for some kind of seminar a guy was giving on the west coast. He was from the east coast and had no contacts to sell this product in the west so he manually typed in like hundreds of addresses. I dont know if i can find a link but i remember reading about it.
Ok aparently googling for "first spam ever" yields this article:
so there you go. First spam May 3, 1978. Theres a reply to it from RMS too (his inital reaction was pro spam heh).
I'll just use my special getting high powers one more time...
I work for a fairly large email service provider. Spam isn't dying by any means. We just doubled production hardware last week to have enough smtp listener processes to be able to accept email. Bayesian is nice for the single user. For an ISP, it isn't. ISPs are bearing the brunt of the expense right now. The day I fear is when ISPs start to go under, or start charging for spam filtering, or simply stop.
Those boxes are running at sustained loads of 40+ and are CPU bound. That's a bit rare in the email world, as you would know if you have ever run a non trivial system in production.
The spammers will send more spam is something that we have been observing in reality. I have seen AOLs numbers, and they are merely two orders of magnitude bigger than ours at the moment.
I can throw myself at the ground, and miss.