Spam Trap Claims 10x-100x Accuracy Gain
SpiritGod21 writes in with a NYTimes article on a new approach to spam detection that claims out-of-the-box improvement of 1 or 2 orders of magnitude over existing approaches. The article wanders off into human-interest territory as the inventor, Steven T. Kirsch, has an incurable disease and an engineer's approach to fighting it. But a description of the anti-spam tech, based on the reputation of the receiver and not the sender, is worth a read.
I read part of TFA, and it seems to be saying that you can id spam mails because they are being sent to a person who gets lots of spam. But that still doesn't take into account the fact that that person also receives legit mail, AND the fact that what is spam to one person isn't spam to another.
Also, seems like a bit of a slashvertisment for what is yet an unproven technology - the only benchmarks we have are ones they provide.
At least once a week there seems to be another flashy technique to filter or block spam. Great.
Except that this ignores the truth behind the spam problem, that many people don't seem to care about. Spam is, at its root, an economic problem. Spam is sent by people who are making money helping someone sell something. The spam you got this afternoon for discount v!@gra or 0EM software is making money for someone. And as long as someone can still make money off of it, they'll keep doing it.
If you want to stop spam, you need to take away the economic incentive. We've already seen how many spam filtering / blocking programs produced in the past 5 years? But yet the spam problem just keeps growing as the number of "solutions" grows. This tells us that the spammers are more than willing to work on ways to circumvent these reactive techniques, so that they can continue to make money off their deeds.
Once we can stop spam from being profitable, we will finally see it go away. But no sooner.
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
I own a number of domains, and receive all email to each domain in a catch-all account. I receive a great deal of emails to totally fictitious email accounts at my domains. Those recipients receive 0% legitimate emails, so anything sending to those accounts is 100% certainly a spammer. Basically what Abaca is doing is working with all the shades of gray in between. Also, this is a system that can only be employed at the server level. It's not like you could add this technology to your stand alone email client.
Dan East
Better known as 318230.
Misquoted by the Slashdot story as usual. FTA:
Over 99 percent spam blocking means fewer than one mistake in every 100 messages processed. That's 10 to 100 times fewer mistakes than any other available systems.
Dan East
Better known as 318230.
1) Issue a Fatwah that spam is an insult to Islam.
2) Behead those who insult Islam!
3) No more spam. Allah Akbar
So, if I understood the article correctly, this technology will classify more email as spam the more spam you have received.
No, that's not how it works at all. Let me try putting it as a concrete example. You have a friend, Jane, who likes to swap stupid chain emails, subscribes to all kinds of "voluntary spam," and generally receives 1000 spam mails a day. Jane's a great lady, don't get me wrong, but you know the type of person I mean. You talk to her in real life, but over email she is incredibly annoying, as most of her messages are essentially meaningless.
Now, let's say that BOTH YOU AND JANE receive the same message M. Now, you know Jane, and you know the kind of messages she typically received (mindless, at least in YOUR eyes). What are the chances that this message M is something that YOU will be interested in? Probably very low. The vast majority of email Jane receives is "crap," at least according to your definition, and so the very fact that Jane received message M greatly increases the likelihood that it is "crap."
Does that make better sense?
Seriously, I don't see how anything working remotely as described can work. First, it guarantees that any OSS mailing list will be flagged as spam because we our emails tend to be on the web and we all receive lots of spam. Then how the hell is someone going to know what percentage of spam I receive (or do they expect everyone to give them access to their inbox?)? Even if that were to work, all the spammers would have to do is let the zombies send one email at a time, at which point either they block all my email or they leave it all through. Dumb idea or dumb reporting?
Opus: the Swiss army knife of audio codec
How does one initialize this system? Spam is determined by user reputation, yet user reputation is determined by quantity of spam received. Am I missing something? The logic seems circular.
My first attempt at doing this, please feel free to ammend/critique:
Your post advocates a
(X) technical ( ) legislative ( ) market-based ( ) vigilante
approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.)
( ) Spammers can easily use it to harvest email addresses
(X) Mailing lists and other legitimate email uses would be affected
(X) No one will be able to find the guy or collect the money
( ) It is defenseless against brute force attacks
(X) It will stop spam for two weeks and then we'll be stuck with it
( ) Users of email will not put up with it
( ) Microsoft will not put up with it
( ) The police will not put up with it
( ) Requires too much cooperation from spammers
( ) Requires immediate total cooperation from everybody at once
(X) Many email users cannot afford to lose business or alienate potential employers
( ) Spammers don't care about invalid addresses in their lists
( ) Anyone could anonymously destroy anyone else's career or business
Specifically, your plan fails to account for
( ) Laws expressly prohibiting it
( ) Lack of centrally controlling authority for email
( ) Open relays in foreign countries
( ) Ease of searching tiny alphanumeric address space of all email addresses
(X) Asshats
( ) Jurisdictional problems
( ) Unpopularity of weird new taxes
( ) Public reluctance to accept weird new forms of money
( ) Huge existing software investment in SMTP
( ) Susceptibility of protocols other than SMTP to attack
( ) Willingness of users to install OS patches received by email
(X) Armies of worm riddled broadband-connected Windows boxes
(X) Eternal arms race involved in all filtering approaches
( ) Extreme profitability of spam
( ) Joe jobs and/or identity theft
( ) Technically illiterate politicians
(X) Extreme stupidity on the part of people who do business with spammers
( ) Dishonesty on the part of spammers themselves
( ) Bandwidth costs that are unaffected by client filtering
( ) Outlook
and the following philosophical objections may also apply:
( ) Ideas similar to yours are easy to come up with, yet none have ever
been shown practical
( ) Any scheme based on opt-out is unacceptable
( ) SMTP headers should not be the subject of legislation
(X) Blacklists suck
(X) Whitelists suck
( ) We should be able to talk about Viagra without being censored
( ) Countermeasures should not involve wire fraud or credit card fraud
( ) Countermeasures should not involve sabotage of public networks
(X) Countermeasures must work if phased in gradually
( ) Sending email should be free
(X) Why should we have to trust you and your servers?
( ) Incompatiblity with open source or open source licenses
( ) Feel-good measures do nothing to solve the problem
( ) Temporary/one-time email addresses are cumbersome
( ) I don't want the government reading my email
( ) Killing them that way is not slow and painful enough
Furthermore, this is what I think about you:
(X) Sorry dude, but I don't think it would work.
( ) This is a stupid idea, and you're a stupid person for suggesting it.
( ) Nice try, assh0le! I'm going to find out where you live and burn your
house down!
No. If previous methods let through one in 100 (1%) then a 10x improvement would result in one in 1000 getting through (0.1%).
Oooo! Can I play?
"Anonymous Coward" --> A Condom Warns You
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
This is clever: filtering spam by exploiting properties of spam pumps in general, vs. straight content analysis. The competition of ever-more-sophisticated content scanning techniques on one side, and spammers' escalating workarounds and huge botnets on the other side, is an arms race that shows no sign of abating.
Of course, this approach does still depend on something—probably content analysis—to determine which messages are spam and which are not, so that receivers' spam statistics can be computed.
The smartest (and reportedly most effective) anti-spam technique I know is spamd, which completely sidesteps content analysis. In a nutshell, it's an SMTP proxy that issues a temporary error code to unknown senders; legitimate MTAs retry delivery (at which point spamd lets the message through), while spam pumps don't bother. Voilà—spam gets stopped before it's ever received. A friend of mine reports that spam volume has dropped to zero since he set up spamd for his department.
If I understand the "receiver reputation" approach correctly, it could use spamd (rather than content analysis) to identify spam; similarly, content analysis can supplement spamd. The two are potentially complementary.
Not much.
Two issues: First, how does the system know that Jane's e-mail is mostly spam. Who tells it? Does it use some other filters to identify the spam in order to determine her spam rate?
Second, how does the system know that the message you received and the message Jane received are the same? Spammers have long been randomizing parts of messages in order to block older spam filters.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Second, how does the system know that the message you received and the message Jane received are the same? Spammers have long been randomizing parts of messages in order to block older spam filters.
An interesting thing, as outlined in TFA that you should R, is that the mails do not have to be the same. They may have different check-sums even. However they are checked against the sending IP-address. If more messages from the same IP address arrive (presumably within a certain time frame), they are all considered spam or ham. Spammers tend to send lots of mails from the same IP address at a time, so that should work.
How they handle mailing lists though is not clear to me really. There are quite some loose ends to the article.
Alright!! I'm going to white list me a new car!
"MightyYar" --> "him gay, try!"
Honeypots have been a published anti-spam technique for a decade. The idea is to publish bogus mailboxes that are not close to any legit mailbox. Any message with a honeypot as any recipient is spam. 100% accurate. (And I blacklist the IP for a week for good measure.) I use a variation, where any message with 3 or more invalid recipients is spam (blacklist IP). That is a little risky since someone may legitimately be trying various mailboxes manually with a telnet session because they forgot the exact name. This technique gives each recipient a score between 0 and 1 that reflects how close to a honeypot that recipient is, with actual honeypots (100% spam) being 1.0.
From TFA with commentary:
"he has started four companies, all based on his frustrations with existing products or services"
Unless they're all still in business that's probably 3 failures on record.
"Along the way he has amassed a personal fortune of about $230 million"
But he got out before the ship sank and with a bundle of cash too. I wonder what his ex-employees got...
"This is harder on my wife than it is on me," he said during a recent interview. "I just look at it as a problem. Here's a problem and you have four years to solve it or you don't get to solve any more problems."
How philosophical...So he's going to cure himself single handedly of a rare disease in 4 years, because medical research is as easy (and cheap) as writing software or tinkering with a home engineering project. I think he's been watching Crusade and sniffing glue.
"His perspective on his disease is also clear. Fourth on his list is "Why human beings will be extinct in 90 years." He writes, "My incurable blood cancer is minor compared to what is happening with the planet. We have somewhat more than 90 years before humanity is virtually extinct.""
Don't even know where to start on this one. I can't be bothered reading about his reasoning, but he's not the first person to predict the end of the world just beyond his own lifetime.
Oh and by the way he has a bridge, I mean some anti-spam software to sell you.
Gimme a break! Nothing to see here.
These posts express my own personal views, not those of my employer
If the contents are irrelevant, then how does this system determine that any two messages are the same? And your answer, "by the sender IP" (and unspoken, by a similar send time).
Which then leads me to ask - what about mail relays, where the same IP address sends thousands of emails every day? Wouldn't every email sent by the relay at roughly the same time be considered the same message, and (because almost everybody gets more spam than ham) be classified as spam?
I think the article tag is correct - "snakeoil".
They always measure it backwards,since it makes the numbers sound much better...
If the old way caught 95% and a new way catches 99%, the you could say it's 4.2% better (4/95) or 4 percentage points better or you could say it's gone from missing 5% to missing 1% for 80% better (4/5) or say it's 5 times better (1% missed compared with 5%). Guess which most people choose to use?
That's the problem I have with this. Spam stopped being truly mass produced years ago. Each spam is now normally sent to each user with a different mix of nonsense. The probability of two different people receiving the same message is virtually zero.
Over 99 percent spam blocking means fewer than one mistake in every 100 messages processed. That's 10 to 100 times fewer mistakes than any other available systems.
That still means that the best other systems make a mistake on 1 out of every 10 messages, and the worst ones make a mistake on every single message. That's still ridiculous hyperbole.
(Personally, I'll take the system that makes 100% mistakes, and I'll use the Spam folder as my Inbox.)
Now if you said that it has 1/10 to 1/100 the error rate of normal clients (which is what they're actually claiming, I think), THAT would make mathematical sense AND be an achievement. The Slashdot title of the story is just bad no matter how you spin it.
If it's for-profit but free, you're not the customer -- you're the product (e.g., the Slashdot Beta's "audience").
No, that's not what they're saying at all. RTFA, please, cause you're describing something completely different. (And moderators too, please at least skim TFA it before moderating, because modding this "Informative" is bollocks.)
This is a system where they look at the history of who a person has sent e-mail to. If the sender has a short term history of sending e-mail to people who mostly receive spam, the e-mail is considered more likely to be spam. Conversely, if the sender has a short term history of sending email to people who don't receive much spam, the email is considered unlikely to be spam.
It's not about your inbox and its percentages, it's about the ratio of the inboxes the sender has previously sent to.
"Because ratings are based on the most recent 25 emails for each sender, the system reacts instantly to spam attacks, usually within just a few messages."
The system has one big flaw, though -- it only work with static senders. A spammer who changes the envelope from address won't get caught, and might even by luck pick a forged sender address that has a positive latest-25-score.
So the solution for the spammers to defeat this system is to send the spams multiple times to the same receipients, but with different senders. This will increase the overall spam, which I don't see as a good service.
Only the mail relay IP address can be determined unambiguously - that's the host which is connecting to the host which is checking the mail for spamminess.
Thank you for all the comments on the NY Times article.
It would be difficult for me to answer each and every comment, so I'll try to just hit the high points here.
It's quite easy to poke fun at an algorithm which is unknown to you as demonstrated by all the comments.
But what's more relevant is whether really smart people who know the algorithm can find fault with it. There are only two people outside of Abaca who know the algorithm: Stephen Wolfram (author of Mathematica) and University of Waterloo Professor Gordon V. Cormack (a well known figure in the anti-spam community). I picked Wolfram because he's the smartest pure math guy I know. I picked Cormack because I think is one of the smartest and most respected scientists in the spam field. You could contact either of them and ask them what they think of the approach. I can tell you what they'd say if you did that. They'd tell you it is a simple, elegant algorithm that has no obvious (to them) holes. I know that because the reason I disclosed it to them was to see if I overlooked anything. Neither found any holes. That doesn't prove that there aren't holes. All systems have holes. What this does mean is that a couple of pretty respected experts think it appears to be pretty solid logic.
In fact, Gordon was kind of enough to go even further and gave me permission to use the follow quote: "This is, by far, the most clever technique I'm aware of for spam filtering." Since Gordon is conference chair for a lot of spam conferences, this is a pretty significant endorsement from someone who KNOWS the full algorithm and who knows the spam space better than just about anyone.
I spent about 4 years studying what others had done in the space. As one commenter pointed out, the recipient reputation system can be thought of as a generalization of the honeypot technique that was first patented by Brightmail.
That's exactly right. My realization is that every email address has statistical value, not just honeypots. So instead of just "black" feedback, the system incorporates "grey" and "white" feedback; every recipient has an apriori odds associated with receiving mail. For many years, Brightmail was the "defacto" standard for spam filtering. Brightmail is just a special case of the algorithm I invented. So instead of learning from honeypots, we learn from ALL recipients and incorporate that statistical input in a mathematically rigorous way in order compute a statistical likelihood that our prediction was correct. That gives us much more input than a honeypot system: it gives us white, black, and grey values. That is critical to avoiding false positives because good sites (like Yahoo and Hotmail) send email to honeypots all the time. And we incorporate that feedback into a statistical framework that is much more accurate than what Brightmail used.
Exactly how we incorporate that input into spam scoring has not been publicly disclosed. It is not obvious.
People who say that this must be snake oil or cannot work ignore the fact that the system has been in use by real customer for more than a year. We have over 100 customers and are just annoucing our existence to the world, so that number should increase quite rapidly now that we are starting to market our product. There are customer testimonials on our website. You can contact them directly to verify that these quotes are legitimate.
Here are statistics from one of our rating servers. There were 1,380,140 messages since the last counter reset. 96% were rated spam. There were 176 false positives and 66 false negatives reported. I just grabbed those stats from one of our live servers right now as I was composing this message. Sometimes we're better, sometimes we're worse, but those numbers are pretty typical.
It's not perfect, but I think those are pretty good error rates for where we are now. And the stats always get better as we add more customers since we get more statistical input and this is just a statistical estimation problem. The more data, the more accurate
You didn't RTFA well enough. That it's about recipients is the selling point.
That's a truth with modifications, though. Look at the quote from the web site I put in my parent post to yours, which clearly shows that it's a block based on who the sender has sent an email to. I'll repeat it, in case you missed it:
"Because ratings are based on the most recent 25 emails for each sender, the system reacts instantly to spam attacks, usually within just a few messages."
Yes, it's a recipient based system in that it assigns a score to the sender based on what the recipients of the emails are. But the blocking occurs due to the score of the sender, based on previous emails, not on the recipient of the current email.
Just think -- if it was based on blocking based on recipient only, it would either block all or no e-mail to an inbox with a single recipient. It would then only be effective for e-mails with multiple recipients, which doesn't match the claims made.
Again, think, and read the article (and that goes for the moderators too).
(Ah, that explains the completely asshat moderation here, then.)
No, I didn't get it backwards -- RTFA. It's called a recipient verification system, but when you look at their own description on how it operates, you'll find that:
- It looks at the recipients of a message, and based on how much spam each of the recipient accounts gets, assigns a score to the sender.
- This score is accumulated over the last 25 emails.
(The reason for this is rather obvious, if you think about it -- if it based its score on just the last e-mail, if you sent an e-mail to someone who receives a lot of spam, it'd be automatically blocked, and that person would not get any e-mail at all.)
Say a sender sends three e-mails, to foo@foo.invalid, bar@bar.invalid, a bunch more people, and finally baz@baz.invalid. If foo@foo.invalid receives 30% spam, and the overall average is 80%, that means that the e-mail is unlikely to be spam. So a score is saved in a table for the sender. Then it goes to bar@bar.invalid, who also has a low 40% spam rate, and another "good" score is saved for sender. When the sender then after a while sends an email to baz@baz.invalid, who has a spam rate of 95%, the fact that he sent an e-mail to foo and bar earlier will increase the likelihood of his email to baz going through.
Conversely, if foo and bar received more spam than average, an e-mail sent to baz would be scored as more likely to be spam, even if baz received a record low 10% spam.
Yes, in a way, it's receiver based, because it builds the score based on the receivers' ratio of spam to valid e-mails. But the score is applied to the sender, and they state this in clear text on the web site itself. You only have to read past the sales pitch and down to the technical details.
> The big assumption is that you can identify the recipients
> of a particular message, but spammers can easily ensure
> that information isn't easily obtained.
Nonsense. You're confusing the body from/to with the envelope from/to.
Spammers can't hide the envelope from/to.
No, you are totally wrong. The system measures the ratio of the sender to the spam of the ratio receiver receiver, and establishes a negative false-positive ratio by building a score based on the spam-spam ratio of the sender receiver. By collecting the sum total products of the receiver sender spam ratio dividend, the sales pitch drives the likelihood of three emails through the foobar baz@incompatible.
In summary, I have no idea what I'm talking about because I didn't RTFA. That I am aware of this fact makes me superior to the lot of you who are arguing over the inner workings of this week's spam-filter vaportech -- which was probably written up in an incomprehensible and inconsistent manner such that it will go over the heads of foolish investors, and part them from their money.
Right back atcha:
courseofhumanevents -> "Must Fence A Nervous Ho"
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Linux is not gay, homosexuals are gay.
GAAH! MY PRINTER IS ON FIRE!!! PUT IT OUT! PUT IT OUT!
> My point is, the "powers" that be, in the particular case, are likely incompetent - incapable of successfully pulling off such a conspiracy.
They're the ones creating the successful antispam systems -- you know, the ones that actually scale up on the gateway. The popular vision of bumbling PHB buffoons everywhere is just another stupid slashdot stereotype, fostered by insecure social retards who have to foist their apparent superiority over everyone by scoffing at everything. Sure, they exist, but long-term successful tech companies generally have -- get ready for it -- smart people working for them.
Anyway, the antispam companies don't have the leverage to pull off an end to spam. Symantec and Cloudmark and Ironport and so forth could stand up and scream and rant and rave at ISPs and yell about the need to secure email infrastructure, to block outbound port 25 from residential ranges, to deploy SPF, or hell just to stop bouncing (I'm looking at you Barracuda), but as long as the ISPs run their ranges as open sewers, and just slap in a few boxes to stop everyone else's spam, the spam problem will continue. And they don't like having vendors telling them how to run their business. The people with the power to stop the spam problem, who won't, are not the antispam vendors, it's the ISPs sending spam. So perhaps I was too harsh about the assessment of the PHB problem -- they certainly do seem to be the norm at ISPs (notable exceptions like AOL and parts of Roadrunner excepted).
Done with slashdot, done with nerds, getting a life.
But its really designed to be a corporate product. So even if the each spam email contains only one recipient they all go through the corporate email server, allowing it see all the various recipients a given sender is emailing.
And there were even hints that the software stored on your corporate mail server might be sharing some information with a central data store, allowing it to get the score of all the recipients that the sender is sending to on any network that is a customer of this product. (So it doesn't matter so much if your company only has 10 people to average across because it is somehow cross checking against the global dataset which is tens of thousands of recipients.)
Linux is not gay, homosexuals are gay.
Not all homosexuals are happy, cheerful people either.
In the free world the media isn't government run; the government is media run.