Slashdot Mirror


Spam Trap Claims 10x-100x Accuracy Gain

SpiritGod21 writes in with a NYTimes article on a new approach to spam detection that claims out-of-the-box improvement of 1 or 2 orders of magnitude over existing approaches. The article wanders off into human-interest territory as the inventor, Steven T. Kirsch, has an incurable disease and an engineer's approach to fighting it. But a description of the anti-spam tech, based on the reputation of the receiver and not the sender, is worth a read.

11 of 419 comments (clear)

  1. Yet another wrong answer... by damn_registrars · · Score: 5, Insightful

    At least once a week there seems to be another flashy technique to filter or block spam. Great.

    Except that this ignores the truth behind the spam problem, that many people don't seem to care about. Spam is, at its root, an economic problem. Spam is sent by people who are making money helping someone sell something. The spam you got this afternoon for discount v!@gra or 0EM software is making money for someone. And as long as someone can still make money off of it, they'll keep doing it.

    If you want to stop spam, you need to take away the economic incentive. We've already seen how many spam filtering / blocking programs produced in the past 5 years? But yet the spam problem just keeps growing as the number of "solutions" grows. This tells us that the spammers are more than willing to work on ways to circumvent these reactive techniques, so that they can continue to make money off their deeds.

    Once we can stop spam from being profitable, we will finally see it go away. But no sooner.

    --
    Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
    1. Re:Yet another wrong answer... by ender- · · Score: 5, Insightful

      If you want to stop spam, you need to take away the economic incentive. We've already seen how many spam filtering / blocking programs produced in the past 5 years? But yet the spam problem just keeps growing as the number of "solutions" grows. This tells us that the spammers are more than willing to work on ways to circumvent these reactive techniques, so that they can continue to make money off their deeds.

      Once we can stop spam from being profitable, we will finally see it go away. But no sooner. But why would the anti-spam software companies want that? If they succeed in actually eliminating spam, they'd also go out of business. It may be profitable for the spammers, but I suspect it's even more profitable for the anti-spam companies.

    2. Re:Yet another wrong answer... by pclminion · · Score: 4, Insightful

      At least once a week there seems to be another flashy technique to filter or block spam. Great.

      It's not "flashy." It's called information theory and statistics. It is an extremely powerful concept that has far more important potential uses than simply filtering spam email. Every new advancement in automated classification and knowledge extraction is VITALLY IMPORTANT to our ability to cope in a world which has suddenly been flooding with SO MUCH information. This power tool is being applied to what some might see as a "silly" problem, but the fact remains that spam is a powerful motivation to researchers to push further limits in the fields of pattern recognition, information and natural language processing.

      If you're against the advancement of information processing techniques, then... uh, okay, I guess. If you can't see beyond spam, you are terribly short sighted.

    3. Re:Yet another wrong answer... by choongiri · · Score: 5, Insightful

      No, if you are harvesting email addresses and sending unsolicited commercial messages to them, it is quite simple:

      You are a spammer.

    4. Re:Yet another wrong answer... by Jimmy_B · · Score: 5, Interesting

      Except that this ignores the truth behind the spam problem, that many people don't seem to care about. Spam is, at its root, an economic problem. Spam is sent by people who are making money helping someone sell something. The spam you got this afternoon for discount v!@gra or 0EM software is making money for someone. And as long as someone can still make money off of it, they'll keep doing it.
      Not exactly. It's making money for the spammer, but it probably isn't making money for the person who hired him. You see, even if no one ever bought anything advertised in spam, it would still be sent. The problem is multilevel marketing, which creates a lot of people desperate to sell unsellable inventory, some of whom pay spammers to advertise it for them. A perceived economic incentive is enough, even if there isn't a real one.
  2. Makes sense by Dan+East · · Score: 4, Informative

    I own a number of domains, and receive all email to each domain in a catch-all account. I receive a great deal of emails to totally fictitious email accounts at my domains. Those recipients receive 0% legitimate emails, so anything sending to those accounts is 100% certainly a spammer. Basically what Abaca is doing is working with all the shades of gray in between. Also, this is a system that can only be employed at the server level. It's not like you could add this technology to your stand alone email client.

    Dan East

    --
    Better known as 318230.
  3. Re:KInda flawed by pclminion · · Score: 4, Informative

    So, if I understood the article correctly, this technology will classify more email as spam the more spam you have received.

    No, that's not how it works at all. Let me try putting it as a concrete example. You have a friend, Jane, who likes to swap stupid chain emails, subscribes to all kinds of "voluntary spam," and generally receives 1000 spam mails a day. Jane's a great lady, don't get me wrong, but you know the type of person I mean. You talk to her in real life, but over email she is incredibly annoying, as most of her messages are essentially meaningless.

    Now, let's say that BOTH YOU AND JANE receive the same message M. Now, you know Jane, and you know the kind of messages she typically received (mindless, at least in YOUR eyes). What are the chances that this message M is something that YOU will be interested in? Probably very low. The vast majority of email Jane receives is "crap," at least according to your definition, and so the very fact that Jane received message M greatly increases the likelihood that it is "crap."

    Does that make better sense?

  4. Form letter by Anonymous Coward · · Score: 5, Funny

    My first attempt at doing this, please feel free to ammend/critique:

    Your post advocates a
    (X) technical ( ) legislative ( ) market-based ( ) vigilante

    approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.)

    ( ) Spammers can easily use it to harvest email addresses
    (X) Mailing lists and other legitimate email uses would be affected
    (X) No one will be able to find the guy or collect the money
    ( ) It is defenseless against brute force attacks
    (X) It will stop spam for two weeks and then we'll be stuck with it
    ( ) Users of email will not put up with it
    ( ) Microsoft will not put up with it
    ( ) The police will not put up with it
    ( ) Requires too much cooperation from spammers
    ( ) Requires immediate total cooperation from everybody at once
    (X) Many email users cannot afford to lose business or alienate potential employers
    ( ) Spammers don't care about invalid addresses in their lists
    ( ) Anyone could anonymously destroy anyone else's career or business

    Specifically, your plan fails to account for

    ( ) Laws expressly prohibiting it
    ( ) Lack of centrally controlling authority for email
    ( ) Open relays in foreign countries
    ( ) Ease of searching tiny alphanumeric address space of all email addresses
    (X) Asshats
    ( ) Jurisdictional problems
    ( ) Unpopularity of weird new taxes
    ( ) Public reluctance to accept weird new forms of money
    ( ) Huge existing software investment in SMTP
    ( ) Susceptibility of protocols other than SMTP to attack
    ( ) Willingness of users to install OS patches received by email
    (X) Armies of worm riddled broadband-connected Windows boxes
    (X) Eternal arms race involved in all filtering approaches
    ( ) Extreme profitability of spam
    ( ) Joe jobs and/or identity theft
    ( ) Technically illiterate politicians
    (X) Extreme stupidity on the part of people who do business with spammers
    ( ) Dishonesty on the part of spammers themselves
    ( ) Bandwidth costs that are unaffected by client filtering
    ( ) Outlook

    and the following philosophical objections may also apply:

    ( ) Ideas similar to yours are easy to come up with, yet none have ever
    been shown practical
    ( ) Any scheme based on opt-out is unacceptable
    ( ) SMTP headers should not be the subject of legislation
    (X) Blacklists suck
    (X) Whitelists suck
    ( ) We should be able to talk about Viagra without being censored
    ( ) Countermeasures should not involve wire fraud or credit card fraud
    ( ) Countermeasures should not involve sabotage of public networks
    (X) Countermeasures must work if phased in gradually
    ( ) Sending email should be free
    (X) Why should we have to trust you and your servers?
    ( ) Incompatiblity with open source or open source licenses
    ( ) Feel-good measures do nothing to solve the problem
    ( ) Temporary/one-time email addresses are cumbersome
    ( ) I don't want the government reading my email
    ( ) Killing them that way is not slow and painful enough

    Furthermore, this is what I think about you:

    (X) Sorry dude, but I don't think it would work.
    ( ) This is a stupid idea, and you're a stupid person for suggesting it.
    ( ) Nice try, assh0le! I'm going to find out where you live and burn your
    house down!

  5. Re:Ummmm.... by Mundocani · · Score: 4, Insightful

    The main problem I can see is that even if this system works it is easily circumvented. The big assumption is that you can identify the recipients of a particular message, but spammers can easily ensure that information isn't easily obtained.

    First they can ensure that the message itself doesn't contain any recipient info (a big bcc basically).

    Then they avoid batching recipients based on their domain so he SMTP server can't tell who else is receiving the message.

    The only way to derive the recipients now is to compare all messages against all others in order
    to match them up. So they hash every message and combine those with identical hashes.

    But putting a little unique text in each message during transmission foils that.

    Spammers: 1 New weapons: 0

  6. The inventor responds... by propelCEO · · Score: 5, Informative

    Thank you for all the comments on the NY Times article.

    It would be difficult for me to answer each and every comment, so I'll try to just hit the high points here.

    It's quite easy to poke fun at an algorithm which is unknown to you as demonstrated by all the comments.

    But what's more relevant is whether really smart people who know the algorithm can find fault with it. There are only two people outside of Abaca who know the algorithm: Stephen Wolfram (author of Mathematica) and University of Waterloo Professor Gordon V. Cormack (a well known figure in the anti-spam community). I picked Wolfram because he's the smartest pure math guy I know. I picked Cormack because I think is one of the smartest and most respected scientists in the spam field. You could contact either of them and ask them what they think of the approach. I can tell you what they'd say if you did that. They'd tell you it is a simple, elegant algorithm that has no obvious (to them) holes. I know that because the reason I disclosed it to them was to see if I overlooked anything. Neither found any holes. That doesn't prove that there aren't holes. All systems have holes. What this does mean is that a couple of pretty respected experts think it appears to be pretty solid logic.

    In fact, Gordon was kind of enough to go even further and gave me permission to use the follow quote: "This is, by far, the most clever technique I'm aware of for spam filtering." Since Gordon is conference chair for a lot of spam conferences, this is a pretty significant endorsement from someone who KNOWS the full algorithm and who knows the spam space better than just about anyone.

    I spent about 4 years studying what others had done in the space. As one commenter pointed out, the recipient reputation system can be thought of as a generalization of the honeypot technique that was first patented by Brightmail.

    That's exactly right. My realization is that every email address has statistical value, not just honeypots. So instead of just "black" feedback, the system incorporates "grey" and "white" feedback; every recipient has an apriori odds associated with receiving mail. For many years, Brightmail was the "defacto" standard for spam filtering. Brightmail is just a special case of the algorithm I invented. So instead of learning from honeypots, we learn from ALL recipients and incorporate that statistical input in a mathematically rigorous way in order compute a statistical likelihood that our prediction was correct. That gives us much more input than a honeypot system: it gives us white, black, and grey values. That is critical to avoiding false positives because good sites (like Yahoo and Hotmail) send email to honeypots all the time. And we incorporate that feedback into a statistical framework that is much more accurate than what Brightmail used.

    Exactly how we incorporate that input into spam scoring has not been publicly disclosed. It is not obvious.

    People who say that this must be snake oil or cannot work ignore the fact that the system has been in use by real customer for more than a year. We have over 100 customers and are just annoucing our existence to the world, so that number should increase quite rapidly now that we are starting to market our product. There are customer testimonials on our website. You can contact them directly to verify that these quotes are legitimate.

    Here are statistics from one of our rating servers. There were 1,380,140 messages since the last counter reset. 96% were rated spam. There were 176 false positives and 66 false negatives reported. I just grabbed those stats from one of our live servers right now as I was composing this message. Sometimes we're better, sometimes we're worse, but those numbers are pretty typical.

    It's not perfect, but I think those are pretty good error rates for where we are now. And the stats always get better as we add more customers since we get more statistical input and this is just a statistical estimation problem. The more data, the more accurate

  7. Re:You are also totally wrong by Garridan · · Score: 5, Funny

    No, you are totally wrong. The system measures the ratio of the sender to the spam of the ratio receiver receiver, and establishes a negative false-positive ratio by building a score based on the spam-spam ratio of the sender receiver. By collecting the sum total products of the receiver sender spam ratio dividend, the sales pitch drives the likelihood of three emails through the foobar baz@incompatible.

    In summary, I have no idea what I'm talking about because I didn't RTFA. That I am aware of this fact makes me superior to the lot of you who are arguing over the inner workings of this week's spam-filter vaportech -- which was probably written up in an incomprehensible and inconsistent manner such that it will go over the heads of foolish investors, and part them from their money.