Slashdot Mirror


Bayesian Filtering For Dummies

Dynamoo writes "Bayesian filtering for spam is awfully clever stuff, touched on by Slashdot several times before. There's a very accessible article at BBC News explaining in fairly simple terms the drawbacks of current keyword-based filtering. It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name. Those Vikings have a lot to answer for."

281 comments

  1. Yes, we must filter out the dummies by Anonymous Coward · · Score: 5, Funny

    I suggest Slashdot immediatly implement this "Bayesian Filter for Dummies" to remove most of the trolls, etc.

    1. Re:Yes, we must filter out the dummies by Anonymous Coward · · Score: 3, Funny

      Wiping out all of the comments just because of the trolls seems a bit extreme.

    2. Re:Yes, we must filter out the dummies by zoikes · · Score: 5, Interesting

      The moderation system (esp. in its current form - moderation by +karma /.ers) will always be better than automated filtering.

      The key problem is adaptation. "Bayesian filtering is better than simple keyword filtering, but its performance will degrade over time unless its rules are continuously updated (via analysis of new data). And there's the problem that a troll in one story context may be an insightful comment in another.

      Moderation by humans apapts rapidly, accomodates a variety of contexts, and will reflect (and grow with) the overall /. "culture".

    3. Re:Yes, we must filter out the dummies by dJCL · · Score: 5, Interesting

      I've been using a baysian spam filter for months now and I understand how they work... Even thou people find the comment funny, a baysian troll filter on slashdot would work...

      If you were to run every slashdot post throu my mail filter as an e-mail message and properly mark the trolls and others you don't want, and the ones you do want, suddenly you would only get the actual good posts, trolling would die quickly... And because of the user classification system currently in place, slashdot has a huge db to build up the word stats, so it could happen immediatly or faster...

      Seriously, I ask that the slashdot admins consider adding this to slashcode... even if slashdot does not use it, others would... there are too many trolls out there as it is on the net and many people put them only a few rungs higher than spammers on the evolutionary ladder(but lower than an ameoba still)

      The logic behind this can actually be extended, to allow a user to start filtering stories so that they only get ones that interest them, or even to filtering submissions to get rid of the cruft, how often to you think that the trolls post troll story submissions? Save work for the site admins...

      I'm curious if an extension of this idea is how Google News works... anyone know?

      Enjoy.

      --
      On Arrakis: early worm gets the bird. Magister mundi sum!
    4. Re:Yes, we must filter out the dummies by Fembot · · Score: 2, Redundant

      Actualy you might joke about this, but I think that having an optional bonus given (As with the current karma bonus) for messages that arent likely to be classified as Troll, offtopic or reduntant might be quite helpful for casual readers.

      It would work based on the moderations of comments for the learning, and of course the bonus would be totaly userdefinable so you could set it to 0 and it would have no effect.

      Just my thoughts on the issue

    5. Re:Yes, we must filter out the dummies by dark-br · · Score: 3, Funny

      Would it work for editors too? If so *please* implement it!

    6. Re:Yes, we must filter out the dummies by milkmandan9 · · Score: 2, Funny
      I suggest Slashdot immediatly implement this "Bayesian Filter for Dummies" to remove most of the trolls, etc.
      So, tell me...would anyone be left?

      Didn't think so.
    7. Re:Yes, we must filter out the dummies by StealthBadger · · Score: 1

      I wonder how much weight the AC posting name would eventually be given for trolling, one way or the other... It would be an interesting experiment.

      If nothing else, it would settle the old question of whether ACs post worthwile comments or not in an empirical fashion.

      --
      Searching for Truth, Justice, and the Guy Who Boosted My Wallet a Few Weeks Back....
    8. Re:Yes, we must filter out the dummies by Ryan+Amos · · Score: 1

      Yeah, but often I'll get caught by the lameness filter when I'm posting a "no karma bonus" short reply to another post. Like this one.

    9. Re:Yes, we must filter out the dummies by Fembot · · Score: 1

      yeah but a) its bayesian so learns from moderations
      and
      b) its optional, so rather than not letting you post at all it lets you post, but some users get the post flagged lower.

    10. Re:Yes, we must filter out the dummies by Drakin · · Score: 2, Insightful

      Unfortanatly there's also the problem with some uneducated people with mod points who can't tell the differnce between a truely insightful post and one that is a well written troll. Nor, the people who confuse a troll with humor that's on topic in terms of a given discussion.

      So while it works, there's still some holes in the system.

    11. Re:Yes, we must filter out the dummies by jericho4.0 · · Score: 1
      A baysian troll filter would not work, for the simple fact that an email is spam or not spam, but a post is many more things to most people.

      --
      "A language that doesn't affect the way you think about programming, is not worth knowing" - Alan Perlis
    12. Re:Yes, we must filter out the dummies by inerte · · Score: 1

      A browser plug-in could be made for this too... Just find what separates comments and hide them if they fall below your "spam treshhold". Could be more personalized than just a server-side script.

    13. Re:Yes, we must filter out the dummies by bluelan · · Score: 5, Insightful
      This wouldn't work.

      Baysian filters for spam work because spam has a significantly different vocabulary distribution than useful e-mail. This is true because spam must deliver a commercial message and play on people's uncertainties.

      Good trolls, on the other hand, look ALMOST like insightful, well written articles. The vocabulary distribution in good trolls is not significantly different than the vocabulary distribution of useful posts. So, Baysian filters would be useless, unless you come up with some smarter characteristics on which to train the filter.

      You could easily develop a filter for ascii-art porno. But, those are offtopic or flaimbait, not trolls.

      --

      I used to be a narrator for bad mimes. (wright)

    14. Re:Yes, we must filter out the dummies by DeadSea · · Score: 4, Interesting
      Bayesian filters for email really only work because spammers can't see which messages you classify as spam. If you implemented a bayesian filter for trolls on slashdot, the trolls would see what words constitute a troll and stop using those words. They would stuff their messages with non-troll words avoiding the bayesian filter.

      The same thing would happen to your mail if the words that your bayesian filter were the same as the words in everybody else's. Spammers would be able to see what make an email seem spamming and they wouldn't do that. Bayesian filtering works for email right now because everybody's filters are a bit different. There is currently no magic bullet to get through everybody's spam filters. Also spammers cannot see your filter so they don't know if their message was filtered. If you opened your archive to me, I could quite easily craft a spam that would land square in your inbox.

    15. Re:Yes, we must filter out the dummies by dJCL · · Score: 1

      A lot of the arguments against push on the idea that either it only can categorize to one thing - troll, or that they will adapt...

      Don't get rid of moderation, just assist it with a troll/offtopic/whatever system, bayesian based...

      And it can do more than just good or bad, there is one bayesian filter out there hat has multiple categories of filter... It can learn to recognize anything, and multiple categories is simple...

      It could be done... but will it? prolly not.
      Maybe when I get around to writing my own news site I'll code the ability in... we will see.

      Enjoy.

      --
      On Arrakis: early worm gets the bird. Magister mundi sum!
    16. Re:Yes, we must filter out the dummies by bluGill · · Score: 2, Interesting

      Ahh, but a troll that looks genuine at first, and appears on topic is worth a reading for the laugh. It needs to be marked funny, and depending on how good it is might need some explination in a followup post to keep those not in the know from thinking the wrong thing.

      OTOH, first post is always useless and a waste of time. So are a few other posts. ASCI-art might be easy to filter, but can you filter the porn ascii-art without blocking the guy trying to make a diagram of some sort so we can better understand what is going on?

    17. Re:Yes, we must filter out the dummies by croddy · · Score: 1
      bayesian (and similar) filters would be a massive boon to search engines, although I suspect few (if any) have implemented them at the moment. suddenly, there could be some teeth behind the "I Found This Informative" button. imagine, a search engine that identifies news stories & whitepapers, or blogs, porn, you name it!

      search engines also present a uniquely robust platform for training such filters, due to the massive number of users & queries in/out.

    18. Re:Yes, we must filter out the dummies by Anonymous Coward · · Score: 0

      You think the current /. moderation system is better than an automated filtering system? If /.'s moderation was random, it would be much better than the garbage it has now. Some sort of smart filtering would have to be even better. So often you see complete garbage posts with a +5 and very good posts from unregistered users that are marked as a -1 troll. The current system is horrible.

    19. Re:Yes, we must filter out the dummies by Inthewire · · Score: 1
      --


      Writers imply. Readers infer.
    20. Re:Yes, we must filter out the dummies by antiMStroll · · Score: 1
      Bayesian filtering:

      - doesn't get hired by marketing firms to skew moderation when their client's product is a topic

      - doesn't register multiple accounts or log in from work in order to self moderate

      - won't generate packs of 'friends' and 'fans' to cross-moderate each other up

      - isn't Ameri-centric.

      Perhaps in theory human moderation is better than cold machine filtering, but only if the theory assumes honesty in moderation.

    21. Re:Yes, we must filter out the dummies by ronabop · · Score: 1
      So...Router named penis, multiple injection into multple 3-way MX queues (mouth, anus, vagina), before going to a mailhub (bitch)...

      Wanna diagram it?

      I need a life with less ascii. And less spam words in my vocabulary.

    22. Re:Yes, we must filter out the dummies by joee · · Score: 1
      DeadSea wrote:
      Also spammers cannot see your filter so they don't know if their message was filtered.

      Unfortunately, they do have one tool to measure whether their message is getting filtered: tracking images.

      If you have images turned on in your HTML-enabled MUA, you are giving the spammer information about (a) your own filter configuration, and (b) the filtering configuration of the general HTML-mail-reading population.

      The good news is that I doubt any spammer is sophisticated enough to use the information in (a), so if your filtering mechanism is better than everyone else's, you may still have an effective filter. As soon as the rest of the population catches up to you, though, spammers will adapt.
    23. Re:Yes, we must filter out the dummies by dJCL · · Score: 1

      that could work, but my god! could you imagine the size of that database, associating a score for every word on the page with every word in the search term... and vice versa? that would get big fast, I know google can keep their db running quick, but this would be an order of magnitude more complex on the query side, not saying it could not be done, just difficult to do...

      --
      On Arrakis: early worm gets the bird. Magister mundi sum!
    24. Re:Yes, we must filter out the dummies by Anonymous Coward · · Score: 0

      Well, at least we would get new trolls. In all honesty, some of them are quite ingenious. The main problem seems to be the ones that start UI flamewars.

    25. Re:Yes, we must filter out the dummies by bluelan · · Score: 1

      Good point, we could filter out most ascii art. But, we couldn't filter ascii porno from ascii landscapes.

      --

      I used to be a narrator for bad mimes. (wright)

  2. A bit of info on Bayesian filtering by jat850 · · Score: 5, Informative

    The BBC article mentions Paul Graham, and I found his page (and some more information on Bayesian networks for spam filtering) here:

    Paul Graham's spam page

    He talks a little bit more about the technical aspects there.

    --
    the blood has stopped pumping, and he's left to decay
    the me that you know is now made up of wires
    1. Re:A bit of info on Bayesian filtering by Rosco+P.+Coltrane · · Score: 2, Informative

      From Paul Graham's page :

      A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it. Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.

      Tough filter for users of dating services who happen to be gynecologists ...

      --
      "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
    2. Re:A bit of info on Bayesian filtering by letxa2000 · · Score: 5, Insightful
      A gynecologist probably wouldn't have a corpus that indicates that "sex" is a .97 spam probability. That's the great thing about Bayesian: the spam probability for each word depends on the mail and spam YOU receive. It works dang well, just as Paul Graham claims. I'm averaging 99.7% accuracy this week, and the one spam that got through was written in German.

    3. Re:A bit of info on Bayesian filtering by GnuVince · · Score: 5, Insightful
      No, because if they have a lot of legitimate mails with words like "sex", "sexy", "penis", "vagina", "viagra", etc., the filter will adapt. That's the whole point. For PG, "sexy" is a sure sign of spam, but for a sexologist, it is not. You train the filter to recognize your spam. So if "sex" appears as much in your legitimate mail than in your spam, "sex" will not be considered a trace of a spam.

      Bayesian filters adapt, that's why they work so well.

    4. Re:A bit of info on Bayesian filtering by eatdave13 · · Score: 1

      A um... sexologist? Where do I have to go to school to be one of those?

      --
      "Verbing weirds language." -- Calvin
    5. Re:A bit of info on Bayesian filtering by Anonymous Coward · · Score: 0

      My filter marks every german mail as SPAM :-)

    6. Re:A bit of info on Bayesian filtering by Inthewire · · Score: 0, Offtopic
      --


      Writers imply. Readers infer.
  3. It's not bad... by Sheetrock · · Score: 2, Interesting

    I've been using it for a bit on my own e-mail, and it seems to work out. But it's not at the point where I'd be happy to see ISPs implementing it for their customers -- even ignoring the Freedom of Speech issue, it still has the occasional false positive.

    --

    Try not. Do or do not, there is no try.
    -- Dr. Spock, stardate 2822-3.




    1. Re:It's not bad... by letxa2000 · · Score: 2, Insightful
      The question is, which produces more false positives: The occasional Bayesian false positive, or the occasional (or not so occasional) good mail that you'll accidentally delete when you're deleting 150 spams per day? If I'm getting 150 spams per day that's 1050 spams per week which is an awful lot of "deletes." You don't think you're going to accidentally throw out a good message now and then when manually deleting that much spam? I'd venture to say that you'll probably accidentally delete more yourself by accident than Bayesian will toss as false positives.

    2. Re:It's not bad... by Anonymous Coward · · Score: 0

      One of the nice features of Popfile (popfile.sourceforge.net) is the addition of an X-text-classification tag to the header of the email. That way an email client can perform an action based on the tag in the mail header - say, moving it to the 'spam' folder if it contains "X-text-classification: spam"

      If an ISP were to start filtering emails, they could do something like that rather than binning the email. That way the user's email program can put the incoming mails into an appropriate folder dependent on the X-text-classification tag (as opposed to deleting them or having the ISP delete them) and the user can have a look through the 'spam' folder and remove any false positives.

    3. Re:It's not bad... by goofy183 · · Score: 1

      An ISP wide implementation wouldn't work though. Bayesian Filtering is a very personal system. Each person's spam and non-spam emails contain very different words. An ISP wide system would maybe catch the very extream spam but without a high false positive rate it wouldn't be very effective.

      Web based systems like hotmail could implement a Bayesian system on a per user basis. Each email teaches the filter. Have the strictness of the filter get slowly more restrictive as it sees more email. All the user would have to do is flag any spam that comes through as such and the system would update it's rules. The user would check the junk mail box every once in a while for false negitives. After using the account for a month or so I bet very little spam would ever get through to any one user. It would take little space as the word database for my last 12,000 emails is about 500K (words expire after not being seen for ~ 2 weeks).

    4. Re:It's not bad... by pyr0 · · Score: 2, Informative

      That's exactly how spamassassin works. My school has it set up on their exchange servers. The problem with that is...you still get the spam. So even if it does go in another folder, which in turn still wastes your bandwidth and time in deleting it, as far as the spammer knows it went through. The still gets paid for having sent out a successful spam. The only real advantage is being able to read your legit messages without sorting through the spam, and even that is starting to become an issue because it seems the spammers have become much smarter. Thus, I still get around 10 spams a day in my main inbox...although that is still an improvement from the actual 100 plus or minus a day I actually get. I still like the idea of those spam tarpits restricting the bandwidth from known spam domains. All email gets through eventually...which isn't a problem for most normal messages. It is a problem though when the spammers suddenly find their connection to remote mail servers throttled and can't send all they want to in as little time as it would normally take.

    5. Re:It's not bad... by HermanAB · · Score: 1

      You should try various filters, since they are not all made equal. SpamProbe (Sourceforge) is waaaaaay better than any other, since it counts word pairs as well, while other filters only count single words. It is of course more processor hungry, but I don't notice anything. You'll have to have a very busy server before the processing power required, would matter.

      --
      Oh well, what the hell...
    6. Re:It's not bad... by Nogami_Saeko · · Score: 1

      Likewise, I've also been using it via POPFile. I'm _extremely_ happy with the results. Within a week of implementing it, my incoming spam was cut by over 90% and the classification error rate is exceptionally low.

      I've been running POPFile since January, and on over 1,200 messages (until my HD blew up), the accuracy was over 98%.

      Good deal!

      N.

      --
      "Nothing strengthens authority so much as silence." - Charles de Gaulle
    7. Re:It's not bad... by wheany · · Score: 1

      I reset my statistics on May 7th, and my current accuracy is 98,59 at work, higher at home. I have received 356 mails, of which 5 have been misclassified, with 1 false positive.

    8. Re:It's not bad... by Inthewire · · Score: 1

      Damn.
      I use POPFile at about 98% accuracy and get around 300 emails per day, sorted into 4 categories (nostly spam).
      I'm happy to report that I've only had one good email reported as SPAM in over 14,000 emails, and considering the content, I'm not surprised.

      I'd rather have a false positive than a false negative.

      I love POPFile

      --


      Writers imply. Readers infer.
  4. Origin of SPAM by brejc8 · · Score: 2, Interesting

    It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name.
    Does anyone have proof thats where the name comes from?

    1. Re:Origin of SPAM by jat850 · · Score: 5, Informative

      Good question ... through Google Groups I found this page.

      --
      the blood has stopped pumping, and he's left to decay
      the me that you know is now made up of wires
    2. Re:Origin of SPAM by Anonymous Coward · · Score: 0

      I'm afraid it is something we brits were fed a bit too much of after the lean years of the WWII! What better way to vent our frustration at seeming the monotony of it, than to have Monty Python make fun of it.

    3. Re:Origin of SPAM by Anne+Thwacks · · Score: 4, Informative
      While the Monty Python sketch may have inspired the use of the term, the Monty Python usage was in fact a rehash of a sketch by Peter Sellers, dating back to the 1950's which referred to the wartime situation where Cafe's often had fancy things on the Menu, but when you came to order, the item in question was not available.

      The sketch is to be found on the album "The Bset of Sellers" - probably released in about 1958, and which also features the nursery rhyme

      "Up on the chair behind the door,
      hey diddle, diddle,
      Hear comes Poppa
      so up with the chopper
      and split 'im down the middle

      And "Balham, gateway to the South" a spoof of the travalogue films that often apepared in the cenema at the time.

      --
      Sent from my ASR33 using ASCII
  5. who're the vikings? by Anonymous Coward · · Score: 1

    Monty Python - vikings? What are you on about?

    1. Re:who're the vikings? by Evil-G · · Score: 5, Informative

      A group of vikings in a monty python sketch drowned out normal conversation by shouting the word "spam" louder and louder. The word was then adopted for all the crap drowning out normal conversation on usenet.

    2. Re:who're the vikings? by RobotRunAmok · · Score: 2, Insightful

      The Monty Python Comedy troupe did a rather famous (in some Geek circles) skit in which the virtues of canned Spiced Ham are literally sung. Inexplicably, a group of Vikings join in the song.

      The poster, obviously better schooled in British farce than luncheon meats, is under the impression that the widely accepted nickname for unsolicited e-mail is derived from the comedy sketch and not from Spam(tm), the food.

      I don't know for certain if he's wrong, but I have a hunch he is. I'm guessing a lot more people have eaten Spam than have digested the Python skit...

    3. Re:who're the vikings? by geeber · · Score: 1

      go to http://erik.selwerd.nl/monthy-python.html all shall be revealed.

    4. Re:who're the vikings? by bethanie · · Score: 1

      I'm guessing a lot more people have eaten Spam than have digested the Python skit...

      You're surely right, but among the circles of people who were in a position to name such a thing, I'm quite positive that Monty Python is the origin of the term "Spam."

      ....Bethanie....

    5. Re:who're the vikings? by Anonymous Coward · · Score: 0

      The first time I remember "spam" in monthy python was a restaurant sketch featuring, among others, Graham Chapman and Terry Jones as man and wife. Upon them asking about the menu, the waitor comes up with menu items like spam-spam-spam-spam-baked beans-spam-spam-spam... etc etc..

      Later on in the sketch, every other time the word spam was uttered, a bunch of vikings sitting by the neighbouring table started a chorus, going something like "spam spam spam spam lovely SPAAAAM, lovely spaaam". At one time a film clip is shown with vikings rowing in longboats, and a historian explaining how the vikings roamed the seas...

      Great stuff.

    6. Re:who're the vikings? by UserGoogol · · Score: 1

      Yes, but its an old word, and back the internet/Usenet had a much higher percentage of Monty Python fans.

      --
      "Never attribute to malice that which can be adequately explained by stupidity." -- Hanlon's Razor
    7. Re:who're the vikings? by NickFitz · · Score: 1
      Inexplicably, a group of Vikings join in the song.

      IIRC, that aspect arose from an occasion when the Python team went into one of the BBC's canteens and found a large group of extras from another production in there, all wearing Viking costumes.

      And for all those who think the spam comes in because of the old jokes about BBC canteen food: I contracted briefly at TV Centre in 1999, and the food is both excellent and cheap :-)

      Dunno what it was like in the late 60s - early 70s.

      --
      Using HTML in email is like putting sound effects on your phone calls. Just say <strong>no</strong>.
    8. Re:who're the vikings? by mlk · · Score: 2, Informative

      Not many people know were the term "spam" comes from, but everyone[1] knows what it means (email wise), and it does come from the Monty Python sketch.

      However it did not orginally go with bulk email, but instead with some wanker on posting the same post over and over again on a newsgroup or IRC.

      [1] Including "normal" users.

      --
      Wow, I should not post when knackered.
    9. Re:who're the vikings? by Anonymous Coward · · Score: 0
      It is official; Netcraft now confirms: *BSD is dying

      One more crippling bombshell hit the already beleaguered *BSD community when IDC confirmed that *BSD market share has dropped yet again, now down to less than a fraction of 1 percent of all servers. Coming on the heels of a recent Netcraft survey which plainly states that *BSD has lost more market share, this news serves to reinforce what we've known all along. *BSD is collapsing in complete disarray, as fittingly exemplified by failing dead last in the recent Sys Admin comprehensive networking test.

      You don't need to be a Kreskin to predict *BSD's future. The hand writing is on the wall: *BSD faces a bleak future. In fact there won't be any future at all for *BSD because *BSD is dying. Things are looking very bad for *BSD. As many of us are already aware, *BSD continues to lose market share. Red ink flows like a river of blood.

      FreeBSD is the most endangered of them all, having lost 93% of its core developers. The sudden and unpleasant departures of long time FreeBSD developers Jordan Hubbard and Mike Smith only serve to underscore the point more clearly. There can no longer be any doubt: FreeBSD is dying.

      Let's keep to the facts and look at the numbers.

      OpenBSD leader Theo states that there are 7000 users of OpenBSD. How many users of NetBSD are there? Let's see. The number of OpenBSD versus NetBSD posts on Usenet is roughly in ratio of 5 to 1. Therefore there are about 7000/5 = 1400 NetBSD users. BSD/OS posts on Usenet are about half of the volume of NetBSD posts. Therefore there are about 700 users of BSD/OS. A recent article put FreeBSD at about 80 percent of the *BSD market. Therefore there are (7000+1400+700)*4 = 36400 FreeBSD users. This is consistent with the number of FreeBSD Usenet posts.

      Due to the troubles of Walnut Creek, abysmal sales and so on, FreeBSD went out of business and was taken over by BSDI who sell another troubled OS. Now BSDI is also dead, its corpse turned over to yet another charnel house.

      All major surveys show that *BSD has steadily declined in market share. *BSD is very sick and its long term survival prospects are very dim. If *BSD is to survive at all it will be among OS dilettante dabblers. *BSD continues to decay. Nothing short of a miracle could save it at this point in time. For all practical purposes, *BSD is dead.

      Fact: *BSD is dying

    10. Re:who're the vikings? by Anonymous Coward · · Score: 0

      We've got Spam, eggs, bacon, and spam..that's not got much spam in it...

      "But I DONT LIKE SPAM!!!!"

      Ah... well then.

    11. Re:who're the vikings? by m0rphm0nkey · · Score: 1

      Who's on first oar.

  6. Dialect or typo? by isomeme · · Score: 2, Funny
    From the article's subhead:
    just as paper junk mail buried many a front door map
    Is that yet another weird British idiom, or simply a typo for "mat"?
    --
    When all you have is a hammer, everything looks like a skull.
    1. Re:Dialect or typo? by Anonymous Coward · · Score: 0

      it,s a typo. How dumb do you think we are?

      No, don't answer that.

    2. Re:Dialect or typo? by Anonymous Coward · · Score: 0

      In imperialistic America, the world is your door map!

    3. Re:Dialect or typo? by Anonymous Coward · · Score: 0
      From the article's subhead:
      just as paper junk mail buried many a front door map
      Is that yet another weird British idiom, or simply a typo for "mat"?
      "Th' flat cap slapped on th' map."

      Standard northern British idiom. Use it in a pub in the region to impress the natives.

    4. Re:Dialect or typo? by Anonymous Coward · · Score: 0

      Well of course I keep a map on my front doorstep. How else would people find their way inside?

  7. Vikings? by cperciva · · Score: 1, Funny

    I'd say that the BBC has more in common with the Normans, actually.

    1. Re:Vikings? by tupps · · Score: 1

      The viking reference is to the Vikings that were in the Monty Python Sketch. From memory there were 2 American tourists trying to order something and Vikings kept popping up singing the Spam song!

      --
      Go out and get sailing!
  8. More Spam! by James+Littiebrant · · Score: 3, Insightful

    I have used a bayesian filter for some time now and while it is the BEST filter type I have ever used nothing is 100% reliable. While this is the best technology for the average user it is most cirtainly not perfect. Instead I use a combination of moderate bayesian filtering and good old fasion "block sender" filtering.

    1. Re:More Spam! by letxa2000 · · Score: 0, Troll
      Yep, I've been using my own Bayesian filter since about January. I'm currently filtering 99.7% of all spam and my only false positive was arguably acceptable (it wasn't written in English and I didn't really want it).

      For me, spam is a solved problem. It's actually fun watching the spam roll in and watch the statistics counting how many emails have been filtered. Anyone that wants plug-and-play Bayesian filtering along with optional keyword filters, blacklists, whitelists, etc. for their POP3 mailbox with nothing to install on the client or server. is invited to check out the site listed in my sig. It just works.

    2. Re:More Spam! by Anonymous Coward · · Score: 0

      Please take your adverts elsewhere.
      I am not going to pay some c**t twelve dollars a year for something that should be free.

    3. Re:More Spam! by Anonymous Coward · · Score: 0

      Maybe he should have put [ADV] in his subject.. =)

    4. Re:More Spam! by letxa2000 · · Score: 0, Troll
      I am not going to pay some c**t twelve dollars a year for something that should be free.

      It should be free? Why? It costs bandwidth... But thanks for your most constructive input.

    5. Re:More Spam! by Anonymous Coward · · Score: 0

      You're welcome. And please look up SpamAssassin or any of the other free anti-spam solutions.

    6. Re:More Spam! by letxa2000 · · Score: 0, Troll
      Yep, Spam Assassin works for many people. Different market--requires downloads and installations and doesn't necessarily work on all platforms. But if it works for you, knock yourself out.

    7. Re:More Spam! by NickFitz · · Score: 1
      Please take your adverts elsewhere. I am not going to pay some c**t twelve dollars a year for something that should be free.

      It's spelt "cunt".

      I can sell you a new keyboard for $12 if yours is playing up...

      Terms and conditions apply. Such as, this is not a genuine offer.
      --
      Using HTML in email is like putting sound effects on your phone calls. Just say <strong>no</strong>.
    8. Re:More Spam! by Anonymous Coward · · Score: 0

      Hell, I'm not even 100% reliable. Once in a while I might open "The Power of Citrus: Lose 10 Pounds in 12 Days or Your Money Back", anyway.

  9. Re:Spiced Ham? by Anonymous Coward · · Score: 0

    Thus proving that the TV generation is full of idiots.

    Now, let's be fair. All it proves is that the poster is an idiot, and the SlashDot Editor-on-Duty is either an idiot or just lazy.

  10. Speaking of dummies... by Anonymous Coward · · Score: 5, Informative

    Someone needs to learn the meaning of "ironic". (Hint: it doesn't mean "weird coincidence".)

    Paul

    1. Re:Speaking of dummies... by MechCow · · Score: 2, Funny

      I thought words were defined by how they are used. It would be ironic if you work for webster.

      --

      --
      On Slashdot I'm a lawyer.
    2. Re:Speaking of dummies... by Anonymous Coward · · Score: 0

      Sad how this is scored "4, Informative".

    3. Re:Speaking of dummies... by happystink · · Score: 1

      Right, the example given isn't irony at all. A better example would be a thousand spoons when all you want is a fork.

      --

      sig:
      See the "..for smart people" banners Wired runs here? Look elsewhere guys.

    4. Re:Speaking of dummies... by Anonymous Coward · · Score: 0

      A better example would be a thousand spoons when all you want is a fork.

      Actually, that's known as a "bummer".

      Irony is when actions specifically taken to accomplish one task do the opposite. Many of Microsoft's "productivity" products qualify.

      Paul

    5. Re:Speaking of dummies... by anonymous+cowfart · · Score: 0

      No no, it's like ten thousand spoons when all you need is a knife, it's meeting the man of your dreams and then meeting his beautiful wife.

      --

      So I'm a pervert. Welcome to the Internet.
    6. Re:Speaking of dummies... by bn557 · · Score: 1

      isn't the definition of irony something like when the actual meaning of a thing is the opposite of the literal meaning. I don't know... whatever.

      p
      (only posted not anonymously because I think everyone should take credit for their posts, whether on topic or not)

      --
      Humans are slow, innaccurate, and brilliant; computers are fast, acurrate, and dumb; together they are unbeatable
  11. Spam = /dev/null by Enraged_jawa · · Score: 1

    Bayesian filtering could stop all the spam that easily? This is great! Where can I download a filter like this? And back in the mid to late 80's or so, at least around Bell Labs where I worked, SPAM stood for Stupid People Asking for Money, when did that change?

    1. Re:Spam = /dev/null by letxa2000 · · Score: 0, Troll
      Bayesian filtering could stop all the spam that easily? This is great! Where can I download a filter like this?

      Quite a few sources, but feel free to try this site which offers it as a service. Nothing to download, nothing to install. Just point your POP3 client to this site and it'll filter your mail for you. I'm at 99.4% of spam filtered for this month and 99.7% for this week.

    2. Re:Spam = /dev/null by GammaTau · · Score: 4, Informative

      Bayesian filtering could stop all the spam that easily? This is great! Where can I download a filter like this?

      You can try bogofilter, ifile, SpamBayes, or POPFile. The newer versions of SpamAssassin also implement some kind of Bayesian filtering.

    3. Re:Spam = /dev/null by Anonymous Coward · · Score: 0

      Nice, but why don't you mention that it costs at least $12/year? $20 for 60,000 emails, which I reckon I will easily get next year (99.9% spam)
      Basically, fuck off with your advertising.

    4. Re:Spam = /dev/null by mnemonic_ · · Score: 2, Insightful

      I like SpamBayes for its ability to be trained on past spam. You can point it to a folder full of past spam and it scores them all, which is much faster than gradually teaching the software to recognize spam through individual email updates.

      POPFile does not have this convenient ability (yet), though it does do general purpose sorting (i.e. not just differentiate between spam and non-spam, but stuff like work, school, linux or whatever you want). It does take a while to train though.

    5. Re:Spam = /dev/null by Roadmaster · · Score: 1

      I use SpamProbe, it's quite mature, actively maintained, has good performance and plenty of features. Of course this depends on your platform, for Windows ive heard good things about POPFile.

    6. Re:Spam = /dev/null by Anonymous Coward · · Score: 0

      Have you noted those lkhassxx34mmdc in you SPAM lately? These random nonsense words will corrupt your Bayesian filtering stats, and eventually some spam will start to get through. They just put enough random words in the message (often 'hidden' in html-tags)

      Time to think up something new. Reject all HTML-mail?

    7. Re:Spam = /dev/null by letxa2000 · · Score: 1
      Have you noted those lkhassxx34mmdc in you SPAM lately? These random nonsense words will corrupt your Bayesian filtering stats, and eventually some spam will start to get through. They just put enough random words in the message

      Actually, no, it won't. Those weird letter combinations are there to make the messages "distinct" (so it's hard to tell if two spams received by two different people are the same) and also those that try to break the spam down into sentences or whatever to, again, try to determine if it is spam by comparing it to other reported spam.

      These ramdom words will not do anything to corrupt Bayesian at all. Since an unknown word in Bayesian is assigned some rather-neutral value (such as 0.4 or 0.5) that is not enough to make an otherwise 0.99 spammy message come down to a non-spammy level. I.e., Bayesian only takes into account the most spammy terms ("sex"=0.99, etc.) and the least spammy terms ("Doug"=a friend of mine that appears often in my email=0.01). Inserting random words will generate terms worth 0.4 or 0.5 which will NOT be considered because they are not far away from the neutral 0.5, so all the other spammy indicators will decide the Bayesian score, not the random words.

      The only thing these random words do is make your Bayesian statistics file grow a bit. But if your Bayesian filter is worth its weight in salt old terms over some age that have only been seen once will be purged since they are probably useless and there are definitely more interesting tokens that will indicate spam or not.

    8. Re:Spam = /dev/null by sabaco · · Score: 2, Informative

      Don't forget SpamProbe as well. I've been using it for a couple weeks, and it has been working very well for me. I've gotten around 1400 messages, and so far 1 false positive and 6 false negatives. I don't know how well the other filters work, but that seems pretty good to me. It's sure a hell of a lot better than the DNS blacklists I use. (I'm still using those. After all, they filter out the first 70% of my incoming mail and are probably faster anyway.)

      --
      This is SO educational! -- Kintaro Oe
    9. Re:Spam = /dev/null by patrickjolliffe · · Score: 1

      Try the mail application which comes as part of Mozilla - it has Bayesian spam filtering built in - details here

    10. Re:Spam = /dev/null by Anonymous Coward · · Score: 0

      A fool and his money are soon parted...
      http://www.smh.com.au/articles/2003/05/23/10535856 93509.html

    11. Re:Spam = /dev/null by Enraged_jawa · · Score: 1

      I use Mozzilla Firebird, browser only..

    12. Re:Spam = /dev/null by patrickjolliffe · · Score: 1

      Me too. I guess you could try Thunderbird which seems to have this feature, but I haven't tried it myself yet.

    13. Re:Spam = /dev/null by Nogami_Saeko · · Score: 1

      There's a small PERL program (insert.pl) included with POPFile to do exactly what you're requiring in terms of past spam training:

      You just tell it that all of the messages are spam for example, and it adds everything automatically.

      N.

      --
      "Nothing strengthens authority so much as silence." - Charles de Gaulle
    14. Re:Spam = /dev/null by Anonymous Coward · · Score: 0

      So you the total length of the message does not count at all: a thousand word message containing one single spam-word 'sex' with 999 good or never-seen-before words would still be classified as spam? Isn't the probability for false positives quite high if it worked this way.

      And I've seen messages get through (bogofilter 0.11.2) that cointained an equal or higher amount of random words intermixed with normal spam text. (My template for the spam word list was my spam collection of some 6000 messages).

    15. Re:Spam = /dev/null by wheany · · Score: 1

      I recommend POPFile. It's multiplatform (Uses Perl. It has a Windows installer that inclues a stripped down version Perl), and it can categorise your mails in more categories than just "spam" and "not spam", eg. "spam", "work", "personal", "tennis" etc.

      It acts as a proxy between your mail program and the POP server, so you can keep using your favorite mail program. It only supports POP at the moment, but IAMP support is on its way.

      POPFile also recognises and counters many common spammer tricks, like commented-out-words (pe<!-- asdasd -->nis), d.o.t.t.e.d w.o.r.d.s, S p a c e d O u t W o r d s, same colored text and background on html mails ("invisible ink"), and more.

    16. Re:Spam = /dev/null by wheany · · Score: 1

      And with IAMP I mean IMAP...

    17. Re:Spam = /dev/null by Inthewire · · Score: 1

      Thanks.
      I've been looking for someone to read my mail.
      I wasn't sure I'd been catching it all myself.

      --


      Writers imply. Readers infer.
  12. But I don't like Spam by Anonymous Coward · · Score: 0, Offtopic

    But it doesn't have much Spam in it.

    1. Re:But I don't like Spam by shish · · Score: 1

      On the spam ingridients list:

      Spam: Pork, Ham

      why not just "pig meat"?

      --
      I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
  13. Ironic? by popeydotcom · · Score: 4, Funny

    Interesting yes, ironic, no.

    What's your name, Alanis Morissette ?

    1. Re:Ironic? by JahToasted · · Score: 1

      That song was ironic if you think about it. You have a song called "Ironic" that doesn't contain one example of irony. That in itself is irony...

    2. Re:Ironic? by DavyByrne · · Score: 4, Insightful

      Actually, I've long wondered whether Alanis was quite clever in choosing a title for that song.

      You see, none of the events she describes in the song is an example of irony, making the choice of the title "Ironic," well, ironic.

    3. Re:Ironic? by Anonymous Coward · · Score: 0

      No, that in itself was retarded.

    4. Re:Ironic? by Anonymous Coward · · Score: 0

      I heard some comedian talking about that:
      "Ten thousand spoons when all you need is a knife, isn't it ironic"
      No, not unless the reason you need a knife is to stab the dickhead who keeps leaving thousands of spoons in your house

    5. Re:Ironic? by Anonymous Coward · · Score: 0

      secret-passage.com (not pr0n) has a ditty on that very thing.

    6. Re:Ironic? by pugh · · Score: 1

      Ed Byrne: "'It's like rain on your wedding day,'... that's not ironic. Not unless you're marrying the god of clement weather."

      --
      "I am a die-hard capitalist....but unethical, lying, bastard capitalism is really no better than socialism" - unknown
    7. Re:Ironic? by TheRevenant · · Score: 1

      Actually, at least some of the events ARE ironic, in the sense of "10. The quality of an occurrence being so unexpected or ill-timed that it appears to be deliberately perverse." - Oxford Paperback Dictionary (4th edition, 1994) or "A condition in which one seems to be mocked by fate or the facts." - The Chambers Dictionary, 1998"

      See http://www.geocities.com/eirig/

  14. Reminds me of a story by joelt49 · · Score: 3, Funny

    This whole spam thing reminds me of a story I read while in 7th grade. In it, the postage for sending junk mail was decreased to practically nothing. Then, junk mail buried America. Hundreds of years later, archeologists came back and investigated the remains. Their conclusions about our society are kind of humorous. However, the idea of junk mail burying us when the postage goes way down has kind of been proved with spam. Maybe a small tax for spam wouldn't be a bad idea.

    1. Re:Reminds me of a story by kahei · · Score: 1

      Was that book called 'Motel of the Mysteries'? A parody of the tutankhamen excavation?

      It was good.

      --
      Whence? Hence. Whither? Thither.
  15. No no no, Bayesian Filtering *OF* Dummies please by corebreech · · Score: 3, Funny

    Thank you for your support.

  16. Do spammer's techniques work on slashdot ? by Rosco+P.+Coltrane · · Score: 4, Funny

    Viagra often spelled V-l-a-g-r-a online

    I-f I t-r-o-l-l l-i-k-e t-h-i-s, w-i-l-l i-t p-a-s-s S-l-a-s-h-d-o-t.'s t-r-o-l-l f-i-l-t-e-r ?

    --
    "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
    1. Re:Do spammer's techniques work on slashdot ? by Anonymous Coward · · Score: 2, Funny

      1. refresh slashdot page once a minute
      2. wait 15 seconds
      3. ???
      f-r-i-s-t p-o-s-t!!!

      i-m-a-g-i-n-e a b-e-o-w-u-l-f c-l-u-s-t-e-r o-f B-a-y-e-s-i-a-n F-i-l-t-e-r-s!

    2. Re:Do spammer's techniques work on slashdot ? by MyHair · · Score: 3, Funny

      S-t-e-p-h-e-n K-i-n-g i-s d-e-a-d a-t 5-2 !

      B-S-D i-s d-y-i-n-g ! N-e-t-c-r-a-f-t c-o-n-f-i-r-m-s i-t !

    3. Re:Do spammer's techniques work on slashdot ? by Saucepan · · Score: 1
      I-f I t-r-o-l-l l-i-k-e t-h-i-s, w-i-l-l i-t p-a-s-s S-l-a-s-h-d-o-t.'s t-r-o-l-l f-i-l-t-e-r ?

      It will the first time. But after that the filter will have learned that lots of single-letter words are an excellent troll predictor.

      In Slashdot's case I might even try keeping track of punctuation characters as words. In addition to helping with dash-trolling it could learn to distinguish perl code snippets from ASCII art drawings of the Goatse dude.

      Of course as another poster pointed out it would be impossible to stop all trolling since even humans can't reliably tell the difference between a subtle troll and a genuine fringe opinion.

  17. Wrong pic... by Mondoz · · Score: 4, Informative
    It's slightly ironic that the BBC, through the commissioning of Monty Python, also gave 'spam' its name.

    Why then, does the article show a pic from a Monty Python animation about the black spot who goes to seek his fortune...
    You'd think they'd use the actual pic of the skit with the Vikings in the cafe...

    --
    /sig
  18. Hmmm by Anonymous Coward · · Score: 2, Insightful

    So this filter works on analysis of previously filtered mail?

    I can see the casual (mis)use of this technique by your average user rapidly becoming a problem - putting just one email from a legit e-mail sender into the bayesian filter could concievably snowball into a block on a lot of legit traffic under certain circumstances.

    Above and Below knows I have enough hassle with users and their e-mail already

    1. Re:Hmmm by letxa2000 · · Score: 3, Insightful
      I can see the casual (mis)use of this technique by your average user rapidly becoming a problem - putting just one email from a legit e-mail sender into the bayesian filter could concievably snowball into a block on a lot of legit traffic under certain circumstances.

      It's natural to think that is the case, but in reality it isn't. Accidentally putting one email in the wrong corpus ("good" or "spam") will not be enough to kill you. If you consistently fail to put them in the right corpus then over time, yes, things would snowball. But that'll only happen over time. A mistake now and then isn't enough to mess things up.

    2. Re:Hmmm by kirkjobsluder · · Score: 1

      I think there is a concern here in that I've found that bayesian filtering works best because it is individualized to me. A shared database could be poisoned by a malicious user.

  19. Required Reading by E-mail Users by Shackleford · · Score: 3, Interesting
    This "Bayesian Filtering for Dummies" article, titled "How to spot and stop spam" on the BBC web site, gave much useful information on the problem of spam and the filtering method used to get around it. It is quite comprehensible, as you certainly don't need to know the probability theory behind Bayesian filtering to understand it. It gives useful information on the problem of spam, and I'd say that this sort of article is required reading for all those who use e-mail. Why? Becaus it states this fact:

    "The sheer number of spam mail sent means that even tiny response rates, reportedly 0.0001%, means junk mailers turn a profit. "

    And this is why I say that educating users is just about as important as implementing spam filtering technology. If people know that they are perpetuating a serious problem by replying to spam, then that's bad news for spammers.

    About another fact mentioned in the article: It said Paul Graham's filter extracts "the top 15 features that define them as spam." 15? I thought that most Bayesian filters use many more spam-defining features. Because I'd say that there are quite a few more. Just think of the many features that spam tends to have. But he says his filter works well. Interesting.

    1. Re:Required Reading by E-mail Users by dJCL · · Score: 3, Insightful

      From my understanding of his full explanation(I read it a while ago, can't remember where, dig around some) each e-mail has every word examined and given a rating from 0.01(good) to 0.99(spam), then the 15 words farthest from 0.50 are selected, some averaging is done and if the score is over some threshold(say 0.90) then it is called spam and trashed, I use spamunition for my outlook e-mail(working on moving my e-mail over to linux, hopefully soon, so I can del my windows boxen) and it can give the stats for each e-mail and it appears to use the same formula...

      Part of the reason this all works is that spammers slowly change their wording over time to beat the static filters, but the baysian filter will still catch it on other parts of the message, and add the new wording to the db... the only spam that ever get throu to me now is stuff that is worded exactly like a normal e-mail, and even then they have a hard time, yet all my friends have no problems...

      I think the key here is to(with this software) never delete any e-mail, spam goes to the spam folder, sort the other stuff, and stuff you wanted, but don't need, move to another folder just so allow the filter to know what to look for... I have 5200 spam e-mails saved and about 1000 legit mail saved and my accuracy level is about 99.9...

      Read up on it, this stuff really does work.

      Enjoy

      --
      On Arrakis: early worm gets the bird. Magister mundi sum!
    2. Re:Required Reading by E-mail Users by WolfWithoutAClause · · Score: 1
      And this is why I say that educating users is just about as important as implementing spam filtering technology. If people know that they are perpetuating a serious problem by replying to spam, then that's bad news for spammers.

      The reason that spam works so well is that the proportion of idiots out there who either didn't attend the lesson, didn't believe the lesson, or didn't listen in the lesson is small, but significant. There's one born every minute as they say.

      About another fact mentioned in the article: It said Paul Graham's filter extracts "the top 15 features that define them as spam." 15? I thought that most Bayesian filters use many more spam-defining features. Because I'd say that there are quite a few more. Just think of the many features that spam tends to have. But he says his filter works well. Interesting.

      Actually, he wasn't that clear. His software picks the top 15 features that suggest it to be spam, or suggests it to be non spam, and then judges the spam on that. The Bayesian filters record literally thousands of 'rules' as you put it, it's just that at most only a specially chose subset of 15 apply to any one message. And his software works exceedingly well. My personal installation is running at well over 98% accuracy.

      --

      -WolfWithoutAClause

      "Gravity is only a theory, not a fact!"
    3. Re:Required Reading by E-mail Users by mlk · · Score: 1

      Most bayesian filters will create a db of words, so actually saving your emails is not required.

      --
      Wow, I should not post when knackered.
    4. Re:Required Reading by E-mail Users by dJCL · · Score: 1

      thanks, I'll have to test that out on mine, delete some old stuff...

      I know everything, I just forget where in my mind I placed the files...

      Enjoy.

      --
      On Arrakis: early worm gets the bird. Magister mundi sum!
    5. Re:Required Reading by E-mail Users by kindbud · · Score: 2, Interesting

      I have 5200 spam e-mails saved and about 1000 legit mail saved and my accuracy level is about 99.9...

      Yes, but you haven't reduced your exposure to spam. In fact, it looks like now you have to track your spam intake assiduously so as to keep the filter trained. Not many people would consider this an improvement. :)

      --
      Edith Keeler Must Die
  20. I don't receive spam by Rosco+P.+Coltrane · · Score: 4, Interesting

    In my home mailbox, I don't receive spam. And I only got two 419 nigerian invesment frauds on my professional address in a whole year, despite the fact that my corporate email address is widly publicized and easy to find on google. And amazingly, I never receive spam in my "special bogus registration" hotmail account (useful for programs like RealPlayer, or nytimes.com).

    So existing mail filters work for me, more or less. The few unwanted mails that pass through are easily taken care of by my trusted delete button. This leads me to ask :

    - Do other people really receive that much spam, or am I an isolated case ?

    - Do people who receive spam purchase things online, or register software and other services with their real names and email ?

    --
    "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
    1. Re:I don't receive spam by Anonymous Coward · · Score: 0

      I receive about 40-50 spams a day. Mostly due to my Usenet activity, though a lot of them come to an address for a FAQ on the web I used to edit. (I ought to forward that address to the new editor, heh heh.)

    2. Re:I don't receive spam by Rosco+P.+Coltrane · · Score: 1

      Looks like I'm a troll now, geez ...

      It was a real question though, I post on usenet too, on various mailing lists that get indexed on google somehow, I maintain several opensource projects, I have a homepage with my email in plaintext at the bottom, etc ... but I almost never get spam. I just wondered why :-)

      --
      "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
    3. Re:I don't receive spam by dmeranda · · Score: 1

      Maybe how you count, but most of us don't need v1agra or naked ch33r1eaders, and therefore consider all those messages to be spam! .. actually, on second thought ;-)

      The obvious question is how much mail do you receive in total, how much non-spam, and how many false-positives go completely unnoticed by you? I've had my email account since the late 1980's and I get over 200 per day. I also run a mail gateway for a medium sized company, and we get over 30,000 per day.

      There are in fact two big problems with Bayesian filtering (or any content-based filtering) from the perspective of an ISP or company... 1) one person's spam is another person's necessity (and usually that person is your boss or VP), and 2) you still have to waste your bandwidth and CPU before you reject it. Sure the bulk of it is very obvious, but there is an awfully fuzzy and thick gray line between good and bad. And the spammers are adapting rather quickly too. So Bayesian filters are a good tool of last resort, but there are many other tools that should be used too.

    4. Re:I don't receive spam by Rosco+P.+Coltrane · · Score: 1

      The obvious question is how much mail do you receive in total, how much non-spam, and how many false-positives go completely unnoticed by you?

      Well, this is what I can tell you : I've had my corporate email (the one that I really use publicly) for maybe 5 years, and I get maybe 15 mails/day not counting mailing lists, and possibly 300 total, the LKML taking a lot of that extra traffic. During the first year I've worked for my company, I was a support engineer, and I've never had (or heard of) a customer who complained I never answered his email, which would indicate that the filter never rejected a legit email.

      --
      "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
    5. Re:I don't receive spam by Jeremi · · Score: 1
      - Do other people really receive that much spam, or am I an isolated case ?


      Yes, they do... I probably get 50-60 spams a day


      Do people who receive spam purchase things online, or register software and other services with their real names and email ?


      I made the mistake of putting my unobfuscated email address on my web page... bad idea :^P

      --


      I don't care if it's 90,000 hectares. That lake was not my doing.
    6. Re:I don't receive spam by An+Onerous+Coward · · Score: 1

      I've had the same Yahoo address for about four years now. I'm up to about 1200 spams a month. 95% end up in the "Bulk Mail" folder, but I still have to look through that for the occasional false positive. The sixty that slip through don't exactly make my heart leap with anticipation either.

      --

      You want the truthiness? You can't handle the truthiness!

    7. Re:I don't receive spam by letxa2000 · · Score: 4, Insightful
      There are in fact two big problems with Bayesian filtering (or any content-based filtering) from the perspective of an ISP or company... 1) one person's spam is another person's necessity

      But that's why Bayesian advocates every user having their own Bayesian statistics. It's not a "one size fits all" for the entire ISP or company, as is the case with most keyword filters. Every user has a different set of Bayesian statistics which is why it is very difficult for spammers to get around this filter--they have no way of knowing what words are in each users' statistics.

      2) you still have to waste your bandwidth and CPU before you reject it.

      It's better to waste your bandwidth and your CPU than to waste the time of those receiving the spam. IMHO...

      So Bayesian filters are a good tool of last resort, but there are many other tools that should be used too.

      The quicker everyone uses Bayesian filters (as opposed to waiting until all the other filters are incapable of keeping up with spam) the sooner the spammers will be in trouble. I personally use both a Bayesian filter with an up-to-date blacklist of known spamvertised domains, etc. I find that, quite simply, the simple keyword filters catch spam from known spam sites and Bayesian catches the rest. But if I turned off my normal filters Bayesian would have caught it all since those spams are always assigned a high Bayesian score, too. It almost makes sense to turn off the other filters, but they can be useful if a spammer comes up with a truly unique spam and someone else has already identified the domain name. It's rare, but it can happen. So a combination of technologies is probably the best... but a combination that lacks Bayesian is a combination that could be better.

    8. Re:I don't receive spam by HermanAB · · Score: 1

      You could use Popfile (on Sourceforge) for Windoze. It works. AFAIK it counts single words only so it is not the best, but you would get about 97% efficiency. If word pairs are counted, the efficiency can be about 99%. Another method is to use the Mozilla (www.mozilla.org) mail client. It has single word counting Bayesian filtering built in and there are versions for every immaginable operating system.

      --
      Oh well, what the hell...
    9. Re:I don't receive spam by HermanAB · · Score: 1

      I've had the same e-mail address for > 8 years and get >150 spams per day and only 1 or 2 legit messages. Spamprobe (Sourceforge) removes about 99% of the spam, which still leaves me with a 50% signal to noise ratio in my inbox...

      --
      Oh well, what the hell...
    10. Re:I don't receive spam by yancy · · Score: 1
      Do other people really receive that much spam, or am I an isolated case ?
      I've often wondered this too. What the heck do people do to get themselves in the 100+ spams per day situation?
      Do people who receive spam purchase things online, or register software and other services with their real names and email ?
      I'm guessing yes, and often. I get about 20 spams per day in my years-old seldom used Hotmail account, which I used for plenty of stupid things in the past (in which case, 20 isn't so bad compared to what a lot of people quote).

      I get about 3 per week in my personal account, which I use for *tons* of registrations and memberships, albeit with "legit" organizations and businesses.

      I get zero in my work account, which is limited strictly to business use.

      So again, what the heck are the rest of you doing to get buried in spam? I guess I might consider spam a problem if I got even half of what the vocal spam-haters claim to receive.

      Yancy

      --
      "My license to make fun of everyone comes from knowing I'm the biggest joke of all."
    11. Re:I don't receive spam by Skapare · · Score: 1

      You are not alone. There are lots of people that never get any spam. For 4 years, my mother never received any spam whatsoever. She used her email address at a small ISP only to exchange email with family. I told her long ago to never use her address when providing input to web sites. Later she opened an AOL account and got another email address. She gave out that address over the telephone to a couple of companies. By the year she died, she was getting about 5 spams a day on that AOL address while still none on the small ISP address. There's no telling how the spammers got it. She wanted to blame AOL. I wasn't so sure about that.

      I've used coded addresses for every different place I give out an email address for the past 2 years. So far, only one of those coded addresses has received any spam. That was the address I used on monster.com (which has probably leaked on the recruiter/employer side of the site). My primary address, which I have been using for years before, and is the one my domains are associated with as well, gets 50 to 200 spams a day, typically. I've seen peaks of over 1500 a day generally caused by some very abusive spammers (almost all of that 1500 from the same spammer, apparently). My filtering is based only on the IP address or domain name of the sending mail server (i.e. I do not use any content filtering at all). Up until about a couple weeks ago I was seeing about 1 spam leak through a week. Over the past 2 weeks I've been getting about 2 to 3 leak through a day. I've also noticed a big rise in the number of spams through May 2003. Spammers seem to be getting more aggressive.

      --
      now we need to go OSS in diesel cars
    12. Re:I don't receive spam by ptbarnett · · Score: 1
      Is there any filtering apps for windows that dont automatically delete spam, but download to a special spam folder?

      Cloudmark does this. I don't use it directly, but my installation of SpamAssassin checks the Cloudmark/Razor servers for the message signature.

      Since my email is hosted on a Linux server, I use procmail (with SpamAssassin) to filter spam into a Spam folder.

    13. Re:I don't receive spam by Skapare · · Score: 1
      I've often wondered this too. What the heck do people do to get themselves in the 100+ spams per day situation?

      That's a very good question. But the answer is a strange one. My mail server receives hundreds of spam attempts every day for an accumulated set of several hundred email addresses that have never even existed. Some of them look like dictionary or name list attacks where the spammer tries a few common names (they try more on the larger ISPs, I think). Some of these addresses are rather distinctive. I took one of the distinctive ones and did some google searching. I managed to not only figure out where the spammer probably harvested the address, but I also tracked down who probably submitted the address with a forged domain name to the web board I found it on, including their real email address. Given the reality of it being spammed, I can't say that I blame them for not giving their real address. And no, I did not forward all their spam to them.

      As for my primary email address, to which 50 to 200 spams are attempted daily, it's been around for years and was used in lots of Usenet posts and several domain registrations. It's probably on most spam CDs.

      --
      now we need to go OSS in diesel cars
    14. Re:I don't receive spam by drunkenbatman · · Score: 1

      By the year she died, she was getting about 5 spams a day on that AOL address while still none on the small ISP address. There's no telling how the spammers got it. She wanted to blame AOL. I wasn't so sure about that.

      Heh, it probably was AOL. Check the court transcribes of a recent slashdot story (i, spammer) and you'll hear the spammer claim that AOL sold him their customer list... to which the AOL rep basically said "Oh, well they could have opted out...".

  21. Apple's Mail app... by useruser · · Score: 4, Interesting

    ...supposedly uses some form of Baysian reasoning. I've been using it for a year now. I trained it for a couple of weeks, turned it on "automatic filtering" mode, and now I can count the number of times its misclassified a message on my two hands. I used to get more spam than legit mail, now I can't help but wonder why spam is a problem for people. Until I remember that most people don't use a mac. Every once in a while, I flip it back into training mode so that I can see the lovely see of brown-colored spam messages that flood my inbox. I flip it back to automatic mode, Mail automatically moves them to my junk folder, and I can forget about them.

    1. Re:Apple's Mail app... by Anonymous Coward · · Score: 5, Informative

      Actually, the latent semantic analysis (LSA) that Apple uses is not a form of Bayesian reasoning; it uses a singular value decomposition (SVD) to perform generalized factor analysis. However, there is a probabilistic version of LSA out there.

    2. Re:Apple's Mail app... by Anonymous Coward · · Score: 0

      I'm also using Apple's Mail.app...and it used to work well, except for a few false positives. The false positives were fixed by a bit of training.

      Recently (the past couple of months), I've started to get a lot of false negatives. About 20% of spam gets through without the filter detecting them. I've been marking all of them as junk, but it doesn't seem to help.

      I'm thinking of switching to some open source Bayesian filter that I can actually monitor to see what it's doing. Or writing my own.

  22. Evolution and by Gyorg_Lavode · · Score: 2, Insightful

    I have a simple questions, is there a way to impliment a Bayesian Filter for Evolution without having to add an extra stop for the email, (ie a mail server on my computer from which evolution picks mail up locally).

    --
    I do security
    1. Re:Evolution and by C3ntaur · · Score: 3, Informative

      Yes, I've done it and here's how:

      1. Get and install bogofilter.

      2. Make a shell wrapper script that runs bogofilter in passthrough mode, redirect stdout and stderr to files in /tmp for debugging and training bogofilter. Here's mine:

      #!/bin/bash /usr/bin/bogofilter -p -u > /tmp/bogo.out 2> /tmp/bogo.err
      status=$?
      exit $status

      3. Make a new local mail folder in evolution to collect spam.

      4. Make a filter in evolution that runs the wrapper script. Tools->Filters, choose Incoming, choose Add. Add a criterion that looks like this:
      Pipe message to shell command, (path to your wrapper script), returns, 0. Add an action to move the message to your local spam filter.

      5. Be sure evolution is set to apply filters to new mail in the inbox(es) you want bogofilter to act on. Tools->Settings, choose Mail Accounts, choose desired inbox(es), choose edit, choose the Receiving Options tab, check the Apply filters to new messages in INBOX on this server.

      Please be sure to RTF bogofilter M. You will need to train (and retrain) bogofilter with spam and non-spam samples over time. The switches to do this have CHANGED from version to version. If you have set things up as above, you can use /tmp/bogo.out to retrain bogofilter for the last message processed when necessary.

      Good luck, and happy spam filtering!

      --
      Loading...
    2. Re:Evolution and by Anonymous Coward · · Score: 0

      Yes.. give popfile a try (popfile.sf.net). I've got about 99.4% accuracy since starting using it.

    3. Re:Evolution and by necrogram · · Score: 1

      use the pop3proxy script from the spambayes project. you need not run a full blown mail server. although there are perks to doing that

    4. Re:Evolution and by Gyorg_Lavode · · Score: 1

      Thank you. All 3 of you. I know I should run a mail server but right now I'm just too busy to set one up, (I've always had trouble when I tried before). I'll give your suggestions a try and thank you again. Good karma.

      --
      I do security
    5. Re:Evolution and by letxa2000 · · Score: 1
      I've been flamed for "advertising," but try the site in my signature. It is Bayesian and doesn't require you install anything on your machine or server. Just setup an account and point Evolution to the site below and you'll be downloading filtered email. I use Evolution via the site below and I'm filtering 99.4% of spam for the month, 99.7% this week.

    6. Re:Evolution and by necrogram · · Score: 1

      I didnt feel like dicking with procmail and thats why i went pop3proxy. i ran mine so i can do the whole pine thing from the road

    7. Re:Evolution and by repetty · · Score: 1

      Wow, real simple.

      I've lobbied Ximian about this. I'm really interested to see now long it will take them to add Bayesian filtering. A search on their website didn't return even one hit for me a couple weeks ago.

      --Richard

    8. Re:Evolution and by C3ntaur · · Score: 1

      Actually, I like the ability to interface with external programs for mail filtering/processing. My main reason for preferring this is that it allows each group of developers to continue doing what they do best. Another benefit is that if a better algorithm than Bayesian comes along, it will be a simple matter to "bolt on" that implementation.

      On a side note, I *would* like to see evolution support filtering based on its contact lists. This is a fairly old request, but worth doing. My understanding of the reason that it's not been done yet is that the contact manager part of the application is not thread safe, and can't easily be combined with the mail part (which uses threads).

      I really hope that they can resolve the issue and implement it eventually. Not only would you be able to set up white- and black- lists, but you could also direct incoming mail to various folders based on contact categories. This has real value, and I'd much rather see the Ximian developers focus on it than on implementing Bayesian algorithms that are already usable through the pipe to shell interface.

      --
      Loading...
    9. Re:Evolution and by C3ntaur · · Score: 1

      Ooops, corrections!

      In step 2, the first 2 lines of the shell script lost the newline between them. They should be:

      #!/bin/bash
      /usr/bin/bogofilter -p -u > /tmp/bogo.out 2> /tmp/bogo.err

      In step 4, the last sentence should end with "...local spam FOLDER" (the one you created in step 3).

      --
      Loading...
    10. Re:Evolution and by Anonymous Coward · · Score: 0

      You are being 'flamed' for advertising because you aren't mentioning each time that it costs money to do so. What, are you so scared of saying so because all the other options are free?

    11. Re:Evolution and by colin_zr · · Score: 1

      The method described in the parent sounds reasonable, but I use bogofilter slightly differently. I don't create a temp file and I don't use the -u option. I just run bogofilter directly and check the exit status. I keep pretty much all the mail I receive, so to keep things up-to-date I just rebuild my word lists from scratch every so often.

      Also, I suggest that you add a "stop processing" action after moving the email to your spam folder. Otherwise, if you have filters for mailing lists after the spam filter then spam that comes through a mailing list will appear in the spam folder and the mailing list folder.

    12. Re:Evolution and by letxa2000 · · Score: 1
      No, I'm not scared of saying so. There are free options to spam filtering but most (if not all) require that the mail client first download the spam or require it be installed on the server. Many users don't want to download their spam at all and don't have access to install anything on the mail server (their ISP). This service targets a different set of people than the free alternatives and I don't know of any free service that does the same thing as this one (correct me if I'm wrong).

      I haven't mentioned its cost because THAT, to me, is advertising. I feel my posts are on-topic because they suggest a valid solution to the spam problem--I'm not just posting messages out of the blue advertising the service, I've responded to someone elses queries to which our service is a valid option. Yes, the solution is a service and it costs us bandwidth and, so, yes, it costs the user a little money--welcome to capitalism. Of course, it's also the cheapest non-free service that I know of.

      Anyway, we're not afraid of free options. If you can solve your spam problem for free, great. We developed this system for ourselves and decided it was worthwhile enough to make available to the public. It's obviously not for everyone and if a few pennies per day burns a hole in your pocket then, yes, you need to go for a free option.

  23. Here's one I've used by wiggys · · Score: 3, Insightful
    I set up Popfile a few weeks ago at work to stop the deluge of spam one of our POP3 accounts was getting. I've never used a spam filter before (other than the usual basic keyword-based ones) and I must say that bayesian filtering is very impressive!

    I find in our case it stops 98-99% of spam dead in its tracks. There have been a few false positives, and you do need check from time to time just in case an genuine emails are misclassified, but it's surprising just how quickly the filter sorts the wheat from the chaff.

    Don't expect miracles but they can save you a lot of time... what I find cool is that it learns so quickly, almost like a complicated neural net should, but it's such a simple idea. I wonder if there are any other uses for this kind of thing?

    --

    Sorry, but my karma just ran over your dogma.

    1. Re:Here's one I've used by Likes+Microsoft · · Score: 1

      I also set up POPfile on my laptop a few weeks ago, with similar results. In addition, I've started using it to sort my personal e-mail into various inboxes in Outlook, some of which I allow myself to read at work (family correspondence), others of which I hold off on (such as solicited advertising from sites like amazon.com) until I'm on my home connection.

      On a related topic. Does anyone know a good Open Source alternative to Outlook? I want it to have an e-mail client, address book, to-do and notes and sync with my palm pilot (I use Windows XP).

      --
      -- Who am I? How did I get here? My God, what have I done?!
    2. Re:Here's one I've used by FFFish · · Score: 1

      Reverend is a general purpose Bayesian classifier, named after Rev. Thomas Bayes.
      Use the Reverend to quickly add Bayesian smarts to your app. To use it in your
      own application, you either subclass Bayes or pass it a tokenizing function. Bayesian fun
      has never been so quick and easy. Many thanks for Christophe Delord for his well written
      PopF. Orange also looks good.

      Stuff you can do with the Reverend:
      - classify recipes by cuisine
      - who do you write like? Shakespear, Dickens, Austen, Aesop
      - detect the language of a document
      - is your code more like Guido's or Peter's"


      http://www.divmod.org/Reverend/index.html
      http://www.divmod.org/Reverend/index.html

      --

      --
      Don't like it? Respond with words, not karma.
    3. Re:Here's one I've used by mlk · · Score: 1

      Mozila with its calander extra?

      --
      Wow, I should not post when knackered.
    4. Re:Here's one I've used by scherbi · · Score: 1

      >I wonder if there are any other uses for this kind of thing?

      Yes, there is:

      http://groups.google.com/groups?q=venue+group:co mp .lang.python&hl=en&lr=&ie=UTF-8&safe=off&selm=mail man.1048821167.17118.python-list%40python.org&rnum =1

    5. Re:Here's one I've used by Likes+Microsoft · · Score: 1
      Thanks. I just looked at the Calendar add-on page, and it looks as if it doesn't have palm-syncing capability. The use of the ical format looked like a great thing, though. I have found a couple of projects since my last post. They're both in the very early stages, though. Any other leads would be appreciated.
      --
      -- Who am I? How did I get here? My God, what have I done?!
    6. Re:Here's one I've used by HermanAB · · Score: 1

      Mozilla will do all that and the spam filtering too...

      --
      Oh well, what the hell...
    7. Re:Here's one I've used by swillden · · Score: 2, Funny

      >I wonder if there are any other uses for this kind of thing?

      Yes, there is:

      http://groups.google.com/groups?q=venue+group:comp
      .lang.python&hl=en&lr=&ie=UTF-8&safe=off&selm=mail
      man.1048821167.17118.python-list%40python.org&rnum =1

      You mean like automatically deleting unusable links so we don't have to try to figure out how to get them to work?

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
    8. Re:Here's one I've used by Anonymous Coward · · Score: 0

      I've been implementing solutions on Autonomy for years (http://www.autonomy.com)

    9. Re:Here's one I've used by Anonymous Coward · · Score: 0

      I use Yahoo as my calendar, address book, and personal e-mail. Sync perfectly with my Palm TT. Not opensource but free enough for me.

      SPAM filtering is fair but I still get more SPAM through that I would like.

    10. Re:Here's one I've used by GnomeSkull · · Score: 0

      I've also been using popfile for a couple of months. I set it up to filter into 3 categories: spam, work, and personal. Right now it is around 98%, which is better than I thought it would be (especially with the 3 categories; work and personal are often similar in subject). I won't go into its features other than great accuracy, but I highly reccommend it!

  24. Dupes! by RayOfLight · · Score: 1

    Perhaps /. could implement a bayesian filtering for killing all the dupes!

  25. Crude but effective by MrWorf · · Score: 5, Insightful

    I simply got to the point that I could count the number of real emails on my hands. So I reversed my previous filter. Instead of filtering spam to my spam folder, I made it default *ALL* mail to the spam folder except from certain known addresses (such as work, friends and my own domain). So far, it has only made one wrong decision, and that was because I hadn't written the email address of a friend correctly.

    This is waaaaay better than any other filtermethod I've tried and requires no learning period at all :)

    1. Re:Crude but effective by wiggys · · Score: 2, Funny

      You know, I think you're on to something there. I sent you an email offering you money so I can sell the idea... but I've a feeling it's been classified it as spam. Shame!

      --

      Sorry, but my karma just ran over your dogma.

    2. Re:Crude but effective by marko123 · · Score: 1, Funny

      I used a similar "Crude but effective" technique at work. I had a job where most days were bad, but some were good. So I told my boss to go fuck himself and now I don't have any bad days any more. Of course, my false positives (the good days) are also gone.

      --
      http://pcblues.com - Digits and Wood
    3. Re:Crude but effective by mlk · · Score: 1

      Ahhh white lists.

      Fantastic things, used with a filter they work wonders.

      --
      Wow, I should not post when knackered.
  26. Brief Tech Notes on Bayesian Filtering by robbyjo · · Score: 5, Informative

    Well, the type of Bayesian learning used in this spam filtering is called "Naive Bayesian" and the engine is trained using "supervised learning" technique. Naive Bayes has been proven very successful for text categorization. Spam filtering is even more successful because we essentially categorize e-mails to two labels: "spam" or "not spam".

    Supervised learning basically works like this. Feed the engine with multiple examples (in this case, e-mails) with labels (in this case, "spam" or "not spam"). The training usually takes thousands of examples to get good enough accuracy. And take note that we need both "spam" and "not spam" examples to enable the learning engine to distinguish them.

    How Naive Bayes works? Well, think of the full Bayesian Network. Bayes net is basically a causal-effect graph with annotated Conditional Probability Table (CPT) on each node denoting the probabilities of possible values. Full Bayes Net takes Directed Acyclic Graph (DAG), but Naive Bayes takes a form of tree instead due to some "naive" assumptions. (Okay, I handwaved a whole lot of details here) And in Learning Naive Bayes, we basically try to construct the tree out of the examples.

    Let P(spam) be the percentage of training e-mails that is labelled as "spam" and P(not spam) be the percentage of "not spam" e-mails.

    First, let the filter reads all e-mails and collect the words out of them. Weed out duplicates and stop words (common words like "I", "you", "the", etc). Let NumVocab be the number of words after weeding.

    Second, process e-mail one by one. Do weeding phase like the above. Let "n" be the number of words on that particular e-mail after the weeding. Scan the word one by one. Let "w" be the current word scanned and "nw" be the number of times word "w" occur in that e-mail. Imagine you have a big two dimensional array to store the result (let's call the array "P"). If the e-mail is labeled "spam", then store (nw+1)/(n+NumVocab) to P[w][spam].

    Repeat until all training e-mails are read.

    And here comes the testing phase...

    When you encounter an e-mail and want to classify whether it's spam or not, you'll need to look up the array P you created earlier. First, you do the weeding phase and scan the word one by one. The algo is like this:

    pspam = P(spam); pnospam = P(not spam);
    foreach unique words w in e-mail do
    pspam = pspam * P[w][spam];
    pnospam = pnospam * P[w][nospam];
    endfor

    if (pspam > pnospam) then return IS_SPAM; else
    return IS_NO_SPAM;

    Hope this helps.

    --

    --
    Error 500: Internal sig error
    1. Re:Brief Tech Notes on Bayesian Filtering by DuSTman31 · · Score: 2, Insightful

      Spam filtering is even more successful because we essentially categorize e-mails to two labels: "spam" or "not spam"

      True. You could simply have a spam and a not spam category. I don't think that'll necessarily lead to the highest accuracies though.

      Spam naturally seems to come in several categories - porn, penis enlargements, mortgages etc. However, it's unlikely that any one spam will simultaneously advertise porn and mortgages. Simply having a "spam" and a "not" category will not take advantage of distinctions such as that.

      When setting up systems such as popfile, consider creating subcategories for each type of spam you tend to get. More work to train, true, but likely to be more accurat once you're done.

    2. Re:Brief Tech Notes on Bayesian Filtering by Ian+Bicking · · Score: 3, Insightful
      Spam naturally seems to come in several categories - porn, penis enlargements, mortgages etc. However, it's unlikely that any one spam will simultaneously advertise porn and mortgages. Simply having a "spam" and a "not" category will not take advantage of distinctions such as that.
      Why does it matter what category? To the user they don't care what kind of spam, merely that it's spam. And this isn't just a UI issue -- the filter is not meant to indicate authoritatively what is spam and what is not. Instead it learns what the particular user considers spam. You're only going to introduce inaccuracy if you create more categories, because the user is sometimes going to miscategorize spams (e.g., porn in penis enlargement). The user is not invested in the result of that subcategorization, so it's not a good goal for training.

      Certainly there are other categorizations that are useful, e.g., work vs. private mail. Bayesian techniques can be used for further categorization, but they should only be used to categorize as far as the user cares to have their mail categorized.

      Bayesian techniques for non-spam wouldn't be that useful, anyway, because non-statistical rules generally work well for everything but spam -- it's only because spammers are specifically trying to defeat non-statistical rules that we need statistical analysis. The only other place for Bayesian techniques, IMHO, is where the user can't articulate the basis of the categorization they desire (but that's probably quite common).

    3. Re:Brief Tech Notes on Bayesian Filtering by dave_mcmillen · · Score: 1

      Supervised learning basically works like this. Feed the engine with multiple examples (in this case, e-mails) with labels (in this case, "spam" or "not spam"). The training usually takes thousands of examples to get good enough accuracy. And take note that we need both "spam" and "not spam" examples to enable the learning engine to distinguish them.

      Thank you for the additional detail! What I wonder is this: would it be useful for people to be able to somehow pool their examples? The number of spam messages I receive is relatively small, so it presumably will take a while for Mozilla (my favourite Bayesian-filtering mail client) to learn to filter them, but I could share the learning process with friends, perhaps we could eliminate all of the classes of spam that any of us had received.

      (Though in fact I've been amazed at how quickly Mozilla has become incredibly accurate. They must have fed it a bunch of examples before I started using it, no? The very first time "Farm Girl Sluts" appeared, it correctly marked it as junk.)

    4. Re:Brief Tech Notes on Bayesian Filtering by nackrm · · Score: 2, Insightful

      Pooling spam to teach isn't such a good idea. The problem you might run into is that some people, like say a plastic surgeon, might get many emails that have words like penis, vagina, sex, larger, etc. So their filter info might allow some spam to get through. This is also the reason that mozilla's mail client wouldn't be "pretrained" for you. Instead the email probably had some key qualities to it that were dead givaways to being spam. One of those is the really long strings of characters used by spam mailers to track live email addresses. There are lots of possibilities there.

      --

      Be a man! View at -1
      acm.cs.uwec.edu
    5. Re:Brief Tech Notes on Bayesian Filtering by Blkdeath · · Score: 1
      Pooling spam to teach isn't such a good idea. The problem you might run into is that some people, like say a plastic surgeon, might get many emails that have words like penis, vagina, sex, larger, etc.

      While I do agree with you, for the most part, it would be plausible to include such e-mails as "wet horny teen sluts want to cum for you" et al.

      I have to say that I did notice Mozilla picked up some mail as SPAM right out of the gate. The unfortunate part was; it picked up several false positives. It was a real bitch going through my mailing list archives and de-SPAM-ifying dozens of messages. But alas, all that done and it's now reliably picking out the SPAM from the regular mail, with false positives and misses declining ever more with each new SPAM I receive.

      It's weird, but this new filter actually makes me WANT to post my e-mail address publically; I want to feed it more and watch it grow!

      So come get me, spammers, my filter's almost perfect and the irony of it all is; you're only helping it get better! C'mon, teach the lizard a lesson and ask me about my penis!

      --
      BD Phone Home!

      Shameless plug. Like you weren't expecting it.

    6. Re:Brief Tech Notes on Bayesian Filtering by Badge+17 · · Score: 1
      When setting up systems such as popfile, consider creating subcategories for each type of spam you tend to get. More work to train, true, but likely to be more accurat once you're done.
      Interesting, but probably wrong. If you are going to classify into many different groups, and reject some... well, the classification accuracy goes down. If you're classifying into "Normal Mail" and various spam categories, and reject all of the spam... you don't really get any improvement in accuracy.

      Generally, Naive Bayesian classifiers are more accurate with only two groups to choose from. (Assuming that there is the same amount of training data).

      Plus, from a user-interface standpoint, it's a lot easier to "Delete - it's spam" than "Delete - it's mortage/porn/whatever."

      Of course, there are various works on improving multiclass Bayesian classifiers. Check out Jason Rennie's work: http://www.ai.mit.edu/~jrennie/papers/index.html
    7. Re:Brief Tech Notes on Bayesian Filtering by Chris_Keene · · Score: 1

      "I have to say that I did notice Mozilla picked up some mail as SPAM right out of the gate. The unfortunate part was; it picked up several false positives. It was a real bitch going through my mailing list archives and de-SPAM-ifying dozens of messages."

      In Mozilla, I basically went to each of my normal mail folders (which don't contain spam) in turn, edit>select>all, select 'not junk'. Then went to my junk folder, selected all and maked them as junk.

      This seems like a quick (perhaps a little dirty) way of teaching the filter what I consider is junk.

      Chris

      --
      You will forget this sig before you next see it
    8. Re:Brief Tech Notes on Bayesian Filtering by Anonymous Coward · · Score: 0

      Funny you should mention it, I just received my first spam this morning offering a 6.0% equity mortgage on my penis, if I use Norton 2003 to send a photograph (by the world's smallest webcam) of all my personal information to the wife of Late DR.ONOMUA ANDREW NDLOVU of Zimbabwe.

      Perhaps I shold have gone for that other spam message that offered to lengthen it 3 inches in 24 hours - ouch!

      (OK, the first one's a joke, but I really got the second one. Definetely sounds painful...)

  27. can someone plz stop this by INeedWeed · · Score: 0

    pleaaase!!! stop timothy from spamming us with these boring articles...

    There's a very accessible article at BBC News explaining in fairly simple terms the drawbacks of current keyword-based filtering

    maybe a little too simple?
  28. BUT .. ! by Anonymous Coward · · Score: 0

    I *like* spam! I print it out and jack off to it!

    1. Re:BUT .. ! by Anonymous Coward · · Score: 0
      I *like* spam! I print it out and jack off to it!

      And I thought I was the only one!

    2. Re:BUT .. ! by Anonymous Coward · · Score: 0

      Although I think you're almost certianly being sarcastic, but at any rate you just tell the Bayesian filter to filter out stuff which you don't like. It doesn't have to be spam, you just have to tell it that it's spam.

    3. Re:BUT .. ! by czion3 · · Score: 0

      I too like spam it makes me feel important.

      "Oh man 30 people sent me e-mails I am so popular"

      I wish I were kidding.

    4. Re:BUT .. ! by Anonymous Coward · · Score: 0

      No shit. Man, maybe we should forward each other spam?

  29. Slight modification: white-list+Bayesian is useful by Jeremi · · Score: 4, Interesting
    I've found that if you add a small tweak to the Bayesian Filter, it becomes even more useful. The tweak is this: Any time you tell the Bayesian filter that an email is "non-spam", it auto-adds the From address of that email to a white-list, so that from then on any emails from that address are automatically marked as "non-spam" by the filter, no matter what they contain. (conversely, any time you mark an email as "spam", the source address of that email is removed from the white-list, if it is present)


    This allows your single spam/non-spam feedback to the system to do double duty, so that once the program knows that you consider an email source to be "trusted", it will allow even spammy-looking stuff (read: mailing list digests, plane schedules, bank statements, etc) through to your non-spam folder.


    Of course, if spammers start constructing google-style databases of who your friends are and impersonating their accounts, then this won't work anymore... but if they start that, all hell is going to break loose anyway.

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.
  30. Re:Spiced Ham? by Anonymous Coward · · Score: 0

    I have to say you both are idiots.

    Although MP didn't invent to word 'Spam', they did pioneer its use considering we aren't talking about spiced canned ham by product.

  31. Browser ad-blocking the same way? by DrJAKing · · Score: 2, Insightful

    I wonder if a Bayesian classifier could sort out banner ads? I currently use Guidescope to block them, but it would be far better not to rely on a third party to decide what's an ad URL. It think it would work, but training it might be hard.

    (And before anyone says "Don't do that, websites will die" my response would be "Good, let most of them die." I hate ads.)

    1. Re:Browser ad-blocking the same way? by Lord+Kholdan · · Score: 1

      I used to have a firewall app (yes, it was a firewall but it did a ton of other things too) that allowed blocking pictures by size... Alas, I dont remember the name.

    2. Re:Browser ad-blocking the same way? by bhtooefr · · Score: 2, Informative

      Did you have Windows? If you did, it was probably WebWasher. It is free for home use. The download link is buried in the front page, so here's a direct link to the WebWasher Classic site: http://www.webwasher.com/client/home/index.html?la ng=de_EN

    3. Re:Browser ad-blocking the same way? by bhtooefr · · Score: 1

      (Mod me down...)

      They also have Linux and Mac versions.

    4. Re:Browser ad-blocking the same way? by CognitivelyDistorted · · Score: 1

      I think it could work. Why don't you implement it and let us know how it turns out? The basic implementation is not that complicated. I did something similar for a class project this semester, although I used boosting with decision stumps instead of "Bayesian" learning. The tricky part is selecting the features. There isn't as much text, so you'd also want to look at things like position in the web page, image size, accompanying tags, etc.

  32. Nostalgia by SirDaShadow · · Score: 1

    Reading about the history of email and instant messaging, it reminded me of how easy it was to echo "Hi there" > /dev/tty01 to send a message to another college acquaintance...ahh the memories...

  33. I'll print off your message and jack off to it!! by Anonymous Coward · · Score: 0

    I'm going to put the printer in landscape mode!

  34. The Normans WERE Vikings (was: Vikings?) by shking · · Score: 1
    1. Normans were Danes (aka Vikings) who took over the northern coast of France. See this article
    2. Much of England was under Viking control at one time, this was called the Danelaw
    --
    -- "At Microsoft, quality is job 1.1" -- PC Magazine, Nov. 1994
    1. Re:The Normans WERE Vikings (was: Vikings?) by cyril3 · · Score: 1

      The Normans were Danes about as much as the Americans are English.

  35. Comment removed by account_deleted · · Score: 2, Informative

    Comment removed based on user account deletion

  36. Beta version better by HybridTheory · · Score: 1
    Current versions are 99.7% accurate at spotting. Other Bayesian filters, such as CRM114, do an even better job.

    So what they are saying here is that Bayesian Filters other than current version do a better job???

    Looks like we'd better get that early beta version reinstalled...

    1. Re:Beta version better by Rhinobird · · Score: 1

      HA!

      the trouble is some of the ones that do a better job are 105-110% accurate.

      --
      If Mr. Edison had thought smarter he wouldn't sweat as much. --Nikola Tesla
  37. I'm afraid of false positives by Anonymous Coward · · Score: 0

    So tell me, Mr. Anderson. What good is an email if you can't read it?

  38. Yes, we must filter out the dummies-AC+5 by Anonymous Coward · · Score: 0

    *sigh* I suggested this back when bayesian was first mentioned. Filtering and classification for the admins, as well as the users. Throw in NNTP and Slashdot could be much better. Oh well, anyone wanna hear about my idea for keeping pancakes from sticking to your kitchen ceiling?

  39. I don't even try to filter spam out. by belroth · · Score: 2, Insightful

    Instead I filter all of my mail for wanted/expected mail into a (large) tree of input folders, mailing lists, company mailings etc.
    Most of what's left is spam, so a quick scan of the inbox (and creation of new rules) weeds out the uncaught desirables and the rest gets dropped in the bitbucket.
    The point being that legitimate mail doesn't try to spoof my filters. I haven't (yet) had any spam arriving where it shouldn't. I'd rather my ISP dumped all the crud in the bin for me, but my marginal cost is low as I'm on ADSL. I now also use a distinct email for each purpose, making it easy to spot where spammers got it from and to create new rules as needed. It's a shame I didn't do this at the start as I have a couple of early ones that are spammed but I can't dump.

    --
    I hereby inform you that I have NOT been required to provide any decryption keys.
    1. Re:I don't even try to filter spam out. by HermanAB · · Score: 1

      Whitelisting, which is pretty much what you are doing, works fine until you start to get more than about 100 spams a day. By then, you really do need a very good filter, since even 99% efficiency is not quite good enough anymore.

      All the spammers need to do to defeat Bayesian filters, is to send 100 times more spams than they currently do, since if you get 10,000 spams, hitting a 99% filter, 100 will get through - sigh.

      --
      Oh well, what the hell...
  40. Where to get a nice Bayesian filter. by zerofoo · · Score: 2, Interesting

    I'm using this now, and it works great!

    Get it here.

    -ted

  41. naive implementation of naive Bayesian by g4dget · · Score: 3, Informative

    Graham's method is called "naive Bayesian", and it's called "naive" for a reason. It works surprisingly well, but it barely scratches the surface of what people are doing with statistical models of text.

    The lack of references on Graham's web site to prior work on text classification makes one wonder whether he just is unfamiliar with a huge body of literature going back decades or whether he just deliberately ignores them. Either way, Graham didn't invent any of the techniques and they are far from state-of-the-art. (Incidentally, you'll probably find Octave or Perl/PDL a more convenient language for implementing this stuff than Lisp.)

    Anybody seriously interested in text filtering should at least do a little bit of background reading. "Readings in Information Retrieval" by Jones and Willett covers some of the basic papers.

    1. Re:naive implementation of naive Bayesian by GnuVince · · Score: 1
      Quote:
      (Incidentally, you'll probably find Octave or Perl/PDL a more convenient language for implementing this stuff than Lisp.)

      I don't know if you are aware, but Paul Graham is an absolute Lisp genius. The first book he wrote was called On Lisp, and it talked how to write software in Lisp. He explains a technique called bottom-up programming (which, of course, can be used in many other programming languages). With that technique, you don't address the problem directly first: you start by writing a set of operators for your problem. So, once you have those functions/operators/primitives, you have the best language for the job you are going to be doing. So, what Paul did when he wrote Viaweb (Yahoo! Stores), was to first write a language to write WYSIWYG web editors. Now, he's in the spam filtering business, so certainly wrote all the stuff he needs to be as efficient as possible in his programming.

    2. Re:naive implementation of naive Bayesian by g4dget · · Score: 1

      Traditional UNIX style also uses "little languages", interactive exploration, and bottom-up programming. Matlab and Perl programmers take a similar approach. And any good OO developer will go through the same exercise when thinking about what operations to put into the interfaces to each class.

      Graham's stuff on Lisp seems just like his stuff on spam filtering: either he doesn't know what other people have been doing for decades or he chooses to ignore it deliberately.

  42. Mozilla does this by Anonymous Coward · · Score: 3, Informative

    Mozilla incorporates a twostep filter:

    1. Is the sender in the address book? If yes, is not spam, otherwise:
    2. Does the message have a probability of 90% that it is spam based on the Bayes filter? If so, flag as spam, otherwise not spam.

  43. Human filtering by stile · · Score: 3, Funny

    Great. Wanna filter my email for me? ;)

  44. 0.0001% response rates by rippie78 · · Score: 3, Insightful

    The sheer number of spam mail sent means that even tiny response rates, reportedly 0.0001%, means junk mailers turn a profit.
    Are we missing a critical factor of the end user who actually responds to SPAM?
    If spammers survive on 0.0001% response rate, how many people are actually clicking/buying? Are these people who provide the customers for spammers going to stop or use any sort of filters?

    1. Re:0.0001% response rates by jafiwam · · Score: 1

      I do not believe that they survive on the response rate. I think that rather the spammer income (or lack thereof) model is based on selling "marketing" to clueless, inexperienced or dumb businesses and con-men.

      Sometimes the spam is sent out by the person hawking the crap, but usually it is the payment to initiate the spam the spammer is after in the first place.

  45. "Alanis irony" by Danny+Rathjens · · Score: 2, Funny

    That's what we refer to as "Alanis irony", 8^)

    1. Re:"Alanis irony" by joeytsai · · Score: 2, Interesting

      Actually it is ironic when you write a song called "ironic" and there are no ironies in it.

      --
      http://www.talknerdy.org
  46. My solution. by Lord+Kholdan · · Score: 3, Interesting

    I don't use email. Yes, I have a few addresses but I havent checked them in months. Email is kinda dead way of communication anyway, beaten by things such as mobile phones and instant messaging.

  47. spam tax by bagsc · · Score: 1

    if their response rate is one in a million, why not put a $0.0001 fee on emails? I use a lot of email, and I dont think that the one cent per hundred is gonna break my bank. (Of course, if you tax it, they'll do it surreptitiously.)

    --
    http://www.accountkiller.com/removal-requested
  48. change the subscription service by Anonymous Coward · · Score: 0

    Change the subscription service to all the articles,but zero AC posting, then you wouldn't have to filter by threshold as severely. Working out the obvious dodge of multiple login handles I don't know, but maybe somehow it's possible, but merely charging per handle would slow it down considerably.

    whoops, posting as AC....

  49. No Junk Mail please.... by TomMajor · · Score: 3, Funny

    On my mailbox outside my apartment I have a "No Junk Mail please" sticker... This actually works. I tried to put the same sticker on my pc, but the junk mail just keeps on comming... I don't understand....

    --



    Ask me no questions, and I'll tell you no lies...
  50. The best email filter by Spud+the+Ninja · · Score: 2, Interesting

    Why go through all the work of training some software to read your email and decide if you might want to read it when most email programs have white list capabilities?

    If I don't know you, that means I don't want to talk to you. Your email goes straight a junk folder, which I can quickly scan once every few days for from names I recognize. I can add these names to my white list if I so choose.

    Granted, my job does not involve me soliciting contacts from the public at large, so this wouldn't work for everyone. I use it on my personal Hotmail account though, and I get to not even consider lots of crap every day.

    --
    You can never put too much water in a nuclear reactor.
    1. Re:The best email filter by antelopelovefan · · Score: 1

      I think that this can be a great strategy, especially for certain types of email accounts. I have a hotmail account that I use whenever I have to provide a (valid) email address in order to some up for something. Hotmail makes it pretty easy to automatically dump everything that isn't on your Safe List into the Junk folder. My favorite thing about this approach under Hotmail is that you can put entire domains on the Safe List, not just individual email addresses.

    2. Re:The best email filter by metamatic · · Score: 1
      Why go through all the work of training some software to read your email and decide if you might want to read it when most email programs have white list capabilities? If I don't know you, that means I don't want to talk to you.


      I was going to e-mail you explaining why whitelists suck, but obviously there would be no point.

      Suffice it to say that not everybody is an antisocial asshole.
      --
      GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
    3. Re:The best email filter by Spud+the+Ninja · · Score: 1
      I'm not sure how this makes me anti-social; I don't know any body that uses email or normal mail or telephones to initiate social contact.

      "Hello, you don't know me, I just called to see if you wanted to chat." - Doesn't sound too likely.

      If I've meet somebody, I'll exchange email addresses or phone numbers with them if I'd like to talk with them again. I can add this address to my while list.

      As I mentioned before, it's not the best idea for lots of work email accounts.

      --
      You can never put too much water in a nuclear reactor.
  51. Bayesian for windows? by NeoSkandranon · · Score: 1

    As soon as there's a well-written app that works with or on top of programs like Outlook and Netscape, I will be excited. Until then, a huge (and likely most targeted) sector of people remains relatively un-filtered.

    --
    If you can't see the value in jet powered ants you should turn in your nerd card. - Dunbal (464142)
    1. Re:Bayesian for windows? by Anonymous Coward · · Score: 0

      I installed SpamBayes recently and already it works really well. Integrates with outlook, just need to train it and before you know it almost all spam is gone!

    2. Re:Bayesian for windows? by Bryan_W · · Score: 1

      Try Popfile

    3. Re:Bayesian for windows? by NeoSkandranon · · Score: 1

      They say on the website it is explicitly NOT for Outlook express, the most popular version of outlook. is there any way at all to get it working there?

      --
      If you can't see the value in jet powered ants you should turn in your nerd card. - Dunbal (464142)
    4. Re:Bayesian for windows? by GnuVince · · Score: 1

      Check out POPFile. It's a filtering proxy, so you use their great documentation to learn how to set it up (easy) and it works with any email client. The only thing that makes it somewhat hard is first time configuration (buckets and classifying the first messages yourself), but once that's done, you forget about it and it just works.

    5. Re:Bayesian for windows? by NeoSkandranon · · Score: 1

      POPfile has no imap/http support, my primary email addresses (and the ones in most need of spam protection) are not POP3

      --
      If you can't see the value in jet powered ants you should turn in your nerd card. - Dunbal (464142)
    6. Re:Bayesian for windows? by Queuetue · · Score: 1

      If you're using imap, then you should use an imap scrubber, like imapassasin, in which case it doesn't matter what mail client you use.

      If you're using webmail , I think it will have to be filtered at the server - I don't know how you could actually tie in with a web app, unless it has an api to do so.

      Does Outlook Express do webmail?

    7. Re:Bayesian for windows? by antelopelovefan · · Score: 1

      I suppose your point is that you don't want to switch email programs but I highly recommend Mozilla 1.4a.

  52. Newsletters... by TomMajor · · Score: 1

    I get som newsletters in my mailbox that I actually want. Some of thm are verry simular to some spam mails in the structure. (Having html code, pictures and so forth) How are the filters in handeling this potential problem? I'm not currently having a spam filter. So I don't know. Does anybody know this?

    --



    Ask me no questions, and I'll tell you no lies...
    1. Re:Newsletters... by Queuetue · · Score: 1

      Just put the newsletters' addresses on your white list. If your filter is smart enough, you can tell it to let them through - if not, just write a procmail recipe that allows them to skip the filter.

  53. Bogofilter 0.9 NOT 0.11 by LinuxHam · · Score: 1

    I just had to backlevel bogofilter from 0.11 to 0.9. I don't know WTF happened between those two revs, but the filtering algorithm went straight to hell. I had forgotten that I normally get over 100 spams a day until I went to 0.11. Then it all came back and I started losing half an hour a day to sorting out my email.

    I gave it a chance for over two weeks, and it never got even close to the success rate of 0.9. Not that I'm complaining, there was nothing left to improve upon in 0.9 AFAIC. (And yes, I did see that someone decided to reverse the function performed by the -N and -S switches -- thus making my crontab edits a nightmare with troubleshooting)

    So I'm now back at 0.9 and back in nirvana. It's good to be home.

    --
    Intelligent Life on Earth
  54. Re:Slight modification: white-list+Bayesian is use by Sandcastle · · Score: 1
    I think you had better be careful how that is implemented. The most simple solution might not be the best one.

    If you merely place the white list filter before the baysian system, and all white listed e-mail is sent directly to your inbox bypassing the filter then the filter will miss it's oportunity to be better trained. It will efectively have very few e-mails to re-inforce what "not-spam" is and will only be fed spam. It might become overly aggressive in it's filtering and any non-whitelisted, non-spam may have a higher chance of being incorrectly classed as spam.

    If you use the whitelist after the baysian filter then the filter misses out on the oportunity to be better educated again, as regardless of how it classifies the whitelisted "not-spam" e-mail, you'll still receive it.

    The whitelist needs to be incorporated into the baysian system itslef to ensure the filter is continually trained with what is theoretically known good "non-spam" mail.

    --
    The fact that a fish swims in water does not make it an expert in fluid dynamics. GogglesPisano (199483)
  55. Better idea by Anonymous Coward · · Score: 0

    Why don't we just put a thorough e-mail embargoe against China, and let the communist gov't there shoot the spammers for us?

    I absolutely HATE Chinese spammers with a vengance because they once forged my e-mail address and I got the angry replies, I wanted to drive down to China and shoot that little Chinese guy and stuff his little Chinese computer full of dynamite.

    Fucking Chinese.

  56. Vikings ? by JOW · · Score: 1

    Ok, so some English are descendent of the Vikings but only because when We the Danes came to the UK we did not find sheep to go around

    --
    I just hate bit SPAM, (www.netnoise.com.kh)
  57. Re:Spiced Ham? by Anonymous Coward · · Score: 0

    MP didn't "pioneer its use" at all. In the skit they are referring to the actual product of spiced ham. How could just mearly mentioning a product by name pioneer the use of the word? They obviously didn't coin any new term, they were talking about spiced ham.

  58. you're wrong. here's why... by Anonymous Coward · · Score: 0

    "50 years of successful predictive modeling should be enough: lessons for philosophy of science"

    it's a very common misconception, but the fact is that a well-written test (eg. the MMPI) will always be better than a human "expert" (eg. a psychologist).

  59. Re:Slight modification: white-list+Bayesian is use by Plug · · Score: 1

    A slightly different idea that I was considering today works as follows.

    Take the Tagged Message Delivery Agent, a system that will send a challenge message to anyone it doesn't know (isn't in the whitelist), which you have to reply to.

    Then change it so anything allowed through on the whitelist is added to the "Not Spam" category, and anything that is challenged is passed through the filter. If it passes, it doesn't get challenged (but also doesn't get added automatically to Not Spam), and if it _doesn't_ pass, then it gets challenged.

    Few, if any, false positives, and challenges not sent where they don't need to be. Sounds foolproof enough...

  60. Re:Slight modification: white-list+Bayesian is use by Wocko · · Score: 1

    You don't necessarily need an explicit whitelist. All you need to do is include the email headers in the list of tokens from which the Bayesian filter learns.

    Then, if you receive non-spam from a friend, their email address is automatically added to the list of non-spammy words.

    Conversely, any time you classify a spam email, then that email address, and potentially the domain if the tokenising is smart, is added to the list of spammy words.

    This is what SpamBayes already does, I believe.

  61. Re:Slight modification: white-list+Bayesian is use by Anonymous Coward · · Score: 0

    I use a Unix probability based spam filter written in awk and ksh with a whitelist built in. The whitelist is executed first as it is much faster than the spam filter. It is located at:
    http://www.sofbot.com/

  62. Re:Slight modification: white-list+Bayesian is use by dwsauder · · Score: 1
    Using a whitelist in combination with a bayesian filter is just one thing you can do. There are plenty of other things.

    You could look for the message ID of a message you sent in the header fields of received messages (specifically, the in-reply-to header field). If you find it, it means that the received message is likely to be a reply to a message you sent.

    You could look for a phrase from your signature, which could indicate that someone sent a reply and included your original message.

    Besides the words in your signature, you could program in certain other words that automatically trigger a classification as non-spam. Those words might include the names of trademarked products that your company sells or similar types of words. Of course, this is just overriding some of the learning that presumably would happen automatically. But if these are very important words, then you must insist that nothing else the filter does can override the classification as non-spam, and thereby avoid false positives.

    In summary, I think that bayesian classifiers, as Paul Graham proposes them, are just too naive. The addition of a few heuristics could make a big difference.

  63. You asked for it ;) by brad-x · · Score: 1

    How to Increase Your Penis

    And Stop Premature Ejaculation

    FREE Bottle Offer 100% Guaranteed to work.





    Take Advantage of Our FREE Bottle Offer As Seen On TV !!!

    Click here to learn more.

    NB: Amusingly my first revision of this was smacked down by slashdot's inbuilt junk filtering mechanisms. :P

    --
    // -- http://www.BRAD-X.com/ -- //
    1. Re:You asked for it ;) by Blkdeath · · Score: 1
      How to Increase Your Penis

      Dork. :)

      --
      BD Phone Home!

      Shameless plug. Like you weren't expecting it.

    2. Re:You asked for it ;) by Anonymous Coward · · Score: 0

      I hate that spam you posted! It is the worst offender of any spam I've ever seen. I've got three different filters protecting my mailbox, and that message makes it through all three of them. I keep feeding it over and over again to bogofilter, but bogofilter still says it is 0% spam. Damn ESR!

      The other variation that also makes it through all of my filters has the text "get larger nuts and penís, more pleasure, more satisfaction." Why can't bogofilter catch that?

  64. When the revolution comes... by almaw · · Score: 1

    All the people who say "I don't get spam, why do you?" will be the first up against the wall when the revolution comes. Well, first after the damned spammers, anyway.

  65. Yes it will by doublem · · Score: 1

    Given the fact you got modded to +5 funny, Yes, if you troll like that it will ge through the lameness filter and the people with mod points.

    Good work!

    --
    "Live Free or Die." Don't like it? Then keep out of the USA
  66. Re:Slight modification: white-list+Bayesian is use by Jeremi · · Score: 1

    True, that mostly works... but it doesn't handle the possibility of my friend sending me an email where the spam-keywords overwhelm the "goodness" of his non-spammy email address. I like to know for certain that no matter what my friends send me, it will get to me (of course, if they send me too much crap, they'll lose their "friend" status... ;^))

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.
  67. Re:Slight modification: white-list+Bayesian is use by Jeremi · · Score: 1
    In summary, I think that bayesian classifiers, as Paul Graham proposes them, are just too naive. The addition of a few
    heuristics could make a big difference.


    I disagree -- the heuristics you mention are much more naive than the Bayesian filter. For example, what if someone doesn't quote your signature in their reply? What if their mailer doesn't include the Message ID? What if the email isn't a reply to something you wrote, but a spontaneous email?


    Even if the heuristics did work well (and in my experience they don't), there is still the time factor -- I don't want to spend all of my free time coming up with and implementing new heuristic rules. I want my computer to do the scut work for me. Bayesian does that.

    --


    I don't care if it's 90,000 hectares. That lake was not my doing.
  68. british are french not vikings? by Anonymous Coward · · Score: 0

    what the normans are? frogs, thats what. just because a few vikings pillaged the coast dont mean they are vikings. hah! maybe lucky eddie. the british are basically africans who picked up some civilisation from irish scots french an germans

    1. Re:british are french not vikings? by Cackmobile · · Score: 1

      Umm, actually the Normans are vikings. Just like some vikings settled in England so to in Normandy. Look at your history. They were very different to the French.

      --
      -- Karma Karma Karma Karma, Karma Chameleon - Boy George
  69. Re:Slight modification: white-list+Bayesian is use by Wocko · · Score: 1

    Yes, that would most definitely be a problem.

    It would probably also tell me that it's time to get some new friends :)

  70. STFU by Inthewire · · Score: 1

    n/t

    --


    Writers imply. Readers infer.
  71. If only by Anonymous Coward · · Score: 0

    It is official; Slashdot now confirms: Spam is dying

    One more crippling bombshell hit the already beleaguered Spam community when Slashdot confirmed that Spam market share has dropped yet again, now down to less than a fraction of 1 percent of all servers. Coming on the heels of a recent Slashdot survey which plainly states that Spam has lost more market share , this news serves to reinforce what we've known all along. Spam is collapsing in complete disarray, as fittingly exemplified by failing dead last [samag.com] in the recent Sys Admin comprehensive networking test.

    You don't need to be a Kreskin [amazingkreskin.com] to predict Spam's future. The hand writing is on the wall: Spam faces a bleak future. In fact there won't be any future at all for Spam because Spam is dying . Things are looking very bad for Spam. As many of us are already aware, Spam continues to lose market share. Red ink flows like a river of blood.

    Sex SPAM is the most endangered of them all, having lost 93% of its core developers. The sudden and unpleasant departures of long time Sex SPAM developers Jordan Hubbard and Mike Smith only serve to underscore the point more clearly. There can no longer be any doubt: Sex SPAM is dying .

    Let's keep to the facts and look at the numbers.

    Viagra SPAM leader Theo states that there are 7000 users of Viagra SPAM. How many users of Penis Extender SPAM are there? Let's see. The number of Viagra SPAM versus Penis Extender SPAM posts on Usenet is roughly in ratio of 5 to 1. Therefore there are about 7000/5 = 1400 Viagra SPAM users. BSD/OS posts on Usenet are about half of the volume of Penis Extender SPAM posts. Therefore there are about 700 users of Soft Porno SPAM. A recent article put Sex SPAM at about 80 percent of the Spam market. Therefore there are (7000+1400+700)*4 = 36400 Sex SPAM users. This is consistent with the number of Sex SPAM Usenet posts.

    Due to the troubles of Virginia, abysmal sales and so on, Sex SPAM went out of business and was taken over by Walmart who sell another troubled Dead Tree version. Now Walmart is also dead , its corpse turned over to yet another charnel house.

    All major surveys show that Spam has steadily declined in market share. Spam is very sick and its long term survival prospects are very dim. If Spam is to survive at all it will be among OS dilettante dabblers. Spam continues to decay. Nothing short of a miracle could save it at this point in time. For all practical purposes, Spam is dead .

    Fact: Spam is dying

  72. Re:Slight modification: white-list+Bayesian is use by Inthewire · · Score: 1

    POPFile's Magnets work like this - based on From, To, or Subject.

    --


    Writers imply. Readers infer.
  73. Re:Slight modification: white-list+Bayesian is use by eraserewind · · Score: 1

    I have my main hotmail set to max filtering, i.e. only allow people I have in my address book or safe list.

    I've noticed that recently some spam has been coming through pretending to be amazon.com or bn.com as they are on my safe list (and I'd imagine many other people's too).

    Is this the beginning of a wave of intelligent spam? one step up from them pretending to be from yourself. How soon before one of those outlook virii is designed to divert the address book info to some spammer, so they can more than just guess what email addresses people are likely to let through?

    The end is nigh I tells ya! ;)

  74. DON'T implement it like the parent by NoOneInParticular · · Score: 3, Informative
    If you do it like the parent:
    pspam = P(spam); pnospam = P(not spam);
    foreach unique words w in e-mail do
    pspam = pspam * P[w][spam];
    pnospam = pnospam * P[w][nospam];
    endfor

    if (pspam > pnospam) then return IS_SPAM; else
    return IS_NO_SPAM;

    You'll soon be running out of bits to store the floating point results. Implement it by adding logarithms of probabilities instead of products of them, thus:

    lpspam = log(P(spam));
    lpnospam = log(P(not spam));
    foreach unique words w in e-mail do
    lpspam = lpspam + log(P[w][spam]);
    lpnospam = lpnospam + log(P[w][nospam]);
    endfor

    if (lpspam > lpnospam) then return IS_SPAM; else
    return IS_NO_SPAM;

    If you have a couple of hundred key-words, this will make a lot of difference concerning the accuracy of the predictions.

  75. Possible spam solution? by Anonymous Coward · · Score: 0

    Possible spam solution: require email senders send a "revokable certificate" which can be downloaded from their web site. Only allow email addresses already in address book and certifed web email addresses to send email. This idea works great as a filter. Hard part would be getting websites to start posting email certificates for download as well as getting the (optional) filter installed on the end users machines.

    Could be easily done if a create email certificate option was added to frontpage and the filter was built into outlook. Only downside to filters of course, is that they only cut traffic down where the filtering is done. Of course, if a good filter is implemented and everyone starts using it, I'm sure it would severely cut down on the outgoing spam in the first place.

    BTW: filtered certificates would automatically send a certificate revocation notice back to the sender of the offending email. Certificates could easily be reinstated at any time just by downloading the certificate again. Web browser of course would have to work in conjunction with the email software in the certificate transfer process.

    Certificate revocation would only require interaction beteween the mail server and the mail client.

    Any reason why this isn't being done yet?

    Start thinking of your email address as something more along the lines of a credit card number & think twice about giving it to someone who doesn't offer a certifacate.

  76. What does it stand for? by JThaddeus · · Score: 1

    While friends in the US and Britain tell me that SPAM is 'SPiced hAM', I think it's really an acronym for 'Swine Parts and Artificial Meat'.

    --
    "Love is a familiar; Love is a devil: there is no evil angel but Love." --William Shakespeare ('Love's Labors Lost')
  77. non-verbal bayesian filtering by piovere · · Score: 1

    From what I've read on Bayesian filtering it seems like most of the spam is caught not only from the body text, but also from the spam crap in the headers (eg bad message id). could other tricks be added into this filtering as "virtual words"--more picture than text, for example, or similar things?

  78. Crude and Inflexible by Anonymous Coward · · Score: 0

    I guess you don't get a lot of new correspondents. Do much web shopping?

    Or do you review your Spam folder periodically?

  79. A Different Approach. by sbot5000 · · Score: 1
    I think we need Dummy Filtering for Bayesians (tm).

    A Plan For SPAM.

  80. Vikings are appropriate by metamatic · · Score: 1

    Of course, Vikings are entirely appropriate to SPAM. The Hormel factory which makes all America's SPAM (as well as the UK's) is located in Austin, Minnesota... and of course, Minnesota has a sports team called the Vikings.

    --
    GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
  81. Problems with current implementations by downwa · · Score: 1

    I was impressed with the concept of Bayesian filtering, and was happy to find the latest Mozilla Mail supporting it with built-in buttons to mark mail as Spam/Not Spam. I wasn't so happy when I discovered many spams slipping through, even though they had the same type of content as previous ones I had marked as spam.

    In viewing the source of one of the mail messages, I discovered embedded HTML comments which split up phrases and even words which might get flagged otherwise. The content of the comments appeared to be randomly generated text, so a filter wouldn't be able to categorize it the same every time. And the placement of the comments within a word wouldn't be the same every time either, so a naive implementation could never filter spam based on previous choices.

    A smarter implementation might try operating on the results a viewer would see-- with the HTML tags stripped out. However, as soon as a filter does that, spammers could add Javascript within comments, that generates the spam text, and viewers who allow Javascript (many) would still see the spam.

    It seems that both the text without the HTML tags, and the contents of the HTML tags, need to be considered separately in order to be able to filter the new generation of spam.

    --
    Life's a lot like money-- you spend it, then it's gone. Spend wisely.
  82. This could be used for good or evil. by UnknowingFool · · Score: 1
    In the wrong hands this technology could be a bad thing. I'm sure parents would like to remove all the content they don't want.

    But then again I could use it to filter out all the commericials and forward to Pamela Anderson jiggling.

    Sorry, I thought that said "Baywatch Filtering for Dummies"

    I am SO embarrassed.

    --
    Well, there's spam egg sausage and spam, that's not got much spam in it.
  83. Cease and Desist by RyanK · · Score: 1
    Due to our successes in other cases, we recommend that you discontinued use of our copyright, 'for dummies'.

    In addition, we require reimbursement for our attorneys' fees in the present amount of $140.00.

    I look forward to your response on or before January 30, 1998.

  84. But how do I get the Mozilla filters working? by Dhraakellian · · Score: 1

    one of my accounts is glutted with spam. (unfortunately, it's not one that I can just close down.) I highlight the spam messages (ctrl+A, usually) and mark them as junk, but Mozilla Mail 1.3 doesn't seem to classify incoming messages as junk, even after several weeks of training. I know basically how Bayesian filtering operates, but how do I get it to work?

    --
    I've read Grocklaw. BoycottNovell, you're no Grocklaw
  85. MOD PARENT UP, Please by billstewart · · Score: 1
    I've been seeing tons of this, and it's clearly there to trick Bayesians.


    And rejecting HTML mail from non-whitelisted sources is probably a good thing anyway :-)

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  86. They're trickier than that now. by billstewart · · Score: 1
    The first round of this was just random characters in subject lines and small amounts of randomness in text. The stuff I'm getting now is different - it's HTML mail with vast amounts of nonsense or English-like HTML comments breaking up words, especially spammish words. I can't show this in real HTML, because it's too likely that Slashcode will eat it, but it works like this.
    Get your Via[COMMENT jwq;fj joihhh h ihgeiohg]gra here ch[COMMENT afdioghdsfhg]eap!
    Ma[COMMENT open source linux]ke Mon[COMMENT slashdot stallman]ey fa[COMMENT can't sleep clowns will eat me]st![COMMENT these aren't the spams you're looking for]!!!

    It looks like a really nasty attack on Bayesian filters, at least until the filters start recognizing HTML comments as a bad thing.
    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks