Slashdot Mirror


Using gzip As A Spam Filter

captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."

268 comments

  1. Grep it instead! by WestieDog · · Score: 2, Funny

    Forget about gzip all the 'cool' geeks use grep! :)

    1. Re:Grep it instead! by dubbayu_d_40 · · Score: 1

      How exactly would you know what to grep for? I believe the value is in pattern identification and grep can't do that. Then again maybe I'm just not 'cool' ;-)

    2. Re:Grep it instead! by Walterk · · Score: 5, Funny

      Just egrep for '(penis|enlarge|money|auction|cash|advance|fortune )'. And hope no hot babes email you complimenting your penis, or mention they want their breasts enlarged, offer you money, auction off your award winning lego collection or anything like that.

    3. Re:Grep it instead! by OzPixel · · Score: 1

      Humour aside, the spammers are already adapting to simple string filtering - a lot of the spam I get these days (all HTML, of course) uses my name inside comments in the middle of any words likely to be filtered, e.g.
      En(!-- name here)large your pe(!-- name here)nis.

      (except with angle brackets instead of parentheses, of course, I can't seem to find a way to get the angle brackets to appear properly ... )

      David.

    4. Re:Grep it instead! by bobbozzo · · Score: 1

      I was doing exactly that (in Eudora), but then I realized I was losing mail from:
      my insurance company (grep for insurance spam)
      messages about income tax changes from friends (grep for tax)...

      Now, I use bogofilter. ID's MUCH more spam than my Eudora filters ever did, with very few false positives.

      --
      Nothing to see here; Move along.
    5. Re:Grep it instead! by some+guy+I+know · · Score: 1

      I can't seem to find a way to get the angle brackets to appear properly

      When using so-called "Plain Old Text" mode, use "&lt;" for "<".
      Two other escapes:
      "&gt;" for ">"
      "&amp;" for "&"

      --
      Those who sacrifice security to condemn liberty deserve to repeat history or something. - Benjamin Santayana
  2. Raw data by gazbo · · Score: 5, Informative

    This article will make much more sense if you look at the raw data in tabular form.

  3. It's all spam by amigaluvr · · Score: 4, Funny

    Hey if you compress all of your mail with gzip then it all looks like foreign spam anyway!

    1. Re:It's all spam by greenjinjo · · Score: 5, Interesting

      You know, I noticed something peculiar. If you're from a non-English speaking country, like I am, you can filter the spam by looking at the language of the subject. In my case, if it is English it is almost certainly spam.

      Do English-speaking people receive spam in foreign languages?

    2. Re:It's all spam by Drakin · · Score: 1

      Not often enough that doing simething like that would catch even a small porion of Spam...

    3. Re:It's all spam by Anonymous Coward · · Score: 1, Interesting

      A mailing list I am on gets almost exclusively korean spam.

      Or it did get it, I don't know anymore, I unsubscribed. The admin refused to make it a subscription-only list as it was easier for newbies to post questions to it, and they could get quicker answers.

      Unfortunately all that happened was newbies posted messages to the list and lost any replies from the few dedicated people left, in the deluge of spam.

      Otherwise, it's mostly just more english spam

    4. Re:It's all spam by dkf · · Score: 1
      Oh yes. Especially from Brazil and South Korea (though not so much of those today, thanks to the DDoS over the weekend that took out a fair bit of the core 'net.) Which just left the field open to those spamming in what would nominally be called English.

      If only spammers would always spam me in Korean, it'd be ever so easy to block... :^/

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
    5. Re:It's all spam by FrostedWheat · · Score: 1

      Do English-speaking people receive spam in foreign languages?

      In my @rocketmail.com account, every item of spam is Taiwanese.

      But that's only one account. All my others get flooded with English spam.

    6. Re:It's all spam by Anonymous Coward · · Score: 0

      Yes, we do. And lots of it. Although nowadays I can't really say how much it is anymore, since procmail deletes everything with non western charsets (except unicode). Call it blunt, but hey: I cannot read that stuff, so even if it is not spam, I'd just as well delete it...

    7. Re:It's all spam by cellocgw · · Score: 1
      Do English-speaking people receive spam in foreign languages?


      Yes but not often. OTOH, I don't think I've ever received a non-spam email.... ooops, yes I did. I criticized an American summer camp's Italian (mis)spelling on their website, and got back a nice long email all in Italian.


      Anyway, So long as it's a settable filter, language-specific spam control should be fine.

      --
      https://app.box.com/WitthoftResume Code: https://github.com/cellocgw
    8. Re:It's all spam by Jester99 · · Score: 1

      I've received thousands of spam messages (20-30 per day...) and perhaps with a couple exceptions that I'm forgetting, they've *all* been in English.

      Of course, an insane number of those spam messages seem to be duplicates of themselves sent day after day, but still. Everything's in English in my account. :\

    9. Re:It's all spam by raynet · · Score: 1

      Also filtering out chinese spam is easy.. just look for odd unicode strings :)

      --
      - Raynet --> .
    10. Re:It's all spam by Anonymous Coward · · Score: 0

      I drop any message with korean or russian text. although it's not always marked as being korean in the headers so sometimes it slips through. I believe SpamAssassin has an option to detect langauges and block them.

    11. Re:It's all spam by nettdata · · Score: 1

      Do English-speaking people receive spam in foreign languages?

      I have an email address that is used exclusively for the Bugtraq mailing list over at securityfocus, and funnily enough, out of about 30 spams a day it gets, 50% are from some 163.com company based in China, and the other half are from Nigeria (the whole "please help us get millions of dollars out of the country" scam).

      I was always kind of curious as to the response the Nigerian guys got from a security-based mail-list.

      --



      $0.02 (CDN)
    12. Re:It's all spam by Yuan-Lung · · Score: 1

      I have received spam in all sorts of languages, including Chinese, English, Japanese, Korean, Russian, and quiet a few that I could not identify with a quick glence.

      English and Chinese (both traditional from Taiwan and simplified from China) are the top 2 that account for over 90% of my spam.

      I think this might be very closely related to the fact that these are the 2 languages I speak. My guess is that either somehow my e-mail address got harvested from sites I vist, or my idiotic friends exposed my address to spam collectors. Either way the information would have been language specific.

    13. Re:It's all spam by Phroggy · · Score: 1

      Do English-speaking people receive spam in foreign languages?

      A pretty wide assortment of them, yes. And yes, if it's not in English, then it's spam, but of course most of my spam is in English.

      --
      $x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
      $x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
    14. Re:It's all spam by xkenny13 · · Score: 1

      I do ... of course, I am Asian, with an Asian last name, so I get all sorts of unreadable SPAM that is apparently uses an Asian charcter set.

      I also keep getting SPAM in French ... someone who appears to be trying to sell art or something.

    15. Re:It's all spam by mousse-man · · Score: 1

      Lots of my spam is some Asian character set, in Portugese and German. Usually, these using German and living in my country of origin will have a small surprise since I call them. And lots of spam get nuked off by DNS blackholing, of course. I haven't seen spammers from Singapore yet - looks like they seem to enforce their laws in really interesting ways. Just why can't the Chinese do that neither?

    16. Re:It's all spam by hplasm · · Score: 1
      "Do English-speaking people receive spam in foreign languages?"

      I get a lot of something in foreign languages. Is it spam? I don't know...translator!! Over here!

      --
      ...and he grinned, like a fox eating shit out of a wire brush.
    17. Re:It's all spam by Anonymous Coward · · Score: 0

      I get a small amount of non-English spam, but
      not much.

      Encoding is also informative - HTML mail is a lot
      more likely to be spam; base64 crap is always
      spam; and I haven't yet gotten a mail from a Real
      Person with image tags or Javascript in it.

    18. Re:It's all spam by Anonymous Coward · · Score: 0

      I just tell them all that I'm really interested.
      And to call me at (the number the last Nigerian
      minister wanted me to call 'em at.).

  4. Maybe I am missing something here by Anonymous Coward · · Score: 1, Interesting

    But isn't the spam quite varied, i.e. without long repetitive sequences? Yes, the same post may come in several times, but the text in each is quite varied; e.g. longer xxx, bigger xxx or yyy, heftier yyy and zzz.

    1. Re:Maybe I am missing something here by ackthpt · · Score: 1
      But isn't the spam quite varied,

      Well, the posting sure is, it's under Your Rights Online. Undoubtably captainclever and timothy are in cahoots to sneak spam related articles under someone's filter.

      --

      A feeling of having made the same mistake before: Deja Foobar
    2. Re:Maybe I am missing something here by 6Yankee · · Score: 3, Funny

      the text in each is quite varied; e.g. longer xxx

      The text in each of my spams seems to have more XXX...

    3. Re:Maybe I am missing something here by ThundaGaiden · · Score: 1

      I think the point would be to have a dictionary looking with common spam words or phrases
      Eg spam.txt

      "Penis Enlargement"
      "Naked women"
      "Massive amounts of money"

      I think that would cover my hotmail account :)

  5. Slashdot filter by fredrikj · · Score: 4, Interesting

    Sounds very much like that lameness filter on Slashdot that refuses to accept a post if its contents can be compressed easily... of course, it's quite simplistic compared to gzip.

    1. Re:Slashdot filter by vasqzr · · Score: 0



      Right, and as we all know, the lameness filter doesn't work for crap.

      You constantly see the same trolls posted over, and over, and over again.

    2. Re:Slashdot filter by pudge · · Score: 3, Informative

      Um, except that Slash uses gzip for its compression. So, no. :-)

      What is different, as has been pointed out, is that Slash compresses a particular post and looks at how well it compresses, but does not compress/compare with other posts.

    3. Re:Slashdot filter by McCart42 · · Score: 1

      In other words, if your post is more noise than signal, you're on the right track. ;)

      --
      "I may be quite wrong." - Socrates
    4. Re:Slashdot filter by fredrikj · · Score: 2, Funny

      Oops. Well, my experience from my troll accounts is that the filter does a lousy job, I could never have guessed that something that sophisticated was behind it ;)

      Err, ignore the troll account part, I never said that.

  6. this is nice by teejie · · Score: 1

    Slashdot reporting on a 'new' feature of gzip found by a user of k5 which they have been using themselves for quite some time now (lameness filter does exactly the same thing... sigh)

    1. Re:this is nice by gazbo · · Score: 3, Informative
      No, the lameness filter does nothing like this. The lameness filter (strictly the postercomment compression filter) just sees how well the isolated text compresses. Too high compression implies too much repetition (hence likely repeatedy copy+pasted junk), too low compression implies random chars - English contains plenty of redundancy.

      This, on the other hand, talks about gziping the mail in the context of corpora of known spam or known ham. Thus it serves as a classification of types of Englishg text, whereas the slashdot system only tries to classify whether or not it is actually English text at all.

  7. Re:Text of the full article by Anonymous Coward · · Score: 5, Insightful

    > The current fad among spam filters is word-counting, with various statistical heuristics applied to the results.

    The current fad is in fact Bayesian filtering, sophisticated statistical analysis.

    gzip used this way can be viewed as a very poor Bayesian analysis with substantially lower effectiveness. Lets just skip the half-assed attempt and go straight to the real thing.

  8. Meet the Bayesian Filtering Algorythm by dpete4552 · · Score: 5, Informative

    http://www.paulgraham.com/spam.html

    --
    http://www.archive.org/details/ThePowerOfNightmares
    1. Re:Meet the Bayesian Filtering Algorythm by dilute · · Score: 2, Informative

      Baysian filtering looks at word occurrence statistics. This is saying just compare the bulk redundancies of a message as compared to a collection of test messages of a known type, without even looking at the "words". May not be the ultimate filter (and I doubt it could be), but it's real interesting, I think, that this appears to have considerably greater than zero accuracy.

      OTOH, it seems to me that some other model, such as a scheme that gives legitimate senders explicit advance AUTHORIZATION to send you email, might be what's needed. How to implement that is, well, left as "an exercise for the reader" -- actually, this has been discussed on /.

    2. Re:Meet the Bayesian Filtering Algorythm by coyul · · Score: 5, Informative

      OTOH, it seems to me that some other model, such as a scheme that gives legitimate senders explicit advance AUTHORIZATION to send you email, might be what's needed.

      I understand what you're saying, but there are a couple of problems with this, depending on how you implement it. If you allow potential correspondents to request authorization by email, you'll still have to process at least one message per originating address. That obviously won't work to eliminate spam (or even cut it down to size...) The other option is to force potential correspondents to request authorization over another channel (phone, fax, whatever), but this neatly destroys a lot of the convenience of email. It also eliminates the impersonal nature of email (by forcing a personal contact) when it is partly this impersonality that distinguishes it in the first place (and encourages some first time correspondents to make contact at all...)

      May not be the ultimate filter (and I doubt it could be), but it's real interesting, I think, that this appears to have considerably greater than zero accuracy.

      Actually, the Bayesian filter implemented by POPFile is remarkably accurate. A friend of mine has been using it since it debuted on slashdot in November and it has correctly classified all of the spam he's received since (76% of his email in total, unfortunately...)

      You can also set up POPFile to process the headers of your messages as well as the body, so it will effectively learn the email addresses of people you're willing to receive email from anyway. Depending on how you define words (what you use as token separators), you could attempt to make it generalize to domains as well.

    3. Re:Meet the Bayesian Filtering Algorythm by dilute · · Score: 1

      You could put their mail in a putative spam folder and send them an explanatory message with a link to a web page where they can get "authorized" and put on a "buddy list". On that page you could do a variety of things, depending on whether you just wanted to screen out automated mailers or really wanted to pre-qualify the sender.

      I'm skeptical about heuristic filters, because of the possibility of the occasional false positive, which could be an embarrasment (or worse).

      However, the filtering technology is very much of interest to me, for other reasons... I will take a look at POPFile for sure.

    4. Re:Meet the Bayesian Filtering Algorythm by crisco · · Score: 1

      I get better than 99% accuracy with POPFile across ALL my categories of email. I think that by using more categories than just spam vs nonspam I get better accuracy. Unfortunately, not everyone seems to enjoy the same level of accuracy with it.

      --

      Bleh!

    5. Re:Meet the Bayesian Filtering Algorythm by kirkjobsluder · · Score: 1

      I'm skeptical about heuristic filters, because of the possibility of the occasional false positive, which could be an embarrasment (or worse).

      I think it is better (and recommended by filter programmers) to use the filters as an aid for classification rather than the end of classification. Especially because filters such as spamassassin detect problems with the mail header that are difficult to eyeball at a glance. But honestly, having used spamassassin for the last year I find the concern about false positives to be a bit overblown. Spamassassin just looks for the same features I look for to identify spam. Humans also have false positive rates as well so it is not obvious to me that a filter which examines the entire message would have a higher false positive rate than a human being scanning the from and subject line.

      So at some point, doesn't the sender bear some responsibility for composing a message in such a way that it looks and feels like spam? Almost all of the spam messages I get have more than a half-dozen features that are used to classfy them as spam. About half of those features involve malformed header information that does not appear with almost every legitimate mail user agent. The claims that heuristic filters will mean missing the cold-call job offer or the dirty invitation from your sweetie are highly inflated.

  9. Compression detection of spam by Alcohol+Fueled · · Score: 1

    So.. does this mean that we'll be seeing e-mail specific programs from companies that make software like gzip and such?

    --
    Ah am not a crook! (\(-__-)/)
  10. Right tool for right job by WPIDalamar · · Score: 2, Interesting

    Sure, this sounds like a nice academic activity, but really ... In the real world, use the right tool for the right job. I tend to think word redundancy does not correlate directly to spaminess.

  11. HTML by Pilferer · · Score: 5, Interesting

    That's because most spam includes large amounts of HTML.

    My friends do not use HTML in email. Ads for "Crimescene Cocksuckers" does.

    1. Re:HTML by ^BR · · Score: 1, Insightful

      You're a moron that didn't read the article.

      The idea is to have a corpus of spam and a corpus of ham, to append the new message to it and to see in which case the message to test compresses best to classify it.

    2. Re:HTML by UberLord · · Score: 1

      Unfortunately, most of my friends use Outlook for Outlook Express which defaults to HTML emails for sending.

      I'd guess most of the Microsoft dominated world does as well....

    3. Re:HTML by phrantic · · Score: 2, Informative

      Another problem with html is that, if there is some level of sophistication on the part of the spammer they can embedd a file (a gif or jpg) in the html that has a unique name that is uniquely associated with your email address. You open the mail, the file is requested (it doesn't even have to exist) but the 404 error or the html get can be logged on the server, and then it is a simple matter of matching the requested files to the email address and you have a list of good email addresses. This is a really useful technique for "closed loop marketing" which is the corporate edition of Spam.

      --
      --My sig is bigger than your sig--
    4. Re:HTML by Lussarn · · Score: 1

      You are quite right except that you don't have to embed anything. just put a image tag in the mail.

    5. Re:HTML by PetWolverine · · Score: 1

      I don't think he's a moron, and I bet he read the article.

      The point of the comment to which you replied was that since spam generally has a lot of HTML, a lot of the same strings, such as "<b>" and "<a href=" will appear again and again, both in the corpus and in the test message. The extra occurrences of these strings in the spam corpus will help the test message to compress better if it's spam. If it's not spam, it generally won't be HTML, so it won't contain all those tags and their occurrence in the corpus won't cause it to compress more.

      --
      I found the meaning of life the other day, but I had write-only access.
    6. Re:HTML by PetWolverine · · Score: 1

      That's a damned good reason to keep HTML support turned off in your mail program.

      Luckily, Apple's Mail program allows this to be turned off. I'm not sure if Outlook or Entourage does, but I consider this feature a major selling point of an email client.

      --
      I found the meaning of life the other day, but I had write-only access.
    7. Re:HTML by Lord_Breetai · · Score: 1

      just put a image tag in the mail.



      True. However even that is suspect. That is, who but spammers just put an image tag and nothing else?



      --
      "You are only young once, but you can be immature forever." -www.animemusicvideos.org
    8. Re:HTML by T-Ranger · · Score: 1
      Spammers who know that text analysis filters exist so they put all there text into photoshop and then put an image up on, ie geocities.

      One line messages send realy fast, and (ie) geocities has to deal with the bandwidth. Its been done.

  12. Great by FungiSpunk · · Score: 1

    Cool! Now I can compress all that useless crap with non-useless crap, compare them, then collect more until it uses just as much space as it did when it was non-compressed! ;)

    --

    "I kill you! You no good 56'ing!"
  13. Excellent by Phosphor3k · · Score: 5, Funny

    Slashdot can use it to filert out duplicate stories.

    1. Re:Excellent by oktokie · · Score: 1

      Just run CommanderTaco through gzip filter.
      He is the source of duplicated posting in our universe.

  14. Legislation by ultrabot · · Score: 0, Flamebait

    I would still rather see a law that would sentence the spammers to death without parole... At least there would be higher barrier of entry to spamming.

    --
    Save your wrists today - switch to Dvorak
    1. Re:Legislation by liquidsin · · Score: 2, Funny

      That's pretty harsh. Once the death sentence has been carried out, I see no reason not to parole them. Have some compassion.

      --
      do not read this line twice.
  15. Nice summary/quote by Anonymous Coward · · Score: 0

    Thanks, I know what compression is.

  16. It won't work for businesses by autocracy · · Score: 4, Funny

    Anything from mid-level management or the marketing department would immediately be marked as spam and trashed. Maybe not very important in the first place, but you'd at least need to be able to say "yeah, I saw the memo on the TPS reports."

    --
    SIG: HUP
    1. Re:It won't work for businesses by blibbleblobble · · Score: 2, Funny

      "Anything from mid-level management or the marketing department would immediately be marked as spam and trashed."

      And the problem?

  17. In additon by some+homeless+guy · · Score: 0, Funny

    In comments submitted on Kuro5hin, a question (see comment) is raised on whether or not Slashdot employs a similiar technique (as presented in the article) to foil spam-flooders

  18. Wow, I just can *** see *** that it's spam by Anonymous Coward · · Score: 0

    It seems your caught up in technology too much to use read...no, forgot, you can't.

  19. Spam Conference talk by Matts · · Score: 4, Interesting

    Jason Rennie gave an extremely interesting talk about this at the MIT Spam Conference this month, although he wasn't using quite as direct a method, instead he was looking at MLD - Minimum Length Description. This is a technique to discover features in corpora that allow you to describe the classification of a corpus in the minimum number of details.

    Basically it's a way to discover features in emails using compression techniques, so rather than having us SpamAssassin developers have to carefully and manually examine emails to see what's new and interesting about them, MLD techniques can automatically detect these features.

    Jason Rennie's web page (talk and paper available) about this is here. Please do read it as it's extremely interesting.

    The one downside of it is that Jason said at the end of his talk that it's extremely slow at doing the feature detection. When asked how slow he said that on a reasonably small corpus it took 4 months (although he said it was written in Perl, so a C port is probably a good plan).

    In comparison to Bayesian techniques the MLD technique presents a great deal of interest - primarily because I work for a company doing spam filtering at the internet level, and so we can't feasibly do personal training which is what makes Bayesian techniques so great (see the talk I gave at the MIT spam conference). Without the personal training Bayes is only about 90-95% effective, so it should be interesting to see where these techniques lead us.

    --

    Matt. Want XML + Apache + Stylesheets? Get AxKit.
    1. Re:Spam Conference talk by ajs · · Score: 2, Interesting

      I think, at the Internet level, RBLs (mirrored by you, obviously for speed's sake) and such are your best weapon. The more of the net you have by the short patch-cables, the more significant you make each RBL that you listen to.

      At the personal level, each of these newly "discovered" techniques (I remember a /. article about using gzip for analysis of other document structures years ago) will make a fine addition to statistical systems like SpamAssassin, which uses them to build a very accurate model of a piece of mail's "spamishness".

    2. Re:Spam Conference talk by Matts · · Score: 2, Insightful

      Actually it's the other way around. DNSBL's (not RBLs - thats a specific term for MAPS' list) are fine for personal users, and even for some businesses, but generally they have way too high a false positive rate for any kind of generic filtering. The SpamAssassin team has done lots of research into this, see for example the slide at the very end of my talk.

      No, for a large scale service you need much lower rates of false positives than any of the DNSBLs provide right now. They're fine as inputs into heuristic or statistical systems, but on their own they are just not accurate enough.

      --

      Matt. Want XML + Apache + Stylesheets? Get AxKit.
    3. Re:Spam Conference talk by archeopterix · · Score: 4, Insightful
      MLD, gzip, neural networks, bayesian filtering and probably a bunch of other spam-filtering methods are all based on the following scheme: get a (big) number of spam messages, a number of non-spam messages (preferably specific to the current user of the filter) and use a learning algorithm on these to produce an automatic classifier.

      What bothers me about this method is that you can never be 100% sure what the learning algorithm will actually learn. My friends seldom send me HTML mail. Most of my spam is HTML. A learning algorithm will probably learn that HTML mail is spam, especially if it never gets HTML "ham" during its training period. Then if one of my clueless friends sends me a HTML message, it will not go through and this is clearly bad.

      I will never trust an automatic filter so as to delete a message marked as "spam" without reading, but I think it can still be useful for ranking messages, so that spam gets read less often and deleted faster.

    4. Re:Spam Conference talk by Anonymous Coward · · Score: 0

      Collateral damage. I can't tell you how much I hate RBL style spam filtering. I used to be fairly ambivalent on whether it was a good idea or not, but after being on the wrong side of a filter because of collateral damage I see that it is NOT the solution. At this point in time I think the best solution for spam is to mark suspect messages but still allow delivery. Let the end user decide if the message is a problem for them, and let the user decide if they read the message or not.

      Systems like SPEWS should die.

    5. Re:Spam Conference talk by ajs · · Score: 2, Interesting

      But, aren't those "false positives" (usually so-called innocent open relays and people sharing netblocks with spammers) what you want?

      In the case of open relays, yes a whole company can be hosed mail-wise when the get on a list, but if multiple BLs agree, then they've got a problem that needs to be fixed.

      For the case of people who share a spammers address range, I feel for them, but... do I really want to take the pressure off of them in favor of flooding the world with spam? I'd personally be pissed at my ISP for allowing such spammers to screw over MY reputation among the BLs. ISPs should behave accordingly, but right now why would they? They get far more money from spammers than from people who will leave because a few folks listening to the BLs get mail from your customers.

      Spam is an ugly thing, and combating it is hard. Casualties are going to arrise. The question is: how do you minimize that list of casualties and make sure that people know the safety dance ahead of time.

    6. Re:Spam Conference talk by Matts · · Score: 1

      No. False positives are always bad. A false positive means you blocked a legitimate mail. A mail that was not spam. A mail that was not from a spammer, but from a person trying to contact you.

      Frankly it's the spammers that should suffer, not the legitimate users. False positives in the fight against spam cause nothing but animosity. We've had DNSBLs for a long time now, and I see nothing but an increase in the level of spam. Are DNSBLs working for you? Maybe. Is the collateral damage model reducing the amount of spam the world sees? Nope. Not remotely.

      Time to move on, try something else. Time to stop more spam and hit them in the pocket. We've no evidence that will work either, but at least we're trying something.

      --

      Matt. Want XML + Apache + Stylesheets? Get AxKit.
    7. Re:Spam Conference talk by TheClarkey · · Score: 1

      Ah but as you've said yourself, you get a bunch of spam and a bunch of non spam. You would simply have to insure that the non spam messages include a proportion of sample HTML E-Mail messages to insure that the classifier isn't just going to base it on HTML content. :)

    8. Re:Spam Conference talk by crisco · · Score: 1
      I will never trust an automatic filter so as to delete a message marked as "spam" without reading, but I think it can still be useful for ranking messages, so that spam gets read less often and deleted faster.
      I'll agree, while I'm enjoying wonderful accuracy from POPFile I still scan the spam classified folder for errors. Scanning a folder of spam for one or two wanted emails is much quicker than selectively deleting them mixed in 50/50 with stuff I want.
      --

      Bleh!

    9. Re:Spam Conference talk by BinaryC · · Score: 1

      I thought the same thing at first, but then I tried POPFile (http://popfile.sourceforge.net/) and have been more than impressed. When I first installed it, it obviously had an error rate of like 50%, but it's been down to 0% consistently for the past 3 weeks or so. It's a lot smarter than most filters in that it doesn't just say, "oh, this message has html, it's spam" In fact, it actually ignores HTML, preferring to look at the content of the message. I was also worried about things like receipts from ordering things online, but supprisingly none of those has been marked as spam. For reference, I get about 30 emails a day.

      You can also see exactly what words are in the lists and modify accordingly. You can also set a black-list of words to never be picked up, and you can set up magnets that always classify a message as a certain type (I don't use magnets, it works so well that I don't have to).

      --
      Ne Quid Nimis - All things in moderation
    10. Re:Spam Conference talk by leviramsey · · Score: 1
      For the case of people who share a spammers address range, I feel for them, but... do I really want to take the pressure off of them in favor of flooding the world with spam?

      Please allow me to extrapolate...

      "For the case of people who share an ethnic background with terrorists, I feel for them, but... do I really want to take the pressure off them in favor of flooding the world with spam?"

      The fact is: DNSBLs are the Internet equivalent of those on the right wing who seem intent on banishing anyone of Middle Eastern descent because they could be terrorists or in proclaiming that "when you buy drugs, you are supporting terrorism."

    11. Re:Spam Conference talk by unger · · Score: 1

      please mod parent up

    12. Re:Spam Conference talk by ajs · · Score: 1

      1) Ethnicity is not a contractual or consentual matter. You cannot "opt out" of your ethnicity and go with another provider. The same cannot be said for ISP. If you do business with an ISP that supports spamming, I think you should expect service to be degraded by that activity, and be pissed with said ISP when/if you find that they've tarnished YOUR reputation by doing business with scum.

      2) The phrase at the end of your mail should be "when you buy drugs from vendors known to support terrorism, you support terrorism". There are holes in that statement, just as there are holes in the logic of blacklists, but voluntary blacklists are one of our best weapons against spam just as not buying from disreputable vendors is our best defense against as simple consumers against the misuse of our money. In the drug example, what's often glossed over is that leagal recreational drugs would not have to be purchased from disreputable vendors....

      Remember also that blacklists are reputation repositories like credit reports or campaign donation lists. An ISP is just as free to use a blacklist to determine who to allocate more bandwith to because they want to support spam as they are to use it for blocking. It's not a unilateral descision process, even when it's unilateral, and I don't think it should be. Also, I think that if my ISP decides to block such access, they should be willing to give me access to an un-blocked server if I want it.

    13. Re:Spam Conference talk by ajs · · Score: 1

      "It's not a unilateral descision process, even when it's unilateral"

      oops. I meant "even when it's unanimous". Duh :)

    14. Re:Spam Conference talk by leviramsey · · Score: 1
      I think you should expect service to be degraded by that activity, and be pissed with said ISP when/if you find that they've tarnished YOUR reputation by doing business with scum.

      Very well. So how about this situation: an anti-porn group decides that the best way to get rid of porno mags is to photograph everyone going into the local 7-Eleven as well as their license plates, regardless of whether or not they come out with a copy of Hustler and buys an ad in the local paper saying "these people support evil pornography by supporting the largest porn dealer in the US," with photos of them and their cars, with names and addresses, courtesy of the DMV. That's exactly what the anti-spam blacklists are: they are designed to punish those who do business with companies that provide services to "undesirables". It's also akin to the '60s, when (not just in the South), various white groups recorded the names of restaurants that served blacks, publicized them, and considered any white that ate at such a restaurant to be a nigger-lover. Some went so far as to park their cars in front of the entrances to such establishments (initiating a DoS attack, basically).

      Also, I think that if my ISP decides to block such access, they should be willing to give me access to an un-blocked server if I want it.

      However, the bulk of ISPs that use the various blacklists do not offer their users a choice, nor do they publicize that the blacklists are in place. They often use the blacklists at the level of their border routers (hell, IIRC, AboveNet (whose CTO happens to be the main guy behind MAPS), a backbone provider (!!!!) uses MAPS at their border routers).

      I have no problem with proper use of the blacklists. Proper use would be having something like SpamAssassin (which can do it's own queries to the dbs) use whether something is on the blacklist to give it a slightly higher score on the spam count. However, such a tool must be entirely opt-in, and users should have the ability to use their own .spamassassin files, with customized weights that would include disabling the blacklist lookups.

    15. Re:Spam Conference talk by ajs · · Score: 1

      "Very well. So how about this situation: an anti-porn group decides that the best way to get rid of porno mags is to photograph everyone going into the local 7-Eleven as well as their license plates, regardless of whether or not they come out with a copy of Hustler and buys an ad in the local paper saying "these people support evil pornography by supporting the largest porn dealer in the US," with photos of them and their cars, with names and addresses, courtesy of the DMV."

      It's again a weak analogy, obviously crafted to cast DNSBLs in the light of evil privacy invaders.

      Facts:

      1. You don't get on a DNSBL by sharing a netblock with spammers. Your netblock gets on a DNSBL by having spammers who the owning ISP refuses to squelch. So, your example would be that your city block gets published.

      2. A DNSBL is not published in a forum like a newspaper. It's a stand-alone resource that is queryable. So, your example would be having your city block publised in a mail-order list of porn-supporting city-blocks.

      3. No one ever got a date from being on a DNSBL ;-)

      "I have no problem with proper use of the blacklists. Proper use would be having something like SpamAssassin (which can do it's own queries to the dbs) use whether something is on the blacklist to give it a slightly higher score on the spam count. However, such a tool must be entirely opt-in, and users should have the ability to use their own .spamassassin files, with customized weights that would include disabling the blacklist lookups."

      So services like cell-phone and pager email should never have spam-filtering, since you're not given a shell via which to modify such a file? Poo! I want my ISP to dump the spam. I'm actually fine with them even dumping the ultra-rare message that looks just like spam, but I wanted to see. I just want the 2-300 pieces of spam that I get per day (which is what you get when your email address pre-dates the existence of spam) to GO AWAY.

      I use SpamAssassin for this, but for those who can't, I think think the ISPs should take on that burden and do the right thing. Black-lists, Razor and other strong indicators of spammishness should be very close in score to the threshold so that they almost push a message over the threshold by themselves.

    16. Re:Spam Conference talk by leviramsey · · Score: 1
      ...crafted to cast DNSBLs in the light of evil privacy invaders.

      No, just calling a spade a spade.

      You don't get on a DNSBL by sharing a netblock with spammers. Your netblock gets on a DNSBL by having spammers who the owning ISP refuses to squelch. So, your example would be that your city block gets published.

      That's no different, for all intents and purposes. And even worse, because people with no connection whatsoever to the 7-Eleven get blacklisted.

      A DNSBL is not published in a forum like a newspaper. It's a stand-alone resource that is queryable. So, your example would be having your city block publised in a mail-order list of porn- supporting city-blocks.

      Irrelevant. There is no effective difference between "publishing in a forum like a newspaper" and publishing in "a stand-alone resource that is queryable". Both are publishing the information for public consumption and dissemination.

      So services like cell-phone and pager email should never have spam-filtering, since you're not given a shell via which to modify such a file?

      Where do I say that you need a shell? Many ISPs have written PHP scripts on their webservers which allow users to adjust their spamassassin and procmail configurations (there's probably a few on sourceforge, also). What I'm saying is that email services should provide their users with a means to control how the filters work.

      I want my ISP to dump the spam

      Then you enable the "Ultra-fascist shoot on sight mode"... My point is simply that ISPs should a) not impement draconian anti-spam policies without telling the users and b) allow the users who want a different (be it heightened or lessened) level of anti-spam measures to choose and enforce their desired level.

    17. Re:Spam Conference talk by ajs · · Score: 1

      Cool, you get your ISP, I'll get mine.

      Nuff said.

  20. Quantitive, not qualititive by psplay · · Score: 5, Interesting

    Its not simply the words that are used in a mail, but the way they are used (the order) that gives a sentence its meaning.

    for example Two Emails:

    1 (ham) "You have won a brand new Convertible, from the competition you entered."

    and

    2 (spam) "A brand new convertible to be won, have you entered?"

    Ham would match about 80% with spam.

    Word matching is a blunt instrument as mentioned. The English language is far too complex for simple calculations, this fact should be taken into consideration, when applying a 'Spam Likelihood' rating to Emails.

    1. Re:Quantitive, not qualititive by iapetus · · Score: 4, Interesting

      If I see either of those in my inbox, it's almost certainly spam. You don't think you really filled in all of those 'feedback' forms about sex toys that you keep getting responses from, do you?

      --
      ++ Say to Elrond "Hello.".
      Elrond says "No.". Elrond gives you some lunch.
    2. Re:Quantitive, not qualititive by Alcohol+Fueled · · Score: 1
      Or how about:

      1 (ham) "You can now increase your performance, having ordered Viagra."

      and

      2 (spam) "You can increase your performance with Viagra, have you ordered?"

      I'm in class, and I'm bored... :)

      --
      Ah am not a crook! (\(-__-)/)
    3. Re:Quantitive, not qualititive by Anonymous Coward · · Score: 0, Funny

      or.....

      1 ham) "Winning a brand new convertible you have, from entered in the contest you were."

      and

      2 (spam) "Winning the convertible you can, enter you have?"

      Both would immediately be recognized as from recent lame movie, and dumped by the filter...this problem isn't as easy as it appears.

    4. Re:Quantitive, not qualititive by Anonymous Coward · · Score: 1, Funny


      Yoda filter then, this is like?

    5. Re:Quantitive, not qualititive by Anonymous Coward · · Score: 0

      "You've just won a prize, send a £50 administration fee to receive it", is quite a common type of dead tree spam round here.

    6. Re:Quantitive, not qualititive by Anonymous Coward · · Score: 0

      Umm...from both examples can't you just check if it has a question mark? :)

  21. Good, now go to level 2... by zanderredux · · Score: 1

    ... and try to do that with /bin/echo !!!!

  22. Don't compress by Fuzzums · · Score: 3, Funny

    Usually I don't compress my spam.

    I delete it.

    This will save me a lot more space ;-)

    --
    Privacy is terrorism.
  23. Not that different by Synonymous+Soured · · Score: 5, Interesting

    A Bayesian spam filter uses an underlying order-0 Markov model of email messages. gzip uses an underlying order-1 Markov model.

    A Bayesian filter uses words as "symbols." gzip uses bytes as symbols.

    The right thing to do would be to combine them.Ttake a gzip-style Markov model, using bytes as symbols and conditional probabilities, and plug it into a Bayesian filter. That would (1) make the filter more powerful and (2) make the filter applicable to any sort of data, arbitrary binary or readable text. Negligible computational overhead, sharper discrimination.

  24. Same old problem... by artemis67 · · Score: 5, Insightful

    Filtering is not a true spam solution. All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.

    Given that, the alternative to having tagged emails automativally deleted is to collect them in a folder and scan the message senders and subject lines. If you're doing that, then the spammer is getting a pitch through to you in the subject line. This therefore does not lessen the incentive for the spammer, but simply causes him to change tactics and put his best pitch in his subject line.

    Right now, I get 60-80 spams a day. What happens when I start getting 600-800 a day? Again, filtering starts to break down, because I have SO MANY messages to scan everyday that the possibility of me missing a legitimate one is very high.

    1. Re:Same old problem... by isorox · · Score: 2, Informative

      I usually cope by having a couple of folders in kmail I flush spam into

      BODY contains "The following message was sent to you as an opt-in subscriber to RB Express."
      FROM contains Trivia
      TO or CC contains "johnsmith@isorox.co.ku"
      FROM contains theracingpost.com
      TO or CC contains "spam" (I use sitespam@isorox to sign up to sites)
      BODY contains "to receive" AND "more of these offers"
      Move to a Spam folder

      If TO or CC doesnt contain
      isorox.co.ku
      exeter.ac.ku
      ex.ac.ku

      Move to possible Spam

      That gets about 80-90% of my spam.

      I skim Possible Spam when I get time, usually once or twice a day. I skim Spam about once every 2 days. i've got a couple of rules that just delete the spam straight off (known junk addresses that I'll never need, certain subjects, etc). Keep all my spam too, and check it when I get time, just in case.

    2. Re:Same old problem... by timlewis_atlanta · · Score: 1

      I agree, but until there is worldwide legislation brought in to stop spamming then spam is going to remain a problem. Likewise, the issue with false positives is a problem. That said, I've been using Popfile, never had a single false positive and get almost exactly 1% false negatives. It's about the best solution I've seen that's available today.

    3. Re:Same old problem... by Anonymous Coward · · Score: 0

      I get about three a week. I went over a year where the only email I got was my ISP telling me it was time to pay my monthly bill. Then I started conversing with two relatives over email. Within a week, I started getting spam.

      One of the relatives used Hotmail and the other used MSN.

    4. Re:Same old problem... by djmurdoch · · Score: 4, Interesting

      Filtering is not a true spam solution. All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.

      One of the side effects of spam is that there are no "Really Important Emails" any more. Spam and spam filters have degraded the reliability of email to such an extent that you'd have to be crazy to send anything Really Important by email.

      Right now, I get 60-80 spams a day. What happens when I start getting 600-800 a day?

      That's a good point. The solution is to get less spam. You can do that by changing email addresses frequently (a really inconvenient solution that I don't recommend), or by getting spammers shut down (or yourself listwashed by the spammers).

      Let the spammers know that if they send something to you, they'll lose money, and they won't send you so much spam. SpamCop reporting makes this easy. If you want to be listwashed, don't munge your address when you send reports. (This is an option with SpamCop.)

      Some people claim that you'll get more spam or get listbombed or something if you send complaints without munging; that's not my experience. I get 20-30 spams per day, total, at all of my 4 publicly available email addresses. (Ninety to 95 percent of them get caught by the SpamCop filters, which have almost never caught valid email.)

    5. Re:Same old problem... by johnburton · · Score: 1

      Frankly I doubt that very many people are so important that losing a single email is that important. And if it is then email is not the appropriate way to send the information as it's not 100% reliable anyway.

      --
      Sig is taking a break!
    6. Re:Same old problem... by Kragg · · Score: 1

      False positives don't destroy the value of filtering at all. I find it massively helpful not to get irritated by alerts 50 times a day when I receive another bloody spam message.
      And I don't miss the false positives because I scan my spam. But the key point is I don't interrupt what I'm doing in order to respond to spam anymore. Well, less often anyway.

      Spam is bad, but spam is life. Filtering is not perfect, but it is helpful.

      --
      If you can't see this, click here to enable sigs.
    7. Re:Same old problem... by ch-chuck · · Score: 2, Interesting

      If you're looking for the mathematically perfect zero fault spam solution in a world full of Msft and human beings, forget it.

      What happens when I start getting 600-800 a day?

      Start another account and don't give it to strangers who might sell it. Only give it to the person or persons who are going to send that really important email message. Throw in a few random numbers so if one gets leaked to spammers you can track the source (i.e., I gave my employment agency (obviously an important contact) chuck369, and nobody else. Now if chuck369 starts getting spam we know employment agency leaked it). Use 'throw away' accounts for untrusted contacts who might leak it to spammers.

      --
      try { do() || do_not(); } catch (JediException err) { yoda(err); }
    8. Re:Same old problem... by Anonymous Coward · · Score: 0

      Ask for return/read reciepts on important email. As the recipient of potentially important mail, keep an address book as most important messages will come from folks you already know. And if that's not enough, set your filter to send bounce messages instead of just swallowing spam. Your important communique will be retransmitted using alternative means after the sender sees the bounce.

    9. Re:Same old problem... by Anonymous Coward · · Score: 0

      these are my filters:

      1. FREE (case sensitive)
      2. XXX (case sensitive)
      3. $$$
      4. to unsubscribe (case insensitive)
      5. click here (case insensitive)

      unfortunately, spammers are putting more and more (esp. disclaimers) in images, making it hard to filter.

    10. Re:Same old problem... by squiggleslash · · Score: 1
      False positives, in the sense of a specific attempt at a message not getting through, are an absolute inevitability in any anti-spam system, period. My belief is that this can be addressed in two ways (preferably both):
      • Use a system that minimizes the number of false positives
      • Ensure that whatever system you use will result in a rejection getting back to any legitimate mailer
      The second of these two options is probably the most controvertial, because it practically implies you make it easy for any spammer to validate your email address, but in practice, if you use a sane enough system for rejecting emails in the first place, that shouldn't be a problem. However, ensuring that, for instance, an email is bounced with a message indicating that it was unable to be delivered gives the sender a chance to get around, on an individual basis, whatever spam trap you've set, whether that involves changing the subject line, going to a web form and sending you the email, using a different email account, etc.

      I've commented in my journal about the system I use which so far has been 99% effective - the odd spam that's gotten through has resulted in a loss of business for the company that sold on the email address, and the hole has been closed so that any other people who were sold the same thing are unable to use it too. The important point though is that there's simply no way an urgent message sent to me will get "dropped". Anyone who sends me such a message and is somehow using a rejected criteria for sending me the email will receive the usual bounce and will have an opportunity to find other means of contacting me.

      Make your spam trap effective. Avoid false positive criterias. And ensure rejections aren't harmful. Follow those three tenets of anti-spam system design and you can't go wrong.

      --
      You are not alone. This is not normal. None of this is normal.
    11. Re:Same old problem... by MrFredBloggs · · Score: 1

      "That's a good point. The solution is to get less spam. You can do that by changing email addresses frequently (a really inconvenient solution that I don't recommend), or by getting spammers shut down (or yourself listwashed by the spammers)."

      Just us an opt-in system. Either by a Hotmail-style `Exlusive mode` where only emails coming from people in your address book are allowed, or by creating filters/rules, which move valid emails into one or more folders (I use one per person, so I can quickly see who has emailed me, and deal with them in order of priority).

      You can easily ask people you want to email you to put a certain word in the subject or body (a word which would be unlikely to occur in spam), and filter on that to a `new people` folder, and then give them an individual rule/folder once you`ve decided they are ok.

    12. Re:Same old problem... by artemis67 · · Score: 1

      The problem is not the 9,999 messages that you know are going to come from good senders; the problem is the one message that may be coming for an unexpected source that is going to cause you to sift through 50,000 spam emails looking for it.

      That is why filtering fails as a solution.

      You know that email from the headhunter that wanted to double your current pay rate and cut your hours by a third? No you don't, because it got flagged as spam and accidentally deleted.

    13. Re:Same old problem... by Anonymous Coward · · Score: 0

      I think I like your filter approach better than strict whitelisting. I used something similar combined with SpamAssassin (although I'll be switching to Bayesian filtering in the near future) and in spite of receiving 30-40 spam a day, I only really had to deal with a couple of them by hand and most of my normal traffic was properly sorted in the first place.

      The problem we have is not that those of us with some tech know-how can't deal with incoming spam. It's how do we get the majority of Outlook users (which is a huge base) into these things without them having to do anything to complicated? Once we get them and the AOLusers to where they're not seeing spam, spam will be less effective (and would therefore hopefully fall off).

    14. Re:Same old problem... by kirkjobsluder · · Score: 1

      Filtering is not a true spam solution. All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.

      Given that, the alternative to having tagged emails automativally deleted is to collect them in a folder and scan the message senders and subject lines. If you're doing that, then the spammer is getting a pitch through to you in the subject line. This therefore does not lessen the incentive for the spammer, but simply causes him to change tactics and put his best pitch in his subject line.


      I guess that this is an interesting question. I keep hearing this argument that filtering is a bad thing because of the risk of false positives. But how is the risk of false positives reduced by removing the filter? Spam filtering for me is a valuable cognitive aid. (One modification to spam assassin would be to put the spam score on the subject line.) I can live with skimming subject lines because many spam models are based on the number of hits from users who buy or click on links in spam.

      I also think that it argues a straw man. I don't read very many comments from people who believe that filtering is "the solution". However, content-based filtering is one valuable tool for sorting through large numbers of messages. By all means we should persue trasport-based and source-based strategies for fighting spam as well. But these have their own problems.

      Finally, if someone wants to cold-call me out of the blue with a Really Important Message, don't they have a responsibility to compose their message without much of the hype, and html text that gets flagged as spam? It would seem that such a cold-call would have no problems getting through as long as they don't make excessive use of all caps, font tags, embedded images, base-64 encoded text, and references to my penis. If it was really important enough to be worth my time, then it probably is not going to have enough spam features to be flagged as spam.

    15. Re:Same old problem... by dlakelan · · Score: 1

      Hah, off topic, but when I first saw it, I thought your sig for the PKD foundation was probably something about Philip K Dick and perhaps was a group looking for a cure for Schizophrenia or some other mental illness.

      I'm sure Polycystic Kidney Disease is bad and it's great that there's a group trying to cure it, but it wasn't as interesting.

      --
      ((lambda (x) (x x)) (lambda (x) (x x))) http://www.endpointcomputing.com a scientific approach to custom computing.
    16. Re:Same old problem... by djmurdoch · · Score: 1

      Whitelisting is a really good solution for some people: those who get email from a relatively fixed and small group of correspondents. But it doesn't work at all well for people who need to be contacted by strangers.

      For example, I'm the Secretary of a non-profit society. I'm also one of the maintainers of an open source project. In both of those roles, I often get email from people I don't know.

      There are schemes for auto-whitelisting people: any mail from someone not in the whitelist triggers a reply requesting a confirmation from the sender. (Spammers never confirm, because they don't see the reply.) I don't think those work very well, because some users are so unsophisticated that they won't respond properly. They also put an extra burden on the sender, and some will just decide not to bother jumping through your hoops.

    17. Re:Same old problem... by statusbar · · Score: 1
      • False positives, in the sense of a specific attempt at a message not getting through, are an absolute inevitability in any anti-spam system, period.

      No... Vipul's Razor works quite well and never gives a false positive! It is not 100% effective though. But it does cut down my incoming spam from 120+ spams a day to 20 or so.

      --jeff++

      --
      ipv6 is my vpn
    18. Re:Same old problem... by zCyl · · Score: 1

      unfortunately, spammers are putting more and more (esp. disclaimers) in images, making it hard to filter.

      So filter for remote-loaded images. If a person wants to send you an image they should attach it or send a link.

    19. Re:Same old problem... by PurpleBob · · Score: 1

      I used to do SpamAssassin's job manually like that too, but the false positive rate got far too high. You never realize what you might catch.

      For example, with that TO or CC restriction, you could never get on a mailing list. Heck, someone with "trivia" in their e-mail address may even send you something important one day.

      --
      Win dain a lotica, en vai tu ri silota
    20. Re:Same old problem... by isorox · · Score: 1

      I filter off email lists I'm subscribed to into appropiate folders before that rule, as as I said, I do check the folders, just not as often as my normal mail. The "Trivia" catches "TriviaPOP" spam and some other stuff, very unlikely that willl effect anyone apart from joe@trivialpursuit.com, but again I do skim spam subjects every so often. My false positives are almost exclusivly mailing lists I've just signed up to and havent put a rule in place for.

      I never delete email though, not even spam, and everything gets inspected, just in differing orders of priority.

    21. Re:Same old problem... by DaCool42 · · Score: 1

      What if all the "really important" email you get is GPG signed, and your spam filter passes anything that is GPG signed by someone you trust. Then not only is the important email passed, but it can be verified as well. For unsigned messages, regular filtering techniques could be used.

      --

      ----
      All of whose base are belong to the what-now?
  25. Spammers will adjust their tactics by ultrabot · · Score: 5, Interesting

    Obviously it wouldn't be a big problem for the spammers to run their creative gems through gzip, and alter the content until they achieve lower compression ratio. Even including a bunch of garbage after the message might do the trick. I believe equivalent analysis can be done cheaper with non-gzip tools...

    --
    Save your wrists today - switch to Dvorak
    1. Re:Spammers will adjust their tactics by koh · · Score: 1

      Hehehe. Then we would start locating spam by looking for a bunch of garbage somewhere in the message, wouldn't we ?

      So the spammers will get back to send plain content, which will be filtered again using gzip and others, then when they reach their profitability limit they'll switch to garbage, which will be filtered by garbage detectors, then can you say smash the stack ? ;)

      --
      Karma cannot be described by words alone.
    2. Re:Spammers will adjust their tactics by hammy · · Score: 1

      I think you can solve this problem by also storing a gzip file with legitimate emails as well. Try adding the email to both and see whether it gets most compressed by being added to the spam or the legitimate mail file.

      If it's spam the mail should then compress better in the spam zip and hence still be classified as spam.

  26. Alternative by Dexter77 · · Score: 4, Interesting

    When the spam is filtered at user-account level, you can only do it by parsing a single mail in some way and determine if it's spam or not. It's like trying to tell whether a movie is bad by looking at one picture. If the spam could be filtered at the server level, by comparing mails that are received into to different accounts, you could really tell which ones are part of a mass-mail (spam).

    One problem with this is the right to open other people's mail. But you could use some basic scrambling (rot-13) to make sure that no one sees the inside. It wouldn't make difference to the comparing script.

    Mailing lists might cause a problem too but wouldn't it be easier to allow the mailing lists you belong to than trying to block the ones you don't belong to?

    1. Re:Alternative by misof · · Score: 1

      One problem with this is the right to open other people's mail. But you could use some basic scrambling (rot-13) to make sure that no one sees the inside. It wouldn't make difference to the comparing script.

      And why exactly should the rot13 help? If the root of your machine wants to read your (non-encrypted) mails, he does so. Anybody else will still have the same chance to read it, rot13 or not. When the e-mail arrives, a mail daemon takes it and puts it into the appropriate user's mailbox. (Or sends back an error message -- no such user, etc.) The only change will be that this daemon will call another program -- the spam filter. Both of them will run under root privilegies and no user (except for the root) will have a possibility to see your email.

      Notice the word non-encrypted in the previous paragraph. As soon as public-key cryptography becomes more used by general public, there will be some "default" ways to publish your public key. AND there will be no problem for the spammer to obtain your public key automatically with your e-mail address (or maybe to obtain it as soon as he knows your e-mail). When this comes true, server-side spam filtering will become impossible, because the server sees only the encrypted message and has no way to tell whether its spam or not.

  27. Sequitur Most Likely Superior by Baldrson · · Score: 4, Interesting
    The statistics generated by Sequitur are most likely superior to Gzip.

    As an example of how Sequitur works, the string 'abcabdabcabd' produces the following grammar rules:

    1. 2 c 2 d
    2. a b
    Representing the original string then is the sequence:

    1 1

    The usage counts of the rules are available as output options.

    1. Re:Sequitur Most Likely Superior by mrtroy · · Score: 1

      Actually...it would be interesting if they replaced 'abcabdabcabd' with

      1. 2 2
      2. 3 c 3 d
      3. a b

      And then represent the string with '1'.
      You save a half your space! :P
      Its far better when dealing with compression to have more complex rules and smaller strings.

      --
      [I can picture a world without war, without hate. I can picture us attacking that world, because they'd never expect it]
    2. Re:Sequitur Most Likely Superior by A55M0NKEY · · Score: 2, Insightful

      But your rule list is now getting big and still has to be stored. Compression is about minimizing the amount of stuff that has to be stored to recreate the original. It would be nice to have a few simple, very reusable rules that you can use to generate the original with a very few commands.

      --

      Eat at Joe's.

    3. Re:Sequitur Most Likely Superior by mrtroy · · Score: 1

      The whole point of making rules is that they are reusable. So you dont worry about the number of them, they are relatively small in size. Imagine that previous example string repeated itself 1000 times in the whole file...using our rules we would save ourselves 1000 extra characters by making use of one extra rule that might cost us 5 chars at most.

      --
      [I can picture a world without war, without hate. I can picture us attacking that world, because they'd never expect it]
  28. RTFA, jackass by Anonymous Coward · · Score: 0

    It's about using the compression ratio as a measure of similarity between the message and a spam database, and not about redundancy within the message.

  29. Dupes by BESTouff · · Score: 1, Funny

    Do you mean that each time you can find dupes, that's spam ? Oh my god, poor /. ...

  30. in other research by tps12 · · Score: 0

    It's also been determined that the discussion in a typical Slashdot story compresses to less than half of a percent of its original size.

    --

    Karma: Good (despite my invention of the Karma: sig)
  31. Yay! by Anonymous Coward · · Score: 5, Funny

    What an idea!

    I could use this to avoid those people who keep saying the same thing all the time, over and over again...

    Now, how can I convince my mother to use e-mail?

    1. Re:Yay! by Anonymous Coward · · Score: 0

      Compression for parents:

      Byte. . . .Meaning

      0x01. . . .Have you found a girl to settle down with?

      0x02. . . .Are you looking after yourself/cleaning behind your ears/eating properly

      0x03. . . .Why can't you be more like your sibling?

      ... and so on

  32. My spam compression approach by joshv · · Score: 1

    I just use one of those new fangled file compression utlities that you can apply recursively to the compressed output, resulting in any arbitrary degree of compression one desires.

    After at most 10 applications of said compression utility, all emails looks like this:
    "1"

    I never see any spam.

    -josh

    1. Re:My spam compression approach by Anonymous Coward · · Score: 0

      Score: 1 (not funny) :-)

  33. What is spam, though? by Big+Mark · · Score: 4, Funny
    The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text.
    Ah. I thought to detect really useless, annoying, pointless, bandwith-sapping and time-consuming email all you had to do was look for "fwd:" in the subject line.

    -Mark
  34. How to stop spam.... by oliverthered · · Score: 3, Informative

    1: Get an email account with unlimited addresses.
    2: when registering use a unique address e.g. slashdot@mydomain.com
    3: Make sure you check/uncheck the give my email address to mailing lists.
    4: If ever you get spam to that account get litigious.

    Use something like mailinglists@mydomain.com, and block anything that doesn't come from mailing lists you've subscribed to.

    --
    thank God the internet isn't a human right.
    1. Re:How to stop spam.... by Jugalator · · Score: 2, Insightful

      Still, you use hotmail (aka "spammer's heaven") here on Slashdot. But thanks for the tip, perhaps we should start trying it out? :-)

      --
      Beware: In C++, your friends can see your privates!
    2. Re:How to stop spam.... by Anonymous Coward · · Score: 0

      Feel free to send as much crap as you want to that hotmail address, I check it once every couple of months. It's the address I use when ever I don't trust a company enough to use a real email address.

    3. Re:How to stop spam.... by Matey-O · · Score: 1

      An 'airtight' hotmail account (One signed up that's not advertised nor given out on USENET or the web) STAYS just as spam free as one from aol or earthlink.

      I've got two hotmail accounts that have been relatively spam free for years.

      I say relatively because you'll still receive spam if they guess [commonfirstname][commonMiddleName][CommonLastname ]@msn.com

      Heck, one of 'em's the email I signed up on slashdot with!

      --
      "Draco dormiens nunquam titillandus."
    4. Re:How to stop spam.... by BenV666 · · Score: 1
      Quote:
      Use something like mailinglists@mydomain.com, and block anything that doesn't come from mailing lists you've subscribed to.
      Yes, but these days spammers are also on those mailinglists and wouldn't be blocked...
    5. Re:How to stop spam.... by NoseyNick · · Score: 2, Informative

      I've been doing this for years, and in practice, it just means I get 12 copies of most spams, because they got my address from 12 different places, usually web archives of the mailing-lists.

      You can't refuse mail from non-lists to mailinglists@your.domain, because then nobody can contact you saying "I saw your post on foo-list and was wondering if I could get a copy of foo-prog and if you could tell me how you made it foo bar baz".

      --
      Nick Waterman, Sr Tech Director, #include <stddisclaimer>
    6. Re:How to stop spam.... by Niggle · · Score: 0

      Doesn't work at all...

      1) Put "email address" box on the registration screen.
      2) Put "spam opt-out" box on a screen that can only be accessed after registering. Default it to on.
      3) Sell email address in the gap between registering and first accessing that page.
      4) PROFIT!

      --
      - Blah blah blah, missing scientist. Blah blah blah, atomic bomb. -
    7. Re:How to stop spam.... by pabs · · Score: 1

      Not if you filter the mailing lists as well, like I do. I'm using a pretty standard SpamProbe setup, except I pass all my mail though it (including lkml and debian-devel).

      --

      Odds of being killed by lightning and winning the lottery in the same day: 1 in 2^55

    8. Re:How to stop spam.... by DeadSea · · Score: 4, Interesting

      You need to expand on your step 4.

      When I started getting spam, I wanted it to stop. I realized I couldn't just disable the email address because there might be somebody out there counting on it to contact me. I could disable it and send an autoreply with my current email address, but then spammers would just be able to look at the reply. I needed some solution where people could send me email even if the address they used had been disabled, but spammers wouldn't be able to get my current address. I decided to put a contact form on my website. Now I autorespond to a disabled email address with the contact form url. In addition, I was able to remove email addresses from my website which was a large source of spam.

      Not being able to find a contact form that was secure, I ended up writing my own and releasing it under the GPL. You can get it at http://ostermiller.org/contactform/.

      I also realized that no matter how hard you try, your email address will leak to spammers. Ever giving an email address only to your closest friends and family will not prevent it from leaking out. Between the klez virus, gift certificates, invitation, greeting card, and crushlink websites, even my most personal email address leaked to spammers. You can't be afraid to disable an email address and send your friends the new one. Now even if I missed a friend, they can still get a message to me.

    9. Re:How to stop spam.... by deisher · · Score: 1

      This already exists. It's called spamex (www.spamex.com).

      --Mike

    10. Re:How to stop spam.... by Anonymous Coward · · Score: 0

      Ok, I use this method, but it doesn't work very well.

      If you email the people when spam starts from them, they always deny it... You can go for weeks--and I have--in arguments and it is NEVER their fault. They would, of course, never do such a thing.

      Also, I have a serious problem with this tactic and Ebay. Ebay is a huge source of spam for me, and I can't just block that email.

    11. Re:How to stop spam.... by Anonymous Coward · · Score: 0

      If it's a OK company pull the data protection act. There screwed, Microsoft hasn't renewed this year, so, if they have any of your personal data make a complaint and screw them

    12. Re:How to stop spam.... by Anonymous Coward · · Score: 0
      An 'airtight' hotmail account (One signed up that's not advertised nor given out on USENET or the web) STAYS just as spam free as one from aol or earthlink.

      Which is much like saying that a well-maintained Daewoo is just as reliable and fun to drive as a Yugo.

      Since my girlfriend whined that I didn't IM her on MSN, I got a Hotmail account and installed KMerlin. Almost immediately after starting the account, I had two spams.

    13. Re:How to stop spam.... by Phroggy · · Score: 1

      Wow, you've almost got it but not quite. Don't get litigious. Instead:

      Create a new e-mail address, notify the legitimate organization of the change, and deactivate the old address.

      Why? Because slashdot@phroggy.com gets spam (or rather, it would, if that address still worked - instead the mail server just logs "user unknown" errors). Are you suggesting I should sue Slashdot for this? Slashdot did nothing wrong, but before they started obfuscating e-mail addresses automatically, I used that address here. Evil spiders crawl the web (including Slashdot) and harvest e-mail addresses. There's not much you can do about this, except change e-mail addresses when it becomes a problem. If you use a different address for everybody, then it's easier to keep track of who you have to notify of the new address.

      Unless you meant sue the spammers. Sure, go for it, but you don't have to use this method for that.

      --
      $x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
      $x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
  35. Just use a string entropy calculation algorithm... by Domini · · Score: 3, Interesting

    It's inefficient to have so much memory overhead.

    Besides, if I were a spammer, I could pad it with high entropy data at the end to make up for my warbling.

  36. This happened to a friend of mine by pommiekiwifruit · · Score: 1
    He actually won a nice car from a contest he entered with an internet company! I saw his picture sitting in the car in a dead-tree magazine.

    The only slight problem was that he doesn't drive :-)

  37. Compression algorithms as filters... by Jugalator · · Score: 4, Insightful

    .. sounds like a poor idea to me. Yes, you can measure the amount of redundancy in a message, but:

    a) Spammers might not always use messages redundant enough to be detectable from regular text.

    b) If I happened to use some words a little too often, especially when writing mails discussing technical stuff or posting computer code fragments, would that be classified as spam?

    I think this is a nice filter when sorting out more or less repetitive mails (spam or not) from novels, but a filter based on a spam database sounds better to me.

    --
    Beware: In C++, your friends can see your privates!
  38. High CPU Utilization by Anonymous Coward · · Score: 0

    Wouldn't the CPU resources consumed by this process make it useless in the real world? I can't imagine compressing all of our incoming mail just to check if it's spam or not, the CPU usage would skyrocket (and it's already high with all the av and routing filters that we have). Right now we're using RBL filtering along with some content based filters and catching 99.9% which is a whole lot.

  39. Moron by ^BR · · Score: 1

    Another moron the tdisn't read the article.

    The proposal is not to see how compressible is the message but to use a compression tool to see how lookalike the message is to a corpus of spam.

    1. Re:Moron by ultrabot · · Score: 4, Interesting

      Another moron the tdisn't read the article.

      I actually read the article.

      The proposal is not to see how compressible is the message but to use a compression tool to see how lookalike the message is to a corpus of spam.

      Yes, and compression ratio is used to determine this.

      --
      Save your wrists today - switch to Dvorak
    2. Re:Moron by PetWolverine · · Score: 1

      This is remarkably similar to your response to Pilferer's comment.

      You know, if you read the article, and find that everyone eles disagrees with you on how to interpret it, that probably isn't a sign that you're the only one who read it. It's more likely a sign that you misinterpreted it and should take another look at it.

      A mind is like a parachute; it only functions when it's open.

      --
      I found the meaning of life the other day, but I had write-only access.
  40. Re:Text of the full article by Anonymous Coward · · Score: 2, Interesting

    Reminds me off a program I helped with for a short time in college called "Siff" (ftp://ftp.cs.arizona.edu/reports/1993/TR93-33.ps) , which would find similar files by taking small fingerprints (32-bit hashes) of 50 byte sequences and finding groups of files that shared a lot of them. It works surprisingly well, even when the files were modified extensively.
    I've often thought since that large mailhubs (yahoo, hotmail, etc) could automatically filter junk mail efficiently by a similar method, perhaps by limiting the delivery rate/fingerprint or just flagging high-occurence hashes as suspect (and then rating each mail by how many of its fingerprints are among this group, too many without an ADV: or bulk-mail tag would cause a mail to be marked as SPAM).
    I wonder if it'd be possible to have a network of smaller hubs accomplish the same thing, perhaps even using an encrypting checksum instead of a simple hash so that individuals could contribute without anyone being able to recreate their original messages?

  41. I can't figure this out... by shivianzealot · · Score: 4, Interesting

    A couple of posts above state that spammers will "just adjust their tactics." Talk like this always puzzles me; on the spammer's side, does this not help him? If I'm selling a combination weight loss drug/mail order bride/penis enlarger/cable descrambler for only three payments of $49.99 in such a manner that every spam blocker in the world filters me, logically I'm only being filtered by people who know better than to buy my "product," thus not irritating them, in effect helping to slow regulation, and I don't loose touch with any significant chunk of my target demographic. Of course, this applies with the exception of corporate environments or similiar situations where Joe Insecure has someone else managing spam.

    Can anyone share some +5 Insight on the matter?

    --

    Bored with karma, be a fan/freak

    1. Re:I can't figure this out... by Motherfucking+Shit · · Score: 4, Insightful
      If I'm selling a combination weight loss drug/mail order bride/penis enlarger/cable descrambler for only three payments of $49.99 in such a manner that every spam blocker in the world filters me, logically I'm only being filtered by people who know better than to buy my "product," thus not irritating them, in effect helping to slow regulation, and I don't loose touch with any significant chunk of my target demographic.
      This would make sense if the only people implementing spam filters were end users. Unfortunately, the logic breaks down when you consider that some ISPs do the filtering on behalf of their customers. It breaks down further when you factor in the number of situations in which a) the customer might not even know that the filtering is happening, or b) the customer blindly trusts the ISP's filtering system.

      Take Yahoo, for example. They're a popular webmail service and they also do spam filtering to some extent on inbound email. I would say that, in general, people who use Yahoo mail are not necessarily the type of people who "know better" than to buy spamvertised products. That's not a slam on Yahoo, nor on the people who use Yahoo mail, it's just the way the demographics work out. The ratio of ripe targets to clued-in antispammers is simply better at Yahoo than it is on other domains.

      To that end, Yahoo's spam filters aren't helping the spammers any. A spammer's goal is to get his ad in front of as many potential targets as possible, and Yahoo is full of potential targets. But if Yahoo's filters catch the spammer's message and route it straight to everyone's Bulk Mail folder, there's (thousands|millions) of "targets" who will never see the message.

      So no, I can't agree that filtering helps the spammers any, at least not the big spammers who are after volume. There's probably a bit of "collateral assistance" in that people who would report the spam may never see it, but I'd say that benefit is cancelled out by the number of possible targets lost to filters.
      --
      "BSD: Free as in speech. Linux: Free as in beer. Windows 10: Free as in herpes." --Man On Pink Corner in #52607549.
    2. Re:I can't figure this out... by stilwebm · · Score: 2, Insightful

      It's true that the sellers want that. However, you may have noticed spammers are not always the sellers. The seller is looking for someone to do some "email marketing" for them. They are looking for wide coverage. They want to see things like "your email can be sent to 30 million unique email addresses," which means a few million that might get through, a few thousand that will actually get read, and maybe a few purchases. Spammers are just creepy marketers who want to make it sound like emailing as many people as possible is better, and should cost the seller more. Since they use open relays and random forged "From" email addresses, they never see what email gets blocked. Using images in HTML email they can get an idea of how many emails were read (this is why you should turn off images in email). While the spammer makes a commission on every sold item, they also make money selling lists and marketing services.

      The numbers are part of their pissing contest, and the pool is your inbox. Spammers are not that bright, but their customers are much, much more stupid.

    3. Re:I can't figure this out... by buss_error · · Score: 1
      Can anyone share some +5 Insight on the matter?

      Rule 1: Spammers lie.

      Rule 2: Spammers are stupid. Not to say they are not cunning, but stupid.

      Rule 3: If you think a spammer is telling you the truth, see Rule 1.

      Rule 4: Spammers will stop when they can't make money fast! spamming.

      --
      Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
  42. Stopping Spam by Inflatable+Hippo · · Score: 4, Insightful

    > stupid filtering isnt gonna get you rid of spam... go complain at spammers upstream providers...

    Filters only work to a limited extend, and so might shutting down the spammers, if it were possible.

    But neither is going to solve this problem.

    The only solution I can think of is wide-spread adoption of PGP (or equivalent) aware mailers and certification of mail.

    The problem with mail addresses is that you have no control over their spread. If I give one to a company it'll usually leak out in the end and it's open season on my inbox.

    However if "genuine" mail is certified and mailers use certification validity as a filtering critera then it simplifies the game hugely.

    Your mailer can spot the people you've genuinely given your address to, and naturally "distrust" uncertified (effectively anonymous) mail or mail whos certificate has been revoked or is unknown to you.

    The "only" things standing in the way of this are:

    1. Slow adoption of certification/encryption in mass market mailers. Usually poor or missing.
    2. Cost/diffiulty of getting a valid certificate (e.g. with Verisign).
    3. The pain of typing a password every time you send a mail.
    4. It only works if everyone joins in.

    But nothing's for free and this strikes at the heart of emails useability.

    I'm continually suprised by the lack of certification use at least by large corporations and governments, but I suppose it removes plausible deniability :-)

    1. Re:Stopping Spam by iamchris · · Score: 2, Insightful

      Think about this: Why do I get 1000's of spam emails per month and I get 10's of peices of junk snail mail/month? Simple: It costs nearly nothing to send millions of spam messages, while it costs a bundle to send junk snail mail.

      A simple solution would be to find a way to charge per email...

      Now, I certainly wouldn't pay per email. But, I shouldn't complain when someone abuses a messaging system that allows millions of messages to be sent out for nearly no cost. I use that system too, on a much smaller scale, for personal and legitimate business use.

      All I can do is ignore as much of the mail as I can, and BOYCOTT anything that is sold via spam.

      Ag.

    2. Re:Stopping Spam by misof · · Score: 2, Interesting

      The only solution I can think of is wide-spread adoption of PGP (or equivalent) aware mailers and certification of mail.

      I have to discourage your optimism a bit. IF the public-key encryption ever finds its way to the general public (I hope and think so), there are two possibilities:
      a) Your public key will be available for the general public -- this is how it will probably work. If someone wants to send you an e-mail, he obtains your public key in a trusted way (e.g. from a trusted key server), encrypts the message and sends it. If the spammer wants to send you spam, once he gets your e-mail address, he does exactly the same. Obtains your public key, encrypts the spam and sends it. The only difference with today's situation: it will be impossible to filter spam on the server side (only to block some spamming IP addresses, no server-side spam filters).
      b) You give your public key only to your friends you trust. This is exactly the approach "everything coming from an address, that's not in my address book, has to be spam." and even contradicts the basic idea: it's your public key...

    3. Re:Stopping Spam by Maengden · · Score: 1

      There's one big "gotcha" that you've overlooked: If certs were successfully adopted widely enough (obviously quite a big *if*, as already noted), then the spamming would become less attractive.

      The next step is either a) spammers figure out how to get into your mailbox anyway or b) spam goes down, then cert use goes down, raising spam use...until an equilibrium is reached.

      I suspect that in the end, cert's usefulness would therefore be self-limiting. Alternatively, you could see this as a way to keep spam at an "acceptable" (equilibrium) leve.

    4. Re:Stopping Spam by dlakelan · · Score: 1

      He wasn't talking about encryption as much as signing.

      Imagine that your emailer fires off a bounce with instructions to everyone who sends you an email that isn't SIGNED.

      If you get a SIGNED email then you can filter it into ones signed by someone you know and ones signed by someone you don't know.

      If spammers start signing their emails they are much easier to target legally.

      --
      ((lambda (x) (x x)) (lambda (x) (x x))) http://www.endpointcomputing.com a scientific approach to custom computing.
  43. Well... by ptrangerv8 · · Score: 0

    This will get modded down, but here goes In Soviet Russia, gzip spams j00! As said above, this is interesting, but not particularly precise... The only way to stop spam, is to prosecute the spammers - just like the anti fax spam legislature did, we need the same principal applied to email... I'm not a supporter of more laws, and more legal BS, but I think in this case it's an acceptable trade-off ;beer;

  44. Spam compression testing by Alien+Being · · Score: 0

    fits_in_little_blue_can ? "spam" : "ham"

  45. That's not how it's done by marx · · Score: 1
    The compression is not used to determine the redundancy in a single message, but to determine how similar the message is to a large body of known spam or ham messages.

    So if the message compresses very well together with spam, then it's similar to spam, and if it doesn't, then it's not similar to spam.

    1. Re:That's not how it's done by Anonymous Coward · · Score: 0

      Ah, I see... My bad :-/

      Dang, mod me down dammit. :-)

  46. Re:Text of the full article by compling · · Score: 1

    while it is true that bayesian filtering is not heuristic, it isn't all that sophisticated either. there are better statistical modelling approaches, but i guess the bayesian will still give quite good results.

  47. Medical community figured this out years ago. by Anonymous Coward · · Score: 1, Interesting
    I took some Medical Engineering courses. One interesting thing they talked about was looking at the rate of false positives and false negatives of a test, the cost of each failure, as well as the cost of the test as a whole. There were quantative values called Sensitivity and Specificity that were calculated, but I can't remember the exact ratios.

    Once you had the information, you could adjust the threshold of the test for optimal results, and figure out which tests were the best value.

    In any case he result is that you end up with screening tests that have a lot of false positives, backed up by more expensive tests applied to all the positives to find the real problems.

    You could do the same thing with spam. You'd need to assign a cost to the false negatives (missing the job offer), and the false positives (deleting spam that passed the filter), and adjust the filter accordingly. (Assuming the cost of the tests, in cpu, are negligible, which is different from the medical example.).

    -- ac at work

  48. RBL by Penguinoflight · · Score: 4, Interesting

    RBL blocks a lot of stuff that isn't spam. It's probably a better idea to go with bayesian filtering. You can read up on it here: http://www.paulgraham.com/better.html

    --
    "And we have seen and do testify that the Father sent the Son to be the Savior of the World"
    1 John 4:14
  49. Email to my girlfriend by FroBugg · · Score: 4, Funny

    Unfortunately, using this my girlfriend would never get any of my emails.

    "I'm sorry. Really, really, really, really sorry. I'm so very, very, very sorry. I'm sorry..."

  50. Better spam filtering.... by JollyFinn · · Score: 1

    Have two folders,
    1->check all the time, from ONLY those who I accept in my list.
    2->rest of the stuff. Spam+Unknown senders...
    Now have quick graphical interface selecting which is spam and whitch is not.
    3->not spam add list.
    4->spam. Add address for filtering 2nd folder.

    Add spam message for VERY liberal filter.
    (If message is almost exact with a previously resieved spam ignore it.)

    --
    Emacs is good operating system, but it has one flaw: Its text editor could be better.
  51. Re:Text of the full article by Hal-9001 · · Score: 4, Informative

    The scheme described in the article is not Bayesian at all. It's more like a very crude hash comparison. If two similar messages are concatenated, they should compress very well. If two dissimilar messages are concatenated, they will not compress as well.

    An actual Bayesian filter would perform a statistical analysis of an existing body of spam and non-spam messages, identify key words or phrases that identify a message as spam or non-spam, and calculate the probability for every key word that a message containing that word is spam. Then every new message is classified as spam or non-spam by running a statistical analysis on its content, and the statistics of that message update and improve the probability model.

    --
    "It take 9 months to bear a child, no matter how many women you assign to the job."
  52. Yeeeaaahhhh.... by ryman · · Score: 1

    ...You see, it's just that we're putting the new cover sheets on all TPS reports from now on, so if you could just go ahead and do that for me, that would be great. And I'll make sure you get another copy of that memo.

    --
    "We are far too easily pleased." --C.S. Lewis
  53. Spammers just found another loophole.. by SystematicPsycho · · Score: 4, Interesting

    I received a nice piece of spam the other day. I didn't read it but I usually scroll to the bottom to see if they have the mandatory (in some places mandatory I think) unsubscribe method. This method sure gets me mad -

    To unsubscribe by postal mail, please send your request to:
    P.O Box 272521
    Boca Raton, FL 33427
    Ref # XXXXXX -- scd

    (XXXX.. replaced real reference number)

    It seems that the unsubscription method doesn't have to be by email - just as long as it's by something and it's there. They musn't be specific in the law. Of course, no one is going to go write a letter by snail mail to unsubscribe to spam, although sending them some dog shit through the mail is tempting. I forgot the site that provides that service. Hrmm I should change my sig.

    --
    Analytic & algebraic topology of locally Euclidean meterization of infinitely differentiable Riemmanian manifold
  54. 32k Window... by pridkett · · Score: 3, Informative

    The fact is, that unless your SPAM corpus and HAM corpus are both under 32k, this won't work. Gzip is fast because it only has a 32k sliding window, meaning that it only searches for like strings in a 32k window around what you're currently compressing. Hate to break it to you, but 32k is not enough for a corpus. I think Bzip2 uses something larger (900k?), but I forget what it is.

    I'll be happy with spam assassin until I get CRM114 (and mailfilter) trained and working.

    --
    My Slashdot account is old enough to drink...
    1. Re:32k Window... by Vann_v2 · · Score: 1

      Look here.

  55. Similar article on heise was published a year ago by hanzwurst · · Score: 2, Informative

    German newsticker heise had a similar article a year ago, altough it does not cover spam explicitly.
    The article has a link to another article published in "Physical Review Letters" which deals with the topic of identifying content/author by applying compression algorithms.
    The underlying idea is that LZ77 compressed data is near to the entropy of a message.

  56. Re:Text of the full article by NoseyNick · · Score: 2, Informative

    > The current fad is in fact Bayesian filtering, sophisticated statistical analysis.

    Baysian filtering IS word-counting with (not very sophisticated) statistical heuristics applied to the results.

    --
    Nick Waterman, Sr Tech Director, #include <stddisclaimer>
  57. Re:WTF? by CheeseburgerBlue · · Score: 1

    This is just a blatent copyright violation.

    I smell a DMCA suit -- somebody go wake up Rusty's lawyers and throw some clothes on them: it's time for court. Fight of the decade!

    "In this corna: benevolent dictator of community-edited superblog Kuro5hin, Rusty rex! In this corna! the gay robots from Slashdot!"

  58. Re:Just use a string entropy calculation algorithm by a2800276 · · Score: 1

    If I were a spammer, I couldn't care less if some nerd using string entropy calculation filters out my spam, because said nerd using weird home grown filtering is also more likely to a.) not reply anyway b.) submit my open relays to blackhole lists c.) complain to my ISP etc. etc.

    If I were a spammer I'd concentrate more on trying to get average users to open my mail even though they've learned that Cindy's "Haven't seen you in ages, JOE23" Emails aren't real. And how to circumvent whatever anti-spam measures come installed in JOE23's AOL software.

    Anyways, some geek in his dorm room is not likely to have enough money to buy penis prosthetics anyway and can also figure out how to jerk off to free thumbnail-pics.

    If spammers started padding their mail with high entropy data I would set up a filter that filters out mails based on how close the character recognition is to standard English HTML-formatted mails, and discards random junk.

    But then spammers would start not just using high entropy material from /dev/srandom (really nerdy spammers themselves, who know not to trust /dev/random) but generating random characters with similar charateristics as English.

    Then the antispammer would have to use fuzzy-logic spell-checking and the spammer would have to start using random words out of the dictionary and finally spammers would be left with no other option than to send me really nice personalized eCards that say "Happy Birthday!" with a little singing chicken, because I haven't found a way to filter those yet. I can only filter spam with mammals

  59. Sorry, that's not right by martin-boundary · · Score: 5, Interesting
    Only naive bayesian models are 0-order Markov. The "naive" refers precisely to the zero order independence assumption. You can have 1-order, 2-order, n-th order bayesian models if you like. Those are called n-gram models. After that, you can have bayesian phrase based models if you like, or paragraph based also.

    Bayesian only refers to how you use the probabilities.

    Now gzip implements similar ideas to LZW compression, which uses variable sized prefixes, which is quite different from an 1-order Markov model. For example, and order 1 Markov model is not allowed to depend on more than the current and immediately preceding symbol.

    1. Re:Sorry, that's not right by Synonymous+Soured · · Score: 2, Informative

      Pre-coffee fog. Sorry. Typing got ahead of brain. Tripped up confounding the words-as-symbols/bytes-as-symbols distinction with the model markovity.

      You are correct about the order-1 assertion. That should indeed have been order-N, where N is the length of the longest prefix string maintained explicitly or implicitly by a Ziv-Lempel dictionary or backpointer set. The Ziv-Lempel engines can be regarded as using shortened N-grams to represent classes of longer, yet-unseen N-grams; and they do use Markov models, where the stationary and transition probabilities are all set equal. In these cases, the probabilities only count for being zero or non-zero.

      A "Bayesian Spam Filter" is order-0 if it relies only on token frequencies, where the tokens are complete strings, and not conditional occurrences of word pairs. The assertion is that a spam filter mechanism would be improved if it relied on a higher-order underlying model, and if the symbols were taken to be bytes and not words. The probability of a string is thus the product of the probabilities of its symbol sequence under the order-N model. But any higher-order model, even one using within-message word digrams or trigrams, would probably be an improvement.

  60. Even Better by HereAllNight · · Score: 2, Informative

    Who needs all of these complicated schemes? I just filter the sending domains as they come. Filter every sender containing "specials", "optin", "offer", "special", "deal", "email", "reward", "value", "promotion", "special" and "super, and all subject lines starting with "friend", and 85% is taken care of right away. So far my formula has had no false positives.

    1. Re:Even Better by Anonymous Coward · · Score: 0

      I have one for "penis enlarger" and that takes care of about 95% of it.

  61. Re:Just use a string entropy calculation algorithm by hammy · · Score: 1

    To counteract that you could also create a second zip containing legitimate emails. The spam mails (even with randomness) should compress better in the Spam zip than the other....

  62. Did you read the f*cking article? by ^BR · · Score: 1

    This is exactly the point that it makes.

  63. Yes please! by CoolVibe · · Score: 1
    If the marketing goons would have to write properly punctuated, nicely formatted mails to reach me, instead of that all UPPER CASE or all lower case overhyped brainless repeated dribble they usually pelt me with, I say sign me up!

    ;-)

  64. Disposable Email addresses by KMitchell · · Score: 1
    The problem with mail addresses is that you have no control over their spread. If I give one to a company it'll usually leak out in the end and it's open season on my inbox.


    I came to this realization driving home from work one night. My immediate follow-up thought was, why not make email addresses disposable, with a nice automated interface to control which ones will fwd to your "real" mailbox? I had worked out a rough framework for how I'd implement this at a site-wide level by the time I got home, only to discover that I wasn't the first one to come up with the idea. A quick google search on "disposable email address" found about half a dozen services that do (more or less) what I'd hashed out.


    Doesn't solve everything, but it does give you a lot more control when choosing what to put in the "email" form when you buy something online :)

  65. Re:Just use a string entropy calculation algorithm by Anonymous Coward · · Score: 0

    But then spammers would start not just using high entropy material from /dev/srandom (really nerdy spammers themselves, who know not to trust /dev/random)

    d00d! Where can I download /dev/srandom? I only have /dev/urandom!

    BTW, /dev/urandom is the device that (historically) wouldn't wait for additional entropy. /dev/random is the "more random" one. Nowadays they are both essentially the same on all but the most archaic operating systems.

  66. Re:Just use a string entropy calculation algorithm by Domini · · Score: 1

    Agreed that this is not the best way to filter spam... it is fraught with peril.

    What I was suggesting is that ISPs actually employ these methods... thus the average user will not even know they were spammed. (Most IPSs employ a troop of Geeks who know where to do:
    "strings /dev/random")

    Personally I prefer an active approach (such as ASK), and preferably the one with the features that has a minimal impact on legitemate users. I still receive about 30 spam mails a day, but with a combination between my IPSs anti-spam system, and my active spam protection, I see about 1 every month only.

  67. Yawn -- read your papers by Anonymous Coward · · Score: 4, Informative

    There was a paper published in PRL a couple of years ago that wanted to identify languages using gzip (Benedetto et al: Language Trees and Zipping). It sure sounded cool, but was quickly forgotten when Joshua Goodman took a closer look (link is down at the moment, probably IIS, Text version in Google Cache).

  68. Where to read about Markov models etc by A55M0NKEY · · Score: 1

    Where's a good easy to digest description of this? It's pretty interesting.

    --

    Eat at Joe's.

    1. Re:Where to read about Markov models etc by Hast · · Score: 1

      Just look for info on Markov processes, Markov chains and stochastic processes. You might want to look up some info on general statistics as well. Unforunately all literature I've read has been in Swedish, so I can't really give any useful hints there.

      Basically a Markov/stochastic process is a group of probabilities. Eg probability deal with the prob. of one person making a phone call or not. Markon/stochastic processes deal with a thousand people doing something.

      M/S processes are also found a lot in queue theory.

  69. Correction by misof · · Score: 2, Insightful

    The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text.

    There is a minor problem with this sentence. And with this whole gzip business. It is misleading. Words, phrases? You cannot force gzip to match words, gzip tries to exploit every likeliness found, even at the character level. E.g., if your "spam dictionary" contains words sex and pants, mail about sextants will have a good compression ratio. And there is no way how to prevent this. That's why the Bayesian filters (operating on words) outperform gzip by a league. That's (one of more reasons) why I think this article belongs not to /. but to a wastebin instead. It simply presents a worse approach to do something. Interesting idea, yes, but that's all.

    (Just FYI: it is proved, that the bzip2 algorithm due to Burrows and Wheeler exploits all such repeatings in the input file nearly optimally -- within some small ratio. Hence, it is even worse to use it as a spam filter :-)

    1. Re:Correction by kirkjobsluder · · Score: 1

      There is a minor problem with this sentence. And with this whole gzip business. It is misleading. Words, phrases? You cannot force gzip to match words, gzip tries to exploit every likeliness found, even at the character level. E.g., if your "spam dictionary" contains words sex and pants, mail about sextants will have a good compression ratio.

      True, but occasional spam-ham matches are a feature of baysian filters as well. The point is not the occasional match, but whether a text is statistically more similar to spam than ham.

      But I would argue that working at the character level or extended phrase level may offer some major advantages. A large part of my spam is formatted as whitespace indented html tags. This is a stylistic trait that would appear to be a very strong diagnostic of spam along with href="http:// and img src="http:// Word based filters split both of these up into separate tokens biasing the results somewhat. For some people the ability to filter tokens embedded in base-64 encoded messages can also be useful.

  70. Repost? by fulldecent · · Score: 4, Interesting
    This post looks like it came from my previous reply on a way to detect entropy (non-repititious content)in P2P files

    Here is a code snippet from the comment:

    #!/bin/bash
    # Entropic analysis by Full Decent
    SIZE=$(cat $1 | wc -c).0
    CSIZE=$(gzip -c --best $1 | wc -c).0
    ENTROPY=$(echo "scale=4; $CSIZE / $SIZE * 100" | bc)
    echo "$1 is ${ENTROPY}% entropic"
    --

    -- I was raised on the command line, bitch

    1. Re: Repost? by Omniscient+Ferret · · Score: 1

      Here's an edit of that script for noting the compression ratio:


      #!/bin/bash
      # Entropic analysis by Full Decent
      SIZE=$(cat "$*" | wc -c).0
      CSIZE=$(gzip -c --best "$*" | wc -c).0
      ENTROPY=$(echo "scale=4; $CSIZE / $SIZE * 100" | bc)
      echo "$* is ${ENTROPY}% entropic or less"

      I also edited it to point out that this gives an upper bound of entropy. Using (for example) bzip2 gives different (not necessarily lower) readings; I'd probably use the lower one, but either is useful for a good estimate.

      (And this ignores, say, FLAC for audio, and... I don't feel like going into compression preprocessing. It is tempting, though.)

  71. How about.... by slummerx86 · · Score: 3, Interesting

    if all the email clients had a little button saying "This is Spam" and if you click it the mail gets sent to some nice spam black list agency. They'd wait for about 10 people to do this, then verify it for the spam it is and then A) black list the spammer and B) send anti-spam email (subject: spam sender here ) nice and easy :)

    1. Re:How about.... by athakur999 · · Score: 1

      CloudMark's SpamNet and Authority products combined do the first quite well. SpamNet checks all of your incoming email against their database and deletes it if it's spam. If spam does get through, there's a button you click that will forward the message to them.

      Authority is a server-side product that uses the SpamNet database to block spams at the mail server.

      Unfortunately, SpamNet only works on Outlook (for now) :(

      --
      "People that quote themselves in their signatures bother me" - athakur999
  72. Re:Just use a string entropy calculation algorithm by a2800276 · · Score: 2, Interesting

    d0rk! Ignoring the fact that I was being sarcastic and artistic license would have permitted me to specify /dev/my_ass let me just say this: before you make statements trying to make people look stupid you should probably have a clue what your talking about.

    While true that your measly Linux machine has no /dev/srandom, this device is the source for _s_ecure random data on OpenBSD and it's probably available some other places as well. Some random trivia (pun intented), checking around I noticed: AIX and Solaris both don't typically have /dev/random at all.

    But anyway, back to your question: if you're sad you don't have /dev/srandom you could try the following:

    ln -s /dev/srandom /dev/zero

  73. E-mail address id-ing by fulldecent · · Score: 1
    I use this technique. I own the domain phor.net and whenever I give out an address, is is in the form freeporn.com@phor.net or monster.com@phor.net

    and I can freely distribute these addresses, because when I get spam (not free pr0n) sent to freeporn.com@phor.net, I can just block them.

    in your AIM profile, you can also link to %n@phor.net which is their screenname. Then you can trace them easily.

    --

    -- I was raised on the command line, bitch

  74. A similar idea (no pun intended) by Ed+Avis · · Score: 1

    The other day I hacked together a script similarity which uses gzip compression to work out how similar two files are. I find this useful when searching for almost-duplicate files.

    --
    -- Ed Avis ed@membled.com
  75. Re:Just use a string entropy calculation algorithm by a2800276 · · Score: 1

    Just to keep on bickering (sorry, bad habit): strings /dev/random wouldn't work cause my super duper filter checks for the proper distribution of letters, i.e. more e's than q's and, cause it's spam, lot's of html thingies.

    You're right on the money though what filtering at the ISP is concerned, that's where the most benefit would be for the end-user. I see two problems, though.

    First, the ISP has to pay bandwidth for the incoming email, spend money on filtering but then isn't rewarded with more time/bandwidth consume by their clients.Secondly, I think they'd be deathly afraid of inadvertantly filtering out some false positives and being sued.

    Think what would happen if some marketing department tries to send their customer the rough draft of a mailing and it keeps getting eaten by the ISP's spam filter.

  76. Wide adoption by fulldecent · · Score: 1
    What I want to see is a backwards-compatible solution to signing e-mail like the following:

    • Every message you send (default behavior) gets MIME-attached with an encoded message digest and public key.
    • You physically type in a password for each e-mail you send (default config).
    • The recipient can add you to their "contacts" and collect your public key.
    • When a "trusted" or even "known" contact (valid public key and checksum) sends you mail it is hilighted or regarded not-spam or...

    This is something that is easily-implementable, backwards-compatible (you don't *need* to read the MIME attachemnt to check for validity) and trustworthy.

    Negative side effects are that if manual password entry is disabled, viruses can use your mail. (A counter measure would be to have the e-mail specify if the password was cached or manually entered)

    Please let me know if this has been implemented in a mail program yet.

    --

    -- I was raised on the command line, bitch

  77. Re:Text of the full article by timeOday · · Score: 1
    Guess what, Bayesian filtering IS a statistical heuristic applied to word counts.

    First you count the occurrances of each word in spam and nonspam. This gives you the probability that spam contains the word, and that nonspam contains the word. Then you use Bayes' theorem to compute the reverse - the probability that, given a message contains a word, it is spam or nonspam. You take the product of this value for all words in the message. Then you normalize so the sum of probability of spam and nonspam equals 1. (This is a so-called "naive bayesian classifier". Somebody might be using a bayesian network with a more complicated structure, but it would still be based on WORD COUNTING as the first step)

  78. hotmail by Koatdus · · Score: 1

    Is anyone else considering just blocking ALL email coming from hotmail? I know it sounds draconian, and I actually have 3 or 4 friends that would be put out but it seems that about half of my spam these days is coming from hotmail accounts.

    Perhaps if the word got out that people were blocking hotmail accounts they would clean things up a bit.

    Another major source of spam here is .br . Since I don't speak spanish or whatever that gobbledy gook is I have a rule that autodeletes everything coming from .br .

    --
    Every wrong attempt discarded is a step forward - T. Edison
    1. Re:hotmail by julesh · · Score: 1

      I have 'From:.*hotmail\.com' giving 4 points in my spam filtering system. 9 points blocks the e-mail.

    2. Re:hotmail by Anonymous Coward · · Score: 0

      The spam isn't *coming* from hotmail; it is coming from open mail relays in Asia these days. They're just forging hotmail addresses.

  79. Re:Text of the full article by Arkham · · Score: 2, Informative

    Baysian filtering IS word-counting with (not very sophisticated) statistical heuristics applied to the results

    This may be the case, but most of the newer filters available now are not really Bayesian filtering by this definition. I use spambayes, and it has some very sophisticated algorithms to determine the statistical probability of the "spamminess" of a ham/spam.

    Some of these fancier algorithms were developed by Gary Robinson and are discussed in some detail here. You can see the results of these different classification techniques (gary combining, chi-squared) in some nice graphs here.

    On a related note, spambayes is VERY accurate in catching spam for me. Amazingly so in fact. It does a far better job than SpamAssassin or the Bayesian filter in Mail.app in my personal experience.

    --
    - Vincit qui patitur.
  80. So the question is... by Anonymous Coward · · Score: 0
    ...if you're at risk from missing legitimate email because you're manually sifting through hundreds of spams a day.... Does the statistical filter give you a higher rate of missed ham, or a lower one? Neither solution is perfect, but which is better?

    There's a good chance that a Bayesian filter will do better than you...out of about 8000 emails, Paul Graham says he missed exactly two legitimate emails, both of which were kinda marginal anyway, and he filtered out 99.7% of his spam.

  81. Bayesian Filters by tacocat · · Score: 1

    Sorry, but I don't see how this is anything different from just another spin on Bayesian Statistical filtering of spam that everyone's been playing with.

    It's hardly patentable. But it is interesting to see. But, once you look at it, not surprising.

  82. Litiogeoususizing (sp) by A55M0NKEY · · Score: 1
    Court fees are expensive and having a lawyer draw up a nice threatening letter can cost you a bundle.

    That's why the Sheisterizer 0.98 BETA ( by Cuisinart ) was created!

    Using the Sheisterizer you too can turn out incomprehensible and threatening sounding letters in your own kitchen ( or wherever you keep your computer ) for a fraction of the cost and effort it used to take.

    Sheisterizer's legaleze generator is guaranteed to produce the most convoluted and obfusticated prose possible liberally sprinkled with obscure and tedious-to-look-up jargon and outdated phraseology plus intimidating references to laws including the DMCA. Latin quotes of ancient Charlemagnian law are used to illustrate the applicability of irrelivant environmental laws to magnify the Quid Pro Quo nature of the implied infringement of the copyrights of ExtremeGonzoPorn film company by using quotes from six different movies created on dates ranging from 3483-3504 ( Chinese Zodiac Calendar ) and DMCA violations in connection with the utilization of a computing device ( Cogito Ergot Rye & Dewey Chetham and Howe L.S.D ) to circumvent the advanced delete on subject read and data streamed on screen button depression situation of the named illustrious and magnanimus institution of higher sloth. Not to mention the manufacture of illicit Schedule I substances and violations of the Mann act and various and sodomy laws in Louisiana's more conservative parishes. Which is why the state of Rhode Island will probably revoke their drivers licenses and extradite them to Saudi Arabia to stand trial for Jay walking.

    The Sheisterizer simulated Neural Net A.I. guarantees that each sentence is impossible to decifer and meaningless but intimidating.

    Beta testers that upgrade to the full release will get the opportunity to beta test Sheisterizer 2 which will include the new Illegal (Registered Trademark) lawyerizing encription for your sensitive files too!

    --

    Eat at Joe's.

  83. I receive a lot of Russian spam by Mustang+Matt · · Score: 1

    I'm mostly guessing it's Russian. I don't recognize it as any other language and it usually comes from an unmasked .ru domain.

    In Soviet Russia...

    --
    The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin
  84. Re:Just use a string entropy calculation algorithm by Domini · · Score: 1

    Not problem, it's not bickering if I'm wrong... -grin-

    A friend of mine once sent me mail ising Caesar's algorythm (ROT13) which I got pretty easily... then he decided to make a random scrambling... so I proceeded to adapt my program using probabalistic distribution in English (I used texts from HHG as my source reference -grin-), and automatically descramble any such texts. I've also written software that defines recursive rules for any type of language structure, and added the letter distribution proability to this... and voilla!)

    PS: Here was the order of probabilities:

    Fortune : etaoinsrhlducmygfwpbvkxjqz
    HHG1 : etaoinhsrdlucgmwyfpbvkxzjq
    HHG2 : etaoinhsrdlucgmwfypbvkzxjq
    HHG3 : etaiohnsrdlucwgmfypbkvxjzq
    HHG4 : etaoihnsrdluwcgyfmpbkvxjzq
    Chaucer : ethoansridlywufmgcbpkvqjxz

    PS: I can email you a copy of these programs if you want... just email me at lailoken on freeshell.org

    On the topic of ISPs implementing this... if they do get a false positive, then the source user will get a bounce, and the sender can always find another method to get it to you. Really important stuff mostly don't get sent by email anyway...
    Besides one can always have opt-out policies regarding spam filtering... thus protecting the ISPs... my ISP does this.

    Once again you are right about the last point, but an active approach would solve this...

  85. Messages from teenagers would be spam by Adam9 · · Score: 4, Funny

    Don't use this filtering if you're a high school teacher or something else that involves getting messages from teenagers..

    [E-mail from skittles9333@some.email marked as spam and deleted] So like, I was like sick, and like, I didn't go to school today. So like, I was told like, that Jim like said, that like you might like, have some homework due like tomorrow. Could you like, tell me what like that homework would like be?

  86. Korean by ONOIML8 · · Score: 1

    I have no idea why but I receive a lot of spam in korean.

    --
    . Quit playing Monopoly with Bill. Switch to one of many non-Microsoft products today.
  87. Nope by I+Am+The+Owl · · Score: 2, Insightful

    Doesn't work for the Lameness Filter, won't work for spam .

    --

    --sdem
  88. Foreign Language Spam by phorm · · Score: 1

    Not really very often, although since I have an email account on a German provider I have gotten a slight bit of German spam. I think a lot of it comes from "sign up" sites, unless you have a strongly public-visible website with your email address on the main page (damn trafficmagnet ads) - most companies in other countries probably aren't going to both pick up your email address if they don't except you to understand the language.

    Since a large portion of popular sites onlines are in english, it stands to reason that when you sign in your email address on an english site, it gets added to an english spamlist. Since I don't sign up on any Korean/Swiss/etc sites, they haven't yet gotten my email address yet (or don't care about it).

    That being said, people in N. America and english speaking countries do get a lot of spam in english from foreign servers - which is where IP range blocklists and spamassassin come in handy.

  89. filtering across multiple accounts by klparrot · · Score: 1
    it just means I get 12 copies of most spams

    What about having a filter check all your accounts at once? If you're receiving the same email on more than one account, chances are it's spam.

  90. Re:Just use a string entropy calculation algorithm by Anonymous Coward · · Score: 0

    This must be my lucky day! I get only 0s! What're the odds to that??!!11!11!!! LOL!

  91. Zip on DNA & Different Languages. by wilgamesh · · Score: 2, Interesting

    This reminds me that about a year ago, three italian scientists came up with a way to find species relatedness by using the zip algorithm. One takes the sequence of bacteria 1, and then attaches a little bit of bacteria X sequence to the end of that. Again, one attaches a bit of bacteria X sequence to the end of bacteria 2. And then zipping is done on this concatenation. The final compression size of just the bacteria X part ended up telling us the homology (or relatedness) of bacteria X to bacteria 1 or 2.

    But from reading all these posts, perhaps a Bayesian method would work just as well. There seems to be no inherent advantage to using zip. One still needs a reference piece of work (non-spam email, or bacteria 1) for comparing entropies or probabilities. Of interest also is that the researchers applied their method to generating an accurate language tree of Indoeuropean languages (grouped by relatedness of course.)

    The ref & abstract of above paper is here:

    Phys. Rev. Lett. 88, 048702 (2002)
    Dario Benedetto,1 Emanuele Caglioti,1 and Vittorio Loreto2,3

    In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification. ©2002 The American Physical Society

  92. Poetry and Prayers by Tisha_AH · · Score: 1

    Of course, if you write in prose you will have a problem. I guess prayers, with repetitive phrasing would also be filtered out.

    --
    Tisha Hayes
  93. Did you read the fucking comment???? by hammy · · Score: 1

    At the risk of feeding a troll.... That was exactly what I was pointing out, that this aspect of the implementation does nullify this person's point! That by comparing the message with both spam and ham you reduce the possibility that spammers can get around this technique by just adding random noise.

    Perhaps you should read people's comments more carefully before making stupid replies!

  94. Brazil by jjga · · Score: 1
    Another major source of spam here is .br . Since I don't speak spanish or whatever that gobbledy gook is

    .br is for Brazil, where people speak Portuguese.

    1. Re:Brazil by Koatdus · · Score: 1

      I stand corrected then... Portuguese..... I don't speak Portuguese and don't know anyone in Brazil...

      --
      Every wrong attempt discarded is a step forward - T. Edison
  95. GZIP used this way ... by fygment · · Score: 3, Interesting

    ... can be universal. The principles used actually have their roots in the theories put forward by R. Solomonoff and Kolmogorov (links below). Any given string of bits can be assigned a "complexity" which is proportional to the length of the shortest program that can generate that string. It isn't usually computable BUT the size of the output file of a compression algorithm can be shown to be a reasonable if crude approximation. The beauty is that this approach (minimum description length or MDL) is clustering email in a very fundamental way without the bias' that can be introduced with assumptions required by Bayesian techniques and arguably making use of all the information (vice a subset chosen by the Bayesian user) contained in the email. Yes, the answers can be the same but the MDL approach is universal and the same classifier without modification could be used for broader clustering tasks i.e beyond binary classification of junk/not_junk to multi-class classification junk/best friend/mom/dad/wife/work/etc.

    As an aside, since it could be fully automated it would be interesting to run the such an algorithm with a graphical display, say a 2D plot of compression size vs time of day just to see what shakes out.

    By the way, the problematic portion for bioinformatics apps is the compression. DNA sequences often exhibit _expansion_ when put through the common compression schemes. Li has come up with a compression scheme that is more optimal called GenCompress.

    Kolmogorov Complexity - http://www.idsia.ch/~marcus/kolmo.htm
    Minimum Description Length - http://www3.oup.co.uk/computer_journal/hdb/Volume_ 42/Issue_04/
    Bioinformatics app - http://www.cs.ucsb.edu/~mli/sam.ps
    GeneCompressio n Program - http://www.cs.cityu.edu.hk/~cssamk/gencomp/GenComp ress1.htm

    --
    "Consensus" in science is _always_ a political construct.
  96. Risk Analysis by po8 · · Score: 1

    All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.

    Huh? You drive around in cars all the time, in spite of the fact that if that system fails (which it not infrequently does) in the wrong way and at the wrong time...you die.

    Technology occasionally fails. The only way to avoid technological failure is to avoid technology. (You'll still have failures: they just won't be technological ones.)

    If you anticipate receiving a communication so Really Important that the consequences of accidentally spam-filtering it are catastrophic, you shouldn't be using e-mail anyhow. I would guess that my personal spam filtering has about the same average false-positive rate as the rate of drops of mail by my software, hardware, and upstream mail providers. At least with the spam-filtered messages, I can save them around and do post-mortem on them.

  97. Re:Asian and Turkish spam by Anonymous Coward · · Score: 0

    I'm a native English speaker, but because my first name is Turkish (Kaan), and many of my email addresses are based on my first name, I constantly get singled out as being a Turk and thus interested in Turkish spam. I get 5-10 pieces of Turkish spam every day (which, if you're curious, is just like English spam - phone cards, herbal crap, toner and computer parts, etc. - only it's written in Turkish).

    I also get spam in various Asian languages (I've recognized Chinese mostly), but I have no idea why.

  98. My ruleset for Sendmail by doorbot.com · · Score: 1

    Well, we keep getting these anti-spam software stories on Slashdot, and I thought it was finally time to post my Sendmail ruleset.

    Using this system of RBLs and header checks, I'm able to whitelist certain users/domains/IPs, as well as block serious offenders. In the past few months, I've received one piece of spam (which was subsequently unceremoniously blocked). The worst offender is the Klez virus, which actually sends valid headers (more or less) and is thus harder to filter with my ruleset.

    Also, my ruleset will return a 553 error during the SMTP coversation... no accept-then-delete here. As an alternative, you might wish to use a more robust filter, such as Exim SpamAssassin at SMTP time.

    Without further ado, here's the URL for my ruleset:
    www.doorbot.com/guides/sendmail/antispam/

    I ask that you go easy on my bandwidth as best you can... I'm on a 128kbit upload DSL.

  99. Good history at everything2... by douglips · · Score: 1

    This node at everything2 has a good description of the catfight this paper generated.

  100. Re:Just use a string entropy calculation algorithm by Anonymous Coward · · Score: 0

    dude, you are so 1337.

  101. how to get rid of spam: those "99" values by kipple · · Score: 1

    I think that under a probability level nobody will send you a mail SO full of $59.99 or $9.99 or $10.99 offers.

    The trick to remove spam is to delete mails that contain more than 2 '9' on a row, possibly preceeded by a $ sign.

    --
    -- There are two kind of sysadmins: Paranoids and Losers. (adapted from D. Bach)
  102. bzip2 results by K-Man · · Score: 4, Informative

    Several knowledgeable people pointed out that the first try was limited by gzip's 32k window size, so I did a quick run with bzip2, which uses a 900k block, and put the results here. Somewhat different, but still a spread between spam/ham.

    And, of course, do try this at home.

    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
    1. Re:bzip2 results by juggy · · Score: 1

      I hate to break it to you, but as far as I know bzip2 uses the Wheel-Burrow-Transform which is vastly different from the LZ77-Scheme that gzip uses. Unless I am utterly mistaken, your checks don't mean anything :-)

  103. Puts a dent in the old essay idea... by shepd · · Score: 1

    Remember:

    Tell them what you are going to tell them.
    Tell them it again.
    Let them know what you just told them.

    Hmmm...

    --
    If you could be told what you can see or read, then it follows that you could be told what to say or think - BoC
  104. Yeah... by silvaran · · Score: 1

    largenay ouray enispay inay ivefay easy inutesmay!

    Unfortunately, I was never very good at latin...

  105. Similar techniques used to out author using alias by bahco · · Score: 1

    Sorry, can't find references, but similar techniques have been used by a team of Italian researchers to determine which real life Dutch author published a book using a pseudonym. Something to do with an award for beginnings authors, or suchlike.

    Bahco.

    --
    -- The best way to accelerate a computer running Windows is at 9.8 m/s^2.
  106. gzip compression by Anonymous Coward · · Score: 0

    I guess this will really press spam out of existance

  107. Putting it to the test. by kirkjobsluder · · Score: 1

    Ok, I decided to try it out and run my own statistics on it.

    The good news is that with bzip2 it peforms about the same as spamassassin. On my K6-200 BSD system it takes about the same time to process an email message spamassassin. Both take too much time for my taste but that is another issue. Performance is proportional to the size of the corpus.

    It's the statistics that bothers me. There is no point in comparing the means (in ambiguous terms) without the standard deviation between groups.

    So here is my data. I created a spam and ham corpus from half of my emails. Then wrote a quick script to pipe the other half through the program.

    ________hratio_________sratio

    ham____.122(sd.09 8)______.249(sd.079)

    spam___.276(sd.046)______. 198(sd.060)

    hratio = compression ratio with ham corpus.
    sratio = compression ratio with spam corpus.

    n(ham) = 93
    n(spam) = 39

    Basically the variance kills compressing with a spam corpus as a test because there is too much ovelap between the ranges. More than half of my spam was within one standard deviation of the ham. The separation between distributions compressing with the ham corpus is ok but not that great.

  108. try blocking com, net and org and the cc's by Anonymous Coward · · Score: 0

    That's not draconian, try my email filter rules:

    1. If incoming email matches an email address in my address book, move to friends folder.
    2. Otherwise, delete it.

    I see no spam and get no false positives.

  109. Do you have non techie friends? by ^BR · · Score: 1

    Most of my non techie friends greatly enjoy sending HTML mail, wether using Outlook or sendmail, but they sure never promise me a bigger penis or firmer breast using 100% natural herbal pills.

    HTML is definitely not a classifier of spam, at most one of computer illiteracy.

  110. Uh by autopr0n · · Score: 1

    It depends on if you're trying to stop spam or go on some crusade to punish people who enable spamming. I think it's rediculous to block mail from someone because they use the same ISP as someone who sells spamming software, and I certanly wouldn't want some unacountable 3rd party doing it on my 'behalf', especialy since it dosn't benifit me at all. (and, in fact, actualy harms me since I'm losing legitimate email)

    --
    autopr0n is like, down and stuff.
  111. Either one would get deleted by me by autopr0n · · Score: 1

    And I think a lot of people would delete the first one as well. I would expect the sweepstakes people to call me.

    --
    autopr0n is like, down and stuff.
  112. brainfart by ^BR · · Score: 1

    Is thinking Hotmail then writing sendmail a precursor sign for some mental desease?

  113. RTFA! by jotaeleemeese · · Score: 1

    The redundancy arises when compressing the email and a body of text you know contains SPAM...

    --
    IANAL but write like a drunk one.
  114. Who wrote this? by Puppet+Master · · Score: 1
    Individual results were also quite clear: while some spam messages compressed slightly better when mixed with ham, ham messages still maintained a margin of 15% or more between the most spamlike ham, and the most hamlike spam.

    Looks like either Dr. Suess, or members of Monty Python...

    --
    The day Microsoft creates a product that doesn't suck, it will be known as the Microsoft Vaccuum Cleaner!
  115. Yep by BillX · · Score: 1

    For me lately, I get about a 50/50 mix of English and Brazilian spam, with the occasional (maybe 10% of total spam) "gibberish" Asian character-set mail.

    --
    Caveat Emptor is not a business model.