Slashdot Mirror


User: letxa2000

letxa2000's activity in the archive.

Stories
0
Comments
2,721
First seen
Last seen
Profile
(view on slashdot.org)

Comments · 2,721

  1. Re:Check your filter training database on Analysis of Spam, and a Proposed Solution · · Score: 1
    What do you mean by overtrained?

  2. Re:Wrong on Analysis of Spam, and a Proposed Solution · · Score: 1
    Actually, it probably will eventually. How many non-spam messages do you get with external images embedded in it? Eventually the IMG tag itself will be considered a high indication of spam.

    The IMG tag in my corpus has a 96.101% spam probability. The SRC tag has a 96.027% spam probability. The token GIF within an HTML tag has a 94.647% probability. An HTTP token inside an HTML tag has a 93.528% probability. So the simple html tag IMG SRC="http://www.somesite.com/file.gif" has already scored 4 high-spam indicators. Throw in some headers for good measure and it's very doubtful that such a spam is going to get past my filter.

    Unless, of course, your friends have a tendency to send you spam-like embedded images in their email in which case you are the exception, not the rule.

  3. Re:Wrong on Analysis of Spam, and a Proposed Solution · · Score: 1
    It's amazing that years after Bayesian was first introduced that there are 1) People that think that the spammers can get around it. 2) People that think that inserting random words or text will reduce Bayesian effectiveness. 3) People that think that spammers can intentionally "poison" the corpus to make Bayesian less effective.

    NONE of these are true.

    My corpus has been building over the last year. I have 7979 good messages and 89048 spam messages in my corpus. Accuracy continues to increase despite whatever it is the spammers might be trying as of late to get past my filter. My accuracy was 99.35% back in June of last year while last month I scored 99.95%--and that's considering I get lots of email from people all over the world that are essentially "unknown" to me and write with varying levels of English literacy.

    Using made-up words (i.e. xfargs) will not help them get past Bayesian filters because "unknown" words will neither help nor hurt the Bayesian score. Using random words (i.e., inserting sections of the Constitution, poetry, or other random words from the dictionary) actually tends to hurt the spam score, at least in the cases I've reviewed. On several occasions I've checked the words that were considered "spammy" in a detected spam with these kind of random words and, ironically, some of the most spammy words were the random words. The spammer actually made things worse by trying to insert the random words!

    Anyway, those that think that Bayesian isn't the solution either don't fully understand the statistics involved and/or are using a faulty implementation. Despite the fact that my monthly spam has gone from 1638 in March 2003 to 14,119 last month, spam is no longer a problem for me. I see fewer spam now than I did a year ago even though I'm now received almost 10 times as much!

    PS--I'd like to look at your presentation, but it appears to be in Powerpoint (?) format! Come on, you should know better. This is Slashdot! :)

  4. Re:infinite monkeys on Armoring Spam Against Anti-Spam Filters · · Score: 1
    It's much easier than that to defeat Bayesian filtering. Ever \/\/0|\|D3R why you're getting so much spam with obfuscated words?

    Well, let's see... VIAGRA as a word and as split-up in various obfuscated ways:

    VIAGRA: 99.624%
    V.IAGRA: V=76.532%, IAGRA=99.9999%
    VI.AGRA: VI=72.656%, AGRA=99.9999%
    VIA.GRA: VIA=34%, GRA=99.9999%
    VIAG.RA: VIAG=99.9999%, RA=92.604%
    VIAGR.A: VIAGR=99.9999%, A=67.68%

    Now, yes, there are other ways to obfuscate the word. But you can give me just about anything and you'll see results similar to the ones above.

    Or why you're getting so much spam where the text content is contained primarily in images rather than plaintext?

    HTML "A": 95.426%
    HTML "HREF": 93.306%
    HTML "IMG": 96.434%
    HTML "SRC": 96.357%

    You send me an HTML message with a link to an external image and it's almost guaranteed you'll be caught as spam. And we haven't even discussed the fact that Bayesian doesn't just look at the content, it looks at the message headers, too--and there's lots of good spam indicators there.

    Those things bypass Bayesian filters, that's why!

    Uh, no, they don't. You don't really know what you're talking about, do you? :)

  5. Re:infinite monkeys on Armoring Spam Against Anti-Spam Filters · · Score: 5, Insightful
    I'm not sure I understand why they think this is a problem with Bayesian filtering. Basically, they're saying that if a spammer sends you the same message thousands of times but inserts a few slightly different words each time, and if the thousands of messages get through the Bayesian filter to the user, and if the user doesn't disable HTML bugs on his email client, then we have a problem...?

    First, if the spammer sends thousands of copies of the same message and just changes the "extra words" that he is testing, it will take very little time for Bayesian to adapt to the rest of the message. Suddenly, the rest of the message that previously contained non-spammy words will be considered very spammy and will overwhelm the "extra words" that each message contains. Each time the message is caught as spam, the probability that any future tests get through--regardless of the "extra words"--will be reduced even further.

    Second, as the article said, it's a lot of work on the part of the spammer. They'd have to send out thousands of messages to each target to "sniff them out" and most of those wouldn't even be effective since most of them would be caught by filters and those few that got through very few would load the HTML bugs to identify themselves.

    Finally, it assumes that those that are using Bayesian filters are filtering their email but leaving their security (inasmuch as HTML bugs) wide open. While there may be some people that use Bayesian and leave HTML bugs active, it has to be a small minority.

    In short, it seems to me they've "found" a way to get around Bayesian that won't work, so to speak. I just don't see the problem.... ??

  6. Re:Thank you.... on Arrest in Caridi FBI Investigation · · Score: 1
    I suppose people will start justifing downloading movies like they do music. "the movies coming out today suck, so I'm not paying for it...but I'll download it". "Actors/directors/producers/cinematographers/ makeup/wardrobe/sound/fx/grips/best boys/art dept/ have to understand they won't get money now from movies and DVD's, they'll have to get it from performing live gigs". Wow...sounds pretty stupid when I put it that way...huh?

    Yeah, because it's a wrong comparison. Going to the movies is like going to a concert. Renting or buying a DVD is like buying music.

    The movie industry doesn't have much to worry about in terms of losing the movie-going population. Going to the movie is "going out" with your wife or girlfriend, getting out of the house for awhile. It's a social activity, and Hollywood does well because many people go for the social aspect of it even if they know the movie is going to suck. There may be a few nerds that would rather download the movie and watch it alone at home, but going to the movies is something couples and even groups of friends do together. It's not going away.

    The threat to the movie industry is that downloading movies could cut into DVD sales and rentals. That's where there's a fair comparison to downloading movies and there may be some risk.

    I know someone (sister-in-law) who has downloaded several movies at work before they came out. She saw them first and then went to see them with my wife and me. Like I said, going to the movies isn't just about the movie. It's about the social aspect.

  7. Re:They don't care about us on Wal*Mart continues push for RFID adoption · · Score: 1
    Out of curiosity, regarding RFID at the retail level... are we talking about RFID on product packaging that gets discarded when you take it home and open it up? Or are we talking about RFIDs embedded in clothes such that it goes with you whenever you put it on?

    If it's the former, what's the problem?

    If it's the latter, then I agree we have a problem. But 1) Avoiding WalMart won't solve the problem since once RFIDs are in clothes I'm sure they'll be in all clothes wherever you buy them. 2) If the RFID is in clothes, why can't it just be located and removed?

  8. Re:not as bad as it sounds. on Spirit Rover Communications Error · · Score: 1
    You think? I'm sure they have some "packet sequence number" to make sure commands aren't received out of order and it's all ignored if they are received out of order.

    It'd have to be a pretty delicate protocol to go into safe mode because, essentially, a packet was dropped over hundreds of millions of miles of space. I'm sure packet drops were contemplated.

  9. Re:You poor deluded fool... on Spirit Rover Communications Error · · Score: 1
    Don't taunt happy vending machine...

  10. Re:Maestro update! on The Dirt On Mars, In Words And Pictures · · Score: 1
    I think it has already been proven that "god" did not create life on earth.. if so.. then how do you explain dinosaurs.. or the ancient egyptians?

    Both dinosaurs and ancient Egyptions qualify as "life," I think. This is just my personal belief, but I tend to believe life was created by God through natural means that we (man) are able to understand. While God can do His thing by waving his "magic wand" in the form of an instant miracle, I don't think that's the way God usually works.

    I think God "setup the universe" and let things unfold according to the physical laws that He defined to govern it. When it comes to life, perhaps He knew it would happen eventually under the rules He defined, or perhaps He gave the initial conditions a "push" in the right direction to make it happen. I don't find this at all in contradiction with what is stated in Genesis (although fundamentalists would probably take serious issue with me and my personal beliefs).

    That said, I'm not entirely convinced by evolution either. I don't consider evolution as a contradiction to religion (as I explained in the previous paragraph), but I still haven't seen proof of evolution. As far as I know, we haven't witnessed it--an evolutionary change in a species as we watch. As far as I know we don't have fossils that show "almost wings" sprouting out of dinosaurs that are in the process of evolving into birds. And the idea that a dinosaur was suddenly born mutated with wings, learned to fly, and reproduce (with another similarly mutated dinosaur?) and thus evolved into a bird seems doubtful.

    Evolution is still a theory and, in my mind, takes as much faith to believe in as religion.

  11. Re:Maestro update! on The Dirt On Mars, In Words And Pictures · · Score: 2
    Also, there is the grand philisophical question involved. Are we the reason for the universe? Did God create all of this just for us or are we just another form of life in a freak universe? The existence of life outside of Earth is as huge a revelation to religion as the debunking of the Earth-centric model of the solar system. The spiritual ramifications are enormous, but not often talked about.

    It's only a "huge revelation" to certain religious extremists. I, for one, am a Christian and my religion would be in no way threatened by the presence of intelligent or non-intelligent extraterrestrial life.

    If there is life on Mars, then suddenly Darwinism takes a huge leap and Biblical creationism, at least the most common interpretations, takes a step back.

    Why? I mean, if someone believes God created life on earth, why couldn't he create life on Mars?

    There are some that believe that John 10:14-16 ("As the Father knoweth me, even so know I the Father: and I lay down my life for the sheep. 16 And other sheep I have, which are not of this fold: them also I must bring, and they shall hear my voice; and there shall be one fold, and one shepherd.") refers to other religons. But "other folds" may just as well refer to life on other planets.

    And now we have to ask if other intelligent, self-aware creatures have a soul. Do they have an afterlife?

    Sure, why not?

  12. Re:Maestro update! on The Dirt On Mars, In Words And Pictures · · Score: 1
    Duke University Medical Center undertook a study about the power of prayer. They had a randomized selected group of patients to be prayed for by christian, jewish, and muslim clerics -- and a control that was not prayed for.

    You just think no-one prayed for the "control subject." You don't think anyone who knew the control subject (and was outside the study) didn't pray? Or maybe even the control subject himselff? :)

  13. Re:Spamkiller doesn't care on Filter-foiling Gibberish Becoming A Spam Staple · · Score: 1
    Well, the obvious attack is to harvest words from the same web page that you harvest an address from. It would be devastating, as far as I could tell from my SA tokens... It may need some tuning too, but it could be bad... So, we need options, and many different approaches.

    That's an interesting thought, and I do see where you are going with it. I do see a couple of problems (for the spammers) though.

    1. Virtually everyone gets spam, but not everyone posts their email address on websites. This tactic would only work for the subset of email addresses that happen to appear on websites.

    2. Not all websites with email addresses are going to provide useful content context. If it's a university directory it could be full of other names and addresses and the spammer won't be able to know which (if any) of those people you communicate with. If he includes all of them then he'll probably run into the same problem as using random terms--that these will actually get higher spam scores.

    3. It assumes that what is discussed on the page with your email address is the same as the type of thing you get in email. This does make sense and may be the case in many cases (such as me putting up a website on a topic and having an email address pointing to me). But in many cases, it won't. For example, I participate in a number of forums. I hide my email address anyway, but even if it was published Slashdot would be the only forum in which there is an overlap between what I discuss in forums and email. The other forums I participate in my participation is limited to the forum--I actually don't discuss any of those topics via email.

    4. Even if it could work, it significantly complicates things for spammers. Most spammers still use pure email address mailing lists, some with email addresses a decade old. If they were to use this approach they'd have to recrawl the entire web and now associate "interesting words" from the webpage with the email address--probably at least 5 or 6 (or more, depending on their email headers) to get the spam score low enough. And that depends on them being able to pick out words from the webpage that are, in fact, interesting. Just picking out "unusual" words (words that aren't contained in most pages) would certainly be logical, but far from certain to work and easily foiled. They'd intentionally have ot look for words that are on the target webpage that aren't usually found on others and use those--but those words could just as easily be garbage terms on the webpage. The Bayesian filter automatically ignores garbage terms (because they usually only occur once, and never in good email) but the spammers would actually be LOOKING for the rarest words, which could be garbage. We have the advantage of being able to ignore silly garbage, but they'd be looking for the rarest words which could be silly garbage. And if they actually used those terms enough, they'd become an indicator of spam for the recipient.

    I don't know... I do see what you're getting at and I hadn't really thought of it. But I think that it'd be pretty unwieldly for the spammers and pretty easily foiled by borrowing a tactic from the spammers and populating such pages with exotic terms that the spammers will grab and try to use. You basically "poison" the words they could potentially grab from the page. And while that tactic doesn't work for spammers (since we simply ignore such infrequent words), it would work for us since they'd intentionally be looking for infrequent words to try to use against us.

    At least those are my initial thoughts...

  14. Re:Spamkiller doesn't care on Filter-foiling Gibberish Becoming A Spam Staple · · Score: 1
    OK, your filter flagged my pseudorandom garbage as being spam. See the discussion we're having... what happens when you feed this entire slashdot post into your bayesian filter? I'll bet a shiny nickel that it will be flagged as spam.

    Pass that shiny nickel on over here, then. :) I inserted the entire message I am currently replying to into my Bayesian filter and, without any headers to work with, it got a spam score of 38.59%. It actually wouldn't have been tagged as spam. Why?

    Spammy words:

    0.99000 Body: SHINY
    0.99000 Body: NICKEL
    0.99000 Body: COLLOQUIAL

    Non-Spammy Words:

    0.00114 Body: I2C
    0.00151 Body: DAC
    0.00383 Body: JMP
    0.00429 Body: FILTERING
    0.00447 Body: ADC
    0.00890 Body: SLASHDOT
    0.02195 Body: PROBABILITY
    0.04082 Body: DISCUSSION
    0.04094 Body: BAYESIAN
    0.05913 Body: PROBABLY
    0.06005 Body: PERHAPS
    0.06202 Body: LINKING

    The probability of *any* word, with exceptions for things like highly technical jargon you use frequently, being labelled as spam, is quite high.

    Only if the word is used in much higher proportions in spam than real mail. Plus it doesn't matter if any given word rises. The word "THE" has a 53% spam score right now in my corpus--but that doesn't mean that any given message that contains the word "THE" is going to have a higher probability of being considered spam since Bayesian only considers the "most interesting" tokens. "THE" is only 3% off from neutral so it is doubtful it's going to be considered. It's words like "PORN" with a 99% score (49% from neutral) or RALPH with a 2% score (48% from neutral) that are going to make the case for or against a given message being spam. Words that are pretty much neutral aren't even going to be considered.

    If someone emails you about something other than work - merely engaging in some colloquial conversation and perhaps linking you to a site they found funny or interesting, you may never get that message.

    That has NEVER happened to me. The only emails I've had erroneously filtered by Bayesian (false positives) were random people I had never heard from before writing to me out of the blue and usually in broken English since they were foreigners. I have a popular website and I get literally thousands of unsolicited comments per year. Only a few of those were ever considered spam and even when I got them they weren't even messages I would have cared had I not seen them. I've never missed a relevant email with Bayesian.

    Your entire approach relies on the idea that there are words that are used more frequently in normal conversation than in spam.

    Right, and vice versa. There are words used more frequently in my normal conversations than spam, and there are words used in spam that I NEVER use in my normal conversations. Bayesian uses ALL that information and calculates a very accurate score predicting whether or not a given message is spam based on the words and characteristics of the mail compared to previous good and spam mail.

    If the spamming software gets a little smarter, the lines between the spam-words list and the non-spam-words list will blur so much

    Do you just think that's the case, or do you have any evidence? Everything I've seen in my Bayesian statistics indicates exactly the opposite. My Bayesian stats continue to improve such that words that spammers use to "dilute" their spam score are actually rising in spam probability since I never use them myself. In a recent spam, out of 18 words they inserted to hopefully lower their spam score, 15 of them actually RAISED their spam score. Their efforts were counterproductive.

    They can't blur the line between my spam and non-spam words unless they know, for example, the names of my best friends, the topics I generally discuss via email, etc. It's not good enough to use a lot of words that aren't used in spam since, over time, those are going to be considered spammy (

  15. Re:Spamkiller doesn't care on Filter-foiling Gibberish Becoming A Spam Staple · · Score: 1
    Isn't it somewhat odd that all of those words are exactly 100% guaranteed to be spam words?

    Not 100%. There is no word or term that is a 100% indication of spam. But 99.9999% is as close as you can get in my particular implementation. It means it's been used in many spams (more than 20, I think) and not a single good email.

    Right now, those words are scoring highly becuase you've flagged a spam message as containing them, but have not flagged any legitimate mail as not containing them.

    Correct.

    As you keep communicating through email, you'll use more and more words, and so a larger and larger percentage of the english dictionary cannot be guaranteed to be spam-words.

    Sure. For example, the word VIAGRA has appeared in just 1 non-spam message and 4144 spam messages. So it's spam score is 99.80%. So, sure, if I start talking about Viagra a lot in my email then that particular word score will go down.

    But you are incorrect in that I (or more precisely, those that email me) will use "more and more words." There are thousands of words in the English language, but we don't use most of them. I seriously doubt that anyone that emails me will use the word "Goddess" or "Heterosexual." But if they do, it'll be infrequent. Perhaps "Goddess" or "heterosexual" would then drop to a 99.8% score.

    But just because these two words drop from 99.9999% to 99.8% is unimportant. Even if they dropped to 50% each the above random words would have resulted in a spam score that would be categorized as spam, and they'd only drop to 50% if a lot of my contacts started sending me lots of email about heterosexuals or godesses.

    Simultaneously, spammers will realize that they can eliminate obscure words, technical jargon, and anything else unusual from their random message generators.

    Again, that's not good enough. They need to use the words that have scores of 1-5% in *MY* Bayesian statistics. And those are the words that are going to be very specific to me. We're talking words like:

    ADC: 0.46% (ADC=Analog Digital Converter)
    AVR: 0.80% (AVR is a type of microcontroller)
    DAC: 0.10% (DAC=Digital Analog Converter)
    I2C: 0.10% (I2C=Protocol for inter-chip communication)
    JMP: 0.40% (JMP is an assembly language instruction)
    RALPH: 5.0% (The name of a friend)
    COLORADO: 9.9% (Where I used to live)

    But if they mention California that's a 55.7% chance of being spam. Oregon is 63%. Arizona is 52%. Florida is 61%. So, for example, just dumping a list of states to hopefully find Colorado (which has a lower score) is going to be counter-productive since most states have HIGHER than average scores. But, of course, someine in California would probably have a low California score and a high Colorado score.

    How will you filter something like "Happy sunshine is today for apples and lamps are sitting on the bed. The desk near the door is computer funny on the airplane. I don't think keys is music by watch for time with knife and bowl. See movie with fries and etc etc rambling..."

    Well, I'd need a message header, too, because they provide a LOT of good information for Bayesian. But right off the bat I can give you the following scores for the words you used in your example:

    SUNSHINE: 99.0%
    DOOR: 90.24%
    MOVIE: 84.77%
    BOWL: 72.27%
    TODAY: 68.053%
    WATCH: 62.50%

    Without the headers your pure-text example message actually snagged a Bayesian score of 86.6% based on my Bayesian statistics. I'd bet you 10 bucks that if you actually sent me a spam with the above text in it that it'd EASILY score over 90% and be tagged as spam.

    This sort of semi-structured nonsense will probably fool your filter if it's only looking at the probability of individual words being used mainly in spam.

    No, it won't fool it. All by itself your sample text was almost tagged by spam. If the spam payload itself (the part that sells me Viagra, sends me to a website, encode

  16. Re:Spamkiller doesn't care on Filter-foiling Gibberish Becoming A Spam Staple · · Score: 4, Interesting
    I get the same statistics as you with my SA install, most of it is given a BAYES_99 score. Unfortunately, many don't train their own filters, and this is rather effective against them.

    True. Although an obvious caveat of using Bayesian to filter is that you HAVE to train it. In the anti-spam service I use (see tagline) it defaults to NOT using Bayesian. If you turn Bayesian on it specifically sends you an email reminding you that you MUST train it or things will actually get worse.

    But you're right, a misused Bayesian filter might actually be worse than no Bayesian filter at all. But that's the case whether or not spammers insert random words.

    There are ways to poison Bayes-filters that are better than this, and that may well be effective. If you sit down and think about it, I'm sure you can think of something too. I'm not going to write them, because it will be too easy for spammers to implement. Fortunately, spammers are stupid, and that buys us some time, but we still need more options.

    Let's talk about them. We're not going to come up with anything that spammers can't come up with so I don't think we're going to make things any easier for them or give away the farm by discussing it publically.

    I personally have thought about it and I'm unaware of how they could poison Bayesian statistics. I only see two approaches, theoretically. 1) Make your spam get a lower Bayesian score so it gets through. 2) Make non-spam get a higher Bayesian score so it gets caught as a false positive.

    Approach #1: Short of going to the "spam of the future" predicted by Paul Graham, I don't see any way for spammers to really get a lower spam score.I've seen entire sections of the Constitution embedded in spam that still got a 98% spam score. The only way spammers are going to get a lower spam score is by doing things like using the names of my friends, using words related to topics I often discuss, etc. And that's just not possible. Like I said, they might get an occasional lucky shot but what gets through to me most probably won't get through to you. I just don't see any way for them to reliably get past a significant number of Bayesian filters.

    Approach #2: Poison the Bayesian stats such that non-spam mail gets tagged as spam. I'm pretty convinced this isn't possible, either. Again, they'd have to heavily use words that are specifically non-spam for the receiver such that the spam rating for those words increases so high that it is considered spam. But if the words are heavily used in both spam (trying to poison the stats) and non-spam, it's going to float to a middle position, like the word "THE" which has a 53.2% chance of being spam (and that's only because 92% of my mail is spam so a neutral word is usually slightly over 50%). But neutral words are completely ignored by Bayesian--only the "most interesting" are considered, those that are 99% spam or 1%--THOSE are the words that define whether or not the message gets scored as spam or not. Plus if they knew which words to poison, those are the same words they could use to get their spam past the filter to start with... so poisoning the filters is pointless anyway.

    I really don't see how they can get around it. I'd be interested in your views. If you really think it's dangerous to talk about it in public then let me know and I'll email you at your mangled address above. Is that your correct address?

  17. Re:Spamkiller doesn't care on Filter-foiling Gibberish Becoming A Spam Staple · · Score: 1
    Here's a few more from another spam I got today:

    luxurious: 34.8%
    goddess: 99.9999% (Bad choice given porn spam)
    prussia: 99.9999%
    foliate: 99.9999%
    roentgen: 99.9999%
    franca: 99.9999%
    plat: 99.9999%
    mycology: 99.9999%
    immigrate: 99.9999%
    calcite: 99.9999%
    gunfight: 99.9999%
    dame: 99.9999%
    clue: 5.2%
    grandiloquent: 99.9999%
    riverfront 99.9999%
    canteen: 99.9999%
    heterosexual: 99.9999%
    guest: 51.6%
    chrysolite: 99.9999%
    crockery: 99.9999%
    scorch: 99.9999%

    In other words, ALL the terms this spammer used to supposedly get past a Bayesian filter scored a 99.9999% spam probability except 3 of then (which scored 51.6%, 34.8%, and 5.2%). However, they had 18 random words that scored 99.9999% spam probability. Since my Bayesian filter only considers the 15 most interesting terms (i.e., those furthest away from 50%), it turns out the ONLY terms considered for this particular email are 15 of their spammy-looking "random words." In other words, it doesn't matter what the rest of their email contains... The random words alone score this message as spammy beyond belief. Their own random words even defeat themselves since their lucky shot with "CLUE" (5.2%) isn't even considered since the 18 random words with a 99.9999% score are far more interesting. This spam would have been better off if it hadn't inserted any random words at all.

    This is a perfect example of why spammers cannot win. They CANNOT get around Bayesian filters except for a very occasional lucky shot when they happen to use a random word that happens to be used frequently by the receiver--but even that proves futile when, in the above example, they get 2 non-spammy lucky shots and 18 damning spam words included in their random words. On balance, their random words have done more damange than good.

    I think time will prove Paul Graham completely right: The spam of the future will be a 1 or 2-line message prompting someone to click a website, and even these will usually be recognized by Bayesian based on their headers alone.

    But the traditional spam arms race is done and Bayesian and statistical filters have won.

  18. Re:Spamkiller doesn't care on Filter-foiling Gibberish Becoming A Spam Staple · · Score: 1
    Ok, granted, those that don't filter spam are in trouble. But as more people become frustrated with spam they will look for ways to deal with it. That may be with services that filter it for them, their ISP implementing the option of filtering, or email clients that support Bayesian. But just as email was originally a geek thing that is now used by virtually everyone, so will filtering.

    The point is, those that don't want to see spam don't have to. The technology exists to insure you won't see it in any offensive quantity. For those that are willing to make a trivial effort to filter their email, statistics insure that spammers will not be able to bother them. When enough people start filtering spam, spam will no longer be profitable.

    As for Hotmail and Yahoo, I'm not sure why the haven't implemented Bayesian yet. But I'm sure it's only a matter of time. Yahoo has a "Report as spam" button so it'd be extremely trivial to make that button generate the appropriate Bayesian statistics that would allow spam to no longer be a problem for Yahoo users. Same is true for AOL.

    But, again, I'm not speaking to the effectiveness of all spam filters. I'm talking about the effectiveness of Bayesian filters. The spammers are fighting a battle they can't win when it comes to trying to get their email past Bayesian filters.

  19. Re:Spamkiller doesn't care on Filter-foiling Gibberish Becoming A Spam Staple · · Score: 5, Interesting
    The encoding V*I*A*G*R*A would break out to the letters V I A G R and A.

    V: 76.9% Spam score
    I: 47.2% spam score
    A: 68.8% spam score
    G: 72.2% spam score
    R: 72.2% spam score

    On balance, if I get a message with the individual "words" of V, I, A, G, R, and A, that's going to be leaning towards spam.

    That's the beauty of Bayesian. Anything the spammers do will eventually come back and bite them in the butt. Even some of the "random words" they are starting to use are getting high spam scores:

    WHEREUPON: 99.9999%
    NEOCONSERVATIVE: 99.9999%
    LIBERAL: 74.3%
    LIBERTY: 84.0%
    MEGATON: 99.9999%
    METHANE: 99.9999%

    These are just a few of the "random words" I found in recent spams and, interestingly, the random words they are using are actually INCREASING their spam probability.

    Statistically, it's a lost cause for the spammers, they just don't realize it yet.

  20. Re:why not filter out 1337 sp3@k? on Filter-foiling Gibberish Becoming A Spam Staple · · Score: 3, Interesting
    You're completely right. I love it that spammers try to conceal their mail with weird combinations of words.

    Examples from my corpus:

    VIAGRA: 99.797%
    V!AGRA: 99.9999%
    AGRA: 99.9999% (from things like VI.AGRA)
    IAGRA: 99.9999%

    PORN: 98.573%
    P0RN: 99.9999%
    PR0N: 99.9999%

    Plus, the trick is looking for things that give away spam that aren't just words. I call them "characteristics." For example:

    Various pharmacy related terms: 99.9999%
    HTML using % escape sequences: 98.789%
    Http:// references that don't use www: 85.538%
    =?ISO- in Subject: 99.9999%
    Suspicious domains (BIZ, BR, PRO, etc.): 99.174%
    1 "Adult Term": 70.8%
    2 "Adult Terms": 85.7%
    5+ "Adult Terms": 99.9999%
    5+ HTML Comments: 92.0%
    10+ HTML Comments: 98.3%
    30+ HTML Comments: 99.9999%

    In short, there are so many aspects of a message you can analyze and make "Characteristics" that my Bayesian filter can often make a decision entirely based on the characteristics without even looking at some of the terms used within the message. But if the characteristics aren't damning enough, the content virtually always is.

  21. Re:Slippery little fish on Earthquake Prediction Months In Advance · · Score: 1
    there is an earthquake, a few people die or what not and we use our money to reduce global warming more. OR we spend money on trying to predict something that won't matter in say 100 years cause global warming will have killed us anyways. Some priorities need set I believe.

    I'd rather spend money on predicting earthquakes which have been proven to kill people than on Global Warming which hasn't even been proven, let alone proven to kill anyone.

  22. Re:So that means... on Earthquake Prediction Months In Advance · · Score: 1
    The grandparent is also making the very large assumption that if the U.S. had warned Iran ahead of time and then the earthquake occurred, that conspiracists and fanatics around the world wouldn't use that as evidence of U.S. technology to create earthquakes rather than predict them!

  23. DSL in Mexico on Broadband Pricing Across The World? · · Score: 1
    In Mexico, DSL is $89 per month for 512 down, 256 up. It used to be $89 for 256/128, but just last month they actually announced "Everyone at 256 will now be at 512 for the same price." I nice Christmas surprise, I guess.

  24. Re:Why is this so hard to get right? on Touch Screen Voting Trouble in Florida · · Score: 1
    My concern is that the manufacturer will write cheap, buggy, and potentially exploitable code. This is, in fact, precisely the case with the current generation of machines-- the SAIC has deemed them "at high risk of compromise". The SAIC is no paranoid slashdot conspiracist.

    Oh, I agree. The hardware and software should be secure. But questioning the security of the machines is a far cry from what some Slashdotters seem to suggest is an effort by the manufacturer to rig elections, get their man elected, etc.

    By what some Slashdotters say you'd think the manufacturer was Satan personified, hell-bent on electing Bush when, really, they probably have just as many bugs as any other softwre product. Granted, we want evoting to be secure and bug-free but bugs or problems are not evidence of an evil company intentionally trying to rig elections.

    That's all I was saying...

  25. Re:Why is this so hard to get right? on Touch Screen Voting Trouble in Florida · · Score: 3, Insightful
    Not done. You still have no idea whether the version recorded on some internal paper spool is actually what you voted for on the screen.

    At some point you must trust the election mechanism to work. If you're concerned about the version recorded on some internal spool to differ from what you voted for on the screen then you might as well be concerned with the votes actually being counted properly at the end of the day when all the voters have left the building.

    Yes, election fraud can exist. But I don't think it's going to happen at the machine level--it's going to happen at the human level.

    These election machines that are having so many problems (or at least reported problems) should be validated, of course. They should be certified by both parties and then not changed. The source really ought to be open which would make certifying the machines that much easier (both sides review the same source code, both compile the program, and both better produce the exact same executable).

    But some people that seem to think that the manufacturer of voting machines is going to intentionally write code to conduct election fraud are insane. At least when election fraud normally happens, it is done quietly in dark corners with no evidence. In the case of a voting machine that does the fraud for them, that's like putting the evidence right out there in public. Someday, someone's going to check that machine, take it into evidence, reverse engineer the executable, and you're going to be sitting in jail and your company bankrupt. I don't think they're going to risk it.