Using AI for Spam Filtering (w/ Source Code)

already slashdotted :( by kyknos.org · 2004-07-11 01:17 · Score: 1

There are too many people accessing the Web site at this time.

--

SHE does throw dice.

Re:already slashdotted :( by jafomatic · 2004-07-11 01:20 · Score: 1

Not only that, but the 403.9 we're getting is returned by Microsoft IIS. And only two comments posted? That one sure didn't last long.

--
::jafomatic
Re:already slashdotted :( by pHatidic · 2004-07-11 01:21 · Score: 1

After only 2 comments...

from the i-can't-do-that-dave dept.

Even more mysteriously, who is Dave and what can't Taco do to him?
Re:already slashdotted :( by Anonymous Coward · 2004-07-11 01:31 · Score: 1, Informative

Even more mysteriously, who is Dave and what can't Taco do to him?

2001. Duh.
Re:already slashdotted :( by Anonymous Coward · 2004-07-11 01:39 · Score: 1, Funny

wooooooosh.

Buzzword Bingo! by Anonymous Coward · 2004-07-11 01:18 · Score: 0, Informative

artificial living organism"

BINGO!

I'll Read the Article... by UberOogie · 2004-07-11 01:20 · Score: 4, Funny

... after we get an AI to counter the Slashdot effect.

--
"Enough of this wretched, whining monkey life." -- Marcus Aurelius, _Meditations_, Book 9, 37

Down by ZeroExistenZ · 2004-07-11 01:20 · Score: 1

That one went down after 3 replies :(
No luck for mirrors?

--
I think we can keep recursing like this until someone returns 1

Re:Down by jafomatic · 2004-07-11 01:23 · Score: 1

It's up, it's just hitting the limit of simultaneous connects. I'm surprised it didn't get deep into swap first; maybe the guy lowered the cap before submitting the article.

--
::jafomatic
Re:Down by ZeroExistenZ · 2004-07-11 01:26 · Score: 1

I just think trying to hit refresh in hopes a new connectionslots open up isn't really going to up my chances of reading the article :p

Got the cache now.

--
I think we can keep recursing like this until someone returns 1

Google cache by cs02rm0 · 2004-07-11 01:21 · Score: 5, Informative

Google cache

Artificial living organism by Anonymous Coward · 2004-07-11 01:23 · Score: 4, Funny

I won't believe spam is a living organism till I see Marty Stouffer do a special, complete with comedy 'boing' noises and 'aint that cute' music as we watch a mother Spam care for her young.

Re:Artificial living organism by ThisIsFred · 2004-07-11 02:06 · Score: 2, Funny

Those aren't my type of nature specials. I'd rather see a spam run down by a cheetah as it tries to escape through my router.

--
Fred

"A fool and his freedom are soon parted"
-RMS
Re:Artificial living organism by School_HK · 2004-07-11 03:46 · Score: 1

I would say the this classification is useless for span filters. Since the spam itself is made by human, so the pattern of the message is quite sure human-like. Also the use of ANN stated in the article is actually used in ALL spam filters. Although other filters don't call it neurons, they use things like "rules" or "tokens". So, it is not a new thing of that area. Filters like SpamBayes uses scoring method, which gives scores to every message to judge them as ham or spam. It already like ANN, that ANN produces a final 0 or 1. Those "old" filters also have training ability and some of them automates it by using the past ham and spam base.

My final conclusion is, this article is talking about old stuff.

I can't let you read this Dave. by Anonymous Coward · 2004-07-11 01:23 · Score: 0, Funny

How about a nice game of chess?

Re:I can't let you read this Dave. by 91degrees · 2004-07-11 01:45 · Score: 1

Aren't you getting your memes muddled?
Re:I can't let you read this Dave. by mog007 · 2004-07-11 05:03 · Score: 4, Funny

Worse than being killed by the AI.. what if the AI decides to not filter spam anymore?

"I'm sorry Dave, but your wife thinks you SHOULD try this V@GR!A substance."

or

"This Nigerian seems very nice, and if it pays off you can get me more delicious RAM."

--
Learn something new.

Spam really needs to be done away with. by ODD97 · 2004-07-11 01:25 · Score: 2, Interesting

I dislike spam, in the same way only more than I dislike all the billboards along the highways. They get in the way of what I really want to see, and essentially make me feel inadequate. Billboards make me feel poor, because I can't afford a new home, or a meal at that expensive restaurant. Spam makes me worry that my penis is too small, my breasts are too small, I'm too fat, I don't send enough money to Nigeria. That said, it's illegal to saw down billboards, but it's not illegal to filter spam so I don't have to see it. The article is slashdotted, so I can't read it, but I think we already have good (free and open, no less) spam filtering available. I use Spam Assassin on my server, plus my mail client has a spam filter for double protection. Both have been learning more and more what constitutes spam, and it's rare that I even see spam anymore. If everyone would use these filters, spam would no longer be as profitable.

--
The emperor is naked.

Re:Spam really needs to be done away with. by Anonymous Coward · 2004-07-11 01:30 · Score: 0

Spam makes me worry that my penis is too small, my breasts are too small...

If you worry about both, I think spam would be relatively low on your list of problems. :)
Re:Spam really needs to be done away with. by Anonymous Coward · 2004-07-11 01:30 · Score: 0

If idiots stopped responding to it, then it will be unprofitable.
Re:Spam really needs to be done away with. by Quirk · 2004-07-11 01:32 · Score: 1

"Spam makes me worry that my penis is too small, my breasts are too small,..."
If you've breasts and a penis their size is the least of your problems. I suppose together they could be killer assests depending on who you do.

--
"Academicians are more likely to share each other's toothbrush than each other's nomenclature."
Cohen
Re:Spam really needs to be done away with. by ODD97 · 2004-07-11 01:40 · Score: 1

I don't need to be your dictionary.com proxy here, do I?

Ok, then, read this. Specifically, entry 3.
Thanks.

(Why does his comment get modded when he's basically just pointing out a small piece of humor from my post?

--
The emperor is naked.
Re:Spam really needs to be done away with. by clambake · 2004-07-11 02:13 · Score: 1

If idiots stopped responding to it, then it will be unprofitable

Not for the people selling people the idea that idiots will repsond to it, i.e. selling the lists of email addresses. It's just like gambling. If you have 650 million people online, and only one of them has to say "yes" for you to make money, it sounds like a great idea to lots of people.
Re:Spam really needs to be done away with. by Anonymous Coward · 2004-07-11 02:25 · Score: 0

I dislike spam, in the same way only more than I dislike all the billboards along the highways. They get in the way of what I really want to see,
Please.. for the rest of us... what you really want to be seeing is the road. If a billboard is blocking that, then you are in trouble.
Re:Spam really needs to be done away with. by kiskoa · 2004-07-11 02:29 · Score: 1

You must be new here.

--
If Yoda so strong in Force is, why words in right order he cannot put?
Re:Spam really needs to be done away with. by ODD97 · 2004-07-11 02:37 · Score: 1

But the people paying for it watch the profitability. If they send out 10,000 e-mails and get no response, and send another 100,000 and get one response, they'll realize it's not worth the money they're spending.

From what I understand, companys do pay for spam somewhere, even if they are well-removed from the actual spammer. Most marketing executives have to justify their expenses somewhere.

--
The emperor is naked.
Re:Spam really needs to be done away with. by slashname3 · 2004-07-11 04:13 · Score: 3, Interesting

I agree. I implemented spamassassin and it has worked wonders. We were seeing anywhere from 3000 to 7000 spam messages a day. Virtual all were tagged as spam by spamassassin.

This past week I implemented another tool called greylisting in the fight against spam.

Over a typical weekend for two days I would see something like 5000 to 8000 spam messages. Since implementing greylisting in the last two days we have seen 7 (yes seven) spam messages that were subsquently tagged as spam by spamassassin.

I never expected it to work that well but it has.

Highly recommended in this fight against spam.
Re:Spam really needs to be done away with. by relaxmax · 2004-07-11 05:03 · Score: 1

my penis is too small, my breasts are too small
I wonder...

--
Love all, Trust few, Follow one.
Re:Spam really needs to be done away with. by wideBlueSkies · 2004-07-11 11:54 · Score: 1

Who gives a rat's ass what your penis size is? Please don't take that offensively.

If some chick is gonna' complain about the the size of your schlong, then get rid of her...she's not worth it. She's probably also preoccupied with the number of zeroes in your accounts. And she doesn't want to see only one of those either. Bigger is better to those types and it's all they care about.

wbs.

--
Huh?
Re:Spam really needs to be done away with. by Anonymous Coward · 2004-07-11 18:14 · Score: 0

A bit touchy on the subject of small penis eh?
Re:Spam really needs to be done away with. by wideBlueSkies · 2004-07-11 23:28 · Score: 1

Damn right! :)

wbs.

--
Huh?
Re:Spam really needs to be done away with. by Anonymous Coward · 2004-07-15 03:24 · Score: 0

"Spam makes me worry that my penis is too small, my breasts are too small" Do you really worry about both?? !! just a concerned netizen..

The great and powerful Oz has spoken! by carpe_noctem · 2004-07-11 01:26 · Score: 3, Funny

And the AI says....

The page cannot be displayed
There are too many people accessing the Web site at this time.
Please try the following:
Click the Refresh button, or try again later.
Open the www.generation5.org home page, and then look for links to the information you want.
HTTP 403.9 - Access Forbidden: Too many users are connected
Internet Information Services

--
"Quoting famous computer scientists out of context is the root of all evil (or at least most of it) in programming." - K

Re:already slashdotted :( ... not entirely by denominateur · 2004-07-11 01:26 · Score: 1

1 time out of 3 I can access it.

Animal Rights Activists by toetagger1 · 2004-07-11 01:26 · Score: 5, Funny

"living organism ... and techniques to kill it"

Next thing we know, we will have Animal Rights Activists in Washington, D.C. protesting our "spam traps"

--
who | grep -i blond | date cd ~; unzip; touch; strip; finger; mount; gasp; yes; uptime; umount; sleep

Re:Animal Rights Activists by atcdevil · 2004-07-11 02:33 · Score: 0

Heh.... only on Slashdot do trolls get rated as "funny" Because as long as you agree with everyone else, it's OK
Re:Animal Rights Activists by Anonymous Coward · 2004-07-11 04:00 · Score: 0

Next thing we know, we will have Animal Rights Activists in Washington, D.C. protesting our "spam traps"

Not unless you start torturing these so called lives for the sole benefit of your tastebuds. And I doubt they're yet as sofisticated as to have the ability to suffer.

Hmm... by SilentSheep · 2004-07-11 01:26 · Score: 1

Sounds pretty cool, but i doubt there will ever be a way to completely get rid of spa unless the governments pass laws, and an international body is set up to prosecute spammers. Making the risk too high for them to bother doing it.

--
.

Re:Hmm... by aussie_a · 2004-07-11 02:49 · Score: 1

I agree. Spa is here to stay. Spam on the otherhand will be susceptible to the approach the article suggested.

The law? by Anonymous Coward · 2004-07-11 01:27 · Score: 0

Can Laws Defeat Spam? No...that cannot be governed by one country's laws. Spammers can exist anywhere on the internet, meaning they can sling their wares from anywhere in the world, making the laws of one country completely irrelevant.

This is just one example where the law is unproactive in addressing the fast paced demands of technology. Technology comes, law follows, more technology, law follows. The law does NOTHING and waits until somebody sues somebody else's ass and then kicks in some legislation to deal with it. It defeats the purpose of the whole legal system in enforcing rules by authority. What authority?

Who would have thought by mst76 · 2004-07-11 01:27 · Score: 2, Funny

> most researchers in the fight against spam have failed to classify it as an artificial living organism

Who would have thought Skynet has its origins in spam?

The Architect? Is that you? by October_30th · 2004-07-11 01:27 · Score: 1

consider the following...

Who talks like this? Really.

--
The owls are not what they seem

Re:The Architect? Is that you? by IICV · 2004-07-11 01:47 · Score: 1

Bill Nye the Science Guy?
Re:The Architect? Is that you? by armando_wall · 2004-07-11 01:51 · Score: 1

Millions of people, perhaps?
Re:The Architect? Is that you? by Tarential · 2004-07-11 05:23 · Score: 1

Me. I also use "thus" and "therefor" on a regular basis. My speech is mixed with Middle English, and it gets more and more pronounced depending on who I'm talking with (namely, their intelligence). I read too much, perhaps, but there is certainly nothing wrong with speaking in such a fashion.

Bayesian filtering by sctprog · 2004-07-11 01:27 · Score: 2, Interesting

Isn't Bayesian filtering system used in, Eg, Mozilla Mail classified as an AI?

Re:Bayesian filtering by Anonymous Coward · 2004-07-11 02:18 · Score: 1, Funny

I wouldn't. It's straightforward, dumb and effective. AI would be the opposite.
Re:Bayesian filtering by School_HK · 2004-07-11 02:39 · Score: 1

Yes. Mozilla uses Bayesian filtering.

Is it any wonder it mimics humans??? by Shoeler · 2004-07-11 01:27 · Score: 5, Insightful

I mean - hello, humans create it.

We're not up against a new being - it's the same type of beings that create scripts for the hell of it that wreak havoc on computer networks because 1) "We can" or 2) "To show them their weaknesses".

It was a very interesting read for sure - the genetic marker bit was quite interesting. Admittedly though I got about 2/3rds the way through it and lost interest.

Blame the spammers I say. ^_^

Re:Is it any wonder it mimics humans??? by ThisIsFred · 2004-07-11 01:49 · Score: 1

Yes, and more than once I've seen Slashdot "researchers" suggest more than one way to kill the organism that creates it.

--
Fred

"A fool and his freedom are soon parted"
-RMS
Re:Is it any wonder it mimics humans??? by ScrewMaster · 2004-07-11 02:16 · Score: 1

You forgot No. 3 ... Profit!

--
The higher the technology, the sharper that two-edged sword.
Re:Is it any wonder it mimics humans??? by TubeSteak · 2004-07-11 04:23 · Score: 1

It was mostly interesting but the most immediate problem i see is that spam does not evolve like an organism. Organisms slowly evolve while Spam content makes the occassional wild shift in both how and what is used to throw filters off the scent. It's the difference between evolution and creationism. Can this filter handle the genetic equivalent of an act of god?

--
[Fuck Beta]
o0t!
Re:Is it any wonder it mimics humans??? by Jeremi · 2004-07-11 05:38 · Score: 2, Informative

spam does not evolve like an organism. Organisms slowly evolve while Spam content makes the occassional wild shift in both how and what is used to throw filters off the scent

Actually, "occasional wild shifts" are exactly how organisms evolve.

--

I don't care if it's 90,000 hectares. That lake was not my doing.
Re:Is it any wonder it mimics humans??? by Chris+Acheson · 2004-07-11 06:13 · Score: 1

Blame the spammers I say. ^_^

We already do. It doesn't stop the spam, though.

--
AEIOU: open-source anonymous internet currency

Knuth's algorithm by 0x54524F4C4C · 2004-07-11 01:29 · Score: 0

spam = false foreach word in message foreach spamWord in ['penis','viagra','paris hilton'] if soundex(word) == soundex(spamWord) spam = true end end end

The fa link says to contact Microsoft Support by Secrity · 2004-07-11 01:29 · Score: 0

The site says that There are too many people accessing the Web site at this time. and that I should contact Microsoft Support.

Re:The fa link says to contact Microsoft Support by Pflipp · 2004-07-11 02:02 · Score: 1

Maybe you should send them an email.

And you, and you, and you...

--
"We can confirm that Debian does *not* ship the version with the trojan horse. Our version predates it." [CA-2002-28]
Re:The fa link says to contact Microsoft Support by Secrity · 2004-07-11 03:07 · Score: 1

Oh, I did. I always send mail to MS when I go to a website that says to contact them.

Re:already slashdotted :( ... not entirely by Ahaldra · 2004-07-11 01:31 · Score: 1

1 time out of 3 I can access it.

Maybe if we hit reload more often the site will become accessible again ;-)
This is really a testament of strength of yet another MS product.

On a more serious note, anybody has a mirror?

--
Code is Speech. No to Censorship.

This guy may take spam a little too seriously... by Anonymous Coward · 2004-07-11 01:31 · Score: 0

From the article: Spam has become the first great plague of the 21st century.
Yeah, spam sucks, but isn't calling it a plague going a bit overboard(and didn't it start before the 20th century?)
Maybe AIDS is a worse plague. I know, it didn't get started in the 21st century, but infection rates have started to climb in the 21st century, esp. in Asia, after showing signs of falling...

The Article by Maddog+Batty · 2004-07-11 01:36 · Score: 4, Informative

Introduction

Spam has become the first great plague of the 21st century. Over 60% of all e-mails are spam, costing U.S. corporations more than $10 billion annually, on top of the productivity lost from scanning through e-mail and deleting spam. Along with this, an estimated 5% of spam campaigns are a pure and outright scam, with the remaining majority pitching products that are dubious at best. It used to be parents had to worry about their kids surfing and finding pornographic websites, now we have to worry more about our kids opening an e-mail client and finding a pornographic spam message. Spam must be stopped before it cripples the infrastructure of the internet and drives users away from one of the greatest forms of communication, E-mail.

Can Laws Defeat Spam? No. This has to be one of the greatest misconceptions of users. The internet is just that, an "INTERnational NETwork" that cannot be governed by one country's laws. Spammers can exist anywhere on the internet, meaning they can sling their wares from anywhere in the world, making the laws of one country completely irrelevant. Also, the decentralized, self-organizing design of the internet makes it nearly impossible to regulate by external means. It would be easier to regulate the weather than to regulate the internet.

Spam as a Living Organism

Up until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it. While this classification may sound strange, consider the following:

Spam evolves and adapts based off the rules of natural selection
Through the fight against spam, spam has demonstrated an uncanny ability to adapt to the conditions of its environment, namely the internet. When one barrier against a strain of spam is put up, another, resistant strain appears. This is similar to how bacteria builds immunity against antibiotics, the strains that are not immune will die, while the ones that are immune take over and become the dominant, drug resistant strain. This leads to the belief that spam will not die until the barriers of its environment evolve faster than it does.
Spam lives within an eco-system, and we're its food
The internet is a complex chain of systems that all rely on each for the other's survival. Without an internet protocol, a web browser couldn't exist. Without web servers, the web wouldn't exist. Without ... (you get the picture). This chain of systems can be likened to an eco-system, with spam existing at a parasitic level of species within this system. It consumes resources (bandwidth, servers, time) in its attempt to reach its primary host: us. Once spam reaches its target, its sole purpose is to solicit its "food" from us, primarily money. If it is effective, that strain of spam lives and continues to propagate, otherwise it will die. Can the internet eco-system be modified so spam can't feed?
Spam has genetic traits and markers
Just like any organism, spam contains certain traits that uniquely identify it. This can be a combination of words, information inside the header of the e-mail, the format of the message (HTML, plain text, rtf), the message encoding (base64), does it contain image links, the number of links, does it contain hidden text, so on and so forth. Up until recently, spam filters have primarily focused on just one of these traits, the wording of the e-mail. Spam, being an organism, evolved so this marker was hidden within its code, making it difficult at best to filter. It did this by including random, non-spam words in hidden areas of the e-mail, by modifying words like Viagra with V1@gr@, sending spam as image links, and by encoding the message in a format that filters could not read. The good news is this "gene" is still present, and can be unlocked by identifying the defensive genes wi

--
wot no sig

Re:The Article by Chess_the_cat · 2004-07-11 01:50 · Score: 1

Spam evolves and adapts based off the rules of natural selection
Spam lives within an eco-system, and we're its food
Spam has genetic traits and markers
Huh? Since when do these three criteria determine if something is alive? As far as I remember from high school the criteria were: locomotion, respiration, ingestion, self-reproduction. Genetic traits and markers have nothing to do with life at all. Viruses are nothing but genetic material but they aren't alive. At any rate, Spam doesn't move on it's own, it doesn't breathe, it doesn't literally eat, and it doesn't spontaneously reproduce. They're just messages sent over a network . Let's get serious.

--
Support the First Amendment. Read at -1
Re:The Article by kyknos.org · 2004-07-11 01:53 · Score: 1

The criteria you remember are grossly outdated. It is not easy to define life. Spam is certainly not a living being, but your criteria arent any better.

--

SHE does throw dice.
Re:The Article by clambake · 2004-07-11 02:16 · Score: 2, Insightful

locomotion, respiration, ingestion, self-reproduction

Yeah, fire is alive.
Re:The Article by t7 · 2004-07-11 02:21 · Score: 2, Insightful

"Huh? Since when do these three criteria determine if something is alive? As far as I remember from high school the criteria were: locomotion, respiration, ingestion, self-reproduction."

I believe you are missing the point the creator is trying to make. Spam imitates a living organism by adapting to it's surroundings in order to survive. Why does spam do this? Because it is sent by HUMANS which learn to "mutate" and change there message to bypass current spam filters in order for them to survive.

I think this is a very interesting approach and may help serve as an affective spam blocking tool while an improved mail protocol is accepted.
Re:The Article by Anonymous Coward · 2004-07-11 02:23 · Score: 0

do the locomotion, doooo the locomotion...

Sorry, just writing a report since nearly 30 hours now, damn deadlines...
Re:The Article by ArbitraryConstant · 2004-07-11 02:38 · Score: 4, Funny

This article advocates a

( ) technical ( ) legislative ( ) market-based ( ) vigilante

approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.)

( ) Spammers can easily use it to harvest email addresses
(x) Mailing lists and other legitimate email uses would be affected
( ) No one will be able to find the guy or collect the money
(x) It is defenseless against brute force attacks
( ) It will stop spam for two weeks and then we'll be stuck with it
( ) Users of email will not put up with it
( ) Microsoft will not put up with it
( ) The police will not put up with it
(x) Requires too much cooperation from spammers
( ) Requires immediate total cooperation from everybody at once
(x) Many email users cannot afford to lose business or alienate potential employers
( ) Spammers don't care about invalid addresses in their lists
( ) Anyone could anonymously destroy anyone else's career or business

Specifically, your plan fails to account for

( ) Laws expressly prohibiting it
( ) Lack of centrally controlling authority for email
( ) Open relays in foreign countries
( ) Ease of searching tiny alphanumeric address space of all email addresses
( ) Asshats
( ) Jurisdictional problems
( ) Unpopularity of weird new taxes
( ) Public reluctance to accept weird new forms of money
( ) Huge existing software investment in SMTP
( ) Susceptibility of protocols other than SMTP to attack
( ) Willingness of users to install OS patches received by email
(x) Armies of worm riddled broadband-connected Windows boxes
(x) Eternal arms race involved in all filtering approaches
( ) Extreme profitability of spam
( ) Joe jobs and/or identity theft
( ) Technically illiterate politicians
( ) Extreme stupidity on the part of people who do business with spammers
( ) Dishonesty on the part of spammers themselves
(x) Bandwidth costs that are unaffected by client filtering
( ) Outlook

and the following philosophical objections may also apply:

(x) Ideas similar to yours are easy to come up with, yet none have ever been shown practical
( ) Any scheme based on opt-out is unacceptable
( ) SMTP headers should not be the subject of legislation
( ) Blacklists suck
( ) Whitelists suck
( ) We should be able to talk about Viagra without being censored
( ) Countermeasures should not involve wire fraud or credit card fraud
( ) Countermeasures should not involve sabotage of public networks
(x) Countermeasures must work if phased in gradually
( ) Sending email should be free
(x) Why should we have to trust you and your servers?
( ) Incompatiblity with open source or open source licenses
( ) Feel-good measures do nothing to solve the problem
( ) Temporary/one-time email addresses are cumbersome
( ) I don't want the government reading my email
( ) Killing them that way is not slow and painful enough

Furthermore, this is what I think about you:

(x) Sorry dude, but I don't think it would work.
( ) This is a stupid idea, and you're a stupid person for suggesting it.

--
I rarely criticize things I don't care about.
Re:The Article by Anonymous Coward · 2004-07-11 04:33 · Score: 0

Locomotion? Plants aren't alive?
Re:The Article by Anonymous Coward · 2004-07-11 04:36 · Score: 0

Spam imitates a living organism by adapting to it's surroundings in order to survive.

Now, see that's something that I disagree with.

Spam is an inanimate object, in no way / shape / form does it adapt once it is released from it's creator.

What does adapt / mutate is the creator (spammer) as they learn what spams made it past the filters / blocklists and which ones didn't.

(Figuring out how to identify those mutated genetics is something I leave up to those who would practice the betterment of the species.)
Re:The Article by Anonymous Coward · 2004-07-11 06:53 · Score: 1, Insightful

And an oak tree isn't.
Re:The Article by wheany · 2004-07-11 07:09 · Score: 2, Insightful

Up until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it.

That is not true. I have been using POPFile for 1 1/2 years now, and spam is no longer a problem for me. I see maybe 1 spam per week. I think that all filters' "bayesian part" is just about as effective, the differences come from the tokenizer. The more data you can extract from the message, the more data the bayesian classifier has to work with.

The article sounds like the author had just learned about neural nets and decided that they would be the best solution to spam without doing any real research on existing systems.
Re:The Article by E_elven · 2004-07-11 08:34 · Score: 1

Aha! So we need to detect the spammers, not the spam itself! Hahaa! It shouldn't be too hard, they obviously send a whole lot of e-mail! Problem solved!!!!!

(!!!!!)

--
Marxist evolution is just N generations away!

Re:already slashdotted :( ... not entirely by RupW · 2004-07-11 01:36 · Score: 3, Insightful

This is really a testament of strength of yet another MS product.

No, more likely it's some guy trying to use Windows 2000 Pro as a webserver. It has a ten connection limit; you're supposed to use a server version of windows for live webservers. I've never seen that error from a server version of Windows.

Smeagle by mfh · 2004-07-11 01:37 · Score: 1

The quote at the top of the page is pretty damn funny; "Tricksy spammers, they'll stop at nothing to get my precious."

I have to ask; if you're going to classify spam as an organism, would you not also have to classify email as an organism? So if spam is predatory in nature, then regular email is not?

And so what if we do this? What guarantee do we have that spammers won't evolve past any thwarting mechanism developed? My thoughts are that you have to keep slowing it down, to the point where only the most experienced spammers can get past the armor. Make it so tough to start spamming that people can't just simply pick it up as a hobby. I can remember back in the early nineties when it was relatively easy to spam and there were quite a few people doing it. But nowadays it's not that easy to just start spamming. There's no guarantees that the email will get through, so while many people try to spam, many fail at it. There's no payoff, there's less people taking it up.

But at some point, we will have to deal with the moneybags supporting spam. Maybe legislation could make funding spam illegal? Fine the little old ladies who give their money to spammers, and you'll see many of that just stop. Cut off the revenue generated by spam, and you cut off the spam itself.

--
The dangers of knowledge trigger emotional distress in human beings.

porn by Anonymous Coward · 2004-07-11 01:39 · Score: 0

Just what I want, my computer learning from porn ads.

Am I the only one who can see this? by Anonymous Coward · 2004-07-11 01:40 · Score: 0

The first half of the article is total bullshit whereas most of the second half consists of wellknown facts about spam and about neural networks.

I wonder if neural a network will work as well as a bayesian network. My intuition says no, but I guess it's worth a try.

Re:Am I the only one who can see this? by Tomato+Soup · 2004-07-11 04:55 · Score: 1

Neural networks have been investigated for years as a text classification tool, and no, they don't tend to work very well. In addition, it usually takes so long to train them that most categorization researchers don't bother too much with them anymore.
-Ken
Re:Am I the only one who can see this? by CFD339 · 2004-07-11 07:04 · Score: 1

agreed. Neural networks were once seen as some kind of magical solution into which you could dump chaos and out of which would come order. The seem to be in that same class of "wish fullfillment" that SF writers rely on for the magic way spaceships can do whatever the writer wants them to do. Neural networks are good tools; and they may be the basis for some really nifty sensors in the near term and even AI in the long run -- but they're not magic. The way to stop spam is convince enough people to switch to verfied email only. Convince companies that its worth potentially loosing a contact or two in the long run and they may even do it. I don't see it as likely though. Try telling a sales guy that loosing a mail message or two from someone they haven't met yet is worth it and you'll have a fight on your hands. Every single lost email becomes that one thing that is the reason they didn't make their numbers. Gack.

--
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln

Re:This guy may take spam a little too seriously.. by bairy · 2004-07-11 01:41 · Score: 2, Interesting

Compared to aids there's no real contest. But spam is a real bastard to everyone on the net, not just because it's seriously annoying, but because some people fall for the scams (419 scam etc) and actually lose money.

Also, it ties up email servers meaning yours can take a little longer. I once got a spam message 2 weeks after it was sent, so what happened to legit email is a mystery.

I think for the damage it does both to servers (slowdown) and to people (moneydown), it could be called a plague

--

Get paid to search..It's geniune and

Really? by Nestafo · 2004-07-11 01:41 · Score: 2, Funny

Your web server can also be classified as an artificial living organism. But I ain't so sure about that living part anymore...

Why do we do what we do? by Quirk · 2004-07-11 01:42 · Score: 1

There must be a theory that explains why /. hordes hang out at a site carrying news for nerds they presumably want to know, but that becomes inaccessible simply by the sheer numbers of slashdotters.

--
"Academicians are more likely to share each other's toothbrush than each other's nomenclature."
Cohen

Re:Why do we do what we do? by ODD97 · 2004-07-11 01:43 · Score: 2, Insightful

Because we've realized that we don't have to read the article or understand the topic to post something here and get modded "Informative"?

--
The emperor is naked.

How is this news ? by janoc · 2004-07-11 01:45 · Score: 5, Informative

How exactly is this news ? It seems that the author of the neural network idea didn't do his homework - e.g. DSPAM includes neural network as an experimental classifier already. And compared to the proposed C# solution, DSPAM is a widely used and mature product already.

Regards, Jan

Re:How is this news ? by rsilvergun · 2004-07-11 06:22 · Score: 1

It's Sunday.

--
Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/

Not new, not genetic, not A.I. -- it's Bayesian by orthogonal · 2004-07-11 01:46 · Score: 5, Interesting

Is Slashdot trying to jump the shark?

We already saw a plagiarized article green-lighted, and now this? Cmdr Taco, Slashdot was a brilliant idea of yours, and I love your site -- but that's because I have reasonably high expectations for it.

First, the submitter of this article has he email address jarhead4067@hotmail.com -- and so does the article's author.

Second, what is presented is not a genetic algorithm. The characteristics of the email to be considered to discover if the email is spam are finite and hard-core -- and even the threshold some characteristics must reach to qualify as spam are hard-core:

// This can be adjusted... Calculating the misspelled word ratio and // any Bayesian probability is time consuming if (stats.SpamProbability < .66)

A genetic algorithm is one in which the goal is hard-core, different means of reaching that goal are generated, and the characteristics of the most successful are used to generate the next "generation"; this is repeated until the goal is reached.

But in this model, each "chromosome" contains statistics about one email. The heart of this model is to train a neural network with known emails ("chromosomes") and then tests unknown emails ("chromosomes") against the network.

Neural networks have a checkered history in Artificial Intelligence research. A (very much simplified) model of biologic neurons, neural networks were for a time seen as a great hope for Artificial Intelligence. A neural network basically starts out with an array of input nodes and an array of output nodes, with each input node connected to each output. Each input corresponds to some characteristic of the items the network is trained with: for classifying animals, the inputs would be characteristic of animals, e.g., "furry", "bipedal", "feathered"; each output a classification, e.g., "mammal", "bird", "human".

To train the network, the input nodes are set to the characteristics of an item, and then the strength of the connection of those inputs to the correct outputs is increased (or that of other connections is decreased -- it's the same thing). With enough training, it's possible to isolate the salient characteristics from the ambiguous one sin a mechanistic way.

This is useful, but it was soon discovered that these simple neural networks, for certain sets of inputs, failed, because of overlapping categories: both birds and humans are bipedal, but only humans are also mammals. In a single layer neural network, the connection strength between input "bipedal" and output "mammal" would fluctuate, unable to describe humans or birds well. These problems can be alleviated by adding additional "hidden" layers of nodes between input and outputs, and by allowing "back-propagation" from output or hidden nodes to layers "previous" to them.

But even with these enhancements, it's been conclusively shown that some problems are intractable for neural networks. In any case, neural networks are no new thing.

Of course I have no idea if classifying spam is intractable or not, but I have to question whether using a neural network reliably can outperform Bayesian (or quasi-Bayesian) filtering. My guess is that since Bayesian filtering can judge email by the occurrence of single tokens ("words"), and not just "chromosome" statistics, and given that this "new" method also uses Bayesian filtering to generate one of those "chromosome" statistics anyway (and for only the most difficult to characterize emails to boot), this method itself probably mostly relies on its Bayesian sub-component.

So I'm a bit at a loss to see why this method is in any way revolutionary or even particularly interesting, or why it was green-lighted for Slashdot. Of course, I only gave the linke

--
Opinions on the Twiddler2 hand-held keyboard?

Re:Not new, not genetic, not A.I. -- it's Bayesian by Henry+Stern · 2004-07-11 03:10 · Score: 1

I think that you're jumping the gun a bit on your accusations. Perhaps, as you admit in your last paragraph, you should have read the article a bit more carefully before writing your response.

I can understand your premature conclusion that he is talking about using genetic algorithms from his biological metaphors, but I didn't see any actual mention of them. He's just using a funny name for features.

I wouldn't dismiss neural networks in the way that you do. People did put a lot of hope in the perceptron when it was first invented and did lose faith for about 30 years when it was shown that it couldn't be used to separate XOR. However, multi-layer networks and kernel functions have helped them regain their utility.

Lastly, I wouldn't get too hung up on his use of the output of another classifier (naive bayes text classifier) as an input to his neural network. That is, after all, exactly the same thing as what happens between the hidden layer and the output layer. We do the same thing with SpamAssassin and it works out very well for us.
Re:Not new, not genetic, not A.I. -- it's Bayesian by Anonymous Coward · 2004-07-11 03:16 · Score: 1, Funny

Cmdr Taco, Slashdot was a brilliant idea of yours, and I love your site -- but that's because I have reasonably high expectations for it. (Italics mine)
Really? Why?
Re:Not new, not genetic, not A.I. -- it's Bayesian by Epistax · 2004-07-11 03:26 · Score: 2, Interesting

You had a good piece on neural networks in there so I thought I'd reply about my own experiences. I've made a few networks from scratch in C++ and tried to train it a few things. From the problems I was having I came to the conclusion that we're training these analog thinkers to solve digital problems, and it's not working so well. Is this a mammal? That's a yes or no question and it is hard to teach a network to answer it. I think neural networks are much better at doing things such as "which". Which animal has the most "mammal essence". One thing I am thinking about doing is giving a cross-sectional view of a city and asking which building is the tallest. I think a network would be much better at answering that.

Another problem is the physical aspect: how many neurons does it take, how should they be linked, and can new ones be grown to solve the problem? I think the 2nd problem is very important. Will every problem be a straight shot input to output 2d map of neurons or will there be backwards traversal? Will these systems settle on a given output or be constantly slightly changing? If you look at an object and decide what it is, your mind will start making things out of it. If you ask a neural network what animal something is and show it a house cat, it's not at all incorrect for it to come up with "Lion" after selecting "Cat". The network is simply thinking about what it is seeing. Again, this implies feedback. I remember seeing one basic model of a neural network where every output node was also an input node. This is a good start but it assumes that no internal thoughts loop back which I believe is incorrect.
As for other issues.. how many neurons? may more grow? I suppose if we truely want a system to be completely organic then we want to start with just the input and output nodes. Let the network figure out that it can't figure it out, and try to guess at the best places to add neurons. I don't know if this has already been done, but I think it's safe to say it hasn't been done well.

I am very interested in this subject and being a computer engineer (er, in school) I am really looking forward to the hardware that can be designed using neural networks for processing.
Re:Not new, not genetic, not A.I. -- it's Bayesian by julesh · 2004-07-14 03:51 · Score: 1

Neural networks have a checkered history in Artificial Intelligence research.

Largely because most people don't understand how they work.

[vastly simplified description of how a single-layer perceptron works snipped]

This is useful, but it was soon discovered that these simple neural networks, for certain sets of inputs, failed, because of overlapping categories: both birds and humans are bipedal, but only humans are also mammals.

That's not why single layer perceptrons fail at all. In fact, a perceptron can cope with independent categorisations very easily. The problem is that they can only make a decision based on a single 'cut' across the input vector-space, meaning that any classification problem with exceptions, or a non-linear rule, cannot be solved by a single layer network. Most real-world classification problems fall into at least one of these categories, so work on perceptrons was abandoned until the development of back-propogation, the first good method for training a multilayer network.

These problems can be alleviated by adding additional "hidden" layers of nodes between input and outputs, and by allowing "back-propagation" from output or hidden nodes to layers "previous" to them.

But even with these enhancements, it's been conclusively shown that some problems are intractable for neural networks.

Do you have a reference to that conclusion, or are you just making it up? In fact, it has been mathematically shown that, theoretically at least, a 2 layer perceptron can solve any input classification problem. The only remaining task is to determine how to train such a network for any given problem -- this is hard, but I don't think it has ever been proven to be impossible.

How is this any different... by Fooby · 2004-07-11 01:46 · Score: 5, Interesting

from SpamAssassin? It takes a bunch of rules, applies them, and uses a neural net to classify the message. Seems to me SpamAssassin does the same thing, only is more mature and extensible and uses a genetic algorithm rather than a back-propagation neural net.

Re:How is this any different... by janoc · 2004-07-11 02:35 · Score: 1

SpamAssassin does not do any neural networks. It just matches rules against the mail and at the end totals the scores assigned by the rules. If the total is higher than some (arbitrary) threshold, mail gets tagged as spam. That's all. (ignoring Bayesian classifier in SA for now, however that is also treated just as a special case of a rule)

Neural networks do not use any rules - they work on feature vectors extracted from the input and send them through something like a state machine on steroids (I am sorry for AI people who cringe at this comment now ..) where each transition has assigned weight. The weights are then used to decide, whether the mail is spam or not. These weights are in turn assigned by a learning process and adapted continuosly, whereas the SpamAssassin rules have static scores, which do not change.

Statistical filters (e.g. Bayesian schemes) work in a similar way - they compute a "spaminess" of each token (probability whether the token is more likely to appear in spam or clear mail) according to a database created by training. From the probabilities of the tokens, probability of the whole message being spam is computed - the formula for that is called Bayes formula, therefore the name of the method. If the probability is higher than a threshold, message is tagged.

Jan
Re:How is this any different... by Anonymous Coward · 2004-07-11 02:42 · Score: 0

I think its great that people are spreading new ideas about classification, even if the implementation isn't new and I don't see how this specifically tackles the (undoubtedly) evolutionary nature of spam. Go that man Jarhead.

Personally, I would like to see more user input to the spam ID process. The problem: I have been training my filter for a long time and now it seems unresponsive to new spam types. I'll admit, its been a while since I worked with MLPNNs, but doesn't the growing example base dilute the effects of newly added examples. My proposal: nothing beats human adaptation time, so if I start getting a crop of spam about mortgages (I don't have one and dont want one) I would like to tell the filter (preferably easily) that anything containing that word or any misspelled variant is 99% probably spam. That way the filter does not have to add the message to the existing (growing) example base, and then consider all of the message properties, but can quickly set up a new rule targeted at a single word at high probability of it being spam.

On an associated point, how about automajically considering all commonly misspelled variants of an irritating word as high probability spam markers. Workers in human perception must have an idea of how we can interpret a misspelled word as what it really is. Swopping @ for 'a' is the simplest example I can think of. Otherwise there is always the approach of brute forcing it with exemplars and a sub-net.

I also think that spam hyperlinks should be stored and blacklisted for my browser, so if i ever click on a bogus link I get a warning that the target site (or associated domain) has been tagged as a spammer site/domain. I don't know how often spammers change their sales sites, but it would be an extra inconvenience for them to have to do so as their sites got blacklisted after their previous failed scam. If there is the added penalty at the domain level, that will provide an incentive for domains to clean up their act. A message reading "This domain is known to be associated with spammers" would be rather nasty advertising. Invoking the (unlikely) wonders of collaboration, an internet repository of suspected spammer addys, downloaded to my browser would clearly help.
Re:How is this any different... by Henry+Stern · 2004-07-11 03:17 · Score: 1

The difference between this and SpamAssassin is that he uses a multi-layer neural network where we use a single-layer neural network. His feature space is a bit more expansive: he uses a lot of features that don't indicate a message being spam on their own.

The first thing that I did when I became involved with SpamAssassin was to replace the old genetic algorithm-based score learning tool with one that uses error backpropagation. It only takes a few seconds to run as compared to a few days for the old GA and it consistently finds better solutions. Look at masses/perceptron.c and masses/README.perceptron in the SpamAssassin SVN repository if you're interested in more about what I've done.

See my full response for more details.

Re:Using AI for preventing slashdotting. by Anonymous Coward · 2004-07-11 01:54 · Score: 0

get a life. belive in christ. goto chruch already. its sunday.

and you must have typing that message from the web terminal installed in the vestry.

It is so advanced by GarbanzoBean · 2004-07-11 01:54 · Score: 1

That it no only recognizes spam but also Slashdot visitors.

Eco system by Anonymous Coward · 2004-07-11 01:57 · Score: 0

2. Spam lives within an eco-system, and we're its food

Yummy! Slashdotters for breakfast!

ANNs, Artificial Neural Networks by Biotech9 · 2004-07-11 01:57 · Score: 0

Neural networks, are basically many interconnected neurons, which are mathematical functions performing a weighted sum of the inputs and delivering an output through a non-linear transfer function. They are used in a research oroject I have worked on to determine data patterns in 'electronic noses'.

Link for info here

Personally, i think this article is very full of buzzwords, but using ANNs to pick out spam is very possible and would return an excellent catch rate, especially if the data collected from being 'trained' could be collected and refined for making better networks.

I'm not sure how 'up for it' home end-users would be, it may take a lof of hardware to run (we use a shit load of Xserves, but maybe thats overkill).

It could be great for actual mail servers, like gmail for example.

Entirely bogus by Anonymous Coward · 2004-07-11 01:57 · Score: 3, Informative

The entire concept is quite ridiculous.

The guy proposes picking nine well-known indicators of spam, ones that could be (and often are) implemented in rule-based spam checkers, then proposes we use a neural network to evaluate a message based these metrics.

Problems:

1) If you detected spam indicators, this is indicative of spam, no? The whole "fancy" bit of this technique is thus needless.

2) These indicators are not inherent to spam, just represent most current bypassing / obfuscation techniques. If you filter them out, they'll evolve. There is nothing that makes his spam filter follow the arms race.

Re:Entirely bogus by Anonymous Coward · 2004-07-11 03:57 · Score: 0

Did you read the article? Ummm, he states that there are probably other indicators out there, that this is just a starting set... and also, these indicators can adapt over time as spam adapts. With a Bayesian filter, a spell checker, and the other traits, there's only so much the spammers can do to beat filters. I think he stated that a central server should exist so new, adapted versions of this can be downloaded.

I say turn it around by zogger · 2004-07-11 02:00 · Score: 1

is it really little old ladies who keep spammers going, or is it porn surfers? I don't get much spam anymore, from hardly ever using email or giving out my email addy and moz's built in filter, but from what I remember the spam I used to get was way more porno,porno with anatomy enhancements, porno and and viagra, then it fell down the list, toner carts, home mortgage refinance, cheap semi legal drugs, etc.

Anyway, I tend to like my basic idea. Regular email is a default setup of accept everything,then struggle to try to filter out the junk. I prefer a universal whitelisting/blacklisting only, have a default setup with email clients to ban all email,that's the default blacklisting, and only let in who you want on an addy by addy basis, that's the whitelisting. If spam wasn't the predominant email, you wouldn't need to do it that way, but it is, so just recognizing that fact means you should go to a default blacklisting. what we are trying to do now is 180 degrtees backwards from what logic dictates. With spam and bayseian schemes, you are struggling and trying to lock the barn door after the horse gets out,that didn't work well way back then, ain't gonna work well now.

You get around "first contact" issues in this system I propose with businesses and mail lists who need access to "new" people regularly by using a webform (cheap/fast/works good enough) or the telephone (for very important transactions) as the first contact.

A next step might be... by panicboy · 2004-07-11 02:02 · Score: 1

Feedback from individual spamfilters is used in realtime to throttle back bandwidth on high-probablility spam generators. Not cut 'em off, just slow them down to the point where it'll take weeks to send out their evil payloads. Aggregated spam statistics are used to update individual filters. A reputation system develops to weight individual spamfilter results.

for crying out loud, people . . . by nusratt · 2004-07-11 02:05 · Score: 1

. . . it's Sunday morning. Get off slashdot, so I'll stop getting this damn
HTTP 403.9 - Access Forbidden: Too many users are connected

Problem: I can't access the article.
Solution: It's Sunday morning. Why don't all of you geeks get off slashdot, and get a life.

(Of course, this doesn't apply to *me*:
I *had* a life once, but demonstrated that I'm incapable of maintaining it.)

Re:for crying out loud, people . . . by Anonymous Coward · 2004-07-11 02:18 · Score: 0

It's Sunday morning. Why don't all of you geeks get off slashdot, and get a life.

Uh, not in my timezone.

Besides, what's wrong with swinging by slashdot when you stagger home at 4am? (assuming you're alone, that is.)
Re:for crying out loud, people . . . by Anonymous Coward · 2004-07-11 04:23 · Score: 0

It's Sunday morning. Why don't all of you geeks get off slashdot, and get a life.

shhh... stop making a racket in the House of Slashdot on our holy day! Ser Taco is about to give the eulogy.

AIDS by clambake · 2004-07-11 02:09 · Score: 1

What I note is missing is how to deal with the spammers attacking the network using it's own techniques against itself. For example, flipping the ham/spam caches so that "good" mail is classified as spam and spam email is classified as good mail.

Without know EXACTLY who is participating in your network, there is no way to guard against this... and once you solve the problem of knowing exactly who is participating, then why not just use that as your uber whitelist?

Don't bother reading this article... by Monkelectric · 2004-07-11 02:16 · Score: 4, Funny

It is *terrible*. Briefly: the author invented a rule based method for classifying email, and then added a few paramaters so he could call it a "learning algorithm". As if adjusting the ratio of links to words will allow you to detect spam, then he seems to throw in a Neural Network for no reason.

I think about the only good thing I can say about this article is, at least he's not out killing puppies.

--

Religion is a gateway psychosis. -- Dave Foley

Re:Don't bother reading this article... by Anonymous Coward · 2004-07-11 03:47 · Score: 0

HAHHAHA Dude, you really need to read all of the article, because its apparent you don't have a freakin' clue what he's talking about. He's basically creating statistics on multiple features within an e-mail, and then using a MLP to learn those features.

Don't throw crap when you don't know crap.

Overly aggressive by Anonymous Coward · 2004-07-11 02:16 · Score: 0

"Up until recently, most researchers in the fight against spam have failed to classify it as an artificial living organism, hindering the development of effective tools and techniques to kill it."

That's because spam isn't an artificial living organism. It's a bunch of emails trying to sell stuff to fools. If you think that it's a useful analogy to draw, then by all means do it, but don't blame everybody else for hindering anti-spam development simply because they didn't consider it and you think it's so cool.

how is this new? by martin-boundary · 2004-07-11 02:17 · Score: 1

Er, how is this idea new? SpamAssassin already does it, and has always done it. The "markers" are simply called rules.

Moreover, the proposed idea of using a central server to coordinate and select rules doesn't work, because everybody gets the same rule sets sent to them and the spammers work out how to bypass them. Bypass one, bypass them all.

Re:how is this new? by Anonymous Coward · 2004-07-11 03:50 · Score: 0

No... the central server idea would replace updated versions of the network, as it adapts to and learns new traits. And then, based of the users filtered messages, would rebuild its feature space.

Ham filtering by skinfitz · 2004-07-11 02:18 · Score: 4, Interesting

I've given up on Spam filtering and concentrating my efforts on Ham filtering.

Basically the present thinking is based on attempting to filter spam out - I would argue that given the amount of variables involved, it it a method doomed to failure. Current methods also assume that the incoming mail is mostly valid, and are attempting to remove the undesirable parts - spam.

What I am having success with is turning this on it's head and assuming that the bulk of incoming mail is bad, and filtering in messages that I want.

The way I am doing this is to use my address book as a whitelist - if an incoming message originates from someone in my address book, then it's delivered into the inbox. If not, then they are moved into a "not in address book" sub folder. Anything my ISP spam assassin based filtering marks, is sent into the "Spam" folder. Doing it this way means that I am only notified of incoming mail that is confirmed from someone in my address book. Periodically I check the other folders (obviously).

We have come to the point I think where the number of variables involved makes filtering in a less intensive process than attempting to deal with the myriad of underhanded techniques that spammers use. By limiting the mail I want to people in my address book, I make it so that spammers are the ones having to deal with the variables as they would have to guess addresses in my address book. If lots of people started filtering like this when we would see spammers using known bulk mail addresses (such as the address iTunes receipts are mailed from) however we can simply alter the filter to include the originating IP / mailer and so on.

Think of it like fishing - you wouldn't attempt to control an entire ocean and remove the water to leave the fish - you accept that the water is there and develop techniques to get the fish out.

Re:Ham filtering by Anonymous Coward · 2004-07-11 02:44 · Score: 0

There is still a flaw in your filter. You will not be able to filter out new viruses/worms/trojans from your friends who were infected by these new viruses/worms/trojans. Note that I'm talking about new viruses that are not detectable by your AV and ISP.
Re:Ham filtering by droleary · 2004-07-11 03:06 · Score: 1

Periodically I check the other folders (obviously).

If you think that is an obvious step, then you haven't found an actual solution to spam. I'm tired of everyone and their mother coming out with half-assed filtering schemes that do nothing more than shuffle off probable spam into a special place that you still have to look though to avoid possible misclassification. So you are left searching for needles in proportionally larger haystacks. That may work reasonably well for the email traffic you have, but I think you'll find that that kind of solution simply doesn't scale. I get thousands of spam every day along with the handful of good messages, and the only way to really manage that given the current state of email is a blocklist at the server level issuing a slew 550 rejects.
Re:Ham filtering by david.given · 2004-07-11 03:53 · Score: 2, Informative

Basically the present thinking is based on attempting to filter spam out - I would argue that given the amount of variables involved, it it a method doomed to failure. Current methods also assume that the incoming mail is mostly valid, and are attempting to remove the undesirable parts - spam.
The problem with this approach is that you run the risk of throwing away ham. Because you're starting with mixed spam and ham, and you're picking out the ham, you don't know for sure that what's left is pure spam. Traditional approaches are safer, because the take mixed spam and ham and throw away only what is known to be spam. Therefore (unless the spam selection process is overeager) they won't throw away ham.
(I feel hungry now...)
I use a greylister. It's brilliant. It reduces the amount of spam I get from about 100 to 150 messages per day to about 5 --- and because it does this before the messages are transferred to my machine, I don't even get the overhead of running them through spamassassin or even my MTA.
Greylisting implements the old sender-pays spam filtering system by exploiting the SMTP system. It requires messages to be sent twice: the first time it's rejected with a try-again-later reply. This makes it the sender's responsibility to store the message and resend it --- this is the cost. As most spam engines aren't real SMTP servers, they usually don't bother to retry. Real messages, however, will arrive about half an hour late. (You then implement lots of optimisation so that you don't bother greylisting messages from known good senders, etc.)
Advantages? It's highly effective. It's completely standards-compliant. It's 100% safe; it won't lose ham unless an upstream mail server goes wrong. It can work before the message body is transmitted. It works against a lot of Outlook Express email viruses too. And, best of all, it's completely invisible to both sender and recipient: set it up, get it going, and it Just Works.
If you're interested, I strongly recommend the one wot I wrote<BLATANT ADVERTISING/>, because it's simple to set up and works on any MTA, but there are lots more around --- the earlier link is a major resource.
Re:Ham filtering by skinfitz · 2004-07-11 04:56 · Score: 1

There is still a flaw in your filter. You will not be able to filter out new viruses/worms/trojans from your friends who were infected by these new viruses/worms/trojans. Note that I'm talking about new viruses that are not detectable by your AV and ISP.

I am aware of this, and that isn't a flaw; firstly my goal is to only notify me of mail from people in my address book - not to catch viruses - for this it works perfectly. Secondly I have virus scanning at my ISP, finally, I use a Mac for my email so even if something does get through, until there is a Mac virus then I'm safe for the time being.
Re:Ham filtering by skinfitz · 2004-07-11 05:14 · Score: 1

I didn't say it was a solution to spam - I don't think such a thing exists - what I am saying is that it is working much better for me than trying to filter out spam.

I've been sitting here for the last few days feeling much better about my email as I am only receiving notification when I receive messages from people I definitely want to receive messages from.

Obviously as it stands this would not scale very well, however the concept is one of variables, not simply using a white list.

For example, most of the junk I see lately uses mis-spellings or lists of words at the end of the message to defeat spam filters; this works as you can spell words quite badly - i.e. v'.1'a'g#ra and we humans can see what it means, however try teaching a machine to catch things like that - most difficult.

What we could do for example is get the machine to check spelling and grammar, and attribute a negative weight (i.e. less likely to be spam) to an email depending on how many words are spelt correctly with "good" grammar. Now obviously not everyone can spell and has good grammar, but most people have spelling checkers and so on and know how to be able to spell if they want to. (This could provide motivation for people to spell correctly!) The spammers could try to mis-spell their crap as much as they wanted - it would start to work against them rather than for them. We need to force them to conform to us, not attempt to adapt to them. If they have to spell words correctly then they are going to get nailed by filters. Yes I know grammar checking would start to get complicated but my point is variables, regardless of what those are; "ham" is easier to spot than "spam" because it has less to "go wrong" as it were. There are less variables involved in spotting ham than spam.
Re:Ham filtering by skinfitz · 2004-07-11 08:33 · Score: 1

I like the idea of this greylisting - it sounds perfect for my work BSD mail system.

Thanks for the info.

It appears to be implementing the concept I was originally posting about which is concentrating on filtering in mail rather than out - if mail systems behave appropriately then you are accepting. Sounds good and I will be taking a closer look when your site is responding!
Re:Ham filtering by gilgongo · 2004-07-11 09:33 · Score: 1

Greylisting is cool, but it *does* increase bandwidth use. Since we're recommending alternative systems, I think you should also look at tarpitting, and the excellent Spamcannibal in particular.

Spamcannibal uses black lists (any RBLs you want). Once it identifies a spammer it attempts to choke them to death by preventing packets from leaving their machine on port 25.

Running Spamcannibal means that you are contributing to a network that prevents spam from getting to you AND others.

Of course, it relies upon the RBLs to be correct, but I'm sure there's no such thing as a perfect solution for the problem of spam.

--
"And the meaning of words; when they cease to function; when will it start worrying you?"
Re:Ham filtering by david.given · 2004-07-11 10:37 · Score: 1

I like the idea of this greylisting - it sounds perfect for my work BSD mail system.
Greylisting is so simple and so effective --- it amazes me that so few people have heard of it! I originally wrote mine because my feeble P166 server was spending >10 seconds processing each message with SpamAsassin. Now it can reject spam before it even arrives...
Incidentally, as it's hosted on SourceForge, the site should damn well be responding. It's all visible from here. If you still can't get there, I think there may be connectivity problems somewhere.
Re:Ham filtering by skinfitz · 2004-07-11 11:20 · Score: 1

Aha - working now - I think it was the redirector to sourceforge you have.
Re:Ham filtering by Anonymous Coward · 2004-07-11 13:07 · Score: 0

That's pretty impractical and probably illegal - basically you are talking about breaking port 25 on any machine found in a blacklist. How many admins have found their network on some obscure blacklist occasionally, because of incorrect reporting, or in my case we are nat-ed behind a single IP, and it only took one trojaned client machine to get our mail server blacklisted. (fixed the firewall now so it doesn't happen, but regardless)
Re:Ham filtering by corngrower · 2004-07-11 14:35 · Score: 1

My ISP provides a spam filtering service on my email. For me, it works pretty good. About 5 % of the mail that ends up in my inbox is spam and only about 1% of the mail that ends up classified as spam is something that wasn't. About 2/3 of the mail that i receive is spam. Even the small amount of spam I get in my inbox I can almost always tell its spam by the subject or the address line. If i can tell it's spam it gets deleted without being opened. If it's in the spam box, it gets deleted without being opened, unless I recognize the sender as someone that I want email from.

The spam filter is a classifier that you can provide feedback to in order to train it. When you open a message, it shows up on a screen that allows you click on either of two buttons to tell the classifier whether or not the message was spam. If it starts misclassifying too many messages, I update its training with several messages.

This spam filter works for me. It's provided by the ISP so I don't know the name of the package really.

Re:Who talks like this? by nusratt · 2004-07-11 02:35 · Score: 1

post-docs, that's who. ;-)
(see bottom paragraph)

Advertising and Self-Image by Jonathan+Quince · 2004-07-11 02:35 · Score: 1

[Billboards] essentially make me feel inadequate. Billboards make me feel poor, because I can't afford a new home, or a meal at that expensive restaurant. Spam makes me worry that my penis is too small, my breasts are too small, I'm too fat, I don't send enough money to Nigeria.

I'm still groggy with the earliness of the hour, so I'll bite here and assume that you're being serious.

The answer is simple: Don't allow your self-image to be formed by other people, particularly low-lifes such as spammers. Seriously, do you give two hoots what a spammer thinks of you? Particularly when this is:

Someone who has never met you, and is not even writing to you personally about your penis or breasts, but rather is sending a mass mailing that will reach porn stars who are hung like horses and FF-cup women the same way it reaches you;
Complete and total scum of the earth who is willing to send spam for money (or in an attempt at making money, which is unsuccessful more often than not);
umm... did I say... a SPAMMER? Do you also care what KKK-type skinheads and convicted child molesters think of you?

it's rare that I even see spam anymore. If everyone would use these filters, spam would no longer be as profitable.

Unfortunately, most people aren't so successful with filters, particularly if they cannot tolerate any false positives at all. Even for those who don't have their own mailservers, every decent ISP nowadays offers server-side filtering -- but it is far from perfect, and I doubt you'll find too many people who claim that filtering even comes close to eliminating the spam problem.

--
Microsoft Windows is, fittingly, the official Desktop OS of Olig

Re:Advertising and Self-Image by bstone · 2004-07-11 03:16 · Score: 1

Unfortunately, most people aren't so successful with filters, particularly if they cannot tolerate any false positives at all. Even for those who don't have their own mailservers, every decent ISP nowadays offers server-side filtering -- but it is far from perfect, and I doubt you'll find too many people who claim that filtering even comes close to eliminating the spam problem.

I note that the article doesn't mention false positives at all. It just mentions one false negative per 1000 emails. I would think that most people are far more concerned about keeping false positives at zero. If some spam gets through, that's a price they're willing to pay (assuming most gets filtered), as long as they still receive their "real" email.

You are all mouth. by Anonymous Coward · 2004-07-11 02:40 · Score: 0

(http://slashdot.org/~October_30th/journal/)

SAYS"
consider the following...

Who talks like this? Really. "

BUT AT THE SITE LINKED ABOVE, JOURNALS"

All things considered, a very pleasant visit.

As usual, my only regret is that I couldn't speak the local language beyond a few simple words I picked up in the first few days. Having to resort to English - which is not my native tongue anyway - always annoys/embarrases me to no end when traveling.

Isn't it about time for star trekish universal translator gadgets already (and let's not forget about all those flying cars we were promised decades ago)... "

YOU ARE ALL MOUTH.
COME TO THINK OF IT, THAT ONLY MEANS YOU FIT RIGHT IN 'ROUND HERE.

Re:This guy may take spam a little too seriously.. by MURL · 2004-07-11 02:50 · Score: 1

I just got a SPAM 2 days before it was sent...

Subject: rattlesnake 180 piroshki
From: "Randell Workman" <mrqahlst@yahoo.com>
Date: Tue, July 13, 2004 4:49 am

--
--- Have you seen MURL?

Poor Al by Anonymous Coward · 2004-07-11 02:50 · Score: 0

How does Al have time to read all that spam? And if that wasn't bad enough, now he's been slashdotted.

This Guy's an Idiot by magefile · 2004-07-11 02:51 · Score: 3, Informative

For starters, he things Internet is short for "INTERnational NETwork" as opposed to a NETwork between entities (vs. network within an entity: intranet).

Then, his criteria:
Is the format of the e-mail HTML?
This is not a bad criterion.

Is the e-mail formatted in valid HTML?
Have you ever seen a commercial program (esp. word, used by Outlook) generate good, 100% valid HTML?

Is the e-mail encoding base64?
No argument here. Unless base64 could be confused with Unicode - don't think so, but not sure.

Does the e-mail contain image links?
Does the e-mail contain "hidden" text that the user cannot see?
Heck, yeah, block it.

Does this e-mail have a large number of recipients?
Most of the spam I get has less than 5 recipients, and a lot of my mail is from a listserv with more than 5 recips.

What's the ratio of links to words in this e-mail?
I generally see only one or two links in my spam. Although I do see zero links in most of my ham.

What's the ratio of misspelled words to words in this e-mail?
Dear lord, no. This is a worthless criterion. Maybe if you looked for a ratio of non-letters (@, |, etc) to letters, but not spelling.

What's the Bayesian spam probability of this e-mail?
WTF does this have to do with AI?

Basically, he's stated the obvious, then made some really idiotic assumptions. Plus a shitload of spelling and grammar errors.

A few problems ... by Titusdot+Groan · 2004-07-11 02:52 · Score: 1

I'm not sure how much I trust a spam solution from somebody who doesn't have the mathematical ability to understand the Slashdot effect but here goes anyway ...

From a life form analogy perspective Spam is not evolutionary, it's more an example of intelligent design.

The problem with the proposed method of detecting spam is that spam changes often. It is mutated to get by Spam Assassin, Brightmail and Spam Bayes. This is just another attempt to get ahead of the spammer on the treadmill.

You need to change the tokenizer regularily, you need to handle invisible ink, etc. etc.

This solution has the added difficulty of the training of the neural network -- how long does that take? Something like Spam Bayes starts recognizing new spam after a few messages of the new type.

spam disguised to fight spam by DumbSwede · 2004-07-11 02:52 · Score: 4, Insightful

Having read the article (from Maddog Batty's copy), I'm struck by 3 things:

1. While the author proposes some marvelous cure based on treating spam as an organism, he just lists traits that any spam filter can use, and which most probably do, though he would suggest that most don't. I fail to see how the artificial-life observation improves spam non-spam determination from the list of traits he proposes filtering on.

2. The article reads like a sales pitch for the author's spam filter.

3. If 2 is true, and it is a sales pitch, then you have the irony of a very effect form of spam that makes it past the slashdot editors.

It's ALIVE!!!!

--
Letter To Iran

Bunk or at best, pseudo-science by Anonymous Coward · 2004-07-11 02:53 · Score: 0

Spam comes from people. People are organisms. People adapt so spam adapts. There is complexity but it follows from the source, there is no emergent behavior.

Killing spam by Alsee · 2004-07-11 02:58 · Score: 1

Have they tried Penicillin?

-

--
- - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.

Re:Killing spam by ketamine-bp · 2004-07-11 03:03 · Score: 2, Insightful

actually spam is very analogous with bugs (bacteria)..

spam filters kills spams,
antibiotics kills bacteria.

we have spam filters,
we have antibiotics.

the selection pressure posed to spam by spam filters makes spam become harder-to-filter one.
the selection pressure posed to bacteria makes them harder-to-kill bacteria.

we then have to develop other spam filters,
so as our antibiotics.

too much of a spam filter will result in adverse effect because you filter ham out.
too much of an antibiotic will result in adverse drug effect because of toxicity to human cells (e.g. nephrotoxicity, ototoxicity etc.)
Re:Killing spam by Anonymous Coward · 2004-07-11 06:17 · Score: 0

Why waste a good dose of Penicillin when probably a .44 Magnum to the spammer's and advertiser's head would certainly accomplish the same thing cheaper, quicker and return better results.

Go ahead, make my day!

Some comments by Henry+Stern · 2004-07-11 03:00 · Score: 4, Insightful

If I were to sum up this approach, it would be SpamAssassin with a multi-layer neural network. I should mention that I maintain the tool that SpamAssassin is useing to train its single-layer neural network for version 3.0, so I can honestly say that have a fair amount of experience in this area.

I'm not too keen on Evans' use of the biological metaphors. I think that they only confuse the issue of what he is doing. I will use the standard terminology, features, from here on out.

What he is doing is finding a nonlinear decision surface between two classes using a universal function approximator. I will explain this in layman's terms.

Imagine a sheet of paper filled with multi-coloured dots where these dots are arranged in clusters and each cluster contains mostly the same number of dots. Starting with a simple example, imagine two clusters of dots, one blue and one red. Assume that you can draw a line that separates the two clusters. That line is called the decision surface. You would say that any new dot that would appear on one side of the line will be called red and the other blue. Any blue dot that appears on the red side of the line would be misclassified as red. This is referred to as a linearly separable problem.

Now, imagine a more complex arrangement of clusters where you can't draw a straight line to separate the red from the blue, but you can separate them using a curved line. This is called a nonlinearly separable problem.

Artificial neural networks are very good for representing these decision surfaces. They are constructed of one or more perceptrons. A perceptron uses an activation function and a transfer function to take a set of inputs and produce a single output. The most popular form of neuron uses a linear activation function and a sigmoid transfer function. The linear activation function is the sum of a set of weighted inputs, i.e. f(X) = sum w_i *x_i. The logarithmic sigmoid transfer function is g(x) = 1/(1+exp(-x)). The output of the perceptron for any given input is O(X) = g(f(x)).

These perceptrons can be chained together in many different ways. One popular method is the multi-layer perceptron, where a set of neurons in the hidden layer process the inputs and pass on their outputs to the output layer where the final output is formed. I don't have a source for you, but it has been proven that, given a large enough hidden layer, the multi-layer perceptron is a universal function approximator.

As long as all of the transfer functions are differentiable, you can train a neural network using error backpropagation by gradient descent. I will leave it as an exercise to the reader to learn how it works, but I assure you that it is very simple. Machine Learning by Tom Mitchell has a good section on the subject, as does Fundamentals of Computational Neuroscience by Thomas Trappenberg.

Evans has identified a large set of features of e-mails, some of whom on their own convey little or no information about whether an e-mail is spam. He trains the neural network to recognize the combinations of these features which can lead towards the conclusion that a message is or is not spam. While his approach is a good idea, I would hesitate to call it novel. Massey, Thomure, Budrevich and Long did a very similar experiment [3] where they used a multi-layer neural network with SpamAssassin.

While his approach is good, there are some downsides for widespread deployment that need to be addressed first. With a large feature set like he is using, you will probably need a lot of training data to find a good fit with a multi-layer perceptron. To train the single layer neural network for SpamAssassin 3.0, I'm using 160000 messages.

Also, as his own arguments show, spam adapts to spam filter technology. Most of the features that he presents in his whitepaper can be easily fooled by a spammer. They can deliberately manipulate these features to evade the spam filter b

Re:Some comments by Montreal+Geek · 2004-07-11 05:58 · Score: 2, Interesting

I think you make a very good point, but given a large enough[1] training corpus, and being very conservative on the weight to assign to error backpropagation, wouldn't it be interresting to see if the decision hyperplane would be able to reshape itself quickly enough to include freshly "evolved" forms of spam as they appear? (Provided, of course, that those consist of variants on previous forms).
I agree, however, that your concern about constructed attacks against detection of specific features is a killer, as it stands. But given a large enough set of features to look for in both form and contents the task becomes increasingly more difficult (hence SpamAssassin's success), would that problem tend to eleminate itself?
I'm using SpamAssassin now, and I think its primary weakness is lack of combinatorial weighing. Feature X is worth n point independently of the presence of other features in the message (or not? I might just have never found how).
-- MG
[1] Where "large enough" is the usual hard problem.
Re:Some comments by rossjudson · 2004-07-11 06:04 · Score: 2, Interesting

What this really points to is the need to have a common framework that a variety of classifiers can operate within. Consensus classification, using diverse techniques, creates a statistical highwire for the would-be spammer to walk. Significant computation can be engaged to calculate email contents that have higher probabilities of fooling bayesian classifiers; fooling two radically different techniques with a single message is pretty hard.

I want to be able to think up a new trait or technique, push it into the framework on a "trial" basis and be able to see the results of it.

Having a domain that's been out there for some time now, I receive about 7k to 12k spam messages a day. Most of these are from zombied PCs broadcasting mail to a random name at an email address. Recently my bayesian classifier has been giving spam scores on these as low as 40%. I have my threshold set at 50%, I think, and I may be lowering it again.

These messages hold hundreds of non-words, together with creatively "uglified" versions of common spam words. The trait I'd like to check for is "ratio of words never seen in ham"; seems like a nice and sensible thing to look for.

Without having a ton of history available and a framework, it's difficult to proceed.

To be honest, I also live in fear of losing my current, finely-tuned bayesian filter...which hasn't given me a false negative in months, and only delivers a few false positives a day.

Neural networks probably represent a better way of combining probabilities gained from multiple techniques. Bayesian stuff works pretty damn well, but we may need to give it a little more "traction" into the problem...
Re:Some comments by Henry+Stern · 2004-07-11 07:21 · Score: 1

I think you make a very good point, but given a large enough[1] training corpus, and being very conservative on the weight to assign to error backpropagation, wouldn't it be interresting to see if the decision hyperplane would be able to reshape itself quickly enough to include freshly "evolved" forms of spam as they appear? (Provided, of course, that those consist of variants on previous forms).

I'm not aware of anyone doing online updating of their neural networks for spam classification. I've always been of the impression that error backpropagation and online updating don't mix with multi-layer neural networks because they tend to take O(n^2) time to converge. In addition, you have to deal with the stability/plasticity tradeoff where you want to give the network enough freedom to learn the new patterns while retaining accuracy on the previously-learned patterns.

I agree, however, that your concern about constructed attacks against detection of specific features is a killer, as it stands. But given a large enough set of features to look for in both form and contents the task becomes increasingly more difficult (hence SpamAssassin's success), would that problem tend to eleminate itself?

The spammers are pretty smart. You'll have to trust me that they find and exploit any and every hole that is left open.

Older versions of SpamAssassin had rules with negative scores for recognized e-mail clients (such as pine and mutt). Spammers started putting those headers into their messages to get an extra boost.

I'm using SpamAssassin now, and I think its primary weakness is lack of combinatorial weighing. Feature X is worth n point independently of the presence of other features in the message (or not? I might just have never found how).

I agree with you 100%. I think that SpamAssassin would be much better off with combinatorial weighting. But, I haven't yet found a good way to do it.

One of the ideas that we've had uses what is called a "sigma pi" node, a perceptron with an activation function that looks like f(X) = sum w_ij * x_i * x_j. Aside from the obvious security holes, I'd expect that as a quadratic function, it would require quadratically more training data than the linear activation function.

We've had some other ideas for activation functions, one of whom looks like f(X) = (prod m_i * x_i) * (sum w_i * x_i). This one would make the network more or less sensitive given that certain tests have hit. It is nonlinear and does not require the messy cross product stuff of the sigma pi node. I haven't done any experiments with this one yet, but my gut tells me that it probably won't work very well.

If any of you have better ideas, I'm all ears. Feel free to drop me an e-mail or catch me on irc.freenode.net in #spamassassin. My nick is henry and I'm usually around during business hours.
Re:Some comments by Henry+Stern · 2004-07-11 07:32 · Score: 1

These messages hold hundreds of non-words, together with creatively "uglified" versions of common spam words. The trait I'd like to check for is "ratio of words never seen in ham"; seems like a nice and sensible thing to look for.

That sounds like a very good idea. e-mail me and we can look at it further.

Neural networks probably represent a better way of combining probabilities gained from multiple techniques. Bayesian stuff works pretty damn well, but we may need to give it a little more "traction" into the problem...

If you're interested, the conceptual difference between naive bayes and neural networks is that the neural networks try to find the mutual information between features while naive bayes just pretends that there is no mutual information (everything is independent). In most cases, the naive bayes assumption is incorrect but it usually works well anyway.
Re:Some comments by davburns · 2004-07-11 08:21 · Score: 1

Another comment: Spamassassin (2.x)'s GA is kindof a pain to train [1] -- it takes a big corpus of spam & ham, and it has to be representative spam & ham, and the spam has to be recent. Then, it takes a lot of computation to run the GA. (This, as I understand it, is why SA 2.x can never really have rules that get updated like virus filters.) This means that sites using SA must either use yesterday's rules to try to filter today's spam, or use rules that aren't ballanced (and may corrilate with each other more than they corrilate with spam/ham-iness.) I suspect that an ANN would be even more hard to train (in terms of corpus requirements and processing), wouldn't it? And you can't just add a hidden node for some pattern you see in spam that gets through. [1] I'm just stating an obveravation here -- I don't mean to complain. I very much appreciate the work that the SA team has done and is doing, and look forward to the release of 3.0. [2] NRFN
Re:Some comments by jarhead4067 · 2004-07-11 16:24 · Score: 1

(SORRY, REPOST FROM FURTHER ON, BUT I WANTED HENRY TO SEE IT)

Wow, I never thought this would generate this much discussion or attention! To everyone that's giving me good feedback (Henry Stern especially), thank you very much as it is appreciated... now for the detractors and script kiddies.

Ok, first off, I was not de-railing existing spam filters, so you guys that took it that I was personally attacking your spam filter, get over yourselves. If you don't have a problem with spam, then congratulations, you get a freakin' cookie. Since most users do, I was trying to create a forum for discussion on new techniques and ideologies for addressing it, not lameass quotes like "this guy's an idiot". And, most of those comments were because people didn't read all the article or didn't understand it. I guess that's what happens when you mix script kiddies with democracy.

Next, if other people have tried and tested this method, great, I wasn't trying to steal their work. This is basically a guy, a garage, and a computer type of project, and I don't have access to research done in deep academic circles. If this overlaps with other work, my deepest apologies to those individuals.

Now, for you people that thought I was trying to say this was the next best thing to canned bread, give me a break! If you read the conclusion, I basically state that no one simple algorithm can adequately address spam (this one included!). That it would take a multi-pronged attack, and that we should start treating spam the same way we treat evolving organisms. That was the point of the article. For the algorithm, I was just trying to present a starting point, not an end.

Whew, now that I've got that off my chest, to the other problems with the article.

Henry, you were absolutely right, the training set does need to be randomly shuffled each time per epoch. My fault on that one guys.

Also, I do apologize about the metaphors, I should have used the proper ANN terminology. I used the metaphors to re-enforce my point about the similarities between this and biology.

Now, I am not quite sure why it takes so many training cycles to train the network in SpamAssasin, but since I'm using standard BP, it is quite fast, and does not require an extensive training set because of the generalization traits of a MLP.

I do agree about the ability of spammers to exploit the error inherent within the hidden layer of a MLP. But, I would think it would be extremely difficult to do this, especially if multiple distributions are created of the trained MLP, each with different structures within the hidden layer and with different initial weights. Then, they would only be able to take advantage of certain traits for a small percentage of filters. Also, I would argue that training should also happen on the users PC, as a background task, say weekly. This allows the network to adapt to new types of spam, and when initialized with random weights, it should make the error within the hidden layer random among differnt installations, effectively killing that exploit, as each machine will converge differently.

Now, regarding spam messages adapting, I am sure their are many more exploits they can implore, but as of this point, they are running out of options, atleast "structual" options. If we can work together to identify the exploitable structures, then we should be able to design a filter that can adapt to the different features within these structures.

That's all I have, thanks again to the folks who gave me constructive feedback on the article. To those who's remarks were out of complete ignorance, **** off.

Thanks SlashDot!

Shawn Evans

Re: not in my timezone. by nusratt · 2004-07-11 03:03 · Score: 1

On a Sunday at 15:00 UTC, what time-zone would you speculate is occupied by the plurality (if not outright majority) of actively-browsing slashdotters?

"Besides, what's wrong with swinging by slashdot when you stagger home at 4am? (assuming you're alone, that is.)"

I think a lot of people would say that if:
(a) you stagger home at 4am Sunday,
(b) you're alone, and
(c) you then swing by slashdot *despite* the troubling evidence of (b) that your time is misused,
then you don't "have a life". ;-)

Little biology? by mattr · 2004-07-11 03:17 · Score: 1

New antispam algorithms are wonderful stuff, kudos to the author. I would have liked to hear more about how exactly it stacks up against say SpamAssassin which has made the news recently for its high quality.

Also it was not clear to me the connection with biology.. that is, it seems that genetic analysis tools might be very useful, and the ideas about how spam acts like an organism and has "genes" is great. But, it was not clear that this has anything to do with the programming strategy.

For example, the use of a perceptron might be a great idea but to someone not trained in them it is hard to see how a multilayer perceptron would be especially good. Also it is not clear that this is what is used in real world genetic analyses. (For example it would have been interesting if genetic databases and bioinformatics tools like BLAST were mentioned). Also the Chromosome object does not obviously have anything to do with a real chromosome; it is confusing and made me wonder if there was something I was missing, or was it named that way to sound "cool"? Also it was not clear to me if any of the dynamics of genetic transcription and whether gene crossover, mutation, and selection have anything to do with this project.

Also I am curious about the choice of programming language. Being a perl fanatic I wonder why that is not being used, and of course perl is great at text, and pattern matching, and the important parts of many modules are invariably in C or C++ already, etc.. But also perl is a language of choice for bioinformatics, and there are a number of existing modules for example BioPerl which wraps other programs and Boulder which is an interesting format that could be used to pipe spam genes to other people's filters. Now I don't know if existing bioinformatics tools could be applicable but certainly these are things that ought to come to mind.. and what these tools do is not trivial, and if genes are a valid metaphor for spam components then there is a potential for existing code to be used too. That is something that would be cool.

There are also documented, easy to extend perl modules related to using genetic algorithms or for rolling your own analysis modules, I'm thinking of Genetics and AI::Genetic.

Finally I note the use of the term Corpus. This is really interesting, and suggests the author is into computational linguistics which also represents a massive amount of existing, nontrivial pattern resolution code.

So I'd like to know more about the relationship of both computational biology and computational linguistics to spam. For example, one big part is going to be how to identify genes, or whether you need a generator of pattern matchers that will be able to identify the existence of a gene.

Also there is a short bit about stopping spam by making it literally not pay to spam. I'd like to hear more about how that might be linkable to the biological metaphor.

I don't mean to detract from the work represented by this article, not at all. But I would like to know more about how the system analyzes and exploits the realities of biological dynamics to make a superior antispam tool. For example it would appear that some "genes" might be postulated for links to websites or even mail servers (the vectors of the disease). And some linguistics tools might even help link references to product types as genes.

Finally, and this is just brainstorming really not criticism, I was bothered by the development of a 0 to 1 probability of ham or spam. This is to me the biggest problem with automated filters. I know it can be done, since my antispam method consists of hitt

Correction by LuckyStarr · 2004-07-11 03:20 · Score: 1

Internet stands for "Interconnected Networks".

--
Meme of the day: I browse "Disable Sigs: Checked". So should you.

Excellent post !! by Anonymous Coward · 2004-07-11 03:27 · Score: 0

I'd like to see more posts like this one on /.

I have a slightly different idea. by khasim · 2004-07-11 03:36 · Score: 1

First off, identify the characteristics of the spammer's mail servers. In my experience, they are usually zombies or open relays that I don't have any legitimate contact with anyway. So.....

Seed the spammer's databases with a bogus address. That's easy to do. Just post what looks like a legitimate address in places that spammers are likely to scan.

Then, any email going to that bogus address is broken down and the originating address is put in a blacklist for your FIREWALL. Any connections from those sites are not even acknowledged.

Unless your mail server has previously sent email to that address. (this will take care of spammers sending from Hotmail or AOL or some such.

You'd also need a method of "learning" addresses that had not sent email to your bogus address. But that should be easy and similar to "SA-Learn" in SpamAssassin.

There. You only block sites that have sent you spam in the past but it should take care of over 80% of the spam (in my case).

Eventually, the spammers will congrugate on major sites that you have legitimate contact with. In which case, those sites can implement throttles to restrict the outbound email.

A nice side benefit is that if someone at one of the open relays DOES try to send you legitimate email (and follows up with a phone call on why you haven't responded), then you can explain that they have an open relay and are spamming the world and they can fix the problem. Everyone is happy!

Re:I have a slightly different idea. by zogger · 2004-07-11 04:23 · Score: 1

That's a good idea, too. I already add rules to the firewall from persistent probes to block that host,especially if you can see if it's coming from a *dsl or cable connection, so might as well do it with spam as well. Thanks for the tip!

Yawn. by Pig+Hogger · 2004-07-11 03:54 · Score: 1

YAWFAPD

Yet Another Way For Automatically Pressing Delete.

Of couse, this won't solve the bandwidth/ressource theft problem...

No, the only solution to kill spam is to immediately nullroute any network that keeps any spammer (or spamming zombie) for more than 2 hours.

Re:Yawn. by minas-beede · 2004-07-11 05:23 · Score: 3, Insightful

"Of couse, this won't solve the bandwidth/ressource theft problem..."

No, it won't.

Obviously, to solve that problem you need to act earlier in the spam path.

Spammers abuse systems because they look for vulnerable systems and can find them, can distinuish them from secure systems. Think about that - it's true.

Securing systems (as a solution to spam) is based on the ridiculous notion that enough can be secured so that the spammers can't find them. Won't happen. But "distinguish them from secure systems" is still left. What can be done with that?

Well, if secure systems didn't look secure to the spammers they'd not be able to distinguish them and they'd try to abuse systems that can't be abused. That would mean they'd send the spam to traps and that the traps would not deliver any spam other than to what can be determined to be the spammers' own addresses, used to test whether the spam sent gets through (in other wordsd, to re-test to see whether the system is or isn't vulnerable to abuse.)

That's easy to understand, isn't it? If you want to stop the bandwidth theft youre almost surely going to have to act against he banwwidth theft. What's described above is a way to make bandwidth theft not work as well. Break bandwidth theft sufficiencently and the spammers won't get enough return on the spam to pay for sending it (or the ones paying the spammers won't get sufficient return - it's the same idea either way.)

With a single ancient Vaxstation and an obsolete MTA I stopped spam to millions of recipients elsewhere: AOL, Hotmail, a large number of destinations. To top it off that Vaxstation was a real email server, so it did two things (and it was slightly harder to stop the spam.) SEt up a fake server and everything that comes to it is some form of abuse: none need be delivered as though it is valid email (it isn't valid email. Of course you'd want to deliver the spammers' own test messages: that's what lets them fool themselves into thinking they've found an open relay.) Nowadays this idea works better if you fake an open proxy: open relay abuse is finally on the decline.

If you're an ISP with IP addresses that the spammers check for abusability or with IP addresses that have been abused you can do more than shut off the IP address (and please, I beg of you, do more. Find out where the abuse packets originate that come into the abused system and do whatever you can to get that abuse stopped. If you, for instance, disconnected the abused system and set up something that accepted the incoming abuse packets but sent out no spam that would be helpful. What you can do depends on the abuse and on the spammer - but the main point is that you don't have to only shut off access, you can do more. Why not do more? You are against spam, and doing more stops some spam. That's in the right direction.
Re:Yawn. by DaCool42 · 2004-07-11 07:39 · Score: 1

it does solve it if it works so well and has such wide spread usage that spam becomes unprofitable. i don't think this sort of filtering is going to accomplish that though.

--

----
All of whose base are belong to the what-now?

Filtering on content will, eventually, fail. by khasim · 2004-07-11 04:01 · Score: 1

I agree that most of his tests are useless. Not to mention, they are easily passed by pasting a few passages from any legitimate source at the end of the message. That will throw off the percentage estimates.

Any tests that are run on the CONTENT of the message will eventually be bypassed as spam gets designed to pass those tests.

I believe that focusing on the SERVERS that send the spam is the only workable approach. Identify which servers send the spam and have your firewall drop those connections.

Kind of like a black/white list, but only with respect to connections.

If you send email to a site, it goes on the white list. Since most companies send email to fewer sites than they receive spam from, incoming connections can be checked against the white list and authorized if on it, before checking on the black list to be dropped.

It will take some processing power, but I don't see any way to handle it. Content analysis will, eventually, fail.

This also solves a large chunk of the "false positive/negative" issue. Since the email never hits your regular filters (SpamAssassin), then it will not get incorrectly flagged.

If it is a legitimate email coming from a blocked host, the SENDER's system will inform the SENDER that there is a problem with connecting. Then the SENDER gets on the phone. Which should also result in the SENDER fixing their zombie/open relay.

Re: not in my timezone. by Anonymous Coward · 2004-07-11 04:04 · Score: 0

(b) you're alone, and
(c) you then swing by slashdot *despite* the troubling evidence of (b) that your time is misused,
then you don't "have a life". ;-)

So if I hit the town on Saturday night and don't manage to pull I've no life??? Do you bat 1.000? Do tell how.

Why aren't we all ... by ModernGeek · 2004-07-11 04:05 · Score: 1

... using freecache, instead of using somthing for what it's not susposed to be used for, and doesn't give a very good view of the site to begin with (images, etc, even though they show up because it can still get them from the original site in this case). Why aren't we linking to freecache in our stories? Maybe slashdot could use somthing that would strip the URL in all links, and always use freecache unless a flag was set to specifically not to??

--
Sig: I stole this sig.

SpamByte: Game Over, Spammers/Computer Crackers... by iamcf13 · 2004-07-11 04:17 · Score: 1

mfh (56): What guarantee do we have that spammers won't evolve past any thwarting mechanism developed?

SpamByte: Game Over, Spammers/Computer Crackers...

This post describes two programs I coded that will eliminate lots of spam and malware for Window's systems if used widely. Both programs, when used together properly, make it effectively impossible for a user to receive spam or malware by email.

yawn. Baysian by itself doesn't work and isn't AI by CFD339 · 2004-07-11 04:17 · Score: 2, Informative

Baysian filters are bypassed just like any other. I'd bet most of us here have tried some form of adaptive filtering with varying results.

He's right in one key respect though -- spam is cheap to send, but spam DESTINATIONS (the links they try to get you to go to) are relatively expensive. You can't registered a hundred thousand domains a day. While its cheap to get one or two, massive domain registration is an expensive proposition. That's currently, IMO, the best way to catch spam once you've gone through the bonehead catch of faked headers.

Personally, I do two stages: First, I catch the obvious stuff -- it says its from AOL.COM but didn't come from their published servers. duh.

Then, I take those "known spams" and search for the call to action link -- what url are they trying to send me to. Take the primary part of that (the domain, plus a little more) and make a list of "probable spam destinations".

I do the same thing with known good mail (mail from people I have sent mail to).

Now have I have good baysian fodder -- actual destination lists both good and bad.

Making a baysian list out of those results in a fairly accurate secondary filter.

Email inbound to me now goes through three checks:
1) have I sent you mail before (whitelist)
2) is this obvious bonehead spam
3) how many links in the message are to the same place as the ones in the bonehead spam?

This works to stop 98% of the 400+ spams a day that get sent at me with a very very low false positive ratio.

--
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln

Re:yawn. Baysian by itself doesn't work and isn't by Anonymous Coward · 2004-07-11 04:24 · Score: 0

Bayesian is just one of the features that's captured, and then the rest is plugged into a neural net, which is AI. Read all the article...

AI Is Scary To Me by Goo.cc · 2004-07-11 04:46 · Score: 0

I think that there is an undiscussed danger in AI picking through my e-mail: it might have different tastes than I do.

For example, maybe the AI mail filter will decide that it really likes those e-mails about Viagra or penis enlarging. Perhaps it will get jealous of my significant other and decide to delete her e-mails.

AI sounds cool but what we really need is spider-sense spam filtering!

Oh, I dunno... by warrax_666 · 2004-07-11 05:17 · Score: 1

maybe because freecache is only for large (as in several megabytes) files?

--
HAND.

Re:Oh, I dunno... by Anonymous Coward · 2004-07-11 06:56 · Score: 0

what about using google cache in the first place? google can distribute requests from different countries

Aggressive predators by iamacat · 2004-07-11 05:31 · Score: 1

If spam is a living organism and we want to control it, it's not enough to have a filter that passively nibbles at what swims nearby. Write something that invades spammer's servers, makes charges with all of their credit card numbers and then e-mails a final "spam" with an outlook express-based viral copy of itself before formatting the hard drive. Let it adapt to that!

Whitelists are a waste of time by Anonymous Coward · 2004-07-11 06:05 · Score: 0

The article mentions the use of whitelists as part of a sophisticated email filtering policy, but since an email address can easily be faked, an email may appear to come from someone you know, but actually be spam.

So clearly we shouldn't use a whitelist alone, as the article suggests, it should only be one phase of filtering.

Okay, but then if you are using other methods that are effective at filtering out spam from potentially forged email addresses, wouldn't those methods be just as effective at filtering out emails that weren't on the whitelist as those that were? In other words, isn't the whitelist completely superfluous? To further cap it off, this makes it increasingly difficult for people that you do not necessarily expect to contact you for possibly legitimate, or at the very least, non-spam, reasons. I've had someone I used to know some 20 years ago contact me because they recognized an anecdote that I was relating right here on slashdot. If I had been using a whitelist, that email would have been classified as junk mail and I most likely never would have even noticed it, let alone read it, and I was actually quite happy to hear from the person that I hadn't seen in years.

Also, there is the very real possibility that an otherwise legitimate emailer's computer may have been compromised by a virus/trojan to send out spam. It's not particularly fair to suddenly block these people because there's no practical way for them to notify you to unblock them if they should happen to be able to purge the trojan from their system.

The problem with spam is that the sender addy can be faked, and I can see no legitimate reason to do this other than to prevent your email address from being harvested by a spammer. I do not predict an end to email spam until the email protocol is completely redone (which may never happen).

Who posts this crud? Who submits it? by jaghatarjankare · 2004-07-11 06:06 · Score: 2, Funny

NOTE: The sample code for this application is in C#. C# was chosen over C++ so beginners could better see the structures of the process, and C# was chosen over Java because of the inherent performance advantages of .NET.

What morons. what total losers.

Re:Who posts this crud? Who submits it? by mark-t · 2004-07-11 07:42 · Score: 1

There are no performance advantages of C#/.NET over Java unless one is on a Windows platform, and since the primary advantage of Java is that it is platform independant, it is clear that the author of the article is ignorant of the fact that most mail servers are running some variant of Unix, where the performance advantage would be nonexistent anyways.
What morons. what total losers.

Couldn't have said it better myself.

--
File under 'M' for 'Manic ranting'
Re:Who posts this crud? Who submits it? by jarhead4067 · 2004-07-11 16:20 · Score: 1

Ummm, most readers of this article are going to plug it into a mail client, which is done primarily on a Windows box. And as a former Java Nazi myself (SCJD, SCJA), I can say with utmost confidence that .NET beats the living hell out of Java in just about every category. I hate MS just like the next guy, but Sun seriously effed up Java when they let every freakin vendor under the sun (no pun intended) into the spec. process and didn't open source it. I shed a tear when I came to this conclusion, but then I realized religion doesn't pay my mortgage. Nuff said.

Re:This guy may take spam a little too seriously.. by mrchaotica · 2004-07-11 07:21 · Score: 1

I don't know about that... everyone I know gets spam, but nobody I know gets AIDS!

(it's a joke; laugh!)

--

"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

Re:already slashdotted :( ... not entirely by alexborges · 2004-07-11 07:47 · Score: 1

Oh.... so thats why this MS product is so strong... who wouldve thought... :)

--
NO SIG

"International Network"? by OleManRiver · 2004-07-11 08:25 · Score: 0

Its hard to take someone seriously when they think that "Internet" is short for "International Network".

its "Inter-connected Network" dumbass.

Qui Bono? Sue the ass off the profiteer by crovira · 2004-07-11 08:28 · Score: 2, Informative

Go after spammers' customers. If they have to pay $10,000 for every spam sent on their behalf, they'll soon stop,

Fuck the spammers. They are merely supplying in response a demand.

Dry up the demand by an internationally (I know of NO govm't who'd turn down money,) backed law making it illegal to have spam sent on your behalf.

The response to spam is NOT going to be technical.

--
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.

If spam is a living organism... by eurleif · 2004-07-11 09:31 · Score: 1

Can we put it in cans and eat it?

Re: not in my timezone. by nusratt · 2004-07-11 09:34 · Score: 1

"So if I hit the town on Saturday night and don't manage to pull I've no life?"

not at all what i meant (or said).
merely saying (amicably), if one seriously wants to be in a relationship, one's time could be better spent than partying until 4AM AND following up by hitting slashdot; e.g., get to sleep earlier to make time to spend Sunday at the park, museum, etc.
anyhow, it was semi-facetious (and apparently poorly worded).

Bayesian is not AI by cipher+chort · 2004-07-11 09:39 · Score: 1

Bayesian filtering is very simply the probability that a word will appear in one context or another. Once you've done this for a huge selection of words you select a few thousand and put them in a dictionary.

There are other techniques that go much further than just checking the "score" of a message based on what keywords show up in it. There are some techniques that try to parse the message for it's grammatical structure and the "intent" of the message. These are much more accurate techniques that what is essentially glorified keyword filtering (Bayesian).

--
Someone is WRONG on the Internet!

Genetic Algorithms And Neural Networks by ttv · 2004-07-11 12:06 · Score: 1

Oh how I wish people would differentiate between GA(genetic alogirthms) and ANN(artificial neural networks)!!!!!!!!!!! GA is chromosomes, crossover and all that jazz, ANN is neurons, back-propogation et al. Combining the 2 is like the unified theory of everything (talk to mr. Hawking if you don't believe me)!!! &lt/rant mode> To my mind, the article makes an interesting point though :- Catogorizing features of an email which can be passed to a classifier. However, looking at it from a human standpoint (one which spam is aimed at), should the only feature of email we are interested in be that of the actual text which appears in our client???? Too often spam classifiers get clouded with hidden text, number of links etc. Do you make your classification of spam in those terms???? Me neither! Perhaps a spam classifier which looks at the context of a message might be more successful!!!! Just a point :-D

Another interesting idea against spam by JSR+$FDED · 2004-07-11 13:57 · Score: 1

Here.

Re:Another interesting idea against spam by MeatNoodle · 2004-07-11 20:28 · Score: 1

Something I think your approach overlooks is purchasing stuff on-line. I get two types of automated response from an online vendor:

1. An order confirmation with an order ID or other follow-up options.

2. A password or registration number that allows me to unlock trial software.

Won't your CRS keep me from getting these, since the follow ups are auto-generated, and don't reply to challenges?

I guess I could add the vendor to my white-list, but I'd need to add the vendor quickly enough so that the one confirmation/password E-Mail isn't rejected. That seems like a tough thing to do. Furthermore, I'd have to get the vendor's white-list entry correct, which is made more difficult by the fact that many on-line vendors use a third party to handle the money transaction/confirmation step.

And lastly, even if I was successful in getting the vendor in my white-list quickly enough, to truly protect against further unwanted E-mails from that vendor (new product offers, or whatever), I'd have to remove them from my white-list again.

Anyway, sounds like the simple CRS approach would need some additional refinements if one makes any on-line purchases.

Just some thoughts.

P.
~~~~~~~~

--
"That's exactly what I said, only different."

fanmail by nounderscores · 2004-07-11 14:48 · Score: 1

Thankyou for working on spam assassin.

Re:Using AI for preventing slashdotting. by Anonymous Coward · 2004-07-11 15:35 · Score: 0

You dumb souless bastard. Ever hear of 7AM mass. Guess not, you godless fool.

Spam is good, spam is nice! by Anonymous Coward · 2004-07-11 15:43 · Score: 0

Spam is good, spam is nice, we can't get rid of it, that way is doomed to failure. That is the wrong problem to attack.

The correct way to attack it, is to attach a price to spam that is directed at people that don't want it. Thankfully, this is easy enough. If the cable Internet companies, the DSL companies and AOL/Hotmail/Yahoo got together and just built a database of IP addresses that send spam to them, and charge $5 for each instance of spam sent for that entry to be removed from the database, and hard IP filtered all traffic from every IP in the database, in seconds, the spam problem would be solved. Since the filtering would be done immediately, there would be a very low possibility for more than a few such emails to be sent from any address that spams. Because there could only be a few entries in the database, a thoughtful organization that accidentally gets added, can, in real time, pay their `fine' for spam, and be educated by it, total cost $5-$1000. Before they hit the `pay now' button, they would _probably_ learn to take care of the _problem_. If not, they can learn again, and again, only limited by the depth of their pockets. Since spammers can't pay out $5 for every spam messsage they send, they could not continue sending email to people that don't want it. In time, measured in days, every one would know if they can send email to the rest of the planet or not, and if they can't and they want to, they can then switch from the spamful company to one that is responsible.

Why must this work, because with a 5 tillion dollar a year spam load, someone would either have to pay 5 trillion a year, or the problem would be gone, I'd wager, the problem of spam would be gone. By spreading the costs around to those organizations that engage in sending spam, we distribute the solution to those organizations. Inovative companies that solve the problem, get rewarded, those that don't, well, they can't send email.

In all companies sent an equal amount of spam, they could cross settle for a net 0 cost, no matter the innocent spam load. If one of them allowed an abnormally high spam load, they would face rising costs. They have only two choices, pay the fines, or stop sending email. Hosting companies would have to classes of IP addresses for lease, those that can't send email, and those that can. The ones that can, may be more expensive, and may entail the hosting provider paying fines and charging the customer to keep the address clean. If the customer can't pay, they, they quickly migrate the user to the can't send email address, and pay the remaining fines on the good IP address, take the hit, and raise prices to keep the clean IP address clean. The hosting providers that don't care, quickly will find they will accululate IP addresses that can't send email, in time, none of their address will be able to send email, except for those of long term, good customers, their address will be clean and remain that way.

The database would be open and free for all to use. It would be best if large numbers of organizations agreed in advance to use the solution, and to contribute to the database. The reason to contribute, $5 for every entry contributed, we reward people, organizations that get spam, think of it as a pay to read. If you read 100 spams a day, you make $500 a day in spam, at that rate, many could quit their day jobs.

To counter forgeries, we charge $1000 for each faked entry. At this rate, few would dare fake large numbers of such entries, and the rare few that tried, would soon find themsolves shut out of contributing to the database.

FREECACHE IS USELESS FOR FILES 5MB by Ayanami+Rei · 2004-07-11 15:57 · Score: 1

How many times does somehow have to remind a slashdotter about this shortcoming of freecache?

Yet somehow, after being mentioned in an article ONCE, freecache is the darling INCORRECT answer for every slashdotting-related problem?

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

The author speaks by jarhead4067 · 2004-07-11 16:12 · Score: 1

Wow, I never thought this would generate this much discussion or attention! To everyone that's giving me good feedback (Henry Stern especially), thank you very much as it is appreciated... now for the detractors and script kiddies.

Ok, first off, I was not de-railing existing spam filters, so you guys that took it that I was personally attacking your spam filter, get over yourselves. If you don't have a problem with spam, then congratulations, you get a freakin' cookie. Since most users do, I was trying to create a forum for discussion on new techniques and ideologies for addressing it, not lameass quotes like "this guy's an idiot". And, most of those comments were because people didn't read all the article or didn't understand it. I guess that's what happens when you mix script kiddies with democracy.

Next, if other people have tried and tested this method, great, I wasn't trying to steal their work. This is basically a guy, a garage, and a computer type of project, and I don't have access to research done in deep academic circles. If this overlaps with other work, my deepest apologies to those individuals.

Now, for you people that thought I was trying to say this was the next best thing to canned bread, give me a break! If you read the conclusion, I basically state that no one simple algorithm can adequately address spam (this one included!). That it would take a multi-pronged attack, and that we should start treating spam the same way we treat evolving organisms. That was the point of the article. For the algorithm, I was just trying to present a starting point, not an end.

Whew, now that I've got that off my chest, to the other problems with the article.

Henry, you were absolutely right, the training set does need to be randomly shuffled each time per epoch. My fault on that one guys.

Also, I do apologize about the metaphors, I should have used the proper ANN terminology. I used the metaphors to re-enforce my point about the similarities between this and biology.

Now, I am not quite sure why it takes so many training cycles to train the network in SpamAssasin, but since I'm using standard BP, it is quite fast, and does not require an extensive training set because of the generalization traits of a MLP.

I do agree about the ability of spammers to exploit the error inherent within the hidden layer of a MLP. But, I would think it would be extremely difficult to do this, especially if multiple distributions are created of the trained MLP, each with different structures within the hidden layer and with different initial weights. Then, they would only be able to take advantage of certain traits for a small percentage of filters. Also, I would argue that training should also happen on the users PC, as a background task, say weekly. This allows the network to adapt to new types of spam, and when initialized with random weights, it should make the error within the hidden layer random among differnt installations, effectively killing that exploit, as each machine will converge differently.

Now, regarding spam messages adapting, I am sure their are many more exploits they can implore, but as of this point, they are running out of options, atleast "structual" options. If we can work together to identify the exploitable structures, then we should be able to design a filter that can adapt to the different features within these structures.

That's all I have, thanks again to the folks who gave me constructive feedback on the article. To those who's remarks were out of complete ignorance, **** off.

Thanks SlashDot!

Shawn Evans

Destination link/payload aggregation. by Ayanami+Rei · 2004-07-11 16:14 · Score: 1

It'd be bitching if we (or someone) could set up a sort of website or service whereby suspected spam links could be collected and analyzed for trends.

Perhaps webhosts could be identified as being problematic... and contacted. Or maybe it might lead one to a compromised ISP or residential net.

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

Re:Destination link/payload aggregation. by CFD339 · 2004-07-11 23:00 · Score: 1

Cool, but maybe not needed. Also, long term it isn't a perfect solution and will be worked around by the spammers before we could see massive benefit from the effort of organization. If I as a single individual and receiving more than 400 spams in a day, it means that I'm clearly poking my head into the wrong websites, but also that I have a massive pool to work with if I've been saving them. So do you. So does everyone. Anti-spam will need to be adaptive automatically to succeed. Lots of techniques need to be deployed working together. Spammers send the same message out a dozen different ways. Each exploits the vulnerabilities of one anti-spam technique or another. Some people built good baysian word checkers so the spammers now send random quoted articles from online newspapers along with their messages. People use whitelists now, so spammers try to use common domain names in forged "from" headers. Each of these is a good way around some filters. The way to adapt though, is to catch the relatively easy stuff for your particularly anti-spam filter, then use that to learn more about the spam itself, which then catches the harder to spot stuff. The way I did it is one way of doing that but by no means the only way. I just make that single assumption -- that a part of the target url consisting of the domain and sometimes a bit more is the most expensive thing for a spammer to change. For me, it means I can use bayesian techniques on a much smaller subset of the message and am not fooled by a tossed in article from the Times. I have no doubt that soon spammers will be sending me mail with lists and lists of links to other domains as a mask and I'll have to find some way to filter them. For now, this works.

--
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln

Don't Popfile and SpamBayes already DO this??? by macraig · 2004-07-11 19:58 · Score: 1

How is the concept presented in this article really so much different that what the better anti-spam products already do? It appears to me that the only thing this article adds to the discourse of identifying spam is attachment of the jingoistic buzzwords "genetic" and "biological" to the process.

Gee, how original....

You underestimate Neural Nets by obtuse · 2004-07-11 20:34 · Score: 2, Interesting

"But even with these enhancements, it's been conclusively shown that some problems are intractable for neural networks. In any case, neural networks are no new thing."

Not so. Maybe you're still thinking about extremely simple neural nets, because no such proof of intractability exists for larger more complex networks.

Here's proof: Neural Networks can emulate a Universal Turing Machine. Since they can also be emulated by a UTM their limitations are no greater or less than those of any UTM. One citation if this isn't obviously true.

This is exactly why Marvin Minsky has been accused of slandering neural nets unfairly, and hindering AI research. In his book _Perceptrons_ he demonstrated a simple problem that a trivial (one or two layers with no feedback) NN can't solve. A lot of scientists wrote off Neural Nets just as you have, because a toy was the only tool used. Never mind the fact that an only slightly more complex NN can solve such a problem easily. I find it telling that for a human to solve the same problem, one has to construct a strategy to do it. Not the sort of thing I'd assume any extremely simple machine could do. These days Minsky complains that AI isn't trying to build human brains. He's a brilliant man, but in some cases (as with many famous people) his chutzpah occasionally outstrips his judgement. I only wish that great scientists were immune to this.

Lots of less qualified people complain that neural nets aren't useful because they have some unpleasant experience with them. They have no idea of the variety of neural nets. It's like using a Playstation and complaining that computers are not useful.

As for spam filtering with AI, unless you have the narrow definition of AI, the Bayesian techniques of SpamAssassin are AI, as is the Latent Semantic Analysis done by OSX mail.app for spam filtering. LSA, while computationally expensive on a PC, is regarded as equivalent to a particular type of 3 layer neural net, (see Kohonen self-organizing maps.)

One thing you have right. Neural nets are "no new thing." They're as old as biological brains. Novelty is not a criterion for usefulness.

--
Assembly is the reverse of disassembly.

Re:You underestimate Neural Nets by k8to · 2004-07-12 01:51 · Score: 1

In practice however, neural nets are labor intensive to set up, slow to execute, and inexact as compared to specialized solutions which are often more apparent.

Certainly if you are working on some problem domain for which a specialized solution is _not_ apparent, neural nets become something to consider. Usually however, the right course is to find that specialized solution.

--
-josh

Re:already slashdotted :( ... not entirely by Ahaldra · 2004-07-11 21:40 · Score: 1

No, more likely it's some guy trying to use Windows 2000 Pro as a webserver. It has a ten connection limit

Ahhh, thats interesting, thx. that sheds some light on ms's business strategies:

Step One: Sell user Overpriced OS advertising "so you can host your own Web site on the Internet"
Step Two: Profit!
Step Three: Post story from user's internet page to slashdot.
Step Four: User upgrades to server version.
Step Five: More Profit!
;-)

--
Code is Speech. No to Censorship.

Re:already slashdotted :( ... not entirely by RupW · 2004-07-11 22:01 · Score: 1

Step One: Sell user Overpriced OS advertising "so you can host your own Web site on the Internet"

:-) I hadn't seen that. On the next page, though, they own up to what you're actually getting:

IIS 5.1 for Windows XP Professional is designed for users developing a Web service for home or for office use. It can service only 10 simultaneous client connections, only one Web site, and it does not have all the features of the server versions.

What is cool about this is. by LWATCDR · 2004-07-12 01:38 · Score: 1

It could also filter out 133t speak.

I had to love the comment about ratio of misspelled to corectly spelled words. As one of the worst spellers in the world I fear for my future emails. I would also worry about highly technical emails getting flagged. Spell checkers think any word they do not know is misspelled.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.

Re:Using AI for preventing slashdotting. by buleriando · 2004-07-12 04:59 · Score: 1

> goto chruch already.

Pronounced as in "goto crutch already".

How Freudian.

Multi-element detection and adaptation by SpyderFan · 2004-07-12 05:15 · Score: 1

The author has accurately identified that we must look at spam in many different ways. The text of the spam message, while important, is not as important as the "tricks" used to elude filters as an accurate identifier of spam.

The spammers, by necessity must disguise the text to elude simple Outlook filters. By doing this, they have introduced more accurate, yet harder to detect, indicators that the message is spam.

By understanding the tricks used by spammer, and using "real intelligence" to detect the tricks, it becomes possible to accurately detect spam without relying on training and other burdensome processes.

Blue Squirrel's anti-spam products detect spam using the text as well as detecting the tricks used to disguise the text.

Almost all of the methods for detecting spam are included.

Whitelist (by e-mail)
Blacklist (by e-mail)
Blacklist - RBL (by IP)
Dictionary - detects tricks used to throw off Bayesian analyzers
Bayesian Analyzer - Trainable
Challenge/Response for false positives
EMail Stamps - Give spammers the option to pay
Bouncer - Trick spammers into taking you off their list
Good Words - Make sure you get messages you are interested in
Anti-Virus Detection and removal
Script removal
Dangerous attachment removal
Detection and removal of Web Bugs
HTML "loudness" detection
SPF / MS Caller ID
Reply possible detection

There is more, and new techniques are added as they are shown to be helpful. Weighting of each of the techniques is simple. Administrators can keep control centralized, or give each user control over their own account and its settings.

There is an SDK available upon request for plug-in analysis.

Nothing is hidden. Every message gets a report. A web interface lets users see the quarantined messages and a detailed report on why it was not allowed through.

It learns the valid users. Or, the program links into LDAP or Active Directory, or RADIUS (for ISPs), or allows the import of users and automatic generation of passwords.

It may be the most complete anti-spam system developed to date. It does not rely on one technique or method, but rather combines them all, and then allows for new techniques.

Slashdot Mirror

Using AI for Spam Filtering (w/ Source Code)

197 comments