Paul Graham on Fighting Spam
Ramakrishnan M writes "Paul Graham, the Lisp Guru is back with a great technique to fight spam. It is based on trust matric, and he claims, only 5 out of 1000 spams got leaked out of this system with 0 false positives. Worth looking at."
How does this compare to spamassasin. Anybody know any figures?
Fleur de Sel
The proper way to get rid of spam is to get rid of spammers. Have it illegal to send spam, to market using spam, and to host spammers.
Make each link in the chain liable!
Fight Spammers!
I propose we define spam as unsolicited automated email. This definition thus includes some email that many legal definitions of spam don't. Legal definitions of spam, influenced presumably by lobbyists, tend to exclude mail sent by companies that have an "existing relationship" with the recipient.
This needs to happen, just because I buy a book from a company doesn't mean I want their stupid monthly mailing list.
This seems very similar to Spamassassin, which alot of us are using with great success.
I got an email last night about this! Also, it asked me to help out his Nigerian cousin...
of course! it sounds so obvious now.
jeez, that alone would cut down on spam, cross reference that with my trusted address book, and I'll probably be ably to filter all spam.
I have that feeling you get when you've been stuck with a problem, and some guy looks at the code for about 2 seconds and finds a problem.
The Kruger Dunning explains most post on
(Yeah, yeah, I know...)
But if you do, check out Cloudmark's SpamNet. I've been quite please with it's ability to stop spam, and it gets better the more people that use it.
1) Lisp...ever since i ran into scheme, I have _loved_ the concept of lisp based languages. A nice Hoo-ha to anyone who says there are no practical applications of lisp based languages. (except haskell...which personally, i think sucks! if one of our own professors hadn't invented it, it would be dead by now) 2) _0_ false positives. I'm perfectly happy to settle with "some small number of spams getting through" given there are NO false positives. Early on in the article he states that he realizes this is a critical problem, and from the start keeps no false positives as a goal. It is far better to have no false positives then to have 100% no-spam rate with that in mind... 3) the statistical word analysis is really interesting..."describe" is innocent. unfortunately....what happens when a few smart spammers get their hands on this analysis *sigh*
When in doubt, parenthesize. At the very least it will let some poor schmuck bounce on the % key in vi. (Larry Wall)
The best way to avoid a torrent of gloppy manjuice shooting all over your naked buttocks every time you even CONSIDER turning your computer on is to remove Linux from your hard drive immediately.
By installing a stable, sensible OS like Xenix, you can ensure an ejaculate-free user experience.
Create an E-Mail address called, say, spam@example.net.
Put a link to it on your website, but tell people not to use it for anything, E.G.
<a href="mailto:spam@example.net">Spam trap - don't use me</a>
Then, it'll get harvested along with all the others on your site. That mail box will fill up with spam, and nothing else.
What good is that? Well, you've got a ready-made list of messages to filter *out* of your other mail boxes!
So, just write a script that checks each inbound E-Mail against the spam list. If it matches, you *know* it's either:
1. Spam
or
2. An E-Mail that somebody has also sent to the "Don't use me" address.
In either case, you don't want to read it, so it gets auto-deleted. Nice.
Oh, I think I'll patent this, and not tell any of you about the royalty I'm going to charge in 15 years time. Hahahahahahaha!!!
Oh, by the way, first post, first post... NOT!
that means CmdrTaco reduces his spam intake to around 500/day.
The One Rule Of Chess You'll Ever Need: Don't play someone who carries a kit in their bookbag.
There are some internet filters out there that use Fuzzy Logic out there instead of databases. They are able to determine what catagory a web page can go into without ever having seen the web page before.
This technology should also be able to be applied to spam.
I hope yahoo reads that article.
Even if someone develops a clever algorithm that's 99% effective, won't the spammers just find a way around it? It's sort of like the music industry and their vain attempts at copy protection. Some of these spammers are smart, computer-savvy people too.
One feature of spammers is to adapt to any sort of anit-spam technology. What's to stop spammers from writing spam filled with 'non-spam' words?
__ Someday, but not this morning, I'll finally learn to use the preview button.
His sample code is written in LISP! Run away! RUN AWAY!
I think that spam is a necassary evil that can be easily controlled. If we make a law to simply ban spam then we might be banning other things like mail lists. I personally recieve NO SPAM in my main account and less than one piece a day in my "junk mail account." That's inluding things that the spam filter catches. All people have to do is to be careful with their e-mail addresses. Spam is not a problem for people who use a modicum of common sense
I wonder when Paul will release arc to the world.
Sadly once the spammer knows this method's being used, he'll start chucking in obscure (but valid) words... ah well, maybe at least spanm will start getting interesting to read, assuming the spammer tries to use the word in context.
"Buy my superlatively efficacious mail list."
Maybe not...
Tom
Oh arse
BUT, now, the best spam filters out there already use statistical properties. Spamassassin does this, for example, and it works *extremely* well. Before I found Spamassassin, I had a huge procmial recipe that used it's scoring mechanism to do basically the same thing -- but of course spamassassin does it better, so I switched :)
From the article:
.97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.
/dev/null immediately without as much as a second glance... :-)
Based on my corpus, "sex" indicates a
Hmm.... take an average adult geek and yes, an email mentioning sex or sexy can go to
On the other hand if you run the statistics on email of an average horny teenager, the probabilities might get a bit different.
Kaa
Kaa's Law: In any sufficiently large group of people most are idiots.
Spammers will try to work around filters, as they don't care that no one wants their crap. Further, filtering it doesn't solve the bandwidth situation, as the lines are still tied up with the bits running through the system until it hits the filter.
There is only one good solution for spam: killing spammers. It should be done, and it should be done brutally and painfully. When known criminal spammers like Ralsky (who ran a child pornography site at one point) are brutally murdered, others may think twice before firing up "EmailBlaster 2002".
STOP MISUSING APOSTROPHES, YOU MORONS!!!
To quote the author: "I get a lot of email containing the word "Lisp", and (so far) no spam that does".
He obviously doesn't getting the "Lesbians with a Lisp" pr0n......
"Mary had a crypto key, she kept it in escrow, and everything that Mary said, the Feds were sure to know."
(insert (lisp joke (here)))
send spam on how to get rid of spam.
Here's how: the spam should be written as a 'multipart/alternative' with an html version of the spam as the primary alternate. The text version contains an innocuous message intended to pass the statistical spam filter. The spam message is entirely contained as an /image/ within the html. The text of the spam becomes invisible to the reader but not to the poor schmuck who gets the email.
I'm guessing here that the inclusion of a single image tag in the html is unlikely to trigger the spam filter, and supplying a wealth of evidence that the email is 'not' spam in the unseen alternate text will let the letter through.
what his spam filter would make of his article?
-- MartinG To mail me: echo kewyjlcxyzvjfxbqwh | tr bcefhjklqvwxyz
When I said market using spam, that includes the company that hires someone who spams.
Comment removed based on user account deletion
This is the brilliant part, and crucial to the endeavour, and so easy to implement!
It appears all the nay-sayers here haven't even read the article (no surprise). With as little code as needed to implement this it should be a must in the next mozilla mail/pine etc. code base.
only infrmatn esentil to understandn mst b tranmitd
Due to excessive bad posting from this IP or Subnet, comment posting has temporarily been disabled. If it's you, consider this a chance to sit in the timeout corner. If it's someone else, this is a chance to hunt them down. If you think this is unfair, please email jamie@slashdot.org with your MD5'd IPID and SubnetID, which are "c9e9c670161ecc03213cef93dc3ea53a" and "167245123af6b03ea65389334162ec02".
Having had the same email address since '93, I receive close to 1000 spams per day to my personal account (which is also aliased from root/postmaster/webmaster).
/dev/null.
I've tried everything under the planet to reduce the amount that I see in my mailbox; SpamAssassin being one of the best so far. But even that lets through quite a bit (around 10%).
So I decided to attack it from a different angle. I wrote a series of perl-scripts that I plunked into my procmail file.
The scripts work by checking the address of the sender each time a message is received. That address is looked up in a database. If it exists in the db, and it's marked as "authorized", it's just passed into my mailbox.
If it's marked as denied,
If it's never been seen before, an authentication message is sent to the sender asking them to reply to it to authorize themselves. If that authmessage is bounced back, a db entry is made as "denied".
If it's replied to in a normal fashion, that email is marked as "authorized" and any queued up mail from that person is pushed out.
The concept is that spam will almost never have a valid reply-to; so it will bounce and be marked as denied.
Even if the email doesn't bounce, no spammer alive will reply to it; so after 30 days, that email is marked as "denied".
Since I've set this up (for myself and my 10-year-old son who receives porn in his box (grrr!!!!)), it has worked flawlessly. The "real" email is unharmed, while the spam is stopped.
Oh, and I have a web-based control page so that users can manually add email addresses (for lists and such).
This week, for the first time in YEARS, I don't have spam in my mailbox anymore.
Hurray!
No if I can only stop those damned dictionary-based scanning of my servers, I'll be set. Thank the gods that I don't have metered service.
While you are there check out his book "OnLisp" (available for free at http://www.paulgraham.com/onlisptext.html). It is an extreamly well written book and gives a flavor of what makes lisp special - its macros. Because lisp has such a regular syntax you can do amazing things with macros.
My only complaint about OnLisp is it only has one chapter on the common lisp object system, which is very powerful - multimethods, method combination, and a metaobject protocall - and could have used more explanation; I don't think it talks about lisp's exception handling at all.
But for a flavor of why people love lisp give this well written book a try!
I guess you never wish to converse with a blind person, or someone who's restricted to a text only medium then?
Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
He isn't fighting spam, he is filtering it. There is a difference. Filtering still costs in bandwidth. Fighting it would eliminate the source and free up the gigabytes of bandwidth lost for this marketing purpose.
Filtering is fine for now, but ultimately it must be fought and defeated.
--------
It's OK to be social, just don't tell anyone about it.
Using Graham's system, write a message that will get a very high mark. The highest mark will win.
The message has to be understandable English. Please post your entry as a reply to this message.
Trollem mirabilem hanc subnotationis exigiutas non caperet
Great... now that they know, they'll spam me with gifs and jpeg.
All humans are mortal. Socrates is a human. Socrates is dead.
As for multipart/alternative... right now anything I get that has a content-type other than text/plain goes to a special folder, where it usually gets deleted without even being opened... fortunately most of my friends use proper mailers that send text/plain :-)
"The best argument against democracy is a five minute chat with the average voter."
--Winston Churchill
This does several things:
Can I use that feature for my own (commercial
or open source) mail client development?
There are several papers describing using Naive Bayes classification, as well as others AI techniques, to filter spam here. Look for the section on "Document Filtering".
Comment removed based on user account deletion
Although to be honest, I don't understand how the algorithm works. However I'm sure some enterprising soul can probably work it out and code something (hell I will if someone can explain it in decent mathematical terms).
All we need then is a repository of spam mail and non-spam mail to "teach it".
Whatcha reckon?
Avantslash - View Slashdot cleanly on your mobile phone.
I'm continually amazed at the people who are beating their heads up against a very simple problem. The answer is not statistics, it is not heuristics, it is not AI, it is not procmail.
The answer is verification...aka whitelists. Check out TMDA, tmda.sourceforge.net. This program assumes you don't want mail from anybody whom you haven't explicitly allowed, or who has verified that they are a real person and not a spammer.
Verification is simple, and some people will point out that it could be defeated by a spammer. But, the economics of spam do not make it feasible for a spammer to attempt to defeat TMDA.
TMDA is similar to making your phone number private. You only get phone calls from people you have given your number to, and you never get telemarketers.
TMDA user since December 2001. Spam messages that tried to get in, 12,133, spam messages that got in 3, false positives, 0. Time I've spent tweaking and modifying the program since installation, 0 minutes.
For those of us that are not LISP gurus,
.01 (min .99 (float (/ (min 1 (/ b nbad))
can someone explain what's he's doing with
the following code:
(let ((g (* 2 (or (gethash word good) 0)))
(b (or (gethash word bad) 0)))
unless ( (+ g b) 5)
(max
(+ (min 1 (/ g ngood)) (min 1 (/ b nbad)))))))))
a nice idea to filter spam ...another one to fight it.
1. the MTA's (mail transport agents like sendmail etc) establish trust relationships between themselves or manually. They also maintain a users safelist (i.e. addressboook + list of addresses user wants to recv mail from)
2. All email over the trusted links and from addresses in the safelist are delivered unfiltered.
3. For each email sent over an untrusted link
a. Perform MD5 over message body.
b. Ask neighbouring trusted agents if they have received an email whose MD5 is given.
c. If no. of positives are greather than a threshold, reject as spam.
You could develop a corpus of spam over a long period of time, and look for shifts in the data. What this paper describes is distinguishing between a spam-corpus and a legit-corpus, but you could also compare a spam-1999 corpus to a spam-2002 corpus, and see if the spammers are up to anything new.
Not that it would be useful, but it might be kind of cool to try it out and see.
It's not wasting time, I'm educating myself.
I have no doubts about the research that goes into the calculation of words that were in spam, since pretty much everyone gets simular types of spam and it's not difficult to collect spam marketed to many demographics.
What I do wonder about is his collection of non-spam. I agree that this approach is very good, but I think a hash of non-spam needs to be collected by an end user or for a specific demographic.
For instance in his article he said that the word madam almost never appears in his non-spam mails. Well he isn't a woman. It is a quite common business practice to send e-mails with the greeting madam. Also the vocabulary used in a personal e-mail enviorment would be drasticly diffent then in a business enviorment.
Say your an AOL teeny-booper... the chances that another teen is sending you an e-mail with red text (fl0000 was one of the key words that was 99% chance of spam) are much greater then a business e-mail envoirment (which actually I use bright read sometimes when in-line replying to e-mails at the office).
So like I said before. I really think the hash of 'good' e-mails has to come from a end-user or at the very least from a demographic...
Your mammas flamebait.
However to test such an idea I need a repository of spam mail - something I don't have. Hotmail junk is no good, it's just the same old adverts regurgitated over and over again.
Does anyone have anything like the 4000 junk emails that this guy has? If so, please could you pop me an email to org dot ewtoo at silver as I'd really appreciate it!
Avantslash - View Slashdot cleanly on your mobile phone.
From the article:
.97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability...an email containing both words would have a 99.97% chance of being a spam.
In the spam filtering business, false positives are your biggest worry...Based on my corpus, "sex" indicates a
False positives could be a HUGE problem in this case...imagine the agony if you missed this email from your wife: "I'm feeling REALLY sexy today - meet me at the motel off 12th street at noon for some lunch-hour sex!"
Actually its a pattern of characters its working with, English has nothing to do with it. The concept will work for any pattern as he's definied it and therfore any language.
only infrmatn esentil to understandn mst b tranmitd
We are all afraid that new and powerful spam filters will filter out an email that was directed at us, but honestly, how many of us haven't accidentally deleted one ourselves? My spam deleting technique is
1.Check name
2.Check subject
3.Decide
And even this system has been known to delete a false positive or two (Hey, I didn't know Alisa knew my email and "Hi" from the name "Alisa" just sounds like spam)
My point being, I doubt if any spam system will ever truly get to the point of never deleting a false positive, but it doesn't mean you should avoid spam filters, or leave them set at settings that make little to no difference.
Why is a mouse that spins?
So, do we give 'em the Iron Maiden or stick to 'em Transylvania style?
Ok, I read the article but quickly, and at the end of it I wasn't sure how he ultimately told the system that an individual e-mail was spam or that it was legitimate, so it would know into which bin to toss those words...is that a manual process?
I set up a homebrew whitelist (which still shows me the potential spam) I'm pretty happy with. I'm trying to figure out if I should keep in the subject based whitelisting or not...some spammers use my typical "hey" or "hi" subjects now...and it's the part of the system that grows the most. I'm just worried I'll send out mail to someone and they'll reply with a different e-mail address...maybe I should expire subjects?
Hmmm.
SO YOU'RE GOING TO DIE: The Comic for Dealing with Death
I wonder what Bernard Shifman would make of this article?
What is our 'CS Consultant' up to these days?
Are you local? There's nothing for you here!
it looks great, and i will try it for my account that i use eudora or outlook for...however, i use a hotmail address for my main account (so it can travel wherever with me), and their custom filtering system sucks (if i may say so)...the only things they let you filter on are subject, From Name, From Addr, & To or CC lines...no option to filter on message content, which is where this would be useful...oh well, i guess that's what i get for using hotmail...i should get a real e-mail account...
"Facts are meaningless. You could use facts to prove anything that's even remotely true." - Homer Simpson
Making spam illegal would probably cut down on people buying email lists and starting to spam in their free time because it seems like a great way to make some money. It might even cut down on the "legitimate businessmen" types here who do it professionally. It's going to have no effect internationally, however, and there's really not much you can do about it.
There's an interesting point about this in the article, however, when graham says:
I would agree with this - it seems to me that for a lot of "crimes of this nature, drugs being the best example, the solution is not criminalization but regulation. People aren't going to stop dealing or using drugs, nor is it something as serious (like murder) that it's worth it to put them in jail anyway. If drugs were regulated, however, most of the problems could be easily reduced. Enforce strict controls to prevent cutting, ban advertisement, and tie sellers to treatment programs to help get people off of drugs. As long as there's no incentive for people to buy them illegally (ie, their being much cheaper or, as it is now, the only supply), people will buy them from regulated sellers.
Similarly if you regulate spam and make people attach footers you'll be less likely to drive people overseas to spam while also making it much easier to filter out.
Of course, there's still not much you can do about the Koreans, other than trying to get their government to do the same thing.
Besides, do you really want to encourage the government to effectively prohibit certain kinds of non-victimizing (non-kiddie porn) speech online?
It was with the help of spam that with just a simple herbal supplement I was able to add three inches to my penis (an increase of over 20%). I had assumed it was just a scam, and nobody was more suprised than me that it worked.
Well, except my wife.
Phallic Symbols in LOTR
Good method. I work with Bayesian technics often and I had thought of the same thing but for a different purpose: automatic classification of emails. When you receive an email, your mail reader would propose a list of potential folders into which you might want to put your email after (or before) having read it. And the best thing is that is learns with time and it gets better. And as this article shows, this method can also automatically filter emails. Now if I have time to get involved in the Evolution project or kmail, ...
I'll do it for cheesy poofs.
Feel free to review the work at http://research.microsoft.com/~horvitz/junkfilter. htm
They came up with similar processes to both filter and to categorize. Bayesian analysis is a very flexible, and while Paul Graham is not the first to try this, his work looks very exciting.
I had nothing to do with any of this work; just a fan of Bayesian research.
Michael
I spent about six months writing software that looked for individual spam features before I tried the statistical approach...[cut]...Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability.
ofcourse these probabilities may vary from person to person.
Check this out:
Digiportal's innovative new ChoiceMail program means the end of spam. I really don't like the idea of using someone else's server to manage my white-list, but all someone needs to do is publish an open-source CGI script to do this... integrated with qmail.
Look in UseNet. The group news.admin.net-abuse.sightings is where people post their spams. Enjoy!
No replies made to AC posts. Please log in.
Usually, I don't mind getting spam on my (insert your favourite free web-mail here), but my university email account is something personal. So, armed with whois command, I started complaining up and down, around the world's ISPs. My question is: Has anyone done this? Any success/dismay stories?
I know I can install a spam filter on my email client, but I prefer to have the email stored on university's mail server. That way I can ssh from anywhere and read my email (pine) and newsgroup(tin), ah.. the nostalgia..
While he advocates generating probablility tables from an individual user's corpus of messages, I would imagine that most users will have many low-spam-probablity words in common.
Even easier, since he assigns a low score to unknown words, appending a sequence of random sets of letters to the end of the message would have much the same effect.
Checking for phrases (rather than words) can mitigate this a bit, but all in all, this still looks like a stopgap measure.
Roving Web-Teleoperated Robot
1. Create layout of spam
2. Take a screenshot
3. Convert to low res PNG or JPG
4. Mail the JPG to 100,000 annoyed geeks
5. ???
6. Profit
This message brought to you by the Council of People Who Are Sick of Seeing More People.
"Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set"
The full details of the patent can be seen here.
Patent Link
I'm surprised you guys don't check at the patent office first before you get all excited about a new idea. Doh!
The bikini - security through obscurity since 1943
The latest trick from spammers is sending out HTML e-mails with their ads. Not a problem by itself, but by embedding the entire spam ad as a single GIF or JPEG image, there's no text for the spambot to filter out. It's easy to trap false positives with this, too, since a family member or friend might want to send out photos without necessarily attaching text as well. Boom, statistical analysis is instantly useless, and we have to go back to the old tricks -- filtering out known spam e-mail and domain sources.
I actually had to close down my hotmail account; the spam would exceed the 2MB within 24 hours after being cleaned (and that's with the wonderful MS spam filter set on "high.")
BTW, these days I'm getting individual spams that are 170 KB in size. Talk about rude...
Phallic Symbols in LOTR
"but ultimately each user should have his own per-word probabilities based on the actual mail he receives ... perhaps best of all makes it hard for spammers to tune mails to get through the filters"
My Karma: ran over your Dogma
StrawberryFrog
Did you happen to read the article? He discusses this at length. He makes a strong argument that his system is actually pretty robust, since to get around it consistantly the spam has to look just like your real email, which is pretty darn hard for them to do.
In a lot of ways this problem is like cheating in games. As long as you're the only one who knows the exploit, you can be pretty sure that it's not going to get fixed, though you'll still get kicked off every server you play on. Similarly, with his method a spammer might be able to find a particular phrasing that's likely to get through, though his messages will still be deleted on arrival. But even if he does, if he starts sending you too many emails or starts selling his technique the filter will adapt with the spam and start filtering it out.
Can this algorithm also be applied to Slashdot comments, and tell whether or not they will be rated "+5, Interesting"?
The world will end in 5 minutes. Please log out.
Sexy young hot teen lesbian girls with who lisp.
See my other post
The bikini - security through obscurity since 1943
Of course, the problem now is, is that spammers won't use ff0000 as a colour, they won't start Dear Sir or Madam, and we'll just have to start again.
I think the best way is to make a similar list of words you find in valid emails, rather than a list of things that occur in spam.
One idea that I use that I've never seen used anywhere else, is change your email address to:
user.aug02@domain.co.uk, and that way any spammers will only have a valid address for max 31 days. Change your email address each month. Humans can work it out, bots can't.
Get your own free personal location tracker
Because it is measuring probabilities, the Bayesian approach considers all the evidence in the email, both good and bad. Words that occur disproportionately rarely in spam (like "though" or "tonight" or "apparently") contribute as much to decreasing the probability as bad words like "unsubscribe" and "opt-in" do to increasing it. So an otherwise innocent email that happens to include the word "sex" is not going to get tagged as spam.
So what's to keep spammers from reading this article, and tailoring their spam to stop using 'hos' and 'ladies' and start include words like 'tonight' and 'apparently'"?
'This week only! All the hiz'oes and liz'adies you could want on our website. Sign up tonight and receive a free two month membership! Apparently we'd uh... like your business!'
D
Taco whacking off to the girls volleyball team?
You're new here, aren't you.
I think rather than lump stuff into "spam" and "non-spam", it should assign a ranking number and preferrably display a color to represent ranking.
If you are tired, then you can ignore or defer the grey areas.
Also, if the display list displayed the first X characters from the content, then one can often check without reading the whole thing. (Perhaps filter out non-indicative words like "the" and "and" to make it more compact to display.)
I don't think there is one magic technique because spammers will work around it if it gets popular. Thus, a combination of machine and human working together will be more effective IMO.
Table-ized A.I.
>> based on the assumption that it is written in English
There's no reason to think that Spanish, German, or even Chinese spam doesn't follow the same statistical word frequency rules.
>> following the simple steps outlined in the URL above
What if you are subscribed to mailing lists, or have mail bots that send you useful messages (like "your server is down")? The usual answer is "just configure those in advance" but that's a pain and not very robust. My hosting company was bought out and their automated server status messages just started to come from a new domain. If I had this kind of filter I would have missed them.
What I want to know is:
Would this also work with email virus? I think it would since the virus would also have a defined patern to it and the program would pick it up after the first one.
Could this be made part of the STMP protocol or built into the backbone layer of the network? Again, I no major reason why it couldn't.
Problems that I have with it are:
Since each word is treated as a token and everything else is not, I'm sure that spammer would quickly figure out that a spam like this just might work:
<HTML>
<BODY>
Enlarge <!-- elephant --> penis [etc..]
</BODY>
</HTML>
which would show the message but hide the balancing words, so it could be possible to change the delta into your favor.
Does anyone else have thoughts on how this might be broken?
III.IIVIVIXIIVIVIIIVVIIIIXVIIIXIIIIIIIIVIIIIVVIII
SMTP is designed broken because it:
1) Allows senders to be faked.
2) Is slow.
3) Requires bounces for broken messages.
4) Allows loops.
5) Cross-subscription to mailing lists, complicated mailing list management.
6) MIME.
7) Add your gripe here.
See http://cr.yp.to/im2000.html
Mailing lists are sent out by machine too. As a list administrator, if you sign up and I get one of these, then you're not going to be signed up much longer.
One is not a problem, 10 is an annoyance, and when you're dealing with 10k people subscribed, it's a royal pain in the ass.
So my automated bounce system will unsubscribe you, even if you have given permission for the list, paid for the list, or whatever.
Is there any way in Postfix/Sendmail/Exim/Whatever to strip HTML tags out of incoming mails?
Get your own free personal location tracker
Laws will never stop spammers. The damages are very hard to prove, especially when the judge/jury don't realize that their ISP filters their mail for 95+% of the spam already. Most people just don't GET it. And most spammers are sending the spam from another country, running a fly-by-night operation, so prosecution is nearly impossible.
Filters are helpful, but they still require huge resources to receive the e-mail and process it. And as stated in the article, the risk of a false positive is often much worse than just receiving the spam.
There are already only a few mail relays that are willing to send out spam, and virtually nobody accepts ANY mail from them. The spam going out is coming through illegally used mail servers. This shows what is to be the solution to the problem of spam: ISPs will only act to stop spam when the spammer is damaging their system.
Most spam gets deleted without the enclosed links getting clicked by at least 99%. The company hosting the web site just sees their customer getting some success with their business. They don't know why, and they really don't believe/care when someone e-mails them to say that the user spammed them from a mail relay in china. The user probably paid for a 2 gig/month of traffic, and they are well under quota.
It's time to change that. With a SETI@Home / Prime95 type application, we could easily DDoS a daily spammer off the net. Slashdot alone could easily field 10000 users willing to put their cable modems up to the task of pounding spammers accounts (and possibly the hosting ISP) off the net. Beat them down until the account appears to be deleted. Maybe then ISPs would hold users accountable for being spammers. Web hosting contracts might start including fines ($500+) for abusing the service, rather than just the scary risk of a cancelled account. All we have to do is beat them down before the few clueless morons come buying and make it worth their while.
Legal? Sure, I don't see why not. I can send a 10 http requests to the ISP in a second... I've never heard of a law that says I can't do that every second. As long as the computers involved are from willing users (sysadmins get permission in writing first), there is no 'hacking'. Every DDoS case I've ever heard of involved charges of 2k+ computers 'hacked', rather than the ensuing attack. Even if it is illegal, this is vigilantism that nobody (other than the hosting ISP) is going to complain about.
In addition to looking for spam you could use this approach to initially filter your valid email into various topical inboxes.
So, with some modifications it could move all the mail you get from work, from various hobby/educational/professional listserves and put them into separate folders for you to scan.
I'd find this tremendously useful, and it would probably even enable me to subscribe to a few more listserves - without the worry of being buried in the resulting email.
This reasoning is statistically invalid. It is only true if the chance of the word "sexy" appearing in a message is independent of the chance of the word "sex" appearing. In other words, only if knowing that the word "sex" appears tells you nothing about how likely the word "sexy" is to appear, can you reason as he is doing above. That's probably a very poor assumption in this case.
He is doing:
The correct formula is: where the last term means the probably of "sexy" given that "sex" appears.Maybe his approach is good enough for his purposes, but the statistical foundations are not correct.
Well I'm wondering if spammers could bias the score by simply adding a list of "counter-words"
Kind of the way web sites bias themselves with search engines.
Remember statistics do have weak points. Think "Lies,damn lies,and..."
if it does, it might take a few emails of the exact same kind of spam until your filter starts ignoring it.
Why not have a centralized free online database of the same kind, that way as soon as an email is sent to just a few people, the filter starts to recognize them right away.
So basically your mail program would be contributing and borrowing the hashtable from this central DB online for each "mail session".
If everyone used this, a spammer would be stopped dead in his tracks after the first few emails sent.
--me
Your filter's usefulness is inversely proportional to the number of people who use it, since it is trivial to bypass by a spammer who knows its details.
Space is cheap. It would be far better to NEVER DELETE YOUR EMAIL. Instead you should just toss it into different folders. A folder per mailing list, a folder for verified spam, a folder for filtered mail(suspected spam), and a general inbox works well.
Michael
I agree - I think things have gotten so bad, that it might not be practical to use algorithms to detect spam. I am using a permission-based system like the Si20. It is called ChoiceMail and it is put out by DigiPortal. If a spammer wants to send you email, they must first ask your permission. If it is a friend, you just give them the OK, and they are forever on your whitelist. I have been using this for about a month, and I too, get ZERO spam.
Can you imagine the day everyone uses this. You send mail to a public list and get back 2000 messages asking you to "authenticate" yourself.
This is a bad plan for working in the large.
Clippy/BOB/etc were based on Bayesian techniques, right? Does this mean M$ could soon build this into Exchange/Outlook?
dundunDUNNNN
[o]_O
As described, it would be very hard for legit spam to get through.. However, what I'm thinking is that they could have their normal 5 KB of email which is spam .. and at the bottom .. (or anywhere else) , just add 20 KB of words they know are "good words" .. throw html comment tags around it and its never seen to the viewer ... but the large amounts of "good words" outnumbers the "bad words" , causing a spam msg to be considered good...
I don't know if that'll really work.. but its a thought
"cogito, ergo sum"
I think DNSBLs and legal actions can be effective, and perhaps additional approaches will arise, but filtering should only be a temporary tactic because the victim ends up paying for the spam anyway.
The beautiful thing about this approach that people seem to be missing is that it evolves as spam does.
I dont' know how it will work with images though
How about *paying* for e-mails?
It's been suggested before, but if all e-mails had a small (say US$0.02) charge associated with sending them, bulk e-mailers would have to be much more careful. They like it because it's virtually free, so a tiny tiny number of replies will pay off. If you change the economics, you change their business model.
By the time one can apply the filters, you have already received the spam. This is a load on your resources. In some cases your in-box may even fill up (yes, I've received 1000's of the same piece of spam in the same hour, exceeding the capacity of my allotted storage and effectively DOSing me from real e-mail) or you may exceed limitations from forwarding services.
The spammers don't really care. Or notice. Their goal is to hit millions of victims, knowing that some of them will respond. The response is all they care about. Filter your e-mail all you want, you were not going to respond to them anyway. All they care about is reaching the mark that doesn't know any better, and this filter doesn't do anything to stop that (unless it is applied automatically by ISP's, unlikely due to the fear of fales positives).
What might help is a two fold attack on what they want: responses from marks. I suggest the following:
A massive education campaign to educate the general Internet user to never respond to (or even read) strange messages that show up in your e-mail. Banner ads would seem a good place to start, it would be a public service if a good percentage of banners were replaced with ones that educated the Internet users who still make spam profitable. This might even have the long term effect of improving banner revenue: if banners compete with spam as a way to get out a message they have a lower value than if the public is taught to not buy from spam and even to aggressively resist doing business with a spammer. In the long run an antispam banner campaign could improve banner revenue for those who help fight spam. Ideally another great way to get the word out would be UCE, but that poses a moral dilemma....
The other thing that could effect the spammer is if the ads are not getting the desired results with the advertisers. What needs to happen here isn't filtering, it's massive negative response to the advertiser. No response don't hurt them, but making them respond themselves to unwanted responses is a more suitable way to respond to those who originate unwanted messages to use in the first place. These people need to get responses that waste their time and resources like they are wasting ours. Obviously those who supply 800 numbers are a prime target for this, while those who supply only postal addresses make it too costly to respond. I think such negative response campaigns need to be coordinated from major popular sites to be truly effective (not just from a few geeks who spend their day on an anti-spam website. Their efforts are much better applied by getting the spam sources in black holes and getting ISP's to block or filter spam). It sure would be nice to see the slashdot effect applied to spammers rather than the poor smuck who puts up a small but interesting website.
Interested in other's thoughts in this area.
I'm an American. I love this country and the freedoms that we used to have.
Just out of curiosity, what if a bunch of geeks set up servers to DOS-flood sites that spammed. (This would not be the return address, since those are usually phony, but the website that sells the goods being advertized.)
If such was possible, then Viagra.com would think twice about starting another spam compaign.
Table-ized A.I.
Freedom of speech is not the freedom to tresspass on my computer equiptment, use my resources for me to listen to your advertising!
This is not a prohibition on your paying your moneyto spread your advertising. This is a prohibition on you spending my money to spread your advertising.
Commercial speech does have some constitutional protection, but not to the same level as non-commercial speech. But even with pure political speech, there is no requirement for me to pay for your speech.
As for hitting the delete key, at that point, you have already tied up at least 2 of my computers used my disk storage, my time, my bandwidth without paying for it.
If you want to spam, no problem, just pay me in advance.
Fight Spammers!
The article proposes that "one cooperative project that I think really would be a good idea would be to accumulate a giant corpus of spam."
This brings to mind a huge, quivering, pink mass of luncheon meat sitting on the Midwestern prairie just down the road from the World's Largest Ball of Twine.
Nukeeeeeeeem!
This is the best paragraph of the whole article:
So as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual words, Bayesian filters automatically notice. Indeed, "c0ck" is far more damning evidence than "cock", and Bayesian filters know precisely how much more.
The Bayesian filter. You can run, but you can't hide!
+1 Insightful, -1 Troll. What can I say, I'm an Insightful Troll.
Leave it to the Slashbots to try to destroy the rule with the exception.
He can add them himself, dolt.
funny, insightful, whatever.
I mean really, most spammers are desperate,"be-your-own-boss" morons. Well, maybe just misguided, but yeah.
what kind of modding is this?
(4 - Informative) ???
it's just a paragraph from the original article. You modders would already be informed if you'd read the article.
oktay
---------------
Founder of the The Free Linux CD Project
In addition to filters being individually tuned, the system allows for "whitelists" - Any mail address on the user's whitelist automagically bypasses the filter.
The difference between this and other whitelist approaches is that "new" people who are sending you legit mail (Like Horny Teenager's latest BF/GF) will likely get through, as opposed to having to authenticate in some manner.
retrorocket.o not found, launch anyway?
I had my first spam before I received my e-mailed copy of the journal. It was "related" to the topic of the journal, and said something like "I sure agree with what you wrote about in the journal. What's your opinion about http://my.url ?" But the To: line was the clue. It included not only me@myrealaddress.com but also that of smart.guy@nospam.address.com (another poster in the journal.) It was very apparent that the author had simply harvested the HTML and dropped it into his address database.
It was only hours before I was getting offers for detoxifying myself, HGH, climax gel and all-free teen pr0n.
John
A few quick comments about this. Although powerful, such approaches suffer from being somewhat too 'black-box'. That is, you turn control over to the computer to make decisions based upon statistical recurrances. This leaves you very vulnerable to several problems.
For instance, the author remarks that he believes a bigger corpus of spam would help train filters. That's true, but misleading: it would help train filters that distinguish between his 'nonspam' corpus and his 'spam' corpus. In this case, he is surely increasing his true-positives.. his rejection of things that really are spam. But his false-positive rate is not helped at all, because his samples are so biased.
(Example: 10 spams get the word 'blunderbuss' but he has no regular email with that word. Therefore, any future email may be rejected because of the word 'blunderbuss', even though there is no basis to know whether the word CAN be used legitamately.)
If the system is done intelligently, this will simply mean that having a lopsided sample will do nothing (the resolving power will be dominated by the smaller of the two samples), but this may be counterintutive to some.
Another problem is that you don't know WHY choices are being made, and that's bad science. Ok, ok, so this isn't science, it's Spam prevention, but I like science.
---N
who is a rabid lisp loser.
Please stop posting crap from people like him.
Thread on google
http://makeashorterlink.com/?N2AC15981
ASK is a system similar to yours with some tweaks: .sig, they are automatically whitelisted.
If you send someone email, and they reply to it, leaving in your
Mailing lists are handled automagically.
Check it out:
http://a-s-k.sf.net
The built in spam filters for Outlook and Hotmail are just so much less efficient than Spamassassin or Razor/SpamNET.
My recent experience shows about 90% of the spam I get can be detected by Spamassasin, 70% by SpamNET and about the same for Hotmail. The Outlook/Outlook Express filters are basically blacklists and catch maybe 40% if properly maintained.
It does sound very similar, so why haven't they been able to implement a Bayesian filter as successfully as the lisp guru?
What's the legality of a DDoS where each attacker is an individual person and not a "zombie"
I recall during the RIAA DoS discussion there were some methods of DoSing that were rather legit. (Slow HTTP request for instance - G, sleep 5, E, sleep 5, T, sleep 5, etc etc. Not a huge bandwidth hog but wreaks havoc with HTTP servers if enough people do it.)
retrorocket.o not found, launch anyway?
Well, my spamfile did (thanks procmail) and I submitted it to spamcop and noticed they have a freephone number.
Now I could ring up and it would cost THEM money, which is a little teensy bit of payback - but imagine posting that freephone number on a site somewhere where like minded people hang out. They could all ring up the number, cost the company money and tie up their staff by chatting to them about their product.
It might make them rethink their spamming.
no sig.
Paul is taking an interesting approach here, but he's not correct in saying that SpamAssassin doesn't use a statitstical approach. He has a bit of a point in noting that his system will generate a prediction probability which is more intuitive than SpamAssassin's scoring system in terms of determining how likely a message is to be spam, but there is also an attractive element to the simplified, non-math way that SA uses scores, which allows them to be more understandable to non-math people.
Seems like a number of the points which Paul makes in the article about spammers being defeatable, about the basic premise that they must get their message through in order to be successful, and that the war on spam is winnable are extensions from my interview with Salon a few months back, but his statistical approach fails to make use of one factor which I believe is critical (and which SpamAssassin attempts to exploit), which is that those commercial messages must convey a commercial message, in other words, they have to be a message, and have some sort of linguistic component which encourages the reader to do something. A purely statistical approach to spam filtering will lose the power of doing analysis of higher-order linguistic concepts.
SpamAssassin's approach is to use the universe's best known natural language processors (humans) to build rules which they believe can differentiate linguistic elements of spam vs nonspam messages, and then use the best optimization and statistical tools we have (currently only using decent tools, not the best tools) to determine how to score those rules against individual messages. The scoring system is very simplistic today, just being a simple sum of the scores of the various rules (though it's slightly nonlinear because of the properties of some of the rules, like the auto-whitelist). Future SpamAssassin development directions include extending the scoring system to be much more non-linear, including examining statistically the frequency of occurrence of combinations of rule triggers.
Automated rule-creation certainly has its place (for example, SpamAssassin's spam-phrase rule, or the auto-whitelist), but I truly believe that the ideal spam filtering system will always have to make the best use it can of human language processing skills. Using this combination of human/computer power, I believe that SpamAssassin can (and often does for many existing users) achieve better ROC performance than anything else.
He proposes you define it as unsolicited automated mail. But that's not it exactly. It's only automated unsolicited mail that you don't want. If he had been looking for that raleigh three speed and had happened to get unsolicited automated mail offering him one, he would have been delighted to get that piece of spam. So sometimes you don't know you wanted it until AFTER you've read it. I would rather avoid filtering on headers if possible.... if the above email came from the same open mail relay as 2 tons of porn email, that doesn't change the fact that I would want the above email anyway.
Spam is mail you don't want. The automated feature is irrelevant. If an army of trained monkeys were copy and pasting the mail to you by hand, would this make it not spam? Of course not.
Is it mail you want just because your friend sent it to you? Even though it's a forwarded chain letter? No, then it's junk.
The goal of filters shouldn't be to filter out automated unsolicited mail, it should be to filter out mail you don't want. So if you are a horny teenager you might want to let all the sex mails through.... that doesn't mean they're not spam. But the spam status is really irrelevant. Very good article.... just replace 'delete as spam' with 'delete as unwanted'.
My primary concern comes from the fact that most of the spams I recieve are either Korean or English, while most of the legitimate mails are in Norwegian. Sending me Korean mail is pointless anyhow, but I fear that simply the _use_ of English will make his scheme produce lots of false positives.
Oh well, I'll probably make my own authentification scheme. It does seem like the way to go. Or, of course, I could subscribe to a few mailing lists just to give his algorithm more entropy to work with.
I was going to post a message suggesting exactly this but you beat me to the punch.
:-D
Why doesn't my email app have this already???
Black holes are where the Matrix raised SIGFPE
Wow. It's described down to a level of detail that would make you think they've already written the Outlook add-in for it. I wonder why we haven't seen it yet?
The only way to end spam is to end human life.
Extreme but 100% effective.
We don't even have to commit murder, we can just chain them up and leave them to rot in they're basements. They don't have any friends so no one would come to set them free.
And even if they did have someone they owed money to or something and they came by and set them free. I bet the probability of them sending more spam would be greatly diminished.
...is in the footnotes:
[2] As a rule of thumb, the more qualifiers there are before the name of a country, the more corrupt the rulers. A country called The Socialist People's Democratic Republic of X is probably the last place in the world you'd want to live.
So what does that make of United States of America?
The CRM114 active filter uses the Bayesian
technique described, but extends the probabilities
to _phrases_ (including interrupted phrases) not just words.
For example, the phrase
Mary had a little lamb
would insert hash marker entries on
mary, had, a, little, lamb, mary had, mary a,
mary little, mary lamb, mary had a, mary had
little, mary had lamb, mary had a little
and so on. My experiments say that you are
just about out of significance at five words
and it doesn't pay to go past that.
The advantage of this is that it's often not
words, but phrases that have the higher-level
"meaning" (grammatical context?) that is even
_more_ indicative of spam versus nonspam than
the singular words taken alone.
You can grab crm114 at:
http://crm114.sourceforge.net
-WSY
Isn't it better to worry about the 'evil' html up on web pages rather than in emails? Fscking warez sites use 10 times as much evil html tricks as spammers.... where's the outrage there?
Fscking lop.com for example.... took so goddam long to clean that shit off my system.
(It's alpha version yet, and it's presently working on a very small subset of environments - requiring MS Outlook/CDO/.NET; but the author seems to solicit invitations to have this rewritten for a normal platform/language:
Before that, its being free is questionable as it's basing on non-free tech...)
VKh
One filter that I found blocks about 50% of the spam I get is to filter by the To: field. Some (50%) spammers either don't include the To: field or have their list server address in it... Either way, it is not addressed to me, so it goes to /dev/null
The bad part of this is that it will filter out all the mailing lists that you chose to subscribe to.
Of course, when filtering the bodies of messages, the easiest defeater is encoding the bodies of the messages. It's easy to block all messages which have "longer" within 1 or 2 words of "thicker" or "intense", but it's much harder to block SGkgVGhlcmUsDQogDQpUaG91Z2h0IHlvdSBtaWdodCB3YW50IH RvIHRha2Ug. Then you're back to blacklisting based on senders and domains and header information. Of course, this is for the ISP I work for, for personal mail I could just reject all encoded mails.
To read makes our speaking English good. - X. Harris
Isn't spam assassin also using some sort of statistical scheme? I've seen some simple perl script things based on averages. I think spam assassin does more than that, but I've never really checked it out. Does anyone know how this is different or comparable to other spam filters?
-- Eric
Several people have pointed out that by the time spam reaches you to be filtered, it has already used resources.
That's why the large ISPs such as AOL and the DSL/Cable providers need to put this on their _outgoing_ connections, just to be able to quickly identify a machine which suddenly begins to produce spam. This would, of course, presume that they are responsible enough to care.
A lot of spam comes from open relays, hacked machines, unscrupulous ISPs here and in asia, etc. Obviously all connections to the internet can't be filtered. But I think that as ISP can save itself time and money by eliminating their own occasional problems.
And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.
This statement is wrong -- it would only be correct if a spam containing sex and containing sexy were independant which seems definately wrong.
If a person sent him an email containing both the words sex and sexy (and perhaps a few other related words), which seems very possible, the probability of being a spam will go way too high and it will be very hard for the system to classify it as non spam. This might seem inevitable, but it doesn't have to be the case.
Of course generating the entire joint distribution over all possible words is impossible, but there are very good approximations, for example he could use a Bayes net.
www.facestat.com - See how strangers judge you.
Or, if you're not willing to sacrifice (or mess with) your MDA, check out ASK. It does about the same thing and works with sendmail, procmail, qmail, etc.
A-S-K
There are limits to the First Ammendment right to freedom of speech. Exceptions have been established by courts in instances of defamation, causing panic, incitement to crime, sedition, and obscenity.
How much spam have you read with defaming remarks to Britney Spear's latest sex pics? I've seen so much spam "advertising" rape and molestation and child pornography- it might not be a literal pursuasian to commit such crimes, but it certainly is obscene to most readers. The Communications Decency Act (which prohibits "obscenity" and "indecency" on the internet) was upheld by the Supreme Court in 1998.
So there may be grounds to strike spammers on these exceptions to the First Ammendment; however, prosecuting spammers would be a special precidence case and blazing new territory in the legal system.
All of this however would have little effect for non US organizations. Besides... who would attempt to prosecute one spammer when there are so many more ready and willing to take that place.
In short, because the morons that support spammers are not likely ever to bother with filters.
The one exception - Filtering with bounce messages. This will cause SOME spammers (not all) to take you off their lists. Since implementing fake bounce messages triggered for every identified spam (See spambouncer.org), my spam counts have halved, from 90+ spams/day to 30-35, and decreasing. Unfortunately, some spammers (azoogle.com) blatantly ignore bounces, and others have non-bounceable return paths. If more people bounce their spams back, those who DO have bounceable (but ignored) returns will have their bandwidth costs increase.
I think the ultimate solution is that the spammers themselves have to be fought. Legislation is one - If 1 in 100,000 people respond positively to spam mail and only 1 in 100,000,000 sue for $500-1000, spam quickly stops being profitable. Also, some form of "voluntary" DDoS of spammers would be nice. Not voluntary for the spammers, but for all those who are attacking. For example, download a small app that each day presents you with an article, that basically states, "Today's target is xxxx - They are targeted because yyyy" and the evidence is presented against them. User can now decide if they want to participate. To minimize legal risks, trickery such as an absurdly slow HTTP GET would be useful. (G, sleep 5, E, sleep 5, T, etc etc) - Doesn't increase bandwidth costs, but the server will probably be brought to its knees rather quickly from having to serve too many simultaneous connections. A client could easily spool up 40-50 such connections with minimal use of local resources, but the server would have to open up hundreds of thousands of simultaneous connections, causing the server to fork like crazy.
retrorocket.o not found, launch anyway?
I might actually read that kind of spam in the future.
Subject: __________ (noun) Enlargement in ____ (number) days!!!!!
Hello _________(name),
Would you like to __________(verb) for only the cost of _______(number) ___________ (hot beverage)?
Just use this link ____________(website) to get started __________(date)!
If you'd like to unsubscribe, ________(verb) _________ (place).
Mordor...a magical, mythical land where women are more rare than dragons--but where every man would rather find a dragon
Whomever modded me as a troll, YOU try wading through Paul Hudak's courses. /growl/
When in doubt, parenthesize. At the very least it will let some poor schmuck bounce on the % key in vi. (Larry Wall)
Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability.
.97 probability of the containing email being plans for Saturday night!
Poor guy. Based on my corpus, "sex" indicates a
One point I'm missing here - how would a spammer know that your filter had sent his message to /dev/null? I can see how they can adapt to measures that prevent their messages from being sent, as they get the bounce notice. But, as far as the spammer knows, a filtered message has reached it's destination. Put another way, what "scoring function" could a spammer use to optimize against filters? How do they know that their messages are being read vs being automatically dumped? I ask this because I suspect they can't know which means that once a good set of filters is in place the spammers will be unable to evade them.
Well, any spam fighting must start with spam recognition, which has to involve some filtering. So this probabilistic technique is as good as any other for yet another approach to single out spam messages.
Now when you are 100% sure that something you've received is spam, it's time to complain to the sender's providers to have his account closed ASAP (and the upstream providers, and spamcop etc.)
The best approach is hand-written complaints. Being lazy, I use SpamBouncer to do the job for me (and I have actually received a couple of manual followups to these autocomplaints leading to reported spammers' account closures).
VKh
For all I care you can consider any email with an image in it SPAM.. Even if it's not, I'm not interested.
Also.. you suggest tailoring the regular text part of the message to look like a regular legitimate mail. However, since the person sending the email does not know you, or your interests, any word they use (except maybe 'the', 'a', 'you the man') will probably get flagged as high risk anyway.
I think the method described in the article has its strong points, the best of which is that its customized automatically for each user's own defitinition of spam mail and the mails he receives.
Oktay
---------------
Founder of the The Free Linux CD Project
As pointed out elsewhere, spammers can get information about whether or not you've viewed one of their messages when you view the HTML if it asks for any external data such as images.
I use Tiny Personal Firewall to prevent progams from accessing the network in ways that I don't want them too. For example, I have told it that Outlook Express should only be allowed to talk to my servers, and even then, only on ports 25 and 145 (send mail and IMAP). This stops all images from being downloaded or other html calls from going off of my machine and letting spammers know that I've viewed their mail.
The nice part of this is that if I decide that I want to view images in an html mail message (nytimes news stories for example), I just right click on the tiny personal firewall icon and disable the firewall, and then just enable it after.
...or tehre will B lots of angary Korean DIABLO player out 2 get u! GIVE ME ITEM?!!? SOJ! ^_^
For the same reason artificial intelligence has been held back by reliance on "symbolist" languages such as LISP:
Everyone wants to believe they are smart enough to tell the computer the rules of behavior rather than realizing they should be teaching the computer to think statistically which is to say rationally.
Of course since the primary religion pushed by both government and media is the moral virtue of ignoring statistics (to the point that actuaries are now thought of as reactionaries) there should be no surprise that the high priesthood of "AI" has failed not only to produce artificially intelligent software but has done so through the theological bias of rules as commandments for the faithful computer.
Seastead this.
All this needs now is a system to keep track of the email addresses of spammers PIPE them into other OPT-IN Sites. This creates a perpetual loop, because Spam bots always reply, ALWAYS REPLY BACK ad infinitum..
I like to sow the seeds of destruction.
Signed Anon Coward
Unfortunately, I don't grok LISP. Could someone please translate the code snippets into Perl or C so I can figure out what he's saying there?
Thanks,
Brant
xIf xYou xCan xRead xThis xYou xHave xWon xA xFabulous xVacation! xClick xHere xTo xRecieve xYour xPrize!
Spammers will start mispelling "hype words" to get them past. (They already do this in titles.)
I can envision having a spelling check to find such, but then you could be filtering out legitamate bad spellers, such as me.
Table-ized A.I.
works with sendmail or qmail or whatever, that filters out the spam messages and autoforwards the message to a list of congressmen, etc with a message "want my vote? make this type of UCE illegal like in Europe".
The Adult Happy Meal - "I'm lovin' it!"
Step 1. Get your own domain name
Step 2. use abuse@yourdomannamehere.com as your email address
Step 3. Enjoy spam free mail
After 5 years and numerous public news group postings, I have yet to receive a single spam.
I question his testing methods. If I read the article right (oops, slashdot faux pas, I admitted to reading the article) he built the Bayesian map from about 4000 messages, then tested the efficacy of his algorithm against those same 4000 messages! He waves his hands about why that's OK, but wouldn't it make more sense to take 10 minutes to build his map against the first 2000 messages and test it against the remaining 2000? I really don't trust algorithms that use the input data combined with the desired results derive those same results against the same input data.
;)
Secondly, over time, assuming that spammers put forth any effort into bypassing his filters, the filters will become much less useful. Spammers will intentionally misspell key words to lower their total spam rating. The easy solution to this is to make the map using a running total of only the messages from the last 3 months, or 6 months, or whatever period works best, but he should have at least mentioned that. Otherwise, over time the massive weight from the old emails will drown out any new spam identifying words.
All in all, it sounds like a great system, though, pending the results of a real test against emails other than the one you built the map from
you can multiply.
can you sit up and walk too?
fucktard.
...otherwise they'll have all your personal data and your phone # for future direct marketing, and they'll know their spam had reached you so they'll have your interests more narrowly classified, making you a more valuable direct marketing target!
VKh
Make it Distributed and make it work with eudora, and i'll gladly use it.
spamnet (see link above) promises to make it so that, if you add a filter to your email, and it (or you) shows promise as a good spam filterer, that filter gets added to those that all subscribers get. unfortunately, it's currently only for outlook, but i expect it will either add support for other clients, or someone will come up with an open source alternative...
- Entertaining Bits from the Ancient Kernel Tree
Not to filter posts for spam, but for, you know, quality!
My primary concern comes from the fact that most of the spams I recieve are either Korean or English, while most of the legitimate mails are in Norwegian. Sending me Korean mail is pointless anyhow, but I fear that simply the _use_ of English will make his scheme produce lots of false positives.
I don't get it. Does Korean spam use something other than WORDS to communicate? Or do their mail headers look any different than Norwegian ones? What makes you think your deleting Korean spam, and thus marking those Korean words (heck, all of them) as spam will be a problem? The filter gets built up for the user, based on the user's email. How could this not work for you? Why would marking of a few English words as not being spam be a bad thing?
If you have a driveway that connects to a public road, then people can park there. Your house is connected to a public road, I can walk in and watch TV. Your car is on a public road, I can use it without your permission.
A spammer that I tracked down was very unhappy that I knocked on his door. He claimed I was tresspassing. How could I, he opted in by having his house accessible by a public road.
If spamming is legal and honorable, why don't you post your real name, address, and phone number with each spam and on each website that you spam about?
Fight Spammers!
This is the greatest idea since sliced bread. Better even! I do like the idea of making the corpus distributed but think that keeping a personal corpus of data is also a very good idea.
One added button can drive it all "Delete as spam" what a wonderful idea!
I think the solution to spam has been found!
Are there any utilities to test the effectiveness of a spam filter? Suppose that I wanted to install this Bayesian filter but I don't have spam (already deleted) to created my hash tables. Is there some web site that will send me a bunch of known spam messages to create the weights against or to test an existing spam filter?
Share bicycle touring info worldwide: http://wheretocycle.com
Prove me wrong by:
Fight Spammers!
Nice system for list matching:
a-s-k.sf.net
It's the content I want to block. I don't want the spam to be sent to me in the first place. I don't want it to use up my bandwidth, which is half the reason for refusing spam in the first place. Plus, when handling other people's mail, it's one thing to block suspected spam sources for them; it's another thing entirely to examine the content, even if it's just computer logic doing it. If I am able to deploy the ability to examine mail for unacceptable content, then what else will I have to test for later? What will the government expect me to be able to do?
I'll stick with blocking dedicated spam houses, ISPs that harbor spammers, open relays, open proxies, dialup pools, and certain countries, by IP address and/or domain name. And I'll continue to block anything that can't get their reverse DNS right (this feature alone took out half the spam with very little collateral damage).
now we need to go OSS in diesel cars
Example: 10 spams get the word 'blunderbuss' but he has no regular email with that word. Therefore, any future email may be rejected because of the word 'blunderbuss', even though there is no basis to know whether the word CAN be used legitamately.
.99 probability of being a spam word, the message is not rejected based simply by matching this word. This one word will influence the final probability calculation (eg, moving towards a "hit" for spam), but the other 199 words could (and will) push the email towards the "valid mail" threshold. Thus, the .99 is negated by the near-zero probabilities of other words in the email (sorry, I'm not a mathematician -- my apoligies if that's confusing). There is an example in the article which explains this.
;)).
I don't claim to understand the article fully, but I'll take a stab at responding to your example...
Let's say you have an email with 200 words in it (including the header, etc). Let's assume that your friend the history buff is sending you some pictures of a blunderbuss. Of course he's kind enough to provide a description of the pictures and a short history of the blunderbuss.
Now, this goes through the filter which splits up the words into 200 tokens. Even if blunderbuss has a
If the email contained only the word blunderbuss (disregarding the words in the header), then it's probably spam. However, if the email contained only the word blunderbuss it's probably not very useful in the first place (unless you're an international spy
Well in the UK (not sure about elsewhere) dialling 141 before the number witholds caller id. Worth remembering :-)
no sig.
Spammers could take the easy way out. Pay people to read it. If people read it and become customers, they make more money than it cost for the spam.
"I have not received *ANY* spam in over a month.
;-)
Zero spam, period."
I haven't checked my hotmail account since a month + 1 week ago...
I think a lot of spammers have been getting "mailbox full: too much spam" messages from me
Free as in *BUUURP!*
Yeah, since mailing list maintainers rarely use the email protocol properly and have little regard for their end users requests - it makes sense you would automatically remove people.
Of course, people who sign up for mailing lists would just manually add you to their client. Problem solved.
I'd unsubscribe if I could get my inbox to lose weight now. What a sexy way to get university diplomas removed from naked redheads!
When i navigate to a site and they have massive amounts of Javascripts triggering new browser windows to domains I never requested, the browser session is no longer under my control or 'opt in' in any way.
In both cases you can go to ridiculous lengths like downloading the content locally first, turning off scripting, disconnecting from the internet, and then viewing the content. But that's not really relevant. If I go to slashdot.org with my standard browser settings and another user posts an innocent looking link in the middle of a discussion that I click on which goes to a site that spawns 1000 broswers worth of goatse and 3 installations of some kind of trojan horse, I did not opt into that. And god that really sucked.... my ceo walked by right when it happened. Much more negative consequences than some spammers getting demographic info on me.
Spammers are generally lame, but don't put up much malicious script. Web sites, including ones linked to from this one, DO. Spammers want to sell you something, not install trojans on your machine.
Of course, cid suppression might work on the final (non-signaling!) voice link to the guy you're dialing so, unless he is connected in real time to that SCP, he won't know instantly that it's you calling - such setups do happen in some callcenters.
VKh
Countries with Union of Soviet Socialist Republics (USSR), Peoples Repblic of China followed closely by United States of America (USA). Scarey.
Good names: Canada wins!
+1 Funny
(too chicken not to post as Anon)
i expect Iraq, Cuba, South Korea, Iran, Saudia Arabia and various others do have silly long official names not that anyone uses them.
So you mark it as 'delete as unwanted'.
:)
The point is that the filter list should be targetted to the individual user's desires, not conformance to a general idea of 'what is spam?'.
Because that method is more universal.... and inclusive of the other goal. While the other goal cannot be expanded the other way around (the part about filtering on headers automatically for example).
But by the way I don't believe you.... And the spammers don't either. They are sure that you want to extend your penis or increase your bust.... or both
What I like best about this approach is that it lets you define what spam is, instead of having to rely on someone else's (possibly different) definition. For example, I hate receiving urban legend "forward me or die a slow death" emails. These generally pass through my spam filters. If I instead marked these as spam using the process described, before long they would be filtered out too. And, because of the statistical approach, future non-urban-legend emails from said "friend" would not be blocked. Neat.
Senator Mary Landrieu
724 Hart Senate Office Building
Washington, DC 20510-0001
Dear Senator Landrieu:
Earlier this month the Federal Communications Commission (FCC) issued a record fine of nearly $5.4 million to Fax.com for transmitting unsolicited advertisements via fax machine (ie. "junk faxing"). Coincidentally, the Associated Press published a series of three articles covering the state of unsolicited e-mail advertising ("spam"). I'm left wondering how the FCC can come down hard on junk faxers but how spammers (arguably of a lower moral class) are allowed to continue to operate nearly unmolested.
The law Fax.com was found to be guilty of breaking is Section 227 of Title 47 of the United States Code. The relevant text follows:
Restrictions on the use of automated telephone equipment:
It shall be unlawful for any person in the United States (...) to use any to use any telephone facsimile machine, computer, or other device to send an unsolicited advertisement to a telephone facsimile machine(.)
It is my understanding that the reasoning behind this law is based on the ownership of resources. Fax machines are purchased and maintained at the owner's expense and only the owner's expense. An unsolicited advertisement sent to this fax machine amounts to nothing less the use of these expensive resources without prior consent. In effect "junk faxing" is considered theft and as such the offenders are held accountable by law.
What does this have to do with spam? In my opinion, everything.
Receiving an e-mail is by all accounts more expensive than receiving a fax. While several companies are now offering stand-alone e-mail clients, I have yet to see one of those with a lower price tag than a fax machine. But even if their price tags were the same, an e-mail station requires that the owner not only pay a monthly fee for a telephone line but also a second monthly fee for the e-mail account itself.
Of course not even an end client is enough to receive an e-mail. The e-mail account you would be paying for is maintained on a very large (and very expensive) e-mail server, complete with its dedicated (and pricey) connection to the internet. There is simply nothing comparable to an e-mail server in the faxing domain. While a bank of fax machines doesn't cost more than the price of the machines and their associated telephone lines, the price a dedicated e-mail server and the associated connections can easily resemble that of a small car.
So why is it that the FCC is given free reign to crack down on junk faxers but spammers are free to consume valuable equipment they do not own?
If you are familiar with the AP articles I mentioned earlier you will know that spam is steadily eliminating the usefulness of e-mail itself. It has been estimated that spam accounts for up to 80% of the e-mail traffic to major e-mail domains such as Hotmail and Yahoo, a problem that their respective owners are all but powerless to fix. As more and more internet resources are tied up by these advertisements, the owners of these resources have had to resort to cutting off offending service providers from the rest of the internet entirely. Customers are finding themselves unable to use the internet access they have paid for simply because another customer of that same provider is abusing theirs.
But even then the providers are powerless to drop spammers. Spammers in the recent AP articles have proudly boasted of the way they outright defraud unsuspecting internet service providers when signing up for an account. And when the provider threatens action, the spammer threatens the provider with legal action. In recent months a spammer was even successful in receiving a legal injunction against their service provider, preventing the provider from stopping the spammer from abusing their resources.
I have little problem with receiving advertisements through the U. S. Postal Service. I know that the delivery cost for every article in my mailbox has been entirely paid by the sender. And while I am not happy with the current situation with telemarketers (I must pay for local telephone service before I have the "privilege"of being contacted by telemarketers), I must grudgingly admit that the state and federal laws designed to restrict telemarketing have been mostly successful. But I am not happy about paying several thousand dollars for a computer and $20.00 a month simply to have my e-mail account flooded to capacity with advertisements for products and services I have no interest in (and preventing legitimate e-mail from reaching me in the process). I am sure that you yourself have been bombarded with advertisements for websites featuring "nasty teens" or "penis enhancement." I have noticed that your office no longer maintains an e-mail address accessible to the public.
The examples of spam I mentioned in the last paragraph bring me to another point: I have noticed on your website your stated commitment to enforcing decency laws on the internet, to protecting children from access to objectionable material on the internet. It should be obvious by now to even the most casual of internet users that the biggest offender in this area is the spammer. While a user must actively attempt to locate a website in order to find such material on the world wide web, the mere existence of an e-mail account all but guarantees that the owner will have such material delivered to them on a daily (if not hourly) basis.
In my opinion the solution to this problem is very simple: expand 227 U. S. C. 47 to prohibit unsolicited e-mail advertisements in exactly the same way it prohibits unsolicited fax advertisements. Nothing more, and certainly nothing less.
I have seen some ineffective bills drift through both houses of Congress that are written to allow unsolicited messages so long as they have an "opt-out" mechanism. Ignoring the fact that such legal loopholes would essentially negate the law entirely (can you prove that you tried to opt out?), it quite literally sickens me the way some of your fellow members of Congress feel that spam is somehow an issue dealing with the freedom of speech. The mere existence of the internet and the supposed changes it has on how business and the legal system work (even though such "changes" have been shown to be a lie) have helped to convince these poor fools that people should somehow have a right to use and abuse the property of others. Does my neighbor have the constitutional right to break my kneecap so long as they provide me with the ability to "opt out" of future kneecappings?
The United States Constitution guarantees that all citizens are free to say what they want. It does not guarantee a soapbox upon which they can say it. Just as I am not guaranteed the right to have a billboard on Interstate 10, spammers should not have the "right" to use the resources of others simply because they're there.
Expanding 227 U. S. C. 47 to include e-mail is an extremely important issue to me and I hope with your stated interests on your website that it is also an important issue to you as well. I know that you are up for re-election this November and I intend to find out how your competitors feel on the issue as well.
Wow! Microsoft really does innovate.
You watch: now some company is going to implement his idea in their filter software and patent it as their own! They will then threaten to sue anyone else who uses it.
Anonymous Cowards suck.
Paul Graham on Fighting Spam:
Wham wham wham wham wham wham wham.
(its deeper than you think)
slashdot: where everyone yells sarcastic metaphors to themselves to understand the issue
I have some very good spam filtering based on the content (which is almost a give away!) The only way this would keep working was if the spammers did not know what I was filtering on. Please lets keep it this way. If smart people like Paul start divulging his good techniques, the spammers will start changing their content too ...
The only way we can win was if everyone came up with their own
filters, kept really quiet about it, and the spammers continued to spam
thinking everything was okay. And of course the SPAM-lovers could
still continue to receive all the spam they like without realising
that anything is different in their little world.
Something to think about.
DO NOT PANIC
I'm glad this article author touched on what I consider the ultimate solution:
If you hired someone to read your mail and discard the spam, they would have little trouble doing it.
There are lots of unemployed people in the tech sector, why not hire them? Heck, let ME be your spam filter!
Then again, there are privacy concerns. Oh well.
For an interesting read, please my paper: 'An Analytical Look at Spam'. I touch on the "Hire a secretary" solution along with an extensive analysis of the entire spam situation.
I'm wondering if spammers could manipulate those probabilities, not to get their spam through, but to increase the "false positives"? Sort of like a "denial of service" attack.
If you've never heard of a product you cannot know you want it. So you won't search Google for a combo USB drive/MP3 player/keychain fob. But if you get an email about it, you may realize you want it.
I agree with your suggestion of a middle ground. Some email you know you want, some email you know you don't want, but some email you're unsure of. If I get unsolicited automated mail about an upcoming Tcl convention because of some forums I'm on, like I did recently, do I automatically want to trash it? I didn't know there were any, I wouldn't have looked for one. And I'm not going to go.... but I did, unbeknownst to myself, want to get that information to then be able to make that decision. I generally am willing to read ANY unsolicited automated email that pertains to programming or software, and I don't care if the person sending it out subcontracts their bulk mailings to a company that also does bulk mailing for porn sites, which is why I would be very wary of heading based filters. Ultimately it is only the content of the email that decides whether I want it, regardless of whether it's commercial or non-commercial, automated or individually sent, from a person I know or from someone I've never met.
I think that the Tcl convention announcement would probably get by this guy's filters, since he weights words about programming as non-spam. So ultimately what he has isn't a spam filter, but a content filter, which I think is more important.
Marc Damashek, "Gauging Similarity with n-Grams: Language-Independent Categorization of Text", Science, 267, 843-848, 10 February 1995.
One recent trick that I've just started seeing in spams is a simple tactic that might do quite well w.r.t. defeating content filters.
What this spammer does is insert html comments in the middle of every word with a random word inserted the comment. ie:
MA<!-- fish -->KE MO<!-- now -->NEY FA<!-- account -->ST!
Content filters may need to get a bit trickier (eg by parsing HTML).
Space is cheap
dont discount the face that most people are working of hotmail/yahoo/webmail/company accounts with restricted quotas.
It would not take very long to fill 50 megs with just email (not including attachments).
If you are on various mailing lists you really will need to delete some mail occassionaly.
If you are on an open mailing list is especially annoying when your spam fileter lets mail through because the list is not a spammer but the real sender is.
The easiest way to circumvent this is obviously to just put a lot of innocent words at the bottom of the email after the sales pitch. That way they can counterbalance the bad probability readings of all their market talk. Or even better, put all the market talk in a JPEG/GIF and then add a few innocent words at the end in white text.
It is a good idea, but spam still has a way to get around it.
Yup. Use the intelligence of hundreds of thousands of fellow spam haters across the internet.
Vipul's Razor: http://razor.sourceforge.net/
Pyzor: http://pyzor.sourceforge.net/
DCC: http://www.rhyolite.com/anti-spam/dcc/
Yes, they do work. no spammers can't get round them just by changing formatting or including random characters.
Government of the people, by corporate executives, for corporate profits.
Problem is that it's "randomness" would give it away. The main message wouldn't be so random (a necessary part of all languages) AND legitimate messages wouldn't have a random and not random part (why should they?). The only flaw I can see is how much of a threshold does there need to be to prevent "false positives"?
It would not work because after a thousand spams in English, a legitimate English mail arrives and gets marked as spam. And as I said, "SENDING ME KOREAN MAIL IS POINTLESS ANYHOW". That means IT DOESN'T MATTER. gah.
Yes it could. There's one disadvantage I see. Since the human's part of the loop. The "message" is going to hit an eyeball in order to be judged as spam. That represents a window for a spammer. That naturally will decrease as the filter becomes more efficient. However by manipulating the 'message'[1], a spammer can increase the chances that his 'message' will hit an eyeball. The proverbial 'cat & mouse' game will ensure. It's going to be an interesting battle.
[1] Remember in language there's more than one way to say the same thing.
Bernie is a moron spammer.
Fight Spammers!
Do not bother making spam illegal. Make spam cost money. Make all unwanted email cost money.
How? Here's an idea. (Disclaimer: I haven't spent too much time thinking about this.) All email must come with an "electronic stamp", or some equivalent thing that costs the sender money, or computer time, or something. Make it possible for the recipient to "refund" the sender, or otherwise not charge the sender. Now, tie a spam filter into this, so that wanted email automatically gets sent refunds, and unwanted email automatically does not.
Result? Mail to/from your friends is free. Mailing lists may cost slightly more, if list members sometimes fail to refund the list maintainer. In the unlikely event that you email a sociopath, they will earn the postage from a single email from you. In the unlikely event that you send email to a friend and it is eaten by the spam filter (ie. a false positive) you will notice the lack of refunded postage, and surmise that your letter never got through and react accordingly (or surmise that your "friend" has turned into a cheapskate.) Otherwise, this change will cost you little or nothing. Spammers, on the otherhand, will go out of business rapidly. "Opt in" lists really will be opt in, and the first or second or nth time you decline to refund their postage, perhaps they will count that as you wanting to be off their list. The amount of postage should I guess be something comparable to the rates the US Postal Service charges for bulk mail and presorted first class mail. Perhaps $.25 would be enough. Perhaps you'd require higher postage for people you'd never conversed with before. Mail that has insufficient postage can result in "insufficient postage" notification to the sender; the recipient is not shown the email until sufficient postage is provided.
Of course, woe to you and your wallet if somebody hijacks your account and sends out 1,000,000 emails allegedly from you... But maybe, like an ATM machine, your "electronic stamp" vendor knows not to sell you more than $5 of unrefunded stamps per day, and automatically telephones you, or cuts you off, if you send more than your limit. (Still, that would make hacking profitable. Bad. Maybe the destination of the postage must be traceable, and the recipient must be liable for refunding if a crime was involved in the sending.)
I suppose our spam filters still might get spam from politicians and corporations. For people using spam filters, it will just be money that we can take to the bank. For people without spam filters, but with the sense to press the "no refund" button on the mailer, they will still get to keep the postage, though they will have earned it.
--- Ben Chase
An alternative approach is to automatically ask any unrecognized email addresses if they belong to a real person. TMDA does this for all non-whitelisted email addresses. The idea is that spammers do not put real email addresses on their spam, so will not be able to respond to a request for authentication. If the emailer doesn't respond to the authentication request, then TMDA blacklists the address for the future. Result -- no spam.
"the basic premise of the filter is that the spammer HAS to tell you what he's selling, and in the process of doing that, gives himself away as a spammer. "
True however there's more than one way to say "Hot, lusty babes at my site". Two his filter has no concept of location. Is the biasing part of the spam before or after the message. Human 'pattern recognizion' coupled with a deep dictionary allows us to spot such deceptions.
The solution is simple. All images should be sent to a secondary address (a receptacle, for the purpose of this), and this address is NOT public to anybody but those who are authorized to send attachments; accordingly, any attachments sent to the primary address just get bounced.
This sig no verb.
Hmm, the next step in the arms race would be to reject a mail that has too many words that have never been seen before.
It'd be great now if he offered an implementation which we coudl all use.
I think a progressive, ever-going implementation is best. I also think its best to filter based on headers first, and not download any spam (to save bandwidth) and then filter based on message content (for the messages downloaded) and move any spam to a spam folder.
Then the user simply looks at the spam folder and looks for false-positives, and marks them as "legit". Then the Bayesian filter recalculates.
Same thing for false negatives, and for the messages not downloaded. The user can look at the headers of the messages not downloaded and say if they're spam. Then the Bayesian filter recalculates.
Another good thing to do is to give a "password" to your friends for them to put in headers sent to you. I.e., 13y4890dshfpljk2134y9073254y32p9ur. Any message with that in the header would be given a 0% probability of being spam, as only those you gave that to would know to put it in the header. Should it become compromised, you can change it (or just don't give it to people who might compromise it).
Back to the Bayesian filter, another good thing might be to have varying levels of "spam". I.e., if something is almost certailny spam (i.e., 99.99999999% likely to be spam, as would a message with the header "Get fucked for free and make lots of $$$$$"), it would be placed in a DEFINATELY SPAM FOLDER. Other things would be placed in a "PROBABLY SPAM FOLDER". Etc.
Anyways, Bayesian Analysis is a really great method.
If your interested in Bayesian Analysis, there's a great phylogeny program which gives you (basically) a bootstrapped maximum likelihood tree (calculated from millions of trees) via Bayesian Analysis: MrBayes.
social sciences can never use experience to verify their statemen
This is NOT the fault of SMTP. (RFC 2821)
SMTP is only a protocol for the transporting of messages. The format of the message is irrelevant. All that is required in the message is that the server knows who the message is going to. The from address given in SMPT is not the one that you see in your browser. It is simply used for logging purposes, and was originally intended as a way for sites to help debug each other's mail servers.
The real culprit that allows the headers to be faked is the arpanet message formatting standard. (RFC 2822). SMTP messages are defined as a block of 7bit characters. It's the messages themselves that allow the exploits, not the SMTP portion itself.
I am the penguin that codes in the night.
Is there a shortcut that I'm missing?
-jon
Remember Amalek.
I just invented this great spam filter! It counts the number of people in the cc: field! Then I multiply it by 10, and that's the percentage chance it gets chucked! Only 14% as much spam gets through, with NO false positives!
Note to M1-ers: a curt but otherwise insightful message is not "Flamebait" or "Troll".
Paul's
Why Arc is not especially object oriented
I would personally like to see Paul Graham spend even more time fighting OOP than spam. The second one is a lost cause arms-race IMO.
Here is something that rang true with me on his OOP musings:
Object-oriented programming is like crack for these people: it lets you incorporate all this scaffolding right into your source code. Something that a Lisp hacker might handle by pushing a symbol onto a list becomes a whole file of classes and methods.
I think using databases (properly) are the same way: a single relational formula does most of the work of a bunch of classes and "hand-indexing" these classes and methods together. (AKA GOF-math)
OOP hard-wires the "noun structure model" into the code (what Paul calls "scaffolding"). LISP and relational techniques tend to use *formulas* to manage these instead of physical code structure. IOW, we don't build structures, we order the information to build *itself* into the needed structures. (OO has the concept of "self-handling nouns", but it lacks the concept of self-handling structures, or interlinks, between those nouns.)
It less disruptive to change a formula than change the physical structure of the code.
OOP fans spend too much time looking for "the proper pattern or model". If you do it right, there is no one proper model or structure: it is virtual views that you create on an as-needed basis and can change on an as-needed basis without a bunch of code rework. You can also have multiple different views without them stepping on each other.
OOP creates code and work that is unnecessary and fragile.
(oop.ismad.com)
Table-ized A.I.
You have the right to speak. But you don't have the right to make me pay for the message, nor to listen to it.
"The most important right is the right to be left alone"
(* I'm not sure what the benefit would be to having a few words from the text. For me (and most likely other people as well), that is enough of an inconvencience that I may as well just scan through the entire email. *)
The point is to make it easier to eye-scan if you are worried about false positives. It helps by: 1. Making it easier to review many messages, and 2. Ranking so as to not check the flagrant ones if desired.
Table-ized A.I.
If the MS patent/approach is so good, why did they give up on it and adopt Brightmail for MSN and Hotmail?
Apple also has a similar, albeit more "theoretically correct" probabilistic anti-spam filter using latent-semantic indexing. Mossberg claims he's getting a 95% catch rate in the WSJ.
A
His algorithm works because spam uses the same repetive syntax. Because so many spam/emails are sent out - it can be flagged by pattern recognition... based on the assumption that it is written in English!
Huh? Where do you get that? The algorithm has NO KNOWLEDGE of syntax or structure. It knows only the presence (or absense) of words in the message, nothing of how they are grouped, positioned, ordered, related, structured, etc. There is zero grammar / pattern recognition as far as I can tell. As long as your corpus or database of reference mail is in the same language as the emails you wish to test, then the algorithm would work just fine. Perhaps you were thinking it used Markov chains?
As long as we have people on this earth that are actually stupid enough to watch Jerry Springer, convert to Mormonism, or buy penis enlargement pills, there will be lots of lame talk shows, moron dook knockers, and spam. And you will receive some of it.
For every genius that comes up with a cool new way to filter spam, there are thousands of idiots ordering up their first spam-marketed item. All you can do is try to ignore as much of it as possible. Filter, but don't expect to get rid of spammers and regain the resources they waste.
...just my 2 gil.
...it won't stop spam hosted offsite (i.e. the spam loads the HTML from elsewhere) or spam consisting of graphic images hosted elsewhere. They don't contain any HTML that would trigger such a filter.
I'm noticing a lot of spammers moving to this in order to evade keyword filters.
Anyway, I like spam. At $500 an email, chasing spammers is a profitable pastime. Make money AND perform a social good! (and learn lots about the legal system!)
Most of the spam I seem to get is in non-alphbetic character sets (Korean/Japaneese/Chineese, I'm not sure, I can't read it). I guess I hit the VIA support site in Taiwan one too many times or something.
I don't know much about that character set, but I suspect they don't use the same separator characters that his filter is looking for to separate its tokens.
Connect to any server on port 25 (the SMTP port), and fake envelope senders all you want. Cross-subscribe mailing lists all you want. SMTP wasn't designed with authentication and security in mind at all. Furthermore, it is darn slow. Granted it's not SMTP's fault only. It's the architecture's fault. I should have been more generic.
Parsing email messages themselves is a pain in the ass too.
Gee, a bit OFF TOPIC wouldn't you say?
What's the difference between spam and denial-of-service attacks? A spammer does nothing but many unsolicited packets, just like a DoS perpetrator. If receiving spam is a "choice," as you say, then getting DoSed is also a "choice," isn't it?
Make up your mind: Either spam is illegal, or DoS attacks are legal. There is no basis for treating one differently from the other.
What Would Jesus Do
(for a Klondike bar)?
They also have a paper from 1998 describing it here
This wouldn't work for me anyway since my personal correspondance frequently contains the words "sex" and "sexy" not to mention "stud muffin".
It is by the juice of the coffee bean that thoughts acquire speed, the teeth acquire stains. The stains become a warning
You just have to allow the mailing list or otherwise. It is completely useless against spam on the list.
So I have this big database on my machine based on my own e-mail? If my machine crashes, I have to start all over? And when the SPAMmers figure out they can send an innocent-looking e-mail with embedded SPAM images, then where are we?
So I'll make my suggestion to eliminate spoofed-address SPAM again:
1. Sending mail server generates a content key based on the contents of an e-mail being sent.
2. Sending mail server uses this key with a private key to create a public key.
3. Sending mail server sends the e-mail, along with the public key to the receiving server.
4. Receiving mail server generates a content key from the e-mail contents.
5. Receiving mail server sends the content key and the public key back to the sending mail server.
6. Sending mail server uses its private key plus the content key to re-generate the public key.
7. Sending mail server compares the public key to the one sent by the receiving mail server.
8. If the keys match, the receiving mail server allows the mail to enter the recipient's mailbox.
9. If the keys don't match, the mail is bounced.
This should eliminate spoofed e-mail, which is the only type I get. This technique also keeps the second transaction to a minimum exchange of keys. The keys add traffic, but the eliminated SPAM traffic more than makes up for the penalty. As more and more mail servers are updated with this feature, spoofing is all but eliminated. The remaining "spoofable" domains can be explicitly severed from the net or blocked.
Xesdeeni
Two things:
(a) Parody spam from your friends would probably make it through the filters, since the headers of the message indicating it was coming from a frequent non-spam sender would be too strong to make the contents of the message trip the filter.
(b) Parody spam from your friends would no longer be funny if you never received spam, so it might as well get deleted anyway.
These kind of research has been going on in early 90's. MS is not the only one, I think they started somewhere around 96-98. There are many people doing the same all over the world, is it legal to patent such thing with a broad meaning while someone else is releasing as public information?
The term "probabilistic classifier" cover just about every classifications algorithm one way or the other.
One might wonder why noone has been using it in large scale: according to the results from many different people, the highiest accuracy is about 90% and is already tweaked with word/phrase weighting. Also, everyone will get different results and in the beginning it isn't that good before a lot of training. If you are seeing much higher accuracy, it just means your data set is smaller than you think.
The best solution in my option is using sneakemail.com.
Sneakemail is a free service that you can use to generate disposable email addresses.
These "sneak email" addresses are aliases of your real address, which is kept hidden.
You can enter these Sneakemail addresses into web forms or use them to contact e-businesses without the risk of your real address being abused or bought and sold.
Consider each Sneakemail address as an informal agreement between you and an online business or organization.
You agree to allow them to contact you through this address, and they in turn, by accepting and using this address, agree not to abuse this privilege by sending you unwanted solicitations or to give or sell your address to others.
If they abuse this privilege, by using Sneakemail, you have more control.
This was an excellent article, and gives me great hope that through technological measures we can finally kill spam.
I'm reminded of what Declan McCullagh said in his recent editorial. Through writing code, not necessarily lobbying for more perfect laws, we can overcome some of the obstacles we face online.
Graham makes a bunch of excellent points about how more perfect spam filtering will eliminate spammers for economic reasons. As we've seen political and legal methods don't work.
Maybe the concept of a P2P network could be harnessed in order to fight spam. For each spam tagged as actual spam by a real human, by a ridiculously large CRC (1024 bit or something--to rule out possibly tagging innocent mail), the CRC could be traded via the P2P network. Automatic updating, almost instantly. A client could be written in about 2k of code.
Interacting with the email client would be another story, but just an idea.
The only problem I can think of would be sabotage. Anyone could tag legitimate mass mailings as spam (such as a mailing list).
Any comments on this idea?
Why can't a spammer just append a "normal" looking email to the spam message. Then all but the spam message -- which can be a small part of the email -- will look statistically "good." Perhaps just using words is not the solution, but other attributes of the message (say, structure or whitespace). Still, I think it's a good approach, and I think statistical analysis on the header is great!
I saw some of the posts here on forged headers. I'm a newbie Linux user. Instead of using linux@local as my machine name, I decided to give it a name. So I named it after the computer in 2001, a space odossey (I'm sure I mangled that), and a year. But since it wasn't a fully qualified domain name, which I don't understand yet, my email headers say something to the effect of: not name of machine, not a fully qualified domain name, message may be forged.
Now that I have more than one box running Linux, and serving web sites, it's a little difficult going back to linux@local, and getting rid of the "forged domain" or whatever message. I have one box serving one web site with Apache, but no email service yet because I haven't studied Sendmail or other email applications. I have another box serving another site, same situation. These boxes are only serving one site right now, so I can give them a fully qualified domain name of the site, but I will be switching them to virtual hosting as soon as I can get that to work. This will preclude me from using one of the domain names as the fully qualified domain name. So I am currently stuck with email message headers that identify my emails (kmail in Linux and Webmail ((ISP provided email remote login-Windows)) as using forged headers, which they are not.
Some of my emails are being sent to dev/null or whatever, from people who's system or network uses tight filtering rules due to this FQDN issue. But that's something I'll have to live with.
Not every email with a "forged" header or domain, or one that does not resolve (I'm behind a Linkie NAT firewall on one of my boxes, the workstation is invisible to the net) is spam.
Well spammers techniques will evolve. Maybe using "goal defining" software. For example a spammer would simple tell his software what goal he's trying for (get people to see the luscious babes at my site). The software would then figure out what combination of words, sentence structure,etc would be needed to maximize his hit rate on your mailbox.
is it considered good netiquite to go along with the spam and then at some point become irrational with them? and then post the email exchange to a website for others to enjoy?
i would like to do this.
Software testers needed for
Is the spam for Taiwanese products, or just routed through open mail relays in Taiwan? If it's the latter, we could certainly outlaw using spam as a marketing tool for US entrepeneurs. If your company or home business sends out spam from Taiwan to US computers, you would still be breaking the law.
Don't forget that Friday is Hawaiian shirt day.
2) If you don't read the spam, they have no revenue.
3) You're gaining the valuable benefits of spam without paying for them.
4) Therefore, not reading spam is STEALING!!!
Oh, and
5) ???
6) Profit!!!
We can all talk all day about Spam-busting techniques, but honestly, can we all get together and make sure that our nine year old doesn't get porn mail all the time? Stopping porn spam would really knock the wind out of the sails of all spammers everywhere. I mean, this thing seems like slam dunk legislation. I know that many of you will say that this is a slippery slope of legislation and scream "THINK ABOUT OUR FREEDOMS," but no one wants their children to see pornography.
/. crowd can really sell that tagline to our local legislator and put a real strike back in the spam wars.
Really, all we need is some new-era Tipper Gore to scream the phrase we all hate at a Senate hearing... and no more porn spam:
"Won't somebody please think about the children?!?"
The chilling effects of this will be monumental. Why the current Right-Wing U.S. administration hasn't gone after this is totally beyond me. Its a cheap and easy target. Shows that they reinforce family values. I hardly agree in anything super-right wing, but this whole children-looking-at-steaming-hot-teens thing is ridiculous.
Whether enforced or not, in the United States soliciting pornography to a minor is still very much illegal. I think that the
You work for VA Software by any chance?
Any other ideas?
Free, legal music for iTunes users.
There are several classification techniques in the field of machine learning that are all more powerful then simple native bayes. In fact in graduate school I built one that outperformed N.B. by a significant margin.
If people want to claim a "great new idea" they should research what has been done in the field first.
What if spammers just put all their wonderfull words of wisdom in a large picture or a flash file thus "hiding" it. Of course alot of people have html turned off but the vast majority do not.
"It's so convenient to have a system where everyone is a criminal" - A. Hitler
If filtering, as described, were widely implemented, SPAM would become ineffective to the point that it would no longer exist. The cost of "making each link in the chain liable" is much greater than the benefit which can be achieved by other means.
He is assuming that you can just multiply the word by word probabilities together. This is a standard assumption. If you don't do something like this you get a combinatorial explosion, just like you said. More to the point, if you don't do something like this, your data becomes sparse. In the limit of making no assumptions you are reduced to recognising only the spam you have already seen, you have no capacity for generalisation and all the new spam gets through. No statistical method is any use if it doesn't generalise. Any method that works in practise has some kind of assumption hidden inside to make it go.
One reason I like the Bayesian approach is that it is pretty transparent. When an implementation is making the independence assumption, it is clearly apparent, and if you need to relax the assumption, for example by looking at word pairs, it is clear enough how to go about it. Graham does discuss this towards the end of his article.
Often the main effect of the independence assumption in practise is to exaggerate the confidence with which the classifier classifies things. Since Graham is not using his probabilities as input to subsequent processing he gets away with this,
Yeah they could, but it would take too much effort. They first have to make the probabilities factor change by sending you a whole lot of legitimate email (as if) and then later send you a spam message that can finally contain the words they made less likely to be spam.. wait a minute!
hmm sounds like a great idea, how about this a elisa style bot starts a mail conversation with you after sending 10 mails back and forward the bot sends you a spam message, the bot has beaten your spamfilters because the filters don't think someone on your contact list would send you a spam message right and you will read the spam message quite focussed, the spam message will be actually read, because you don't understand it and you think the bot is a person!
well don't be surprised if you experience it one day, remember this message, it started it all!
Consider:
If they are REPEATED innocuous messages that match against PAST "innocuous" messages that I decided were spam, that is going to pick this up.
Then your message goes into the corpus as "spam."
And messages that are written as multipart/alternative with statistically similar "innocuous" messages will be matched as spam.
You don't know the parameters. The parameters essentially involve the subjects I discuss with my family, or with friends, or with business associates, or with technical associates.
How can you possibly construct, as a "spam-meister," messages that resemble those without being someone that I regularly communicate with?
No, this "defeat" represents nothing of the sort.
If you're not part of the solution, you're part of the precipitate.
ANytime someone asks for my e-mail addres, it's their_business_name@conesus.com or their_personal_name@conesus.com.
If I ever get spam from a certain address, I can block the address, and goto the site in question and change my address to something else.
But the coolest part is if anybody sends a mass-email to me and my buds, they usually include a personal_message_to_me@conesus.com.
Don't eat your soul to fill your belly.
conesus.com
But this isn't enough, by itself, to classify a message. Messages do not solely consist of one or two words; they consist of many. And collecting the statistics together requires calculating a "relevance factor," based on all the words.
The one used for Naive Bayesian Inference is as follows: Rf calculation , and you'll notice it involves doing a logarithm-based weighting.
The formula doesn't care what words are used, or that you think one folder contains "spam" and that another contains "gold."
In my corpus, the word sex is used in 65 different mail folders, mostly probably pretty "innocently."
Drawing conclusions based on one or two words is, unfortunately, pretty incomplete. It might well be that the one use of "sexy" in a particular message doesn't force it into the Spam/Phonesex folder because it makes even more extensive mention of Enlightenment and WindowMaker and GTK Themes and winds up being very strongly tied to the X/WindowManager folder because there are several other words not related to sexual activity that make it (correctly) appear relevant to a discussion of window managers.
Graham is drawing an analogy based on two words (words likely to grip adolescent attention!); reality involves adding everything up, and those two words certainly don't tell the whole story of the whole corpus.
If you're not part of the solution, you're part of the precipitate.
The "foreign language" Spam that I get gets nicely refiled by Ifile into my Spam/Foreign folder.
That folder has a corpus of messages assortedly written in Han, French, Kanji, Korean, Finnish, French, Spanish, and Russian, and Ifile nicely recognizes that words in those languages provide evidence that messages seem most relevant to go into that folder.
Ultimately, it all involves human classification:
I go through them, and read them, perhaps just browsing titles when I see that spam seems appropriately filed.
By leaving the messages in the folder, indicate that they were correctly filed, and should become part of the corpus.
That then involves human intervention as I move the messages to where they should have been.
Note that IFile is useful for filing good messages, not merely at throwing away spam.
Indeed, the more that you use Bayesian filtering for, the more folders with distinctive kinds of message that you have, the better it gets at discriminating where messages should go. I don't have one "Spam" folder; I've got about 8 for different sorts of spam. I don't have one 'inbox' for all my "good" mail; the mail gets thrown into a veritable huge chasm of mail folders. The more there are, the better.
If you're not part of the solution, you're part of the precipitate.
The typical formula is
Relevance - Rf
There may be a bit of a "fight" between the words, but if all the messages containing the string my_wife@frobozz.org go in the Honey folder, and occasionally contain phrases like That dress was so sexy or the likes, that will change the Ff(w) value for f = Honey , and the message will be appropriately routed, perhaps into the subfolder Honey/Rendezvous where you put the weekly messages of that sort from your wife.
Of course, there's then the non-technical problem, namely locating a wife that would actually send that message.
If you're not part of the solution, you're part of the precipitate.
As Ifile source code is available that dates back as far as about 1996.
If you're not part of the solution, you're part of the precipitate.
I've had this theory for a long way on a technique that could be used to defeat spam once and for all. Despite what the author of this article states, trying to fight spam by analyzing the content is not going to defeat it, and as has been pointed out, there are many ways to work around that solution.
Targetting the sending addresses, and most other techniques like that simply lead to wars of one-up-manship as the spammer and spam fighter struggle to find better techniques to hide and detect spam, respectively.
So what's the theory? Fairly simple, really, and the technology is already available, but not widely implemented. Spam largely suffers from an identity problem. Consider that junk mail that arrives in the post box can easily be identified and/or blocked through legal means if necessary, largely because we know where it comes from. The reason spam has proliferated is because SMTP traffic is largely anonymous - mail servers basically trust the mail they receive and have no real way to verify the information being presented to them. Yes, they can check From: and To: headers to verify that the email is local / remote / relay attempt, whatever. But with the number of open relays on the net, it's easy to forge and bypass these checks.
By using SSMTP (SMTP over SSL), all email can be logged with identifying information from the original sender. If enough servers on the net start to support SSMTP, and increasingly mandated its use, eventually I'd be able to block all regular SMTP traffic. This has the added advantage of making email more secure.
But how does this stop spam? Well, it doesn't directly stop spam, but it means that we would legitimately be able to identify who originally sent the email. Once that happens, the spammer can no longer hide behind anonymous gateways. It probably wouldn't even matter too much if open relays were accidently left open - so long as the open relay didn't support SMTP but only supported SSMTP.
Ideally, every user would require their own secure certs to properly identify the sender, but this would probably add too much cost for the average user, and may be rejected for privacy reasons. But so long as the mail servers themselves were configured this way, we would always be able to identify very quickly where the email was originally sourced, thus giving a recipient an easy place to target (and hence sue if it comes to that).
As this takes off, it may actually be a way to make spam legitimate. The secure cert attached to the email could have an incentive allowing users to opt-in or opt-out automatically. A user could set their mail to say "yes, I'm willing to put up with ads if you're willing to pay me for it" putting the cost back on the person responsible for the spam in the first place - the advertiser.
Anyway, it seems to me like a fairly simple way to solve this - but it does take a lot of co-operation to get there. Something that hasn't happened yet for IPv6, another new protocol that doesn't really seem to be getting off the ground. So what am I missing?
We're talking about Naive Bayesian Filtering, where the assumption is made that we can assume the use of Bayes' formula even though we know it's not quite independent.
What you're missing is that the real formula doesn't just involve two words; it involves all of the words in the message.
The usual formula is Rf, and you'll notice that it involves multiplying the occurrances of words in the message with the logarithm of their frequencies in each folder.
The word "sexy" may usually be enough to consign messages to the Spam/Websex folder, but if there are some occurances of the term "sexy window manager" in a discussion of some window managers, the fact that the names Enlightenment, WindowMaker, stupid , memory-hungry and themes occur rather a lot in X/WM and never in the Spam folders means the relevance total will most likely favor the right folder.
If you're not part of the solution, you're part of the precipitate.
If you get a new email message that has a bunch of "non spam" words in it, it seems likely that it will not be marked as spam. As the article said, spammer's vocabulary is really limited.
Hurray for LisP! :)
I'd like to see mail browsers add a nice big "SPAM" button that will can do a number of configurable actions, and has a useful default. I suggest as the default that it forge and send back a "no such user" message, save the message in a "past spam" folder, and occasionally invokes a naive Bayesian statistical analysis program (as Graham describes) to create a filter for the future (then filter out email with a high probability of being spam). Perhaps it could optionally do other things, such as forward a copy to a list of email addresses (e.g., your local "abuse" account, the newsgroup news.admin.net-abuse.sightings, and email addresses of well-known spam killers), or calling on other spam killers to check it like SpamAssassin.
Perhaps there could be checkbox beside each action like "don't do it when you press SPAM", "do it when you press SPAM", or "confirm before doing it when you press SPAM" - that way, you could get rid of chain letters without sending them to net-abuse.
By building easily-invoked SPAM-handling capabilities right into the mail browsers, people will be able to fight back more easily.
I know the Mozilla folks are considering anti-SPAM measures; I hope they're willing to build in this kind of functionality, so that it's enabled by default.
- David A. Wheeler (see my Secure Programming HOWTO)
I use hotmail. I got lots of spam. I went into the filters menu. I setup a filter for every letter of the alphabet (yes 26 filters) in the subject line. I then went into the tagline menu and inserted the tagline "All mail to me at insertnamehere@hotmail.com must have a totally blank subject line or it will be automatically deleted.
I then went in and put accept to all my friends email addresses. I have no more spam, and no problems receiving mail.
Works like a charm.
Spammers intentionally hide their identity and try to make their spam hard to filter.
Fight Spammers!
This technique, if I understand it well enough (IANA genius), would work pretty well as a porn site filter--scan the site before it's displayed, decide if its porn, filter based on the probabilies.
/. readers), and compare it to a corpus of sites the user usually visits. "Accidentally" get a porn site? Add it to the correct corpus using a -Porn- button.
As far as building a corpus of porn sites versus a corpus of non-porn sites, I'm not sure of the best way--perhaps it'd be enough to pre-create the probabilities for porn sites through careful and excrutiatingly thorough research (good work for some enterprising
If the technique is as good as Mr. Graham says it is, it might put to rest the concerns of those who fear that other kinds of filters exclude innocent content and therefore restrict speech.
I realize the ingrates who run Slashdot will not see this (AC limits indeed, I remember when...).
Since this technique is so effective in combating spam and is easy to set up. Why not set up filtering on "First Post","Goatcx",etc, etc, you get the picture. Your moderator workload will drop dramatically. And for an experiment see how well it does for filter and/or sorting of submitted stories and their acceptability for posting on slashdot?
I saw Eric Horvitz demo this (along with a lot of other impressive stuff) when I was at MSFT. The spam filtering works very well for him. And yes, he's already written an Outlook COM plugin that does it.
The problem is that Eric works in MS Research, not on a product team. MSR does an excellent job developing cool new technology, and a very bad job working with the product groups to ship it out the door. (Likewise, the product teams do a poor job working with Research.)
The ultimate example of that is the MSAgent technology... otherwise known as "Clippy". Horvitz was the brain behind the original (and very cool) concept. But the Office product team couldn't take the concept and ship it in a useful form, so it shipped in the painful form we all know and hate.
Eventually, Microsoft will figure out how to do successful technology transfer from MSR to the product teams. Hopefully spam filtering will be the first one to get it right.
"The entire success of spam depends on human eyes reading it."
Which raises the question of how that "filter" database is going to be generated? Hold the envelope to one's head like that old Johnny Carson routine? Nope, looking at the spam then saying "this is spam, plonk". Opportunity there even if smaller than before.
The only "trait" that all spam mail has is that the same message is sent to hundreds or thousands of recipients. A trait which can not be altered.
The Distributed Checksum Clearinghouse (DCC) filters on exactly this aspect. You can find it here
The mail server runs DCC on every incoming message and computes a fuzzy checksum for the message. This checksum is then reported to a central set of servers which record the presence of this checksum and then reports back to the mail server the number of times others have reported a similar message. If you get a high number back its spam and the mail server rejects the message.
Similar messages generate identical checksums. So personalizations and random tokens do nothing to circumvent the filtering.
I think that if every existing sendmail/qmail server ran DCC then spam would simply cease to function instantly. Currently though I don't preceive there to be a sufficient number of mail servers computing and reporting checksums to make it 100% effective but my server is currently filtering out about 95% of spam mail.
This is not as good as the 99.95% reported by this article but DCC will be more resistant to spammers getting clever and attempting to using statistically rare words or phrases to defeat the anti-spam filter.
I will never live for sake of another man, nor ask another man to live for mine.
Or because a woman wears skirt, it is ok to grab her. Just because you get slapped does not mean that it is right, legal, or proper to grab her.
A spammer is a thief by definition.
Fight Spammers!
If P2P is killing music then surely it will also kill porn in the end so why worry about spam? Of course this needs as much help as possible so get sharing pornography right away.
I did this sometime ago, Unique Spam Invoicing System, USIS aka "Spammer Nailer". And am really planning to bill the spammers. The idea: spammers collects email by harvesters: this page contains an unique address and a service agreement, which says that by sending an e-mail to the address, you agree to the terms of service, which you can read at the url. And as the address is unique and I got the weblogs, there is atleast even some chance of nailing the spammer.
Another idea is to start putting lists of legislators' email addresses (as well as email addresses of their major supporters) on web pages so that spammers start spamming them, too. Legislators hire others to read their emails, and they surely have filters (false positives aren't a problem here!), but it could eventually become obvious even to legislators. Especially if you get the personal email addresses (according to many legislatures, it's legal to share the email address with spammers - if they don't want it to be, they'll need to pass a law to make it illegal!).
Another idea: a non-profit organization creates and maintains a database of HASHES of email addresses that do NOT want spam (say MD5 and SHA-1 of canonicalized email addresses, e.g., all lower case; an entire site could be represented by "@mycompany.com"). Anyone can download the database, for a small fee. Anyone can add or remove their email address from the list for FREE (and it must always be free); they just need to subscribe/unsubscribe, with a separate email to confirm (to show that they really did add their email address to the list; entire sites could require "root" or "postmaster" to represent them). Then legislation can be enacted that gives serious $$ penalties to any spam to the "no-spam" list. Capturing the database wouldn't do any good; it would only provide hashes and date/time stamps.
Anyway, just an idea.
- David A. Wheeler (see my Secure Programming HOWTO)
Sorry, it just ain't.
You can't encrypt real mail. Real mail takes days to weeks to arrive somewhere. Real mail can also be dangerous (anthrax anyone?). If I read my real mail out at the mall - do I have privacy rights to it? What about if I read my email there? You have a lot of fuzziness about privacy rights being dependent on where you are and what you're doing..... if you have privacy rights at home, then you have privacy rights at home to read you real mail, read your email, and surf the web.
My primary email addresses now are hotmail addresses. Reading my email obviously means going out on the web. So I've already left my 'house' in your terms, and I read my email with a web browser. Are you going to say that email that goes to a real email server is a different animal than email that goes to hotmail/yahoo/etc.? That's a pretty flimsy distinction.... they're SMTP packages in either case, and that's what defines if it's email. Old style email clients work the way they do because of the state of technology at the time... NOT because of specific design decisions.
I also take exception to all these metaphors that rely on the physical universe. In both cases you're sitting in front of your computer, looking at your monitor. You're not 'going out' in one instance and not in the other. Applying old paradigms inappropriately is why legislation on the net is so fscked up.
My nine year old had no trouble learning to program in Haskell and really enjoyed it.
I'm saying what he has is NOT a spam filter, it's a trainable content filter. THERE IS A DIFFERENCE! He wouldn't WANT what he says he wants, something that blocks unsolicited automated email. The filter has nothing, ultimately, to do with the 'spamness' of the message, only with whether you like the content and headers - and I think the inclusion of header filtering is a mistake because it's being included because he thinks he's making a spam filter, when in reality he's making a content filter. My concern is when he defines spam as 'automated unsolicited email' when he should define it as 'any email i don't want'. Sorry Aunt bertha, I've had it with your forwarded joke of the day emails; even though i continue to want personal emails from you, i don't want the forwarded joke of the days - see how header filtering doesn't work there, but pure content filtering does?
Like a user agent, yes. Excite used to have a news clipper feature, you could make up a category of keywords to look for "category:cyborg = cyborg, implant, neurochemical, mind-machine interface" and it would grab stories that matched the guidelines. At first it would be a rough fit, but with each story you could mark 'I like it' or 'I don't like it' and it would make some behind the scenes list of other keywords to use or avoid, and over a couple weeks it would start surprising you with all kinds of things you didn't know you wanted, but that you loved as soon as you read. And yes, that's why I'm so insistent that the judgement be just on the content and not on the 'spamness' of the headers, because I had a really positive experience with this kind of technology, and the articles were not rated at all according to their source. Just because I like one story from AMA doesn't mean I'll like others, just because I hated one story from Newsweek didn't mean I'd hate another one, and I think the same is equally true of the mailing source of an email.
He wouldn't WANT what he says he wants, something that blocks unsolicited automated email. The filter has nothing, ultimately, to do with the 'spamness' of the message, only with whether you like the content and headers - and I think the inclusion of header filtering is a mistake because it's being included because he thinks he's making a spam filter, when in reality he's making a content filter. My concern is when he defines spam as 'automated unsolicited email' when he should define it as 'any email i don't want'. Sorry Aunt bertha, I've had it with your forwarded joke of the day emails; even though i continue to want personal emails from you, i don't want the forwarded joke of the days - see how header filtering doesn't work there, but pure content filtering does? Or do you think that you should still have to delete the forwarded joke of the day emails manually, every day, for the rest of your life? And why would you want to do that when this could filter them without filtering the personal emails?
But legislators are some of the least efficient communicators on the planet. They won't even think twice about ignoring email, since they still think dead tree connotes authority.
No, it is if you say, move the car, if you are blocked. Then you decide to take the car for a long drive.
Spamming is stealing.
Fight Spammers!
(* Yes, but how many spammers are going to reply to your challenge? Zero! And that alone will make the challenge an effective tool. *)
If confirmation requests become a wide-spread practice, they *will* take advantage of that.
Too many techniques here assume that what works in obscurity works en-mass also.
Not the case. Spammers tarket the widest-used techniques. When something becomes wide use, kazaam!
Table-ized A.I.
Well I thought this sounded cool so I spent an hour or two coding it up in a VB macro for Outlook.
There's now a "Delete Spam" button on my toolbar that moves the selected message to a "Spam" folder. There's an event handler that runs whenever a new message comes in, analyzes it, and if it looks like spam puts it in a "Probable Spam" folder. There's a macro which analyzes all the messages in the "Spam" folder and all the messages in my Inbox to generate the word probabilities hash table.
I did a quick run through my deleted mail folder, used the "Delete Spam" button to move a representative sample of spam (250 messages) to the Spam folder (I didn't do them all just to save time). I then ran the analyzer to get an initial hash. Then I analyzed the messages in my deleted mail folder, wrote the scores and subject lines to a text file, and moved most of the spam that didn't get flagged as spam to the spam folder, and re-ran the analyzer.
Bingo. That simple technique has caught every spam I've gotten since. From time to time I can check the "Probable Spam" folder and move those messages to the "Spam" folder and re-run the analyzer to improve it. We'll see how it weathers over time, but it's already doing better than I have any right to expect.
Actually, there's two, and both are easily found by simply entering "bayesian spam" into Google:
-
-
Please feel free to enlighten me about the above two; I'm not investing in Paul's employer, so the first issue is not nearly as important as the second, but as a spam-victim, I truly do so want to believe there's a magic anti-spam bullet, I just have trouble believing this particular story based on the data at hand.There already exists a generic method (ifile) to fold this technique into a procmail script, which means you don't need any special-purpose email program; any company that thinks its going to replace the email-browser is dreaming. I also downloaded some experimental code for Emacs GNUS, but it was too clunky for anything more than a demonstration of the rating method. Since ifile works with procmail, any ISP could use it to tag suspect email, so it wouldn't matter if there are both geeks and sex-starved teens (not the same thing?) in the audience; each can do with the extra tags as they wish.
the Google results also show this is not a new problem; research, heavy research, has been applied to Bayesian network classification of spam emails since 1994 ... so it is the first approach, or close to it. My first question is then, "if it works so well, with 0% false positives, then why did everyone, even Microsoft, abandon it?" IFile has been around for a long time, yet none of even the Linux distros include it by default. That's a little suspicious, don't you think? If the method is so foolproof, why are there no fools using it?
Excuse my greying cynicism, but there's no mention in Paul's paper of how he's accounting for the mass failure of the corpus of work that goes before him, and I get a little dubious when one lone programmer claims they can out-think large numbers of trained professionals and academics. Yes, it does happen, but when you hear hoofbeats in the street, it's usually not a zebra.
Would this also work with email virus? I think it would since the virus would also have a defined patern to it and the program would pick it up after the first one.
I actually proposed this on Advogato many moons ago, in February of 2001.
-Waldo Jaquith
I really like the elegance of this approach, but Mr. Graham neglected to brag about one important capability: It transcends the language of the text it filters. A good database of spam and non-spam messages in French, Italian, Greek, Russian, Arabic, Thai, Korean, Japanese, or whatever (you'll notice I put some double-byte languages in there) will generate good filtering of any sort of messages. This will continue to increase in importance as the Far East gets to be a larger portion of the Internet (for both legitimate and spam users.)
mail.yahoo.com or someother big free mail program should implement this. Yahoo has a "report as spam" button on every message you read, so that would be an easy way to build the spam group. As for the "good" group dunno.
-shane
"Not knowing when the dawn will come, I open every door." - Emily Dickinson
And in the US, calls to an 800 (888/877/866/855) number from a payphone result in a 28-cent charge to the RECIPIENT (ie: the spammer), which is paid to the operator of the payphone.
Something to think about next time you have time to kill at a shopping mall or airport with rows of unused payphones....