Plan for Spam, Version 2
bugbear writes "I just posted a new version of the Plan for
Spam Bayesian filtering algorithm. The big change is to mark tokens by context. The new version decreases spams missed by 50%, to 2.5 per 1000, even though spam has gotten harder to filter since the summer. I also talk about how spam will evolve, and what to do about it."
Hopefully we see even more stuff like this coming out of the spam conference
I run PopFile at work and it rules!
Please carry on with this Bayesian Spam filtering! It'll be the death of spam yet!
It's Christmas everyday with BitTorrent.
I'm using the filters in Moz 1.3 alpha and the Base64 encoded emails are not being recoginized and flagged as spam. I've trained and trained and trained.
They almost always get through.
Anyone else experience this?
Also, how do can you flag an ad that is an image? Block all HTML email?
I dunno.
But will it enlarge my penis?
I just posted a new version of ...
While I recognize it's a valid project, this type of announcement is more deserving of a frontpage at Sourceforge or Freshmeat. Now if there was a huge breakthrough, we could expect to see it posted here, right?
January 2003
.03% false positives [4]. It's always alarming when two people trying the same experiment get widely divergent results. It's especially alarming here because those two sets of numbers might yield opposite conclusions. Different users have different requirements, but I think for many people a filtering rate of 92% with 1.16% false positives means that filtering is not an acceptable solution, whereas 99.5% with less than .03% false positives means that it is.
.7602. That filter recognized about 23,000 tokens. The current one recognizes about 187,000.
.5.
.5. I use the 15 most interesting to decide if mail is spam. But you can run into a problem when you use a fixed number like this. If you find a lot of maximally interesting tokens, the result can end up being decided by whatever random factor determines the ordering of equally interesting tokens. One way to deal with this is to treat some as more interesting than others.
.99.
.9999 if they occur more than 10 times and .9998 otherwise. Ditto at the other end of the scale for tokens found only in the legitimate corpus.
.88, just under the threshold of .9.
.06%. The other two were a notice that something I bought was back-ordered, and a party reminder from Evite.
.06%? Not much, I think. You could treat it as an upper bound, bearing in mind the small sample size. But at this stage it is more a measure of the bugs in my implementation than some intrinsic false positive rate of Bayesian filtering.
.5, they cancel out and you can remove them from the formula.
.03%. These false positive rates are untrustworthy, as I explain later. I quote a number here only to emphasize that whatever the false positive rate is, it is less than 1.16%.
.99 and .01. It seems justifiable to use thresholds proportionate to the size of the corpora. Since I now have on the order of 10,000 of each type of mail, I use .9999 and .0001.
.9999, but it's also a name.) It seems we should at least notice duplication though, so I may try allowing up to two of each token, as Brian Burton does in SpamProbe.
(This article was given as a talk at the 2003 Spam Conference. It describes the work I've done to improve the performance of the algorithm described in A Plan for Spam, and what I plan to do in the future.)
The first discovery I'd like to present here is an algorithm for lazy evaluation of research papers. Just write whatever you want and don't cite any previous work, and indignant readers will send you references to all the papers you should have cited. I discovered this algorithm after ``A Plan for Spam'' [1] was on Slashdot.
Spam filtering is a subset of text classification, which is a well established field, but the first papers about Bayesian spam filtering per se seem to have been two given at the same conference in 1998, one by Pantel and Lin [2], and another by a group from Microsoft Research [3].
When I heard about this work I was a bit surprised. If people had been onto Bayesian filtering four years ago, why wasn't everyone using it? When I read the papers I found out why. Pantel and Lin's filter was the more effective of the two, but it only caught 92% of spam, with 1.16% false positives.
When I tried writing a Bayesian spam filter, it caught 99.5% of spam with less than
So why did we get such different numbers? I haven't tried to reproduce Pantel and Lin's results, but from reading the paper I see five things that probably account for the difference.
One is simply that they trained their filter on very little data: 160 spam and 466 nonspam mails. Filter performance should still be climbing with data sets that small. So their numbers may not even be an accurate measure of the performance of their algorithm, let alone of Bayesian spam filtering in general.
But I think the most important difference is probably that they ignored message headers. To anyone who has worked on spam filters, this will seem a perverse decision. And yet in the very first filters I tried writing, I ignored the headers too. Why? Because I wanted to keep the problem neat. I didn't know much about mail headers then, and they seemed to me full of random stuff. There is a lesson here for filter writers: don't ignore data. You'd think this lesson would be too obvious to mention, but I've had to learn it several times.
Third, Pantel and Lin stemmed the tokens, meaning they reduced e.g. both ``mailing'' and ``mailed'' to the root ``mail''. They may have felt they were forced to do this by the small size of their corpus, but if so this is a kind of premature optimization.
Fourth, they calculated probabilities differently. They used all the tokens, whereas I only use the 15 most significant. If you use all the tokens you'll tend to miss longer spams, the type where someone tells you their life story up to the point where they got rich from some multilevel marketing scheme. And such an algorithm would be easy for spammers to spoof: just add a big chunk of random text to counterbalance the spam terms.
Finally, they didn't bias against false positives. I think any spam filtering algorithm ought to have a convenient knob you can twist to decrease the false positive rate at the expense of the filtering rate. I do this by counting the occurrences of tokens in the nonspam corpus double.
I don't think it's a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure. Spam filtering is not just classification, because false positives are so much worse than false negatives that you should treat them as a different kind of error. And the source of error is not just random variation, but a live human spammer working actively to defeat your filter.
Tokens
Another project I heard about after the Slashdot article was Bill Yerazunis' CRM114 [5]. This is the counterexample to the design principle I just mentioned. It's a straight text classifier, but such a stunningly effective one that it manages to filter spam almost perfectly without even knowing that's what it's doing.
Once I understood how CRM114 worked, it seemed inevitable that I would eventually have to move from filtering based on single words to an approach like this. But first, I thought, I'll see how far I can get with single words. And the answer is, surprisingly far.
Mostly I've been working on smarter tokenization. On current spam, I've been able to achieve filtering rates that approach CRM114's. These techniques are mostly orthogonal to Bill's; an optimal solution might incorporate both.
``A Plan for Spam'' uses a very simple definition of a token. Letters, digits, dashes, apostrophes, and dollar signs are constituent characters, and everything else is a token separator. I also ignored case. Now I have a more complicated definition of a token:
Case is preserved.
Exclamation points are constituent characters.
Periods and commas are constituents if they occur between two digits. This lets me get ip addresses and prices intact.
A price range like $20-25 yields two tokens, $20 and $25.
Tokens that occur within the To, From, Subject, and Return-Path lines, or within urls, get marked accordingly. E.g. ``foo'' in the Subject line becomes ``Subject*foo''. (The asterisk could be any character you don't allow as a constituent.)
Such measures increase the filter's vocabulary, which makes it more discriminating. For example, in the current filter, ``free'' in the Subject line has a spam probability of 98%, whereas the same token in the body has a spam probability of only 65%.
In the Plan for Spam filter, all these tokens would have had the same probability,
The disadvantage of having a larger universe of tokens is that there is more chance of misses. Spreading your corpus out over more tokens has the same effect as making it smaller. If you consider exclamation points as constituents, for example, then you could end up not having a spam probability for free with seven exclamation points, even though you know that free with just two exclamation points has a probability of 99.99%.
One solution to this is what I call degeneration. If you can't find an exact match for a token, treat it as if it were a less specific version. I consider terminal exclamation points, uppercase letters, and occurring in one of the five marked contexts as making a token more specific. For example, if I don't find a probability for ``Subject*free!'', I look for probabilities for ``Subject*free'', ``free!'', and ``free'', and take whichever one is farthest from
Here are the alternatives [7] considered if the filter sees ``FREE!!!'' in the Subject line and doesn't have a probability for it.
If you do this, be sure to consider versions with initial caps as well as all uppercase and all lowercase. Spams tend to have more sentences in imperative voice, and in those the first word is a verb. So verbs with initial caps have higher spam probabilities than they would in all lowercase. In my filter, the spam probability of ``Act'' is 98% and for ``act'' only 62%.
If you increase your filter's vocabulary, you can end up counting the same word multiple times, according to your old definition of ``same''. Logically, they're not the same token anymore. But if this still bothers you, let me add from experience that the words you seem to be counting multiple times tend to be exactly the ones you'd want to.
Another effect of a larger vocabulary is that when you look at an incoming mail you find more interesting tokens, meaning those with probabilities far from
For example, the token ``dalco'' occurs 3 times in my spam corpus and never in my legitimate corpus. The token ``Url*optmails'' (meaning ``optmails'' within a url) occurs 1223 times. And yet, as I used to calculate probabilities for tokens, both would have the same spam probability, the threshold of
That doesn't feel right. There are theoretical arguments for giving these two tokens substantially different probabilities (Pantel and Lin do), but I haven't tried that yet. It does seem at least that if we find more than 15 tokens that only occur in one corpus or the other, we ought to give priority to the ones that occur a lot. So now there are two threshold values. For tokens that occur only in the spam corpus, the probability is
I may later scale token probabilities substantially, but this tiny amount of scaling at least ensures that tokens get sorted the right way.
Another possibility would be to consider not just 15 tokens, but all the tokens over a certain threshold of interestingness. Steven Hauser does this in his statistical spam filter [8]. If you use a threshold, make it very high, or spammers could spoof you by packing messages with more innocent words.
Finally, what should one do about html? I've tried the whole spectrum of options, from ignoring it to parsing it all. Ignoring html is a bad idea, because it's full of useful spam signs. But if you parse it all, your filter might degenerate into a mere html recognizer. The most effective approach seems to be the middle course, to notice some tokens but not others. I look at a, img, and font tags, and ignore the rest. Links and images you should certainly look at, because they contain urls.
I could probably be smarter about dealing with html, but I don't think it's worth putting a lot of time into this. Spams full of html are easy to filter. The smarter spammers already avoid it. So performance in the future should not depend much on how you deal with html.
Performance
Between December 10 2002 and January 10 2003 I got about 1750 spams. Of these, 4 got through. That's a filtering rate of about 99.75%.
Two of the four spams I missed got through because they happened to use words that occur often in my legitimate email.
The third was one of those that exploit an insecure cgi script to send mail to third parties. They're hard to filter based just on the content because the headers are innocent and they're careful about the words they use. Even so I can usually catch them. This one squeaked by with a probability of
Of course, looking at multiple token sequences would catch it easily. ``Below is the result of your feedback form'' is an instant giveaway.
The fourth spam was what I call a spam-of-the-future, because this is what I expect spam to evolve into: some completely neutral text followed by a url. In this case it was was from someone saying they had finally finished their homepage and would I go look at it. (The page was of course an ad for a porn site.)
If the spammers are careful about the headers and use a fresh url, there is nothing in spam-of-the-future for filters to notice. We can of course counter by sending a crawler to look at the page. But that might not be necessary. The response rate for spam-of-the-future must be low, or everyone would be doing it. If it's low enough, it won't pay for spammers to send it, and we won't have to work too hard on filtering it.
Now for the really shocking news: during that same one-month period I got three false positives.
In a way it's a relief to get some false positives. When I wrote ``A Plan for Spam'' I hadn't had any, and I didn't know what they'd be like. Now that I've had a few, I'm relieved to find they're not as bad as I feared. False positives yielded by statistical filters turn out to be mails that sound a lot like spam, and these tend to be the ones you would least mind missing [9].
Two of the false positives were newsletters from companies I've bought things from. I never asked to receive them, so arguably they were spams, but I count them as false positives because I hadn't been deleting them as spams before. The reason the filters caught them was that both companies in January switched to commercial email senders instead of sending the mails from their own servers, and both the headers and the bodies became much spammier.
The third false positive was a bad one, though. It was from someone in Egypt and written in all uppercase. This was a direct result of making tokens case sensitive; the Plan for Spam filter wouldn't have caught it.
It's hard to say what the overall false positive rate is, because we're up in the noise, statistically. Anyone who has worked on filters (at least, effective filters) will be aware of this problem. With some emails it's hard to say whether they're spam or not, and these are the ones you end up looking at when you get filters really tight. For example, so far the filter has caught two emails that were sent to my address because of a typo, and one sent to me in the belief that I was someone else. Arguably, these are neither my spam nor my nonspam mail.
Another false positive was from a vice president at Virtumundo. I wrote to them pretending to be a customer, and since the reply came back through Virtumundo's mail servers it had the most incriminating headers imaginable. Arguably this isn't a real false positive either, but a sort of Heisenberg uncertainty effect: I only got it because I was writing about spam filtering.
Not counting these, I've had a total of five false positives so far, out of about 7740 legitimate emails, a rate of
I don't think this number can be trusted, partly because the sample is so small, and partly because I think I can fix the filter not to catch some of these.
False positives seem to me a different kind of error from false negatives. Filtering rate is a measure of performance. False positives I consider more like bugs. I approach improving the filtering rate as optimization, and decreasing false positives as debugging.
So these five false positives are my bug list. For example, the mail from Egypt got nailed because the uppercase text made it look to the filter like a Nigerian spam. This really is kind of a bug. As with html, the email being all uppercase is really conceptually one feature, not one for each word. I need to handle case in a more sophisticated way.
So what to make of this
Future
What next? Filtering is an optimization problem, and the key to optimization is profiling. Don't try to guess where your code is slow, because you'll guess wrong. Look at where your code is slow, and fix that. In filtering, this translates to: look at the spams you miss, and figure out what you could have done to catch them.
For example, spammers are now working aggressively to evade filters, and one of the things they're doing is breaking up and misspelling words to prevent filters from recognizing them. But working on this is not my first priority, because I still have no trouble catching these spams [10].
There are two kinds of spams I currently do have trouble with. One is the type that pretends to be an email from a woman inviting you to go chat with her or see her profile on a dating site. These get through because they're the one type of sales pitch you can make without using sales talk. They use the same vocabulary as ordinary email.
The other kind of spams I have trouble filtering are those from companies in e.g. Bulgaria offering contract programming services. These get through because I'm a programmer too, and the spams are full of the same words as my real mail.
I'll probably focus on the personal ad type first. I think if I look closer I'll be able to find statistical differences between these and my real mail. The style of writing is certainly different, though it may take multiword filtering to catch that. Also, I notice they tend to repeat the url, and someone including a url in a legitimate mail wouldn't do that [11].
The outsourcing type are going to be hard to catch. Even if you sent a crawler to the site, you wouldn't find a smoking statistical gun. Maybe the only answer is a central list of domains advertised in spams [12]. But there can't be that many of this type of mail. If the only spams left were unsolicited offers of contract programming services from Bulgaria, we could all probably move on to working on something else.
Will statistical filtering actually get us to that point? I don't know. Right now, for me personally, spam is not a problem. But spammers haven't yet made a serious effort to spoof statistical filters. What will happen when they do?
I'm not optimistic about filters that work at the network level [13]. When there is a static obstacle worth getting past, spammers are pretty efficient at getting past it. There is already a company called Assurance Systems that will run your mail through Spamassassin and tell you whether it will get filtered out.
Network-level filters won't be completely useless. They may be enough to kill all the "opt-in" spam, meaning spam from companies like Virtumundo and Equalamail who claim that they're really running opt-in lists. You can filter those based just on the headers, no matter what they say in the body. But anyone willing to falsify headers or use open relays, presumably including most porn spammers, should be able to get some message past network-level filters if they want to. (By no means the message they'd like to send though, which is something.)
The kind of filters I'm optimistic about are ones that calculate probabilities based on each individual user's mail. These can be much more effective, not only in avoiding false positives, but in filtering too: for example, finding the recipient's email address base-64 encoded anywhere in a message is a very good spam indicator.
But the real advantage of individual filters is that they'll all be different. If everyone's filters have different probabilities, it will make the spammers' optimization loop, what programmers would call their edit-compile-test cycle, appallingly slow. Instead of just tweaking a spam till it gets through a copy of some filter they have on their desktop, they'll have to do a test mailing for each tweak. It would be like programming in a language without an interactive toplevel, and I wouldn't wish that on anyone.
Notes
[1] Paul Graham. ``A Plan for Spam.'' August 2002. http://paulgraham.com/spam.html.
Probabilities in this algorithm are calculated using a degenerate case of Bayes' Rule. There are two simplifying assumptions: that the probabilities of features (i.e. words) are independent, and that we know nothing about the prior probability of an email being spam.
The first assumption is widespread in text classification. Algorithms that use it are called ``naive Bayesian.''
The second assumption I made because the proportion of spam in my incoming mail fluctuated so much from day to day (indeed, from hour to hour) that the overall prior ratio seemed worthless as a predictor. If you assume that P(spam) and P(nonspam) are both
If you were doing Bayesian filtering in a situation where the ratio of spam to nonspam was consistently very high or (especially) very low, you could probably improve filter performance by incorporating prior probabilities. To do this right you'd have to track ratios by time of day, because spam and legitimate mail volume both have distinct daily patterns.
[2] Patrick Pantel and Dekang Lin. ``SpamCop-- A Spam Classification & Organization Program.'' Proceedings of AAAI-98 Workshop on Learning for Text Categorization.
[3] Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz. ``A Bayesian Approach to Filtering Junk E-Mail.'' Proceedings of AAAI-98 Workshop on Learning for Text Categorization.
[4] At the time I had zero false positives out of about 4,000 legitimate emails. If the next legitimate email was a false positive, this would give us
[5] Bill Yerazunis. ``Sparse Binary Polynomial Hash Message Filtering and The CRM114 Discriminator.'' Proceedings of 2003 Spam Conference.
[6] In ``A Plan for Spam'' I used thresholds of
[7] There is a flaw here I should probably fix. Currently, when ``Subject*foo'' degenerates to just ``foo'', what that means is you're getting the stats for occurrences of ``foo'' in the body or header lines other than those I mark. What I should do is keep track of statistics for ``foo'' overall as well as specific versions, and degenerate from ``Subject*foo'' not to ``foo'' but to ``Anywhere*foo''. Ditto for case: I should degenerate from uppercase to any-case, not lowercase.
It would probably be a win to do this with prices too, e.g. to degenerate from ``$129.99'' to ``$--9.99'', ``$--.99'', and ``$--''.
You could also degenerate from words to their stems, but this would probably only improve filtering rates early on when you had small corpora.
[8] Steven Hauser. ``Statistical Spam Filter Works for Me.'' http://www.sofbot.com.
[9] False positives are not all equal, and we should remember this when comparing techniques for stopping spam. Whereas many of the false positives caused by filters will be near-spams that you wouldn't mind missing, false positives caused by blacklists, for example, will be just mail from people who chose the wrong ISP. In both cases you catch mail that's near spam, but for blacklists nearness is physical, and for filters it's textual.
In fairness, it should be added that the new generation of responsible blacklists, like the SBL, cause far fewer false positives than earlier blacklists like the MAPS RBL, for whom causing large numbers of false positives was a deliberate technique to get the attention of ISPs.
[10] If spammers get good enough at obscuring tokens for this to be a problem, we can respond by simply removing whitespace, periods, commas, etc. and using a dictionary to pick the words out of the resulting sequence. And of course finding words this way that weren't visible in the original text would in itself be evidence of spam.
Picking out the words won't be trivial. It will require more than just reconstructing word boundaries; spammers both add (``xHot nPorn cSite'') and omit (``P#rn'') letters. Vision research may be useful here, since human vision is the limit that such tricks will approach.
[11] In general, spams are more repetitive than regular email. They want to pound that message home. I currently don't allow duplicates in the top 15 tokens, because you could get a false positive if the sender happens to use some bad word multiple times. (In my current filter, ``dick'' has a spam probabilty of
[12] This is what approaches like Brightmail's will degenerate into once spammers are pushed into using mad-lib techniques to generate everything else in the message.
[13] It's sometimes argued that we should be working on filtering at the network level, because it is more efficient. What people usually mean when they say this is: we currently filter at the network level, and we don't want to start over from scratch. But you can't dictate the problem to fit your solution.
Historically, scarce-resource arguments have been the losing side in debates about software design. People only tend to use them to justify choices (inaction in particular) made for other reasons.
Thanks to Sarah Harlin, Trevor Blackwell, and Dan Giffin for reading drafts of this paper, and to Dan again for most of the infrastructure that this filter runs on.
Reply or e-mail; don't vaguely moderate. Ex-O'Reilly/MIT employee, now a full-time Google employee.
You, sir, are my hero. A true Stalin.
I'll be curious how spammers counteract this. Probably just send more and more to those who aren't filtered. I never thought of filtering all combinations of capitalization.
My users were complaining about spam again today. I walked over to discuss it with them and lo and behold, all stuff they signed up for except 2 klez emails.
rm -fr ~/Mail
would do the trick.
Note to self: get smarter troll to guard door.
Anyway, I've said a few times the only way to effectively stop spam is to make it more expensive to the companies having it done. Filtering, blocking ports, refusing mail from RBL'd hosts all helps, but it will not stop until it is fully against the law and people bring legal action to stop it.
Even people who are supposed to be clueful don't get it. I got spammed to buy EZ-Pass for the PA Turnpike. I sent a nastygram to the state DoT. The keyboard monkey responded that I should look closely at the email, that I signed up to receive it. If I had a dollar for every site that claimed I signed up with them I would be rich. What an idiot.
Some states, like California, have anti-spam laws, but curiously, they only cover spam sent from California to California. My state's telephone do-not-call list covers all calls to my number, no matter where they originate.
Now, I understand that there would be problems with international spam, but stopping domestic spam would be a huge boon to everyone. It seems like this legislation would be wildly popular, and easy to pass.
I really like this analytic approach. I've been using Hotmail's spam filtering, which merely removes e-mails from addresses not in my address book. While this is most of the time effective and very easy to implement, there does seem to be a major problem with false positives. ie I give my e-mail to someone, who's not in my address book.
Does anyone think AOL or Hotmail could start using such a system as the one outlined in the article?
First of all, let's realize that email is communication is data transmission. Spam is noise. This immediately brings to mind Claude Shannin's work on information and entropy. He made it very clear that noise can be reduced to a level that is O log(n) that of the information transmitted. This means that as we have more and more email out there, we are going to get more and more noise, unless we change something.
Let's go back to the definition of information. Basically, it's stuff that nobody knows about. If it is surprising to you, it is information (in non-technical language). That suggests that perhaps the information content (and therefore spam) could be reduced if, instead of secretively emailing our friends individually, we CC'd them on all our missives. This would make the amount of information lower (since people would be less surprised by our further revelations, having seen the foregoing matter) and therefore spam might even be eliminated.
As long as it stops all the emails I get from Ubi Lumjobo trying to get me to accept $21.5m from South Africa then I'll be happy. :)
Or the people that try to make my breasts larger...
Or viagra...
Martin Piper
Owner - ReplicaNet and RNLobby
The irony is, Spam evolves, yet people still fall for spam. If it didn't work, we'd have seen the last of it years ago.
I've downloaded MailWasher and have just started looking through it (so I don't know what it uses for filtering.) I've noticed a lot of the recent junk is html with the ploy spelled among comment tags, i.e.:
Are any of the filters able to handle these?
Lastly, has anyone ever bother to combat spam with spam? I.e. send out a letter explaining what people are likely to get, aside from their credit card charged out to a pr0n site, sugar pills, photocopies of something you can find in any library, identity theft, etc. ?
A feeling of having made the same mistake before: Deja Foobar
And the conflict rages on. The better filters we use, the sneakier the spam artists get. Now we're developing self-modifying algorithms to detect and kill spam, and I'm sure the spammers are developing self-modifying algorithms to craft filter-tricking spam.
How long before the back-and-forth of spam filters and spam crafters becomes self-aware? It's got to happen. Eventually the spam filters will become a skeptic consciousness that *feels* its way through spam and spots the phoneys, and the spam crafters will become a persuasive consciousness that tries to think and write as a close friend or relative.
...
This is a wonderful tool that is being developed. However, I don't think any one tool will succeed in eliminating spam. From a spammer's point of view, if my income depends on messages making it through filters, by damn I will bypass those filters by whatever means I can. These assholes send penis enlargement advertisements to my mother -- If her gender doesn't stop them, neither will an email filter.
On a different subject, in a story about a week ago, someone posted a link to a peer-peer network of spam emails for MS Outlook available at http://www.cloudmark.com that will trap a significant amount of emails based on (and this is overly simplified, of course) users' votes. Does such a solution exist in the open source world?
<:
The real scarry part of the article is about, what he called, "Spam of the Future". It's really interesting. Basically, is a spam message that has a lot of seemingy normal text, that won't get caught in the spam filter. Because it IS normal text. It's then followed by a link - ususally to a porn site.
Here is your opt-in FREE! porn!
Moneyed corporations, non-working 'poor' and criminal prisoners are turning productive citizens into tax-slaves.
Without spam, how else would I be able to sit home every day and make $1,000 a week watching TV while playing with my 12 inch penis?
Reply or e-mail; don't vaguely moderate. Ex-O'Reilly/MIT employee, now a full-time Google employee.
The latest development Spamassassin has an interesting application of Bayesian filtering. Basically, it takes all of SA's existing heuristics, uses that to develop a sense of what is and is not spam, and then pumps the results through a Bayesian filter that learns from these messages.
As with any other SA test, no single element of the chain is trusted enough to definitively call something spam, but if a message would have squeeked through before, this new filter can put the final nail in its coffin through word analysis against previous spam.
So, why did I use a subject about "ENDING spam"? Because one of the tools that spammers have is SA itself. They can use it to score their messages and determine how "spamish" it is. The problem now is that each SA installation will have subtly different scoring, and the message may be "ok" according to the spammer's version, but my version has a better sense of the mail that *I* get.
SpamAssassin is definitely a tool worth checking out if you have not already. Install it in daemon mode (spamd) and then use "spamc -f" in your procmailrc or the equiv for your MTA.
Very nice tool, and a real time-saver for me.
I have just set up a system which parses spam email, locates any Web addresses, strips out the parameters, and then visits the Web site. Just think if we ALL did this. So rather than the poor spammer only getting a .001% hit rate, they get an astounding 100% hit rate. So 1 million emails sent, 1 million instant Web page hits. And it is not like they can complain about this, after all they are ASKING for the hits.
Even better is that my domain gets multiple spams from the same company.
2.5 per 1000
:)
So it catchs 2 or 3 of every 1000 spam messages. My worry would be how many non-spam messages it catchs?
I'd hate it to tag any personal mail as spam
Could Bayesian filtering be applied to filter offtopic posts as well?
Good people do not need laws to tell them to act responsibly, while bad people will find a way around the laws-Plato
Ooooops! Wrong window. Sorry.
Modest doubt is called the beacon of the wise. - William Shakespeare
Overblown? The fact that you would need more than one email account to keep from having your time wasted by spam proves otherwise.
The basics are, you take all good mails, and create a database of words used in them. Make a different database for spam mails. Then, for each incoming mail, compare to each database, and classify as spam or non-spam.
The algorithm starts out conservative, ie: you get most of the mail classified as good. For each "good" email that is spam, you manually re-classify it.
Then, after a few weeks, the filter does all the work. It is basically using word-databases to compare emails and classify them the way you, the user would. Periodically you will receive another spam email, then you re-classify it, and never see an email like it again (in your inbox).
Bogofilter and CRM114 are among the more successful efforts so far, but there are many. And they are FAR more successful than blacklist/whitelist/fixed token comparison filters. But Bayesian filtering is just a near optimal way to replicate the classification of the user, which is also why it works so well.
why not just go back to blocking all of china
Spam filters are great, but it seems that only the Net-savvy are using them. Savvy users aren't the people spammers are making all their money from--they are making money off the naive and inexperienced users. These users aren't going to go out and install the latest Bayesian filters on their system, and the major email readers won't (and probably shouldn't) come with them automatically activated.
To make spam cost-ineffective for the spammers, we've got to stop it (or flag it) before it gets to the end-user. It would obviously be a mistake to allow ISP's to automatically delete all email that fails their spam filters, but I think it would be appropriate for them to include something in the headers flagging such email as probable spam. Then future email readers could detect this header and handle it gracefully, like moving it to a "spam" folder on the user's machine. Once this happens and Grandpa no longer gets email asking him to test the latest Viagra alternative, spam may become a thing of the past.
I think I speak for everyone when I say false positives are the only real hinderance to the filtering of spam. I get roughly 20 emails a day, 75% of which are spam. If one of them slips past the filter and I see it, it doesn't bother me so much. Spam is no longer a problem. What is an absolute necessity, though, (and probably less so for me than other people) is that none of my legitimate email is filtered as spam. I'd rather have 100 spams filtered improperly than one legit email.
Whale
Yeah, 2.5 per 1000 getting through is a proof that his ideas are obviously flawed. Having a working system is the best proof that an idea works :)
Travis
So you'd rather have 1000 out of 1000 pieces of junk mail over 2.5 out of 1000. Way to keep things in perspective! (Actually, what you were doing is trying to show off what little intellect you have.)
Interesting read and all, but CmdrTaco, you forgot to mention how many spam mails you've received already today ...
Inquiring minds want to know!
;)
I'm really excited about all of the neat stuff happening with Bayesian filtering and related technologies, but I just wanted to put in a plug for TMDA, Tagged Message Delivery Agent, which uses a whitelist-centric strategy. Since I began using it, the amount of spam I have to look at is virtually at zero. If you haven't read about it yet, check it out.
Everyone but the folks at SpamAssassin have been focusing on the idea that any one technique for identifying spam is doomed to diminishing returns.
Over at SpamAssassin, they've been busily creating a system that collects "good enough" tests by the dozens and uses them to collectively score a message and determine its general "spamishness". The system relies on a complex scoring system that is determined, not by the whim of human programmers, but on the results of a genetic training system that pits one set of scores against another until equilibrium is reached for a given set of example spam and non-spam.
See my other post here for how Bayesian filtering will be used to allow this system to feed back on itself and improve as it sees more of your spam and non-spam....
The article mentions compiling a vast collection of spam. Such a project is already underway at SpamArchive.
Paul Graham, great book, ANSI lisp, if there is one person that knows spam it is this guy.
---- Berlin Brown http://www.newspiritcompany.
The victims are expending considerable amounts of individual CPU time to classify mail which they must read (albeit mechanically).
Rejecting mail from IP addresses known to send spam (or teergrubing to tie up spammer resources) puts the burden back on the spammers, where it belongs.
Absent effective out-of-band defenses (such as the courts and the legal system), wasting money on filtering is a foolish effort just to benefit a few innocent sources who choose to share an IP address with spammers. And if they pay money to a spam-tolerant ISP, are they really innocent ?
You probably get no spam to your home or cell phone because it's too expensive to set up a company in China and make phone calls to the US, just to get around the laws. Unfortunately, it *is* basically free to send spam mail. If they could call you for free from outside the US, they would be doing that too.
I have been quite excited with all the new ideas being put to use in fighting spam recently. Unfortunately, whenever I find one that is implemented, it doesn't work with my mail server or my client. It seems like there should be a standard API that spam filters could implement, (using soap or xml-rpc or something), so that the various mail servers and email clients could use a single plug-in to add spam filtering. This would allow the people who are good at spam filter code to focous on that one problem, and the people who are good at writing email plugins and GUI code can do what they are good at.
Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
The url for the project is popfile.sourceforge.net
I didn't try it yet, but it I will try it really soon now!
It doesn't have to be much, just 1/8 cent per email or so. That's all it would take.
If you post it, they will read.
>Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam...
Spoken like a true geek.
Spicy SPAMBURGER
Ingredients:
1 (12-ounce) can SPAM® luncheon meat, cut into 4 slices
1 green pepper, cut into thin strips
1 small onion, thinly sliced
½ cup MIRACLE WHIP or MIRACLE WHIP LIGHT Salad Dressing
½ teaspoon ground red pepper
4 hamburger buns, split
Lettuce and tomato slices (optional)
Instructions:
Cook SPAM®, green peppers and onions in large skillet 5 minutes, stirring occasionally. Mix MIRACLE WHIP and red pepper. Spread evenly on hamburger buns. Place SPAM®, peppers and onions on bottom halves of buns. If desired, top with lettuce and tomato and cover with top halves of buns. Makes 4 sandwiches.
***
I lied. I hate spam of any kind. Bravo Anti-spammers.
-Eric
OK, signal and noise. What if the signal was all in one frequency band and the noise all in another. Problem separating them? No.
What if, in effect, a similar distinction held for spam in the transmission channel - that spam by itself selected a pathway to the recipient that was never used by the signal? Block that pathway and the spam never gets through.
Spam doesn't select a pathway but spammers do. If you could block relay spam at the open relays it would be dead. You can't, of course - the open relays are controlled by people who don't know the need to block spam. You know that, I know that. If you can't change the people then change the open relays (from the spammers' points of view.) Set up a system that looks like an open relay and stop the spam. An open relay honeypot.
I asked an operator of such a honeypot how he did last year:
> How did 2002 end?
From March 7 to December 26 2002, the total was:
235,624,232
Using one Pentium 90 he stopped spam to 235 million recipients. Think about that number when you see filter people reporting what they stop just for their own domains. This was spam to recipients all over, not simply to the honeypot operators domain: he operates at the relay level. He stopped 100% of the spam, no deception deceived him, no tuning was needed, no valid email was caught - it is perfect filtering. Perfect filtering - who else has that?
And you can do it at home on your DSL or cable connection (the guy above uses sendmail -bd, but Windows users have a program they can use):
http://jackpot.uk.net/
Yeah, I know, spammers are switching to open proxies. So, write an open proxy honeypot. That, too, will be 100% efficient. In addition you now are giving spammers reason to fear every open relay and every open proxy they detect. FEAR. The SPAMMERS have to scramble. They have to scramble and they have to show everything they do to overcome the technique - there is no stealth way to look for open relays and open proxies.
The problem is solved, it is a matter of implementation and of getting active systems everywhere in the net space (so there's no safe IP space for the spammers anywhere.)
Remember: A single Pentium 90, 235 million spam messages stopped in 10 months.
Why are you posting Anonymous Coward? Are you afraid someone will post your email to a few spam lists. :)
For those coming late to the story, Joel Sponsky demonstrated in his well known column [joelonsoftware.com] recently that Bayesian filtering of spam is an intractible problem.
Where? There's no mention of Bayesian anything on that page. The closest thing I can see is "Bad Spam Filters," which is about a different kind of filter.
I hope you all realize that at best you're buying time, not solving the Spam problem. It won't take long for these guys to find ways through the filter.
The problems need to be solved on a different level. The problem is not the messages themselves, it's that people are allowed to send these messages to anybody they want without any real challenges as to their authenticity.
Let me explain how I have things set up right now, and hopefully my stance on this issue will be a little clearer. All my messages come into the same mailbox. I have a bunch of email aliases, though. If I sign up for Slashdot, for example, then I create a new alias like 'slashdot@insertdomainnamehere.com'. I then add that email address into my 'email allowed' list so that it gets funneled through into a visible folder. If that address gets abused, I shut down the email alias.
My personal friends are treated a little differently. Once they email me, I add their address into my list of friends, and they get put into a friends folder. I treat this differently than a registration place because my friends all need one address to contact me at, I don't mind them sharing it with each other. If my address changes, then their messages still get through.
I plan on going farther down the road. I'm going to give people an email address, and when they email it they get an automated message with instructdions on how to 'request permission' to send me email. When permission is granted, they don't get that message anymore. It basically means that the only messages that get through to me are the ones that have a human behind them to read the response and then go through the proper channels to reach me.
I'm not claiming to have done anyting new here. I'm basically mimicking the way IM works, and I'm doing it without having to do anything real fancy. Outlook's Rules Wizard is doing quite a bit of the work here. But since people actually have to take the time to request my authorization, it means that it's a message meant for ME as opposed to a message meant for anybody who's out there. With an approach like this, it'd be a lot harder for spammers to get through.
You don't speak for everyone. On the contrary, I think that most people realize that e-mail delivery isn't guaranteed - and therefore they expect that truly vital messages will need to be backed up with a phone call or some other means, to be sure the message was delivered.
I would prefer to lose one or two legitimate mails in return for a virtually zero rate of missed detections.
Sean
It doesn't catch all the spam, and it occasionally has a false positive. This will be true of any spam filter we implement, because spam continues to change. SpamAssassin runs on some of the mailservers I connect to, but it tends to perform worse than Mail.app. So until we can get each user's spam filter customized at the server, spam identification is going to have to stay client-based. It sounds like Paul Graham's tools are getting a little more efficient, but does any of this make a big difference for the end user?
Can't you see that everyone is buying station wagons?
http://lsa.colorado.edu/papers/dp1.LSAintro.pdf
...until the email server at work got hacked and someone stole the entire address list. Since then, all of us have been getting spam by the bucketloads. And since I depend on people being able to get my current work address, I can't change it. Thank God for SpamAssassin!
-Looking for a job as a materials chemist or multivariat
It's all fine and dandy to have a spamtrap account if you never plan to read it, but what if you want to get online bank statement notifications or other important notices? I just noticed my friendly credit card company (Capital One) took it upon themselves to introduce my previously spam-free e-mail account to their business partners so they could introduce me to the wonderful world of buying fucking flowers for valentines day. Thanks alot assholes. And no, they have NO option to opt out of this fucking crap. The spam is posted from the same address as the statement notifications with a friendly disclaimer saying they're not in any way affiliated. Nice.
Pigs are some of the most intelligent beings on our planet. Why do we kill them by the billions? Just to enjoy the transient pleasure of tasting their flesh?
Pigs would be pretty rare if we never killed them.
I think you are wrong.
Bayesian filtering merely using a statistically optimized method to duplicate the classification of the user for which it is working. If trained on enough of YOUR email, it will work exceedingly well in classifying YOUR future email.
Put another way, I tried blacklisting filtering, and fixed token filtering, and performance was pretty poor. In contrast, I am quite happy with bogofilter's performance. But, of the various methods, only Bayesian filtering takes the preferences of the individual user as its primary basis for sorting email.
BTW, your link is pretty much useless in showing why Sponsky may or may not think Bayesian methods are intractable. He more or less just rants that draconian MTA based filters are doing harm - I agree with him. But the word Bayesian doesn't even appear on the page to which you linked. And that makes you a Troll.
... is "0D". Some HTML editor out there, apparently only used by spammers, encodes it's output with an ASCII "0D" at the end of each line. These spams get the highest scores I've seen.
People who disagree with you are not automatically evil, greedy, or stupid.
"Who knows, maybe Joel's wrong."
Has he ever been wrong before? I have read a few of his writing and I wasn't impressed. It's not like he is some sort of a deity or something.
I am curious as to why you would site him as some sort of an infallible source.
War is necrophilia.
I host several domains as a hobby for my family. Recently my ip address made it into a listing on spews.org. Am I a spammer? By no means. Am I screwed? Absolutely. After reading spamming newsgroups I found that I am not alone. At first I was just getting blocked because I was sending mail ( my own smtp server ) from a "known" spamming source when in fact I'm not a source of spam. My IP happens to fall into a larger block of ip's that my ISP owns, some of which are sources of spam.
This was a minor setback, but now other services are starting to use bulk email sources as deny lists for their offerings. My free dns provider, zoneedit now prohibits me from adding / modifying any of my zones. This is simply not acceptible to me. The way spews is set up, it is not easy for my ip to get off the list. My ISP cannot just call them up and take me off. There has to be a way to avoid this, and eliminating spam at a higher level would be a good start.
Unfortunately, it might work at first, but we've seen offtopic posters and first posters evolve. Alas, they seem to be a form of semi-intelligent life and once their numbers start to dwindle you can almost bet some internet environmentalist society will crop up and declare them endangered "where once, great herds of them swept majestically across the plains, now only a few cling to the ever encroaching egalitarian dark forces of the internet.
It's probably just easier to round them up and send them to Guantanamo.
A feeling of having made the same mistake before: Deja Foobar
If you want to filter spam for yourself, great. You probably appreciate all of the issues involved.
The irony is, though, that the better joe-surfer has spam filtered *for* him, the less he'll realize that it's a problem -- and the less political stink spam will have associated with it.
Why are you letting these clowns ruin our country?
I've been using qconfirm http://smarden.org/qconfirm/ and it has eliminated my spam problems completely. We have a nice web interface that our users can surf into manually releasing any messages that may be important, otherwise they sit in queue waiting for the sender to validate (confirm). I have also eliminated the double bounce issues that sometimes would come up. Full details on this setup is listed at the above link. Give it a try! -GG
sneakemail.com is my new way of eliminating spam.
Quit modding this down! It's the honest-to-god TRUTH!
If only I'd given a different address for each I could figue out definitively who the culprit is.
A feeling of having made the same mistake before: Deja Foobar
That hasn't worked for snail mail. Junk mailers don't stop sending their mail just because they have to use postage; they just up the price for their masters.
You bring up an interesting point. If everybody in the world were the sort of people on Slashdot, there would be no spam. Even at very low cost, there would be zero response, and it wouldn't be worth their effort.
The problem is the real morons. The kind who are taken in by the stupidest spam tricks, like the "future spam" he describes (nonsensical but grammatical set of English text designed to slip past Bayesian filters, followed by a URL.) What kind of a moron would click on such a URL? The kind of moron with more money than brains. (Probably not much money, but clearly zero brains.)
It would be lovely to filter out those emails before they reach the morons, but that's unfortunately impractical and illegal in the general case. Maybe we all need to subsidize a cheap ISP for morons.
1. Collect spammers scalps
2. ???
3. Profit
This is where the problem of spam will be solved, by having a web of trust between the mail servers, that sign the message in a maner which makes it easier to back track a message and if these servers also do filting, well we kill two birds with one stone. The problems are:
- CPU intensive
- Need to look at every message
- Seeding the filter database
- Building trust with other servers
And others of course.Think about this one, what does the typical email a porn star would get look like? What we think of spam, might not be someone else's.
How would the system scale?
And what would stop a spammer from installing a server with a bogus filter database, or just signing off on each message as being legit?
Perhaps filtering based on each user's personal corpus of valid email is the only workable solution, or that spammers will kill off email as a usable means of communication.
III.IIVIVIXIIVIVIIIVVIIIIXVIIIXIIIIIIIIVIIIIVVIII
Why does no-one ever mention ifile? They seem to have been doing this for quite some time (since 1996?) and have a neat trick for avoiding all those boring "training" steps (you tell ifile how to classify messages by moving them into the folder you think they should be in).
Watch this Heartland Institute video
In practice, I find that bogofilter (a Bayesian spam filter) works better and requires less maintenance than SpamAssassin, which, in turn, is better than any of the anti-spam tools I had used previously (such as SpamBouncer, which I found almost useless).
i say we start our own. we are the people who dont respond to spammers, you'd think theyd be happy to remove us.
anyone with me?! start up a do not spam/call list?
You don't have a problem with spam? So no one should have a problem, right? Spam does not depend solely on you doing something stupid. Any mail you send to someone could be improperly fwd'd or even posted somewhere that you don't want it to be. From there, you're screwed.
Damn AC.
I'm very savvy about where my good email addresses go, and never had to worry about spam-- but I recently started getting Windows Messenger popups. (I disabled it shortly after getting them) But the point is that spam will continue to be a problem to anyone who uses the internet at all.
Spammers will continue to create new and ever more effective ways of bombarding us up to the point that even the most savvy of us will be unable to ignore the problem.
[cliche]If you're not part of the solution, you're part of the problem![/cliche]
Spam filters seek to classify emails as spam/nonspam based on differences in the emails. The spammers however have absolute control over the content of their emails, so such methods are doomed to a life of one-step-ahead. There is one characteristic of spam which can never be changed by the spammer: spam is computer-generated and mass-mailed. Legit emails are not.
My idea is this: The system maintains an initially empty whitelist. When mail is received from a sender not on the whitelist, autoreply with a message explaining the situation and requesting an email back whose first line or subject contains a random word or phrase from the dictionary. Human beings will grumble, respond, and get added to the whitelist. Spammers won't give your email the personal attention it needs to get past, so you remain blissfully unaware of it.
High-speed Road Trip (18.000KPH)
"Use another account for regular everyday things, and make sure it sin't something simple like abc123@hotmail.com. I do that and never get spam to my real accounts. This whole spam thing is way overblown."
You remind me of the guy who fixed his leaky roof by using an umbrella in his house.
...because they tell me how to drive my car.
Laws are necessary when a practice can be used to cause others harm.
I'm curious if you have any idea how many spammers that represents.
Also, isn't it easy for a spammer to workaround a spam honeypot -- create a hotmail account, add it to your spam list, and verify that it did go through.
What good will just responding to spam do? You have to buy everything advertised in spams. Only terrorists fail to do that.
And you could do even better by forwarding the spam to everyone on you addressbook.
Does it work to send back an undeliverable message to the spam sender? Is there software to do this?
Ultimately, what it will take is filters good enough to block most of the existing crop of chicken-boners, backed by laws recognizing circumvention of the filters as a form of computer cracking. The latter would make it legally risky to develop, distribute, or use filter-busting tools (which have no conceivable legit use outside of filter-improvement research, for which a narrowly carved exception could be made).
/. If the government wants us to respect the law, it should set a better example.
Ok do the following, then maybe I'll care about your opinion: 1. Solve world hunger so tribe in africa don't need pork to survive 2. Find jobs the farmers who currently raise hogs as a for primary or suplimental income, that require the same skills, knowlege, ect. that they have been building for generations. 3. Find me an example of a civilization that is flourishing with out pork. 4. Find a place to sell grain considerd feed quality for animals but not people. 5. Do the above with the following animals. Cattle, Chickens, fish, sheep. I appoligize for the off topic post, but people who refuse to realize that more is at stake that pig lives bother me. Additionally anyone who says pigs are intellegent has not ever been around pigs, in addition to being intellegent, they are also quite mean.
"I am the Flail of God!" -Genghis Kahn
No he didn't. He was miffed because spamcop blacklisted his providers ip and was bouncing email from his list. He discussed bouncing spam mail vs tagging spam mail and suggested that tagging was a better solution. The word Bayesian doesn't show up anywhere on a search of his site.
Sounds great, but it's trivially detectable by trying to use the relay to mail one of your hotmail accounts.
I do take the point that many spammers are simply too dumb and lazy to do that, but I expect there's evolution in action amongst them and we can't expect that situation to last forever.
If you were blocking sigs, you wouldn't have to read this.
"discovered that "per" and "FL" and "ff0000" are good indicators of spam. In fact, "ff0000" (html for bright red) turns out to be as good an indicator of spam as any pornographic term." This is a really good and novel approach. Along with key words like "XXX", bright colors are a sure indicators of SPAM. Afterall who uses bright flashing red colors in his/her daily emails?
It's curiosity. They don't know, so they click. Even warnings that clicking on random things in your email box is a bad idea don't stop it. The clicker HAS TO KNOW.
You already know. You are not curious, merely annoyed. So you don't click.
I don't see how you can educate people other than with the 'burned fingers', like the speeder that can't keep his foot off the gas pedal... until it gets him into a major wreck. A co-worker of mine has finally learned not to run email attachments after getting infected with a trojan he clicked on out of curiosity.
Despite what individuals may do, this is just part of overall human nature.
Actually, spam is a pork product, but barely. They liquify pork byproduct and can it. Im not kidding the stuff is liquid when they put it in the can. Kinda dispointing, I always thought that they shot the cans right though the pig. Now if only they did either of the above to real spammers.
"I am the Flail of God!" -Genghis Kahn
I went through over 500 spam a day down to about 3 or so and I figured out that those last 3 are due to the fact that they are bypassing the filter (I have a bunch of different urls and the server that it is all hosted on also has its own name - so mail sent to that username at that host doesn't get sent through any filters and the way that the filters are setup there - pair.com - I can't trap that particular servername).
I have been very impressed with SA and am writing scripts to track the stats even better (I love seeing what it has pulled out everyday).
So far I have had zero false positives out of about 1-2megs of mail being filtered everyday for nearly a month now.
SA has multiple different ways of searching the mail - any one of them can be easily bypassed by any given e-mail - but all of them together are really damn good at getting rid of spam.
I'm very impressed with it and how well it learns (although straight "out of the box" - or perhaps I should say "straight out of the tar.gz" it brought me down from 500+ spam to 5-10 a day and then I tweaked how my accounts were filtering into SA and that fixed the rest.
There are some odd things afoot now, in the Villa Straylight.
maintain a list of email address that you will accept from... there are many solutions already in place to deal with getting new people on your list. this is the only way.
MARIJUANA, SHROOMS, X: ONLINE?! - E
Just ask your ISP to either:
1) stop lumping you in with spammers
2) stop hosting known spammers
or change ISPs.
Because of where I work, I have to use Outlook Express. I know, sucks to be me. OE does have a filter setting so I can at least start putting keywords in and have mail sent to different boxes. I have found that a large (greater than 95%) of spam sent to me is "personalized", meaning that somewhere in the spam is my name.
Co-workers, friends, family, don't call me by my name, so I add my name to the kill-filter list and most spam goes bye bye. I only wish OE had an option to kill-filter anything with HTML in it since nearly 100% of my incoming spam contains HTML, sound, images and whatnot.
I'd love to see M$ get their act together and fix OE and Outlook and include modern filterin techniques (such as discussed in the main article) but I doubt it'll ever happen.
So rise up, all ye lost ones, as one, we'll claw the clouds.
I've started seeing html mail with naughty words split with html comments. The filter tags the spam from header fields, and learns that "gra" is spammish.
The spammers are starting to jump through hoops to get around this. It must be working on a large scale.
Pardon me for the munged subject. I can't see how to get a "greater than" symbol into Slashdot.
People who disagree with you are not automatically evil, greedy, or stupid.
Anyone know if someone is working on something to combat the ridiculous amount of spam via instant messaging? I've been receiving daily spam instant messages via AIM for the past month. The messages are always from different screen names and are always advertising webcams. It's getting really annoying. The simple solution is to just block people not on my buddy list, but that just as bad as other types of false positives.
If you have your own server or host, you can setup different aliases for various online services that require email addresses. So if you go to randomsite.com, you can create an alias called randomsite@domain.com. When you start receiving spam at that address, you simply forward everything to /dev/null. You also get evidence that randomsite.com is selling email addresses, should you wish to persuit the matter further.
Combine that with the usual safeguards, and you probably won't receive any spam. I know I don't, at least...
---
Open Source Shirts
I atttend Dalhousie University here in Halifax, Canada. Our NOC implemented a spam filtering tool ( I think SpamAssasin ) and it marks emails that it believes to be spam, by adding an "X-Is-Spam" SMTP header to the email. Anyone who wants to use this can then just add a corresponding rule in procmail. Lee
Ever try playing hostmaster or postmaster for multiple domain names (or a single one at that). There are times when having all your DNR's or requests for DNS changes sent to abc123@ might seem like a good idea, but that just brings us back to the problem at hand.
Start message with: kitten computer candy mother father pig money happy birthday yams game scanner program office printer scanner paper car automobile doing ...
... penguin telephone camel
At the end: HOT HOT RUSSIAN TEENS! [insert link here]
So how about that? That kind of stuff is already used on search engines!
No I'm not trolling.
lol
I suppose if that method works for you, then rock out with your cock out.
but for those of us that own urls and/or companies that are web facing (in that everyone out there needs to see it in order to bring in money - not just a page that your parents and your best friend visit once a month) - you need to have an email (or in many cases - many emails) that is public.
(or in my case you also have a bunch of urls that you thought were amusing)
those emails get hit by bots and then you are added to lists - not to mention that you are added to lists once you have registered a domain name.
(obviously it helps to filter out the X@domainname.com where X is not one of the valid emails for that address - many hosted companies will simply let anything through that is at that url, and spam takes advatage of that)
like someone else on here said - while your method works for you, to then wonder why it is a problem for others is naive.
In the end - I use spamassassin and it f'in rules.
There are some odd things afoot now, in the Villa Straylight.
Simply use a free account for any registration required sites / internet posting and only check it when necessary to confirm registration. Use another account for regular everyday things, and make sure it sin't something simple like abc123@hotmail.com. I do that and never get spam to my real accounts. This whole spam thing is way overblown.
Well, that won't work in a lot cases. I can create an e-mail account on my ISP (Roadrunner) and within hours I am getting spam without having even used it. The must be allowing easy access to the account list. Free accounts are worse (hotmail, yahoo), create an account and you're guaranteed to get spam, even if you've kept the e-mail address a complete secret.
On the other hand, at work, I don't get a single piece of spam because I am careful with the address.
Zoot!
I don't think it would be nearly as complicated to check for duplicate posts...
Every person has a public and private key assigned by the govt. He sends his public key in each of the emails with the email has. The recepients mail app checks the key via a database run by the govt. The database can be very redundant since keys dont change much, just like DNS.
Now if the root key servers go down, after a latency the cache values go bad, email clients would then automatically accept all email. This would hold any sender accountable, at the price and risk of lack of privacy in say political mailing lists.
"Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
"Also, isn't it easy for a spammer to workaround a spam honeypot -- create a hotmail account, add it to your spam list, and verify that it did go through."
Yes. So far many don't (I don't know of any that do, but spammers do, eventually, stop sending to a honeypot.) Ralsky never caught on to the Moscow honeypot that was whacking him last year (I think he's the one who told Shiksaa - visit NANAE to find out who she is - that SPEWS was killing him, just at the time of the major whacking Ralsky was getting.) (Chuckle.) I looked for spammer dropbox addresses in trapped spam 3 years ago - I figured they'd use the same address every once and a while in the list of victims. I sorted the list of recipients, sorted again, removing duplicates, and compared. No differences: each victim showed up once. They could do it, they don't. Years of experience has taught them that they can test for open relays and abuse them incautiously - nobody does anything to counter them. They think they own the internet because people ignore their attempts to relay. It's easy to knock the smirk off their faces: pay attention to illicit connection attempts.
There is a project already in motion to collect all recipient addresses for honeypot-collected spam in a central location. If any address shows up too frequently then that's a suspicious address. The real problem isn't what the spammers do or could do, it is that too few people use this very simple method to wreck the spam path.
My original honeypot went down last week (I retired in 2001; I haven't really checked to see what the current managers are doing with it.) This year I only captured relay messages, delivered nothing. When it went down last week it had captured over 100 relay test messages in January. You can also go after spammers with these (and I did - no results yet to report, I'm hoping for some big results.) Spammers could detect that - but too late.
There's a sneakier version of what you suggested that the spammers could use. I won't tell them what it is.
Volume is the key - many honeypots are needed, quickly, to whack them before they adapt. Same for open proxies. It is an absolutely simple approach. You could set up Granny's system to run a honeypot and it would work, if she has a connection to a segment the spammers search for open relays. http://jackpot.uk.net/
Try Jackpot and see for yourself, if you can.
Maybe having spamtrap addresses works if you only use the internet as a personal communication medium. But what if you run an online business, and need to keep email addresses on your websites?
That's why antispam technology is important to me.
(exhales loudly as he reclines the brown chair)
Upon reading these extremely fine articles, my mind picks and dances at one particular point, and that is the SIZE of corpuses to use for the training. It seems to me, that at infinitely large bodies of training material, both spam and non-spam tokens would have equal chances of being passed or rejected. Even for large (4000) bodies of corpus, would you really want to be training with equal numbers of spam examples vs. non-spam examples? It seems to me that the filter could cycle unto itself, giving the word "the" superior priority to "mortgage", and so-on-and-so-forth such that the filter would have learned so many words -- regardless of good vs. bad -- that the filter would again (raises fist to clear throat) turn in on itself; cycle unto its own voidance.
Does anyone have any ideas on this? If I missed something from the article, such as the "weighting" system he gives to known "good" text (which I still see as being futile at large sample sizes) please inform me.
hi, I like pancakes -.-- -.-- --..
Just as every Elvis fan longs to visit Graceland, SPAM fans worldwide now have their own pilgrimage to make. In Austin, Minnesota a 16,500 square-foot SPAM Museum opened in September 2001.
Museum visitors will be welcomed to the world of SPAM luncheon meat with a variety of interactive and educational games, fun exhibits and remarkable video presentations.
try { do() || do_not(); } catch (JediException err) { yoda(err); }
The real fact of the matter is that for most people the hassle is nearly as bad as the spam! I don't want to spend the time setting up such things. And when people have set them up *for* me I get too many false positives, if only because my interests differ from them. Thus any filter has to be trained with user data and be trainable in an unobtrusive, easy fashion.
The only software I know of that does this is Apple's Mail program in OSX. Unfortunately the program has many limitations and annoyances. (Damn that drawer) However Apple's approach to Spam ought to be followed by all other email clients. Adding Bayesian inference to an email client is very easy. Putting it in the sever is a mistake because you *can't* easily click and lable an email as spam. As with unfortunately too much Open Source software, the interface has been ill conceived.
Pigs would be pretty rare if we never killed them.
Actually, there is a bit of truth to this. There are a lot of cows in America. Why? Because they taste good, or so I'm told, and they make nice clothes.
There are more trees in America now than when Columbus landed. Why? Because we can make all kinds of nifty things out of them, so people grow and sell them, and replant what they sell.
Simmilarly, hunting "culls the herd" a bit, controlling a population that would otherwise likely starve to death.
And, for what it's worth, I'm a vegitarian.
Thomas Galvin
I should mention that while 235 million spam messages stopped is wonderful and impressive it isn't the focus. Spammers can send relay spam because they can easily and reliably find open relays. The real goal of honeypots is to undo that - make finding open relays difficult and unreliable. Some open relays may still relay only 50 messages/day for about 1000 recipients. Setting up honeypots in the same zone with these makes the real open relays less easy to find, and that is the real goal. We need to destroy, completely, the ability of spammers to find open relays. Causing them grief using information from trapped relay test messages is another way to make finding open relays too difficult for them.
Put enough painful traps in an IP region and the spammers will have reason to not test at all in that region. Then it won't matter if there's an open relay in the region becuase the spammers will never look there for it and won't find it.
(Sure, secure all you can, but securing them by making them honeypots is far more powerful than just bolting the door, so to speak.)
The same discussion works for open proxies and for any other TCP/IP service they decide to abuse. Punish them if they try. Drive them away. Make not bothering you (or anyone else) the course they choose. Make them suffer if they don't choose wisely.
I use Mac OS X and Entourage. Can anyone recommend a good filter for me? I use the built in one, but it's not as accurate as it used to be. Anything that gets by, I report to SpamCop and use the Entourage Bouncer script, but I'm starting to doubt that Bouncer is any good.
...closing your eyes when someone hits you with their fists. But it's better than nothing.
Why not cut the flow of shit at the source? Why can't countries fight against spam by enforcing laws and stopping this waste of time and network resources (=spam) once and for all?
Whoever mentions "freedom of speech" gets shot in the head. You can speak all you like but you can't make me listen to your "speech". Spam is forcing me to listen.
My exception safety is -fno-exceptions.
Someone always has to post a comment to the effect that "spam would go away if you just make it illegal...blah, blah, blah...." hmmm...lets examine that shall we? Every internet connected country would have to outlaw spam in theory for this pipe dream to work. Then, you would have to have a super secret international spam police. Then magically overnight everyone would be so scared to send spam that it would just stop.... ha thats laughable. The U.S. outlawed various narcotics and that sure as heck stopped all the drug sellers and users didnt it. Come-on The knee jerk reactions American's make to any unpopular subject is always government regulation. Instead of taking actions on ones own we are always inclined to be babysitted, until that imposes on our freedoms, then all of a sudden we want government out of the picture. Take a few simple steps on your own and reduce if not eliminate your spam consumption. A) DONT sign up for lists that you know are junk! If you do, sign up with a throw-away account. B) use an email address that isnt easy for spammers to find using dictionary attacks like "bob123@aol.com". Creating an address 11 characters or longer without common words will help reduce the number of spam collectors finding your address. C)Don't post your address where it can be easily collected by mail harvesters. D)Don't use yahoo, MSN, and the like for email and then complain that you get spam. They are FREE accounts. How do you think they make money to keep offering those accounts? Why do you think they offer free accounts? To sell your email address to spammers!!!!!! E) Support programmers and companies who offer defenitive EULAs against selling your address and or provide or offer methods to block spam effectively. F) Consider using software that sends an autoresponder to anyone trying to send you email that you have not recieved from before which requires them to authenticate themselves before thier mail can reach your box. Put the work on spammers and not on your sys admin or yourself
Try Cloudmark's SpamNet. It's amazing. It's P2P based with almost 270,000 people right now, and it blocks about 60%-95% of my incoming spam (depending on whether or not it's made its rounds through the P2P network yet). I love it and they offer a quick and dirty plugin for Outlook 2000 and Outlook XP. Enjoy!
-Christopher Wu
http://www.christopherwu.net/
I dread to think how many (well intentioned) hours I've wasted on this problem. A part of me wants to provide a workable solution to my company, but I am reluctant to implement a system that blocks even one legitimate message. So I continue to research a better solution everytime this topic appears on Slashdot. I am beginning to think I should just wait for adequate legislation.
Go somewhere random
"Maybe we all need to subsidize a cheap ISP for morons."
Good idea... we already have the cheap ISP for morons. Now who's going to kick in some money to help pay for everyone's bill?
Not bad, but I'm having better luck with SpamNet - www.cloudmark.com
Not all free emails are equal, I have a yahoo account I haven't told people about, it has recieved I think 1 maybe 2 spam in the 9 months I have had it. I created a hotmail that I haven't told people about and it has recieved 5-6 spams in the 1 month I have had it.
Either I'm lucky, people pick on hotmail more, or Yahoo does what I tell it on their subscriptions forms.
Another program is SpamPal, which also acts as a pop proxy. It also has a plugin structure, and one of the plugins is a Bayesian filter. This is in addition to included support for using available spam blacklist stuff like SPEWS, ORDB, SpamCop and a whole bunch of other DNSBL lists (even the ability to block entire domains like .kr, .ch and so on). It's a rather cool piece of software.
-'fester
...at least in any version I've looked at, is "language" filter. Maybe 90% of the email I recieve is in norwegian, with hardly no spam. Most of my english mail is spam, simply because I have very little legitimate mail in english. Is there any guesstimate (a la winXPs "language recognition")? By the way, that function is a major PITA for writing english references in a norwegian paper.
Kjella
Live today, because you never know what tomorrow brings
Fraud is the cause of spam. If the FTC/FBI/Attorneys General would see Spam not as a problem of unwanted mail, but as a problem of people committing fraud on a widespread basis and enforce those laws, we'd be fine.
Chances are the fraud, deception and other similar laws are far more severe than any penalties for sending spam -- if spammers had to think about a trip to a federal, pound-me-in-the-ass prison for getting caught for perpetuating interstate fraud because the government was making an effort to find and prosecute these people, they might choose a different line of work.
Spam laws in and of themselves will only lead to an awful bureaucratization of email and they won't be enforceable anyway.
This strikes me as dubious -- it navievely seems AI-complete. How does Sponsky even come up with a rigorous definition of "spam"? It's a user-specific concept -- what if my Mom forwards me spam, saying "how do I get rid of this?" What if a spammer does the same thing from a one-time hotmail account -- would it get past the filter?
Nothing can be said without viddying his proof, but something smells wrong in diaperland.
Yours in Christ,
eSolutions
Well, there would be less spam, but there would still be some.
See, spam doesn't have to actually sell anything. There is some level of spam that is just spam companies (they call themselves marketing consultants) convincing people that spam really works, even if it doesn't. The company may not sell anything, but that's OK, the spam/scammer can just move on to another company, or try to convince the company they need to "run their ad longer".
I've had enough abrasive sigs. Kittens are cute and fuzzy.
It's great so much people like you are trying to stop that disease that's growing in the Cyber World - Spam.
I know that there is still a big run to reach the time when we will receive 0 (zero) spam messages, but with your help, with the help of one government's laws, and maybe with a bit of sense from the spammers themselves, we will reach it!
KISS - Keep It Simple, Stupid!
"These people learn quick, after their servers make their way to the open relay blacklists. Just make sure it happens every time when you receive a spam that have been apparently sent through an open relay. Forward the spam to relays@ordb.org with the first line:
"Relay: IP_address"
But I am the open relay. That's my approach. I'm only "open" for the spammer relay tests - the spam gets flushed (actually usually it is archived but as far as the spammer is concerned it's the same thing.) Jackpot will stop storing spam received form a particular IP when a specified number of spam messsages from that IP is reached.
You really ought to try Jackpot - it's a neat program. http://jackpot.uk.net/
What you say does very often apply for open proxies - a lot of spam I've trapped came first through an open proxy.
I'm not very tech-savvy, though I admire those who are. I hate spam, and used to get lots of it. Here's my fixes.
My ISP makes Brightmail spam filtering available to all users at no cost... if they opt in to it. All Brightmail's catches are held in a spam folder until you get round to reviewing and deleting them. It takes a couple of clicks to wipe out a dozen spams.
Anything that gets through Brightmail then is filtered through the Spamcop mail forwarding service I've set up - my ISP allows me multiple email ID's, so I don't download or read the "public" one any more. Anything that's blocked by Spamcop is ipso facto more insidious than the Brightmail harvest, so I happily punish the "clever" spammers by reporting them to their ISPs, web hosts, etc. With Spamcop's "quick reporting" option, it only takes a couple of clicks to report dozens of spammers.
Not much gets through both. If it does, I delete it. The problem's become almost invisible to me.
(I'd still kinda like my own Bayesian filter, though...)
I'd like to experiment with my own anti-spam software; to do so I'd like to be able to modify a pop/smtp proxy.
Anyone know of a decent GPL'd (BSD'd, MIT'd) pop and smtp proxy coded in C or (better) C++?
How about one that runs under MS-Windows?
Thanks.
Opinions on the Twiddler2 hand-held keyboard?
The new wave of Spam, at least what I've been seeing is they have hard to filter topics, and NO content, except a linked in image.
The image has all the data on it, in graphical format. Both words and images. It's readable to a human of course..
How does one propose the programs 'view' an image and determine if its selling you something or just a picture of someone's kids.. ?
---- Booth was a patriot ----
I wonder what the implications on the OpenSource community are going to be because of this? Details can be found here.
Don't think that a small group of dedicated individuals can't change the world. It's the only thing that ever has.
Less controversial than ISPs trashing suspected spam is ISPs trashing virus email - that almost never gets false positives, and almost nobody minds (or at least, almost nobody minds if the virus part gets deleted, if it was an attached document on a real email message.) That won't stop Good Times Hoaxes, which are wetware problems rather than software problems, and it's a much more common feature for corporate email systems (because they're usually the suckers\\\\\\\customers for Certain Popular Email Systems and Certain Popular Word Processors which make it easy to auto-execute code.)
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
This is a guy who shrugs off his false-positives as "understandable, because they were very spam-like when you look at them"
What if you are a person who deals with financial data over e-mail? What if you routinely help people with their web pages? What if you send long blocks of code?
What has been done here is to publish statistical results based on ONE MONTH of mail sent to ONLY ONE PERSON.
So are these results in any way relevant? No.
Have a banker, a programmer, a web page designer, a salesman, and someone who runs a porn site, all run this alg.
but then I guess those kinds of people don't deserve to get their legit emails, so ignoring potential datasets for these people is okay and doesnt in any way invalidate the supposed effectiveness of a filter.
-- 'The' Lord and Master Bitman On High, Master Of All
Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability. And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.
This sounds like a good reason not to use a Bayesian filter. A Bayesian filter has no way of knowing how catastrophic it would be if it filtered out a real e-mail containing those words from a cute girl at my college. Unless I had received such an e-mail before, which I have not.
The shareholder is always right.
I use Pocomail as a mail client. It's filters allow me to leave all mail on server, except mail from addresses already in my address book, plus whatever other filters I set up for mail lists, etc. it took all of five minutes to set this up, and no spam gets downloaded. I'm running POP, what the client is really doing is marking spam as already read, so it stays on the server. I peek at the server once a day to grab any good mail that got caught, and then delete the rest. Yes, I do have to take that extra step of looking at the mail on the server, but I'm only seeing the headers so its not that big of a deal. It takes all of 30 seconds to scan the headers,and now that I've been doing this for a while, its very clean and I rarely have anything on the server worth keeping. The Banysean stuff is cool and all, I just don't see that it's enough of an improvement over me current system to justify installing a new app.
There are also tagged versions of Unix email clients around, which let you receive messages to yourname+tag@your-isp.net, letting you do the same with tags that you did with addresses in your own domain, but surprising numbers of humans and web-forms seem unable to use those addresses correctly. (They also don't work for me, because my email forwarder doesn't know how to translate myname+tag1@emailforwarder-domain.com into myrealname+tag@my-real-isp.com.)
Fastmail.fm has a nice intermediate version, using subdomains - tag@username.fastmail.fm is equivalent to username+tag@fastmail.fm, so you can give people human-readable tagged names and do all the same processing tricks. It's pretty limited use in their free service, but has much more flexible tools in their paid service.
The other approach that helps with filter-evaders is collaborative filter nets such as Vipul's Razor or Cloudmark. Some recipients will still get stuck reading the spam, but they'll mark it so most recipients can auto-trash it.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
See my web log for an idea for an ISP level filter. The basic idea is to temporarily reject messages that look like spam. Spammers cannot deal with that delay. I'm testing it right now and the results look good.
Sure they should. The mail client for Mac OS X does. It starts off in "training" mode where it only flags what it thinks is spam but keeps it in your inbox. Once you're satisfied that it's working (you train it by correcting its spam/nonspam decisions), you switch to "automatic" mode and spam goes into a Junk folder.
How to solve most of our problems: 1.Lots of nuclear plants. 2.Cure aging.
I have also seen this a lot lately. Are there any legitimate reasons to base64 encode your entire email message or are there any standard mailers that base64 encode mail? If no and no, then you should be able to tag any base64 encoded mail as spam, no?
Four fifths of all our troubles in this life would disappear if we would just sit down and keep still. -C. Coolidge
Without using filtering software.
1. Change your e-mail address and drop the old one. (This way you are starting off with a clean slate and not on any mailing lists.)
2. Make sure your ISP dosent post or sell your e-mail address.
3. Make your email address simple for people to rember but hard for a computer to crack example m1nam3@isp.com. Use simular methods as you would in making a password. That prevents common name email address.
4. On your webpage make a CGI/PHP/ASP whatever form to send you an e-mail. When you want people to e-mail you give them the link to that page. Make sure that there are no prameters that can make your program e-mail others, and also that your e-mail address is not listed in any of the source that is visable to the web user.
5. Only give your e-mail to people you can relitvly trust. If you cant trust them then give them a link to you weppage.
6. When filling out forms on the network asking for your e-mail ether use an alternate e-mail or read the companies privicy clames and make sure that you do not check or uncheck something stating that they will send you e-mail or adds.
7. Use spamassasan or other email filtering on your system.
8. Forward all spam to ucs@ftc.gov with all the headers.
9. See if your email client has a automatic bounce back. If so bounce the message back to sender.
10. if you want to post your e-mail address then I would make a graphical jpg, png as your e-mail. That way it slows down most computers from reading it.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
One of my ISPs's implementation of SpamAssassin seems to be using it as part of their rating heuristic.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
But on the other hand, we want users to believe messages like this...
So how should the stupid users know which messages to believe? Next, we'll hear the end users saying, "If it sounds too good to be true, then use WinXP."
The price of freedom is eternal litigation.
Trickier than you might think considering slashdot editors spell most words in the english language with 3 or 4 incorrect variations.
I've reviewed the replies, and I think most of them are crud. I think the real, underlying reason why so many people argue so strongly against legal solutions here and in other technical forums is that technical solutions involve giving money to technical people, while legal solutions don't. There's a lot of people who are making money fighting spam, and if spam problem actually started to abate, then sales of anti-spam software and, more importantly, sales of anti-spam expertise would drop severely.
Vipul's Razor, Cloudmark, etc. are collaborative filters that let humans mark messages as spam and share the spam ratings, so even Spam-Of-The-Future messages that evade filterbots are likely to get caught by humans. That means that if a roughly-identical message gets sent to N people, and sneaks by their spam filtersbots, the first few humans to read it send in ratings that let everybody else's filterbot kill it for them. They do some kind of hashing function to catch similar-but-not-identical messages, which is necessary because message headers will obviously be different for every recipient, but have useful information, and message bodies for different recipients may be identical, but often have some recipient-customization, like "Dear Bob" and "remove-2184242314231-Bob@spammer.com".
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Errrrrr, no.
What do you think these ill-fated populations did before we were around to so skillfully mangae their populations for them?
Has anyone found a Bayesian filter that not only redirects spam into a spam folder but also sorts it's history of redirected mail into a probability list, so that it's easy to check the mails that were close to being accepted.
Of the 4 programs I just looked at, none mentioned this feature but pretty much everyone complains about periodically having to scan their 'spam' folder for false +ves, and a history sorted into probability would make that easier.
Stemmo
If you're using Outlook Express instead of Full-Scale More Expensive Outlook, you're probably fetching the mail using POP3 or IMAP instead of the MS Exchange proprietary protocols. If so, there are filter programs that you can run on your own PC that proxy POP3 from a server, so you can tell your Outlook Express that your email server is 127.0.0.1:pop3, and the proxy fetches it from mail.example.com for you, which gives it a hook to hang filtering tools on. There are probably similar filters for IMAP by now.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
If we can write software that automatically filters spam, can we write some that can give us the names and addresses of those sending it, so that they can be punished under the law?
Before we were around to manage the population they had all of the land to roam free. Now we are taking up all of there habitat and yes if we dont keep the herd in control with hunting they will overpopulate and starve to death. I mean look at africa, everything is either city, private land or national park and they have to cull the herds of elephants to keep the population from desimating the entire landscape..
"Please proceed to grab your ankles. The anal injection process with proceed in 5, 4, 3, 2, 1...... WHOS YOUR DADDY!!!
I understand what you are saying about the constitutionality of outlawing SPAM but I'm not sure that I buy into that argument. For sure it would be argued at the Supreme Court level.
The reason I say that I don't buy into it is that there are examples where government can limit our rights. We have the right to bear arms but if I walk down the street with a bazooka I could be arrested. We have the right to travel and move about but I still need to have a driver's license to drive a car and I can forget about driving a tank down main street.
So, although spammers have the right to free speech, outlawing SPAM probably does not violate that right. We have the right to free speech but we are not allowed to yell "FIRE!" in a movie theatre.
My State has anti-spamming laws but they focus on deception and fraud. The Spammer must not fake his return address and the subject line must not be deceptive. Things like that. This law has already been challenged up to the State Supreme Court and has been held as constitutional.
The race isn't always to the swift... but that's the way to bet!
If there are a small number of honeypots, yes, it's easy to stop them. But what if everybody had one, or at least everybody who didn't need a real smtp server? All those cable modems out there, which aren't allowed to run servers because of blazingly clueless policies by cable companies, could be running honeypots, especially teergruben,y yyy
which run valid SMTP vvvv...eeee..rrr..yyyy...sssss...loooowwwww....ly
and can keep spammers tied up for long times. They don't usually look like open relays; they usually look like end users. Cable modem companies could be heros by having people running the things. (And if spammers respond to this by not sending email to domains hosted on cable modems, that's a big win too....)
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
I would be more than happy to hog tie one of the spammers to try it. I even have the potatoe gun we can use....:p
"Please proceed to grab your ankles. The anal injection process with proceed in 5, 4, 3, 2, 1...... WHOS YOUR DADDY!!!
I think I'm relatively Net-savvy, but I don't have the facilities to run my own filters as I use a commercial webmail service. Remember that Net-Savvy isn't the same as sysadmin.
Cheers, Paul
If you're not interested in procmail, check out TMDA or ASK (active spam killer), both of which are also at sourceforge.
Whitelists work really well. The Nigerian bank scammers are the only ones who actually read their return mail, so I see one of those every now and then, but that's it.
Build stuff. Stuff that walks, stuff that rolls, whatever.
"All those cable modems out there, which aren't allowed to run servers because of blazingly clueless policies by cable companies, could be running honeypots, especially teergruben [tuxedo.org], which run valid SMTP vvvv...eeee..rrr..yyyy...sssss...loooowwwww....lyy yyy"
Jackpot has a tarpit value, or tarpitting can be disabled. Jackpot will run on windows systems if a JVM is installed. Those wanting to do this to screw spammers hitting their cable/DSL connecitons should be quite pleased at the results.
http://jackpot.uk.net/
Incidentally, I push Jackpot but it is another person's project - Jack Cleaver. It's a very fine piece of software.
Get a spam detection system working that specifically at the sendmail level sees that a bul emailing is happening and make the sender think it is accepting them and silently throw all them away.. Legitimate 1-5 emails from legitimate users go on fine but make sendmail by default look like an open relay and act like it is sending everything the spammer is flinging at it.. but never actually do... this way it makes it impossible for the spammer to figure out what is a real server , what is a honeypot, etc.. and they will simply quit.
If everything looks like a real deal, it makes it darn hard to see what is real and what is fake.
there are NO legitimate uses for a mass emailing through an open relay...
It was from Microsoft. It said "Fight spam with MSN 8!"
Maybe someone can explain why an on-topic question gets a -1 Overrated?
Opinions on the Twiddler2 hand-held keyboard?
SPAM® luncheon meat ingredients: Chopped pork shoulder meat with ham meat added, salt, water, sugar, sodium nitrite.
Why not look up the IPs in the email in any of the several open relay databases and use the results as virtual words?
I have a program that filters solely based on the IPs in the headers, and it catches most of my spam with very few false positives. That's without even doing any content based filtering.
WWJD? JWRTFA!
Check out the newest Mozilla alpha release. The e-mail client has a generic filtering plugin API, with currently one plugin developed: a Bayesian filter.
So the next Mozilla version will have it.
Computers. You can't live with them, you can't live without them.
http://msn.com.com/2100-1106-981177.html Apparently there is an upcoming conference at...MIT? Unfortunately there is not too much info in this article.
if you don't feel better tomorrow, we'll just cut your legs off about here. - Theodoric of York
The cost of sending email is generally substantially less than the cost of receiving email. And this is one of the reasons the spam problem is so pervasive; it's so cheap to spam, and costs the victims so much. The cost of spam is the cost of receiving email, plus the cost of having to deal with junk you don't want. What filtering does is reduce or eliminate the personal cost of dealing with junk in your mailbox. But that still means your mail servers are dealing with more than twice the number of network connections, and twice the number of SMTPD processes, plus the added cost of applying the Bayesian filtering (which can't be less than the cost of receiving and queueing mail because it has to open it up to apply the test).
Sure, I'd would rather not have to wear out my 'd' key, and filtering can save it. But the real costs of spam are at the servers having to deal with all those spam connections that continue to happen despite the fact that spam isn't being accepted. I currently refuse spam using SPEWS and several other DNSBLs, and this means the spammer gets a 5XX refusal code, so they know it doesn't work here. Yet they don't clean their lists (there are over 200 email addresses here being spammed regularly that have never even existed). They keep on spamming. They keep on using my bandwidth. They keep on using CPU cycles, virtual RAM, and swap space. They keep on costing me money because I have to add more mail server capacity sooner than I should have to.
At least by refusing the mail to begin with my costs are lower. I'm looking to some solutions where I can have huge lists of IP addresses I refuse IP layer traffic from, next, to further reduce the costs.
And efforts by certain anti-spam groups, which get labeled as "collateral damage" are in fact working at some ISPs to get spammers shut down and kicked out. If these methods would be used by everyone, then every ISP would be forced to eject spammers, and then we'd finally see the spam levels going down. And as the ISPs clean up, their address space drops from being listed, and the "collateral damage" (we (TINW) call it "peer pressure") itself will be reduced, too.
Bayesian filtering sounds like a nice idea. It's just not trying to solve the right problem, which is the total cost of spam.
now we need to go OSS in diesel cars
No. My problem's with the senders, not the messages. What Hotmail should do is send back an email saying "Your message has been rejected because you have not been authorized by this user. If you'd like to request authorization, click here and follow the instructions."
This works great unless you, like most people, register for service at some website that needs to send you email (a confirmation, a receipt, a password, whatever). Most of these emails are automated, and your "you are not authorized" email is going to hit a blackhole in most cases.
To achieve this we need a project that has a defined module interface. The base of the plugin wouldn't do anything other then provide module integration and possibly user interface. Each module would be made by people or groups as needed and would be completley integratable into the pre-existing framewrok. There would be modules for black lists, modules for white lists, modules for bayesian filtering, modules for rules based filtering, modules for filtering encrypted messages or checking signed ones. You name it, and if it doesn't exist you can just write it yourself, or contract out for it because there is a list of standards that will alow it to interact nicely. Corporations should like the idea too, it gives them a nice free tool that they can modify as needed. Perhaps some might even aid in its development.
Just my rant. Oh and if someone(s) make this a reality, remember you where inspired by Anonymous Coward!
We need to define "false positive". Differences between different tests might hinge on what the definition of a "false positive" is. And further, where and how filtering should, or should not, be deployed can also depend on what is a "false positive" for the recipients where it could be deployed.
Given that spammers totally disregard failed delivery, even if they get an SMTP response code indicating this when they do direct delivery, and do not clean up their lists, one effect is that even where the spam is refused or filtered into a junk box, the spam keeps coming. And the rate of growth is still substantial as more and more people attempt to get their cut of the pie that is there for spammers as a result of this theft or delivery resources they do to keep their own costs down (i.e. spamming with an unmaintained list of a million is cheaper than working to clean it down to just those who actually want it).
ISPs that host spammers who steal delivery resources from recipients (and their ISPs) are just as guilty of theft as the spammers themselves, as far as I'm concerned, because they could put a stop to it, but don't because it means more revenue for themselves (and increases the level of theft the rest of us incur). So to me, my anger is not only aimed at the spammers, but also to those who support the spammers, and even to those who support those who support spammers (e.g. the other customers of the ISPs harboring spammers). So I don't want any of their mail. I don't want their servers to contact my servers at all. I don't want my servers to have to spend any time queueing and classifing their mail. And I certainly don't want it in my mailbox.
So basically, the mail from all the other customers of a bad ISP (because they harbor spammers) is also unwanted mail. If it gets rejected by whatever tool I use to block spam, then it is not a false positive at all.
IP address based tools like DNS based blacklists often provide exactly what I need. Since all the mail from places where spam comes from is what I want to refuse, by blocking the whole mail server, the effect is an excellent match. And the SPEWS DNSBL even lets me block the rest of the ISP so I can "send the message" back to their other (so called "legitimate") customers that I don't care for them to be supporting an ISP that supports spammers.
So basically, what I have now seems to do the job very well. That's because of what I happen to define as the mail I don't want to get. Others who define the mail they don't want to get differently might need to use a different approach, and maybe Bayesian filtering is right for them. It isn't right for me because it isn't really addressing the problems I have, which is that my mail servers are still being bombarded by spammers attempting to send spam. As it is now, I reject all this junk mail during the SMTP session with no reception of the message content, no queueing, and no text processing. All a Bayesian filter would do for me is increase my costs because then I'd have to receive content I already know I don't want, and process that content to make a decision I already have made. So there's nothing in it for me.
now we need to go OSS in diesel cars
Sorry this is OT, but....
Are you trying to suggest that without hunting, wild populations would die out?
As a matter of fact, as you can see in this article from the NYT, it is extremely important to manage the deer population, or else you end up with a bunch of starving deer.
My grandfather was a licensed hunter in Germany (one of the very few exceptions to their strict firearm laws), and they used very scientific methods to determine how many deer could be sustained per hectare of forest and what not, and had to cull the herd when the population increased to certain levels, and ceased hunting altogether when changes (a disease, or whatever) made the population too low.
Yes, it is a natural cycle that has been repeating for hundreds of thousands of years, but no one wants a starved deer staggering through their neighborhood trying to forage on your front lawn because there isn't enough food to go around in the region's wooded areas.
::.. check out some Cell Phone Reviews
I noticed that there are a lack of MS Outlook-specific bayesian filtering programs out there. I've been using a Bayesian filter called Spammunition with great results. It integrates into Outlook, and it's free. Still a beta product though, so it's a little buggy. But the bayesian approach definitely works. I don't have to deal with spam any more!
Well, I've heard a lot of things about the legal system in the US
I wonder what might be unconstitutional, as well, in the United States if mandatory labels people can use for filtering were. Examples might be
- having "secret" e-mail addresses that aren't published in any place spambots can reach - by withholding that address from spambots you deny spammers their constitutional right to send you anything they want to your address!
- doorlocks - doorlocks can be used for the unconstitutional purpose of denying somebody who wants to talk to you their right to do so, they make it easy to violate the constitution by keeping the door shut when someone wants to speak
- caller identification (telephones) - something that can be used for violating the constitution by not ansering calls from certain people
- ear-grafts - terrible tools that seduce you to act unconstitutionally! Don't think you can keep the window open in summer and put something in your ears to be able to sleep despite of loud people on the street, oh no, maybe these people want to yell something at you, and you're depriving them of their rights!
In my view, it's absurd to establish any connection between ant-spam laws (and even more so if it's just about labels) and the right of free speech because the right of free speech does not and has never meant that everyone must sacrifice their time for listening to / receiving any kind of speech. No anti-spam law prevents people from signing up to mailing lists about low mortgages, teen porn and printer cartidges, setting up opt-in mailing lists about these topics, searching such material on the Internet etc., so it really takes a lot of imagination to see a connection between anti-spam laws and the right of free speech.Recently, I've seen a report about people in the US being interrogated by the FBI after criticizing Bush's war politics. Maybe these were just untypical cases, but I think if you're worried about the right of free speech in the United States, at all, this would be the kind of issues you'd have to look at and certainly not one of many possible channels of advertizing (one that costs people particularly much time whether they want or not) being restricted.
The Jackpot honeypot will optionally deliver relay test messages.
Spammy thinks that he has found a real live open relay because it delivers his test messages.
You can set the minimum time between test messages and the max recipients.
Easy solution, deliver test messages.
The JackPot Honeypot does exactly that.
see www.jackpot.uk.net
Even when relaying is turned completely off, however, many spammers keep hitting my JackPots.
My jackpot farm of 13 IP addresses on 4 machines has eaten over 2 million spam in the last 50 days.
Total spams logged: 9754 Total spam recipients 26392 in the last 24 hours.
bz
I pull all the servers from the emails, and email about 30 people at each basically saying f&3k you leave me alone....It's been optimized over time and works for the most part....when a spammer gets my email addr ..its normally not for more than 3 days... before they stop....I've had quite a few spammers saying that they aren't spamming me, but that I signed up....they just get the 30 messages again....some of them even keep sending me messages saying I've been removed...and they get 30 more.....of course I have a very limited number of people I give my email address to so I only have to filter their email address, and they know not to email me if they use a different email addr. I get about 3 spams a day now....versus 70-80 before and it's only been 3 months....and of course most of the spam I get is from people who have bought my email addr from some list...but hopefully in time it'll be off the lists...
It is not necessary that spam filtering be done at the client. There is no reason that Logic could not be used on the server as well.
.....
Consider the Following:
MailServer has a "Spam-User@yoursmallco.com" and a "nonSpam-user@yoursmallco.com".
Heuristic Spam filter scans inboxes for spam and pre-fixes "SPAM" to the beginning of suspected spam.
Client filters for "Spam"
oh-nevermind.
This isn't science, that much is certain.
An irreproducible result is noise. Not
only is Graham not releasing his code or
data set, he's not providing enough information
to reproduce the algorithms precisely enough
to evaluate their performance on independently
gathered data.
In short, this is marketing, not research.
-I like my women like I like my tea: green-
One of the references is:
Bill Yerazunis. ``Sparse Binary Polynomial Hash Message Filtering and The CRM114 Discriminator.''
Anyone else recognize the reference to Dr Strangelove? I love that movie!
The thing I find amusing/interesting/whatever about this is that there is exactly such a drug, and that it is wildly popular.
I have used lots of spam filters including spam assassin, junkfilter, homegrown scripts, etc. Nothing works nearly as well as the bayesian filtering. I started with bayespam but found ESR's bogofilter better performing (It's in C as opposed to perl so lower startup time) and it fits more easily into my mail architecture. No false positives and many hundreds of spams caught. I have a feeling this is going to be the best spam filtering technology for some time to come. Spammers won't be able to out-evolve it. I also like the fact that I don't have to periodically update rulesets or anything because it is self-maintaining.
If you haven't tried it, check it out! You won't regret it.
This only hides the spam from the reader - it doesnt eliminate the cost to ISPs and end users to receive, store, and download it. It also doesnt hide the spam from the truly clueless that actually BUY the junk products that spam usually advertises.
8 &group=news.admin.net-abuse.email
t p://www.spamfaq.net/
The right solution is to force ISP's to shut down spammers COLD when the begin getting complaints - many ISPs *STILL* do not due this, some even accept premium payments from spammers to let them continue using their service (wether it be access to actually send the spam, or to host websites or reply-boxes so they can collect their cash from the suckers)
If anyone *really* wants to be educated on some ways that have been proven to be effective in getting (some) ISP's to clean up their act, here are some URLs
SPEWS - Spam Prevention and Early Warning System
http://www.spews.org/
ROKSO - Who's behind your spam.
http://www.spamhaus.org/rokso/
(Note - I am not affilated with nor do I represent any either of these sites - I just happen to agree with their goals and methods)
Any questions, see news://news.admin.net-abuse.email - lurk for a week first, then post. If you dont have usenet access or a decent newsreader, you can get there thru google groups: http://groups.google.com/groups?hl=en&lr=&ie=UTF-
In either case, you might want to read the FAQ first: try
http://www.samspade.org/d/nanaefaq.html
or
ht
I've written up a HOWTO on setting Bogofilter to work with Exim and the Cyrus IMAP daemon. Hopefully somebody will find the document useful.
This is kinda offtopic, but whould it make sense? This would require some sort of a sendmail re-write...
When you send an email, a "signature" (hash/ID/whatever) is stored in a database on that outgoing server. When the receiver's sendmail gets the email, it connects to that server, verifies the signature and if there is a match, the email is forwarded to the user's account.
This would eliminate those emails with fake domains and relaying would be over as your "outgoing" server would have to keep the "signature" database. Of course, there should be an option to relay certain domains (like subdomains).
Once a "legit" spam is detected (sent from a real domain), the user could set the signature as "undesirable" and that information would be shared to real time blocking lists. Also, other sendmails trying to match the signature would fail and the email would be discarted right away. (probably after a certain threshold)
I think that if we adopt that kind of server, eventually all sendmail servers will have that option and most spam will slowly cease to exist as users won't accept emails without a proper "signature".
-- Leeeter than leet
Why do we need additional laws? If I could charge a spammer with harassment well then we have all the law we need. Last I checked every state has laws against people who harass other people.
If I tell a spammer to stop and then don't stop I should be able to call 911 on their ass. If they don't give me the ability to say stop (if you gag a rape victim is suddenly not rape because they didn't say "stop?") because they give a false return address shouldn't I beable to press even more serious charges? By that point they're basically stalkers.
I should get one opt out attempt and after that they get hauled off to jail.
Or is your analogy just a vain attempt to gain Karma? Personally I like the idea of charging spammers with harrassement. But how well would it hold up in court?
Maybe we should just stop calling it spam and looking for spam laws and just call it what it is: harassment. We could start taking names and kicking ass immediatly if that were the case.
Ben
Work Safe Porn
Making spam illegal isn't going stop spam at all. The only way we'll ever stop spam is with Vigilante Justice.
;)
Blackholing open relays, as well as cracking into spammer's computers and fucking them up something fierce will be the only real way to reduce spam.
Oh yeah, and don't forget the spam filters, SpamAssassin rules!
``Yeah, I know, spammers are switching to open proxies. So, write an open proxy honeypot. That, too, will be 100% efficient. In addition you now are giving spammers reason to fear every open relay and every open proxy they detect. FEAR. The SPAMMERS have to scramble. They have to scramble and they have to show everything they do to overcome the technique - there is no stealth way to look for open relays and open proxies.''
Do you think they care? They'll just move thier tasks to net providers that take no interest in security. And if that doesn't work, look for open proxies in third-world countries, etc..
Most people who eat pork also have access to other, non-meat foods.
Most people who eat pork have incisors and are omnivorous homo-sapiens.
Why do we kill them by the billions? Just to enjoy the transient pleasure of tasting their flesh?
Yes.
Seems you only seem to know this fellow if you have moderator duty.
I've been toying with the idea of forwarding all my Korean and Chinese spam (60% of the spam I receive is in those languages, another 20% is English-langugage but relayed via .kr or .cn servers) to their embassies. Currently, .kr and .cn ISPs are being bribed into giving spammers free reign. The Chinese and Korean governments could put a stop to that (IIRC South Korea does have spamming legislation), they just need to be made aware of the seriousness of the problem.
Sending the government a few spams won't do that, but sending them all the spam anyone receives might.
wile thiz is n interesding aproach, my eggsperienz is that sbammers will kepe comeing up wid wayz to git zee mezzage akross beekuz da gray mattah beetween yor earz is damm gud at getten zee mezzage (eefen if yor philter iznt).
ugly, but if you are hawking miracle drugs or nigerian ministerial assistance pleas dignity is kinda low on the priority list...
Yeah, poor old Indrema. I remember when MS announce the XBox not long after the Indrema was announced. Poor John Gildred. :^)
For anyone who cares much about what the Indrema was (going to be), visit my old site: Indrema Informer
-bill!
(on to more important things, like Tux Paint and the Zaurus)
No, no. Private filters are almost certainly fine. I'm saying that the government mandating labels on spam so as to achieve the effect of prohibiting that speech through private means might not be.
If you want to filter spam based upon its content sans mandated labeling, go ahead.
If governmental authority is wholly uninvolved (i.e. not even invoked by private actors), there's no first amendment issue.
-- This and all my posts are in the public domain. I am a lawyer. I am not your lawyer, and this is not legal advice.
Still, why should mandated labeling by any problem, at all, in connection with the First Amendment? After all, I thought the First Amendment was about free speech, not about coercion of people to receive messages.
I don't think it's the case, but if the First Amendment had the absurd consequence of restricting people's rights to decide which messages they want to see, I guess the wording of the First Amendment should be changed.
"Do you think they care? They'll just move thier tasks to net providers that take no interest in security. And if that doesn't work, look for open proxies in third-world countries, etc.."
Let's suppose the spammers, though diligent use of open relay and open proxy honeypots, are down to one last 3rd world country where they can find systems to abuse. Do we (a) cry at our misfortune or (b) try to persuade operators in that one last country to run honeypots?
As it stands they still look for open relays (and I'd guess open proxies) in the good old USA. Why not be an early example for the operators in that last 3rd world country and run a honeypot now, so they can see the advantage?
Thee's been some might fine honeypot success overseas. Moscow isn't 3rd world, of course, but that honeypot was a sensation. I don't even know where (what country) the 235-million-trapping honeypot is located. Some mighty old hardware has been used for honeypots - even stuff the 3rd world might easily have to spare. They can run Jackpot on Windows systems. If they're on the net in suffucient number to matter to the spammers then there's almost certainly sufficient resources that can be used to fight the spammers.
I invite you to try Jackpot. Just load it and start it, trap relay tests only. You may be surprised.
http://jackpot.uk.net/
Science is built up of facts, as a house is with stones. But a collection
of facts is no more a science than a heap of stones is a house.
-- Jules Henri Poincar'e
- this post brought to you by the Automated Last Post Generator...