Asynchronous Programming for Spam Elimination
ttul writes "Stas Bekman (formerly the maintainer of mod_perl) has been quietly building an asynchronous programming framework to build high performance network applications in Perl. His recent Perl.com article describes how he has used the Event::Lib module (that lives on top of the popular libevent library) to write a traffic-shaping email proxy to get rid of spam. Asynchronous programming is challenging at the best of times. Read on to find out how to do it the easy way in Perl."
so they wrote an asynchronous proxy that slows down connections. cool trick, but not any kind of scalable solution.
:)
:) -- using a pool of apache/mod_perl instances to handle connections is grossly inefficient.)
the core assumption, and the only thing that makes this work, is that botnet spam software will _always_ just give up after 30 seconds; if this throttling technique ever became commonplace, spammers would just write their own asynchronous mailer -- it's not THAT hard. windows has the same kind of async networking support (either through the winsock API and/or IO completion ports, or what have you) and i'm sure the spam/botnet software authors have no qualms about holding open a couple thousand sockets on the rooted windows machine (times a few hundred thousand machines.) furthermore, i bet there are some shitty legitimate MTAs that would just give up too, causing actual mail to get discarded
(that, and they shoulda used twisted or something
ok, ok, maybe this sounds overly critical. it's a clever, thinking-out-of-the-box idea, but certainly not the panacea we're looking for to stop spam.
-fren
"Where are we going, and why am I in this handbasket?"
Sometimes I sits and programs, and sometimes I just sits ...
Your post advocates a
(X) technical ( ) legislative ( ) market-based ( ) vigilante
approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.)
( ) Spammers can easily use it to harvest email addresses
(X) Mailing lists and other legitimate email uses would be affected
( ) No one will be able to find the guy or collect the money
( ) It is defenseless against brute force attacks
(X) It will stop spam for two weeks and then we'll be stuck with it
( ) Users of email will not put up with it
( ) Microsoft will not put up with it
( ) The police will not put up with it
( ) Requires too much cooperation from spammers
(X) Requires immediate total cooperation from everybody at once
( ) Many email users cannot afford to lose business or alienate potential employers
( ) Spammers don't care about invalid addresses in their lists
( ) Anyone could anonymously destroy anyone else's career or business
Specifically, your plan fails to account for
( ) Laws expressly prohibiting it
( ) Lack of centrally controlling authority for email
( ) Open relays in foreign countries
( ) Ease of searching tiny alphanumeric address space of all email addresses
( ) Asshats
( ) Jurisdictional problems
( ) Unpopularity of weird new taxes
( ) Public reluctance to accept weird new forms of money
(X) Huge existing software investment in SMTP
( ) Susceptibility of protocols other than SMTP to attack
( ) Willingness of users to install OS patches received by email
(X) Armies of worm riddled broadband-connected Windows boxes
(X) Eternal arms race involved in all filtering approaches
( ) Extreme profitability of spam
( ) Joe jobs and/or identity theft
( ) Technically illiterate politicians
( ) Extreme stupidity on the part of people who do business with spammers
( ) Dishonesty on the part of spammers themselves
( ) Bandwidth costs that are unaffected by client filtering
( ) Outlook
and the following philosophical objections may also apply:
( ) Ideas similar to yours are easy to come up with, yet none have ever
been shown practical
( ) Any scheme based on opt-out is unacceptable
( ) SMTP headers should not be the subject of legislation
( ) Blacklists suck
( ) Whitelists suck
( ) We should be able to talk about Viagra without being censored
( ) Countermeasures should not involve wire fraud or credit card fraud
(X) Countermeasures should not involve sabotage of public networks
( ) Countermeasures must work if phased in gradually
( ) Sending email should be free
( ) Why should we have to trust you and your servers?
( ) Incompatiblity with open source or open source licenses
( ) Feel-good measures do nothing to solve the problem
( ) Temporary/one-time email addresses are cumbersome
( ) I don't want the government reading my email
(X) Killing them that way is not slow and painful enough
Furthermore, this is what I think about you:
(X) Sorry dude, but I don't think it would work.
( ) This is a stupid idea, and you're a stupid person for suggesting it.
( ) Nice try, asshole! I'm going to find out where you live and burn your
house down!
"Where are we going, and why am I in this handbasket?"
Boffoonery - downloadable Comedy Benefit for Bletchley Park
Asynchronous Programming = programming with futures
must we rename everything every time that someone "discovers" it?
This guy goes and makes it multithreaded... Great just what we need.
Top 10 Reasons To Procrastinate
10.
Except "asynchronous programming" is already a well-known term among many web developers:
Asynchronous Programming with
JavaScript, HTML DOM,
and
XMLHttpRequest
Forget async io (completion stuff) in Windows...
They can just make the SPAM program multithreaded and start a new thread for each new connection (each using *synchronous* IO).
Theres no interprocess communication involved, it should be trivial.
Isn't that an oxymoron?
The article is correct - mail servers do not mind waiting a few minutes/hours/days to deliver their mail. Unfortunately, end-users do mind. The inherent delays for just about every message would be particularly painful for business email users, but even residential ISP customers are constantly opening tickets when they observe a delay (I work closely with several large ISPs, which is how I know).
Delays aside, I just can't buy into network-layer rate limiting when it comes to email. The metric for anti-spam success is measured in "messages" (or more accurately, "recipients"). Nobody ever calls their local email admin to say, "hey, I've received 1.3 megabytes of spam this week, what gives?"; instead, the problem is always quantified by the number of individual messages the end user had to look at and consider before deciding what to do.
Because of this, rate-limiting should be done per-recipient. That way, there's no question what a particular sender is going to get through. Once they pass the limit you've specified for their class of IP (known mail server, dynamic IP, etc) during whatever timeframe, they receive an SMTP 4xx error until that timeframe is up. That still slows them down, but you can't get around it with smaller messages, etc.
Got on Vans but they look like sneakers!!!
[full disclosure and shillery alert: I work with Stas at MailChannels]
:)
You make some very good points -- and these are all concerns we had when we set out to build this software.
Fortunately for the world, these concerns have turned out to be unwarranted. Furthermore, our experience in actually deploying this technology has been far more breathtaking than we had imagined -- both in terms of spam mitigation and improvements in scalability.
> the core assumption, and the only thing that makes this work, is that botnet spam software will _always_ just
> give up after 30 seconds;
I have a theory that spammers will always be impatient. I believe this theory for several reasons:
1. Spam campaigns are now recognized by anti-spam companies in minutes or hours. New campaigns therefore have a very short life expectancy and have to be completed as fast as possible. If mail can't get delivered fast, it's time to move on to a new domain to get it moving again. With collaborative filters like Cloudmark recognizing campaigns in less than 60 seconds, spammers obviously have to move traffic fast.
2. Botnets are not unlimited in their size or bandwidth capacity. Typicaly botnets these days are between 1,000 and 10,000 hosts. Any larger and the command and control channels are very quickly noticed and shut down by service providers. Botnets cost money too -- $250/hour for a 10K botnet is typical.
3. Spammers raison d'etre is to send lots of mail and hope that a small percentage of recipients buy something. The only way to make the business profitable is to send huge amounts of mail. If all zombie traffic in the world was magically being slowed down, spamming would no longer be profitable and spammers would tend to focus more on things like highly targeted phishing instead. Not surprisingly, we're already starting to see this.
4. Because #3 isn't going to happen any time soon, and in light of the technical constraints (1 and 2), spammers have no choice but to abort their connections within a very short time frame. It's just the nature of the economic beast. Hanging on is just for posterity. It doesn't make economic sense.
5. It works. And it's very very scalable. By slowing down traffic and multiplexing what remains, mail server load drops by 90%. In big installations, that means no more being paged in the middle of the night because your cluster of 4-way Xeons with 8GB of RAM is borked by a distributed spam burst.
Oh -- and of course you can't just slow everything down. It's important to be very selective so as not to delay everything.
> if this throttling technique ever became commonplace, spammers would just write their
> own asynchronous mailer -- it's not THAT hard...
Actually, it is that hard. Even Stas got a headache working on this project.
But even if it was easy, it would be pointless for a spammer to launch more than one connection per zombie. If a sender is marked as suspicious, the sender's concurrency is severely limited. One connection per zombie, at 5 bytes per second -- that's just not economic.
> furthermore, i bet there are some shitty legitimate MTAs that would just give up too, causing actual
> mail to get discarded
Let's just say the gap between the patience of spammers and the patience of legitimate MTAs is very large indeed. And by carefully fingerprinting and assessing sender reputation, this problem can be minimized to the point where it is a far smaller problem than content filter false positives.
I also want to point out that this technology does not make email suck by slowing it down. It in fact speeds up delivery of legitimate mail in most cases because the load is so reduced on the rest of the infrastructure.
Just talk to our customers. One of them was running four 4-way Xeon boxes with 8GB of RAM each -- all this to service the spam filtering needs of just 10,000 end users. He told us he hadn't slept a full night in months because of load-based outages. Since installing the software Stas built, the only alert he's received is a notification that the load level dropped below the panic threshold!
Reading TFA, I couln't help but notice that everything looks like a nail to these guys. I know this is the mod_perl author but did it ever occur to him that Perl and Apache (APR) are the wrong tools for this job? Doing this in (eg) Erlang would have negated any threading or concurrency issues, but I digress; doing it in Perl may be cool in a geeky masochistic kind of way that I cannot relate to.
<sarcasm>
Now I must return to writing my device driver, I've never really done any kernel hacking before and I'm a web designer so I'm doing it all in javascript.
</sarcasm>
Specifically, the article did take into account botnets, and they're just forcing good SMTP compliance. It shoudln't affect well-designed mailing list software, nor would it sabotage public networks.
So yes, there is the whole issue of the arms race -- people will just correct their botnets to handle this quirk. But your other categorizations are grossly unfair.
Build it, and they will come^Hplain.
Most if not all mail transfer agents no longer operate as open relays by default, a problem which used to be the main contribution to spam. People blamed the complexity of Sendmail for that and other problems, so many distros moved to other mail transfer agents for their default. A few years ago Sendmail was still about 65% of the mail servers.
What is the current marketshare of Sendmail now and what is the frequency of others like Exim, qmail, and Postfix?
Beta is broken and the link to classic doesn't work. Stop wasting our time or there won't be anybody left here.
Perl is good for scripting but 24/7 high performance apps?
Don't make me laugh. Something this CPU and I/O intensive should
be written in C/C++ or even assembler at a push , not a scripting
language. Seems to me this project has been written in perl for
the sake of writing it in perl , not because it confers any
advantages over doing it in a lower level language.
We implemented greylisting. It is the answer. I watch as tens of thousands of emails per day are bounced away into oblivion. At first, ham had to wait a a while, but now that the database is built, no one waits anymore. Not only that, server CPU is neglible because Spamassassin doesn't run on resent mail that has been marked as ham. Combine this with a few scripts that do some basic purging of spam addresses from the database, and we're good to go. Let's not reinvent the wheel. Why don't we just build greylisting right into the SMTP protocol? And while we're at it, let's build encryption in too -- feeling challenged?