vjs · Slashdot Mirror

← Back to Users

User: vjs

vjs's activity in the archive.

Stories: 0
Comments: 3
First seen: 2001-12-01
Last seen: 2001-12-02
Profile: (view on slashdot.org)

Comments · 3

Re:Another spam system on Distributed Spam Detection · 2001-12-02 08:12 · Score: 1

Such systems can be cool, but they have two major shortcomings. The first is that they cannot start rejecting spam before it has been seen and manually reported by at least one good guy. From my logs, it seems the bad guys like to burst their spews at odd hours, such as when they get home from a hard day begging with a "homeless please help" sign.
Second, it is practically impossible to maintain a list of more than a tiny number of only good guys. If there is any real incentive, the bad guys will get on the list with as many aliases as they need to skew the system. You must either keep the list tiny enough that all members are known to all other members, or you must assume that bad guys are present. Voting or trust schemes can ensure that no more than 5% or perhaps even 1% of members are secret bad guys, but that's not good enough for an anti-spam system that hopes to have a false negative rate lower than 40% and a false positive rate of less than 1%.
As I understand it, this Razor can be used with spam traps (addresses that get no legitimate mail) to largely avoid the first problem. If you are extremely careful and lucky about keeping secrets, spam traps can fix the second problem. The need for lucky secrecy comes in keeping the bad guys from knowing about any of your spam traps lest they send them legitimate mail (e.g. CERT advisories).
A major problem with spam traps is getting the bad guys to spam them. It is easy to build a spam trap that receives some spam, but if you want to reject more than 10-20% of spam, you need more. For example, you need to get the big commercial and political outfits to send their wonderful news to your traps, but they're not going to scrape domain contacts or netnews or use the standard dictionary attack list. (My copy of the standard dictionary attack list is fairly complete. Used with a DCC client, it collects a lot of spam.)
All of that is why I believe in automated checksum reporting without any humans in the loop. I think you must start rejecting copies of a spew within minutes and ideally seconds of its start. That's why one of the design criteria of the DCC is that servers should send the checksums of a message to their peers within seconds of when its receipient count reaches "bulk."
There is a third problem with Fabien Penso's system as I understand it. That is that none of the SMTP envelope or headers are reliable indications of spam, if you want a low false negative rate. If there is one thing that spammers can invent, it is new usernames.
Re:Similar to DCC on Distributed Spam Detection · 2001-12-01 16:46 · Score: 1

Whether the checksum is SSH or MD5 is obviously completely irrelevant to whether the input of the hash is "fuzzy."
Some people think that SSH may be more secure than MD5. To date that supposed weakness in MD5 is at most a suspicion. For the purposes of detecting spam, it is also completely irrelevant, since the ability of a bad guy to compute collisions is not interesting. It's mostly merely good politics for dealing with people who don't understand or care to think about any relavent threat model to use MD5 or SSH instead of a long CRC. The hash must be long enough to have a probability of collision less than the probability of failures elsewhere, whether in hardware or software. For that you want 64 or 128 bits. There is very common and reasonably fast code to compute MD5, so I chose MD5 for the existing DCC checksums. There is nothing in the DCC protocol that requires the future DCC checksums to use MD5.
"Normalizing" the message is the essense of "fuzziness." Whether you convert the message to a grammar tree, histogram of words, ignore typical spammer "customizing," or anything else before computing the checksum, you are doing no more or less than "normalizing," at least for any useful meaning of the word I can think of.
Vernon Schryver vjs@rhyolite.com
Re:I wouldn't trust this too much. on Distributed Spam Detection · 2001-12-01 16:26 · Score: 1

I'm inclined to trust the DCC far more, but only because it is my code. The DCC is completely independent of NANAE. I suspect most DCC users don't know what "NANAE" means.

Except that both this package and the DCC involve exchanges of checksums, I don't see major similarities between the two. Perhaps that is just my NIH syndrome talking. The DCC has been in use for a bunch of mailboxes since last year.

I think there is a major problem common to both that I deal with by saying "don't do that." That problem is dealing with bad guys. What happens if a bad guy subscribes to a mailing list you like such as CERT advisories, and submits checksums for those messages? My answer is that if your DCC server accepts checksums from DCC clients not under your personal thumb, then you must whitelist all of your incoming mailing lists because your DCC server only detects "bulkness" and not "unsolicited bulkness."

If you accept checksums from strangers, then the effectiveness of your system for detecting bulkness increases significantly, but you can't trust people you don't know. Worse, by the time you have a significant number of users, the hassles of bookkeeping force you to assume that at least a few of them are bad guys.

Then there are mistakes by good guys. What happens if a good guy accidentally submits the checksum(s) of a CERT advisory? The answer for the DCC is the same as for bad guys. If you feed your DCC server with anything except spam traps that cannot receive any legitimate mail, you can consider all hits to be unsolicited bulk email. If you let humans submit checksums of what they think is spam, you cannot trust them to never make mistakes, and so must treat your DCC server as telling you only about "bulkness."

Vernon Schryver vjs@rhyolite.com