paulgraham.com · Domains · Slashdot Mirror

Re:Yet another reason to switch to Lisp by Dr.+Photo · 2003-03-04 07:08 · Score: 2, Insightful · on Aspect-Oriented Programming with AspectJ

Yawp.

What I got from the review is simply further confirmation of Paul Graham's hypothesis that more modern programming languages are asymptotically approaching the capabilities Lisp has had for decades.

The whole "join points / pointcuts" thing seems to me like a watered-down version of Scheme's (and other Lisps) dynamic-wind and call/cc.

(And didn't the B-type Golgafrinchans end up populating some utterly insignificant little blue-green planet orbiting a small unregarded yellow sun far out in the uncharted backwaters of the unfashionable end of the Western Spiral arm of the Galaxy? ;-)

Re:tar pits dont work by Anonymous Coward · 2003-02-28 13:33 · Score: 0 · on Using Statistics to Cause Spammers Pain

If you will look at the article referenced in the oritinal article, you'll see that Mr. Lamb is proposing exactly using Bayesian filtering to make the decision to throttle/tarpit the incoming mail. Sounds to me like a great idea. You still process and accept spam, you just do it very S L O W L Y. If enough people do so, then the spammers ability to send out x messages in a specific amount of time drops and impacts his bottom line.

If Only... by 6e7a · 2003-02-27 04:35 · Score: 1 · on The Riddle of Baghdad's Battery

If only the wise and intelligent people who invented that battery were running Iraq right now. But, like nerds in high school, these people are too easily disposed of.

On a more serious note... by Obiwan+Kenobi · 2003-02-21 08:25 · Score: 5, Insightful · on Advice You Would Give to Your 12 Year-Old Self?

I know its easy to go the "+1 Funny" route here and tell everyone to get ready to jerk off a lot and buy stock in [some company that will explode with profits], but after thinking about this for awhile, I've deduced my advice to a sentence:

Don't take shit.

My life from 12-17 generally consisted of me putting up with bullying, putting up with being put down, putting up with people who had no business trying to tell me what to do, and even when they had that right, they did it all the wrong way. A little standing up for yourself goes a long way.

What would I tell myself? When that bully picks on you, punch him in the face as hard as you can. Go Ender on him--don't stop until they pull you off of him. I guarantee that he'll never try it again, yet this amazing fact eluded me, and I just assumed that no matter what I did, and that included fighting back, that I would be stuck in my little hole of miserableness forever.

Don't let your boss walk all over you. When I entered the "corporate world," also known as the Full Time Job, my little "Computer Operator" job got me nothing but headaches and more miserableness. Just when I thought I had escaped the clutches of bullies and put downs, here comes Office Politics to screw it all up again. Suddenly my boss would take credit for all my work and leave me hung out to dry when I made a mistake, holding myself up to the whole place as an example of How To Screw Up Rightly. The more I think about it, the more it hurts in the futility of it all.

Did I ever finally grow some gumption and let it fly? Sure. But it was far too late. The damage had been done, and this fantastic article rang so true my ears are still ringing. I told off my old boss, let the higher ups know what was going on, and moved on to greener pastures. I settled down, found a wonderful wife and now have a gorgeous 8 month old daughter who I value more than my own life. And I'll be sure to let her know, when she turns 12, that life isn't about the microcosm of high school, or the inmates, er, students in it.

My greatest hope would be that my 12 year old self would be, at the very least, left alone. And that's more than most depressed, repressed teenagers get.

Re:F Table? by pjrc · 2003-02-18 15:11 · Score: 1 · on Why Nerds Are Unpopular

what are these 'F' and 'D' tables people keep referring to?

You could click on the link, but since reading though these posts is more interesting that actually reading the article (if only the first paragraph or two), I'll do you the service of quoting the first two paragraphs of the article so you don't have to actually click on that link and leave the comfort of slashdot.....

Why Nerds are Unpopular
February 2003
When we were in junior high school, my friend Rich and I made a map of the school lunch tables according to popularity. This was easy to do, because kids only ate lunch with others of about the same popularity. We graded them from A to E. A tables were full of football players and cheerleaders and so on. E tables contained the kids with mild cases of Down's Syndrome, what in the language of the time we called "retards."
We sat at a D table, as low as you could get without looking physically different. Our table was populated by complete nerds, cases of delayed pubescence, and recent immigrants from China. We were not being especially candid to grade ourselves as D. It would have taken a deliberate lie to say otherwise. Everyone in the school knew exactly how popular everyone else was, including us.

Bayesian filtering by Anonymous Coward · 2003-02-15 21:03 · Score: 0 · on Spam Catchers Block Latest Crypto-Gram

I highly recommend you read up on this. Even if you don't go for the gory statistical details, read Paul Graham's overview:

http://www.paulgraham.com/better.html

It works quite well, even when spammers try to evade it using techniques like you mentioned. For example, a message with this:

Highten S/e/x/u/a/l Satisfation, 1 0 0 % Safe ... was easily caught and filtered, even though every keyword is mispelled or mangled, and even though the body of the message was seemingly spam-innocent.

Re:Unfortunately, posting to /. can generate spam. by Deven · 2003-02-13 04:10 · Score: 1 · on My Short Life As An Unintentional Porn Spammer

Moral: spammers hoover slashdot, so don't post your email here, ever.

Screw that. I refuse to hide or obfuscate my email address. I've been using the Internet for 15 years. I remember the time when the Internet was mostly spam-free, and people rarely forged email addresses even though everyone knew how to.

My real email address is deven@ties.org -- this is my primary personal email address, not a spam-trap address. I know that the spammers are harvesting address from Slashdot and everywhere else. I don't care. Let them have the address. I've never hidden it, and I never will. I'm stubborn that way. (It's akin to refusing to change your lifestyle in response to terrorism, even when you know you're at risk...)

Of course, since I don't hide my email address, I get tons of spam, along with "Joe job" bounces/replies for spams forged in my name, plus more bounces copied to postmaster, since I receive postmaster mail for several domains. Bring it on! It just provides me with a larger corpus of bogus email to use for Bayesian filtering, or whatever other technique I may experiment with...

I firmly believe that a technical solution will be required to solve the spam problem. Legislation won't prevent the virtually-untraceable international spams, and may not even prevent local ones if it's not zealously enforced. Social controls haven't been effective. We need to prevent the spam from being delivered in the first place, or at least mark it as suspicious so legitimate mail doesn't drown in the noise so easily.

Beyond basic filtering like SpamAssassin and Bayesian filtering, there are other technical solutions worth exploring. Human validation techniques like TMDA might help. Finding a way to punish spammers and drive up their costs, such as E-Stamps or selling interrupt rights (original paper: HTML or PDF), might be effective. (But likely a higher barrier to legitimate mail.) Some sort of PGP-style Web of Trust might be very effective if done well, but it would be difficult to build. Perhaps some "soundness" principles could be borrowed from Usenet II to create a similar system for email...

Let's cross our fingers and hope to find a truly effective solution (or combination of solutions) in the near future!

Spammers benefit from spam conferences by Vadim+Makarov · 2003-02-06 15:48 · Score: 3, Insightful · on Slashback: NWLink, Vivendi, Gatherings

Don't be mistaken, these spam conferences are closely watched by the spammers as well.

Two weeks ago, I read A Plan for Spam article from the last conference, announced on Slashdot. There, the author describes spam-of-the-future as "some completely neutral text followed by a url".

Voila, the future has come. Yesterday I got a short message in Russian, in friendly tone, with an URL. Just like the ones I sometimes get. I'm a webmaster of a site with diverse content, and strangers sometimes send me stuff like this for news etc. There is absolutely no way to tell whether it's a spam or not without visiting the URL.

While the developers wrestle with one strategy and openly discuss the remedies, the spammer sees it and picks the next strategy, always ahead of you! Who benefits more from these conferences, good folks or the spammers?

One fix I'd propose is to stop publishing and webcasting the conference stuff. Then the spammers would have to attend in person. You know what happens next. A spammer surrounded with angry geeks :)

Obvious answer? by Dr.+Photo · 2003-02-03 05:48 · Score: 5, Interesting · on Kishotenketsu Programming?

Ruby is a programming language that was born and raised in its native Japan, which means it may very well be, by definition, what you say you're looking for.

Incidentally, Ruby, though purely-OO, supports nifty things like true closures, and you can end up doing functional programming without realizing it at first [Ruby, of course, is designed with this sort of thing in mind]. It was the realization that I was doing this (or something very close to it), in conjunction with Paul Graham's essays that got me interested in Scheme (a sleek, lightweight dialect of Lisp).

So, perhaps the only real answer is to learn as many interesting programming languages as you can, and use the broadened perspective you gain to make an informed decision for yourself.

RBL by Penguinoflight · 2003-01-27 02:11 · Score: 4, Interesting · on Using gzip As A Spam Filter

RBL blocks a lot of stuff that isn't spam. It's probably a better idea to go with bayesian filtering. You can read up on it here: http://www.paulgraham.com/better.html

Re:Naive Bayesians probably don't work in long run by BlackjackGuy · 2003-01-19 02:36 · Score: 1 · on MIT Spam Conference Conclusions

Actually Bayesian filters are still extremely effective in the circumstances you mention. Paul Graham talks about all of them in his article A Plan For Spam. I'll run down the list.

a) Short messages do get caught. The bayesian filter doesn't just look at the email body; it looks at the email header as well. There's just as much damning evidence of spam in the email header as the body. All the URLs and header "signatures" of spammers are pretty easily identified after you've gotten a lot of spam.

b) I've noticed that misspelled words aren't that bad frankly. The email headers are still valuable here. You'd have to misspell every word in the whole spam message to really try to get the filter to choke. And then, the spammers return rate is going to go down. They don't want that. And even if they do misspell EVERYTHING, the email header is still going to be fishy.

c) Spammers can't really use common words to get their message across. If they do, they get a worse return rate, and they don't make any money. They need flashy marketing words, things like FREE and SALE and VIAGRA and PORN. Plus, most of them use terms like "unsubscribe" or "offers" to make it try to sound legitimate. These words are all dead ringers for spam. And again, the email header is of course going to get them caught.

d) Most of the spam I get now is simply just a picture. And the bayesian filter I use catches all of them. Again, not to sound like a broken record, but the email headers are really effective in catching this stuff, regardless of the message body. And the HTML tags that spammers use in their email is also pretty recognizable. Things like color codes or whatever. Specifically for pictures, the IMG tag of course needs to have a domain name in the URL of the image, and that most likely is going to be good evidence for spam.

Bayesian filters still work well against all the scenarios you point out.

My notes for the proceedings (very long post!) by babbage · 2003-01-18 21:58 · Score: 5, Interesting · on MIT Spam Conference Conclusions

I was waiting for the review to show up on Slashdot, as the conference was really good. The audio proceedings have been put online, but I'm not sure if they can take a Slashdotting, so please be gentle :) If you have 8 hours to spare, the whole day was pretty good & worth listening to, but the schedule as planned isn't exactly the sequence people spoke in, so you may have to jump around the RealAudio stream a little bit.

Turning my notes for the day into something vaguely coherent, here are some hightlights from the proceedings. There are a couple of speakers that I didn't write anything down for, but from mid-morning on this should be pretty comprehensive. Apologies in advance if my notes lead me to attribute certain comments to the wrong speaker -- if anyone notices any mistakes please feel free to add corrections:

Bill Yerazunis - CRM114 & MailFilter

Because Perl "freaks him out", Yerazunis came up with the CRM114 minilanguage (points for anyone that gets the joke in the name without googling for it :), then wrote MailFilter in CRM114 as an implementation of a filter that can be used with Procmail or SpamAssassin or what have you. The basic idea is to decompose a message into a set of "features" composed of various permutations of single words, consecutive words, words appearing within a certain distance of one another, etc, such that the set of features N is very much bigger than the set of words X. You then analyze the features in various ways and if you get above a certain arbitrary threshold, you flag the message as spam & handle it accordingly.

He claimed that with this software he could get better than 99.9% accuracy in nailing spam, and a similar percentage in avoiding "ham" (the term everyone was using for false positives -- legit mail that was falsely identified as spam). One of Yerazunis' observations is that the best way to defeat the spam problem is to disrupt the economics: if a 99.9% or better filter rate were to become the norm, then the cost of delivering spam can be pushed higher than the cost of traditional mail and the problem will naturally go away without requiring legislation (which would be nice anyway, but we can't count on it).

The drawback of CRM114/MailFilter is that it can only handle about 20k of text per second, so it's not appropriate for large scale use yet. Still an interesting project to watch though: crm114.sourceforge.net
John Graham-Cumming - POPfile

Most of his very entertaining talk was about the ingenious tricks that spammers resort to to obfuscate spam against filters, including most diabolically one example that placed each column of monospace text in the message into an HTML column, so that the average HTML-capable mail client would render the message properly, but it would be absolute gibberish to most mail filters. The ultimate lesson was that any good filter has to focus not on "ascii-space" (the literal bytes as transmitted) but the "eye space" (the rendered text as seen by the user), which by extension may mean that any full scale spam parser/filter could also have to include a full-scale HTML & Javascript engine. Yikes!

As for Graham-Cumming's software, it's a Perl application, available for all platforms (Windows, Mac, & of course Linux) that allows users to filter POP3 mail. Interesting stuff if you're a POP user: popfile.sourceforge.net
John Draper - ShopIP

Most of Draper's work seemed to be focused on profiling spammers, as opposed to profiling spam itself, by throwing out a series of honeypot addresses & using data collected to hunt down spammers. spambayes.sourceforge.net
Paul Judge, CipherTrust

Judge's big argument, which no one really disagrees with, is that spam has become not just a nuisance, but an actual information security issue. To that end, he is advocating much more collaborative effort to address the problem than we have seen to date: conferences like this, mailing list discussions, better tools, and public data repositories of known spam [and ham]. To that last point, one of his observations (which others made as well) was that there are no universally agreed on standards for what qualifies as spam, so repositories for spam will not be accurate for all users (spam for your programmers will be the bread & butter of your marketing department, etc). Plus, there are obvious privacy issues in publishing your spam & ham for public scrutiny. And to add another wrinkle, one danger of public spam/ham databases is that spammers can poison them with false data, screwing things up for everyone. That said, he encouraged users to help out with building spamarchive.org.
Paul Graham

The man who organized the conference and kicked everything this week off with his landmark paper from last fall, A Plan for Spam. Graham's spam filtering technique famously makes use of Bayesian statistics, a technique popular with nearly all of the speakers. The nice thing about a statistical approach, as opposed to heuristics, simple phrase matching, RBLs, etc, is that they can be very robust & accurate; the down sides are that they have to be trained against a sufficiently large "corpus" of spam (most techniques have this property though) and they have to be continually retrained over time (again, this is common). Graham was too modest to produce numbers, but subjectively his results seemed to be even better than what Yerazunis gets with MailFilter, by an order of magnitude or more.

Like other speakers, he predicted that spammers are going to make their messages appear more & more like "normal" mail, so we're always going to have to be persistent about this -- as one example, he showed us an email he received IN ALL CAPS from a non-English speaker asking for programming help, and although it was legit, the filters insisted otherwise. "That message is the one that keeps me up at night."

Everyone interested in the spam issue should go read Graham's paper immediately.
Robert Rothe, eXpurgate

Rothe works for Eleven, an ASP company from Berlin selling a spam management service/application called eXpurgate. His talk was short on details about how the tool worked (mainly that it searches for bulk mail), focusing instead on the high level functionality it provides to users -- basically, they classify mail as safe, questionable, or dangerous, and let the users handle them accordingly. Another speaker that sees spam as a network security issue, so they built their system accordingly, with privacy of the client's mail content in mind etc.

Like many speakers, he warned about the dangers of an anti-spam "monoculture": that Bayesian techniques might be great, but if that's all anyone uses then spammers will catch on and adjust their messages to look more like normal mail, to the point that Bayesian filters won't work anymore. As a result, we're going to need to attack the problem from several angles, using different techniques, to keep the spammers off balance as much as possible.
Matt Sergeant, SpamAssassin

SA is a well known Perl application for heuristically profiling messages as spam, adding headers to the message saying for example "I am 72% sure this is spam because it has X Y Z", and passing off the message to procmail or whatever to be handled accordingly. SpamAssassin can handle a message throughput great enough that it can be deployed at the network level (whereas some of the others, which might have somewhat better hit rates, are still too inefficient at this point). Deployed this way, the differences in effectiveness for single vs. multiple users becomes very apparent, as 99% effective rates fall down into the 95-80% range. This happens because, again, different users define different things as spam, so mapping one fingerprint to all users can never work quite right. For an example of a tool that your company can deploy right now & get fast, decent results, SA looks like a good choice; but for the long run it looks like a Bayesian technique is going to get better performance, and SA is adding a statistical component to its toolkit. Good talk.
Barry Warsaw, Python Labs

This was another example of the "monocultures are dangerous" philosophy, as Warsaw explained how he is helping to use a variety of anti-spam techniques -- from clever Exim MTA configuration to good use of Spam Assassin & Procmail to fine tuning of the MailMan mailing list engine -- to work together to manage the spam problem for all things Python (Python.org, Zope, many mailing lists, a few employees, etc).

He pointed out that some very simple filters can be surprisingly effective: run a sanity check on the message's date; look for obviously forged headers; make sure the recipients are legit; scan for missing Message-Id headers; etc. In response to the person that originally posted the article, yes, he did mention blocking outgoing SMTP as an effective element of a many tiered spam management approach.

Among other tricks for getting the different filtering tiers to play nice together, they make heavy use of the X-Warning header so that if an alarm goes off in one tier of their mail architecture, other components can respond appropriately. Cited projects included ElSpy and SpamBayes.
Barry Shein, founder & CEO of The World -- or as he laughingly put it, "President of the World". Har har har

This talk was mostly a let down for me -- Shein has made his views very well known, and his ranting, rambling talk didn't really introduce any new ideas for anyone that had read that interview (some good jokes & quotes though).

His core argument is that spam is "the rise of organized crime on the internet", that filters are nice but that the mail architecture itself is fundamentally flawed, and that ISPs like his -- in 1989, The World was the world's first dialup ISP -- are being killed by the problem. Shein was very annoyed that all these talented people are having to clean up a mess like this when we should be out working on more interesting stuff, and not having to worry about this issue. His big hope seemed to be that legislation will someday come to the rescue, but he sounded very pessimisstic. (Others in the room seemed to feel that this was a very interesting machine learning problem, and weren't really fazed by his pessimism -- but then most of the people in the room don't run ISPs.)

He also suggested that we need to find a way to make spammers pay for the bandwidth they are consuming (rather than having users & ISPs shoulder the burden) but didn't seem to know how we might go about implementing this. At all.

Fun rant to cheer along to, but for me it wasn't very constructive in the end.
Jean-David Ruvini, eLabs SmartLook

This was an interesting product. Ruvini's company is developing an extension to Outlook 2000 & XP that will watch the way users categorize messages into folders, come up with a profile for what kinds of messages end up in which folders, and then try to offer similar categorization on an automatic basis. Think of it as Procmail for Outlook, without having to mess with (or even be aware of!) all the nasty recipies.

Obviously if you have a spam folder, then spam will be one of the categories it looks for, but more broadly it will try to categorize all your mail as you would ordinarily categorize it. This makes SmartLook a broader tool than "just" a spam manager.

SmartLook is another statistical filter, though it uses non-Bayesian algorithms to get results. eLabs' tests suggest that the product is able to properly categorize messages about 96% of the time, with no false positives, and (for their tests, mind you) that it performed better than Bayes filters over three months of usage.

One nice property of this tool was that it works well with different [human] languages -- some strategies fall apart &/or need retraining when you switch from English to some other language. For certain markets (eLabs seems to be a European company, perhaps French?) this is a crucial feature, and having a tool that works with one of the biggest mail clients out there (most people don't use Mutt or Pine, sadly enough) can be very valuable. Very clever -- watch for the inevitable embrace & extend three years from now.
Eric Raymond

He didn't say anything about guns, but he did try to correct one of the other speakers for misusing the term "hacker."

Like Graham, ESR is a Lisp fan, but he knows that the vast majority of people aren't, and he also knows that the vast majority of people need to be using something like Graham's spam software. So on a lark, he came up with a clean version in C, named it BogoFilter, and put it on Sourceforge, where a community sprung up to, well, embrace & extend it.

As good as Graham's Bayesian algorithm is, ESR felt -- as did many of the other speakers -- that the nature of your spam/ham corpus is much more significant than the relative difference among any handful of reasonably good algorithms. (Back to the often repeated point about how corpus effectiveness falls apart when used for a group of users, as opposed to individuals.) To that end, he strongly feels that the best way to deal with the spam problem is to get good tools into the hands of as many people as possible, and to make them as easy to use as possible (ahh, the old "open source UIs always suck" argument :). As an example, one of the first things he did was to patch the Mutt mail agent so that it had two delete keys: one for general deletion, one for "get rid of this because it's spam." That second key, and interface touches like it, seem like the way to get average people to start using filters on a regular basis.
Joshua Goodman, Microsoft Research

Unlike ESR, Goodman felt that algorithm selection does make a big difference, but this being Microsoft he refused to disclose what algorithms his team is working with -- except to say that, when delivered, they will be more accessible for average users than SpamAssassin, Procmail recipies, or Mutt :)

Microsoft has been working on the spam problem since 1997, but because of how big they are they've had unique problems in bringing solutions to market. As a case in point, they tried to introduce spam filters to a 1999 Outlook Express release, but were immediately sued by email greeting card company Blue Mountain because their messages were being inaccurately categorized as spam. With that in mind, they have been very reluctant to bring new anti-spam software out since then because they would like to see legislation protecting "good faith spam prevention efforts."

As a very large player, Microsoft faced certain difficulties in developing useful filters -- it may make sense for you as an individual to filter all mail from Korea, but this doesn't work so well if you are trying to attract customers *from* Korea :). This has forced them to put a lot of work into thoroughly testing different strategies before offering them to the public.

In spite of what millions of webmail users may have expected, Hotmail & MSN are currently being filtered by Brightmail's service, and plans are underway to reintroduce spam management features to client side software again. (Just imagine how bad it would be if they weren't paying someone to filter for them! Unfortunately, no hecklers piped up to ask if they are really selling Hotmail's user database to spammers, and if that is a source of annoyance for his team.)

An interesting barrier his group has had to grapple with was what he called the "Chinese menu" or "madlibs" spam generation strategy: that it's easy to come up with a template for spam -- "[a very special offer] [to make your penis bigger] [and please your special lady friend all night!" vs. "[an exclusive deal] [for genital enlargement] [that will boost your sex life!]" etc -- and have a small handful of options for each 'bucket' multiplying into a huge variety of individual messages that are easy for a human to group together but almost impossible for software to identify.
Michael Salib, extremely funny MIT student

Unlike nearly all other filter writers of the day, Salib's approach was heuristic: find a handful of reasonable spam discriminators, throw them all against his mail, and see how much he can identify that way. "It's sketchy, but this is a class project. I don't have to be realistic. [...] These results may be completely wrong."

Much to his surprise, he's trapping a lot of spam. He pulls in a little bit of RBL data ("the first two or three links from Google, whatever"), looks for some patterns and so on, and then churns it through LMMSE, an electrical engineering technique that as far as he can tell doesn't seem to be known in other fields. Basically this involves running the messages through a series of scary-but-fast-to-calculate linear equations). It turns out that he can process this much faster than a Bayes filter, to the point that customizing his approach for each user in a network would actually be feasible.

For a small spam corpus, he got results better than SpamAssassin did, though for a large corpus his results were worse; he couldn't really account for why this would be the case, or predict how things would scale as the corpus continued to grow.

When questioned about the RBL tactic by a member of the audience [who was apparently familiar to Salib -- I don't know who it was] about whether authenticating remote users might be the answer, Salib's response was "yes, I agree, but then you *do* work for Verisign, who is in the verification business, so you would say that."

Right on, Salib -- his talk was easily the funniest & breezy of the day :)
David Lewis, general researcher

The core of Lewis' argument, as ESR said earlier in the day, is that for any machine learning technique the quality of the learning corpus is much more important than the algorithm used. Bayes is one such algorithm, but there are many other good ones in the literature. In a dig at Goodman's refusal to disclose algorithms, Lewis pointed out that all of this has been publicly discussed since the first machine learning paper was published in 1961.

Observations: "lots of task inspecific stuff works badly, but task specific stuff helps a lot." It is important to use different corpuses [corpi?] for training and for general use, so that you don't train your machine to focus too much on certain types of input (this is a point that Microsoft's Goodman made as well).

As Graham did, Davis emphasized that spam is going to slowly start looking more like natural text, and we're going to have to deal with this as time goes on. www.daviddlewis.com/events/
Jon Praed, Internet Law Group

To a burst of tremendous applause, this talk began with the sentence "my name is Jon Praed, and I sue spammers."

He brought a legal take on the "not everything is spam to everybody" angle, emphasizing that we need a precise definition of what qualifies as Unsolicited Commercial Email (UCE). In particular, it has been difficult trying to pin down if the mail was really unsolicited, as this is where the spammers have the most wiggle room. However, if you can track down the spammer, they have to date rarely been able to verify that the user asked for mail, and so Praed has been able to successfully prosecute several spammers on this angle. He doesn't expect this to work forever though.

According to Praed, "laws against spam exist in every state, and more are pending", but he doubts that a legal solution will ever be completely effective as long as spam is lucrative. By analogy, he pointed out that people still rob banks and that has never been legal.

Praed informed the audience that there are several ways to get back at spammers, including injunctions, bankruptcy, and contempt, and all of these can be very effective. He pointed out that, to be blunt, a lot of these people are desperate low-lifes, and spam has been their biggest success in life. After these legal responses, their lives all get much worse. It hadn't occured to me to see spammers as pitiful before, but I can now. Most importantly, Praed stressed that these legal remedies can be very effective, and he strongly warned against taking vigilante action. This is almost always worse than the spam itself, and it only serves to get you in even deeper trouble than the spammer.

Identifying the sources of spam, most comes from offshore spam houses, abuse of free mail accounts (Hotmail & Yahoo, free signups at ISPs, etc) and bulk software (which may apparently soon become illegal in certain areas, provided that a law can be found to ban spam software while allowing things like MailMan or MajorDomo). Interestingly, he questioned the idea that header spoofing is a big problem, and claimed that in every case he has dealt with he has been able to track down the messages to a legit source sooner or later.

Suggestion: if you get a spam citing a trademarked product [e.g. Viagra], forward it to the trademark holder and they will almost always follow up on it. Suggestion: be fast in trying to track down spammers, as some of them have gotten in the habit of leaving sites up long enough for mail recipients to visit, but taking them down before investigators get a chance to take a look. Legal observation: spam is almost always fraud, and can be prosecuted accordingly.

Praed wrapped up his talk by citing the encouraging precedent that the famous Verizon Online vs. Ralsky case set: [a] that the court is interested in where the harm occurs, not where the person doing harm was when causing it (so if you send spam to someone in Alaska and spam is a capital offence in Alaska, you can be tried as a citizen of that state even if you caused the harm from somewhere else), and [b] it is assumed that you have to be familiar with a remote ISPs acceptable usage policies, and ignorance is no defence (just as you can't say "I didn't know it was illegal to shoot someone", Ralsky couldn't say that he didn't know Verizon prohibits spam -- (he had to have known that the AUP wouldn't allow what he was doing, so he deliberately didn't read it)). That precedent makes future prosecution of spammers much more encouraging. While, again, legal solutions may never eliminate the spam problem, a precendent like this can be an important supplement to filtering efforts (the stick to the filter's carrot, or something -- my lousy analogy, not Praed's).
David Berlind, ZDNet executive editor

His talk was primarily about how he receives a huge quantity of email from ZDNet readers, and he can't afford to use any spam filtering solution strategy that would allow *any* false positives. As one of the speakers said -- sorry, I forget who (Microsoft's Goodman?) -- getting a 0% false positive rate is easy: just classify nothing as spam. Getting a 100% hit rate is also easy: just classify everything as spam. Any solution besides those two is always going to have some degree of error either way, and determing how much of what kind of error you want to accept is up to you. Most users will tolerate a moderate false negative rate (some spam gets through) if it means that the false positive rate (legit mail is deleted) is very low. In Berlind's case, the false positive rate has to be vanishingly small, because reading all customer mail is a critical sign of respect for him.

Further, his business is also a legitimate mass emailer, sending out millions of free newsletters to users every day, and if Shein's proposal to bill bulk mailers were to catch on then even a very low rate would quickly put his company in the red. One obvious solution, which wasn't mentioned: start charging a subscription for these mailings, and make them profitable. I don't want to see this happen but if it did then the economics would tilt back toward making things feasible again.

Berlind is appreciative of the anti-spam work that is being done, but at the same time is skeptical of how pragmatic most of what is being proposed can really be. He feels we need a massive effort to rework the way mail is handled [Y2K anyone? It could get IT people back to work...], and to that end hopes ZDNet can help promote such a cooperative effort between the parties working on this. They don't want to be involved -- they are journalists & publishers, not standards developers -- but they are eager to get things going & want to cover the story as it progresses.

Like Shein said, he feels it's a waste for all these talented people to be working on combating penis enlargement offers, and hopes that we can find a way to get past this and work on real problems, "like world peace." This comment got a chuckle from the audience, but he seemed like the kind of guy that really meant that, and more importantly, he was right. A smart guy like Paul Graham or Bill Yerazunis shouldn't have to waste time tinkering with how many Viagra offers he can automagically delete when there are more fun things to be doing.
Ken Schneider, Brightmail

As mentioned earlier, Brightmail provides an ASP service for real time filtering of both incoming & outgoing mail. As would perhaps be expected, bigger ISPs and networks attract larger amounts of spam: 50% of mail coming into big ISPs and 40% coming into big companies is now spam. Brightmail offers the Probe Network, a <slashdot-killfile-term>patented</slashdot-killfil e-term> system of decoy honeypot addresses that gather data for analysis at their logistics center, which in turn distributes spam filtering rules to their clients where a plugin for $MTA (using the open source or proprietary MTA of the client's choice) can act on the database.

An interesting property of their system is that they have a mechanism for both aging out dormant rules as well as for reactivating retired ones, so that the currently active ruleset can be kept as lean & effient as possible. A big source of difficulty for them is legitimate commercial opt-in lists, because things have gotten more shady & blurry over time and it's now hard to tell this mail from much of the spam out there. Whitelists help here, but the problem is still difficult.

After each speaker had his turn, there was a panel discussion, but not much really happened there, and the moderator cut things short after only a couple of minutes. The original plan was for everyone to go out for Chinese food afterwards and continue the discussions over dinner, but when 580 people signed up that plan obviously fell apart. :) And so, here ends the notes...

spambayes? by spongman · 2003-01-18 17:18 · Score: 4, Informative · on MIT Spam Conference Conclusions

Did anyone there talk about Spambayes? I've been using this open-source spam filter for several months now and lurking on their mailing list and I have been really impressed at the lengths they've gone to to provide a mature framework for testing their statistical theories over many varied sets of spam/ham corpora.

While they started out with the bayesian algorithm described by Paul Graham they quickly discovered that the effectiveness of his algorithm tends to depend on the values of some quite sensitive tuning parameters and that diffrent people can get wildly differing degrees of success depending on their configuration and the types of spam/ham that they receive. Gary Robinson wrote an interesting critique of Paul's algorithm and helped the spambayes team incorporate his so-called chi-squared combining scheme (which apparently isn't bayesian at all) which doesn't seem to depend so much on 'magic' numbers and their testing framework showed that it works surprisingly well for both small and large sets of messages.

It's still under active development although most of the ongoing work is centered around the user interface components (POP proxies, Outlook plugins, etc...) whereas the actual spam classifier hasn't changed much in a while.

Well worth looking into if you're getting too much spam. Who isn't?

Tsk Tsk CmdrTaco by Kernel+Corndog · 2003-01-18 03:26 · Score: 2, Informative · on Spammers Busted

I thought for sure, with as much spam as you get, you'd be the first one to try out the bayesian mail filters that Paul Graham wrote about. One of the ones he suggested was CRM114 With a reputed catch rate of 99.8%, do you really not want to try it that much?

Re:Do we need this? by Anonymous Coward · 2003-01-15 07:57 · Score: 0 · on Carping Over Creative Commons

To some extent, it's true that consumers want quality in books, and publishers try to provide this, but I don't think it's the main reason for publishers to exist. Heck, use a spelling-checker and spend a day putting it into DocBook and you've got more than half of those issues covered for free.

Plus, with a few notable exceptions (Knuth comes to mind), the books on my bookshelf have errors. Lots of them. Not just minor typos, either, but blatant lies, especially in technical books. If anything, the error rate in published books is higher than, say, HOWTOs or blogs.

Plus, files on the web can be corrected easily. My technical books are full of scribbles in the margins with corrections I've had to make.

The main thing publishers are good to me for are putting books on paper. The fact that people (including me) buy books that are available for free in the internet is a testament to this. Paul Graham's excellent book "On Lisp" is available for free on his website, yet he has a list of a couple dozen people who want to buy a paper copy (it's out of print).

Re:The evolution of languages by Anonymous Coward · 2003-01-12 17:15 · Score: 0 · on The D Language Progresses

First, if you've only done C/C++/Java and the like, you're really missing out. Go learn some real languages. :-)

As to why to design new languages, I think Paul Graham put it rather nicely, so I'll leave it at that.

...presumably do all that by writing header files, includes, classes, etc...

In a language like C, that's not true at all. Suppose you wanted to make a new control structure -- like Pascal's do-while loop. There's really no way to do it (short of writing your own interpreter, or embedding a C compiler in your program).

But in other languages, it's not only possible, it's common. If I want a new loop construct in Lisp, I can just write it. Lisp is written in Lisp, so my control structure will feel just as "native" as the standard ones. There are some things I've seen in Lisp (like anaphoric macros) that I couldn't even imagine how to begin writing in any other language.

I recall Chuck Moore saying something to the effect of "Imagine if in your OO programming language, you had to do a function call to access a member variable; that's how clumsy a lot of languages feel to a Forth programmer."

Re:what about by merriam · 2003-01-10 03:23 · Score: 1 · on Top Ten Software Innovators?

Software innovators, right?

tim berners-lee

Ahead of Ted Nelson? Anyway, that's more information and networks than software.

alan turing

Of course.

larry wall

That's five moderated mentions so far of Wall. Here's one of John McCarthy. Quoting Paul Graham:

In 1960, John McCarthy published a remarkable paper in which he did for programming something like what Euclid did for geometry.

I think Wall would probably mention McCarthy.

bill gates ??

I call that business.

steve wozniak jay miner

hardware

Re:EFF said it better by F452 · 2002-12-31 03:27 · Score: 1 · on The Spam Problem: Moving Beyond RBLs

The MIT conference is likely to be a failure because the organizers are only presenting the tried and failed filtering approaches of the past. Those approaches are now well understood, they can mitigate the problem but can never do more than that. Filters suffer from reverse network effects, the more widely used they are the greater the incentive to program arround them.

I think they will talk a lot about using Bayes, which I don't think has been widely tried with respect to email filtering.

Spammers are already trying work arounds to get past statistical filtering, but I don't know if they'll be as successful.

Spam with images? They'll have to embed them in html which is itself a red flag.

Like Paul Graham said in that article, the spammers can't hide their message very well.

Re:Programming "Career" by richieb · 2002-12-26 10:22 · Score: 3, Insightful · on Engineering Careers Short-Circuiting

I don't think programming Emacs plugins is all that important personally. Lisp is only really of use in the AI field.

You're talking esoterica and dusty cobwebbed corners of the field -- not anything that 99% of engineers will ever need to know.

Thanks. That was exactly the answer I was expecting. I suggest you take a look at this article to start with.

talented people by n2kra · 2002-12-23 05:06 · Score: 1 · on Recruiting Help for Open Source Projects?

how do I attract talented people?

try a LFSP

Re:Using design patterns by jorleif · 2002-12-19 01:55 · Score: 1 · on PHP5 Coming Soon

You are primarily speaking about design where Paul Graham was speaking of actual code. It should be noticed that even they are called _design_ patterns, they are actually patterns in program code. The point he was trying to make in this article was that regularities in code are a sign of the human compiler at work, and that means that the programmer is doing the job that the compiler should be doing. As an example of how it would work, consider the Singleton pattern you mentioned. Instead of doing something like this:

public class Foo{ private static Foo instance = null; private Foo(){ } public static getInstance(){ if(instance == null){ instance = new Foo() } return instance; } }

You should be able to write an abstraction Singleton declaring something similar to the above code (this would be the macro Paul Graham was recommending). After this you could write:

public class Singleton Foo

(or something similar). Now this way the Singleton code would not be duplicated. Note that I personally do find design patterns a very powerful technique, and am not trying to dis them like Paul Graham was. However I would find it very neat if one could define Singletons as I described above.

Re:Comments from the co-author of Water by Xandis · 2002-12-02 20:17 · Score: 1 · on Water, a Newish Web Language Out of MIT

I think making money is a very good thing and I have no preference for open or closed source products but when I see these consulting fees:

Consulting Services
Architect: $3,000 per day
Programmer: $1,500 per day
Jumpstart Workshop: $20,000 for one-week, up to 6 developers

and the other ways money is to be made ($100,000 for joining a committee!!!), I get the feeling that you guys are out to line your pockets by faking out stupid corporations who will waste gazillions on anything they don't understand. I get that curl feeling...

Good luck though since you will be competing with Sun and Microsoft.

I leave this bit to ponder from http://www.paulgraham.com/langdes.html:

"If you look at the history of programming languages, a lot of the best ones were languages designed for their own authors to use, and a lot of the worst ones were designed for other people to use.

When languages are designed for other people, it's always a specific group of other people: people not as smart as the language designer. So you get a language that talks down to you. Cobol is the most extreme case, but a lot of languages are pervaded by this spirit.

It has nothing to do with how abstract the language is. C is pretty low-level, but it was designed for its authors to use, and that's why hackers like it.

The argument for designing languages for bad programmers is that there are more bad programmers than good programmers. That may be so. But those few good programmers write a disproportionately large percentage of the software.

I'm interested in the question, how do you design a language that the very best hackers will like? I happen to think this is identical to the question, how do you design a good programming language?, but even if it isn't, it is at least an interesting question. "

-----

Re:Tey would but unlike /. editors they actually h by ArnoZ · 2002-11-28 11:30 · Score: 1 · on When Personalization Runs Amuck

How about using a statistical approach similar to Bayesian filters to detect duplicate stories?

LISP, methinks by euroderph · 2002-11-26 08:46 · Score: 1 · on Has Software Development Improved?

After ten plus years of coding in 8080 ASM, PL/I, FORTRAN, C, and C++, I've settled on Java as the tool of choice. But it's only recently that I've dug into LISP, and I'm fascinated by what I find. Read the story about the Yahoo online store that two guys wrote, but then they hired a dozen more people so the VC's wouldn't ask stupid quesions like "How could only two guys possibly write something so excellent?". The story can be found here.

I don't know well LISP stands up in a production environment -- in that I have no idea how easy it is (or is not) to read and understand and maintain someone else's LISP code -- but just for the perspective and the opportunity to get a revelation, I'd advise giving LISP a serious shot.

Re:Database? by stevenp · 2002-11-20 22:46 · Score: 2, Informative · on SpamArchive.org Launched

The learning mechanisms for detecting spam, like the Bayesian classification require a large amount of messages to build a good spam detection profile. The average 500 message JunkMail folder is not big enough for the purpose.

A different way to filter ads? by Alethes · 2002-11-19 09:46 · Score: 2 · on Browsers Which Protect Your Privacy?

What if you could have automatic ad filtering work just like spam filtering using the Bayesian classification technique?

Re:I don't even use email anymore by ichimunki · 2002-11-19 03:02 · Score: 1 · on Email (As We Know It) Doomed?

Whitelisting is not a good solution. Not even remotely-- unless you figure out a way to authenticate unknown users that you do want to hear from (which I think is likely to lead to a proliferation of methods for doing this and create a lot of confusion). What is a good solution is Paul Graham's solution using probabilities. Now I don't know if you can hook it into MS Outlook/Exchange easily, but every sensible email solution I've seen would easily allow for this kind of filtering... maybe a selling point for moving people away from the one email client that causes more problems than it ever seems to solve.

Re:Great, more censorship by I+Am+The+Owl · 2002-11-17 15:53 · Score: 2, Informative · on As the Spam Turns

Bayesian filters are not "nearly perfect."

Really? You mean blocking 995 out of 1000 isn't "nearly perfect"? 99.5% seems pretty damn close to perfect to me...

i'm happy this is starting to appear... by zonker · 2002-11-14 08:11 · Score: 0 · on Mozilla Adding Spam Filters

...in clients. i'm a big fan of popfile but i'm looking forward to a day when eudora etc. will perhaps use some kind of bayesian or adaptive latent semantic analysis filtering techniques in addition to their current methods.

Sort by Spam Probability by Krellan · 2002-11-14 07:38 · Score: 5, Insightful · on Mozilla Adding Spam Filters

It seems too many people distrust spam filters because of the chance of accidentally blocking an important legitimate message as if it were spam.

Many spam filters are strictly binary: a message is either spam, or not spam. This is not ideal, because "gray area" messages - between these two extremes - will likely not be sorted correctly.

I propose adding a new sort option to email clients.

Sort by Spam Probability

This would be an additional field that can be displayed in a message list, similiar to "To", "From", "Subject", and the like. Like the article, probabilities would range from 99% (almost certain spam) to 1% (most likely an innocent message). Notice that 100% accuracy either way is not claimed.

This way, the user can see up front the messages that are most likely not spam. The spam messages will be relegated to the bottom of the list, possibly colored to indicate their likelihood of being spam. If there is a message in the "gray area", it will most likely appear in the list between the legitimate messages and the spam, so the user will have a chance to see the message and make a decision, without the message being lost in the shuffle.

This would be a great feature. I hope this gets into Mozilla's mail client.

(BTW, another feature that would be great to see in mail clients would be datestamping of the actual time the message was downloaded. Many spammers, and innocent people with misconfigured clocks, send emails with wild dates that are not to be trusted. You can see this in yearly archives of GNU "mailman" mailing lists! Datestamping emails as they are downloaded will also keep mailboxes in order when sorted by date, as newly arrived messages will always be at the bottom, instead of being scattered throughout the inbox. But sorting by spam probability will probably become more popular than sorting by date....)

Re:Microsoft's Patent by McFly777 · 2002-11-14 07:09 · Score: 2 · on Mozilla Adding Spam Filters

I hope somebody can find some prior art on this. I just (quickly) read the claims and body of the patent and it sounds very much like the techniques that have been described here previously.

Unfortunatly, the Patent was issued in Dec 2000; the first time I heard this idea was the Paul Graham implementation in the last few months.

So, if this is all old hat to anyone out there, please do everyone a favor and find that prior art and let everyone know, so that, in 5 years when MS trys to enforce this patent, there is a defense.

------

I accidently posted this as an AC (score:0) so I am reposting it, but in the mean time another AC post claims to have some prior art. According to that AC This article (fixed link) may be helpful, I couldn't read it myself as the Full Text requires ACM membership. Perhaps somebody with access could take a look, and review it's potential applicability as prior art. (ie. Does it explicitly mention using baysian techniques to filter spam?)

Re:Microsoft's Patent by DaveAtFraud · 2002-11-14 06:50 · Score: 5, Informative · on Mozilla Adding Spam Filters

This is from Paul Graham's site with regard to the Microsoft patent. Patents tend to be very narrow in scope such that, if some aspects change, the patent may no longer apply. Pick on any typical consumer product such as hair dryers, stereos, you name it. They all have patents and they're all different and they don't "infringe" on each other unless they're virtually identical.

"Bayesian filtering" aka "Naive Bayes" by ghamerly · 2002-11-14 06:27 · Score: 5, Informative · on Mozilla Adding Spam Filters

This approach is more commonly called "Naive Bayes" classification in the field of machine learning. It is naive because it considers each word to be a feature (dimension), but it also considers each word in an email to be conditionally independent of all other words in the document (which is not true, but really useful in practice).

The author of the web page on using this technique to classify spam (Paul Graham) has a better explanation of Naive Bayes on this web page.

I've written my own naive Bayes classifier to identify spam, with less positive results than he reports. However, naive Bayes can be a very effective technique, and I can believe his results.

The two things you have to beware of when using it are "smoothing" probabilities of words you've never seen (you don't want them to always be zero, as straight naive Bayes will give you), and you need LOTS of training data for naive Bayes to work well. That means that you need to already have a fair amount of spam to identify spam well.

You can see a paper I wrote on using naive Bayes to classify hard drive failures here, or look for more stuff on naive Bayes on Google. Also, don't reinvent the wheel: Andrew McCallum has written a very good toolkit for doing these sorts of things in Bow.

Re:Filtering by Gabe+Garza · 2002-11-14 06:04 · Score: 5, Informative · on Mozilla Adding Spam Filters

Actually, using only the body isn't just a hack, it's a relatively new technique invented by Paul Graham that seems to produce excellent results. It makes a lot of sense: Spam is Spam because the body contains commercial or otherwise unwanted material--it's only natural that the most direct and accurate Spam filters are going to analyze the body. Bayesian classification like this is computationally tractable and appears to work. You can read more about it here.

Aaaarrrggghhh!!! by upper · 2002-11-04 15:31 · Score: 2 · on Working Bayesian Mail Filter

The patent covers any method at all like Paul Grahm's method. He's discusses the patent here.

The patent claims boil down to using a probabilistic classifier to recognize spam. There are many claims, but they're mostly trivial elaborations. Probabilistic classifiers aren't new, and there's no claim they invented them. And it doesn't look like they had to solve any real technical hurdles to apply it. It's one of the most egregiously obvious patents I've seen in a while.

I say there's only one way to test whether an idea is obvious to people skilled in the field, and that's to pose the problem to people skilled in the field and see if they can find the solution. Anything less is a joke.

Not to diss Horvitz and Heckerman -- they're big names in Bayesian inference and Bayes nets. They've been behind a bunch of solid research.

Aaaarrrggghhh!!! by upper · 2002-11-04 15:31 · Score: 2 · on Working Bayesian Mail Filter

The patent covers any method at all like Paul Grahm's method. He's discusses the patent here.

The patent claims boil down to using a probabilistic classifier to recognize spam. There are many claims, but they're mostly trivial elaborations. Probabilistic classifiers aren't new, and there's no claim they invented them. And it doesn't look like they had to solve any real technical hurdles to apply it. It's one of the most egregiously obvious patents I've seen in a while.

I say there's only one way to test whether an idea is obvious to people skilled in the field, and that's to pose the problem to people skilled in the field and see if they can find the solution. Anything less is a joke.

Not to diss Horvitz and Heckerman -- they're big names in Bayesian inference and Bayes nets. They've been behind a bunch of solid research.

Yahoo! Mail by sfe_software · 2002-11-03 07:48 · Score: 2 · on Working Bayesian Mail Filter

Noone has mentioned it so far, but Yahoo mail has a Bulk Mail folder. SPAM is automatically sent there, and I have yet to see a single false positive (and false negatives are quite rare as well).

The system works surprisingly well. I checked the FAQ and it doesn't go into any detail about how it works, but I wouldn't doubt if something like this is being used.

I've been thinking, and it seems that this could potentially have a lot of use, aside from Spam filtering. Perhaps a mail client could let you categorize email in general (SPAM, Business-related, forwarded stuff from AOL users, etc), and learn how to spot and organize things.

I'm putting this (either the POPfile or bogofilter) into place with a modified SquirrelMail, just to give it a good run; I might try and modify it to also categorize other types of email, just to see if something like that could work.

I could easily see a mail client (web-based or otherwise) that lets you drag mail to specific folders, and eventually learns how to do this for you (and of course you can always correct it by simply dragging to another folder, which also contributes to the learnig process)...

After reading this article my mind is just spinning with ideas... Bayesian search engines... perhaps speech/voice recognition applications... classifying text/html/doc files... organize songs (processing the lyrics)... ugh, I should stop now :)

Re:Whas that? by sfe_software · 2002-11-03 07:02 · Score: 5, Informative · on Working Bayesian Mail Filter

If you had just clicked the POPFile link, you would see the explanation.

I also highly recommend this link, as it goes into quite a lot of detail on this filtering technique. After reading it, I am going to give the Perl variation a shot.

Re:I've tried many things by BitwizeGHC · 2002-09-20 11:37 · Score: 2 · on David Sorkin on Internet Law and Spam

Use this. But on a large scale. Perhaps by convincing ISP's to install Bayesian filters on their mail relays? The spam gets silently dropped, and the good mail goes through. No need for the kind of sabre-rattling and politics that accompanies a blacklist plan.

Bayesian vs not isn't really the point by XDG · 2002-09-17 09:18 · Score: 4, Insightful · on More on Bayesian Spam Filtering

Gary is both right in some respects and irrelevant in others. Here's the key line in his article that deflates it a bit:

It is untested as of now. It is based purely on theoretical reasoning. If anyone wants to try and it test it in comparison to other techniques, I'd be very interested in hearing the outcome.

On the other hand Paul Graham has actually tested his model and it works. I've worked it up in perl and tested it on my own data set and it works there, too. Paul acknowledges that he's being a bit fast and dirty, but the proof is in the pudding. The rest is just academic quibbling over the fine points.

I'm not sure why this particular article needed to be posted, as it's just one of several alternative approaches and an untested one at that. On Paul's page, he also lists several published academic papers with other alternatives -- all actually tested, of course.

Gary is basically right in questioning the use of the word "Bayesian". Paul's approach is more about weighing "evidence" as given by the appearance of certain words, rather than in figuring out the probability of spam assuming a "prior". See Paul's explanation, but if you check the article he references at the end, you'll note that the method Paul uses is only one of several methods to solve an underspecified problems. It's a reasonable guess, not necessarily the only guess.

This is similar to Paul's approach except for including a "prior" assumption of p(spam) -- the expected probability of any email being spam, calcuated from the historically observed frequency of spam. By leaving it out, Paul implicitly assumes that 50% of mail is spam -- that's his "prior" estimate of the spam rate. Given the other adjustments he makes to his sample, that appears to be acceptable in practice. (Paul overweights the spam prior, but also overweights the effects of "good" words.)

I'd personally prefer to overweight the "good" e-mails entirely rather than just put a "good-multiplier" on them like Paul does, but that's just quibbling over small bits.

As to the bit that Gary raises about Paul assuming a spam probability for an unknown word -- Paul originally said .2, then revised to .4, but really should have put it at .5 or just excluded it from all calculations. A new word has no robustness as a predictor (which is why Paul dropped words that didn't appear five times anyway). In practice, a new word at .4 isn't going to be among the 15 most interesting words to make the calculation from, anyway.

-XDG

Bayesian anti-spam filters by abischof · 2002-09-12 02:04 · Score: 3, Interesting · on Mozilla 1.2 Betas Start Flowing

Remember that Slashdot article on Paul Graham's method of spam blocking through Bayesian filters?

In case not, the basic idea is that spam can be fairly reliably detected through statistical analysis of word choice. For instance, a message containing the word "GNU" probably isn't spam, while one containing "remove" might just be (but see the write-up for more detail).

Anyhow, there's been a bug filed requesting Bayesian filtering for Mozilla. If you're interested in the feature, you may wish to vote for the bug (of course, you'll need a free Bugzilla account to vote).

Bayesian anti-spam filters by abischof · 2002-09-12 02:04 · Score: 3, Interesting · on Mozilla 1.2 Betas Start Flowing

Remember that Slashdot article on Paul Graham's method of spam blocking through Bayesian filters?

In case not, the basic idea is that spam can be fairly reliably detected through statistical analysis of word choice. For instance, a message containing the word "GNU" probably isn't spam, while one containing "remove" might just be (but see the write-up for more detail).

Anyhow, there's been a bug filed requesting Bayesian filtering for Mozilla. If you're interested in the feature, you may wish to vote for the bug (of course, you'll need a free Bugzilla account to vote).

Ever heard of open standards? by leandrod · 2002-09-10 21:23 · Score: 2 · on Mozilla Rising ... As A Platform

POSIX, X Window System, NFS, LDAP, GTK+ and Gnome.

All of these can be run on any platform, providing a cross-platform, single-login environment. And throw in Scheme and Common Lisp for languages even more powerful and high-level than Java or C#.

Substitute or add C++ and wxWindows or Qt and KDE or Objective C and GNUStep or whatever you like for Lisp, GTK+ and Gnome if you don't like copyleft or too much openness or multiple languages. Why, you can even use Java or .Net now.

Even MS had an open standards strategy to migrate all users to Xenix, before it realised it had power enough to get users into a proprietary lock-in.

See Fink for the Mac OS X. It's based on Debian, and install all the missing part of open standards support on Mac OS X. Granted it would be more difficult to do on MS W32, but not impossible.

CygW32 is already part of the answer; refine it, rework it for dpkg, integrate better with MS W32 -- especially making X Window getting its configuration from the registry and integrating its windows on the MS W32 desktop -- and you have everything Mozilla is supposed to do, but better, faster, more powerful. And native.

You need to learn basic software engineering by Anonymous Coward · 2002-08-27 06:35 · Score: 5, Insightful · on Why are Businesses Willing to Spend More for Software?

Several of your items sound reasonable, but are actually foolish.

For instance number 2 on your list is that the more developers that you have working on a project, the more likely it is to be completed and delivered on time. In fact software engineering literature from The Mythical Man-Month on comes to exactly the opposite conclusion. Adding bodies to software development creates basic infrastructure problems that make the project more difficult to accomplish, let alone to accomplish in a timely manner.

Real world project data supports that. Most large projects fail. Larger projects fail worse, more often. At one point Sun had an internal rule that no project would be accepted that was supposed to take more than 6 months or cost over $1,000,000. The odds of failure were just too high.

Now of course a given project has minimum realistic needs in terms of how many skills are required, and how much work will be needed. There is a minimum team size for a given project. But for the optimal productivity and probability of successful completion, you really want a team that is close to that minimum.

And that gets us into your technology decisions. Agreed that when you are trying to bid on a contract for a client, you do what the client wants. If the client wants a series of unproductive technologies, you have to increase your development costs because of your projected unproductivity, but that isn't the time to sell them on the benefits of their letting you work more productively.

However at this moment we are not dealing with an RFP. So I am going to point out to you that if there are real software engineering reasons to want development teams that are as small as feasible, then there is good reason to want a development environment that makes developers more productive and therefore reduces the minimum size of development team that you need for your projects. I won't say which tools and what environment that is because the answer depends on a lot of hard to evaluate factors (though I admit to thinking that your choices "leave room for improvement"), but I can point to Beating The Averages for a sample essay showing how much of a difference it can make. And I cannot emphasize enough that it is a very powerful experience for a developer when they see first hand what kind of difference it makes for them to be in a productive environment.

That said, there is a good business case for not experimenting with your development environment. It is one that puzzles most techies, so it is worth explaining.

If you start making changes in areas that your company does not personally have a lot of experience in, then some of your decisions will be good and some will be bad. The problem is that the bad ones will cost you far more than the good ones win you - you are essentially gambling on your ability to pick correctly in an area that you know nothing about.

Therefore outside of areas of strong company expertise there is a lot of pressure to try to make similar decisions to your competitor. Those are no more likely to be good or bad than trying to make your own choices would be, but they have the decided advantage that you won't accidentally choose badly where your opponent chose well, in an area that turns out to be the deciding factor.

Incidentally this is a principle that explains the advice offered by Paul Graham in Revenge of the Nerds. In that special case Paul is talking about how a nerd should take advantage of their area of expertise. And the answer is to pick a line of business to which your special knowledge applies, go into that business, and let your correct decisions tell. But when you do that, do not simultaneously attempt to rethink every other area of business that you must deal with, because you will get a lot of that wrong!

You need to learn basic software engineering by Anonymous Coward · 2002-08-27 06:35 · Score: 5, Insightful · on Why are Businesses Willing to Spend More for Software?

Several of your items sound reasonable, but are actually foolish.

For instance number 2 on your list is that the more developers that you have working on a project, the more likely it is to be completed and delivered on time. In fact software engineering literature from The Mythical Man-Month on comes to exactly the opposite conclusion. Adding bodies to software development creates basic infrastructure problems that make the project more difficult to accomplish, let alone to accomplish in a timely manner.

Real world project data supports that. Most large projects fail. Larger projects fail worse, more often. At one point Sun had an internal rule that no project would be accepted that was supposed to take more than 6 months or cost over $1,000,000. The odds of failure were just too high.

Now of course a given project has minimum realistic needs in terms of how many skills are required, and how much work will be needed. There is a minimum team size for a given project. But for the optimal productivity and probability of successful completion, you really want a team that is close to that minimum.

And that gets us into your technology decisions. Agreed that when you are trying to bid on a contract for a client, you do what the client wants. If the client wants a series of unproductive technologies, you have to increase your development costs because of your projected unproductivity, but that isn't the time to sell them on the benefits of their letting you work more productively.

However at this moment we are not dealing with an RFP. So I am going to point out to you that if there are real software engineering reasons to want development teams that are as small as feasible, then there is good reason to want a development environment that makes developers more productive and therefore reduces the minimum size of development team that you need for your projects. I won't say which tools and what environment that is because the answer depends on a lot of hard to evaluate factors (though I admit to thinking that your choices "leave room for improvement"), but I can point to Beating The Averages for a sample essay showing how much of a difference it can make. And I cannot emphasize enough that it is a very powerful experience for a developer when they see first hand what kind of difference it makes for them to be in a productive environment.

That said, there is a good business case for not experimenting with your development environment. It is one that puzzles most techies, so it is worth explaining.

If you start making changes in areas that your company does not personally have a lot of experience in, then some of your decisions will be good and some will be bad. The problem is that the bad ones will cost you far more than the good ones win you - you are essentially gambling on your ability to pick correctly in an area that you know nothing about.

Therefore outside of areas of strong company expertise there is a lot of pressure to try to make similar decisions to your competitor. Those are no more likely to be good or bad than trying to make your own choices would be, but they have the decided advantage that you won't accidentally choose badly where your opponent chose well, in an area that turns out to be the deciding factor.

Incidentally this is a principle that explains the advice offered by Paul Graham in Revenge of the Nerds. In that special case Paul is talking about how a nerd should take advantage of their area of expertise. And the answer is to pick a line of business to which your special knowledge applies, go into that business, and let your correct decisions tell. But when you do that, do not simultaneously attempt to rethink every other area of business that you must deal with, because you will get a lot of that wrong!

Re:My Question by yankeezulu · 2002-08-26 05:16 · Score: 2, Informative · on Ask Larry Wall

there are others who would say the same about java...take a look at what Paul Graham (lisp master)
has to say about java in item 2: Java's Cover, by P. Graham

Re:Another idea! Need repository of spam by GloomyTrousers · 2002-08-25 02:07 · Score: 1 · on Paul Graham on Fighting Spam

He (Paul Graham) has got a few links to other people's spam collections

Why bother? by Howl · 2002-08-21 19:39 · Score: 1 · on Politicians Seek Spam Loophole

All this challenge and whitelist business is simply not needed.

Use bogofilter - it uses Beyesian statistics to filter spam it's it very good. See also A plan for SPAM by Paul Graham.

Graham's Plan for Spam by Wise+Dragon · 2002-08-20 03:08 · Score: 2 · on Haiku vs Spam

Graham's Plan for Spam

Bayesian Algorithm

Re:As requested by ichimunki · 2002-08-20 02:59 · Score: 2 · on Haiku vs Spam

Maybe want to try this:
Paul Graham has a spam plan
Statistics don't lie?

Slashdot Mirror

Domain: paulgraham.com

Comments · 1,105