Domain: paulgraham.com
Stories and comments across the archive that link to paulgraham.com.
Comments · 1,105
-
Re:What about blind people?That's possible, but difficult. The bogus tags themselves reveal why that's so. They are not valid HTML, but they have the form of valid closing tags. Though I don't know the pre-XML (read fairly current) HTML spec very well, and being too lazy to look it up at this hour, I nevertheless seem to recall that it says browsers should ignore tags they don't recognize. In any event, browsers are notoriously liberal about what they will render, so as to make the "user experience" nicer, and the job of standardization impossible. 8) All of this makes it tough to strip out bogosities. However I think that it's a requirement to do that if Bayesian filtering is to survive the current round of slime-bucket SPAM-mongering countermeasures.
The other countermeasure I've seen get through SpamAssassin is stuff like this:Hey, how's it going? You know, you were right about <a href="slime-sucking-spam-site.com">that site!</a> They <em>do</em> have erection meds for much less. How do you think they get away with it?
This was predicted in Paul Graham's original Plan for Spam. Quoting:
Cheers,
Your low-life SPAM-sluicing buddy.
To beat Bayesian filters, it would not be enough for spammers to make their emails unique or to stop using individual naughty words. They'd have to make their mails indistinguishable from your ordinary mail. And this I think would severely constrain them. Spam is mostly sales pitches, so unless your regular mail is all sales pitches, spams will inevitably have a different character.
There's still grist for the Bayesian mill in messages like the example, but it's thin grist, indeed. -
Re: Go to spam sites and check them...
junkgoof wrote:
Good points, actually. I wonder why you were modded to 0?
Thank you. It's nice to get a reply from someone who can deal with the facts and isn't a raving, foaming-at-the-mouth political nitwit.
My post wasn't modded down; my karma had been damaged earlier in the day by three politically-based attacks on two of my posts by one or more silly children who shouldn't have been entrusted with mod points. One, a post chock full of factual information, was first hit as being "Overrated" at 1, then as a "Troll" at 0, leaving it at -1.
Re:I'd rather have a sales tax than an income tax
Take a look at them and judge for yourself whether either deserved "Overrated" or the second one also deserved "Troll." Better yet, look at the posts to which they are replies and read mine in context.
Being modded down dropped my slashdot karma level to "Bad," which affected the starting score of any new messages I might post. While the nitwit was doing that, I was posting elsewhere in slashdot on the topic of spam, so if you find any value in my comments about spam and Filters that Fight Back (and as far as Paul Graham knows I am still the first and only person on the planet actually implementing FFB), you (and others) might be annoyed that the effect of the political moderation was to reduce the visibility of my messages about spam.
If you search for messages posted by me you will find at least several in which I make the case that Filters that Fight Back is presently the only effective way to carry costs back to those who pay for the spam to be sent. It's not my idea; it's Paul Graham's idea:
Paul Graham is the man who brought us Bayesian filtering in his August, 2002 paper, A Plan for Spam. Many software developers have since incorporated Bayesian filtering in one form or another into email clients and servers. This year he offered new thoughts Filters hat Fight Back, and I've been implementing them.
Along the way I concluded that I don't care whether or not I confirm that my email address is "active." The spammers are already sending me spam inviting me to visit their Websites. OK, I'll visit. I'll visit every URL they send me that looks like a spam Website, and for good measure I'll download the entire site for research purposes. Every URL, every time.
Thanks again for being a real person. BTW, my seppuku sig was not directed at you or at any particular poster. It's a general comment on the frequency of moronic posts. Being out of date or not having kept up to date on the latest in spam technology is not moronic.
-
Re:They're annoying
druske wrote:
Unfortunately, this technique would encourage the "click this link" sort of spam, where the spammer gets paid as an affiliate of some website.
First, payments for "click-throughs" have pretty much died because the same people who would spam you would also generate false clicks with the same lack of scruples. Now and into the future, "click-throughs" will only generate fees or commissions if they result in completed sales.
Second, putting images and phony text into spam makes it all that much easier to identify and filter out.
druske also wrote:
I like Bayesion filtering as well, though it needs to be smarter about the insertion of HTML comments in the middle of words (Viagra), punctuation (V'i'a'g'r'a), additional spacing (V i a g r a), etc. to get around the latest bag of tricks.
Then you don't understand Bayesian filtering, at least not as it was proposed by Paul Graham, who brought it onto the scene 15 months ago. Bayesian filters, properly implemented, love peculiar constructs because those never occur in legitimate email.
RFTA:
A Plan for Spam
-
Re:JessWhat turned me off Java was not the fact that it had large libraries, but the way the libraries seemed bolted on. I realize that having clear separation between the libraries and the language syntax is a good idea technically, but programming languages are for people. System.out.println() is perfectly understandable, but a convenience function would be nice---or even a "print" statement.
As libraries get big (as they certainly have in Java), it gets harder to find what you're looking for in them. A big problem in language design is making big libraries convenient and intuitive. Java looks like it hasn't tried all that hard.
Jess looks like a layer of abstraction, and abstraction is generally a Good Thing. For example, HTML templating systems are often more pleasant than a bunch of perl scripts with the page hard coded. Hey, if it lets you take complex logic out of Java, I'd say it's good.
-
Re:Bayesian filtering
How does heavy blocking of "spam friendly countries and ISP's" serve to deter more spam? I imagine that can only happen if such blocking becomes ubiquitous, and in the same way, if content-based Bayesian filters that fight back become equally ubiquitous, that would serve as an even stronger deterrent, without the same kind of collateral damage that accompanies blacklisting.
Frankly, the only serious long-term solution I can see for the problem of spam is to totally redesign SMTP to provide at the very least strong authentication of mail servers. Until then, IMHO content-based filtering is still a far better interim solution.
-
Re:innovation
Production systems aren't supposed to innovate; they're supposed to implement the ideas that have been proven in more experimental systems, often in a somewhat dumbed-down/weakened/safer version. It would be pretty shocking if any major vendor implemented any radical, unheard of, unproven programming concepts into a commercial system.
As a good example, look at garbage collection. How many years had non-mainstream languages been doing it well before it was adopted into 'serious' language used by a significant number of people doing 'real work'? If you want more examples of mainstream programing systems ignoring proven innovations just go to Lisp fanatic Paul Graham's webzit, even though he sounds like an Amiga user in many cases. -
Re:Politicians for Ya"The battle on spam must be fought on all available fronts..."
I disagree. There is only one front that spam can be fought on successfully, and that is economic.
When spam is no longer profitable, spammers will give it up voluntarily. As long as it is profitable (especially as profitable as it is right now) people will continue to do it, regardless of one law or a thousand.
Each law passed to try to stop it restricts someone's freedoms, and it's pretty much unavoidable that the law will restrict some who do not deserve such restriction. If that would eliminate spam, that might be considered a reasonable tradeoff. Since it will not, each and every law passed solely on the basis of "We've got to DO something!" is a bad idea.
The best thing I've seen so far is A Plan for Spam. I tend to agree with the author's assesment of how likely it is to work, specifically because it destroys the economics of spam.
I think most of our collective energy should be going to integrating things like "A plan for Spam" into common email programs, possibly extending it to allow people to join anti-spam clubs, so they don't have to label the thousand or so emails the thing needs to train with themselves.
Simply having a button that says "Spam!" will help a lot of people deal with the frustration involved with spam, and if it is pitched to the general public as "Every time you kill an email as spam (instead of just deleting it) you're helping to put spammers out of business," you're going to have to beat them off with a stick. People would love to have a way to get back at spammers.
-
Popups
After seeing a lot of complaints about popups/popunders, I couldn't help but quote Paul Graham:
"In this scenario, spam would, like OS crashes, viruses, and popups, become one of those plagues that only afflict people who don't bother to use the right software." -
Popups
After seeing a lot of complaints about popups/popunders, I couldn't help but quote Paul Graham:
"In this scenario, spam would, like OS crashes, viruses, and popups, become one of those plagues that only afflict people who don't bother to use the right software." -
Re:Yahoo!They are not switching from C; they are switching from YScript!, a proprietary Yahoo language.
I think that the most highly evolved languages are the languages that are most likely to succeed. Check this article out, it describes how programmers at Viaweb leveraged Lisp as their secret weapon in the bidding wars for Yahoo. Lisp can be considered a scripting language, but it conquered all other forms of compiled competition in this case.
Anyways, I think that PHP's approach towards the integration of generated and static content smacks of previous-generation languages (namely ASP.) ASP.NET and JSP have evolved the paradigm, with their clear distinctions between compiled code and html. I find ASP.NET with C# to be SO much easier to maintain than the legacy ASP scripts at my company.
Also, it's nice being able to hand pages to the graphics design department without worrying about them deleting important chunks of code. It's a problem when they see the page icon in Dreamweaver and mistake it for a graphic. They hit "delete". There goes Important Function #3! Arrgh!
-
/. needs TrackBackHeh, I just posted something about comment spam and a possible solution to my website...
So what else can be done about it? I'm surprised no one has mentioned Bayesian filtering of comments. Like most people who've heard of it, I first found out about Bayesian filtering from A Plan for Spam, and how it can identify spam. Since then virtually every spam blocking system has started using Bayesian techniques for at least some part of identification.
Read the rest... -
Re:Sad
Wi-Fi doesn't spam children with hardcore porn. Perhaps you're getting it confused with unfiltered email or certain instant messaging services which are banned from just about every elementary school I know of. Could it be that your real concern is baseless?
-
Last entry in the "problems"
From this page:
Why have email as part of the system? Why not just have a blacklist of spam sites and encourage people to beat on them?
Several people have written suggesting a "DDoS@Home" project of this type. (Two correspondents who shall remain nameless simultaneously invented this catchy name.) But I think mail should remain in the system for two reasons: (a) it tells you which sites to pound, and when, and (b) if you included it as part of a filter, you could get more users.
On the other hand, if some group managed to launch a DDoS@Home project aimed at spammers, that would be enormously amusing. I'd sign up for it.
Sounds like a challenge. So who's going to be the first to post a URL to the SourceForge project page? ;-) -
Re:Maybe it's a pre-emptive patent
Microsoft has patents on Bayesian filters? I was under the impression that Paul Graham came up with the idea of applying them to email filtering, and that Bayesian math itself was pioneered by Rev. Thomas Bayes in the mid 1700's.
-
Re:Spam is bad...mmmkay?
Paul Graham's paper on Bayesian filtering, although incomplete, is a great start to understanding how it all works. http://www.paulgraham.org.
You mean this one?
Several attempts have been made to attack the tokenizer, which is one area DSPAM has a considerable lead on other tools. DSPAM performs several different deobfuscation techniques prior to tokenizing a message.
In other words, spammers have already started to attack bayesian filters (or at least filters that identify keywords) and DSPAM is using techniques to deal with those particular attacks. The bayesian filter didn't automatically learn to defend against the tokenizer attacks--humans had to intervene and write code. And the code they wrote doesn't deal in general with attacks against the tokenizer--it deals with the particular attacks that have been tried so far.
We can both imagine further attacks on the tokenizer, and we can both imagine defenses against those attacks. This is an arms race. It's not a very satisfactory long-term solution.
Even attaching ham messages doesn't quite do the trick, for the reasons I mentioned in my previous email.
I believe the reason you gave was that you thought the "ham"-identifying tokens would be too particular to the individual receiver? Again, I'm not so sure this is true--for example, any filter that I use has to (at a minimum) identify as "ham" almost all email from the linux-kernel mailing list and a dozen other lists on various topics. Any spammer can download the archives of a few big mailing lists and test out their spam against a bayesian filter that passes mail on those lists.
I doubt the ham each of us receives is *that* unique. And if even only 10% of the mail we receive is significantly generic, then this is enough---a spam filter that wrongly identifies 10% of my mail as spam is close to useless to me.
--Bruce Fields
-
Filters That Fight Back
This is really bizarre. There are almost 300 comments on this item and no one has even mentioned Paul Graham's proposal for Filters That Fight Back:
www.paulgraham.com
The idea is to raise the costs of spam to the spammers, if not at the spam sending side, then at the spamwebsite side. Most spam solicits visits to a website. If a relatively small percentage of Net users were to employ Bayesian filters and/or other techniques to identify and segregate spam, then to accept the explicit invitation in each spam to visit one or more URLs provided, and maybe even download the entire sites a few times, the cost of running a spamwebsite server for the tiny numbers of orders they get would rise sharply.
I don't have it completely automated yet. I'm still using filters in my email client, but they are good enough that no spam gets through to my New Mail folder, and a whitelist ensures that there are no false positives in any mail from anyone I already know I wish to hear from. What goes to my spam folder contains a few false positives of people who have never written to me before, but mostly those whose email contains garbage like HTML.
Once a day or so I simply save the cleaned spam folder to a file and ftp it to one of my servers. There, scripts take over and faithfully accept the explicit invitations in the spam to visit their websites.
As more people do this, the traffic will dramatically increase at the spamwebsites, but orders will not increase. At some level or other, either in their server farm or to their upstream provider, those sites pay for bandwidth. As they get bumped up into higher bandwidth pricing tiers, their margins on the small numbers of orders they get from complete nitwits will drop.
Think of it as a servo system: If the level of spam annoys you, set your filter to fighting back. As more people do that, spam will level off and drop. As it drops to a level at which fewer people bother to set their filters to fighting back, an equilibrium will be achieved. There will still be spam, but a whole lot less than there is now. Think mosquitos and birds. Birds control mosquito populations. There are still mosquitos, but a lot less than there would be if there were no birds. Be a bird -- eat spamwebsites.
The weak point in Graham's proposal is that it really needs a universal whitelist to prevent spammers or other malicious third parties from causing massive traffic to innocent websites by sending out spam that provides URLs that are not the spammer's. It's not clear how such a whitelist would work, who would run it, how sites would get onto it (or off, if they turn bad), or whether someone will come up with a neat P2P solution.
It is clear, though, that anyone receiving 20-100 spams a day can easily review the filtered spams or the extracted URLs and simply delete those that appear innocent. Then scripts do the rest.
-
Re:am I the only one....
Althought I never tried "extreme programming" or other buzzwords, I always programmed in pair with friends and coworkers. When both of we are "in zone" I can't really feel any difference between my ideas and their ideas; it's just "ideas" with are implemented in code.
I can't verbalize why I'm writing a code fragment the way I am writing it...
I can and, in fact, I do it to myself all the time. So when I'm pair programming I just speak aloud my personal monologue. The "why" is usually short and interesting, and the "how" can be communicated throught code.
While maybe our team code produce more amount of code working separately, I definitly see a good amount of improvement in code quality when we work together on the same code.
While I wholeheartedly agree with having lots of meetings and discussions during the design phase (requirements, functional spec, detailed design) and during the review phase (post mortem, code reviews)
And this is where we see my style is exactly the opposite of you. I think detailed design, specs, code reviews are worth nothing. In my experience all this bureaucracy never really was useful to anything. Detailed design, for example, implies that you know in advance what you will do and what are the requeriments, and this has never been the case.
Instead of wasting my time with corporate cruft, I'm much more productive using some language wich allows fast prototyping, quick redesign, quick fixing of errors and incremental development.
I guess this goes to show how people code differently :) -
Re:Duh, where do you think the geek-girls go?
BTW - full disclosure: not single (or looking), just trying to find out if you geek-boys really want what you say you want...
(emphasis mine)
See, that's the problem. All (most?) the geek girls are already taken. And so are all the hot girls (and of course all the hot geek girls :P)
Of course, a bigger problem is our lack of social skills. I was talking to a relatively tech-savvy semi-loner girl (someone who is conceivably within my reach), and I literally tripped over my own bag, and threw the Palm in my hand 20 feet while walking and talking to her. Luckily, she didn't say anything about it. But you get the idea...
--Quentin -
Re:50Ghz processors...
Here we come, won't that be great. 10Mfps in Quake4D, milliseconds from start to crash in windows.
Nonsense. What we get is redundancy, and we can actually use it. See, the thing with faster computers is, they allow a greater level of abstraction in programs, both on the programmer and the user side. This has unfortunately not yet happened, since too many programmers stubbornly stick to C and its likes.
Granted, using high-level programming languages does not automatically make programs more stable, but it does give better chances of resuming or recovering from errors, and more importantly, it allows programmers to focus on more important aspects of programs. Best of all: we can use all that extra power for human-computer interface enhancements, such as speech, video or natural language recognition.
Two random links which I'm too lazy to label:
http://www.paulgraham.com/hundred.html
TUNES -
MIT Grad Students vs. Ga Tech Sophomores
JSP is fantastically simpler than "J2EE", which is the recommended-by-Sun way of building applications, but still it seems to be too complex for seniors and graduate students in the MIT computer science program, despite the fact that they all had at least one semester of Java experience in 6.170.
\begin{humor}
That's funny. I had a couple of sophomore co-ops down here at Georgia Tech build a significant J2EE app with JSPs, servlets, and an Oracle back end in a semester. They had only one semester of prior Java experience. Perhaps you should beef up your CS curriculum up there at the "Georgia Tech of the North!"
\end{humor}
In all seriousness, I enjoyed your article. I thought you were too hard on Java, but Java's honor has been amply defended in other posts.
I think you're too critical of your affinity for Lisp. After starting out with Pascal and the C languages (C, C++, Java) and discovering functional programming later in life, I find myself drawn to Lisp (and ML, and ...) - especially for AI. Paul Graham has a pretty good article about the timeliness of Lisp that may make you feel a little better about your "Lisp zealotry."
Happy Hacking!
Chris -
Lisp, Java and C++
There's a pair of books that make interesting reading together. One is Paul Graham's On Lisp. Whatever you may think of his statements against other languages, he knows Lisp and he does an excellent job of explaining how to use it well. I didn't understand how to use Lisp macros effectively until I read it, or why to use them.
Shortly after reading it, I read Modern C++ Design by Andrei Alexandrescu. Reading that, I started to understand some of the power of generic programming. If you understand when these two books are explaining two very different implementations of the same things, then you have grasped the essence of some very powerful techniques.
Frankly, generic programming is one place where Java is still definitely lagging. Fortunately, there is currently an effort to fix that.
Lisp's greatest strength is also its greatest weakness. The language eschews nearly all syntax. All structure within a program that would be expressed syntactically in other languages consists of levels of parentheses and order of arguments to various functions, macros and special forms. There is great power in this. It means that what you add to the language fits in seemlessly. That is the point of Graham's title for his book. Read the first chapter for his explanation.
Unfortunately, this very scarcity of imposed syntax puts a burden on the programmer to format for clarity and to learn to read in a language where some of the familiar signposts simply are not present. That task is certainly possible, but it puts many people off. -
Java's CoverThe blog seems to be down, but in case anyone was interested in a similar story:
Paul Graham (of Bayesian filtering and Lisp fame) wrote an excellent article called Java's Cover.
It is about why he thinks Java is bad technology -- despite never having used the language. Very interesting read.
Thomas -
Java's CoverThe blog seems to be down, but in case anyone was interested in a similar story:
Paul Graham (of Bayesian filtering and Lisp fame) wrote an excellent article called Java's Cover.
It is about why he thinks Java is bad technology -- despite never having used the language. Very interesting read.
Thomas -
Java's CoverThe blog seems to be down, but in case anyone was interested in a similar story:
Paul Graham (of Bayesian filtering and Lisp fame) wrote an excellent article called Java's Cover.
It is about why he thinks Java is bad technology -- despite never having used the language. Very interesting read.
Thomas -
Java's CoverThe blog seems to be down, but in case anyone was interested in a similar story:
Paul Graham (of Bayesian filtering and Lisp fame) wrote an excellent article called Java's Cover.
It is about why he thinks Java is bad technology -- despite never having used the language. Very interesting read.
Thomas -
Re:Thats a ridiculous question to ask the internet
In Why Nerds Are Unpopular, Paul Graham pointed out something that doesn't really have to do with nerds or unpopularity, but which is pretty insightful just the same: in "real life", you're part of a community that was typically thrown together by geographical location, and it can be hard to find others who share your interests. With the internet, it's easy to find groups of people who are interested in obscure things. Hey, we've got slashdot....
-
Re:J2EE is not slowRemember that paper (I forget who wrote it) about the huge market advantage web programming in Lisp gave his company?
You mean Beating the Averages by Paul Graham.
-
Re:Functional Programming?? *hisss*
Hm. Getting offtopic here, but I rather dislike it when people blame the tool they can't use.
Here's what Paul Graham (yes, of Bayesian filtering fame) has to say about functional programming. There is an amusingly appropriate quote: In business, there is nothing more valuable than a technical advantage your competitors don't understand.
Alright, stretching to get back on topic, I'll assert this: To a programmer, knowing functional programming is about as useful as reading literature, analyzing politics, studying science, or traveling abroad.
Take that as you will.
--
Dum de dum. -
Shameless Paul Graham link
Someone mentioned Lisp macros? Here's the obligatory Paul Graham links, posted AC to avoid karma-whoring:
- Revenge of the Nerds, cool intro of why Lisp is cool, and
- On Lisp, free book (and the best one) about Lisp macros.
-
Shameless Paul Graham link
Someone mentioned Lisp macros? Here's the obligatory Paul Graham links, posted AC to avoid karma-whoring:
- Revenge of the Nerds, cool intro of why Lisp is cool, and
- On Lisp, free book (and the best one) about Lisp macros.
-
Re:Random Trivia Note
And supposedly the same Robert Morris that helped write what became Yahoo! Stores.
-
Re:Why are students so passive - one story
"Anyone else experienced anything similar?"
Read the letters page of any issue of 2600 for what the kids think about it: learning off your own back is considered disorderly and threatening in many schools.
p.s. article
-
Re:Unfortunately... Re:Don't fully agree.
so, the whole point is that Lisp is not a programming language but a kind of language definition language? Just a raw parse tree, and Build Your Own Syntax. See why I say it's difficult? You haven't ANYTHING done for you in advance.
Oh, come on. Common Lisp has about 1000 defined symbols (i.e. variables, functions, macros, classes ...). It includes an extremely powerful exception system, highly flexible OOP, and all the mundane stuff like lots of standard datatypes, control structures, IO, pretty printing etc. People frequently bash it because it's too big.You don't have to do any kind of language design when you do Lisp programming. You can get a long way with just using plain function definitions. Yet you can easily define new syntaxes, control structures and stuff.
never got to understand why Lisp programmers think of the macro system as the primary and more exclusive power of the language, now I start to see it. But how do to learn to create those domain-specific languages? It is so far away from conventional academic lectures, that one needs to forget almost everything to start thinking that way!
Back when I was the proud owner of a Commodore C 128, I used to think similar things about useless stuff like GOSUB. Why can't we just stay with the more familiar GOTO that everyone understands?Get over it. Learning new tools is usefull, but it's work. Get a good book on Lisp macros, and dive in.
And I'm not convinced that that syntaxlessness is indispensable. [...] I would prefer to have some syntactic sugar
You are not alone. And, given that you can actually define a new syntax, many people tried to come up with alternatives to raw s-expressions. And indeed succeeded. However, none of these alternatives ever got too popular (the most successfull attempt might by the Dylan language, which started with s-expressions, but dropped them). People could have used alternative syntaxes, but the vast majority chose not to. -
Re:Can anyone
Concerning advantages of Phython
Python is totally object orientated, and very intelligently designed in this department. Whereas in Perl (5) you have to jump through hoops to create objects, especially OO modules, in Python it's as easy as assigning a variable a new value.
Alright, lets set something straight here. The world is on a huge object oriented high. As has been said about strict types, object oriented programming is a hammer and everything all of a sudden looks like a nail.
Any language that is *only* objected oriented is forcing you to look at everything as nails.
Try Lisp, you'll feel much better.
(Insert language here) is just a watered down Lisp. -
Re:Can anyone
explain what the major advantages of using Python are. I have only ever looked at it very briefly and even more briefly at Jython. From this very limited experience I cant really think of a compelling reason to use Python over some of the more mainstream languages, other than perhaps as a scripting type glue.
If you are using Java then python is a step up because it offers first class functions and some other incredibly power constructs.
Unfortunately, although Python's effort is applaudable, it really is only a first class imperative language that has added some features of Lisp.
If you are going to chose a new language to learn, then you should be learning Lisp. Most people avoid it because it looks complicated but, believe me, after using in for many years, Lisp is gorgeous.
I highly suggest you check out Paul Graham's website and read his articles about Lisp before you waste anytime learning any other language.
All languages nowadays are slowly adding individual pieces of Lisp functionality. Why not just use Lisp (no reason to wait a decade for all the "popular" languages to finally come fill circle and become Lisp dialects). -
Re:Can anyone
explain what the major advantages of using Python are. I have only ever looked at it very briefly and even more briefly at Jython. From this very limited experience I cant really think of a compelling reason to use Python over some of the more mainstream languages, other than perhaps as a scripting type glue.
If you are using Java then python is a step up because it offers first class functions and some other incredibly power constructs.
Unfortunately, although Python's effort is applaudable, it really is only a first class imperative language that has added some features of Lisp.
If you are going to chose a new language to learn, then you should be learning Lisp. Most people avoid it because it looks complicated but, believe me, after using in for many years, Lisp is gorgeous.
I highly suggest you check out Paul Graham's website and read his articles about Lisp before you waste anytime learning any other language.
All languages nowadays are slowly adding individual pieces of Lisp functionality. Why not just use Lisp (no reason to wait a decade for all the "popular" languages to finally come fill circle and become Lisp dialects). -
Re:Comparison of Bayesian spam filtersI've always wondered how Paul Graham has managed to get so much hype built up about his work. The idea of using Bayesian filters to classify spam had been around about 5 years prior to his "A Plan For Spam" - check out, for example, this paper by Mehran Sahami (a very cool guy who works here at Stanford as well as at Google) from 1998: http://citeseer.nj.nec.com/sahami98bayesian.html (and if you search around on Citeseer you'll undoubtedly find many other papers on spam classifying from even earlier, though not all use Naive Bayes).
Mathematically, Graham's version of Naive Bayes is pretty weak - look at the original A Plan for Spam, he chooses all kinds of random numbers based purely on trial and error, rather than backing them up with mathematical reasoning:
I want to bias the probabilities slightly to avoid false positives, and by trial and error I've found that a good way to do it is to double all the numbers in good. This helps to distinguish between words that occasionally do occur in legitimate email and words that almost never do. I only consider words that occur more than five times in total (actually, because of the doubling, occurring three times in nonspam mail would be enough). And then there is the question of what probability to assign to words that occur in one corpus but not the other. Again by trial and error I chose
That's just one paragraph, stuff like that is all over the paper. There are many more logical ways to bias the classifier away from false-positives, which I'm not sure if it's worth getting into. Having spent the summer implementing many different variations on spam filtering, I can say confidently that Graham's variation is definitely far from the best. .01 and .99. There may be room for tuning here, but as the corpus grows such tuning will happen automatically anyway. -
Re:Comparison of Bayesian spam filtersI always wondered how Graham felt about the hundreds of Bayesian filters written after he published his article. After all it was supposed to be a killer feature of a webmail system he (together with others, of course) writes to demo his Arc language.
Then again, he's probably still insanely rich from the ViaWeb (a.k.a Yahoo! Store) deal, and doesn't really have to care about lost business advantage much. Becoming a millionaire to be able to concentrate on hacking seems to be a good career plan
:-) -
Re:What I don't understand...
the gibberish you see in the subject header and the to/from headers are designed to fool Bayesian filters. These filters provide weighted probabilities that a message is spam by examining the content. Using gubberish words and random text instead of words like 'sex', 'porn' or 'teens' is an attempt to fool the filter with words it wont recognise from its active database.
Paul graham has done some brilliant and sucessful work using Bayesian filters -
WHICH set of basics?
Even if I only have a library of 5 books, I'd still rather have only one that covers the basics. That way, I don't have to dig through five different, sometimes conflicting, explanations of the same concepts.
Ah, but which basics? The term "object-oriented" means different things to different people. The differences you're seeing in these books are probably caused by this.
Go read this article by Jonathan Rees, and you'll see why Java OO != Smalltalk OO != CLOS OO (to name two models you've probably seen, and one you probably haven't). -
It is the other way around or "A Plan for RIAA"
Yet again with with apologies to Paul Graham, I wrote it before: implement colaborative bayesian filters in all major P2P clients. Train the filters to reject RIAA known search strings, RIAA known IP numbers, RIAA known nicknames. Iterate this across all participants. Let the filters learn while RIAA try to beat themt. Go back to step 1.
-
The 100 year language
I think that Alan Kay was way ahead of his time - getting kids to program. I think much like the spreadsheet revolutionized office work by allowing dynamic analysis of data the next big language will be one that is simple enough to allow average office workers to speed up and automate their own work. Abstraction is key.
VBA is being used currently for a lot of that work - but it is truly horrible. Wharton has started teaching its MBA students Python.
Check out what Paul Graham has to say about programming languages in 100 years (basically they won't change much).
http://www.paulgraham.com/hundred.html
And Artima had a discussion on this topic, "After Java and C# - what is next?". http://www.artima.com/weblogs/viewpost.jsp?thread= 6543 -
Re:expressive
Lisp. See "Succinctness is Power".
-
Re:expressive
Lisp. See "Succinctness is Power".
-
Almost catching up
Python is indeed getting closer. Just give them some more years to finish the last three itens and suddenly we'll have thousands more using a Really Good language.
Note to moderators: this is a troll :-) -
Re:My approachOk, now that you're getting more concrete it sounds like you do understand how all this works. However, I don't think you've proven that a Bayesian filter can't keep up with spammers. In Better Bayesian Filtering, Paul Graham talks about degeneration, where if a token isn't found in the corpus, you fall back on a simpler version of it, e.g. fewer exclamation points. Add a couple simple degeneration techniques and you'll catch PcarEcatNcabIcanS and pppppeeeeennnniiisssxyqhh.
Basically, there are two directions a spammer can go to bypass the Bayesian filter's favorite keywords. Either they break words up so that the filter sees shorter tokens, or they do something else that makes the filter see longer tokens. Shorter non-word tokens will just end up being learned as spam markers. Longer tokens can be handled via n-tuplets plus a few degeneration techniques. I think the number of different ways you can obfuscate words but still keep them recognizable to humans is limited.
-
Re:My approachSince neither you nor the moderators who modded you up could be bothered to actually read Paul Graham's article, I'll quote it for you here:
In a sense, though, my filters do themselves embody a kind of whitelist (and blacklist) because they are based on entire messages, including the headers. So to that extent they "know" the email addresses of trusted senders and even the routes by which mail gets from them to me. And they know the same about spam, including the server names, mailer versions, and protocols.
-
Re:Does making this public help spammers?
Bayesian filtering is currently considered by many the best spam filtering mechanism. Since the detailed data set is different for everybody, and it learns from spam and non-spam messages, the only way a spammer could avoid Bayesian filters would be to either customize spam for each recipient (not practical) or make spam messages look a lot like normal messages (making them much less intrusive, but also impossible to filter through any mechanism other than a whitelist). See Paul Graham's spam pages for further info.
Security through obscurity would be pointless. Unless you are using a spam filter you wrote yourself and aren't going to give anyone else, it won't help.
Even if you would offer a filtering service without giving the filtering program to anyone (to prevent reverse-engineering), spammers could always use the service as an oracle to figure out ways around it through trial-and-error. -
Not really
I'm sure that is what the spammers hope and believe, but in fact most Bayesian filters associate a probability factor to each token or word, and they make a decision based on the set of tokens with the highest or lowest scores. For example, in Paul Graham's seminal Plan for Spam he describes using only the 15 most significant tokens to make the determination of the message's spamminess. So it really doesn't help to try to bury words like "penis" or "viagra" in a mass of obscure or invented words, however large; the filters will ignore those and home in on the bad words.
In fact, the spammers' choice of obscure or invented words as padding is dumb. If they would use regular words such as do occur in the legitimate email you want to read, there's actually a chance that over time they could render Bayesian filters less potent, because the good words would become more associated with spam than with legitimate mail. Careful attention to the training corpus is needed to avoid this happening. -
Re:Not quite ready? Of course it is.
"I think OS's should have even more time spent on making better GUI's, with as much written language removed from it as possible"
Hackers and painters
In summary: a GUI-only interfaces is to a text-interface as film is to literature. It may feel easier to use at first, but the limitations are significant, it makes it more difficult to think outside the designers' box, and it cripples the linguistic abilities which most people love to practise.
Ever use the Lego programming language?