Listen, if you were making an absolute boat load of money off of other peoples' work, wouldn't you get upset if some newfangled Internet thing came along and disturbed your flow of income?
All this proves is that all of us, together, can help to stomp out the music executive thieves over time with continued support of decentralized pee two pee programs like KaZAA, Napster, and GNUtella.
The RIAA et al. can see ten years down the road and realize that things don't look good for them. Don't worry, we'll all get through this and then the music artists will finally get some penance for their years of hard work, singing, and dancing.
I'm on the official list of objectors about this particular levy law. Don't worry, we've got some pretty convincing evidence to show at the hearing that illustrates that the proposed levies are much too high and should be struck down like a red-headed stepchild.
It will be interesting to see the outcome. If it passes, the market for blank media and mp3 players will be hit hard.
We last touched on this subject in October, when someone was searching for cheap SSL certs. We've also discussed why certs are so expensive. Why not take it one step further and discuss ways of making and authenticating our own certs for free...or as close to free as possible?
Ladies and gentleman, a round of applause for the only Slashdot editor who reads Slashdot!
More than 1.1 billion trolls are killed worldwide each year. For no reason.
Trolls are an unhealthy food source. Most people who eat trolls also have access to other, non-meat foods.
Trolls are some of the most intelligent beings on our planet. Why do we kill them by the billions? Just to enjoy the transient pleasure of tasting their flesh?
Sadly, the average broadband dorm-dwelling college student could, if s/he had enough storage space, download 600 songs in 20 seconds.
Just imagine what they could get in an hour of work.
Unfortunately, it's not uncommon for many college students to have hundreds of GigaBytes of MP3 songs that they did not pay for. It's all too confusing, and most students don't even understand what they're doing is wrong.
I help with freshman arrival and computer orientation for them, and 90% of them have a first question of "How do I download MP3s?".
A federal judge ordered Verizon Communications Tuesday to turn over the name of a customer suspected of downloading songs over the Internet, handing a victory to recording companies in their fight against online piracy.
But what if I start my own ISP and the database of customer records is indexed without any information that would be able to identify the person or phone line that's dialing in to use our Web access services?
Any payment information would be done with cash only and written on pencil and paper kept in a lockbox or safe of some sort so that no matter what a court rules, my customers remain anonymous.
We now live in a world where anonymity no longer exists, we have to pay for music, and an ugly 18 year old Canadian chick is at the top of the US music charts?
More than 1.1 billion pigs are killed worldwide each year. For no reason.
Pork is an unhealthy food source. Most people who eat pork also have access to other, non-meat foods.
Pigs are some of the most intelligent beings on our planet. Why do we kill them by the billions? Just to enjoy the transient pleasure of tasting their flesh?
It's always great to see some competition in the stagnant console market.
Yeah, OK, I'll bite.
Console gaming systems are among the highest selling products of all-time. What's fascinating is that sales figures of console systems like PS2, X-Box, Gamecube, etc. have shown to be recession-proof and nearly always in the black year after year for the past decade or two.
With so many games to choose from, it's no wonder the video game industry is headed for a record-breaking year," said Richard Ow, senior account manager, NPDFunworld. "Annual 2001 was the year of the console, and now with so many households owning one or more of them, the goal of video game publishers is to be 'the game' on everyone's holiday gift list," said Ow.
Year-to-date (January through October) 2002 total console hardware dollar volume increased 29 percent, compared to the same time period last year. Another bright spot for the industry was the category of console accessories, showing an impressive 27 percent increase in dollar volume versus last year. (from this GameWorld story)
So, while yours was an interesting troll attempt, your inclusiong of a blatant and easily-disproved non-fact in the first sentence is what messed you up.
We don't need your (what you think are) witty comments making fun of a real business with a real business plan. Hell, nearly any business plan looks good next to the business that you believe in that give away their products for free. It's 2003, and you and your compadres are still trying to figure out how to make money by writing software for free. As a math minor, I must say that you're in a for one hell of a problem.
Maybe this company is spearheaded by someone who has money to blow, and so he or she decided they wanted to give the game console market a try. I don't think you personally are the one who should be judging them or their company on a world stage like Slashdot that receives millions of readers every day.
Sure, it's said that bad press is better than no press at all, but I just wish people like Michael wouldn't make the good-hearted folks like myself here at Slashdot look like bitter old unemployed programmers.
Just my two cents.
A faithful Slashdot contributor, -- Amsterdam Vallon
More than 1.1 billion pigs killed each year
on
Phantom Game Console
·
· Score: -1, Offtopic
More than 1.1 billion pigs are killed worldwide each year. For no reason.
Pork is an unhealthy food source. Most people who eat pork also have access to other, non-meat foods.
Pigs are some of the most intelligent beings on our planet. Why do we kill them by the billions? Just to enjoy the transient pleasure of tasting their flesh?
Without spam, how else would I be able to sit home every day and make $1,000 a week watching TV while playing with my 12 inch penis?
Archive Version (b/c it's a personal site)
on
Plan for Spam, Version 2
·
· Score: 5, Interesting
January 2003
(This article was given as a talk at the 2003 Spam Conference. It describes the work I've done to improve the performance of the algorithm described in A Plan for Spam, and what I plan to do in the future.)
The first discovery I'd like to present here is an algorithm for lazy evaluation of research papers. Just write whatever you want and don't cite any previous work, and indignant readers will send you references to all the papers you should have cited. I discovered this algorithm after ``A Plan for Spam'' [1] was on Slashdot.
Spam filtering is a subset of text classification, which is a well established field, but the first papers about Bayesian spam filtering per se seem to have been two given at the same conference in 1998, one by Pantel and Lin [2], and another by a group from Microsoft Research [3].
When I heard about this work I was a bit surprised. If people had been onto Bayesian filtering four years ago, why wasn't everyone using it? When I read the papers I found out why. Pantel and Lin's filter was the more effective of the two, but it only caught 92% of spam, with 1.16% false positives.
When I tried writing a Bayesian spam filter, it caught 99.5% of spam with less than.03% false positives [4]. It's always alarming when two people trying the same experiment get widely divergent results. It's especially alarming here because those two sets of numbers might yield opposite conclusions. Different users have different requirements, but I think for many people a filtering rate of 92% with 1.16% false positives means that filtering is not an acceptable solution, whereas 99.5% with less than.03% false positives means that it is.
So why did we get such different numbers? I haven't tried to reproduce Pantel and Lin's results, but from reading the paper I see five things that probably account for the difference.
One is simply that they trained their filter on very little data: 160 spam and 466 nonspam mails. Filter performance should still be climbing with data sets that small. So their numbers may not even be an accurate measure of the performance of their algorithm, let alone of Bayesian spam filtering in general.
But I think the most important difference is probably that they ignored message headers. To anyone who has worked on spam filters, this will seem a perverse decision. And yet in the very first filters I tried writing, I ignored the headers too. Why? Because I wanted to keep the problem neat. I didn't know much about mail headers then, and they seemed to me full of random stuff. There is a lesson here for filter writers: don't ignore data. You'd think this lesson would be too obvious to mention, but I've had to learn it several times.
Third, Pantel and Lin stemmed the tokens, meaning they reduced e.g. both ``mailing'' and ``mailed'' to the root ``mail''. They may have felt they were forced to do this by the small size of their corpus, but if so this is a kind of premature optimization.
Fourth, they calculated probabilities differently. They used all the tokens, whereas I only use the 15 most significant. If you use all the tokens you'll tend to miss longer spams, the type where someone tells you their life story up to the point where they got rich from some multilevel marketing scheme. And such an algorithm would be easy for spammers to spoof: just add a big chunk of random text to counterbalance the spam terms.
Finally, they didn't bias against false positives. I think any spam filtering algorithm ought to have a convenient knob you can twist to decrease the false positive rate at the expense of the filtering rate. I do this by counting the occurrences of tokens in the nonspam corpus double.
I don't think it's a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure. Spam filtering is not just classification, because false positives are so much worse than false negatives that you should treat them as a different kind of error. And the source of error is not just random variation, but a live human spammer working actively to defeat your filter.
Tokens
Another project I heard about after the Slashdot article was Bill Yerazunis' CRM114 [5]. This is the counterexample to the design principle I just mentioned. It's a straight text classifier, but such a stunningly effective one that it manages to filter spam almost perfectly without even knowing that's what it's doing.
Once I understood how CRM114 worked, it seemed inevitable that I would eventually have to move from filtering based on single words to an approach like this. But first, I thought, I'll see how far I can get with single words. And the answer is, surprisingly far.
Mostly I've been working on smarter tokenization. On current spam, I've been able to achieve filtering rates that approach CRM114's. These techniques are mostly orthogonal to Bill's; an optimal solution might incorporate both.
``A Plan for Spam'' uses a very simple definition of a token. Letters, digits, dashes, apostrophes, and dollar signs are constituent characters, and everything else is a token separator. I also ignored case. Now I have a more complicated definition of a token:
Case is preserved.
Exclamation points are constituent characters.
Periods and commas are constituents if they occur between two digits. This lets me get ip addresses and prices intact.
A price range like $20-25 yields two tokens, $20 and $25.
Tokens that occur within the To, From, Subject, and Return-Path lines, or within urls, get marked accordingly. E.g. ``foo'' in the Subject line becomes ``Subject*foo''. (The asterisk could be any character you don't allow as a constituent.) Such measures increase the filter's vocabulary, which makes it more discriminating. For example, in the current filter, ``free'' in the Subject line has a spam probability of 98%, whereas the same token in the body has a spam probability of only 65%.
In the Plan for Spam filter, all these tokens would have had the same probability,.7602. That filter recognized about 23,000 tokens. The current one recognizes about 187,000.
The disadvantage of having a larger universe of tokens is that there is more chance of misses. Spreading your corpus out over more tokens has the same effect as making it smaller. If you consider exclamation points as constituents, for example, then you could end up not having a spam probability for free with seven exclamation points, even though you know that free with just two exclamation points has a probability of 99.99%.
One solution to this is what I call degeneration. If you can't find an exact match for a token, treat it as if it were a less specific version. I consider terminal exclamation points, uppercase letters, and occurring in one of the five marked contexts as making a token more specific. For example, if I don't find a probability for ``Subject*free!'', I look for probabilities for ``Subject*free'', ``free!'', and ``free'', and take whichever one is farthest from.5.
Here are the alternatives [7] considered if the filter sees ``FREE!!!'' in the Subject line and doesn't have a probability for it.
If you do this, be sure to consider versions with initial caps as well as all uppercase and all lowercase. Spams tend to have more sentences in imperative voice, and in those the first word is a verb. So verbs with initial caps have higher spam probabilities than they would in all lowercase. In my filter, the spam probability of ``Act'' is 98% and for ``act'' only 62%.
If you increase your filter's vocabulary, you can end up counting the same word multiple times, according to your old definition of ``same''. Logically, they're not the same token anymore. But if this still bothers you, let me add from experience that the words you seem to be counting multiple times tend to be exactly the ones you'd want to.
Another effect of a larger vocabulary is that when you look at an incoming mail you find more interesting tokens, meaning those with probabilities far from.5. I use the 15 most interesting to decide if mail is spam. But you can run into a problem when you use a fixed number like this. If you find a lot of maximally interesting tokens, the result can end up being decided by whatever random factor determines the ordering of equally interesting tokens. One way to deal with this is to treat some as more interesting than others.
For example, the token ``dalco'' occurs 3 times in my spam corpus and never in my legitimate corpus. The token ``Url*optmails'' (meaning ``optmails'' within a url) occurs 1223 times. And yet, as I used to calculate probabilities for tokens, both would have the same spam probability, the threshold of.99.
That doesn't feel right. There are theoretical arguments for giving these two tokens substantially different probabilities (Pantel and Lin do), but I haven't tried that yet. It does seem at least that if we find more than 15 tokens that only occur in one corpus or the other, we ought to give priority to the ones that occur a lot. So now there are two threshold values. For tokens that occur only in the spam corpus, the probability is.9999 if they occur more than 10 times and.9998 otherwise. Ditto at the other end of the scale for tokens found only in the legitimate corpus.
I may later scale token probabilities substantially, but this tiny amount of scaling at least ensures that tokens get sorted the right way.
Another possibility would be to consider not just 15 tokens, but all the tokens over a certain threshold of interestingness. Steven Hauser does this in his statistical spam filter [8]. If you use a threshold, make it very high, or spammers could spoof you by packing messages with more innocent words.
Finally, what should one do about html? I've tried the whole spectrum of options, from ignoring it to parsing it all. Ignoring html is a bad idea, because it's full of useful spam signs. But if you parse it all, your filter might degenerate into a mere html recognizer. The most effective approach seems to be the middle course, to notice some tokens but not others. I look at a, img, and font tags, and ignore the rest. Links and images you should certainly look at, because they contain urls.
I could probably be smarter about dealing with html, but I don't think it's worth putting a lot of time into this. Spams full of html are easy to filter. The smarter spammers already avoid it. So performance in the future should not depend much on how you deal with html.
Performance
Between December 10 2002 and January 10 2003 I got about 1750 spams. Of these, 4 got through. That's a filtering rate of about 99.75%.
Two of the four spams I missed got through because they happened to use words that occur often in my legitimate email.
The third was one of those that exploit an insecure cgi script to send mail to third parties. They're hard to filter based just on the content because the headers are innocent and they're careful about the words they use. Even so I can usually catch them. This one squeaked by with a probability of.88, just under the threshold of.9.
Of course, looking at multiple token sequences would catch it easily. ``Below is the result of your feedback form'' is an instant giveaway.
The fourth spam was what I call a spam-of-the-future, because this is what I expect spam to evolve into: some completely neutral text followed by a url. In this case it was was from someone saying they had finally finished their homepage and would I go look at it. (The page was of course an ad for a porn site.)
If the spammers are careful about the headers and use a fresh url, there is nothing in spam-of-the-future for filters to notice. We can of course counter by sending a crawler to look at the page. But that might not be necessary. The response rate for spam-of-the-future must be low, or everyone would be doing it. If it's low enough, it won't pay for spammers to send it, and we won't have to work too hard on filtering it.
Now for the really shocking news: during that same one-month period I got three false positives.
In a way it's a relief to get some false positives. When I wrote ``A Plan for Spam'' I hadn't had any, and I didn't know what they'd be like. Now that I've had a few, I'm relieved to find they're not as bad as I feared. False positives yielded by statistical filters turn out to be mails that sound a lot like spam, and these tend to be the ones you would least mind missing [9].
Two of the false positives were newsletters from companies I've bought things from. I never asked to receive them, so arguably they were spams, but I count them as false positives because I hadn't been deleting them as spams before. The reason the filters caught them was that both companies in January switched to commercial email senders instead of sending the mails from their own servers, and both the headers and the bodies became much spammier.
The third false positive was a bad one, though. It was from someone in Egypt and written in all uppercase. This was a direct result of making tokens case sensitive; the Plan for Spam filter wouldn't have caught it.
It's hard to say what the overall false positive rate is, because we're up in the noise, statistically. Anyone who has worked on filters (at least, effective filters) will be aware of this problem. With some emails it's hard to say whether they're spam or not, and these are the ones you end up looking at when you get filters really tight. For example, so far the filter has caught two emails that were sent to my address because of a typo, and one sent to me in the belief that I was someone else. Arguably, these are neither my spam nor my nonspam mail.
Another false positive was from a vice president at Virtumundo. I wrote to them pretending to be a customer, and since the reply came back through Virtumundo's mail servers it had the most incriminating headers imaginable. Arguably this isn't a real false positive either, but a sort of Heisenberg uncertainty effect: I only got it because I was writing about spam filtering.
Not counting these, I've had a total of five false positives so far, out of about 7740 legitimate emails, a rate of.06%. The other two were a notice that something I bought was back-ordered, and a party reminder from Evite.
I don't think this number can be trusted, partly because the sample is so small, and partly because I think I can fix the filter not to catch some of these.
False positives seem to me a different kind of error from false negatives. Filtering rate is a measure of performance. False positives I consider more like bugs. I approach improving the filtering rate as optimization, and decreasing false positives as debugging.
So these five false positives are my bug list. For example, the mail from Egypt got nailed because the uppercase text made it look to the filter like a Nigerian spam. This really is kind of a bug. As with html, the email being all uppercase is really conceptually one feature, not one for each word. I need to handle case in a more sophisticated way.
So what to make of this.06%? Not much, I think. You could treat it as an upper bound, bearing in mind the small sample size. But at this stage it is more a measure of the bugs in my implementation than some intrinsic false positive rate of Bayesian filtering.
Future
What next? Filtering is an optimization problem, and the key to optimization is profiling. Don't try to guess where your code is slow, because you'll guess wrong. Look at where your code is slow, and fix that. In filtering, this translates to: look at the spams you miss, and figure out what you could have done to catch them.
For example, spammers are now working aggressively to evade filters, and one of the things they're doing is breaking up and misspelling words to prevent filters from recognizing them. But working on this is not my first priority, because I still have no trouble catching these spams [10].
There are two kinds of spams I currently do have trouble with. One is the type that pretends to be an email from a woman inviting you to go chat with her or see her profile on a dating site. These get through because they're the one type of sales pitch you can make without using sales talk. They use the same vocabulary as ordinary email.
The other kind of spams I have trouble filtering are those from companies in e.g. Bulgaria offering contract programming services. These get through because I'm a programmer too, and the spams are full of the same words as my real mail.
I'll probably focus on the personal ad type first. I think if I look closer I'll be able to find statistical differences between these and my real mail. The style of writing is certainly different, though it may take multiword filtering to catch that. Also, I notice they tend to repeat the url, and someone including a url in a legitimate mail wouldn't do that [11].
The outsourcing type are going to be hard to catch. Even if you sent a crawler to the site, you wouldn't find a smoking statistical gun. Maybe the only answer is a central list of domains advertised in spams [12]. But there can't be that many of this type of mail. If the only spams left were unsolicited offers of contract programming services from Bulgaria, we could all probably move on to working on something else.
Will statistical filtering actually get us to that point? I don't know. Right now, for me personally, spam is not a problem. But spammers haven't yet made a serious effort to spoof statistical filters. What will happen when they do?
I'm not optimistic about filters that work at the network level [13]. When there is a static obstacle worth getting past, spammers are pretty efficient at getting past it. There is already a company called Assurance Systems that will run your mail through Spamassassin and tell you whether it will get filtered out.
Network-level filters won't be completely useless. They may be enough to kill all the "opt-in" spam, meaning spam from companies like Virtumundo and Equalamail who claim that they're really running opt-in lists. You can filter those based just on the headers, no matter what they say in the body. But anyone willing to falsify headers or use open relays, presumably including most porn spammers, should be able to get some message past network-level filters if they want to. (By no means the message they'd like to send though, which is something.)
The kind of filters I'm optimistic about are ones that calculate probabilities based on each individual user's mail. These can be much more effective, not only in avoiding false positives, but in filtering too: for example, finding the recipient's email address base-64 encoded anywhere in a message is a very good spam indicator.
But the real advantage of individual filters is that they'll all be different. If everyone's filters have different probabilities, it will make the spammers' optimization loop, what programmers would call their edit-compile-test cycle, appallingly slow. Instead of just tweaking a spam till it gets through a copy of some filter they have on their desktop, they'll have to do a test mailing for each tweak. It would be like programming in a language without an interactive toplevel, and I wouldn't wish that on anyone.
Notes
[1] Paul Graham. ``A Plan for Spam.'' August 2002. http://paulgraham.com/spam.html.
Probabilities in this algorithm are calculated using a degenerate case of Bayes' Rule. There are two simplifying assumptions: that the probabilities of features (i.e. words) are independent, and that we know nothing about the prior probability of an email being spam.
The first assumption is widespread in text classification. Algorithms that use it are called ``naive Bayesian.''
The second assumption I made because the proportion of spam in my incoming mail fluctuated so much from day to day (indeed, from hour to hour) that the overall prior ratio seemed worthless as a predictor. If you assume that P(spam) and P(nonspam) are both.5, they cancel out and you can remove them from the formula.
If you were doing Bayesian filtering in a situation where the ratio of spam to nonspam was consistently very high or (especially) very low, you could probably improve filter performance by incorporating prior probabilities. To do this right you'd have to track ratios by time of day, because spam and legitimate mail volume both have distinct daily patterns.
[2] Patrick Pantel and Dekang Lin. ``SpamCop-- A Spam Classification & Organization Program.'' Proceedings of AAAI-98 Workshop on Learning for Text Categorization.
[3] Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz. ``A Bayesian Approach to Filtering Junk E-Mail.'' Proceedings of AAAI-98 Workshop on Learning for Text Categorization.
[4] At the time I had zero false positives out of about 4,000 legitimate emails. If the next legitimate email was a false positive, this would give us.03%. These false positive rates are untrustworthy, as I explain later. I quote a number here only to emphasize that whatever the false positive rate is, it is less than 1.16%.
[5] Bill Yerazunis. ``Sparse Binary Polynomial Hash Message Filtering and The CRM114 Discriminator.'' Proceedings of 2003 Spam Conference.
[6] In ``A Plan for Spam'' I used thresholds of.99 and.01. It seems justifiable to use thresholds proportionate to the size of the corpora. Since I now have on the order of 10,000 of each type of mail, I use.9999 and.0001.
[7] There is a flaw here I should probably fix. Currently, when ``Subject*foo'' degenerates to just ``foo'', what that means is you're getting the stats for occurrences of ``foo'' in the body or header lines other than those I mark. What I should do is keep track of statistics for ``foo'' overall as well as specific versions, and degenerate from ``Subject*foo'' not to ``foo'' but to ``Anywhere*foo''. Ditto for case: I should degenerate from uppercase to any-case, not lowercase.
It would probably be a win to do this with prices too, e.g. to degenerate from ``$129.99'' to ``$--9.99'', ``$--.99'', and ``$--''.
You could also degenerate from words to their stems, but this would probably only improve filtering rates early on when you had small corpora.
[8] Steven Hauser. ``Statistical Spam Filter Works for Me.'' http://www.sofbot.com.
[9] False positives are not all equal, and we should remember this when comparing techniques for stopping spam. Whereas many of the false positives caused by filters will be near-spams that you wouldn't mind missing, false positives caused by blacklists, for example, will be just mail from people who chose the wrong ISP. In both cases you catch mail that's near spam, but for blacklists nearness is physical, and for filters it's textual.
In fairness, it should be added that the new generation of responsible blacklists, like the SBL, cause far fewer false positives than earlier blacklists like the MAPS RBL, for whom causing large numbers of false positives was a deliberate technique to get the attention of ISPs.
[10] If spammers get good enough at obscuring tokens for this to be a problem, we can respond by simply removing whitespace, periods, commas, etc. and using a dictionary to pick the words out of the resulting sequence. And of course finding words this way that weren't visible in the original text would in itself be evidence of spam.
Picking out the words won't be trivial. It will require more than just reconstructing word boundaries; spammers both add (``xHot nPorn cSite'') and omit (``P#rn'') letters. Vision research may be useful here, since human vision is the limit that such tricks will approach.
[11] In general, spams are more repetitive than regular email. They want to pound that message home. I currently don't allow duplicates in the top 15 tokens, because you could get a false positive if the sender happens to use some bad word multiple times. (In my current filter, ``dick'' has a spam probabilty of.9999, but it's also a name.) It seems we should at least notice duplication though, so I may try allowing up to two of each token, as Brian Burton does in SpamProbe.
[12] This is what approaches like Brightmail's will degenerate into once spammers are pushed into using mad-lib techniques to generate everything else in the message.
[13] It's sometimes argued that we should be working on filtering at the network level, because it is more efficient. What people usually mean when they say this is: we currently filter at the network level, and we don't want to start over from scratch. But you can't dictate the problem to fit your solution.
Historically, scarce-resource arguments have been the losing side in debates about software design. People only tend to use them to justify choices (inaction in particular) made for other reasons.
Thanks to Sarah Harlin, Trevor Blackwell, and Dan Giffin for reading drafts of this paper, and to Dan again for most of the infrastructure that this filter runs on.
More than 1.1 billion pigs are killed worldwide ea
on
Plan for Spam, Version 2
·
· Score: -1, Offtopic
More than 1.1 billion pigs are killed worldwide each year. For no reason.
Pork is an unhealthy food source. Most people who eat pork also have access to other, non-meat foods.
Pigs are some of the most intelligent beings on our planet. Why do we kill them by the billions? Just to enjoy the transient pleasure of tasting their flesh?
Amsterdam Vallon's post contained a handful of facts coupled with clear, readable statements backing his claims up. He was moderated mostly as a "Troll" and his post sits invisible at a score of 0.
The parent post by Tewfik is a few "sentences" of garbled English prose which contains more nonsense than factual information. His post sits at 5 and is moderated "Insightful".
I think the moderations and comments mentioned above show that the Pro-Linux factions are moronic and easily manipulated, much like employees who Microsoft chooses to hire.
The sad thing isn't that Linux is dead, it's that its followers won't admit it.
As a developer, I'm asked on average of once or twice every year to suddenly pick-up a new technology and learn it within a couple weeks so that I can write a new program for release 6-12 months later as itself or jointly with the hardware guys.
When it comes to good, thorough documentation and API releases, I've always thought that this is an area where Linux is truly lacking. Hypothetically speaking, I think a coder learning Java for a new Windows P2P program that he must write would have a much easier time than a programmer who must learn Perl or C on his Linux box and create a network-intensive application that installs and runs the same way on all distributions of Linux, as well as Mac OS X.
I figure opinions from the "non enlightened", as many of you will probably call me, will help you to improve Linux, especially its documentation and user-friendliness.
Yeah, those of us who have to dress nicely for work and who would never write "Credit Card sized 5GB HD to become late this year" on a website that gets millions of hits a day are real dorks, aren't we?
cmdrtaco is getting married cmdrtaco is connected to the following things cmdrtaco is connected to because cmdrtaco is in the bag cmdrtaco is getting married to the fine woman this website is run by cmdrtaco is still known to post hoaxes or wild cmdrtaco is gay" or cmdrtaco is a big fan of both gnome and kde cmdrtaco is brilliant cmdrtaco is nothing more than a perl script operated by lucasfilms cmdrtaco is lame cmdrtaco is very unhappy with the current va situation cmdrtaco is confident things will continue as usual cmdrtaco is on the site cmdrtaco is in the bag joining us as guest cmdrtaco is this story cmdrtaco is my hero cmdrtaco is the one that is laying on the purple couch with the notebook cmdrtaco is on vacation cmdrtaco is a big cardcaptor sakura fan as well cmdrtaco is also one of the linux cmdrtaco is a torvelian cmdrtaco is a fair replacment cmdrtaco is an idiot cmdrtaco is because the whole of the cmdrtaco is quoted as saying cmdrtaco is hanging with the techtv people cmdrtaco is a big fan too cmdrtaco is a dedicated programmer cmdrtaco is psychic
With the increasing complexity of security issues, has your team at HP Research centers been interrogated or questioned mildly by any federal crime/security groups such as the RSA, CIA, or FBI?
As a computer scientist, I have found through years of extensive research that my digitally-compiled code tends to run faster on a computer than my black ink fountain pen does on a piece of paper.
From the article: Imagine never having to polish your shoes again, having glistening white teeth without visiting a dentist, or owning a mobile phone powered by a miniature battery.
When my work shoes need polishing, my wife does it or I pay 50 cents and have them done at the Mall Foodcourt or something.
I got glistening white teeth by using Colgate Gel. It's $14.99 and quite a bargain.
And my mobile phone gets 6 hours of battery life on a single charge, which is plenty since I don't really talk to many people.
So, honestly, could someone point me toward some practical real-world applications of this supposed "Next Generation" of nanotechnology?!
Listen, if you were making an absolute boat load of money off of other peoples' work, wouldn't you get upset if some newfangled Internet thing came along and disturbed your flow of income?
All this proves is that all of us, together, can help to stomp out the music executive thieves over time with continued support of decentralized pee two pee programs like KaZAA, Napster, and GNUtella.
The RIAA et al. can see ten years down the road and realize that things don't look good for them. Don't worry, we'll all get through this and then the music artists will finally get some penance for their years of hard work, singing, and dancing.
I'm on the official list of objectors about this particular levy law. Don't worry, we've got some pretty convincing evidence to show at the hearing that illustrates that the proposed levies are much too high and should be struck down like a red-headed stepchild.
It will be interesting to see the outcome. If it passes, the market for blank media and mp3 players will be hit hard.
Ladies and gentleman, a round of applause for the only Slashdot editor who reads Slashdot!
More than 1.1 billion trolls are killed worldwide each year. For no reason.
Trolls are an unhealthy food source. Most people who eat trolls also have access to other, non-meat foods.
Trolls are some of the most intelligent beings on our planet. Why do we kill them by the billions? Just to enjoy the transient pleasure of tasting their flesh?
Sadly, the average broadband dorm-dwelling college student could, if s/he had enough storage space, download 600 songs in 20 seconds.
Just imagine what they could get in an hour of work.
Unfortunately, it's not uncommon for many college students to have hundreds of GigaBytes of MP3 songs that they did not pay for. It's all too confusing, and most students don't even understand what they're doing is wrong.
I help with freshman arrival and computer orientation for them, and 90% of them have a first question of "How do I download MP3s?".
A federal judge ordered Verizon Communications Tuesday to turn over the name of a customer suspected of downloading songs over the Internet, handing a victory to recording companies in their fight against online piracy.
But what if I start my own ISP and the database of customer records is indexed without any information that would be able to identify the person or phone line that's dialing in to use our Web access services?
Any payment information would be done with cash only and written on pencil and paper kept in a lockbox or safe of some sort so that no matter what a court rules, my customers remain anonymous.
Is this feasible?
Let me get this straight.
We now live in a world where anonymity no longer exists, we have to pay for music, and an ugly 18 year old Canadian chick is at the top of the US music charts?
I need a Tums...
More than 1.1 billion pigs are killed worldwide each year. For no reason.
Pork is an unhealthy food source. Most people who eat pork also have access to other, non-meat foods.
Pigs are some of the most intelligent beings on our planet. Why do we kill them by the billions? Just to enjoy the transient pleasure of tasting their flesh?
Yeah, OK, I'll bite.
Console gaming systems are among the highest selling products of all-time. What's fascinating is that sales figures of console systems like PS2, X-Box, Gamecube, etc. have shown to be recession-proof and nearly always in the black year after year for the past decade or two.
So, while yours was an interesting troll attempt, your inclusiong of a blatant and easily-disproved non-fact in the first sentence is what messed you up.
Re-write for a B-.
Michael,
We don't need your (what you think are) witty comments making fun of a real business with a real business plan. Hell, nearly any business plan looks good next to the business that you believe in that give away their products for free. It's 2003, and you and your compadres are still trying to figure out how to make money by writing software for free. As a math minor, I must say that you're in a for one hell of a problem.
Maybe this company is spearheaded by someone who has money to blow, and so he or she decided they wanted to give the game console market a try. I don't think you personally are the one who should be judging them or their company on a world stage like Slashdot that receives millions of readers every day.
Sure, it's said that bad press is better than no press at all, but I just wish people like Michael wouldn't make the good-hearted folks like myself here at Slashdot look like bitter old unemployed programmers.
Just my two cents.
A faithful Slashdot contributor,
-- Amsterdam Vallon
More than 1.1 billion pigs are killed worldwide each year. For no reason.
Pork is an unhealthy food source. Most people who eat pork also have access to other, non-meat foods.
Pigs are some of the most intelligent beings on our planet. Why do we kill them by the billions? Just to enjoy the transient pleasure of tasting their flesh?
Without spam, how else would I be able to sit home every day and make $1,000 a week watching TV while playing with my 12 inch penis?
January 2003
.03% false positives [4]. It's always alarming when two people trying the same experiment get widely divergent results. It's especially alarming here because those two sets of numbers might yield opposite conclusions. Different users have different requirements, but I think for many people a filtering rate of 92% with 1.16% false positives means that filtering is not an acceptable solution, whereas 99.5% with less than .03% false positives means that it is.
.7602. That filter recognized about 23,000 tokens. The current one recognizes about 187,000.
.5.
.5. I use the 15 most interesting to decide if mail is spam. But you can run into a problem when you use a fixed number like this. If you find a lot of maximally interesting tokens, the result can end up being decided by whatever random factor determines the ordering of equally interesting tokens. One way to deal with this is to treat some as more interesting than others.
.99.
.9999 if they occur more than 10 times and .9998 otherwise. Ditto at the other end of the scale for tokens found only in the legitimate corpus.
.88, just under the threshold of .9.
.06%. The other two were a notice that something I bought was back-ordered, and a party reminder from Evite.
.06%? Not much, I think. You could treat it as an upper bound, bearing in mind the small sample size. But at this stage it is more a measure of the bugs in my implementation than some intrinsic false positive rate of Bayesian filtering.
.5, they cancel out and you can remove them from the formula.
.03%. These false positive rates are untrustworthy, as I explain later. I quote a number here only to emphasize that whatever the false positive rate is, it is less than 1.16%.
.99 and .01. It seems justifiable to use thresholds proportionate to the size of the corpora. Since I now have on the order of 10,000 of each type of mail, I use .9999 and .0001.
.9999, but it's also a name.) It seems we should at least notice duplication though, so I may try allowing up to two of each token, as Brian Burton does in SpamProbe.
(This article was given as a talk at the 2003 Spam Conference. It describes the work I've done to improve the performance of the algorithm described in A Plan for Spam, and what I plan to do in the future.)
The first discovery I'd like to present here is an algorithm for lazy evaluation of research papers. Just write whatever you want and don't cite any previous work, and indignant readers will send you references to all the papers you should have cited. I discovered this algorithm after ``A Plan for Spam'' [1] was on Slashdot.
Spam filtering is a subset of text classification, which is a well established field, but the first papers about Bayesian spam filtering per se seem to have been two given at the same conference in 1998, one by Pantel and Lin [2], and another by a group from Microsoft Research [3].
When I heard about this work I was a bit surprised. If people had been onto Bayesian filtering four years ago, why wasn't everyone using it? When I read the papers I found out why. Pantel and Lin's filter was the more effective of the two, but it only caught 92% of spam, with 1.16% false positives.
When I tried writing a Bayesian spam filter, it caught 99.5% of spam with less than
So why did we get such different numbers? I haven't tried to reproduce Pantel and Lin's results, but from reading the paper I see five things that probably account for the difference.
One is simply that they trained their filter on very little data: 160 spam and 466 nonspam mails. Filter performance should still be climbing with data sets that small. So their numbers may not even be an accurate measure of the performance of their algorithm, let alone of Bayesian spam filtering in general.
But I think the most important difference is probably that they ignored message headers. To anyone who has worked on spam filters, this will seem a perverse decision. And yet in the very first filters I tried writing, I ignored the headers too. Why? Because I wanted to keep the problem neat. I didn't know much about mail headers then, and they seemed to me full of random stuff. There is a lesson here for filter writers: don't ignore data. You'd think this lesson would be too obvious to mention, but I've had to learn it several times.
Third, Pantel and Lin stemmed the tokens, meaning they reduced e.g. both ``mailing'' and ``mailed'' to the root ``mail''. They may have felt they were forced to do this by the small size of their corpus, but if so this is a kind of premature optimization.
Fourth, they calculated probabilities differently. They used all the tokens, whereas I only use the 15 most significant. If you use all the tokens you'll tend to miss longer spams, the type where someone tells you their life story up to the point where they got rich from some multilevel marketing scheme. And such an algorithm would be easy for spammers to spoof: just add a big chunk of random text to counterbalance the spam terms.
Finally, they didn't bias against false positives. I think any spam filtering algorithm ought to have a convenient knob you can twist to decrease the false positive rate at the expense of the filtering rate. I do this by counting the occurrences of tokens in the nonspam corpus double.
I don't think it's a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure. Spam filtering is not just classification, because false positives are so much worse than false negatives that you should treat them as a different kind of error. And the source of error is not just random variation, but a live human spammer working actively to defeat your filter.
Tokens
Another project I heard about after the Slashdot article was Bill Yerazunis' CRM114 [5]. This is the counterexample to the design principle I just mentioned. It's a straight text classifier, but such a stunningly effective one that it manages to filter spam almost perfectly without even knowing that's what it's doing.
Once I understood how CRM114 worked, it seemed inevitable that I would eventually have to move from filtering based on single words to an approach like this. But first, I thought, I'll see how far I can get with single words. And the answer is, surprisingly far.
Mostly I've been working on smarter tokenization. On current spam, I've been able to achieve filtering rates that approach CRM114's. These techniques are mostly orthogonal to Bill's; an optimal solution might incorporate both.
``A Plan for Spam'' uses a very simple definition of a token. Letters, digits, dashes, apostrophes, and dollar signs are constituent characters, and everything else is a token separator. I also ignored case. Now I have a more complicated definition of a token:
Case is preserved.
Exclamation points are constituent characters.
Periods and commas are constituents if they occur between two digits. This lets me get ip addresses and prices intact.
A price range like $20-25 yields two tokens, $20 and $25.
Tokens that occur within the To, From, Subject, and Return-Path lines, or within urls, get marked accordingly. E.g. ``foo'' in the Subject line becomes ``Subject*foo''. (The asterisk could be any character you don't allow as a constituent.)
Such measures increase the filter's vocabulary, which makes it more discriminating. For example, in the current filter, ``free'' in the Subject line has a spam probability of 98%, whereas the same token in the body has a spam probability of only 65%.
In the Plan for Spam filter, all these tokens would have had the same probability,
The disadvantage of having a larger universe of tokens is that there is more chance of misses. Spreading your corpus out over more tokens has the same effect as making it smaller. If you consider exclamation points as constituents, for example, then you could end up not having a spam probability for free with seven exclamation points, even though you know that free with just two exclamation points has a probability of 99.99%.
One solution to this is what I call degeneration. If you can't find an exact match for a token, treat it as if it were a less specific version. I consider terminal exclamation points, uppercase letters, and occurring in one of the five marked contexts as making a token more specific. For example, if I don't find a probability for ``Subject*free!'', I look for probabilities for ``Subject*free'', ``free!'', and ``free'', and take whichever one is farthest from
Here are the alternatives [7] considered if the filter sees ``FREE!!!'' in the Subject line and doesn't have a probability for it.
If you do this, be sure to consider versions with initial caps as well as all uppercase and all lowercase. Spams tend to have more sentences in imperative voice, and in those the first word is a verb. So verbs with initial caps have higher spam probabilities than they would in all lowercase. In my filter, the spam probability of ``Act'' is 98% and for ``act'' only 62%.
If you increase your filter's vocabulary, you can end up counting the same word multiple times, according to your old definition of ``same''. Logically, they're not the same token anymore. But if this still bothers you, let me add from experience that the words you seem to be counting multiple times tend to be exactly the ones you'd want to.
Another effect of a larger vocabulary is that when you look at an incoming mail you find more interesting tokens, meaning those with probabilities far from
For example, the token ``dalco'' occurs 3 times in my spam corpus and never in my legitimate corpus. The token ``Url*optmails'' (meaning ``optmails'' within a url) occurs 1223 times. And yet, as I used to calculate probabilities for tokens, both would have the same spam probability, the threshold of
That doesn't feel right. There are theoretical arguments for giving these two tokens substantially different probabilities (Pantel and Lin do), but I haven't tried that yet. It does seem at least that if we find more than 15 tokens that only occur in one corpus or the other, we ought to give priority to the ones that occur a lot. So now there are two threshold values. For tokens that occur only in the spam corpus, the probability is
I may later scale token probabilities substantially, but this tiny amount of scaling at least ensures that tokens get sorted the right way.
Another possibility would be to consider not just 15 tokens, but all the tokens over a certain threshold of interestingness. Steven Hauser does this in his statistical spam filter [8]. If you use a threshold, make it very high, or spammers could spoof you by packing messages with more innocent words.
Finally, what should one do about html? I've tried the whole spectrum of options, from ignoring it to parsing it all. Ignoring html is a bad idea, because it's full of useful spam signs. But if you parse it all, your filter might degenerate into a mere html recognizer. The most effective approach seems to be the middle course, to notice some tokens but not others. I look at a, img, and font tags, and ignore the rest. Links and images you should certainly look at, because they contain urls.
I could probably be smarter about dealing with html, but I don't think it's worth putting a lot of time into this. Spams full of html are easy to filter. The smarter spammers already avoid it. So performance in the future should not depend much on how you deal with html.
Performance
Between December 10 2002 and January 10 2003 I got about 1750 spams. Of these, 4 got through. That's a filtering rate of about 99.75%.
Two of the four spams I missed got through because they happened to use words that occur often in my legitimate email.
The third was one of those that exploit an insecure cgi script to send mail to third parties. They're hard to filter based just on the content because the headers are innocent and they're careful about the words they use. Even so I can usually catch them. This one squeaked by with a probability of
Of course, looking at multiple token sequences would catch it easily. ``Below is the result of your feedback form'' is an instant giveaway.
The fourth spam was what I call a spam-of-the-future, because this is what I expect spam to evolve into: some completely neutral text followed by a url. In this case it was was from someone saying they had finally finished their homepage and would I go look at it. (The page was of course an ad for a porn site.)
If the spammers are careful about the headers and use a fresh url, there is nothing in spam-of-the-future for filters to notice. We can of course counter by sending a crawler to look at the page. But that might not be necessary. The response rate for spam-of-the-future must be low, or everyone would be doing it. If it's low enough, it won't pay for spammers to send it, and we won't have to work too hard on filtering it.
Now for the really shocking news: during that same one-month period I got three false positives.
In a way it's a relief to get some false positives. When I wrote ``A Plan for Spam'' I hadn't had any, and I didn't know what they'd be like. Now that I've had a few, I'm relieved to find they're not as bad as I feared. False positives yielded by statistical filters turn out to be mails that sound a lot like spam, and these tend to be the ones you would least mind missing [9].
Two of the false positives were newsletters from companies I've bought things from. I never asked to receive them, so arguably they were spams, but I count them as false positives because I hadn't been deleting them as spams before. The reason the filters caught them was that both companies in January switched to commercial email senders instead of sending the mails from their own servers, and both the headers and the bodies became much spammier.
The third false positive was a bad one, though. It was from someone in Egypt and written in all uppercase. This was a direct result of making tokens case sensitive; the Plan for Spam filter wouldn't have caught it.
It's hard to say what the overall false positive rate is, because we're up in the noise, statistically. Anyone who has worked on filters (at least, effective filters) will be aware of this problem. With some emails it's hard to say whether they're spam or not, and these are the ones you end up looking at when you get filters really tight. For example, so far the filter has caught two emails that were sent to my address because of a typo, and one sent to me in the belief that I was someone else. Arguably, these are neither my spam nor my nonspam mail.
Another false positive was from a vice president at Virtumundo. I wrote to them pretending to be a customer, and since the reply came back through Virtumundo's mail servers it had the most incriminating headers imaginable. Arguably this isn't a real false positive either, but a sort of Heisenberg uncertainty effect: I only got it because I was writing about spam filtering.
Not counting these, I've had a total of five false positives so far, out of about 7740 legitimate emails, a rate of
I don't think this number can be trusted, partly because the sample is so small, and partly because I think I can fix the filter not to catch some of these.
False positives seem to me a different kind of error from false negatives. Filtering rate is a measure of performance. False positives I consider more like bugs. I approach improving the filtering rate as optimization, and decreasing false positives as debugging.
So these five false positives are my bug list. For example, the mail from Egypt got nailed because the uppercase text made it look to the filter like a Nigerian spam. This really is kind of a bug. As with html, the email being all uppercase is really conceptually one feature, not one for each word. I need to handle case in a more sophisticated way.
So what to make of this
Future
What next? Filtering is an optimization problem, and the key to optimization is profiling. Don't try to guess where your code is slow, because you'll guess wrong. Look at where your code is slow, and fix that. In filtering, this translates to: look at the spams you miss, and figure out what you could have done to catch them.
For example, spammers are now working aggressively to evade filters, and one of the things they're doing is breaking up and misspelling words to prevent filters from recognizing them. But working on this is not my first priority, because I still have no trouble catching these spams [10].
There are two kinds of spams I currently do have trouble with. One is the type that pretends to be an email from a woman inviting you to go chat with her or see her profile on a dating site. These get through because they're the one type of sales pitch you can make without using sales talk. They use the same vocabulary as ordinary email.
The other kind of spams I have trouble filtering are those from companies in e.g. Bulgaria offering contract programming services. These get through because I'm a programmer too, and the spams are full of the same words as my real mail.
I'll probably focus on the personal ad type first. I think if I look closer I'll be able to find statistical differences between these and my real mail. The style of writing is certainly different, though it may take multiword filtering to catch that. Also, I notice they tend to repeat the url, and someone including a url in a legitimate mail wouldn't do that [11].
The outsourcing type are going to be hard to catch. Even if you sent a crawler to the site, you wouldn't find a smoking statistical gun. Maybe the only answer is a central list of domains advertised in spams [12]. But there can't be that many of this type of mail. If the only spams left were unsolicited offers of contract programming services from Bulgaria, we could all probably move on to working on something else.
Will statistical filtering actually get us to that point? I don't know. Right now, for me personally, spam is not a problem. But spammers haven't yet made a serious effort to spoof statistical filters. What will happen when they do?
I'm not optimistic about filters that work at the network level [13]. When there is a static obstacle worth getting past, spammers are pretty efficient at getting past it. There is already a company called Assurance Systems that will run your mail through Spamassassin and tell you whether it will get filtered out.
Network-level filters won't be completely useless. They may be enough to kill all the "opt-in" spam, meaning spam from companies like Virtumundo and Equalamail who claim that they're really running opt-in lists. You can filter those based just on the headers, no matter what they say in the body. But anyone willing to falsify headers or use open relays, presumably including most porn spammers, should be able to get some message past network-level filters if they want to. (By no means the message they'd like to send though, which is something.)
The kind of filters I'm optimistic about are ones that calculate probabilities based on each individual user's mail. These can be much more effective, not only in avoiding false positives, but in filtering too: for example, finding the recipient's email address base-64 encoded anywhere in a message is a very good spam indicator.
But the real advantage of individual filters is that they'll all be different. If everyone's filters have different probabilities, it will make the spammers' optimization loop, what programmers would call their edit-compile-test cycle, appallingly slow. Instead of just tweaking a spam till it gets through a copy of some filter they have on their desktop, they'll have to do a test mailing for each tweak. It would be like programming in a language without an interactive toplevel, and I wouldn't wish that on anyone.
Notes
[1] Paul Graham. ``A Plan for Spam.'' August 2002. http://paulgraham.com/spam.html.
Probabilities in this algorithm are calculated using a degenerate case of Bayes' Rule. There are two simplifying assumptions: that the probabilities of features (i.e. words) are independent, and that we know nothing about the prior probability of an email being spam.
The first assumption is widespread in text classification. Algorithms that use it are called ``naive Bayesian.''
The second assumption I made because the proportion of spam in my incoming mail fluctuated so much from day to day (indeed, from hour to hour) that the overall prior ratio seemed worthless as a predictor. If you assume that P(spam) and P(nonspam) are both
If you were doing Bayesian filtering in a situation where the ratio of spam to nonspam was consistently very high or (especially) very low, you could probably improve filter performance by incorporating prior probabilities. To do this right you'd have to track ratios by time of day, because spam and legitimate mail volume both have distinct daily patterns.
[2] Patrick Pantel and Dekang Lin. ``SpamCop-- A Spam Classification & Organization Program.'' Proceedings of AAAI-98 Workshop on Learning for Text Categorization.
[3] Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz. ``A Bayesian Approach to Filtering Junk E-Mail.'' Proceedings of AAAI-98 Workshop on Learning for Text Categorization.
[4] At the time I had zero false positives out of about 4,000 legitimate emails. If the next legitimate email was a false positive, this would give us
[5] Bill Yerazunis. ``Sparse Binary Polynomial Hash Message Filtering and The CRM114 Discriminator.'' Proceedings of 2003 Spam Conference.
[6] In ``A Plan for Spam'' I used thresholds of
[7] There is a flaw here I should probably fix. Currently, when ``Subject*foo'' degenerates to just ``foo'', what that means is you're getting the stats for occurrences of ``foo'' in the body or header lines other than those I mark. What I should do is keep track of statistics for ``foo'' overall as well as specific versions, and degenerate from ``Subject*foo'' not to ``foo'' but to ``Anywhere*foo''. Ditto for case: I should degenerate from uppercase to any-case, not lowercase.
It would probably be a win to do this with prices too, e.g. to degenerate from ``$129.99'' to ``$--9.99'', ``$--.99'', and ``$--''.
You could also degenerate from words to their stems, but this would probably only improve filtering rates early on when you had small corpora.
[8] Steven Hauser. ``Statistical Spam Filter Works for Me.'' http://www.sofbot.com.
[9] False positives are not all equal, and we should remember this when comparing techniques for stopping spam. Whereas many of the false positives caused by filters will be near-spams that you wouldn't mind missing, false positives caused by blacklists, for example, will be just mail from people who chose the wrong ISP. In both cases you catch mail that's near spam, but for blacklists nearness is physical, and for filters it's textual.
In fairness, it should be added that the new generation of responsible blacklists, like the SBL, cause far fewer false positives than earlier blacklists like the MAPS RBL, for whom causing large numbers of false positives was a deliberate technique to get the attention of ISPs.
[10] If spammers get good enough at obscuring tokens for this to be a problem, we can respond by simply removing whitespace, periods, commas, etc. and using a dictionary to pick the words out of the resulting sequence. And of course finding words this way that weren't visible in the original text would in itself be evidence of spam.
Picking out the words won't be trivial. It will require more than just reconstructing word boundaries; spammers both add (``xHot nPorn cSite'') and omit (``P#rn'') letters. Vision research may be useful here, since human vision is the limit that such tricks will approach.
[11] In general, spams are more repetitive than regular email. They want to pound that message home. I currently don't allow duplicates in the top 15 tokens, because you could get a false positive if the sender happens to use some bad word multiple times. (In my current filter, ``dick'' has a spam probabilty of
[12] This is what approaches like Brightmail's will degenerate into once spammers are pushed into using mad-lib techniques to generate everything else in the message.
[13] It's sometimes argued that we should be working on filtering at the network level, because it is more efficient. What people usually mean when they say this is: we currently filter at the network level, and we don't want to start over from scratch. But you can't dictate the problem to fit your solution.
Historically, scarce-resource arguments have been the losing side in debates about software design. People only tend to use them to justify choices (inaction in particular) made for other reasons.
Thanks to Sarah Harlin, Trevor Blackwell, and Dan Giffin for reading drafts of this paper, and to Dan again for most of the infrastructure that this filter runs on.
More than 1.1 billion pigs are killed worldwide each year. For no reason.
Pork is an unhealthy food source. Most people who eat pork also have access to other, non-meat foods.
Pigs are some of the most intelligent beings on our planet. Why do we kill them by the billions? Just to enjoy the transient pleasure of tasting their flesh?
Amsterdam Vallon's post contained a handful of facts coupled with clear, readable statements backing his claims up. He was moderated mostly as a "Troll" and his post sits invisible at a score of 0.
The parent post by Tewfik is a few "sentences" of garbled English prose which contains more nonsense than factual information. His post sits at 5 and is moderated "Insightful".
I think the moderations and comments mentioned above show that the Pro-Linux factions are moronic and easily manipulated, much like employees who Microsoft chooses to hire.
The sad thing isn't that Linux is dead, it's that its followers won't admit it.
Apple chose BSD over Linux
LNUX is worth a few pennies and in danger of being de-listed off the NASDAQ stock exchange
MCSEs are getting more jobs than Linux sysadmins these days
Apple chose Trolltech/KDE over the Gecko/Linux renderer
OS X already won. Any geek or scientist worth their salt who wants a UNIX home computer already has an iBook or a G4.
As a developer, I'm asked on average of once or twice every year to suddenly pick-up a new technology and learn it within a couple weeks so that I can write a new program for release 6-12 months later as itself or jointly with the hardware guys.
When it comes to good, thorough documentation and API releases, I've always thought that this is an area where Linux is truly lacking. Hypothetically speaking, I think a coder learning Java for a new Windows P2P program that he must write would have a much easier time than a programmer who must learn Perl or C on his Linux box and create a network-intensive application that installs and runs the same way on all distributions of Linux, as well as Mac OS X.
I figure opinions from the "non enlightened", as many of you will probably call me, will help you to improve Linux, especially its documentation and user-friendliness.
One for those of you who dress nicer than me.
Yeah, those of us who have to dress nicely for work and who would never write "Credit Card sized 5GB HD to become late this year" on a website that gets millions of hits a day are real dorks, aren't we?
Pffft.
cmdrtaco is getting married
cmdrtaco is connected to the following things
cmdrtaco is connected to because
cmdrtaco is in the bag
cmdrtaco is getting married to the fine woman this website is run by
cmdrtaco is still known to post hoaxes or wild
cmdrtaco is gay" or
cmdrtaco is a big fan of both gnome and kde
cmdrtaco is brilliant
cmdrtaco is nothing more than a perl script operated by lucasfilms
cmdrtaco is lame
cmdrtaco is very unhappy with the current va situation
cmdrtaco is confident things will continue as usual
cmdrtaco is on the site
cmdrtaco is in the bag joining us as guest
cmdrtaco is this story
cmdrtaco is my hero
cmdrtaco is the one that is laying on the purple couch with the notebook
cmdrtaco is on vacation
cmdrtaco is a big cardcaptor sakura fan as well
cmdrtaco is also one of the linux
cmdrtaco is a torvelian
cmdrtaco is a fair replacment
cmdrtaco is an idiot
cmdrtaco is because the whole of the
cmdrtaco is quoted as saying
cmdrtaco is hanging with the techtv people
cmdrtaco is a big fan too
cmdrtaco is a dedicated programmer
cmdrtaco is psychic
With all the disease over there, I think they actually mean "Open Sores".
(Wow, an actual non-troll use of the "sores" joke!)
Carly,
With the increasing complexity of security issues, has your team at HP Research centers been interrogated or questioned mildly by any federal crime/security groups such as the RSA, CIA, or FBI?
As a computer scientist, I have found through years of extensive research that my digitally-compiled code tends to run faster on a computer than my black ink fountain pen does on a piece of paper.
Does it hurt as much as they say it does?
From the article:
Imagine never having to polish your shoes again, having glistening white teeth without visiting a dentist, or owning a mobile phone powered by a miniature battery.
When my work shoes need polishing, my wife does it or I pay 50 cents and have them done at the Mall Foodcourt or something.
I got glistening white teeth by using Colgate Gel. It's $14.99 and quite a bargain.
And my mobile phone gets 6 hours of battery life on a single charge, which is plenty since I don't really talk to many people.
So, honestly, could someone point me toward some practical real-world applications of this supposed "Next Generation" of nanotechnology?!
Research and Development Centers on Nanotechnology
Academic
Industry
National Laboratories
List Your Center on this Website
Funding and Sponsoring Agencies and Societies
Government
Professional Societies
Non-Profit Organizations
Major Research Reports and Publications
Books
Periodicals
Reports
Submit Your Publication
Conferences