DrSchlock · Slashdot Mirror

Re:don't bother........ on Why Learning Assembly Language Is Still Good · 2004-06-11 17:55 · Score: 5, Funny

I have a favorite fortune on my system...

The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.

Re:Vectors..... on How Apple's Mail.app Junk Filter Works · 2004-05-18 18:23 · Score: 1

Ugh. The magic doesn't come from vectors. Vectors are just how you throw the numbers around. The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word. To measure the distance between two histograms, you usually use the chi-squared test. So, forget all about "vectors", the real work horse is the histogram.

I don't think this is really true either. They're definitely representing documents by vectors, where dimensions correspond to words. (I would bet they've added extra dimensions for features like message length and number of recipients, too.) There's more than one way to compute the distance between two vectors, but they're all pretty easy.

The hard part is using this collection of labeled vectors to generate a rule that correctly predicts the labels of new vectors, i.e., divides up the vector space in a good way.

Image classification is rather different: as you observe, a lot more effort goes into extracting meaningful features of the image. In document classification, you can do a certain amount of this, perhaps some sort of syntactic analysis; but usually most of your features still end up just being words, which are easy to pick out. In both problems you then have to divide up the vector space somehow.

You can argue that the ease of representing a document makes document classification an easier problem. But it's not a feature of this algorithm; everyone currently doing document classification pretty much ends up using a bag-of-words vector, because it's easy and works very well... even though it seems intuitively very silly to throw out word ordering information.

Re:Kinda like Mozilla Mail? on How Apple's Mail.app Junk Filter Works · 2004-05-18 17:39 · Score: 5, Informative

This spam filtering feature seems pretty similar to the one found in Mozilla Mail. Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.

Not exactly Bayesian, no. It's a different kind of document classification algorithm, which the article calls Latent Semantic Analysis. Basically they represent each message as a point in a high-dimensional space (based on the unordered words in the document), and figure out which parts of the space tend to be occupied by spam e-mails. This involves quite a lot of computation to determine a likely boundary between the parts of the space representing spam and non-spam messages, given only a collection of labeled points.

To make this train and run reasonably quickly, they have to do dimensionality reduction on the space: they collapse dimensions which tend to be correlated or redundant or useless. (If "teens" and "gushing" generally appear together in messages, they probably don't need two separate dimensions; if "hi" is equally likely to appear in spam and non-spam, it may not need a dimension at all.)

A naive-Bayes classifier is much simpler: Assuming that the probabilities of words in a document are all independent, it selects the document type (spam or non-spam) that maximizes the total probability of the observed words. There's no training beyond counting how often each word occurs with each document type.

Naive Bayes typically works nearly as well as more complex methods, and runs much faster. But presumably Apple feels their LSA implementation is fast enough, and sufficiently more accurate than simpler techniques to be worthwhile.

Re:More like Montgomery Burns last words in... on Project Grizzly Bear-Proof Suit Up For Auction · 2004-05-10 08:38 · Score: 2, Funny

Or the classic Mike Scioscia -

Can't... lift... arm... or... speak... at... normal... rate...

Re:Stealing Japanese technology... on Build Your Own Wireless Beer Pitcher Monitoring System · 2004-05-07 17:17 · Score: 1

It *is* (apparently) a new invention; it happens to address the same problem, but in an entirely different way. The Cornell students measure pitcher tilt, MERL measures electrical capacitance.

Re:Measure weight on Build Your Own Wireless Beer Pitcher Monitoring System · 2004-05-07 17:09 · Score: 1

That's all you have to do. Just measure the decrease in weight. Why do they have to make it any more complicated in it needs to be? *sigh*

It's not really that simple, though. You'd need some sort of force sensor in the bottom of a pitcher, like a spring. The problem is that the force would change all the time: when you lift and lower the pitcher (think of the force on your feet in an accelerating elevator), when it bangs on the table, when it tilts, etc. You could add some sort of timer to make sure the force decrease lasts a while, but now we're getting away from simplicity again. Plus wear and tear on the sensor would probably be rather high.

Re:How is this so different? on Build Your Own Wireless Beer Pitcher Monitoring System · 2004-05-07 16:44 · Score: 3, Informative

It's a different way of solving the same problem, and a reasonably clever one at that. Each idea has its points; the original capacitance method is cheaper, as the authors observe, but it also doesn't work well with viscous fluids that cling to the side of a container and conduct electricity around its circumference.

An alternate solution... on Build Your Own Wireless Beer Pitcher Monitoring System · 2004-05-07 16:34 · Score: 4, Informative

The same problem can also be solved by measuring capacitance of the glass across the remaining fluid. (I don't really understand this, but I'm believe it's fairly simple.)

The article references this, in fact.

http://www.merl.com/projects/iGlassware

Re:Major Problem on The Trouble With Using D&D Rules In Videogames? · 2004-04-12 17:10 · Score: 1

Actually, since they dropped the A when 3rd edition came out, regular D&D is now officially more advanced than AD&D. Better and much, much more rules-consistent, too; not quite enough for a computer, but pretty good.

Re:Google vs. spammers on Google's Gmail To Offer 1GB E-mail Storage? · 2004-03-31 16:11 · Score: 2, Insightful

What I don't understand is how they can justify this as a business. How will they recoup the cost of providing millions of gigs of storage and huge amounts of bandwidth indefinitely?

Well, this kind of service wouldn't actually require 1 gig/user. It's not like they're handing you your own sealed-off hard drive. Most people will never use anything like that much space, I suspect, and the company would only have to pay for the amount used in practice.

They'd recoup costs the same way they do for search: through targeted ads. They're already pretty much caching the Internet; if anybody could handle this kind of space and bandwidth, it would be Google...

Re:And just what does this say to the Interviewee? on Only 32% of Java developers really know Java · 2004-03-21 09:54 · Score: 1

create a hashmap of numeric counters such that I am going to increment a million times, but only store 100k keys into the table He means, map 100k keys to integers, then repeatedly ("a million times") access and increment those integers.

Slashdot Mirror

User: DrSchlock

Comments · 11