Yes, and it is remarkably underwhelming. I love how he takes credit for Bayesian spam filtering from his rather naive 2002 proposal when far more sophisticated proposals (e.g. Sahami et al.'s 1998 paper on junk mail filtering) preceed it by a number of years. His proposal would never pass peer review at any reasonable venue since there's absolutely nothing new about it (and it is in fact flawed, though to be fair, many of the flaws were fixed in his followup article).
Another joke is how he claims to have written the "first web-based application". Click the link, and then you get the caveat: he defines "web based application" very precisely to fit rather arbitrary aspects of his app. Bzzt.
Intelligent mail sorting. This isn't my idea, but unfortunately I can't remember where I heard it (probably someone else's slashdot post). The same methods that are used today in bayesian spam filtering could be used to sort my mail into folders for me. This might prove a little more difficult than spam filtering because you have n categories instead of just 2 (spam and not spam), but it would certainly be useful.
Yes, folder categorization is MUCH harder than spam classification, and the implication is you have many more errors. One way to minimize the impacts of misclassification error is to simply use the machine-learning-determined categorization to sort the inbox, rather than automatically file mail into your folders. An implementation of this in Lotus Notes (yuck!) is described here. This would be an amazingly useful feature.
3 weeks ago, for shits and giggles, I pre-calcuated md5 passwords based on a 5 million word dictionary. I dropped all of the results in a PostgreSQL database. Took about 12 hours to complete, mainly becuase the app I wrote to handle it was kinda poor and a quick hack. If I were to re-do it, I would use my workstations to create the checksums, and do the inserts.
I don't know why it takes them so long to come up with a result. Needless to say, I am gonna have to 1 up them now. Tonight, I am gonna start pre-calcuating a database simular to theres. Difference is, mine will run MUCH faster
If it took 12 hours to load a mere 5 million passwords, you'll need around 6,770,663 hours to load the entire space hashes considered by this tool. Somehow I doubt people will be very impressed if it takes you that long to one-up them. But hey, good luck with it!
True in principle but irrelevant in practice -- the password space they are considering can be represented in roughly 48 bits. So chances of a collision when hashing such passwords to a 128 bit MD5 hash space is EXCEEDINGLY LOW.
My criticism stems from significant experience developing GUIs with Swing under v1.4, which is the latest non-beta release. And sorry, Swing is still dog slow for anything but trivial GUI stuff (e.g. menus and dialogs).
Maybe it's "good enough" in 1.5 but I doubt it. The problem is one of poor design choices (e.g. object-creation-happy apis), and all the kludges in the world (e.g. caching compiled native code) are unlikely to change that.
More FUD based on no evidence. Java has not been slow for years.
Unless you use Swing, which is a critical component for writing desktop apps unless you resort to non-standard libraries.
Sun, swallow your pride for just a second and deprecate Swing (yes, ALL of it!). Then adopt SWT instead and you'll be doing the community a huge service. It's no surprise that the only usable desktop Java apps I'm aware of use SWT. In addition to Eclipse which most here are probably aware of, take a look at Azureus for example. I'd love to see someone port Azureus to Swing as an exercise to demonstrate just how bad Swing really is in comparison.
Right, preserving the integrity of the search results in the case of malicious users injecting false information into the system sounds like a challenging problem.
But a p2p search engine seems to be the only way to go for an open-source search engine implementation.
Agreed... I think it's solvable too -- just difficult, which makes it all the more fun.
Also I accidentally linked to the wrong paper in my original reply. I meant to link to this one, but that other paper was at least marginally related.
Something that would make a nice opensource project would be to include p2p search functionality in apache itself.
This way all the modificed web servers would make a giant distributed search engine.
Some nice algorithms like koorde or kademlia could be used.
Anyone thought about starting something like this?
We looked into something a lot like what you suggest (and actually have it up and running inside our intranet with 2k or so users). The problem with doing this on the internet is that p2p techniques are MUCH more susceptible to spamming than centralized techniques in general (because, for one, p2p reputation systems are very difficult to get right). Another problem is that most existing p2p search methods work great for finding popular content but not very well for finding that very specific peice of information that maybe only you are looking for at the current moment. Kademlia/Chord are DHT's and do not solve the text search problem on their own. While some p2p networks have adapted DHT's for keyword searching, the results still leave a lot to be desired (IMO).
My wish would mostly be to improve responsiveness.
I'm no fan of MS by any stretch of the imagination, but Gnome menus on my 2-way 2.4GHZ Xeon server feel like a crappy Java Swing app on my much older Windows desktop machine.
I heard that 2.6 kernel should help things a bunch but I'm still not seeing it.
I agree including blogging is just stupid.... however an integrated p2p file sharing tool or user-friendly personal webserver would rock. Everyone has to share/move around files at some point, whereas very few have anything interesting enough to say to justify a blog. Yeah I realize we have scp but that's not real friendly when you need to move something to a non-linux box, or want to push something out to multiple parties.
My main problem with FC2 has been that java doesn't work with the i686 kernel. That's a pretty major problem and frankly I'm surprised they would make it a release in such a state.
Stick with FC1, which I have no complaints about whatsoever.
Very well said. I've reported several bugs in Java runtimes. Years back they were quite responsive, but the last one I submitted was well over six months back, and I have received no word yet. The previous bug I reported took about 7 months to be confirmed. That's pathetic. And I always submit concise source code clearly demonstrating the problem.
Sun -- either get your act together, or open Java up.
Funny that they should mention that, being that Abelson & Sussman from MIT's own CS courses (available via video lecture make the comments that programmers who have to go through a planning phase before they program aren't real programmers. It looks like MIT made most of the programmers who have no respect for the software development cycle.
And you are full of shit. I've taken the course, and never heard any such nonsense. Perhaps you should be a little more specific in your citation -- nobody is going to dig through an *entire semester* of video lectures to see if what you say has any merit.
FYI MIT has required since the early 80's at least that all CSE grads take a software engineering course which is very heavy on specifications and design documentation.
we have to trust any mail service at some point. My point? I'll trust them not to actually read them. Anonymous ad fetching? That's OK.
OK, maybe Google isn't evil now, but what if Microsoft were to buy them out? And whose to say going public won't pressure them to derive as much $$$ out of their data as possible? Anonymous ad fetching could only be the beginning. Yes you have to trust your mail provider, but the sheer scale of Google's service is what scares people. That is, it's not just your mail they may be able to access, but your mail and that of (almost) all your friends, family, and co-workers! That's a hugely valuable asset and there will be pressures for Google to exploit it in all kinds of (possibly nasty) ways.
Kudos though to Sergey who does seem to be seriously thinking about privacy issues (though the interviewer's suggestions about using an encrypted store are laughably naive).
One of the ways you could provide some clarity about the privacy issue would be to push this data through an encrypted store. You could keep the indexes unencrypted, but keep the rest of the data encrypted.
What a pointless idea. The index contains almost all the data contained in the store (excepting maybe stop words). and besides, the store needs to be decrypted to render the messages as HTML. So how would this brain-dead idea provide *any* privacy whatsoever? OK, so it might protect against an evil system administrator who might happen to run off with one of the store's disks, but not the index, and not manage to get ahold of a decryption key (which which must reside on *every* machine that does rendering of store-derived information). Even using segmented store encryption would offer little privacy.
A crack would imply it breaks the encryption scheme. However, seeing as it only works on music someone has legally purchased, it's clear to me that this relies on having access to the decryption keys. So it sounds as if they simply reverse engineered the decryption protocol. Not an easy task by any means, but it's not as interesting as something like DeCSS which involved determining both the decryption keys and decryption algorithm.
Since you can't do low level memory optimizations (e.g. to maximize cache utilization), Java will never approach well optimized C/C++ implementations, at least for memory intensive apps. (And let's not even bring up that bloated mess known as Swing...)
For example, here is a recent paper that includes a comparison of Java and C/C++ for XML processing tasks. The conclusion is that Java based implementations are not yet competitive with the C/C++ implementations.
Don't get me wrong, I *love* Java, but when I do highly performance critical stuff, it's just not the right tool for the job.
Re:It's like Netscape v. Microsoft in that...
on
Google v. Microsoft
·
· Score: 1
Yeah, but Lucent and AT&T labs had just as powerful groups of theoreticians, almost all of whom have by now been laid off. These sorts spend their lives avoiding doing anything that might actually be considered practical (with a few rare exceptions). Besides, building a search engine is more a complex engineering problem than anything else. You're better off hiring people with strong database or text analytics backgrounds than the theoretical types. That is, if you expect anything to actually get done:-)
Having attended the last 2 code-cons, I can highly recommend the event. The focus is on working or near-working applications in p2p, privacy, encryption, and other topics most Slashdotters know and love. The crowd is also great... you'll learn a lot simply talking to people between presentations. Bram and Len have done a great job with the program and this year looks to be no exception.
I'd like to second that review. We now use our vonage number as our "primary" line. Quality is typically as good as any other service, though sometimes there is a noticable "half duplex" quality about it (speaking into the phone silences any incoming speech). Other than that, the service is great. I love the web-based configuration options for voice mail, forwarding, etc. And the international rates are very very good.
.... but yes I agree that a machine capable of playing "perfect" chess is a long way off!
Sorry, I thought you were arguing in favor of the author's unfounded claims that humans are unlikely to be beaten consistently by machines any time soon...
I took a look at the entire site and found no such explanation. Any other suggestions? I'm curious too.
Unfortunately I can't mod it up as I've already posted, but it's (perhaps painfully) spot on, and funny as hell to boot.
Yes, and it is remarkably underwhelming. I love how he takes credit for Bayesian spam filtering from his rather naive 2002 proposal when far more sophisticated proposals (e.g. Sahami et al.'s 1998 paper on junk mail filtering) preceed it by a number of years. His proposal would never pass peer review at any reasonable venue since there's absolutely nothing new about it (and it is in fact flawed, though to be fair, many of the flaws were fixed in his followup article).
Another joke is how he claims to have written the "first web-based application". Click the link, and then you get the caveat: he defines "web based application" very precisely to fit rather arbitrary aspects of his app. Bzzt.
Doh! Broken link. The paper I meant to cite is actually here
Yes, folder categorization is MUCH harder than spam classification, and the implication is you have many more errors. One way to minimize the impacts of misclassification error is to simply use the machine-learning-determined categorization to sort the inbox, rather than automatically file mail into your folders. An implementation of this in Lotus Notes (yuck!) is described here. This would be an amazingly useful feature.
I don't know why it takes them so long to come up with a result. Needless to say, I am gonna have to 1 up them now. Tonight, I am gonna start pre-calcuating a database simular to theres. Difference is, mine will run MUCH faster
If it took 12 hours to load a mere 5 million passwords, you'll need around 6,770,663 hours to load the entire space hashes considered by this tool. Somehow I doubt people will be very impressed if it takes you that long to one-up them. But hey, good luck with it!
True in principle but irrelevant in practice -- the password space they are considering can be represented in roughly 48 bits. So chances of a collision when hashing such passwords to a 128 bit MD5 hash space is EXCEEDINGLY LOW.
My criticism stems from significant experience developing GUIs with Swing under v1.4, which is the latest non-beta release. And sorry, Swing is still dog slow for anything but trivial GUI stuff (e.g. menus and dialogs).
Maybe it's "good enough" in 1.5 but I doubt it. The problem is one of poor design choices (e.g. object-creation-happy apis), and all the kludges in the world (e.g. caching compiled native code) are unlikely to change that.
Unless you use Swing, which is a critical component for writing desktop apps unless you resort to non-standard libraries.
Sun, swallow your pride for just a second and deprecate Swing (yes, ALL of it!). Then adopt SWT instead and you'll be doing the community a huge service. It's no surprise that the only usable desktop Java apps I'm aware of use SWT. In addition to Eclipse which most here are probably aware of, take a look at Azureus for example. I'd love to see someone port Azureus to Swing as an exercise to demonstrate just how bad Swing really is in comparison.
Agreed... I think it's solvable too -- just difficult, which makes it all the more fun.
Also I accidentally linked to the wrong paper in my original reply. I meant to link to this one, but that other paper was at least marginally related.
We looked into something a lot like what you suggest (and actually have it up and running inside our intranet with 2k or so users). The problem with doing this on the internet is that p2p techniques are MUCH more susceptible to spamming than centralized techniques in general (because, for one, p2p reputation systems are very difficult to get right). Another problem is that most existing p2p search methods work great for finding popular content but not very well for finding that very specific peice of information that maybe only you are looking for at the current moment. Kademlia/Chord are DHT's and do not solve the text search problem on their own. While some p2p networks have adapted DHT's for keyword searching, the results still leave a lot to be desired (IMO).
I'm no fan of MS by any stretch of the imagination, but Gnome menus on my 2-way 2.4GHZ Xeon server feel like a crappy Java Swing app on my much older Windows desktop machine.
I heard that 2.6 kernel should help things a bunch but I'm still not seeing it.
I agree including blogging is just stupid.... however an integrated p2p file sharing tool or user-friendly personal webserver would rock. Everyone has to share/move around files at some point, whereas very few have anything interesting enough to say to justify a blog. Yeah I realize we have scp but that's not real friendly when you need to move something to a non-linux box, or want to push something out to multiple parties.
Stick with FC1, which I have no complaints about whatsoever.
Sun -- either get your act together, or open Java up.
And you are full of shit. I've taken the course, and never heard any such nonsense. Perhaps you should be a little more specific in your citation -- nobody is going to dig through an *entire semester* of video lectures to see if what you say has any merit.
FYI MIT has required since the early 80's at least that all CSE grads take a software engineering course which is very heavy on specifications and design documentation.
OK, maybe Google isn't evil now, but what if Microsoft were to buy them out? And whose to say going public won't pressure them to derive as much $$$ out of their data as possible? Anonymous ad fetching could only be the beginning. Yes you have to trust your mail provider, but the sheer scale of Google's service is what scares people. That is, it's not just your mail they may be able to access, but your mail and that of (almost) all your friends, family, and co-workers! That's a hugely valuable asset and there will be pressures for Google to exploit it in all kinds of (possibly nasty) ways.
Kudos though to Sergey who does seem to be seriously thinking about privacy issues (though the interviewer's suggestions about using an encrypted store are laughably naive).
What a pointless idea. The index contains almost all the data contained in the store (excepting maybe stop words). and besides, the store needs to be decrypted to render the messages as HTML. So how would this brain-dead idea provide *any* privacy whatsoever? OK, so it might protect against an evil system administrator who might happen to run off with one of the store's disks, but not the index, and not manage to get ahold of a decryption key (which which must reside on *every* machine that does rendering of store-derived information). Even using segmented store encryption would offer little privacy.
A crack would imply it breaks the encryption scheme. However, seeing as it only works on music someone has legally purchased, it's clear to me that this relies on having access to the decryption keys. So it sounds as if they simply reverse engineered the decryption protocol. Not an easy task by any means, but it's not as interesting as something like DeCSS which involved determining both the decryption keys and decryption algorithm.
For example, here is a recent paper that includes a comparison of Java and C/C++ for XML processing tasks. The conclusion is that Java based implementations are not yet competitive with the C/C++ implementations.
Don't get me wrong, I *love* Java, but when I do highly performance critical stuff, it's just not the right tool for the job.
Yeah, but Lucent and AT&T labs had just as powerful groups of theoreticians, almost all of whom have by now been laid off. These sorts spend their lives avoiding doing anything that might actually be considered practical (with a few rare exceptions). Besides, building a search engine is more a complex engineering problem than anything else. You're better off hiring people with strong database or text analytics backgrounds than the theoretical types. That is, if you expect anything to actually get done :-)
Having attended the last 2 code-cons, I can highly recommend the event. The focus is on working or near-working applications in p2p, privacy, encryption, and other topics most Slashdotters know and love. The crowd is also great... you'll learn a lot simply talking to people between presentations. Bram and Len have done a great job with the program and this year looks to be no exception.
Google Sucks Search on Google
I'd like to second that review. We now use our vonage number as our "primary" line. Quality is typically as good as any other service, though sometimes there is a noticable "half duplex" quality about it (speaking into the phone silences any incoming speech). Other than that, the service is great. I love the web-based configuration options for voice mail, forwarding, etc. And the international rates are very very good.
Sorry, I thought you were arguing in favor of the author's unfounded claims that humans are unlikely to be beaten consistently by machines any time soon...