Over at ASF a bunch of smart people are building Hadoop and Hbase. The latter is the open-source version of the BigTable, similar to Hypertable, but written in Java (not C++) and being super actively developed in the open and under the ASF umbrella.
Google sponsored several Summer of Code.... summers and good things came out of it. Some of the SoC projects actually ended as Lucene contributions, too.
Wikipedia search may not be great, but Lucene itself is an amazing toolkit. I tend to think that without Lucene half of the companies that have some kind of a search companies (think Web2.0) wouldn't know what to do. Lucene is great and free. FAST, Autonomy, Google Appliance, Endeca, etc. are all *massive* and *expensive*. Compare that to the free and super-flexible Lucene! Oh, and it's not like there is no professional support and services around the Lucene stack! Just look at http://sematext.com/ and its client list and you'll see some big names.
This is hardly surprising. Microsoft has a ton of researchers working on all kinds of things search. At this year's SIGIR (Special Interest Group for Information Retrieval), Microsoft had the most papers presented by far. It is also interesting to note that SIGIR 2006 was not very far from Redmond - in Seattle. More about SIGIR 2006 at http://www.sigir2006.org/ (MS was also a "Diamond Sponsor" for the event).
It's not that working for a high-profile service such as SourceForge is not fun, it's that working for a startup is often more fun, and working on your own startup even more fun. It's not fun joining a company that has been around for years - everything there has been pretty much figured out. Creative work has been done. Facing scaling issues has been done, massive email servers - done. Now it's maintenance phase, tweak here and there phase, rewrite component X from scratch (not so fun), and so on.
In startups you are the one who has to figure all this stuff out, and for lots of people that's a lot more interesting (see.signature below).
You clearly didn't read the Washington Post article.
The 16 years old kid who logs onto MySpace at 02:41 is using the same computer in the basement that mom and dad use the next mornign at 07:45 to log into their bank accounts, pay bills, trade some stock, and so on.
That's why even a free MySpace is a good target. As a matter of fact, MySpace is an excellent target because it has highly loyal and extremely active users who log into MySpace multiple times a day. This means that if the phishers' crack stays on the site even for a very amount of time, they will be able to grab a solid number of usernames. If they did that on Simpy for example, a site with nowhere that many daily active users, the catch would be a lot smaller. That's why phishers targeted MySpace.
With MySpace being so popular and with its users regularly logging in on a daily basis, I wonder what the impact of this was in terms of: 1) the total number of "phished" accounts 2) the number of "phished" accounts in terms of a percentage of the total userbase.
Professors are not the ones who will decide whether Wikipedia will make the grade or not. The populus will. And the populus has already decided. I know a number of people who now go to Wikipedia first, Google second.
Koders and even Krugle guys precede Google's code search, but they are going to have a hard time attracting more developers' eyeballs - check this. Too bad one can't get Google code search on there, too, but you can imagine how far that graph curve would be.
We are slowly working towards that, but we are not at the point where this can be done both fast and well. Unless you have FBI/NSA/CIA/government resources, of course.
There is a great little company in Brooklyn, NY called Alias-i. Some years ago they built this interesting "tool" called....guess....ThreatTracker. Information Extraction, Named Entity Recognition and other interesting stuff, if you are into this. No, I don't work for them, but their LingPipe toolkit has some cooooool stuff.
But how do you know you didn't get any false positives? You could know that only if you examined every single email marked as spam. Did you really do that? If so, then what is the point of running spam filters? The whole problem is that spam consumes our time, and thus we want to get rid of it without ever seeing it. We don't want to monitor our spam filters.
This is not a surprise. That is simply another example of nature's laws on the web. This is not much different from the now well known fact that most stories on Digg are submitted by a handful of people (see: Top 100 Digg Users Control 56% of Digg's HomePage Content).
Hah, interesting! Here is a post on a very related topic: Social Spam and Spam Incentives, as it relates to Simpy. It asks about incentives, about the choices of things that are "spamvertised" (who follows "home loan" links on a site that so obviously stinks of rotten spam?), etc.
Over at ASF a bunch of smart people are building Hadoop and Hbase. The latter is the open-source version of the BigTable, similar to Hypertable, but written in Java (not C++) and being super actively developed in the open and under the ASF umbrella.
Google sponsored several Summer of Code .... summers and good things came out of it. Some of the SoC projects actually ended as Lucene contributions, too.
Are you sure it's Lucene and not Wikipedia's use or Lucene? (I never use Wikipedia's internal search, so I really don't know)
Did you know that Amazon uses Lucene for "search inside the book", for example? Does that suck, too?
Wikipedia search may not be great, but Lucene itself is an amazing toolkit. I tend to think that without Lucene half of the companies that have some kind of a search companies (think Web2.0) wouldn't know what to do.
Lucene is great and free. FAST, Autonomy, Google Appliance, Endeca, etc. are all *massive* and *expensive*. Compare that to the free and super-flexible Lucene! Oh, and it's not like there is no professional support and services around the Lucene stack! Just look at http://sematext.com/ and its client list and you'll see some big names.
It's important to remember that Skype comes from the same people who brought us Kazaa. It's the DNA.
Yeah. Look for updates to this over on TechCrunch - here.
This is much like Tylenol - lowers body temperature and temporarily removes pain, but doesn't cure the symptoms.
This is hardly surprising. Microsoft has a ton of researchers working on all kinds of things search. At this year's SIGIR (Special Interest Group for Information Retrieval), Microsoft had the most papers presented by far. It is also interesting to note that SIGIR 2006 was not very far from Redmond - in Seattle. More about SIGIR 2006 at http://www.sigir2006.org/ (MS was also a "Diamond Sponsor" for the event).
It's not that working for a high-profile service such as SourceForge is not fun, it's that working for a startup is often more fun, and working on your own startup even more fun. It's not fun joining a company that has been around for years - everything there has been pretty much figured out. Creative work has been done. Facing scaling issues has been done, massive email servers - done. Now it's maintenance phase, tweak here and there phase, rewrite component X from scratch (not so fun), and so on.
.signature below).
In startups you are the one who has to figure all this stuff out, and for lots of people that's a lot more interesting (see
This is an experiment you performed on your own?
Really? Where did you get this information? I haven't seen this information published anywhere... but would love to see where this info comes from.
You clearly didn't read the Washington Post article.
The 16 years old kid who logs onto MySpace at 02:41 is using the same computer in the basement that mom and dad use the next mornign at 07:45 to log into their bank accounts, pay bills, trade some stock, and so on.
That's why even a free MySpace is a good target. As a matter of fact, MySpace is an excellent target because it has highly loyal and extremely active users who log into MySpace multiple times a day. This means that if the phishers' crack stays on the site even for a very amount of time, they will be able to grab a solid number of usernames. If they did that on Simpy for example, a site with nowhere that many daily active users, the catch would be a lot smaller. That's why phishers targeted MySpace.
With MySpace being so popular and with its users regularly logging in on a daily basis, I wonder what the impact of this was in terms of:
1) the total number of "phished" accounts
2) the number of "phished" accounts in terms of a percentage of the total userbase.
Professors are not the ones who will decide whether Wikipedia will make the grade or not. The populus will. And the populus has already decided. I know a number of people who now go to Wikipedia first, Google second.
I am surprised nobody mentioned Apache Harmony - http://incubator.apache.org/harmony/ - that's an open-source Java SE implementation.
Koders and even Krugle guys precede Google's code search, but they are going to have a hard time attracting more developers' eyeballs - check this.
Too bad one can't get Google code search on there, too, but you can imagine how far that graph curve would be.
If you think providing equivalent service for less $$$ is possible, why not start that business?
I heard those VooDoo PCs come with built-in "nano-microphones".
Keep this stuff in mind when you leave your offices. Turn those electricity-wasting air-warming boxes off at night. Or at least make them Zzzz.
We are slowly working towards that, but we are not at the point where this can be done both fast and well. Unless you have FBI/NSA/CIA/government resources, of course.
There is a great little company in Brooklyn, NY called Alias-i. Some years ago they built this interesting "tool" called....guess....ThreatTracker. Information Extraction, Named Entity Recognition and other interesting stuff, if you are into this.
No, I don't work for them, but their LingPipe toolkit has some cooooool stuff.
I hope it turns out as good as it blurb makes it sound. I believe Pierre Omidyar's Omidyar Network was founded with the same/similar goals in mind.
But how do you know you didn't get any false positives? You could know that only if you examined every single email marked as spam. Did you really do that? If so, then what is the point of running spam filters? The whole problem is that spam consumes our time, and thus we want to get rid of it without ever seeing it. We don't want to monitor our spam filters.
This is not a surprise. That is simply another example of nature's laws on the web. This is not much different from the now well known fact that most stories on Digg are submitted by a handful of people (see: Top 100 Digg Users Control 56% of Digg's HomePage Content).
Hah, interesting! Here is a post on a very related topic: Social Spam and Spam Incentives, as it relates to Simpy. It asks about incentives, about the choices of things that are "spamvertised" (who follows "home loan" links on a site that so obviously stinks of rotten spam?), etc.