One of the key aspects of the automobile, in contrast to other forms of transportation, is that it is more deadly to anyone getting in the way or disobeying the unwritten rules of the road. It's like the Mafia - they don't have to kill everybody, just enough to send a message.
Now, if suddenly we have cars which don't run red lights, and which stop every time for pedestrians or dogs, cats, etc. which appear in front of the vehicle, chaos will ensue.
Imagine walking down a crowded sidewalk. You're constantly being blocked, jostled, and otherwise impeded by people who show little concern for your presence, because you're not a threat.
If the motor-death equation is suddenly removed, the same situation will occur on our sacred highways - walking, bicycling, and other un-American forms of transportation will take over the streets!
AFAIK they use some kind of search on the lexicon for the inverted index. For instance, the string "nut" is matched to "nutmeg", "donut", etc., and the document lists for those terms are merged together. Phrase search would also be done using all matching words, eg "nut hol" would expand to phrase searches like "donut hole", "peanut holder", etc.
The exact method for matching the search string to the lexicon isn't clear. It could be a suffix tree, but it may be as simple as grep-like scanning of the words, since there aren't that many relative to the text size.
Looking at mail.app it seems to do this process on each keystroke. It's not terribly fast, but it gets the job done.
I'm afraid they've slapped their name on another marketing move with little potential. They're not the first into the market, and they're many features short of being the best.
The Google Search Appliance was their first cola: a minimal port of their web search and crawling platform, with a Google brand name as the only selling point. It's a feature-deficient, PHB-created product in a market with better and cheaper alternatives. Sales have been disappointing, and it's on its second or third product manager.
Now we get a half-baked, non-portable, closed product that shows ads at us when we search our own data. Sorry, Google, you should be able to do better with all that IPO cash.
Re:SQL is good for some things, but not for others
on
An Alternative to SQL?
·
· Score: 1
The SQL3 standard includes recursive queries. DB2 has had them for years.
Unfortunately he's not the first person to recognize the phenomenon. IBM was similarly full of fools when it tanked in the early 90's, and I doubt if I was the first to notice. My father might have been, but he wasn't very forthcoming about it:-).
I just went through a couple of rounds of interviews with a spam filtering company about doing something similar. The problem these days is that spammers have figured out that "V1AGRA" can be spelled in a number of ways which fool word-based spam filters. There is also a lot of hidden information, such as html and urls, which may be significant, but is difficult to identify with exact string matching.
The approach used to be:
1. Find features (usually well-delimited words) in the message. 2. Look up the features in a database of precalculated scoring information. 3. Add up the scores for all the features found, using some buzzwordy algorithm.
Nowadays the features may not be so obvious. For instance "V1AGRA" may not be present in the feature database, but if "VIAGRA" is, we should be able to link to it via some sort of approximate match, or substring match. Here we can see that both strings have "AGRA" in common, and score accordingly. Longer strings, like "Former Dictator of Nigeria", provide more material to match on.
One problem with substring matching is that substrings can overlap, yielding multiple matches for the same piece of text. A string of length n has n^2/2 different substrings, so our feature space is enormous. Adding up all the feature scores from multiple overlapping hits in a useful way is also much more difficult.
One way out of this mess is to pick a really simple scoring method. Gzip "scores" (in compression amount) messages on how many characters match, in substrings beyond a certain length (4?), using a greedy algorithm. It's a simple tool for guaging the similarity of two files.
The IBM method seems a bit more sophisticated. I've looked up similar methods in bioinformatics textbooks. They handle overlapping, and appear to choose their features with a substring-counting approach.
Before you begin using a trademark, you may wish to check if someone else is using it. This can be a laborious process of reading newspapers and periodicals, books, online news and web sites, and the like, to find prior uses of the term you wish to trademark.
Fortunately, a new interweb tool called a search engine makes this process much faster than it used to be. A number of fantastic, dynamic companies, such as Yahoo, Lycos, and Altavista, have sprung up to build these engines using cutting-edge technology.
To use a search engine, go to the search engine website and type the word you wish to search for into the text box presented. Press the "search" button, and, in as little as 30 seconds, the search engine will scan all of the websites it has indexed, and present a list of results ranked by importance, using sophisticated relevance-calculating techniques.
Some newer search engines allow scanning of up-to-the-minute news articles, and even popular forums such as Usenet.
iPod runs an embedded OS, as does the Airport Express (Broadcom chipset, I believe, with Linux, like Linksys routers). Apple seems to have more embedded OS's than regular ones.
You may laugh, but this isn't far from the procedure used to get high-quality EEG data. The stuff you get from outside the skull is generally junk.
Researchers used to piggyback on severe epilepsy patients, whose condition had gotten so bad as to require surgery to remove or alter parts of the brain that triggered the seizures. This operation required a bit of reconaissance to find the offending grey matter, so a craniotomy (skylight in the cranium) was standard diagnostic procedure, and the operation usually had a few extra minutes for experimental measurements.
Some of the more advanced people used to insert probes all the way into the brain to trigger the seizures; the whole process was guided by EEG's to gradually refine the location of the source.
One of my programs was set up to take EEG's from an 8x8 electrode array, which was laid upon the brain after the skull and membrane were removed. I almost got to attend one such procedure live, but I was scratched from the roster at the last minute - that's a lesson as to why software shouldn't be too reliable.
As far as using a soundcard, I'm not surprised at all. A soundcard is basically a two-channel A/D converter. You need a lot more channels to compete nowadays, but for the price, you can't beat the commodity hardware. The only additional hardware you need is a bank of preamps, and possibly a clock/timer board to make sure the sampling is precise. And, of course, a drill.
You're not thinking creatively enough. One of the main uses of this technology will be to avoid annoying people, panhandlers, Hare Krishnas, people you owe money to, and so forth.
With a bit of computer aid we'll be able to walk right through a crowd of Dianetics pamphleteers without making eye contact or slowing down. Of course I do this already, but there will be much less risk of collision.
This was a fairly expected move due to the tax cuts you note. Cash in the bank can be realized either as a capital gain in the stock, if the company holds onto the cash, or as a dividend of about the same amount. The old system encouraged capital gains instead of dividends, but the new one makes dividends preferable.
MS went for years without paying any dividend, because stockholders were able to get their returns in price appreciation. Now, expect flatter pricing, with more dividends. That's good news for stockholders, but bad news for stock option holders.
That's funny; I noticed that same "worse than High School" look in the party photo that Peter Norvik showed at a Google recruiting presentation. Apparently they even have a "Google Dance", which I imagine to be somewhat like PeeWee Herman in a biker bar, but less graceful.
After the Google appliance, this seems like an expected move. The desktop is certainly key from a marketing sense.
However I don't see a lot of overlap with web search. The major pieces won't work the same:
Crawling: People want fresh information, eg that marketing report that just went out five minutes ago. Many web sites are happy to be crawled once a month. Keeping up with user edits on a filesystem is going to be a lot harder, and users will probably not be happy with heavy reindexing cycles. The ultimate would be heavily integrated with the filesystem, keeping an eye on all file activity, and refreshing the index appropriately. I believe Longhorn's delays are related to this problem.
Indexing: Desktops have a lot of file types, and strange crypts like the Outlook. Certainly Google has some support in this area, but more may be needed. There are also other document units like email messages instead of files, or even database records.
Fetching: Granted, a simple search toolbar will work, but I've been more impressed with, for example, Apple's Sherlock protocol, which allows multiple search "channels", eg Web, News, Stocks, etc., some from third party providers. IIRC this is what Firefox uses.
Ranking: Pagerank is definitely not going to work, although that may not be such a handicap when hit counts are in the one or two-digit range. Still, it's not a competitive advantage.
Latent Semantic Indexing has been around for a while, and I've forgotten many of the details. As some have mentioned it's a dimension reduction technique, and the result is a set of eigenvectors, each of which describes a set of terms which correlate well with each other (or anticorrelate, I think components can be negative too).
In English terms, the technique finds sets of words that occur together in different subject areas, and gives them weights which reflect how often they occur together. For instance, "baseball" and "bat" may emerge as common companions in some documents, so they might get weights of 1.0 for both (in one eigenvector/topic) if they always occur together - meaning a query for "bat" should always return hits for "baseball" too. However if "bat" gets diluted by documents about flying animals, then its weight in the "baseball"-"bat" vector will be reduced, say to 0.5. Then queries for "bat" will not necessarily map to baseball documents, but to both areas, represented by different eigenvectors.
That's confusing enough, but LSI gives a clean method for managing all of these relative probabilities in a global space of word occurrence vectors. The "latent" part is how it discovers these topic areas automatically, by clustering words which occur together. This process is similar to data mining for common subsets, but with LSI the members of the subsets are actually weighted for significance.
I came up with this idea a couple of years ago, after a few in-flight disintegrations of air tankers. The idea is that the JDAM kit can drop anything on a dime, costs less than $20k, and could probably cost a lot less in a non-military configuration.
There are actually a lot of pros that I didn't think about initially. Besides the safety problem with diving into fire zones, there's also a fuel problem, since each climb out consumes almost as much as taking off. This constraint reduces the weight capacity of each mission -- many tankers seem to fly with only a fraction of their rated weight.
The ability to load a plane up to its full capacity with retardant, fly to a fire area, and make repeated, accurate drops from high altitude, without running out of gas, seems like a major plus to me. There are also benefits in being able to make "quick response" drops, eg from Smoke Jumper aircraft, with less risk.
Of course, I'm not buying, just looking for competitive information:-).
To date, all we've gotten from Google is a two-page flyer listing some of the features; I don't get the sense that they're overly excited about the product.
I found out an interesting fact a while ago: Google schools all of its new employees in intellectual property law, in a course lasting several days, covering patents, trade secrets, copyrights, and the like. This is a paranoia level approaching IBM, where every copy machine has a traceable watermark. Even sales people can't reveal competitive analyses, or any high-level marketing research, even if it might help a sale. Requests, for instance, for a feature comparison of the Google search appliance vs. its competition are met with a stony wall of silence (and appropriately so, I might add).
So, if you keep track, Google interviews contain almost no information, and are mainly public relations exercises. Vague statements about the corporate culture, some well-aligned musings about the company's future direction, and oh look at the time, the interview's over.
I suspect most of their searches are done by an Amiga behind the coffee bar.
The issue is whether the company will bet the farm on search or not. Many companies build search tools of various kinds, but few try to do so as a pure play. MSN has recently refocused its efforts on search, but it hasn't historically tried to make money on that alone.
He said "search engine companies", not search engines. Companies which do other things don't qualify. MSN, for instance, is affiliated with some company that makes computer mice.
The Japanese gained over 80% market share for DRAM in the 80's, and then a mysterious fire destroyed a glue factory that was needed for some aspect of production. Alas, production dropped. DRAM prices went through the roof, and stayed that way until the Koreans broke the monopoly in the 90's.
But there was no hint of wrongdoing. Would you like some whale sushi?
One of the key aspects of the automobile, in contrast to other forms of transportation, is that it is more deadly to anyone getting in the way or disobeying the unwritten rules of the road. It's like the Mafia - they don't have to kill everybody, just enough to send a message.
Now, if suddenly we have cars which don't run red lights, and which stop every time for pedestrians or dogs, cats, etc. which appear in front of the vehicle, chaos will ensue.
Imagine walking down a crowded sidewalk. You're constantly being blocked, jostled, and otherwise impeded by people who show little concern for your presence, because you're not a threat.
If the motor-death equation is suddenly removed, the same situation will occur on our sacred highways - walking, bicycling, and other un-American forms of transportation will take over the streets!
AFAIK they use some kind of search on the lexicon for the inverted index. For instance, the string "nut" is matched to "nutmeg", "donut", etc., and the document lists for those terms are merged together. Phrase search would also be done using all matching words, eg "nut hol" would expand to phrase searches like "donut hole", "peanut holder", etc.
The exact method for matching the search string to the lexicon isn't clear. It could be a suffix tree, but it may be as simple as grep-like scanning of the words, since there aren't that many relative to the text size.
Looking at mail.app it seems to do this process on each keystroke. It's not terribly fast, but it gets the job done.
I'm afraid they've slapped their name on another marketing move with little potential. They're not the first into the market, and they're many features short of being the best.
The Google Search Appliance was their first cola: a minimal port of their web search and crawling platform, with a Google brand name as the only selling point. It's a feature-deficient, PHB-created product in a market with better and cheaper alternatives. Sales have been disappointing, and it's on its second or third product manager.
Now we get a half-baked, non-portable, closed product that shows ads at us when we search our own data. Sorry, Google, you should be able to do better with all that IPO cash.
The SQL3 standard includes recursive queries. DB2 has had them for years.
Competing against dogs for DBA jobs.
Unfortunately he's not the first person to recognize the phenomenon. IBM was similarly full of fools when it tanked in the early 90's, and I doubt if I was the first to notice. My father might have been, but he wasn't very forthcoming about it :-).
I just went through a couple of rounds of interviews with a spam filtering company about doing something similar. The problem these days is that spammers have figured out that "V1AGRA" can be spelled in a number of ways which fool word-based spam filters. There is also a lot of hidden information, such as html and urls, which may be significant, but is difficult to identify with exact string matching.
The approach used to be:
1. Find features (usually well-delimited words) in the message.
2. Look up the features in a database of precalculated scoring information.
3. Add up the scores for all the features found, using some buzzwordy algorithm.
Nowadays the features may not be so obvious. For instance "V1AGRA" may not be present in the feature database, but if "VIAGRA" is, we should be able to link to it via some sort of approximate match, or substring match. Here we can see that both strings have "AGRA" in common, and score accordingly. Longer strings, like "Former Dictator of Nigeria", provide more material to match on.
One problem with substring matching is that substrings can overlap, yielding multiple matches for the same piece of text. A string of length n has n^2/2 different substrings, so our feature space is enormous. Adding up all the feature scores from multiple overlapping hits in a useful way is also much more difficult.
One way out of this mess is to pick a really simple scoring method. Gzip "scores" (in compression amount) messages on how many characters match, in substrings beyond a certain length (4?), using a greedy algorithm. It's a simple tool for guaging the similarity of two files.
The IBM method seems a bit more sophisticated. I've looked up similar methods in bioinformatics textbooks. They handle overlapping, and appear to choose their features with a substring-counting approach.
Fortunately, a new interweb tool called a search engine makes this process much faster than it used to be. A number of fantastic, dynamic companies, such as Yahoo, Lycos, and Altavista, have sprung up to build these engines using cutting-edge technology.
To use a search engine, go to the search engine website and type the word you wish to search for into the text box presented. Press the "search" button, and, in as little as 30 seconds, the search engine will scan all of the websites it has indexed, and present a list of results ranked by importance, using sophisticated relevance-calculating techniques.
Some newer search engines allow scanning of up-to-the-minute news articles, and even popular forums such as Usenet.
iPod runs an embedded OS, as does the Airport Express (Broadcom chipset, I believe, with Linux, like Linksys routers). Apple seems to have more embedded OS's than regular ones.
You may laugh, but this isn't far from the procedure used to get high-quality EEG data. The stuff you get from outside the skull is generally junk.
Researchers used to piggyback on severe epilepsy patients, whose condition had gotten so bad as to require surgery to remove or alter parts of the brain that triggered the seizures. This operation required a bit of reconaissance to find the offending grey matter, so a craniotomy (skylight in the cranium) was standard diagnostic procedure, and the operation usually had a few extra minutes for experimental measurements.
Some of the more advanced people used to insert probes all the way into the brain to trigger the seizures; the whole process was guided by EEG's to gradually refine the location of the source.
One of my programs was set up to take EEG's from an 8x8 electrode array, which was laid upon the brain after the skull and membrane were removed. I almost got to attend one such procedure live, but I was scratched from the roster at the last minute - that's a lesson as to why software shouldn't be too reliable.
As far as using a soundcard, I'm not surprised at all. A soundcard is basically a two-channel A/D converter. You need a lot more channels to compete nowadays, but for the price, you can't beat the commodity hardware. The only additional hardware you need is a bank of preamps, and possibly a clock/timer board to make sure the sampling is precise. And, of course, a drill.
You're not thinking creatively enough. One of the main uses of this technology will be to avoid annoying people, panhandlers, Hare Krishnas, people you owe money to, and so forth.
With a bit of computer aid we'll be able to walk right through a crowd of Dianetics pamphleteers without making eye contact or slowing down. Of course I do this already, but there will be much less risk of collision.
MS went for years without paying any dividend, because stockholders were able to get their returns in price appreciation. Now, expect flatter pricing, with more dividends. That's good news for stockholders, but bad news for stock option holders.
Much as I like this platform, I find that the display tends to get blurry after a few hours of use. Are they working on a fix for this problem?
That's funny; I noticed that same "worse than High School" look in the party photo that Peter Norvik showed at a Google recruiting presentation. Apparently they even have a "Google Dance", which I imagine to be somewhat like PeeWee Herman in a biker bar, but less graceful.
After the Google appliance, this seems like an expected move. The desktop is certainly key from a marketing sense.
However I don't see a lot of overlap with web search. The major pieces won't work the same:
Crawling: People want fresh information, eg that marketing report that just went out five minutes ago. Many web sites are happy to be crawled once a month. Keeping up with user edits on a filesystem is going to be a lot harder, and users will probably not be happy with heavy reindexing cycles. The ultimate would be heavily integrated with the filesystem, keeping an eye on all file activity, and refreshing the index appropriately. I believe Longhorn's delays are related to this problem.
Indexing: Desktops have a lot of file types, and strange crypts like the Outlook. Certainly Google has some support in this area, but more may be needed. There are also other document units like email messages instead of files, or even database records.
Fetching: Granted, a simple search toolbar will work, but I've been more impressed with, for example, Apple's Sherlock protocol, which allows multiple search "channels", eg Web, News, Stocks, etc., some from third party providers. IIRC this is what Firefox uses.
Ranking: Pagerank is definitely not going to work, although that may not be such a handicap when hit counts are in the one or two-digit range. Still, it's not a competitive advantage.
Latent Semantic Indexing has been around for a while, and I've forgotten many of the details. As some have mentioned it's a dimension reduction technique, and the result is a set of eigenvectors, each of which describes a set of terms which correlate well with each other (or anticorrelate, I think components can be negative too).
In English terms, the technique finds sets of words that occur together in different subject areas, and gives them weights which reflect how often they occur together. For instance, "baseball" and "bat" may emerge as common companions in some documents, so they might get weights of 1.0 for both (in one eigenvector/topic) if they always occur together - meaning a query for "bat" should always return hits for "baseball" too. However if "bat" gets diluted by documents about flying animals, then its weight in the "baseball"-"bat" vector will be reduced, say to 0.5. Then queries for "bat" will not necessarily map to baseball documents, but to both areas, represented by different eigenvectors.
That's confusing enough, but LSI gives a clean method for managing all of these relative probabilities in a global space of word occurrence vectors. The "latent" part is how it discovers these topic areas automatically, by clustering words which occur together. This process is similar to data mining for common subsets, but with LSI the members of the subsets are actually weighted for significance.
There are actually a lot of pros that I didn't think about initially. Besides the safety problem with diving into fire zones, there's also a fuel problem, since each climb out consumes almost as much as taking off. This constraint reduces the weight capacity of each mission -- many tankers seem to fly with only a fraction of their rated weight.
The ability to load a plane up to its full capacity with retardant, fly to a fire area, and make repeated, accurate drops from high altitude, without running out of gas, seems like a major plus to me. There are also benefits in being able to make "quick response" drops, eg from Smoke Jumper aircraft, with less risk.
And if they start distributing Kool Aid, watch out.
Of course, I'm not buying, just looking for competitive information :-).
To date, all we've gotten from Google is a two-page flyer listing some of the features; I don't get the sense that they're overly excited about the product.
I found out an interesting fact a while ago: Google schools all of its new employees in intellectual property law, in a course lasting several days, covering patents, trade secrets, copyrights, and the like. This is a paranoia level approaching IBM, where every copy machine has a traceable watermark. Even sales people can't reveal competitive analyses, or any high-level marketing research, even if it might help a sale. Requests, for instance, for a feature comparison of the Google search appliance vs. its competition are met with a stony wall of silence (and appropriately so, I might add).
So, if you keep track, Google interviews contain almost no information, and are mainly public relations exercises. Vague statements about the corporate culture, some well-aligned musings about the company's future direction, and oh look at the time, the interview's over.
I suspect most of their searches are done by an Amiga behind the coffee bar.
The issue is whether the company will bet the farm on search or not. Many companies build search tools of various kinds, but few try to do so as a pure play. MSN has recently refocused its efforts on search, but it hasn't historically tried to make money on that alone.
He said "search engine companies", not search engines. Companies which do other things don't qualify. MSN, for instance, is affiliated with some company that makes computer mice.
The Japanese gained over 80% market share for DRAM in the 80's, and then a mysterious fire destroyed a glue factory that was needed for some aspect of production. Alas, production dropped. DRAM prices went through the roof, and stayed that way until the Koreans broke the monopoly in the 90's.
But there was no hint of wrongdoing. Would you like some whale sushi?
Google is the internet!