Google Suggest Dissected
sammykrupa writes "Google suggest Javascript code dissected and rewritten for all of you web developers out there. Cool piece of web reverse-engineering!" Joel Spolsky astutely notes that this will raise the bar in terms of how people expect the "internets" to work.
Eventhough it's an M$ spawned horror - It has brought a new revolution to javascript. Now it can load data from the server without having to refresh the screen. Flash has an XmlSocket , but I never see anyone use it till now (pointers please).
:)
Eventhough Google suggest looks great, I'd vote on CGI::IRC as the biggest killer HTML/Javascript browser app.
Clientside Javascript is powerful, we never realized how much
Quidquid latine dictum sit, altum videtur
Google suggest is a neat idea, but a potentially destructive one.
Small sites should *not* try to do this kind of thing on a live site. The amount of pressure this could put on a bad database structure (or even a well formed one) is considerable. Think about how many database hits a user could perform in a very short space of time: (user enters something, (database hit) backspace (database hit) types another letter (database hit)), then multiply it by a hundred or more people if your site gets a moderate amount of traffic.
Google can get away with this because they have considerable bandwidth, and large server farms. We've been seeing people trying to copy google suggest for the last couple of weeks in #javascript/freenode and in #php/freenode. The people trying to copy it generally do not understand how potentially bad this can be for a single server.
Anyhow, my advice is, don't do it unless you have the resources to scale your site. The cost of such an insignificant feature (lets face it, all it does is save the user one or two clicks) seems like it outweighs the gain. If you do decide to do it, and your site gets popular, and you're on some kind of shared host, your sysadmin is going to hate you, and the other site admins will probably meet you at your house, torches in hand.
BeauHD. Worst editor since kdawson.
LiveSearch does something very similar, is Open Source and exists since April ;)
If you look for more XMLHTTPRequest examples, which tightly integrate JS and PHP (other server side languages would be possible), see JPSpan.
I don't quite understand all the hype about Google Suggests. The technique for doing it exists since at least 2 years on Mozilla (and even longer on IE). Therefore, doing something like that was possible since a long time, but maybe everyone was just scared of using JS for "serious" stuff..
People know when they're sitting behind copious bandwidth. And you could well grow accustomed to an all-text page weighing the better part of a megabyte, due to a heinous amount of information parked in hidden JavaScript data structures, giving you that near-whiplash inducing responsiveness.
In fairness, Google Suggest, like Gmail, works very nicely for me on a 56k dialup. Gmail takes a few seconds for its inital load, true, but then it's like lightning. Suggest doesn't even have the slow initial load, since webhp.htm comes in at only 3.6kB. I'm very impressed.
Now I've no doubt that the bandwagon will bring us massive slow bloat as everyone gets his dog to code up vaguely similar functionality, but Google haven't done that.
Not to dismiss the neat reverse engineering he did, but is the actual discovery that big a deal? It's just a keypress handler, and some server communication. No big deal on any graphical user interface other than a web page.
Google have good UIs because they hire smart people. Other people don't because they don't hire smart people, or hire the wrong type of smarts (graphic designer instead of sw engineer for the coding part of a website, and vice versa).
I've looked at using the XMLHTTP object a couple of times in the past, and noted that this is partly how Google Suggest works.
XMLHTTP is a COM object included with recent versions of Internet Explorer. You can call it from client side JavaScript in a web page. The object will make a request to the URL you specify, and return the result into either a string variable, or an MSXML DOM object. You can then have the javascript output the results to an object (eg, a div tag) on the page without doing a full page reload.
I wrote a small tech demo that implemented a virtual tree - so when you expand a branch in the tree the client only retrieved the data it needed. This was borrowed from the approach the MSDN web site uses. The advantages to it are that it doesn't download the same data over and over like when you expand a branch in a server side tree. You also don't have to do any work at all to remember the state of the tree since there's no full page refreshes involved.
Google Suggest is similar in that it is a virtual list rather than a virtual tree. A virtual list allows you to list lots of items and jump around in the list without needing to download the entire data set when the page was loaded.
Another use for this would be dynamic forms - forms that alter the state of controls based on selections the user made in previous controls.
The biggest suprise to me was that Google have implemented this on a site live to the public. In using XMLHTTP I found it a little bit prone to locking up the browser when waiting for responses to requests. Additionally it's Windows only, so could never have been implemented on an external web site.
I'll be looking with interest at the Mozilla side of Google's implementation, since I didn't think an equivalent existed until now. Two different implementations of the same functionality is still going put a damper on the technology though.. different code for different browsers is usually more trouble than its worth.
1. Google performs several possible searches for each key you press
2. Google already knows the estimated number of results for millions of queries
Both of these suggest a heck of a lot of computing power. This type of thing might not scale up for general use in the near future - but still...
we're talking massive computational power and one of the largest databases ever created.
I'm a bit worried the Googleplex is going to wake up one day and declare to all us 'organics':
"yo bitches - you work for me now"
IMHO google shouldn't be the international standard moral censor of the web.
As a concerned parent (I'm not, but pretend) I wanted to help protect my teenage daughter so I looked for information by typing "sexual diseases". Granted the search would have worked, but as an unknowledgable home user I thought there were no results.
IMHO, as well as prompting with common queries not involving any sequences of glyphs that the pope might blush at, google suggest should treat people with more respect and also return suggested spelling corrections and search result count for all exact search queries.
At worst, if the user types "cunt", google suggest should include all suggestions with "cunt" in them. And in that case where it is an extremely offensive word to white heterosexual christians (as that appears to be the only metric by which google can be bothered to censor), if the user types it, google should produce suggestions including less offensive words too.
I fear that might be the case. I learned to code HTML and to put a decent webpage, designed the way I wanted it, online with relative ease, at the age of 14. It took time to learn it, but it was fairly straightforward - I wanted a large header in Verdana, I put in "FONT FACE" and "H1" tags, I wanted a table with a specific background color, I put in a "BGCOLOR" etc.
Today, we have two languages (XHTML and CSS) instead of one (HTML), and while it certainly does a lot to improve interoperability and platform independence, it is two languages to learn, not one. Throw in stuff like JavaScript, and you have even more.
Of course one can choose not to use XHTML and CSS, but that's not the way we want it, right? We want people to use the standards, to write code which won't crash Firefox, or not use proprietary solutions. Doing this is taking more and more effort. We have the skills and time to do and learn this, but not everyone have.
If we want a wide adoption of standards, and an Internet for everyone, where everyone has equal opportunities, the only way is to make the standards easy to use, so people will use them of their own free will.
Otherwise, in 10 years we'll be designing our fancy webpages, while the Joe Users who don't have the time or skills to learn the 13 languages required have no choice but to hire a professional, or use a crappy proprietary solution which won't allow them to take their ideas to their full potential, and this is a great loss for everyone.
Saying "You must do *complicated thing* because it's the specified standard!" will only work with people like us.
Christ, Google Suggest has been around for all of 8 days now. It was released December 10, and is a Google Labs project, which according to the website "showcases a few of our favorite ideas that aren't quite ready for prime time."
Shutting down free speech with violence isn't fighting fascism. It IS fascism!
After seeing google suggest, I built the same thing last weekend for CPAN modules. It's at http://teknikill.net/cpan/
The next thing I need to do is include the value of the dropdown box and limit the results on that.
This makes for an interesting way to sum up the internet into 26 words/phrases.
Check it out:
A - Amazon
B - Best Buy
C - CNN
D - Dictionary
E - eBay
F - FireFox
G - Games
H - Hotmail
I - Ikea
J - Jokes
K - Kazaa
L - Lyrics
M - Mapquest
N - News
O - Online Dictionary
P - Paris Hilton
Q - Quotes
R - Recipes
S - Spybot
T - Tara Reid
U - UPS
V - Verizon
W - Weather
X - XBox
Y - Yahoo
Z - Zip Codes
If I had to sum up the internet in 26 words/phrases, I don't think I could have done it better than Google. Of course, that is keeping in mind that Google Suggest has some pretty serious filters in place, so instead of P being "Porn" it is "Paris Hilton." Not too far off, if you think about it.
The thing that's interesting to me is that this is not really much different than the GMail compose address area, but suddenly it's brilliant in this respect.
Annoyingly, I'd written almost identical functionality for my own personal use maybe a year before I ever saw it in GMail (though it was already in place when I got GMail so I have no idea when they put it in there) because I really wanted standard combo boxes with pre-populated choices that also let you key in another choice. You could even use the arrow keys or the mouse just like Google's interfaces, with result caching (just like Google!) on the client side. Suddenly no one can stop talking about it.
The only difference in this from the GMail address bar is that in GMail the complete address book is pre-populated, while here they use the browser's DOM object to pass a request for the data (that's how mine did it, since I was working on thousands of distinct combinations and didn't want to have to have page load times get unmanagable). Populating the data is as simple as (if it were php):
echo "<data-result>";
$sql = "SELECT search_text, result_count FROM common_user_searches WHERE search_text LIKE '$user_input%' ORDER BY search_frequency DESC LIMIT 10";
$result = (mysql|odbc|etc)_query($sql);
while ($row = *_fetch_assoc($result)){
echo "<result text=\"".htmlentities($row['search_text'])."\" count=\"{$row['result_count']}\"/>";
}
echo "</data-result>";
I'll definately agree, it's incredibly clever, but it's not so bleeding clever that it demands as much attention as they're getting for it.
Slay a dragon... over lunch!
XmlHttpRequest to fetch data on demand has been around for a long time. For example, MSDN has been using this technique for years now. I have been using it for 9+ months on an application that recently went into production.
The reason you have not seen it in use much is
Google's best engineering continues to be in the back end - that is what makes this thing possible, and why no one else would likely be able to replicate this. The ability to search billions of records that fast is simply staggaring.
Wolf5K is a Javascript clone of Wolf3D in 5Kbytes. I deobfuscated it and posted a series of tutorials on how it works here. There is also a C++ translation and enhancment series of tutorials here. Full ready to compile source is included for all tutorials.
The task of deobfuscating code is quite tedius but not too daunting. The main thing is getting the whitespace back in so you can see where all the functions begin and end. You then have to understand the language well enough that you can read the code and figure out what's going on without hints from comments or descriptive variables.
For Wolf5K I just started by working on the simple functions first and then by process of elimination worked my way through the code and finished with the raycasting function.
Translating it all to C++ was then quite easy because by then you have a very good grasp of how the code is suppost to work.
Work Safe Porn
*cringe*
This works just fine until something doesn't work *perfectly*, and then all hell breaks loose. I will give you a real world example. I'm currently working at a law firm (I'm starting law school in either fall 05 or 06) that uses a common "indstry standard" database tool -- it is a flat-file DB that uses B+ trees. [B trees are like binary trees, but they have many children instead of two, and B+ trees store information only in the leaves.] The idea between B+ trees is that because of the high degree of branching of the tree, you should never have to take more than 2-3 "slow memory" accesses to find your page. (i.e. the entire first node lives in memory. assuming a branching factor of 256, 16777216 records can be accessed within 3 accesses.) Building these trees is also a time-intensive process since there are a lot of writes that happen to parent nodes, and it is very likely that pages get flushed from virtual memory. The problem is that no one has a CS background and so no one understands the memory heirarchy, virtual memory, caching, write-on-update, LRU/MRU page replacement, et cetera, so when Concordance is *slow as all hell* -- no one knows why. [The answer is: When indexing a large database, the programmers seem to have been sloppy and the main node spills over onto a second memory page. Once other nodes begin to spill over, you get a case of "thrashing" in which every time your computer pulls a node back into the "working set" of what lives in physical memory, it kicks what you need out of virtual memory. Google for "thrashing" and the "row-major" and "column-major" order problem.]
*My* firm took a huge risk and hired someone with a CS degree (masters) rather than a paralegal, and they did some experimenting. I've gotten under-the-hood of many of their apps, and the things I've discovered have been shocking. (And these are industry "standard" solutions.) They've reaped the benefits of having someone that actually understands the underlying technology. Here is the archetypical example:
BUT: I am giving you a secret peek at the innards of foo!
Very long story short: at some level, there must be someone technical so when things "go wrong" (like why many people accessing a shared harddrive over ethernet for disk intensive operations is a bad idea due to the nature of a bus architecture...) all hell doesn't break loose.
When in doubt, parenthesize. At the very least it will let some poor schmuck bounce on the % key in vi. (Larry Wall)
Therefore, at present, this works only for English; with other languages it can happen that it suggests porn-prone search terms for the refinement of terms that have, as such, nothing to do with pornography. Some examples:
- the first suggestion for 'fille' (French for 'girl') is 'nue' (naked)
- the 5th suggestion for 'dzieci' (Polish for 'children') is 'nago' (naked)
- suggestions for 'mund' (German for 'mouth') countain 'mund auf sperma rein' (open mouth, introduce sperms), 'mund ficken' (fuck in the mouth), "mund arsch" (mouth ass)
- devochki (with Cyrillic letters: Russian for "little girls") gives the suggestions "devochki porno"
- the first suggestion for 'smot...' with Cyrillic letters (smotret': Russian for 'watch'/'look at') is "smotret' porno"
I think this is probably quite problematic - someone enters a search term that has nothing to do with pornography, and Google suggests something pornographic for 'refinement'. Of course, this is not due to Google's intent, but due to the distribution of the things people search for and of contents on the Internet. I suppose this is one of the problems Google will want to address before offering Suggest as an option on the main page.unfortunately, "google suggest" is not as good as it could be.
;). so i start by typing "sou". after a short delay, google suggests "southwest airlines". ok, this seems to be what most people are searching for when entering "sou". luckily, "southwest" is the second most common suggestion listed in the drop-down list, so i just hit 'cursor-down' and 'enter' to autocomplete and search for "southwest". everything ok so far.
;)
;)
;). sure, there are people who are interested in the set union and not the intersection.. all they need is hitting backspace accordingly.
why? valuable implicit information gained through the human-computer interaction is not fully exploited by "google suggest". for illustration, see the following example:
let's say i'm searching for "southwest". and for the sake of logic, let's assume that i either don't know the correct spelling or that i'm a lazy dog
now comes the problem:
the top result displayed by google is.. southwest airlines! this of course doesn't make sense because if i wanted to search for southwest airlines, i would have happily accepted google's first suggestion already. actually, "google suggest" knows about my preference for "southwest" over "southwest airlines" and yet doesn't use this "extra-"information gained thanks to human-computer interaction! so my brain feels slightly offended
to put it simply: if an average user is selecting a search term from a list of suggested search terms, he probably wants to search for that exact search term but not for any of the other also displayed suggested search terms. if not, an average user would have probably selected another search term out of the displayed list of suggestions. so to me, this looks like if the bright google guys forgot about the fact that the act of selection from a list also implicitly includes information about what does not get selected.
suggestion for a better "google suggest":
as a probably not perfect but working solution, "google suggest" could simply exploit this implicit user interaction information by excluding all explicitly deselected (and eventually all not explicitly selected) suggested search terms from the search query. in the example:
excluding all explicitly deselected search terms yields:
southwest -"southwest airlines" (voilà! southwest airlines is not the top result anymore
excluding all explicitly deselected and all not explicitly selected search terms:
southwest -"southwest airlines" -"soulseek" -"south park" (etc.. you get the point)
that's pretty easy to implement - with an obvious benefit for average users.
disclaimer: i'm talking about expectations of average users here. iow: about users that are probably just interested in the few topmost results, i.e. the intersection and not the set union of results (but that's probably the point of web searching anyway