Interview With Google's Director of Research

Re:[ot]Google's data structure? by Anonymous Coward · 2001-06-21 00:13 · Score: 1

none. It's a gigantic Perl script written entirely of RegExen and one tr///.

use strict is REMed out.

Re:Voice activated search engine by Anonymous Coward · 2001-06-21 03:21 · Score: 1

I love when people don't read the article and post

He did read the article. He said "I'm not sure that's not why they were working with BMW." Note the double negative, hence he is sure that is why they were working with BMW.

you want choices? by Anonymous Coward · 2001-06-21 08:47 · Score: 1

nah!

bestbet at shootybangbang.com bounces you straight to the *best bet*

sometimes it is smart, sometimes it is stupid

Re:Voice activated search engine by Anonymous Coward · 2001-06-21 01:37 · Score: 2

(car cuts driver off)
"Fuck you, asshole!"
(computer beeps)
[25,945 results found.]

Re:Actual Questions for Ask Jeeves by Tony+Shepps · 2001-06-21 13:09 · Score: 1

Quality! I've linked back to your site too. Now I wonder if anyone else was doing this...

Actual Questions for Ask Jeeves by Tony+Shepps · 2001-06-21 02:18 · Score: 2

In the good old days, ask.com let you see everything being asked of Jeeves, unfiltered. I watched it for a while, saving off the really weird questions, and made a page of it here.

Happy reading, and remember, you're looking at the end of the human race.

Re:Actual Questions for Ask Jeeves by billybob · 2001-06-21 05:11 · Score: 1

hehe, pretty funny :)

It is too bad they took that away.

--
Joseph?

Yikes, Zephyr Interactive? by drsoran · 2001-06-21 01:37 · Score: 1

So these are the guys we can complain to whenever we hit one of those heavy flash ladden sites like Ford? Geez, I went to their site and the stupid background flash process was eating up 85% of my CPU time. I love when these companies add useless bullshit eye candy to a site.

Re:Prepositions need love too by Malc · 2001-06-21 00:26 · Score: 2

Judging by the article, they build lists of words, and find their intersection. I can't imagine how big the lists for common words (e.g. articles) would be. Perhaps they had to cut them out due to hardware constraints?

Deja by Tet · 2001-06-21 00:44 · Score: 2

The most interesting part about the interview was the snippet that implies Google didn't have much of a say in the Deja archives being down after the buyout. So it wasn't the complete cock up that we all thought it was. They still handled the PR really badly, though. If they'd just told people what was happening, I'm sure they wouldn't have come across half as badly as they did.

--
"The invisible and the non-existent look very much alike." -- Delos B. McKown

Re:Deja by i0lanthe · 2001-06-21 02:18 · Score: 1

So it wasn't the complete cock up that we all thought it was.
News flash! Slashdot (retroactively) gives someone the benefit of the doubt! Gif at 11.

--
"The Crystal Wind is the Storm, and the Storm is Data, and the Data is Life"

Google also does Mac searches! by GPS+Pilot · 2001-06-21 04:14 · Score: 1

google.com/mac

But the Google/Mac logo isn't as cool as the Google/Linux logo -- it contains all the fruity colors that Apple has largely abandoned.

--
That that is is that that that that is not is not.

Re:[ot]Google's data structure? by K-Man · 2001-06-21 01:37 · Score: 2

The documents are assigned id's 1..n and, for each word, an ordered list of id's of documents containing the word is constructed. When a search asks for, say, "cheese fondue" the array for "cheese" and the array for "fondue" are retrieved and merged using a sorted list merge (fast, since the arrays are already ordered). The result is a list of document id's that were in both lists, i.e. documents containing both words.

There are various ways to speed this up by compressing the arrays, hash joins, etc., but the basic idea is the same.

--
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger

Re:[ot]Google's data structure? by K-Man · 2001-06-21 03:11 · Score: 3

That's true if the data is changing. However most search engines do web crawls in large chunks, and index the data once in one large block. Under such conditions dynamic management of hit lists and other data structures is not necessary. Basically, the bytes are packed as tight as they can get them so that it all fits into memory.

As far as I can tell from their paper, Google manages its web crawls the same way. It partitions the data into "barrels" and indexes each separately. Once the indices are built, they aren't updated. They also extend the hit lists to include word position and some other attributes for each hit.

--
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger

Re:Prepositions need love too by rsidd · 2001-06-21 02:30 · Score: 1

It says it's ignoring them, but the top few "hits" typically do include the exact page. I just tried, for instance, "All your base are belong to us". It claims to ignore "are" and "to" but the top few hits contain the exact phrase. (The same happens with your example "Hail to the chief", though it says it's ignoring "to the".)

Re:Prepositions need love too by Zagadka · 2001-06-21 11:07 · Score: 2

Now bear in mind that Google couldn't even come up with the phrase, however much I +'d it to death, on its top ten list. If I only have that one phrase in memory on Google, I can't find it.

The problem is that you +'ed it too much. If you search for +"+but +that +the +dread" you'll notice that it gives you some warnings. Google's ignoring all of the +'s you added, because you're using some of them incorrectly. ("dread" is not a stop word, for example)

Instead, try searching for "but +that +the dread". Then you'll get what you're looking for.

Re:Voice activated search engine by FFFish · 2001-06-21 12:14 · Score: 2

Oh, gahd.

That's just great. Now the cell-phone dolts in the SUVs will be using Google *at the same time* to check on their facts, *while* they are driving...

--

--

--
Don't like it? Respond with words, not karma.

Re:Disturbing Search Requests by ergo98 · 2001-06-21 01:16 · Score: 1

That site is absolutely hilarious! Thank you for the link.

Masturbation Techniques by ergo98 · 2001-06-21 00:30 · Score: 5

Google absolutely blows away the competition, however it is humorous seeing entries in my log file related to people looking for masturbation tips (from the beginner level "How To" style queries, to full blown searches for advanced techniques). The page in question is entitled "Hey Jerk : Get Off My Computer!" (and relates to pop-up ad windows) and I'm, uh, proud to see that it ranks #2 for searches for "jerk off technique" (I've had dozens of related hits appearing). While it is humorous seeing searching going a little off-track, I am very curious how many consumers know that each link you follow passes on where you came from, so for instance I see log entries like

200x-xx-xx xx:xx:xx xxx.xxx.xxx.xxx GET /rants/jerk/index.htm 200 5986 334 270 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;+Dig Ext) http://google.yahoo.com/bin/query?p=jerk+off&b=21& hc=0&hs=5
-or-
200x-xx-xx xx:xx:xx xxx.xxx.xxx.xxx GET /rants/jerk/index.htm 200 5986 437 1292 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;+Dig Ext;+sureseeker.com) http://www.google.com/search?q=guys+who+jerk+off

Re:Masturbation Techniques by daviddennis · 2001-06-21 04:28 · Score: 2

The OmniWeb browser on MacOS X has a very nice feature, enabled by default, which simply disables all pop-up windows. You can disable all pop-ups, or disable only pop-ups that are not the result of you manually clicking on a link.

Unfortunately, OmniWeb's JavaScript support is lacking in other areas, but that feature is brilliant, and their text display is the cleanest I've ever seen in any program. Linux users should get MacOS X just to rest their bad font weary eyes :-).

D

----
Re:Masturbation Techniques by Krilomir · 2001-06-21 04:46 · Score: 2

This is one of the reasons I found Gnutella fun when it first came out ... just looking at all those searches. It became even more fun when people began using the Gnutella-search-stream as a chat-feature ;)

Re:why I like google by daviddennis · 2001-06-21 04:37 · Score: 2

Perhaps the best news, though, is that

http://www.google.com/windows/

doesn't work. Great job!

D
----

Voice activated search engine by funkman · 2001-06-21 00:05 · Score: 2

They are working with BMW to see they can integrate the search engine into the car to do a search base on what you say.

Even out of the scope of a car - this feature would be awesome if it were integrated with cable (or satellite) and the TV room

Get me Gilligan's Island ... Click

Re:Voice activated search engine by funkman · 2001-06-21 01:38 · Score: 3

I love when people don't read the article and post. From page 2 of the article:
What other kinds of search are you developing?
We have a voice-search project with BMW -- BMW wants to put voice search into their 7 Series cars. They want to put microphones in the cars -- you can just speak whatever your search is and then it gives you answers back on a display. Then you just say the result number and the search jumps to that result.
Re:Voice activated search engine by SpookyFish · 2001-06-21 00:12 · Score: 1

As nifty as it would be to be able to voice search google from your car, I'm not sure that's not why they were working with BMW.

I know BMW is/was talking to them about their corporate site search offering, where they handle the indexing/searching of your intranet (in BMW's case) or public site.

I suspect this is where a large part of google's revenue will be in the coming years -- that's great, if it keeps the ad banners off their public service!

Re:Perks by ethereal · 2001-06-21 00:38 · Score: 2

All search engines spider ahead of time and store; to do otherwise would take forever to get you any search results ("It's a terrible strain on the animators' wrists." :) My impression from the article was not that they generate whole searches ahead of time, but that they categorize by the individual search words, and then when you type in a query they generate the intersection of the pages on their many word lists. Then one miracle occurs, and ...

Caution: contents may be quarrelsome and meticulous!

--

Your right to not believe: Americans United for Separation of Church and

Re:Regex: won't happen by griffjon · 2001-06-21 04:01 · Score: 2

*sigh* you're right.

But in the case where they would implement my ability to submit a RegEx, I could give them lots of flex on the time in return for the exact one page that I want. How hard could it possibly be?
(dodging)

--
Returned Peace Corps IT Volunteer

Re:Prepositions need love too by griffjon · 2001-06-21 00:15 · Score: 3

I'm just waiting for them to implement a RegEx interface. now THAT would be some love for the geeks out here.

--
Returned Peace Corps IT Volunteer

Dumb question (?) by DonK · 2001-06-21 01:07 · Score: 2

I missed an answer to "How come for the last N months the Google front page has stated:

Search 1,346,966,000 web pages

and this number doesn't change?"

Re:Dumb question (?) by QuantumG · 2001-06-21 12:50 · Score: 1

It's just a rounding error.

--
How we know is more important than what we know.
Re:Dumb question (?) by SpaceLifeForm · 2001-06-21 05:52 · Score: 1

Because that is what is HARDCODED in the HTML.

Why don't they update it? That is the question!

--
You are being MICROattacked, from various angles, in a SOFT manner.

Re:Prepositions need love too by King+Babar · 2001-06-21 01:59 · Score: 3

For example, searching for: "Hail to the chief" would ignore to and the. In order to actually search for the phrase (which I indicated that I wanted to do by surrounding it in quotation marks), I would have to type "Hail +to +the chief". Hardly user-friendly.

And, actually, that's not quite right, either. It's apparently always going to blow off your "the" (I just tried it). This is, alas, a seriously hard problem. What you were doing was looking for what actually amounts to a single chunk of information: the title of a fanfare played for the president. Unfortunately, the English version of the title is four words long although the title itself might in some cases act just like a single word (or noun phrase). So:

That was one of the worst "Hail to the chief" s that I have every heard.

Yes, you might even pluralize it just like a noun. So that's one problem right there: search terms that really are tantamount to a single lexical item might be four or more words long, and might even be inflected.

Ideally, you'd like to index separately these multi-word chunks, especially if you can prove they occur way more often than expected. So in your example, "hail" and "chief" co-occur on about 28,000 pages, while "hail" alone is on 510,000 and "chief" alone is on over 1,500,000. If Google indexes 1.5 billion pages (or so), and the terms were independent, then, you'd expect something like 5000 co-occurrences, and 28,000 is so outrageously out of line you would know that something is up.

Now, I'm guessing that *local* co-occurrence information is likely to eventually going to prove even handier in this regard. So, for example, "hail to" comes up 157,000 times, which is about 1/3 of all "hail" pages. That's very unlikely unless there's something systematic (and very possibly exploitable) going on.

The big problem is that you can't really do much with function words alone, since they're just too staggeringly frequent. In running English text, the frequency of "the" is just about 70,000 per million. In other words, 7% of all English text consists of the definite article, and most web pages contain many distinct copies. You've got to kill that. Unfortunately, by omitting "the", you lose a lot of potentially useful information about definiteness of the noun phrase. In the "hail to the chief" example, the song title itself is just one example of a (somewhat) productive expression "hail to [definite-NP]", which has a specific kind of meaning implied (interestingly, usually sarcastic or abusive). Picking up on this could be very useful.

So suppose I typed into deja "bush mass-mooning Gothenburg". I'll get 9 hits. That's nice, but google might want to do more, and provide additional examples of president (or candidate) Bush being derided in public. Or maybe give me pages that refer to the same incident being described as the Swedish version of "hail to the chief".

So there is no doubt that function words need love, but I'd argue for a love that seeks to understand them and their weird little contributions to meaning rather than just a way to make sure you can nail a song title exactly.

--

Babar

the technology behind google by dizco · 2001-06-21 01:37 · Score: 2

There's an excellent presentation at technetcast by jim reese (cheif operations engineer @ google) called "the technology behind google", in mp3 format. Its much more technical than this interview, really a very good listen. get it here

--sean

Re:[ot]Google's data structure? by daytrip · 2001-06-21 02:10 · Score: 3

You'll probably get a resonable idea at this page:

http://www-db.stanford.edu/~backrub/google.html.

Also, try a lookup for a bloom filter, which google uses, I think. Most search engines work by inverting the index, and then merging the lists. Taking the intersection of all the keywords gives ou the membership, then you apply ranking to the membership. Pretty simple concept. I don't know of any search engines that use a trie, or use any form of stemming.

-js

Re:Is she hot or not? by ashitaka · 2001-06-21 01:03 · Score: 1

Lame geek reaction to a woman.

See a female who ain't your mother,
run in circles, sweat and stutter.

From the article:
"...people like my husband would get crazy. He just wants to find pages that have his words."

Lesbian? Not. Competent? Hell, yes!

--
If you don't want to repeat the past, stop living in it.

Re:Smarter Searches by Xofer+D · 2001-06-21 01:26 · Score: 2

Last semester, I did a directed study about applying approximate machine reasoning to human information access, specifically to searching hypertexts of metadata. One of the ideas I looked at was an article about a search engine called FuzzyBase (pdf) which was developed by three people including my professor, who works in the SFU Communication Networks Laboratory. FuzzyBase did just what you suggest - it used an interactive user session to disambiguate user queries. There are several interesting technologies which use this sort of thing to obtain unambiguous search keys, and most involve the usage of semantic ontologies. If you want to get started looking at this stuff, have a look at some of the articles on this page, especially the online links at the end of the page. There are already search engines that do this to some degree.

--
The Signal/Noise ratio can be improved in two ways. Remaining silent is the OTHER way.

German queries at fireball.de by harmonica · 2001-06-21 02:41 · Score: 2

German search engine fireball.de has a page that lets you see what others have requested in the last 30 seconds. There are some sick people out there...

MP3 of that talk by harmonica · 2001-06-21 03:03 · Score: 3

You probably mean The Technology Behind Google. It's a 73 min MP3, very interesting!

Re:MP3 of that talk by htmlboy · 2001-06-21 06:50 · Score: 1

That sounds like the same guy (and mostly the same topics). Good call.

chris

Re:Yeah Suckah! by htmlboy · 2001-06-21 01:37 · Score: 5

Google gave a talk for ACM here last semester (got a t-shirt, woohoo!). The speaker described how they're used. They have thousands of linux boxes, and they're used to store websites (to be searched and cached copies) and to do searching on the pages they have (I think that's how it went). I got the impression that linux is used because it's free (important with thousands of licenses), it's reliable, and they found it a good platform for the searching backend software.

an interesting side note: they found that when one of the linux boxes stops working, it's more cost effective to replace it than to fix the problem (hardware, at least). google throws out a lot of good hardware because of that. the lecture hall was begging for a student donation program of some sort when the google guy mentioned that :)

chris

Send messages to the staff! by dead_penguin · 2001-06-21 00:54 · Score: 5

With the giant display of scrolling queries (filtered, though) they have in their lobby, I think it's time to start sending little messages to the Google staff using searches.

"Help, I'm stuck in here!!" is an obvious classic to try. If enough of us do it, it might even get noticed...

"Intelligence is the ability to avoid doing work, yet getting the work done".

--

It's only software!

Re:Send messages to the staff! by Tofuhead · 2001-06-21 07:19 · Score: 2

Before I read your post, I had the same idea. I just sent one that said "Sorry, am I DOSing the Google lobby scroller?" Then, after reading this post, I did a search for "jerk off technique."

Hope those scroller babies don't log IPs. It would look like I was so bored (at work right now) that I decided to SPAM their scroller, which had somehow gotten me into some kind of masturbatory mood.

< tofuhead >
--

--
It is still the dark of night.
Re:Send messages to the staff! by binner · 2001-06-21 05:24 · Score: 1

Sarcastic or not...that's damn funny!

-Ben

--
Say what you mean, mean what you say! But please know what #$@% you are talking about!
Re:Send messages to the staff! by jaredcat · 2001-06-21 01:38 · Score: 1

now thats a good use of my time

Re:Smarter Searches by gorilla · 2001-06-21 01:57 · Score: 2

Google already has this. If you do a search on 'slishdot' it asks you if you meant slashdot.

Re:[ot]Google's data structure? by costas · 2001-06-21 01:35 · Score: 2

A speculative answer since b-trees are my bread and butter (I am just now specing a 2TB data-mine): hundreds of thousands of entries (or hundreds of millions) should not really bother a b-tree. From the articles about Google, I am guessing they have implemented some sort of distributed b-tree app server, across all those COTS linux boxes.

I am curious as to what kind of implementation they are using; Google's roots would suggest some hacked form of Berkeley DB with lots of performance improvements.

Oh, well, just some guesswork... if I am close, I am expecting a job offer by the way :-)...

Gnut by QuantumG · 2001-06-21 12:29 · Score: 1

type "monitor" in gnut and you will see all the gnutella search requests going through your node. Often you see people searching for an exact filename and you know that their transfer stopped halfway through and they're looking for the rest of it, so you can get some idea of what is available out there in a completely passive manner.

--
How we know is more important than what we know.

Sing it brother by QuantumG · 2001-06-21 12:36 · Score: 1

How about a simple exact: query type? Too damn slow I suppose.

--
How we know is more important than what we know.

Re:Google is still sloppy and second-rate. by QuantumG · 2001-06-21 12:56 · Score: 1

So is your mom.

--
How we know is more important than what we know.

I remember when... by T3kno · 2001-06-21 00:53 · Score: 1

I switched from dogpile to google. It was the day that I read on /. that you could search for "more evil than satan" on google and the first hit was www.microsoft.com. That was a great day.

--
(B) + (D) + (B) + (D) = (K) + (&)

Re:I remember when... by TheShadow · 2001-06-21 01:42 · Score: 1

Yeah, now it just brings up a bunch of pages talking about when doing the search brought up Microsoft. :)

--

--

--
"What do you want me to do? Whack a guy? Off a guy? Whack off a guy? Cause I'm married."
Re:I remember when... by HoaryCripple · 2001-06-21 03:44 · Score: 1

Yeah. This sucks. Google needs to manually change the directory so that microsoft.com comes up again when doing the query.

--
Check out crippl3.net.
Booyah

Re:Smarter Searches by Louis+Savain · 2001-06-21 02:01 · Score: 2

Interesting work. Thanks for the helpful links.

Re:Smarter Searches by Louis+Savain · 2001-06-21 03:26 · Score: 2

Google already has this. If you do a search on 'slishdot' it asks you if you meant slashdot.

Thanks for this suggestion. Although it is a good example of interaction between the engine and the user, it seems to be based on a simple spelling check. Rather, I was thinking more in terms of what Monika Henziger referred to as a topic based query. For example, typing 'bicycle' and receiving a choice of 'bicycle repair', 'bicycle racing', 'bicycle sales', 'bicycle parts', 'bicycle touring', etc...

Re:Smarter Searches by Louis+Savain · 2001-06-21 05:23 · Score: 2

Thanks for the info on Excite's zoom feature. I am impressed. I wonder how they go about creating their topic associations. Do they compile it manually or do they have a automated tool that searches previous user inputs to come up with the most common keyword associations? An automated tool would, of couurse, be much more efficient and cheaper to operate.

Smarter Searches by Louis+Savain · 2001-06-21 00:32 · Score: 4

Monika Henziger: You can try to return documents that are specifically on this topic. We're developing more sophisticated techniques to return documents that might not mention the query words, but are [still relevant to] the topic. We're getting away from just pure word matches and getting more into topics.

This is interesting. I wonder if there might be a way for the engine to have a two way back-and-forth "conversation" with the user. IOW, if the engine interprets the query to have several possible meanings, a few multiple choice questions might clarify the meaning and narrow the search parameters. I think this could be more helpful than doing a blind guess of the user's intention.

Re:Smarter Searches by Fencepost · 2001-06-21 00:56 · Score: 2

I wonder if there might be a way for the engine to have a two way back-and-forth "conversation" with the user. IOW, if the engine interprets the query to have several possible meanings, a few multiple choice questions might clarify the meaning and narrow the search parameters.
I believe it was Altavista that had (and may still have, though I don't see any sign of it) something along these lines - after a query, it would also present an option to narrow the query by selecting some other key words that appeared in some of the pages. If I recall correctly this was not on the main query results pages, but there was a link to it.
For the example someone posted earlier where he gets a lot of hits from people looking for masturbation tips, using that option would present you with several groupings of words - one group might include "masturbate" and other terms likely to be found on that sort of pages, another group might include "network," "security," and "adware." Each group and each word within a group had a checkbox that could be used to select additional words to use in limiting the search.
I suspect that this was dropped for load reasons, though I could be wrong - it may be that people just didn't use it and they decided it wasn't worth the hassle.

-- fencepost

--
fencepost
just a little off
Re:Smarter Searches by PeterBecker · 2001-06-21 17:03 · Score: 1

You might want to take a look at this, too: http://www.guidebeam.com/ They work on top of Google, I got this URL just some days ago and didn't find the time to check the information on their site but it might be what you are looking for. HTH, PeterB

--
-- CAUTION: Don't read this posting.
Re:Smarter Searches by timboy3 · 2001-06-21 01:47 · Score: 1

Excite Search has a "Zoom in" feature that does something like this -- it gives you a set of alternate queries you can try to narrow the set of results returned. (Look at search.excite.com, try a query, and then use the Zoom In button.)
Re:Smarter Searches by timboy3 · 2001-06-21 04:13 · Score: 1

Right. I just tried your example (bicycle) on Excite's "Zoom In", and got these suggestions, among others:
bicycle parts
bicycle accessories
recumbent bicycles
Schwinn bicycles
mountain bicycles
used bicycles
..

phone book function by spasm · 2001-06-21 00:44 · Score: 2

Dunno if anyone's noticed the new 'phone book' function - type "your name" {your city/state/zip code} if you live in north america and see what comes back as the first google find. Your home address & phone number, at least if you're in the phone book.

I first noticed this function when searching for information on the professional work of someone who I was going to be working with - and the #1 thing google spat up was his home address and phone number. I know I could have found this almost immediately if I went actively looking for it, but it was a bit creepy anyway. I guess the reason I'm disturbed it that it wouldn't have occured to me to go looking for that information, but once it was thrust in my face like that, I could immediately think of reasons it might be handy to have it.. In the event, I didn't copy it down anywhere, but, well, I could think of people who wouldn't hesistate to call me at 3am if they had my home number..

Fortunately google seems willing to at least let you opt out - http://www.google.com/help/pbremoval.html - which is fine for people who know about google and its more esoteric functions, but ain't going to help Jane Shmoe when she starts wondering why so many more people seem to know here she lives and what her home number is - people who wouldn't necessarily have gone looking for the information (that would be rude..) but who don't mind having it when it's 'handed' to them.

Re:phone book function by freeweed · 2001-06-21 01:35 · Score: 2

Doesn't seem to work for me up here in Canada, although my name does come up with some interesting stuff that I've never seen online before :)
As for not having your phone number/address on the internet... that's why the phone companies are required by law to allow you to de-list. Without the internet, it takes me all of 5 minutes to drive to my local library, where they have phone books from around the world for the taking. Oh yes, and the white pages here only list first initial anyway :)

--
Endless arguments over trivial contradictions in books written by ignorant savages to explain thunder in the dark.

Re:[ot]Google's data structure? by markprus · 2001-06-21 01:15 · Score: 1

I would imagine google uses a highly compressed inverted index stored probably in a flat file format. If you would like to read some academic literature on the subject you can find a great list of resources compiled by Prof. Torsten Suel.

Re:Prepositions need love too by LocalYokel · 2001-06-21 00:22 · Score: 3

Search terms have all kinds of problems.

I had the same problem yesterday when I was searching for "quotes about Shakespeare". "to be or not to be" (with quotes) pulls up the proper category, but the first rsult it comes up with is the GNU homepage, because GNU's not Unix!. The second link is to Am I Hot or Not, BTW...

Strangely enough, it warns about "or", and if I want to use it in a search, it must be in CAPS, but then how do I search for something in ORegon? For some reason, it says nothing about "not", so I don't know what's up with their search terms anymore.

--

--
E2 IN2 IE?

Search Query by BierGuzzl · 2001-06-21 01:09 · Score: 1

The most common query to hit my site is "fuck the skull of jesus".

isn't Google always getting itself in the news? by Delrin · 2001-06-21 00:01 · Score: 1

Is Google's technology really so ground breaking? Didn't Yahoo take it in a bigger leap? I get this feeling when I read This article back in April, is Google's business strategy or practises really all the newsworthy? You decide. :)

Re:isn't Google always getting itself in the news? by markov_chain · 2001-06-21 02:12 · Score: 2

There is nothing technological that Google is doing that isn't done by other engines (Excite, Hotbot).
Really. Google uses a patented ranking algorithm, described by Page and Brin (Stanford graduate students which founded Google) in a paper titled The PageRank Citation Ranking: Bringing Order to the Web (1998) . The algorithm does very well at recognizing relevant documents. Last I looked, other search engines used mostly sets of hand-tuned hacks which did not do as well. Has this changed? I'd appreciate some references, refereed if possible.
~

--
Tsunami -- You can't bring a good wave down!
Re:isn't Google always getting itself in the news? by DejaMorgana · 2001-06-21 04:27 · Score: 1

1. Re: sellouts. Google is the only search engine that has remained a search engine and not tried to become Baby AOL for the sake of an instant dollar. While it does feature sponsored links, it labels them clearly as such. It does not disguise them as "featured links", "popular results", or any other pseudo-content. It does not surround its search results with ads masquerading as error messages, and it does not offer to run a duplicate search on Amazon.com. It's straightforward and simple.
2. Re: search technology - they do actually innovate. Google was one of the first to use an index of link popularity and relevance to rank the sites in their index, which gives you better results than a search engine that just looks for instances of your search phrase.
3. Re: spin control. Well, first of all, Salon.com is not exactly tech press. Secondly, I've never seen a single negative article about _any_ search engine in a major tech site. Not just Google. Even ridiculous engines like Ask Jeeves hardly ever get negative press.
Re:isn't Google always getting itself in the news? by Weh · 2001-06-21 01:48 · Score: 1

Whilst other "search engines" have or are trying to become portals google is still a search engine only. That's a difference... I didn't move to google because everyone was using it, I moved to Google because it consistently gives me the best search results. And their search technology *is* different...

Re:Yahoo took a much bigger leap - it licensed Goo by Delrin · 2001-06-21 01:41 · Score: 1

no doubt, they are a good engine, but aren't we just repackaging existing services? Or more succinctly, making slight improvements on existing technologies?

Re:Prepositions need love too by TimMann · 2001-06-21 11:16 · Score: 1

In running English text, the frequency of "the" is just about 70,000 per million. In other words, 7% of all English text consists of the definite article, and most web pages contain many distinct copies. You've got to kill that.

Well, actually you don't. AltaVista indexes every word, including "the". This helps it do exact phrase queries. For instance, try searching for "The Who".

Prepositions need love too by zpengo · 2001-06-21 00:01 · Score: 4

A recent development in Google technology left me very dismayed -- They started ignoring "common words."

This makes sense on a general level, but when you try searching for a phrase embedded in quotation marks, it's frustrating to have Google decide which parts of a literal string to search for and which to ignore. If I had wanted it to ignore parts of it, I wouldn't have indicated that it was a literal phrase, dangnabbit!

It is possible to include words that you typed in the search phrase, but you have to add an Altavista-style '+' before it.

For example, searching for: "Hail to the chief" would ignore to and the. In order to actually search for the phrase (which I indicated that I wanted to do by surrounding it in quotation marks), I would have to type "Hail +to +the chief". Hardly user-friendly.

Oh, well.

--

Got Rhinos?

Re:Prepositions need love too by BitchAss · 2001-06-21 01:02 · Score: 2

I tried your search for "to be or not to be" using the +'s in front and I got this back:

Google always searches for pages containing all the words in your query, so you do not need to use + in front of words. [details] The word "or" was ignored in your query -- for search results including one term or another, use capitalized "OR" between words.[details] The following words are very common and were not included in your search: to be to be. [details]

That seems so pointy-haired-bossish.

--
Like sex? Read and write about it! Indecent Blogging
Re:Prepositions need love too by 3-State+Bit · 2001-06-21 06:15 · Score: 1
It says it's ignoring them, but the top few "hits" typically do include the exact page. I just tried, for instance, "All your base are belong to us". It claims to ignore "are" and "to" but the top few hits contain the exact phrase. (The same happens with your example "Hail to the chief", though it says it's ignoring "to the".)
You're such an f'ing troll, I was about to say anonymously, but then how can you have such a low ID? Sigh. Here goes my rant.

"I tried, for instance, 'All your base are belong to us'." Yeah. Uh-huh. That phrase is so frequent that if I hear "base belong" I think of that phrase. Hell: here's the word belong on google. Four of the top ten searches, including the top two, highlight "belong" in the full phrase "all your base are belong to us" visible in the summary. What were you smoking? Man.

My point, dear rsidd, just so I'm not being flamebait, is that if you want to see how Google treats your phrases, pick a random phrase out of a book you know is an etext, not too common, and see if you can find that etext based on that short word. Let's say you remember the phrase "but that the dread" of something after death, but you're only sure of the first part. What's this from? (Hamlet's soliloquy. "Who would fardels bear...but that the dread of something after death...puzzles [paralyzes] the will [to end the bad things] and makes us rather bear those ills we have than fly to others we know not of....")

Now look here:
+"+but +that +the +dread"
Returns:
"Results 1 - 10 of about 3,410"
(none of these includes the phrase I searched on.)
Now this:
+"+but +that +the +dread" -Hamlet -Shakespeare
Returns:
"Results 1 - 10 of about 2,480"

In other words, after I make sure that Hamlet's quotation was not just lurking on another page past the first ten that I looked at, I saw how many pages of the 3,410 could possibly have to do with the quotation I was looking for. Only 930 (subtract above).
Now let's look at a full-fledged full-text search engine, Altavista. (no affiliation, but I use altavista whenever I need a phrase and don't care how popular or "valued" the site is that it appears on--do you know that Google adjusts importance based on how much linkage a site gets on other sites? This doesn't mesh with phrase-based searching.)

Anyway,
"but that the dread"
on altavista returns, not surprisingly, a top ten pages that EACH (every one of the ten) reference Hamlet's soliloquy. (Althoguh one is a satire including the phrase and being about a cat. It begins
"To go outside, and there perchance to stay
Or to remain within: that is the question:" and includes the phrase I searched on).
Total number of search results returned with the above search?
"We found 434 results:"

Now bear in mind that Google couldn't even come up with the phrase, however much I +'d it to death, on its top ten list. If I only have that one phrase in memory on Google, I can't find it. Period. But what if I want more power than just that. What if I wasn't just looking for it (because if I had been, I might include words like "play" or "shakespeare", which I could reasonably guess is where I got the phrase stuck in my mind from), but rather, for instance, wanted to know how many times anyone on the Internet (that a search engine indexes) has used the words "but that the dread", except in quoting Shakespeare. Therefore, the following progression. (After each one, I looked at the top ten pages and added a phrase to eliminate one or more of them).
"but that the dread"
We found 434 results:

"but that the dread" -Hamlet -Shakespeare
We found 65 results:

"but that the dread" -hamlet -shakespeare (lowercase this time, because Altavista treates uppercase as forced-uppercase and lowercase as either.)
We found 48 results:

"but that the dread" -hamlet -shakespeare -"that is the question" (Still fairly clearly an allusion to Shakespeare.)
We found 13 results:

"but that the dread" -hamlet -shakespeare -"that is the question" -"whether 'tis"
We found 8 results:

"but that the dread" -hamlet -shakespeare -"that is the question" -"whether 'tis" -"undiscover'd country"
We found 7 results:

"but that the dread" -hamlet -shakespeare -"that is the question" -"whether 'tis" -"undiscover'd country" -"undiscovered country"
We found 4 results:

The four results?
1. From http://www.lang.nagoya-u.ac.jp/~matsuoka/EG-Clare. html:
  "
  The sound seemed taken out of her voice; it was husky as the notes on an old harpsichord when the strings have ceased to vibrate. She read her answer in my face, I suppose, for I could not speak. Her look was one of intense fear, but that died away into an aspect of most humble patience. At length she seemed to force herself to face behind and around her: she saw the purple moors, the blue distant hills, quivering in the sunlight, but nothing else.
  
  'Will you take me home?' she said meekly.
  
  I took her by the hand, and led her silently through the budding heather - we dared not speak; for we could not tell but that the dread creature was listening, although unseen - but that IT might appear and push us asunder. I never loved her more fondly than now when - and that was the unspeakable misery - the idea of her was becoming so inextricably blended with the shuddering thought of IT. She seemed to understand what I must be feeling. She let go my hand, which she had kept clasped until then, when we reached the garden gate, and went forwards to meet her anxious friend, who was standing by the window looking for her. I could not enter the house: I needed silence, society, leisure, change - I knew not what - to shake off the sensation of that creature's presence. Yet I lingered about the garden - I hardly know why; partly, I suppose, because I feared to encounter the resemblance again on the solitary common, where it had vanished, and partly from a feeling of inexpressible compassion for Lucy. In a few minutes Mistress Clarke came forth and joined me. We walked some paces in silence.
  "
2. From http://www.clareweb.com/eolas/coclare/history/dutt on_su rvey/dutton_survey_chapter5.5.htm:
  "
  Mr. Ledwich, in his Epitome of the Antiquities of Ireland, says, that in the reign of King John the clergy did not receive any tithes; the veneration for the church at that time was so great, that regulations were unnecessary; they were supported by oblations. The piety of modern times, I fear, would influence but very small collections. The whole ecclesiastical revenue to a late period was divided into four parts, one to the Bishop, one to the clergy, one to the poor, and one to support the church and other uses, and he says this mode exists at this day in the diocese of Clonfert.
  
  To throw as much light on this subject as possible, I shall make a few extracts from Mr. Rawson's admirable Survey of Kildare, lately published. In page 27 he mentions one tithe-dealer having exacted thirty shillings per acre for wheat;** "the dread of citation, and the loss of his straw, made the timorous ploughman yield to any terms." Again, page 31, "It must appear evident to every man, that the entire weight of the church establishment falls on the sweat from the brow of industry, whilst the feeder of one thousand bullocks does not pay as much as the herdsman for his garden. Can it be denied, but that the dread of tithe keeps much land in pasture, which would otherwise give bread to thousands, encrease population twenty-fold, do away all necessity of emigration, and make little Ireland not only a granary to England, but to the whole world." In page 33, and which deserves peculiar attention, "The assertors, that the titles to tithes and to estates are of equal strength, should consider that, if estates were to be let at undefined rents from year to year, and the landlord at each harvest to view the crops and exact some proportion in lieu of rent, would any occupier in such case be anxious to till or improve? Would not the kingdom soon become a dreary uninhabited waste? Yet exactly such is the conduct towards the tenth of the produce, the tithe. Let the land-holder be ascertained at what yearly rent he is to pay for one and the other, and all complaint is at an end.[...]
  "
3. From http://www.victorybaptist.org/books/johnbunyan/fea rofgod/part1.htm:
  "
  3. Add to this the revelation of God's goodness, and it must needs make his presence dreadful to us; for when a poor defiled creature shall see that this great God hath, notwithstanding his greatness, goodness in his heart, and mercy to bestow upon him: this makes his presence yet the more dreadful. They "shall fear the Lord and his goodness" (Hosea 3:5). The goodness as well as the greatness of God doth beget in the heart of his elect an awful reverence of his majesty. "Fear ye not me? saith the Lord; will ye not tremble at my presence?" And then, to engage us in our soul to the duty, he adds one of his wonderful mercies to the world, for a motive, "Fear ye not me?" Why, who are thou? He answers, Even I, "which have" set, or "placed the sand for the bound of the sea by a perpetual decree, that it cannot pass it; and though the waves thereof toss themselves, yet can they not prevail; though they roar, yet can they not pass over it?" (Jer 5:22). Also, when Job had God present with him, making manifest the goodness of his great heart to him, what doth he say? how doth he behave himself in his presence? "I have heard of thee," says he, "by the hearing of the ear, but now mine eye seeth thee; wherefore I abhor myself, and repent in dust and ashes" (Job 42:5,6).
  
  And what mean the tremblings, the tears, those breakings and shakings of heart that attend the people of God, when in an eminent manner they receive the pronunciation of the forgiveness of sins at his mouth, but that the dread of the majesty of God is in their sight mixed therewith? God must appear like himself, speak to the soul like himself; nor can the sinner, when under these glorious discoveries of his Lord and Saviour, keep out the beams of his majesty from the eyes of his understanding. "I will cleanse them," saith he, "from all their iniquity, whereby they have sinned against me, and I will pardon all their iniquities whereby they have sinned, and whereby they have transgressed against me." And what then? "And they shall fear and tremble for all the goodness, and for all the prosperity that I procure unto it" (Jer 33:8,9). Alas! there is a company of poor, light, frothy professors in the world, that carry it under that which they call the presence of God, more like to antics, than sober sensible Christians; yea, more like to a fool of a play, than those that have the presence of God. They would not carry it so in the presence of a king, nor yet of the lord of their land, were they but receivers of mercy at his hand. They carry it even in their most eminent seasons, as if the sense and sight of God, and his blessed grace to their souls in Christ, had a tendency in them to make men wanton: but indeed it is the most humbling and heart-breaking sight in the world; it is fearful.
  "
4. From http://www.users.globalnet.co.uk/~turing/T/003397. html:
  "
  But that the dread of someone else could win that game, puzzles the will and makes us rather bear those ills we have. Thus conscience does make cowards of us all; and thus the native hue of resolution is sicklied over with the pale cast of thought, and enterprises of great pith and moment. With this regard their currents turn awry, and lose the name of action.
  "
  
  (This last attempts to quote the original, except the phrase "could win the game".)
So now, we see that only THREE documents out of the entire indexed online world have the words "but that the dread" appearing outside of a work that relates directly to Shakespeare. (Possibly we have a false negative on very long documents that include the words "but that the dread" and, independantly, some other phrase I negated, such as "that is the question" or "whether 'tis". I don't think it's likely though.)

This startling conclusion is one that you could not find with Google, which could not even be bothered to find for you where the phrase "but that the dread" comes from. Apparently each of Altavista's 434 original results, except these latter three, are correct positives. (In the sense that the phrase is from the context in which I heard it, as a part of a soliloquy in Hamlet.)

I used to use Altavista and was sad to hear at a conference held by some technology head at it here in Boston, that lots of people only use Altavista as a "backup" in case Google can't find what they're looking for. He was very proud of the idea that Altavista didn't have what he called "stop words" (Google's "the" "a", etc), but rather full-text indexing. (He did mention that only the first 378K of a text were indexed or something, but I think any document that long is also avaialable for download somewhere in chapters...). Anyway at that time I was saddened that Altavista wasn't doing too well, it was what I used, since it seemed like it had an expert, powerful system. (With such conveniences as a NEAR keyword to show that two phrases mustn't just occur within the same document but within several words of each other. The back-end, but not the user interface, he told some of us afterward over refreshments, was fully Regular Expression, and an expert user could combine things like boolean operators with NEAR and a few other keywords (up to an impressive depth) to get basically any query she wanted.
Today I use Google because, chances are, the site that I'm interested in is the one other people are interested in who know about that subject. (From Google's site:
"
PageRank Explained

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

Important, high-quality sites receive a higher PageRank, which Google remembers each time it conducts a search. Of course, important pages mean nothing to you if they don't match your query. So, Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines all aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.
")

Which of course is usually exactly what I want. Unless I have a phrase stuck in my mind like "but that the dread". In that case, like those upon whom I frowned a year or two upon, I head over to Altavista "as a last resort after Google fails" (sigh) and use it's all-but RegEx features.
Robert Viragh.
~
Re:Prepositions need love too by 3-State+Bit · 2001-06-24 12:14 · Score: 1

Now bear in mind that Google couldn't even come up with the phrase, however much I +'d it to death, on its top ten list. If I only have that one phrase in memory on Google, I can't find it.

The problem is that you +'ed it too much. If you search for +"+but +that +the +dread" you'll notice that it gives you some warnings. Google's ignoring all of the +'s you added, because you're using some of them incorrectly. ("dread" is not a stop word, for example)

Instead, try searching for "but +that +the dread". Then you'll get what you're looking for.

Wow, that's so informative! (uh, that wasn't sarcastic) It doesn't actually /say/ it's ignoring them, it just says some of them are redundant. Thanks!
I wanted to make sure that it was still a phrase search, though, and the sites didn't just come up because but, that, the, and dread were "kinda near" each other or all there or something. Removing the quotation marks (so it's not a phrase search) gets me irrelevant pages. Rearranging the words WITHIN the quotation marks gets me irrelevant pages. So far so good. Now I search for something very obscure to get a phrase to phrase-quote. I downloaded the complete works of william shakespeare (etext -- other formats ) [1]
First I wanted to use only stop words (hehe. I got the list from most common words in English.)
Now I wrote a program (C++ unfortunately, I don't know perl :( ) to go through the text file word-by-word, resetting a CurrentPhrase and CurrentNumWords whenever it passes a word that isn't one of these:"of a to in is that it for was on are as with at be this from I by what", and possibly setting HighPhrase to be CurrentPhrase if the CurrentNumWords is more than the HighestNumWords. Thus, at the end, we'll end up with a HighPhrase that has the most number of words in a row taken only from these allowed words. The phrase it ends up with? "from what it is to a", which grepping shaks12.txt I find at:
"
Oph. Could beauty, my lord, have better commerce than with honesty?
Ham. Ay, truly; for the power of beauty will sooner transform
honesty from what it is to a bawd than the force of honesty can
translate beauty into his likeness. This was sometime a paradox,
but now the time gives it proof. I did love you once.
Oph. Indeed, my lord, you made me believe so.
"
Now first I try altavista: +"from what it is to a", which instantly (~ 1 second) spits back "We found 45 results:" including what looks like lots of hamlet. Now -hamlet
+"from what it is to a" -hamlet:
We found 10 results:

Now google. First I tried: +"+from +what +it +is +to +a" which sputtered for such a long time loading that I thought I would get back "your search took too long, please make it more specific." (which had happened to me before). But waiting patiently, I got:
"Results 1 - 10 of about 39. Search took 30.45 seconds" (which mostly look like they're from Hamlet, too)
I was thinking, wow, I'd never had a search take that long. It was probably because these search words are rarely if ever asked for together so that their intersections were hardly cached at all. Next, I wanted to make sure that most of the 39 were relevant, so:
+"+from +what +it +is +to +a" -hamlet
Brings: Results 1 - 1 of 1. Search took 30.54 seconds
I was surprised it took this long again, since I would think it would have had my results cached from the first time around. When I hit the search button again, it only took 0.31 seconds to come up with the same results.
Anyway, what does this prove? Altavista is STILL better at phrase searching: google missed 9 things with the phrase "from what it is to a" but without Hamlet. (Apparently this means it didn't miss any WITH hamlet because Google's 39 original hits + 9 missed non-hamlet ones = Altavista's 45 hits). Plus, Altavista answered instantly and google took >30 seconds. What does this prove? That Altavista is better for phrase searching, even when you obey all of Google's tempersome rules. :)

Robert Viragh

[1]FYI, this is 5.2 megabytes, and gets paginated into 2184 pages in M$ Word, 10-sized font. At 50 non-blank lines per page, if Shakespeare had been productive for 50 years, working 16 hours a day 6 days a week, he would have been able to spend 2.2 hours on each line.* Prolific my ass. On the other hand, there is not a line of his you could find that has not been specifically, actively considered for at least half an hour in total by a single scholar. Of course not all of them are interested in all of Shakespeare. I am not so obsessed with him as I appear to be either. Only actually read a few of his plays, and a relatively small percent of the ones I was SUPPOSED to in school.

*Numbers: 50 years * 52 weeks * 6 days per week * 26 hours per week = 249,600 hours. 2184 pages * 50 lines per page = 109200. Divide the two answers = 2.2 hours per line.)
~
Re:Prepositions need love too by Kinchie · 2001-06-21 00:44 · Score: 1

Prepositions need love?
Proposition your preposition daily.

--
Protege Posterioram Tuam
Re:Prepositions need love too by Modus+Nonsens · 2001-06-21 00:07 · Score: 1

Yeah I agree to what you say. Why would they exclude common words *inside quotation marks*? That goes against the standard that I am use to using!

Re:[ot]Google's data structure? by jon_c · 2001-06-21 01:01 · Score: 1

I would seriously doubt they have a SQL interface for there DB. I also would bet a you mom's poop that they don't use a comersial database for the website indexing.

I figure it's something derived from a B-Tree (like a binary tree - but better for databases) and distribute it on a cluster of of boxens (linux right?)

I'm sure there's a hell of a lot more to it then that. a hell of a lot more, hell let's ask him.

begin question
Hey google guy, how is the webpage index data stored and retrieved. What data structures and what algorithms are used. how many boxes do you have for indexing?
end question

maybe he'll answer.

-Jon

--
this is my sig.

Re:why *I* like google by jon_c · 2001-06-21 02:53 · Score: 1

who else has BSD only searches?.. and not only that, a cool BSD google logo!

http://www.google.com/bsd

-jon

--
this is my sig.

Re:[ot]Google's data structure? by jon_c · 2001-06-21 02:58 · Score: 1

Google's data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost. Although, CPUs and bulk input output rates have improved dramatically over the years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures.

BigFiles
BigFiles are virtual files spanning multiple file systems and are addressable by 64 bit integers. The allocation among multiple file systems is handled automatically. The BigFiles package also handles allocation and deallocation of file descriptors, since the operating systems do not provide enough for our needs. BigFiles also support rudimentary compression options.
4.2.2 Repository

Figure 2. Repository Data Structure
The repository contains the full HTML of every web page. Each page is compressed using zlib (see RFC1950). The choice of compression technique is a tradeoff between speed and compression ratio. We chose zlib's speed over a significant improvement in compression offered by bzip. The compression rate of bzip was approximately 4 to 1 on the repository as compared to zlib's 3 to 1 compression. In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL as can be seen in Figure 2. The repository requires no other data structures to be used in order to access it. This helps with data consistency and makes development much easier; we can rebuild all the other data structures from only the repository and a file which lists crawler errors.

Document Index
The document index keeps information about each document. It is a fixed width ISAM (Index sequential access mode) index, ordered by docID. The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics. If the document has been crawled, it also contains a pointer into a variable width file called docinfo which contains its URL and title. Otherwise the pointer points into the URLlist which contains just the URL. This design decision was driven by the desire to have a reasonably compact data structure, and the ability to fetch a record in one disk seek during a search
Additionally, there is a file which is used to convert URLs into docIDs. It is a list of URL checksums with their corresponding docIDs and is sorted by checksum. In order to find the docID of a particular URL, the URL's checksum is computed and a binary search is performed on the checksums file to find its docID. URLs may be converted into docIDs in batch by doing a merge with this file. This is the technique the URLresolver uses to turn URLs into docIDs. This batch mode of update is crucial because otherwise we must perform one seek for every link which assuming one disk would take more than a month for our 322 million link dataset.

Lexicon
The lexicon has several different forms. One important change from earlier systems is that the lexicon can fit in memory for a reasonable price. In the current implementation we can keep the lexicon in memory on a machine with 256 MB of main memory. The current lexicon contains 14 million words (though some rare words were not added to the lexicon). It is implemented in two parts -- a list of the words (concatenated together but separated by nulls) and a hash table of pointers. For various functions, the list of words has some auxiliary information which is beyond the scope of this paper to explain fully.

Hit Lists
A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information. Hit lists account for most of the space used in both the forward and the inverted indices. Because of this, it is important to represent them as efficiently as possible. We considered several alternatives for encoding position, font, and capitalization -- simple encoding (a triple of integers), a compact encoding (a hand optimized allocation of bits), and Huffman coding. In the end we chose a hand optimized compact encoding since it required far less space than the simple encoding and far less bit manipulation than Huffman coding. The details of the hits are shown in Figure 3.
Our compact encoding uses two bytes for every hit. There are two types of hits: fancy hits and plain hits. Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. Plain hits include everything else. A plain hit consists of a capitalization bit, font size, and 12 bits of word position in a document (all positions higher than 4095 are labeled 4096). Font size is represented relative to the rest of the document using three bits (only 7 values are actually used because 111 is the flag that signals a fancy hit). A fancy hit consists of a capitalization bit, the font size set to 7 to indicate it is a fancy hit, 4 bits to encode the type of fancy hit, and 8 bits of position. For anchor hits, the 8 bits of position are split into 4 bits for position in anchor and 4 bits for a hash of the docID the anchor occurs in. This gives us some limited phrase searching as long as there are not that many anchors for a particular word. We expect to update the way that anchor hits are stored to allow for greater resolution in the position and docIDhash fields. We use font size relative to the rest of the document because when searching, you do not want to rank otherwise identical documents differently just because one of the documents is in a larger font.

The length of a hit list is stored before the hits themselves. To save space, the length of the hit list is combined with the wordID in the forward index and the docID in the inverted index. This limits it to 8 and 5 bits respectively (there are some tricks which allow 8 bits to be borrowed from the wordID). If the length is longer than would fit in that many bits, an escape code is used in those bits, and the next two bytes contain the actual length.

Forward Index
The forward index is actually already partially sorted. It is stored in a number of barrels (we used 64). Each barrel holds a range of wordID's. If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID's with hitlists which correspond to those words. This scheme requires slightly more storage because of duplicated docIDs but the difference is very small for a reasonable number of buckets and saves considerable time and coding complexity in the final indexing phase done by the sorter. Furthermore, instead of storing actual wordID's, we store each wordID as a relative difference from the minimum wordID that falls into the barrel the wordID is in. This way, we can use just 24 bits for the wordID's in the unsorted barrels, leaving 8 bits for the hit list length.

Inverted Index
The inverted index consists of the same barrels as the forward index, except that they have been processed by the sorter. For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into. It points to a doclist of docID's together with their corresponding hit lists. This doclist represents all the occurrences of that word in all documents.
An important issue is in what order the docID's should appear in the doclist. One simple solution is to store them sorted by docID. This allows for quick merging of different doclists for multiple word queries. Another option is to store them sorted by a ranking of the occurrence of the word in each document. This makes answering one word queries trivial and makes it likely that the answers to multiple word queries are near the start. However, merging is much more difficult. Also, this makes development much more difficult in that a change to the ranking function requires a rebuild of the index. We chose a compromise between these options, keeping two sets of inverted barrels -- one set for hit lists which include title or anchor hits and another set for all hit lists. This way, we check the first set of barrels first and if there are not enough matches within those barrels we check the larger ones.

--
this is my sig.

Northernlight by JPMH · 2001-06-21 02:32 · Score: 2

Northernlight categorises its returns into "Custom Search Folders", subject by subject.

Re:Much like McDonalds by Carnivore · 2001-06-21 05:11 · Score: 1

I thought that they switched to "billions and billions served" about 10 years ago...

Here is the real google info... by jwater · 2001-06-21 01:48 · Score: 5

Here at Slashdot it seems like people only can complain about a service. Most of the posts are rants without understanding of the dynamics below them.

I think we all could use more understanding of the topic. A link to the paper that started it all here.

1. When was the last time that "to" or any other preposition helped the average query. Your Grandmother does not know that this word is meaningles 99.9% of the time, so google ties to improve their relevancy.
2. Google has not sold out. Their ads are the most simple in the industry. They give access to users like you and me at reasonable rate. Who wants to wait for 345x123 pixel banner ads anyways.
3. Have you noticed the spelling feature? Google will correct your spelling. This is a function of the tons of bigrams that they have stored.
4. Here is a link to more papers [Warning: Technical] here.

Re:Here is the real google info... by Popocatepetl · 2001-06-21 06:28 · Score: 1

Please mod the post to which I am replying up a point (Interesting). It is true.
Re:Here is the real google info... by 4thAce · 2001-06-21 04:38 · Score: 1

3. Have you noticed the spelling feature? Google will correct your spelling.

Has anybody told them that their name is a misspelling?

Search engine, correct thyself.

--
Inventor of the LOLbalrog meme.

Re:More on language translation... by FTL · 2001-06-21 01:40 · Score: 2

> (translated from English to Korean
> and then back to English again)

And that's the catch. Most documents are readible after they;ve been put though the blender once. But two passes through the blender results in garbage.

The Fish is quite good for the one-way trips that it was designed for. A round trip ticket through the Fish is usually deadly.
--

--
Slashdot monitor for your Mozilla sidebar or Active Desktop.

its been said by zerocool^ · 2001-06-21 00:16 · Score: 1

Its been said here on slashdot before, but
every one should check out that if you change your preferences on google having to do with language, one of the languages is Bork Bork Bork!, or the sweedish chef's language.

also, what happened to searching for 666, the first entry it spat up was microsoft?

zero

--
sig?

Re:[ot]Google's data structure? by Angelo+Torres · 2001-06-21 03:35 · Score: 1

Check out The Anatomy of a Search Engine.

read the article by mr_gerbik · 2001-06-21 00:25 · Score: 2

They filter what gets projected.. maybe you should have read the next sentence before posting.

"That's a filtered version, except that the filter doesn't work well in other languages. So we had people here from BMW, and they told me that there were some German queries that got through that shouldn't have.

[Note to self: Curse on Google only in foreign tongues.]"

Re:read the article by mr_gerbik · 2001-06-21 01:07 · Score: 2

Have you ever had experience with filtering software? Any filtering software worth 2 cents looks for that kind of shit.. purposeful misspellings, replacements like 0s for Os, 1s for ls. I think Google is smart enough to make a filter like this. So no.. "britney spears suk1ng c0ck" isn't going to get through. Beyotch.
Re:read the article by Eloquence · 2001-06-21 01:01 · Score: 1

*sigh* That's why he misspelled the words.

--
Re:read the article by Rogerborg · 2001-06-21 20:30 · Score: 2
- They filter what gets projected.. maybe you should have read the next sentence before posting.
Uh huh, and maybe you should have read the trailing ;) before replying.

;)
--
If you were blocking sigs, you wouldn't have to read this.

Re:why *I* like google by mr_gerbik · 2001-06-21 03:12 · Score: 2

AND...

Mac only searches.. and a cool Mac logo!
http://www.google.com/mac

AND...

US Government searches... and a "cool" US logo?
http://www.google.com/unclesam

why I like google by mr_gerbik · 2001-06-21 01:13 · Score: 3

who else has linux only searches?.. and not only that, a cool linux google logo!

http://www.google.com/linux

-gerbik

Re:why I like google by ryanf · 2001-06-21 03:30 · Score: 1

Go under http://www.google.com/linux.

Try searching for "news".

Guess what comes up #2?

Ryan Finley

--

Ryan Finley
SurveyMonkey.com -- Create your own professional surveys

How to improve the timeliness searches? by Jim+Madison · 2001-06-21 14:16 · Score: 1

Kudos on creating the most relevant search engine.

My question is, what are you doing to improve the timeliness of searches? Often, there is a conservative bias as older sites have more links to them. As I watch the results from my site get integrated, it seems that your processing cycle is about a month--making google not the SE of choice to research recent news events. I may also add, that this seems like a bigger imperative given the recent acquistion of deja/usenet.

Keep up the good work (and don't ever sell out baby, no matter what riches the VC put in front of your nose).

--
Hey democracy lovers, add Quorum as a c

SatireWire: interview with Jeeves by mrBlond · 2001-06-21 00:27 · Score: 2

http://www.satirewire.com/features/satire-jeevesin terview.shtml
--
mrBlond (I don't email from Malaysia)

--
CowboyNeal for president!
"Hit any user to continue."

Re:Disturbing Search Requests by don_carnage · 2001-06-21 01:07 · Score: 2

You probably should check out this site: Disturbing Search Requests

--

--
Wooden armaments to battle your imaginary foes!

Re:Disturbing Search Requests by don_carnage · 2001-06-21 01:31 · Score: 2

I can't remember where I found that -- it may have even been here on /.

It kinda makes you want to start checking those referer logs, eh? I found once that was looking for 'priceless pissing'. No clue how they ended up on my site!

$ grep google /usr/apache/logs/referer_log

--

--
Wooden armaments to battle your imaginary foes!

Re:new search engine by xp · 2001-06-21 00:33 · Score: 1

This company claims they are writing the new serach engine for Google. Click on clients and then #6.

In fact they seem to be claiming that they built most of Google. It's a pity their own web-site looks so bad though. Here is an excerpt: We also built a sophisticated server system to run the show and organized the site's starting database

Re:Why Google is my favorite search engine by PingXao · 2001-06-21 02:57 · Score: 1

Wait until that last posted company "helps" them with their web interface! I avoid Flash sites at all costs. I seem to remember that messing with the simple interface was the beginning of the end for Deja.

The criticisms being made here about how Google omits certain words apply equally to their newsgroup searches. Very annoying. The advanced groups search lets you search for an "exact phrase". Or so it says. It doesn't let you search that way at all. They have done a pretty good job so far with deja's data, however. I missed it all being out there. I look forward to their improvements over time.

It's good to know... by Rackemup · 2001-06-21 00:03 · Score: 1

It's good to know that at least one dot-com is still going strong. Good work Google-people!

I love google.. it's fast, gives lots of results and the page isn't cluttered with dozens of banner ads like some other *cough* search-engine-portal-wannabes *cough*.

Maybe someday I'll get to use my networking skills on that server farm the've got going there... ahhhh a guy can dream eh?

Re:[ot]Google's data structure? by kerrbear · 2001-06-21 02:24 · Score: 1

When a search asks for, say, "cheese fondue" the array for "cheese" and the array for "fondue" are retrieved and merged using a sorted list merge (fast, since the arrays are already ordered). The result is a list of document id's that were in both lists, i.e. documents containing both words.

That would work ok, except that the process of updating the lists would be very expensive. Indexing every word in the interenet would be trivial, but keeping the addresses for those words in sorted order would be extremely non-trivial.

Imagine the word 'test' for example. You gotta believe that 'test' is on about a hundred million web pages, with more being added each day. That's one hundred million sorted addresses- probably taking up more than 800 disk blocks (100,000,000 / 4096 bytes block / ~30 bytes address). Every time you add a new page with the word 'test' (or take one away), you have to update the list. That's a lot of disk block rearranging. Now multiply this by all the words on the web and you can see what a huge amount of rewriting has to be done. I don't think linear address lists would cut it.

Now they could have some kind of funky indexing scheme for all the addresses. But its still freakin expensive to update them all. The article mentioned they update every 28 days. Does this mean they stop everything every 28 days to update- or does it mean that it takes 28 days to do an update? Regardless, this could mean that Google is always 28 days out of date. Another search engine that beat this number could potentially compete by saying they are more up to date.

You have to imagine that as the internet grows larger, that this is going to get even more time consuming.

Yahoo took a much bigger leap - it licensed Google by arete · 2001-06-21 01:03 · Score: 2

What Yahoo did was license google, instead of what they were doing before, licensing Inktomi. Google rocks.

http://news.cnet.com/news/0-1005-200-5561996.htm l

--
Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot

What do you expect, a monolith? by arete · 2001-06-22 02:17 · Score: 2

Yahoo is repackaging existing services - they're repackaging Google. And yahoo has more name recognition, so more people use it. And they bring in more revenue in ads, so more money goes to Google to develop.

Google OTOH, is developing new technology. Most of that development is incremental -things get better and better. Until we actually find an alien monolith to give us all our science, this is how most advancements happen.

--
Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot

Re: weird google pages (was "why *I* like google") by wishus · 2001-06-21 04:32 · Score: 2

www.google.com/redhat - Doesn't do anything special, but the URL is there
www.google.com/palm - Looks to be made for monochrome PDA browsers
www.google.com/ie - For Pocket IE maybe?
wishus
---

method for increasing hits by jvj24601 · 2001-06-21 01:35 · Score: 2

A friend of mine (web developer) says that he's created a way to increase the hit count among all the sites he creates. He uses a server-side Perl scripts to determine if the Google bot is hitting a page, and includes links to *all* of the sites' homepages that they are hosting. So if he includes this script on every page of every site he hosts, then every page links to every site.

Does this work? I mean, they include (in plain English) something like "Here are some of the other sites we, [our web design firm], created and host" along with a short blurb. It sounds like it would work, right?

Re:method for increasing hits by b0bby · 2001-06-21 03:40 · Score: 1

My understanding is that Google ranks highest links which are from sites which are themselves linked from lots of sites, so this might have some use. At the very least it'll help everything get indexed.
Re:method for increasing hits by neves · 2001-06-22 02:30 · Score: 1

Sites about the same topic have a lot of link between them. If you register everybody whos in your referrer log, all this network of sites will be more relevant to searches. They don t say that having unpopular sites give you no ranking, they just say that as more popular the site that links to you, more ranking you ll have.
Re:method for increasing hits by MattCutts · 2001-06-21 08:54 · Score: 1

Hey, if you go back and read the PageRank papers, you'll see that it doesn't help you to just get a lot of links from bad sites. That was one of the points of the original mathematics.
So your friend might think he's improving his rank at Google, but he's really just wasting time. But if it makes him happy, who are we to criticize?

new search engine by Aalschover · 2001-06-21 00:10 · Score: 2

This company claims they are writing the new serach engine for Google. Click on clients and then #6.

It really says 'To fullfill their needs, we built a brand new searcg engine for Google.....'

[flash alert]

Re:Yeah Suckah! by i0lanthe · 2001-06-21 02:13 · Score: 1

I got the impression that linux is used because it's free (important with thousands of licenses), it's reliable, and they found it a good platform for the searching backend software.

There was a googletalk at CMU just this month where someone asked this question, and the answer confirmed your impression "it's free and it works", plus the speaker also mentioned "and you can fix problems yourself" (yay open source), which they have done at least once and submitted a patch for I forget what (and I think he said that it was not a beautiful enough patch to be accepted, but I mention it anyway lest anyone think that they are not "good citizens").

--
"The Crystal Wind is the Storm, and the Storm is Data, and the Data is Life"

More on language translation... by sdo1 · 2001-06-21 01:07 · Score: 3

These translation services (such as BabelFish on AltaVista) still have quite a way to go before they're completely reliable. Especially when you translate from one language to another, you might end up with something similar to this (translated from English to Korean and then back to English again):

Will be complete and on the front of the L it will be reliable to translation service (as the BabelFish is same) a yet positively is thin method to Altavista. It was special and when you from one language also translate in different one thing, you in child one silence comfort ended to this, (and the that time English back mac tayn Great Britain from again under translate again in a Korean):

-S

--
--- What parts of "shall make no law", "shall not be infringed", and "shall not be violated" don't you understand?

Regex: won't happen by brlewis · 2001-06-21 03:31 · Score: 2

You can pre-build lists of matches by word, but regex is too general a concept. You can't pre-build an index that will help speed up a query based on some yet-to-be-specified regex. There's just no way to do it fast.

probably a suffix tree. by gagganator · 2001-06-21 01:37 · Score: 1

i dont think your question is offtopic at all. i was looking for the answer to this but the interview answer seemed evasive

i would guess they use a suffix tree. roughly, a suffix tree stores a suffix of a string (in this case webpage) in each branch from the root; branches split further when there are different endings to a suffix. i used this for a search algorithm for a plagarism detector for my school. i believe this algorithm is most popularly used for searching dna sequences

the director states that they presearch words and store them. it almost sounded like they search this tree for words and store the results in tables that they then join upon request. more informed speculation?

--
the animal doesnt even have opposable thumbs, focker!

mujen.com is better by oplspopo112 · 2001-06-21 03:14 · Score: 1

I still think http://mujen.com is better than google. Google will fall one day just like yahoo, hotbot, and the rest.

Re:mujen.com is better by oplspopo112 · 2001-06-21 03:39 · Score: 1

How is hotbot or google better than mujen??? I though mujen uses hotbot, google, and some other searchengine.

Re:[ot]Google's data structure? by wrinkledshirt · 2001-06-21 00:21 · Score: 1

From the article, she says that they update the index every 28 days. That sort of smacks of a non-database-like structure, you think? If they wre using a database-like structure, then they'd be able to update it more regularly, no?

--

--------
Bleah! Heh heh heh... BLEAH BLEAH!!! Ha ha ha ha...

Re:[ot]Google's data structure? by wrinkledshirt · 2001-06-21 00:26 · Score: 1

Ah. Gotcha.

--

--------
Bleah! Heh heh heh... BLEAH BLEAH!!! Ha ha ha ha...

[ot]Google's data structure? by wrinkledshirt · 2001-06-21 00:08 · Score: 3

Okay, this is so off-topic it's not even funny.

Anybody have an inkling of a clue of the data structure that Google uses (or probably uses) to store all its words? I was just thinking that maybe it was some sort of balanced binary tree with each node containing a word, two pointers to the next two words further down the tree, and the root of a linked list of all the pages that word is contained in? I know binary search trees are supposed to be fast, but I was wondering if that'd be good enough for something with probably hundreds of thousands of words?

I'm assuming they're not using some sort of sql LIKE "%searchword%", I can't imagine any kind of cluster that could speed that process up, although I don't really know all that much about the process or what the main benefits of clustering are.

Anyway, hugely sorry for the offtopic post, it's just something that's been on the brain lately...

--

--------
Bleah! Heh heh heh... BLEAH BLEAH!!! Ha ha ha ha...

Re:[ot]Google's data structure? by blamanj · 2001-06-21 01:10 · Score: 5

Probably they use a trie or the related Patricia tree. These are very space efficient and relatively fast.
Re:[ot]Google's data structure? by Violet+Null · 2001-06-21 00:14 · Score: 1

Well, I don't _know_, but why would you need a binary tree? I can only assume from reading the article, but I think they do do something along the lines of doing LIKE statements on a recently fetched page. The key difference, though, is that they do the LIKE statement (or whaddeva), and then cache the results. So when you do a search for "linux", it already knows which of its pages contains the word linux; instead of having to do a wasteful LIKE again, it can simply do 'SELECT * FROM MatchPages WHERE MatchTerm = 'linux'. If you have two or more search terms, it finds the intersection of the results, etc.

But that's me.
Re:[ot]Google's data structure? by Violet+Null · 2001-06-21 00:23 · Score: 1

They say they update it _completely_ every 28 days. From previous work on other, less-cool search engines, I took this to mean that their spiders crawl the entire web every 28 days, refreshing the cached pages.

Re:Is she hot or not? by JAVAC+THE+GREAT · 2001-06-21 02:57 · Score: 1

Just because she's married doesn't mean she's straight.

Nice poem by the way. Did you write that yourself?
---

--
Know someone who is stealing cable? Report them!

Search 1,346,966,000 web pages by thgood · 2001-06-21 01:23 · Score: 2

Do you know how long it has been since they changed that number on their homepage????

I emailed Google about it gave me some crap about it being too difficult....

What the mess...

Much like McDonalds by freeweed · 2001-06-21 01:37 · Score: 2

They've been claiming '99 billion served' for several years now. Either they have a Y2Kish problem with their signs, or they're about to unleash the biggest wave of advertising the world has ever seen.

One Hundred Billion Served!. Could become as common as that evil Castaway DVD commercial that's repeated at least 50 times a night on TV.

--
Endless arguments over trivial contradictions in books written by ignorant savages to explain thunder in the dark.

Re:Disturbing Search Requests by tb3 · 2001-06-21 01:20 · Score: 1

Well, that's one opinion. TMI, in my opinion. (I know there are some really sick puppies out there, I just didn't know how sick. Not sure I wanted to, either.)

"What are we going to do tonight, Bill?"

--

www.lucernesys.comHorizon: Calendar-based personal finance

Google Merchandise! by skunkeh · 2001-06-21 20:01 · Score: 1

What could be more fun than your very own Google brand lava lamp?

www.googlestore.com

Re:Is she hot or not? by yukonbob · 2001-06-21 03:19 · Score: 1

Funny how their backup cycle works, too ... every 28 days.

-bch

also... by verbatim_verbose · 2001-06-21 03:47 · Score: 1

Sure it's a good search engine, but one of the best things about it is this: http://www.google.com/intl/xx-bork/

G3r/\/\4/\/ Pr1d3 b4by, w00t!!!!!1 by Supa+Mentat · 2001-06-21 00:29 · Score: 1

Anyway... I started using google exclusively a long time ago and I'm glad to see it's progressing. BMW is also my faveorite car company and I think that the search technology that their working on is really cool so that's good. But dammit, what she says about German websites is so fscking true, you've no idea! I speak english well, better than most Americans (seriously, I lived in the US for a while and the schools you guys have are just depressingly bad) but I wish I could get more info in German, it's not as if we don't have anything to put up on the web. Props go out (see I know some slang in english) to all meine Deutsche Freundinnen!

--
"A witty saying proves nothing." - Voltaire

Perks by Violet+Null · 2001-06-21 00:03 · Score: 1

And let's not forget the on-site masseuses.

Where do I sign up?

I find it interesting, though, that Google works by generating the searches ahead of time and storing them. I would think that space would be a killer (especially for the common words, eg, car), but that's why I'm not in the search engine biz.

Does explain why Google can't handle words with symbols or numbers with decimals very well, though.

Yeah Suckah! by Louis_Cyphier · 2001-06-21 00:03 · Score: 2

We're all aware of the fact that google r0x0rs, but one thing I've always been curious about Google and their "linux boxen" is, do they only use Linux for their servers, or do they have other practical uses, IE Quake Servers, and just workstations, or is Linux used only for price reasons? Anyone know?

--
,/""-. / `-. ( ,--._ `-. "\_ `-. `,

Why Google is my favorite search engine by sketerpot · 2001-06-21 00:12 · Score: 1

* It isn't cluttered up with pictures, banner ads, and fancy web design that takes a whole minute to load * It gets good results by looking at people's links. I just wish that it had a "Strict Boolean" mode, where it wouldn't try to second guess what you typed in. And that it wouldn't omit common words in quotation marks.

Google Parody by eyesyte · 2001-06-21 10:04 · Score: 1

The only Google parody I know of is http://www.aybabtu.com. Anyone know of any others?

--
Logic is overrated.

Slashdot Mirror

Interview With Google's Director of Research

135 comments