Google Index Doubles

Re:This is news ? by Manip · 2004-11-10 23:02 · Score: 1, Funny

Yes because we at /. love Google..

Google is a constant source of information and a geeks friend - if the index has doubled so has our supply of information. Information rules!

Image Search by TupperTrenine · 2004-11-10 23:04 · Score: 4, Funny

Have they updated image search yet?

Re:Image Search by dapyx · 2004-11-11 00:53 · Score: 0, Offtopic

Probably they're waiting for the US to upgrade their government.

--
I'm sorry, the number you have dialed is an imaginary number. Please rotate your phone 90 degrees and dial again.
Re:Image Search by BoldAC · 2004-11-11 01:27 · Score: 2, Interesting

While waiting for the update to their image search, everybody should optimize their web pages... google-style.

For those of you that don't believe that having keywords in your URLS... just use google's own story, for example.

http://www.google.com/googleblog/2004/11/googles -i ndex-nearly-doubles.html

"Google Index Nearly Doubles" is in the url and the first header. Look at how they do thinks... and your google traffic will increase.
Re:Image Search by BoldAC · 2004-11-11 01:33 · Score: 1

Gesh... I need coffee...

Edit:
Look at how they think and your google traffic will increase.
Re:Image Search by jjh37997 · 2004-11-11 03:19 · Score: 1

Nope..... no Abu Ghraib torture pics yet.
Re:Image Search by Anonymous Coward · 2004-11-11 09:58 · Score: 0

The reason the page is named that way is because of recent (well, a few months) changes in the Blogger.com software. Since Google owns Blogger.com, it's natural that they would use the Blogger.com software to generate their own blog.

Whether or not it has anything to do with where your page will rank on google is pure speculation.
Re:Image Search by Asphalt · 2004-11-11 10:11 · Score: 1

I have one of the largest image indexes on a particular subject on one of my websites (sorry, no nudity).
None of the images show up anywhere on Google Images. People often email me asking me why not.
*shrug*
Google does what Google does. People seem to like it.
I have to say that i've been finding it a little lacking lately ... searching for articles that I know exist often yield nothing.
But, hey, you really can't argue with $180/share.

Re:This is news ? by Zork+the+Almighty · 2004-11-10 23:05 · Score: 1

In related news, the sun has set for today and will rise again tomorrow. The web is growing. Google is indexing it. It isn't news, it's a factoid.

--

In Soviet America the banks rob you!

More pages v.s more relevant pages by xiando · 2004-11-10 23:06 · Score: 5, Insightful

Personally I find that the lack of relevant pages if the biggest problem with search engines, not the lack of pages with information. It seems I always find what I'm looking for eventually, what I need improved is the time I spend looking though spam-bomb pages before I find a page with the correct information.

These spam-pages seem to be increasing; I mean those pages with just a buch of keywords or the output of some search system.

--
9/11: Never forget it was a false-flag operation

Re:More pages v.s more relevant pages by Kithraya · 2004-11-10 23:51 · Score: 5, Insightful

I'm especially irritated by the increasing number of highly-ranked pages that are nothing more than another search engine's results. If Google could find some way to identify and remove these from my result set, Google's usefulness to me would increase 10 times over.
Re:More pages v.s more relevant pages by metlin · 2004-11-11 00:08 · Score: 2, Interesting

Google has a problem with this because some of those searches are actually useful.

For instance, when I search for something technical, I often run into search results from DBLP, arXiv, CiteSeer and the like -- although these are really search results within themselves, they're immensely useful to me.

Since we both effectively have a conflict of interest - Google would need to figure out a way to strike a balance.
Re:More pages v.s more relevant pages by Eric+Giguere · 2004-11-11 00:12 · Score: 1

Absolutely. This is why I always tell people to "think like a librarian" when it comes to finding information in a search engine, whether it be Google or not. That said, I don't know how much is being taught about libraries and library organization these days, so maybe that's a meaningless thing to say.
Eric
How to detect Internet Explorer (as opposed to Firefox)
Re:More pages v.s more relevant pages by __aahlyu4518 · 2004-11-11 00:20 · Score: 2, Informative

Personally I find that the lack of relevant pages if the biggest problem with search engines, not the lack of pages with information.

Actually.... information IS relevant data. If it's not relevant to what you want, then it is just data...
Re:More pages v.s more relevant pages by juglugs · 2004-11-11 00:20 · Score: 1, Funny

What's a library?

--
This sig is in Spanish when you're not looking....
Re:More pages v.s more relevant pages by corrie · 2004-11-11 00:27 · Score: 2, Interesting

However, results from places like Starware Search are not useful, and elevates my blood pressure with all the attempts at spamming me.

Just because I use Firefox and Adblock doesn't mean I now want to visit all possible spam sites in existence.

I don't care if Starware and friends make their money from advertising or not. The point is that Google is ALREADY a search engine, and a pretty good one at that. What is the point of returning results from another search engine, especially if the other one does not even have specialised domain?
Re:More pages v.s more relevant pages by jez9999 · 2004-11-11 00:30 · Score: 4, Interesting

One thing that would really help me sometimes would be if Google allowed you to do an 'exact match' search. No, I don't mean enclosing something in double quotes, that still ignores capitalization, whitespace, and most non-letter characters. I'd like to be able to search for pages that have the EXACT string '#windows EFNET', for example, or '/usr/bin/' or whatever. '/Usr/biN' wouldn't match, and nor would '#windows^^EFNET' (where ^ is equal to a space :-) ).

I sent an e-mail to Google about this and the guy who replied didn't seem to think it was possible... anyone know if it is?

--
== Jez ==
Do you miss Firefox? Try Pale Moon.
Re:More pages v.s more relevant pages by fishbot · 2004-11-11 00:31 · Score: 1

It's what we'd get if we printed everything linked to by Google :)
Re:More pages v.s more relevant pages by Anonymous Coward · 2004-11-11 00:38 · Score: 1, Funny

I don't want to start an old discussion again ... but hereI don't want to start an old discussion again ... but here is where rdf, ... can play a role. At my school they are starting to deposit articles, ... in a repository that has metadata based on the dublin core. Hope this will help searching for that kind of info info: papers, ... ?

Anyway, I believe google also has a personalised search:

http://labs.google.com/personalized

Maybe this can help.
Re:More pages v.s more relevant pages by Anonymous Coward · 2004-11-11 00:45 · Score: 1, Funny

That's why engines showing clustered results may well end up beating Google at its own game.
Re:More pages v.s more relevant pages by MoobY · 2004-11-11 01:02 · Score: 1

The same goes for duplicate information. I don't want 200 versions of wikipedia listed when I'm looking for a specific article, nor 200 times the same man page when I'm researching something different of a unix command besides the man page of a command.

--
--- Sigmentation Fault - Comments Dumped
Re:More pages v.s more relevant pages by PsychoSlashDot · 2004-11-11 01:03 · Score: 5, Insightful

What I've read on the Google help pages seems to indicate that they don't index punctuation or capitalization. When you search for something, your string is looked for within an existing index, and appropriate reference materials are shown. Including punctuation wouldn't result in any hits within their index, meaning no results.

Now, obviously, it is theoretically possible to do just about anything. But in this case, with the architecture they have in place, anyone ever doing what you're asking would require a full-text search through their multi-TB dataset, which I suspect is highly impractical.

My point is that as I understand it, Google has coded a number of shortcut tricks which allow reasonable search times, and full-text string-exact searching would prevent them from using those shortcuts, resulting in search times they don't seem to think is reasonable.

--
"Oh no... he found the .sig setting."
Re:More pages v.s more relevant pages by Anonymous Coward · 2004-11-11 01:10 · Score: 1, Funny

It is an interesting problem, extact string matching. If you think at how it would be done it is relatively simple for a short piece of text. just call strstr on a chunk of text. The problem, is google does not likely index large bodies of text. Instead, google indexes bags of terms. Each term is likely a stemmed word, that no longer resembles the orignal word. In this way, google compresses the document, saving space, while making it faster to look up key words in a document. The only way I think google could provide exact string matching, is to search their google cache. The problem or limitation with the google cache, is if you didn't notice, google does not cache every page, hence the word cache. While disk space is cheap it is also slow to access, so, even while it is visible google could store all 8 billion pages on disk it is only likely you would want to wait that long to search for your extact match. There are some tricks that could be used to speed narrow in on which documents to do exact string checking in. First they use the string you passed in and do the normal tokenization of the string breaking it down into parts. Then they come up with a result set. Now they can start doing exact string matching within that returned result set. The issue with that is it is undeterministic as to how long that process will take as each document is of arbitrary size. The best they could do would be to do an exact string match in the summary text and return the documents in that set first followed by the other documents, which is very close to what they actually do.
Re:More pages v.s more relevant pages by maxwell+demon · 2004-11-11 01:30 · Score: 1

The stuff you find in /usr/lib :-)

--
The Tao of math: The numbers you can count are not the real numbers.
Re:More pages v.s more relevant pages by Erasmus+Darwin · 2004-11-11 01:39 · Score: 3, Interesting

"But in this case, with the architecture they have in place, anyone ever doing what you're asking would require a full-text search through their multi-TB dataset, which I suspect is highly impractical."
Actually, they could cut that down considerably. For example, say we were doing an exact search for '#windows EFNET' as in the original example. The first thing they could do is start with a traditional search on "#windows EFNET". At that point, they've cut their multi-TB dataset down to just a few megs or less of likely matches (in this case, only 10 pages matched). Then they could do a full-text check on each result, looking for an exact match and discarding all the rest.
Re:More pages v.s more relevant pages by LeoNomis · 2004-11-11 01:58 · Score: 1

I'm especially irritated by the increasing number of highly-ranked pages that are nothing more than another search engine's results. If Google could find some way to identify and remove these from my result set, Google's usefulness to me would increase 10 times over. Don't you mean 20 times over?
Re:More pages v.s more relevant pages by cavemanf16 · 2004-11-11 03:16 · Score: 1

What you guys don't realize is the orders of magnitude higher that it takes to perform the whole "capitalized/not capitalized" search makes this unreasonable for Google to attempt to do. A long while back our CRM application was consistently getting hung on queries that involved customer first/last name combinations because it WAS capitalization sensitive. You see, when you tell a computer to search for "Joe JingleheiMerScHmIdT" WITH a capitalization sensitive search, it has to go through every single combination of capitalization in that name. But when all it needs to do is match "J" to "j or J", "o" to "o or O", and so forth, the search takes MUCH less time.

While it is somewhat frustrating that Google can't do this (and do it in 0.2s), the reality is that you gain a whole lot more processing power for Google's algorithm to do it's thing in presenting you with the best results when you're not sure exactly what you want out of the search. I think Google has struck a good balance so far.
Re:More pages v.s more relevant pages by Zemplar · 2004-11-11 03:29 · Score: 1

I agree that this would be a nice feature, but the fact of the matter is that the vast majority of users don't have an OS (Windows) that is case sensitive, save for a very small list of exceptions. I would also go so far as to suggest that most users don't even think with a thought process that distinguishes data by capitalization.

In the mean time, I'd suggest using Google's "special seach" feature that can be found here http://www.google.com/options/specialsearches.html .
Re:More pages v.s more relevant pages by ShecoDu · 2004-11-11 03:31 · Score: 1

I dont know a lot about search engines, but maybe somebody else can update my comment.

Google uses an advanced indexing algorith based on words and coordinates, some hashes here and around and you get the relevant results, they could add a few characters to their valid-word-chars list, but trying to get something more exact would need a new algorithm.

You should try reading a technical document about search algorithms, just for kicks :)
Re:More pages v.s more relevant pages by PMuse · 2004-11-11 05:38 · Score: 2, Insightful

How about a NEAR operator? Sure, AND OR NOT are nice, but my results would be a lot more relevant if I could eliminate results where the search terms appeared a thousand words apart.

--
"We reject as false the choice between our safety and our ideals." --The American President (20.1.2009)
Re:More pages v.s more relevant pages by Spoing · 2004-11-11 05:52 · Score: 1
1. I'm especially irritated by the increasing number of highly-ranked pages that are nothing more than another search engine's results. If Google could find some way to identify and remove these from my result set, Google's usefulness to me would increase 10 times over.
Agreed. I'd like to add these sites to a global block list; stumble on them during a search -- GRRRR! -- click 'block host' and never see the site again (bonus if the link can be removed or marked as 'already read'.
--
A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.
Re:More pages v.s more relevant pages by Anonymous Coward · 2004-11-11 06:10 · Score: 0

Try using "." it doesn't solve all your problems, but it can help. If you said foo.bar, then google treats it as one 7 character word, where the '.' can be a single space or certain other characters. It's great for finding.exact.quotes
Re:More pages v.s more relevant pages by tsiolkovsky · 2004-11-11 14:12 · Score: 1

Even if this were true, it seems that it would still be possible to take the query results and run them through another filter for punctuation, capitalization etc.

This wouldn't require a whole rewrite, just stick one more filter between the user's query and the final presentation of the data in HTML. You know that the query system has to be highly modular. The data passes through several modules before it is sent back to the user. They just need to add one more module that filters out results that don't meet the query's exact capitalization and punctuation.
Re:More pages v.s more relevant pages by KaiSeun · 2004-11-11 15:01 · Score: 1

That would be a good idea for them to do, but on the user side, the best solution in my opinion would be to just develop the ability to skim through the short descriptions, while identifying what looks like spam, and what is real. Do this fast enough, and you could go through 10 result maybe every 5 seconds. Not terribly efficient, nor does it solve the problem that google hasn't done anything yet, but a solution is need now. Of course, people who currently do this probably are also able to skim through article and pick up key points.
Re:More pages v.s more relevant pages by Anonymous Coward · 2004-11-13 20:57 · Score: 0

Guess what...

You're still an ass, Erasmus

I'm all alone by tcdk · 2004-11-10 23:06 · Score: 4, Funny

8 billion pages and not a single link to my blog.

Can't figure of I should just shoot my self or maybe just open a subscription to /.

--
TC - My Photos..

Re:I'm all alone by Zork+the+Almighty · 2004-11-10 23:11 · Score: 4, Funny

If you shoot yourself, will your blog readers know ? I mean, it's kindof like the tree in the forest thing.

--

In Soviet America the banks rob you!
Re:I'm all alone by dotmike · 2004-11-10 23:21 · Score: 1

Warning!
Re:I'm all alone by pAnkRat · 2004-11-10 23:30 · Score: 0

Try using:

site:blog.tc.dk

as a search.

It lists 4 hits. I think your search show that nobody links to you page, but hey, that's life (for a geek)

--
we need an "-1 Plain wrong" moderation option!
Re:I'm all alone by tadmas · 2004-11-10 23:43 · Score: 3, Informative

8 billion pages and not a single link to my blog.

Perhaps you should just tell them where it is.
Re:I'm all alone by Anonymous Coward · 2004-11-10 23:49 · Score: 0

http://www.google.com/search?hl=en&q=site%3Ablog.t c.dk&btnG=Google+Search
Re:I'm all alone by Anonymous Coward · 2004-11-11 00:45 · Score: 0

Perhaps you should just tell them where it is.

No, it already knows. See the other people suggesting a "site:" search.
Re:I'm all alone by Anonymous Coward · 2004-11-11 00:48 · Score: 0

I think your search show that nobody links to you page,

which is what he's complaining about. You just prove that google knows where he is but still nobody cares. Does that make it worse or better?
Re:I'm all alone by Temkin · 2004-11-11 01:03 · Score: 1

You're not alone... They don't index my site either. :P
Re:I'm all alone by tcdk · 2004-11-11 01:17 · Score: 1

My site is actually index'ed, they just don't index anybody, who links to me.

--
TC - My Photos..
Re:I'm all alone by ttldkns · 2004-11-11 01:51 · Score: 1

It indexes the URL i leave on my slasdot comments. Its quite useful actually, i just managed to reclaim my home page from googles cache! I lost it in a format! im really happy now ,even tho it is a small piece of poo!

--
How many computers are too many?
Re:I'm all alone by MasterOfUniverse · 2004-11-11 03:54 · Score: 1

a dead tree actually..

--
"There is no flag large enough to cover the shame of killing innocent people."--Howard Zinn
Re:I'm all alone by jfengel · 2004-11-11 05:39 · Score: 1

I've always wanted to build a dead-man-switch email system. Something that pings you every week to see if you're still alive, and if you don't respond it sends emails. Something to protect you if you're blackmailing somebody, or let your boss know what you really think of him now that you're beyond retribution. Or maybe just a sappy final love letter to your wife. That sort of thing.

But boy would you have to build safeguards into that. "Uh, sorry, I never meant to admit my homosexual attraction to you, but see I went on vacation and forgot about the deadman switch..."
Re:I'm all alone by rob_squared · 2004-11-11 06:37 · Score: 1

Maybe you should have a more popular name:
http://www.google.com/search?hl=da&q=the&btnG=S%C3 %B8g&lr=

--
I don't get it.
Re:I'm all alone by Anders · 2004-11-11 09:47 · Score: 1

I've always wanted to build a dead-man-switch email system.
I always thought that people just wanted their porn to disappear when they do.

Do this affect how fresh their index will be? by Jugalator · 2004-11-10 23:07 · Score: 3, Insightful

I wonder if it'll take longer to index twice as many pages? Or if they, along with this change, improved their spider and/or added hardware. Otherwise I'm not sure this change is for the better, unless you like to search for really obscure topics.

--
Beware: In C++, your friends can see your privates!

Re:Do this affect how fresh their index will be? by andres32a · 2004-11-10 23:41 · Score: 1

Actually no. Better search results means fewer necessary searches, which in turn will make the entire process most time effective. And anyway, you can`t just stop indexing webpages just because it might take longer to index them. You just need to improve on hardware or the technology itself.
Re:Do this affect how fresh their index will be? by Jugalator · 2004-11-11 00:19 · Score: 1

Better search results means fewer necessary searches, which in turn will make the entire process most time effective.

Search results? Are you talking about a person searching? I was mostly concerned about how quickly Google can update their complete index now that it doubled in size. I understand for my part it might get better, as long as the index is kept up-to-date.

And anyway, you can`t just stop indexing webpages just because it might take longer to index them. You just need to improve on hardware or the technology itself.

Yes, I realize this too, however I just wonder if Google made the necessary hardware/tech changes to maintain their current freshness of the index so we aren't getting an index, say, twice as big but taking twice as long to reflect all the always ongoing fluctuations on the web. I'm not sure if that would really be an improvement. More broken links and all that.

--
Beware: In C++, your friends can see your privates!

What is new about this. by hanssprudel · 2004-11-10 23:07 · Score: 3, Interesting

What the article does not point out is why this something important. For just about forever google's store has been coverging on 2**32 documents. Some people have speculated that Google simply could not update their 100,000+ servers with a new system that allowed more. Apparently they have now done the necessary architecture changes to allow for identifying documents by 64 bit (or more identifiers) and back in the business of making their search for comprehensive.

Good timing to conincide with MSN attempt to start a new searchengine too!

Re:What is new about this. by Jugalator · 2004-11-10 23:16 · Score: 3, Interesting

Good timing to conincide with MSN attempt to start a new searchengine too!

Yes, they'd better fight back, as they now have a serious competitor in MSN.
It's giving very accurate results.

Doesn't anyone find it strange that Google gave the same top result there a while back?

MSN must be using a very similar algorithm.

Maybe a bit too similar...?

*tinfoil hat on*

--
Beware: In C++, your friends can see your privates!
Re:What is new about this. by slavemowgli · 2004-11-10 23:23 · Score: 2, Insightful

I don't quite believe that Google would've limited themselves that way (using 32 bit identifiers for documents) - that would've been incredibly short-sighted.

--
quidquid latine dictum sit altum videtur.
Re:What is new about this. by Anonymous Coward · 2004-11-10 23:34 · Score: 4, Interesting

For just about forever google's store has been coverging on 2**32 documents. Some people have speculated that Google simply could not update their 100,000+ servers with a new system that allowed more. Apparently they have now done the necessary architecture changes to allow for identifying documents by 64 bit (or more identifiers) and back in the business of making their search for comprehensive.
As someone who routinely follows these things, I couldn't agree more with your statement. My company operates a number of sites, and over the past 6 months, we've seen an obvious trend. Sites with, say, 5000+ pages, which used to be entirely indexed in Google, gradually had pages lost from Google. A search for site:somesite.com would return 5000 results 6 months ago. 3 or 4 months ago, the same search gave maybe 1000 results. This month maybe 500 or 600. We were definitely of the opinion that Google's index was "maxxed out" and was dropping large portions of indexed sites in favor of attempting to index new sites.

Now after seeing this story, I did a search and found literally all 5000+ pages are indexed once again. This is a huge step forward for webmasters everywhere. If your site had been slowly edged out of Google's index it's most likely back in its entirety now.

Thanks G.
Re:What is new about this. by Zork+the+Almighty · 2004-11-10 23:55 · Score: 1

Ha! Google itself is #4 on MSN's results.

--

In Soviet America the banks rob you!
Re:What is new about this. by pchan- · 2004-11-10 23:57 · Score: 1

heh, bedope.com. i haven't seen that site since Be Inc went under. the were the site to introduce the most numerically advanced version of linux, ever!

"You'll note that other versions of Linux are languishing at version 6.3 or even 2.2 - only Be Dope Linux Version 27.1 with AVN (Advanced Version Numbering) brings you a version of Linux numbered at 27.1".
Re:What is new about this. by Jugalator · 2004-11-11 00:24 · Score: 2, Insightful

Wow, Microsoft must have fixed it...
It now no longer shows microsoft.com as top hit.

Haha, I guess the joke reached MS headquarters. :-P

--
Beware: In C++, your friends can see your privates!
Re:What is new about this. by Anonymous Coward · 2004-11-11 00:26 · Score: 0

Actually, it first gave Microsoft.com as top hit.
It now gives Be Dope with that search.

Funny...
Re:What is new about this. by BetterThanCaesar · 2004-11-11 01:06 · Score: 1

Or maybe all the other pages mentioning "more evil than satan himself" got higher rank anyway. The same happened to the corresponding Google query.

Wikipedia/Google bomb:

However, the first Google bomb mentioned in the popular press may have occurred accidentally in 1999, when users discovered that the query "more evil than Satan " returned Microsoft's home page. Now, it returns links to several news articles on the discovery.

As you see on the MSN search page, the same is happening here. I doubt they've made changes to target this exact query.

--
"Stop failing the Turing test!" -- Dilbert
Re:What is new about this. by Anonymous Coward · 2004-11-11 01:12 · Score: 0

Is Google not allowed to be short-sighted?
Re:What is new about this. by bighoov · 2004-11-11 01:40 · Score: 3, Interesting

Probably not short sighted, but rather an space and cpu efficiency issue. Space - If you have 64-bit doc ids, even if you index 2^48 documents you're still wasting 16 bits per stemmed word per document. CPU - dealing with 64-bit integers on 32-bit hardware usually involves multiple loads, and decreases what can fit in the hardware data caches.
Re:What is new about this. by Dayflowers · 2004-11-11 01:44 · Score: 1

It might have been an option of compromise.

One thing is for sure: they stayed on the 4bill for a looooong time.

--
I am a speak english. Do you not? - Saroto
Re:What is new about this. by goatpunch · 2004-11-11 03:15 · Score: 1

Yes, before the leap to 8 billion pages they were indexing 4285199744 pages, which is 99.8% of 2^32 (4294967296) - these numbers seem too close to be a coincidence (they differ by about 1 million).

--
Worst BBC News Stories
Re:What is new about this. by cavemanf16 · 2004-11-11 03:19 · Score: 1

Maybe not, but now it does list www.google.com! ROFL!

http://beta.search.msn.com/results.aspx?q=%22mor e+ evil+than+satan+himself%22&FORM=QBHP
Re:What is new about this. by captwheeler · 2004-11-11 07:48 · Score: 1

An MSN search on "more evil than satan" returns google.com as the number one site.
http://beta.search.msn.com/results.aspx?q=%22more+ evil+than+satan%22&FORM=QBRE

...too bad one of the many smart people at Microsoft didn't get to make a witty response.

--
Thanks for putting on the feedbag. Thanks for going all out. Thanks for showing me your Swiss Army knife.
Re:What is new about this. by mdfst13 · 2004-11-11 12:31 · Score: 1

"MSN must be using a very similar algorithm."

You are not the only one saying that Microsoft is copying Google. Basically, they are indexing whatever Google does. Apparently they didn't have enough content of their own.

no update on the images by bvdbos · 2004-11-10 23:07 · Score: 3, Informative

Unfortunately they didn't update the image-search yet.

Re:no update on the images by Anonymous Coward · 2004-11-10 23:35 · Score: 0

You really got the hots for Lyndie England huh?
Re:no update on the images by Anonymous Coward · 2004-11-11 03:03 · Score: 0

Yeah, I really like their image search, but the ancientness of it makes it less and less useful as time passes. Anyone know of a decent alternative that has a more up-to-date index?

Google makes minor change to website - news at 11! by Sanity · 2004-11-10 23:08 · Score: 3, Insightful

Does every minor Google or Apple related thing deserve a slashdot story? Can slashdot create a "Fanboy" section for insignificant stories advocating Google (with their software patent) and Apple (with their iTunes DRM)? That way I could filter them out more easily.

Ofcourse ... by El_Muerte_TDS · 2004-11-10 23:10 · Score: 1

I made my internet mirror world reable.

Re:Ofcourse ... by loyalsonofrutgers · 2004-11-11 00:20 · Score: 1

You have a mirror of the internet? That must be the one that George Bush uses.
Re:Ofcourse ... by krymsin01 · 2004-11-11 00:28 · Score: 1

Yeah, but in the mirror Spock has a goatee...

--
stuff
Re:Ofcourse ... by Zemplar · 2004-11-11 03:33 · Score: 1

How did you do that?

Include a spell checker?
Re:Ofcourse ... by Scaba · 2004-11-11 07:28 · Score: 1

You mean Spock has a goatse...picture of Jim Kirk.

Quality - not quantity by seanyboy · 2004-11-10 23:10 · Score: 3, Insightful

Google needs to stop obsessing about the number of indexed pages, and start concentrating on the quality. Since pagerank was switched off, 2 out of 5 searches now seem to be jammed with pages full of nothing but random words and adverts. It's even more galling when the adverts are Google Ads. Much as I love Google, they're becoming increasingly less effective as a tool.

--
Training monkeys for world domination since 1439

Re:Quality - not quantity by Ingolfke · 2004-11-10 23:18 · Score: 3, Funny

I agree search engines are so 1990. I rely exlusively on word of mouth to find websites. If Firefox would add a button to the toolbar that said 'Cool Sites', maybe with an icon of a pair of glasses, and have the button link to a webpage with links to the latest cool sites on the net, that would certainly be the end of Google and their 8 billion pages. Pah!
Re:Quality - not quantity by Onionesque · 2004-11-10 23:21 · Score: 2, Insightful

To paraphrase Churchill, Google is the worst system devised by the wit of man, except for all the others. Where else would you go? Yahoo? Hey, how about AltaVista?
The problems faced by Google in their battle against the scumbags who would game the system are faced by every other search engine. Google, IMHO, handles them better.
Re:Quality - not quantity by seanyboy · 2004-11-10 23:30 · Score: 1

Agreed, they still need to know when people are being frustrated by the search results they're being given. And I'm finding it increasingly difficult to find what I want with Google.

--
Training monkeys for world domination since 1439
Re:Quality - not quantity by Anonymous Coward · 2004-11-10 23:39 · Score: 0

No, this is a good thing. It's taken them longer and longer to get pages in the damn index.
Re:Quality - not quantity by dabadab · 2004-11-10 23:49 · Score: 3, Informative

"[i]Since pagerank was switched off[/i]"

Since when is Pagerank switched off?

--
Real life is overrated.
Re:Quality - not quantity by INT+21h · 2004-11-10 23:55 · Score: 1

Tried Stumbleupon? It has a plugin for firefix iirc.
Re:Quality - not quantity by Anonymous Coward · 2004-11-10 23:58 · Score: 0

He was referencing the old Netscape default setup, I believe.
Re:Quality - not quantity by seanyboy · 2004-11-10 23:58 · Score: 4, Interesting

My bad. I'd skimmed a few things on the web, and assumed that it had been switched off. Looks instead as though Google have changed how it works. See PageRank is dead. I need to investigate further.

--
Training monkeys for world domination since 1439
Re:Quality - not quantity by Beolach · 2004-11-11 00:00 · Score: 1

That was actually how Yahoo! got started. A few of college drop-outs started making a webpage linking to their favorite sites... and their friends started going to it, and their friends' friends, and their friends' friends' friends... and then somebody offered to pay them to advertise on the site. And we ended up with this.

--
Join moola.com, play games to earn money.
Re:Quality - not quantity by WhiteDragon · 2004-11-11 00:06 · Score: 1

at the bottom of every search results page, there is a link that says, "Dissatisfied? Help us improve". I've clicked on it once or twice, when encountering a particularly spammed keyword and they have fixed it!

--
Did you mount a military-grade, variable-focus MASER on an unlicensed artificial intelligence?
Re:Quality - not quantity by melvster · 2004-11-11 00:47 · Score: 0

Remember over 50% of google employees work in advertising or marketing.

They are an advertising compnay. The search is just a teaser to get you to their site.
Re:Quality - not quantity by grazzy · 2004-11-11 01:17 · Score: 1

they also made the "cool pages" the parent is talking about.. which always has sucked bad :)
Re:Quality - not quantity by mavenguy · 2004-11-11 02:27 · Score: 1

Not just a pair of glasses, but a pair of glasses mended with tape
Re:Quality - not quantity by Ramses0 · 2004-11-11 02:38 · Score: 1

StumbleUpon.com ... you can thank me (or demonize me!) later. :^)

--Robert
Re:Quality - not quantity by Anonymous Coward · 2004-11-11 03:39 · Score: 0

For you young ones... anyone who was on the net before 1995 remembers the time when "cool site of the day" was the best way to find new interesting pages. In other words, it was a joke for us old fogeys.
Re:Quality - not quantity by Anonymous Coward · 2004-11-11 05:05 · Score: 0

You know that Zawodny works for Yahoo's search team, right? Not exactly the most impartial source to trust..
Re:Quality - not quantity by PMuse · 2004-11-11 05:43 · Score: 1

First, I have to give reluctant kudos to MSN for parsing long boolean queries such as
(((A AND B) NOT (C OR D)) AND E)
Google needs to play catch-up here.

Second, we need SORT OPTIONS. It's not that hard to allow sorting by date, title, file type, and number of hits. Again, MSN has won a march on Google in this area.

--
"We reject as false the choice between our safety and our ideals." --The American President (20.1.2009)
Re:Quality - not quantity by ral315 · 2004-11-12 08:35 · Score: 0

And notice that his blog is back in the number 1 spot above his home page. So the article is basically pointless now.

And I for one welcome... by mu22le · 2004-11-10 23:10 · Score: 2, Funny

No, wait, they are our internet search overlords since, like, 1999?

Mhm to anonymous coward or not to anonymous coward?
Will moderators smack my karma below zero?

Re:Google makes minor change to website - news at by Anonymous Coward · 2004-11-10 23:10 · Score: 0

Nothing beats Apple. It is superior and makes people instant geeks without knowing shiat.

Re:This is news ? by PerpetualMotion · 2004-11-10 23:12 · Score: 3, Interesting

A bigger index does not equal better search results, however, with the press this will generate, it will equal profits.

Re:Google Schmoogle by seanyboy · 2004-11-10 23:13 · Score: 2, Funny

They already have.

--
Training monkeys for world domination since 1439

slashdotting by Zork+the+Almighty · 2004-11-10 23:15 · Score: 4, Funny

In case of slashdotting use this mirror.

--

In Soviet America the banks rob you!

Re:slashdotting by juglugs · 2004-11-10 23:45 · Score: 3, Funny

No, no, no... Use this Mirror...

--
This sig is in Spanish when you're not looking....
Re:slashdotting by flewp · 2004-11-11 00:10 · Score: 1, Redundant

You can always try the google cache just in case too!

--
WWJD.... for a Klondike bar?
Re:slashdotting by osvejda · 2004-11-11 01:30 · Score: 1, Funny

This should be modded down. The mirror is out of date.
Re:slashdotting by xlcus · 2004-11-11 01:39 · Score: 2, Funny

or this mirror ;-)

Re:This is news ? by Anonymous Coward · 2004-11-10 23:16 · Score: 1, Funny

They doubled the index by counting all the stuff on your hard drive indexed and sent to them by Google Desktop Search.

Re:This is news ? by dotmike · 2004-11-10 23:18 · Score: 3, Insightful

Yeah, but it'd be news if the sun set twice in one night or rose twice as bright.

It's more the exponential increase in the size of the index rather than the piecemeal addition.

Nonsense. by MadFarmAnimalz · 2004-11-10 23:20 · Score: 2, Funny

over eight billion pages crawled

You don't just go from 4 billion to 8 billion overnight.

They are probably just crawling the same 4 billion twice.

--
Blearf. Blearf, I say.

Re:Nonsense. by Powercntrl · 2004-11-11 00:49 · Score: 0, Offtopic

So they're using Slashcode's dupe-checking module?

--

---
DRM is like antifreeze, to the MPAA/RIAA it's sweet, to the consumers it's poison.

Makes you wonder... by manmanic · 2004-11-10 23:21 · Score: 5, Insightful

Does this mean that I've been missing a huge amount of important information until now? I'd just assumed that Google covered the entire relevant web but now it seems to cover the whole same amount again. My Google alerts also seem to have started producing a lot more results which suggest that a lot of these new pages are rated quite highly. Who knows how much more quality content on the web we're just not seeing?

Re:Makes you wonder... by krymsin01 · 2004-11-10 23:29 · Score: 1

Yes, you missed all the good info. By now, all the new pages Google is indexing are out of date and irrelevant.

--
stuff
Re:Makes you wonder... by jlar · 2004-11-10 23:29 · Score: 5, Interesting

"Does this mean that I've been missing a huge amount of important information until now?"

Maybe the steep increase is due to all the new file formats they are indexing now. That might be useful for some people (although I sometimes find it kind of annoying that a search returns MS-Word documents).
Re:Makes you wonder... by Politburo · 2004-11-11 03:19 · Score: 1

That might be useful for some people (although I sometimes find it kind of annoying that a search returns MS-Word documents).

This isn't Google's fault. I'd rather that people didn't put documents on the web in Word format, but people do it. I still need the information that's in the document, though, and I would like Google to index it. Same with PDFs, or any other format that contains text. An option would be nice for those who are looking for HTML only (or similar).
Re:Makes you wonder... by RedWizzard · 2004-11-11 09:13 · Score: 2, Informative

Maybe the steep increase is due to all the new file formats they are indexing now.
The steep increase is probably due to an architecture change. Google has, for a long time, been indexing around 4 billion pages. That implies that they have been giving each page a 32 bit unique identifier, and had exhausted that id space. It would be a lot of work for them to seamlessly upgrade all their software to support a larger id, and it has taken them a long time to do so. Now that they have the large jump in pages is simply due to the fact that they can index much more of the web.

Re:This is news ? by krymsin01 · 2004-11-10 23:23 · Score: 1

You are damned right that'd be news. It means we slipped out of reality and headed into the twilight zone. (CUE MUSIC)

--
stuff

Re:Google makes minor change to website - news at by timdorr · 2004-11-10 23:24 · Score: 1

Maybe it's just me, but I'd call the doubling of information available for me to search a pretty significant improvement. Especially when the last update was only a 1b increase ("only" is a relative term, of course...).

--
Tim Dorr
Owner/Manger
A Small Orange

Re:Google makes minor change to website - news at by Anonymous Coward · 2004-11-10 23:24 · Score: 0

I guess its all in the wording

Re:Google makes minor change to website - news at by Anonymous Coward · 2004-11-10 23:26 · Score: 0

Gentoy and this weeks kiddies favourite Umbongo Linux come pretty close.

Google needs your cookie badly by Anonymous Coward · 2004-11-10 23:27 · Score: 2, Informative

Until today you could save your google settings without loosing your privacy. You can still save those settings but google refuses to use them when you block their cookie. In my case I get 10 search results although I like to receive 100. Seems that they are making many dollars on a user's cookie, and now they are a public company my privacy is less important than "stock holders' interests".

Re:Google needs your cookie badly by Anonymous Coward · 2004-11-10 23:52 · Score: 3, Informative

You can still save those settings but google refuses to use them when you block their cookie. In my case I get 10 search results although I like to receive 100.
Create a keyword bookmark with the URL
http://www.google.com/search?q=%s&num=100

Give it the keyword 100, then type 100 search_term in the address bar to use it.
Re:Google needs your cookie badly by marc252 · 2004-11-10 23:54 · Score: 0

well, If you consider all your privacy goes away because a cookie from google, you better stop using cellular phones, regular telephone lines, credit cards, internet connections from private places, snail mail, bank accounts, and ahh! don't you go walking through city streets, there are cameras monitoring your activity....
Once you've done all this you will find yourself living alone in a forest, then you can surely shout out loud "I'm free!!!" but,
be careful don't shout to loud, threre might be some satellite monitoring forests for people who try to live aside from the system! Good luck
Re:Google needs your cookie badly by Anonymous Coward · 2004-11-11 00:16 · Score: 0

Great solution! http://www.google.com/search?num=100&hl=enl&q= %s works too. Thanks.
Re:Google needs your cookie badly by Anonymous Coward · 2004-11-11 01:38 · Score: 0

> save your google settings without loosing your privacy

How does saving the settings make your privacy not tight? That doesn't make sense.
Re:Google needs your cookie badly by Anonymous Coward · 2004-11-11 02:39 · Score: 0

> and now they are a public company my privacy is less important than
> "stock holders' interests".

May I treat you to the obligatory "Duh!" ?
Re:Google needs your cookie badly by Anonymous Coward · 2004-11-11 04:47 · Score: 0

You are so the man. Cheers!
Re:Google needs your cookie badly by oojah · 2004-11-11 06:07 · Score: 1

Presuming you are using Mozilla (I don't use Firefox, but I guess it should work the same), find the file searchplugins/google.src in your mozilla directory.

After the line starting

You may also want to change the updateCheckDays value as well as it looks as though it will overwrite your modified google.src file (although I'm not sure about this).

This modifies the default google search behaviour that you get when you type in the URL bar, press up then return.

Cheers,

Roger

--
Do you have any better hostages?
Re:Google needs your cookie badly by Everyman · 2004-11-11 08:14 · Score: 1

The instructions for cookie-less preferences at Google-Watch have been updated. By editing your bookmark and adding four characters, the Google sabotage is defeated.

Re:Google makes minor change to website - news at by Zork+the+Almighty · 2004-11-10 23:28 · Score: 1

The extra 3 billion pages are probably link farms.

--

In Soviet America the banks rob you!

Re:Google makes minor change to website - news at by dotmike · 2004-11-10 23:29 · Score: 2, Funny

At the same time, can Slashdot create a "Curmudgeon" section for those who like to gripe about the less than monumental significance of some story topics?

Google domination. by Anonymous Coward · 2004-11-10 23:30 · Score: 2, Informative

Local tabloid Aftonbladet is running a poll on search engine use:

Google (81.4 %)
Yahoo (2.2 %)
MSN (3.8 %)
Other (11.4 %)
Don't know (1.2 %)

61730 votes so far.

I'm a little surprised, either the masses who use the "default" (MSN?) aren't bothering to answer, or google is simply very very dominant and those "default using masses" do not exist [in this country].

Re:Google domination. by Mostly+a+lurker · 2004-11-11 00:06 · Score: 2, Insightful

the masses who use the "default" (MSN?) aren't bothering to answer
I think it is more that many users of IE just do not twig that their failed page access resulted in an automatic query to MSN.
In reality, most users make occasional deliberate queries to Google and more frequent accidental queries to MSN.
Re:Google domination. by Darthmalt · 2004-11-11 05:44 · Score: 1

I watched a friend of mine type in the name of a website wrong so of course it brought up the MS search engine.

In the MS search box she then proceeded to type in google and hit enter. Does anyboy else see the incredible irony in this?

If I kept eating so much spam... by dos_dude · 2004-11-10 23:31 · Score: 2, Funny

... my weight would probably double, too.

Re:This is news ? by Ford+Prefect · 2004-11-10 23:32 · Score: 2, Interesting

A bigger index does not equal better search results, however, with the press this will generate, it will equal profits.

It would be terribly easy to get trillions of pages indexed. For instance, a site I've been working on has a public calendar system, with results fished out of a database. There are very few actual events in it at the moment, but with the 'Previous' and 'Next' links it'll run from 1970 to 2038. A naïve web-crawler would index every single month for every single year, but Google would appear to have crawled over just a few, presumably flagging the pages as too similar to warrant further investigation.

With stuff like public web forums, Slashdot and the like, I can easily imagine comparatively small sites producing thousands of pages apiece. Is there useful information in there? Quite possibly, but it definitely needs treating in a different manner to an old-fashioned, static-pages-only site...

--
Tedious Bloggy Stuff - hooray?

Re:MODS ARE STUPID by Anonymous Coward · 2004-11-10 23:35 · Score: 0

He was just speaking what the rest of us were thinking. Right now, we're thinking that you don't know how to spell "repetitive".

Re:Google thieves my bandwidth by Anonymous Coward · 2004-11-10 23:35 · Score: 5, Informative

Google respects the robots.txt file. Use it.

Re:Google makes minor change to website - news at by Anonymous Coward · 2004-11-10 23:38 · Score: 0

No, you don't know shit infinity.

Microsoft by Cookeisparanoid · 2004-11-10 23:38 · Score: 4, Interesting

A lot of people have been asking what the point of the artical is, why does it matter, well possibly because Microsoft announced the launch of their search engine http://news.bbc.co.uk/1/hi/technology/4000015.stm and are claiming more pages index than google (5 billion) so google have responded by effectivly doubling their pages indexed.

8 billions.... by DrYak · 2004-11-10 23:39 · Score: 1, Funny

Of which 80% is V1AGR@ advertising,
and 19% is pr0n.
There's debate if the remaining 1% contains pirated music and movie or plans for DIY nukes.

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]

Re:8 billions.... by Anonymous Coward · 2004-11-11 00:25 · Score: 0

Of which 80% is V1AGR@ advertising,
and 19% is pr0n.

and considering what goggle and M$N are trying to do to each other, perhaps the answer might just be in their advertising...

/rumour - posting as an A/C due to premature moderation

RTFA by Anonymous Coward · 2004-11-10 23:39 · Score: 0

They are probably just crawling the same 4 billion twice

From the article: These are not just copies of the same pages, but truly diverse results that give more information.

Re:RTFA by Tim+C · 2004-11-11 00:33 · Score: 1

It was a joke.

--
It's official. Most of you are morons.

Mine is bigger than yours!!! by ayjay29 · 2004-11-10 23:46 · Score: 4, Informative

From BBC News here.

In a statement Microsoft said its search engine returned results from five billion web pages - more than any other search engine.

But this quickly won a response from Google which announced that its index has now grown to more than 8 billion pages.

Prior to the Microsoft announcement, Google was only indexing 4,285,199,774 web pages.

Steve Ballmer is soon to announce that his daddy is one hundrad years old, and kan kick your daddy's ass...

--
Offtopic, Inflammatory, Inappropriate, Illegal, or Offensive comments might be moderated up.

Re:Mine is bigger than yours!!! by Anonymous Coward · 2004-11-11 02:15 · Score: 0

Prior to the Microsoft announcement, Google was only indexing 4,285,199,774 web pages.

Interesting world we live in when 4 billion is a small enough figure to be prefixed with "only".
Re:Mine is bigger than yours!!! by qbwiz · 2004-11-11 02:22 · Score: 1

A large world, actually. 4 billion is less than one per person.

--
Ewige Blumenkraft.

Re:This is news ? by Anonymous Coward · 2004-11-10 23:46 · Score: 0

Dictionary definition of redundant
re.dun.dant adj
1a: exceeding what is necessary or normal: SUPERFLUOUS
b: characterized by or containing an excess; specif: using more words than necessary
c: characterized by similarity or repetition
d chiefly Brit: being out of work: laid off
2: PROFUSE, LAVISH
3: serving as a duplicate for preventing failure of an entire system (as a spacecraft) upon failure of a single component
-- re.dun.dant.ly adv

Learn what does it mean, mods.

Grrrrr by squoozer · 2004-11-10 23:47 · Score: 4, Funny

Now it's going to be even harder to get my name in the top spot. Why was I cursed with the surname Smith!

--
I used to have a better sig but it broke.

Re:Grrrrr by Anonymous Coward · 2004-11-11 00:14 · Score: 0

You're not the only Squoozer Smith?

Wow.
Re:Grrrrr by ceeam · 2004-11-11 00:15 · Score: 0

Have you considered changing it to Squoozer? http://www.google.com/search?q=squoozer
Re:Grrrrr by Insipid+Trunculance · 2004-11-12 03:27 · Score: 1

i hope you arent called John

*Ducks*

--
Wanted : A Signature.

Searching LiveJournal.com by hackrobat · 2004-11-10 23:49 · Score: 4, Informative

Looks like they've added a gazillion LiveJournal pages to their index. I used to have a Google search box on my LJ that didn't throw up relevant results until last week or so. Now it works perfectly, just like builtin search (like what you see in MT and WordPress).

Re:Searching LiveJournal.com by grazzy · 2004-11-11 01:54 · Score: 1

Easily disproved:
http://www.google.com/search?hl=en&lr= &c2coff=1&cl ient=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial& q=site%3Alivejournal.com&btnG=Search

2,3 million
Re:Searching LiveJournal.com by cavemanf16 · 2004-11-11 06:17 · Score: 2, Insightful

MSN's "msnbot" has been crawling/spidering my webserver (which runs Geeklog and is just another blog of my random crap) pretty extensively for weeks now. (Lie 5 times a day it seems) Searching on Google for my site's name now reveals more results from my site, but not a lot of those circle-jerk style search results pages that are just trying to generate some ad revenues. However, using the beta.search.msn.com site DOES yield a lot more random crap (mostly blogs and personal webservers) that somehow generated some kind of link to my site because of the title of one of my articles, someone linking to my site in one of their blog posts, etc.

I have a feeling MSN's new search site is gonna be mostly blogs and advertisements, not relevant information. I think it's good Google has indexed more pages, but I still believe their algorithm will continue to provide more USEFUL results than MSN. (BTW, the googlebot doesn't hit my site too frequently which tells me Google's bot understands that my site isn't updated too frequently, nor is it linked to from other important sites)

Geeks who understand marketing by Mostly+a+lurker · 2004-11-10 23:50 · Score: 1

What Google has going for them is that they combine technical know how with marketing smarts. I still use Google as my primary search engine because it produces better results. Google understands though that, in the market at large, they need to play the numbers game. Fine they say. Within hours of the Microsoft announcement, out comes this.

Frankly, I love it any time someone can best Microsoft. The next big thing may well be consumers putting their data on servers provided by the likes of Google, Microsoft and Yahoo -- running their applications there and having PCs that are little more than very easy to use display devices. If so,I would not mind seeing Google with the dominant market share. I trust them with that kind of power a lot more than Microsoft.

Doubled? Wait a minute... by 't+is+DjiM · 2004-11-10 23:50 · Score: 5, Funny

From 4 to 8 billion pages... I guess they just indexed the google cache...

--
--Use ant to make .war

Re:Doubled? Wait a minute... by fronti · 2004-11-10 23:57 · Score: 1

rotf... perhaps it's a bug in de indexer.. But when I take a look in my logfiles, there is a real "fight" googlebot vs. new-msnbot vs. ast jewes.. and all the 3 index the hole transcode, mplayer, xvid mailinglist archive ( http://www.itdp.de ) tonns of small files :) (ok I know about robots.txt)

Competing with Microsoft's 5bn? by Richard+W.M.+Jones · 2004-11-10 23:51 · Score: 4, Informative

On the same day that this story hits the BBC. In that story Microsoft claim that they have 5 billion pages indexed, more than the 4.2 billion pages indexed (at that point) by Google. The BBC have just updated the story with the 8bn figure.

I smell competition!

Rich.

--
libguestfs - tools for accessing and modifying virtual machine disk images

Does this mean...? by jimicus · 2004-11-10 23:51 · Score: 3, Insightful

Does this mean twice as many pages with "Search for 'printer problem linux' on Kelkoo"?

Re:Does this mean...? by mikael · 2004-11-11 00:21 · Score: 1

Probably in the same way that Daylight Savings Time gives you an extra hour of sunlight each day.

--
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
Re:Does this mean...? by elFarto+the+2nd · 2004-11-11 02:36 · Score: 1

Do a search for uranium hexafluoride, and look at the Sponsored Links
Regards
elFarto
Re:Does this mean...? by Piquan · 2004-11-11 08:55 · Score: 1

Do a search for uranium hexafluoride, and look at the Sponsored Links
I didn't get any sponsored links.
Re:Does this mean...? by elFarto+the+2nd · 2004-11-11 09:46 · Score: 1

For those that don't get the sponsored links here it is: Uranium Fantastic low prices here. Feed your passion on eBay.co.uk! www.ebay.co.uk

robots.txt by ReKleSS · 2004-11-10 23:51 · Score: 3, Informative

Yes, this is probably a troll, but anyway... I take it you've never heard of the robots.txt file? You sound like you might want to read up on it. It's designed to help control the spidering of your pages for whatever reason, particularly cases like yours or situations where a spider would get confused and end up doing something stupid (recursive stuff, etc).
-ReK

--
md5sum -c reality.md5 reality: FAILED md5sum: WARNING: 1 of 1 computed checksum did NOT match

Re:Google thieves my bandwidth by MobiusClark · 2004-11-10 23:53 · Score: 1

Erm... Have you considered putting a robots.txt file on your webserver?

The Googlebot is quite well designed and should honour any instructions you put in it.

Take a look http://www.google.com/search?q=robots.txt

meta-no-archive by Anonymous Coward · 2004-11-10 23:54 · Score: 3, Interesting

apparently my sites will never get a good ranking on google because I don't want the search engine to cache the site. So I'm using meta no-archive tags. That's the only thing I can figure out why the sites rank so poorly on google, when they come up in the top 10-20 hits on yahoo and other search engines. The keywords for the searches are valid, the sites are relevant to the keyword searches, yet the sites don't show in the top 100-300 on google.

I've avoided all the usual spam type of tags (auto refreshing, hidden text, cloaking, etc.) and the sites are legitimate and on the up and up, and yet the only page or two that google is spidering are the one or few that appear to be without the no-archive tags and possibly the revisit/expire tags.

Is google's policy, allows us to cache your site, or get penalized? Anyone else run into a similar problem or can shed some light on this? The only other thing I can think of is the robots text file, that keeps googlebot, and then other spiders through a *, from entering images directories. The spiders, including googlebot, aren't restricted from entering any other directories, they are given free reign.

Anyone else with problems with no-cache, no archive, tight revisit/expire times, or similar non-spam tags that result in penalties in google ranking?

I've been using google exclusively for a few years now. But the poor page ranking of sites on my server got me wondering about other sites that may be relevant to my own searches which may be exluded or penalized by google. So I've started using Yahoo search again, as much as I hate Yahoo (what they do with advertising to Yahoo groups and Yahoo mail is a shame). It appears that Yahoo is including better results because other sites show up with higher ranking that actually are relevant. So I've learned that Google isn't as perfect as I thought it was, which was disappointing in itself. It was easy using one search site. Now I have to use two to make sure I'm getting good results. Anyone know if there is a plugin for Firefox with both Google and Yahoo search boxes on the toolbar?

Re:meta-no-archive by kill-1 · 2004-11-11 01:31 · Score: 1

I use "no-archive" on several websites, and they don't seem to get penalized. From my observations one of the most important things for a good Google ranking is still page rank. I would check the number and quality of backlinks of your site and the sites ranking higher than yours. Maybe that's the reason.
Re:meta-no-archive by Mant · 2004-11-11 02:33 · Score: 1

How many sites link to yours, and how do they rank in Google? That is going to determine your page rank more than the content of the page.

I'm sure I've seen some way of doing a sort of backwards search on a page, that will show all the pages in Google that link to it.
Re:meta-no-archive by jandersen · 2004-11-11 02:36 · Score: 1

This is exactly why I don't use Google anymore - not since 2002, or thereabouts. I suspect it has something to do with who buys adverts and who doesn't. What I seemed to observe was that when I searched on Google I would always get a lot of results that were irrelevant or only remotely relevant, but which pointed to commercial sites. The most grotesk was once I searched for nonsense words (just to see what happened) and I got results like 'Buy books about [nonsense-word] on Amazon'. I mean, that is simply totally worthless; at least to me. Not to mention deceitful.

Yahoo, which I prefer now, does the same, but at least they are honest about it and display these links seperately as 'Sponsored Links'.
Re:meta-no-archive by justMichael · 2004-11-11 03:34 · Score: 2, Informative

I'm sure I've seen some way of doing a sort of backwards search on a page, that will show all the pages in Google that link to it.
The search you are looking for looks like this: link:slashdot.org
Re:meta-no-archive by Anonymous Coward · 2004-11-11 04:43 · Score: 0

You do notice that these are *on* the side, right? Not with the regular search results? As opposed to Yahoo which puts them right above their regular results, only slightly seperating them. I guess putting "Sponsored Links" right above it isn't much of a give away either?

http://www.google.com/search?query=blah+blah+blah
Re:meta-no-archive by Anonymous Coward · 2004-11-11 12:27 · Score: 0

Google probably doesn't penalize you do for not allowing a cache, but they should.

Nothing pisses me off more than sites that allow google to view their logged in information (many scientific journals do this) but not let google cache it. If you click on the link, you can't get any information unless you pay to subscribe.

If you don't want your information available to the public, then why have it in a search engine???

If you don't mind people seeing your website, then why not let google cache it??
Re:meta-no-archive by Anonymous Coward · 2004-11-12 03:38 · Score: 0

What does selling a subscription have to do with what I posted?

I'm not charging anyone to see the content.

In case you haven't noticed, google doesn't update their search engine on-the-fly, nor do they do it very often. If you've been paying attention at all, you'll know that google doesn't update very frequently, in fact takes months to update, because when they do, the howls of protest come out from some sites losing ranking, while others stay quiet while they get a better rank. Some of it has to do with changing algorithms, and some may think they don't do this often, but I'd say they change their algorithms daily, tweaking all the time, so every index update changes ranking.

So this should give you a hint as to why not let google cache it. Web sites that have information that changes frequently is the #1 reason. Google honors the revisit tag, and possibly the expire tag. But if you have a revisit tag that says revisit in one day, and they don't change their index for 6 weeks, and you change your site on November 11 for Christmas, and their last update was November 1, with the next update January 4th, they can spider your site daily and you are still screwed for the Christmas shopping season, when 40-60% of sales happen for some businesses.

There are a lot of other reasons, but the reason outlined above is the #1 reason for the sites on my server. It's just the opposite of what you state, we want the information available to the public, and we don't want outdated, old, no longer accurate information available to the public.

Thanks to google penalizing the refresh tag, (actually thanks to the spammers abusing it, but google is the one penalizing it), I still have sites and directories out there linking to pages that either have moved, or had their urls renamed due to a mispelling. Yet I still have hits coming in for the mispelled urls, and they are ignoring the server side redirection that Apache does through the httpd.conf file. So what would you rather have, a missing file with a cached page that shows nothing, or a listing that shows the up-to-date site, including the correctly spelled url sub-page? The cache feature won't work in the above example.

Scientific journals are selling their journals, not giving it away as free content. If they allowed caching, would anyone buy what they could get free through the cache? I personally think that scientific journals should be free, not subscription. I think a lot of important information is locked away because of this. But it appears, thanks to the internet, that scientific journals are being forced to evolve along with a lot of other industries. From what I read in the last year or two, the subscription only journals are coming under pressure from freely published journals. So they may be charging the content providers (the authors) to publish, and publishing openly in the future. This would be a good thing for everyone, I'm sure you would agree. And I use the cache for slow loading sites since the cache is faster. And I use it for other purposes. But if you use google enough, you'll realize that some industries, and some fields/categories don't use, or avoid the cache entirely. When you see this, take a look at the web sites, what their content is, and maybe you can figure out why they avoid the cache also, which can be any one of a number of reasons, but the reasons are usually similar for particular industries/fields/categories, and which may be completely different for other industries/fields/categories.

The sites on my server are being served by apache on linux. Their uptimes are measured in months to years, and better than 99.9% availability. I've had less luck with google itself than the sites on my server. And that doesn't include the backup servers. So availability is not issue.

It's even worse than that! by Anonymous Coward · 2004-11-10 23:57 · Score: 0

I don't quite believe that Google would've limited themselves that way (using 32 bit identifiers for documents) - that would've been incredibly short-sighted.

What was even more short sighted was their use of two digits to store the year value for the file dates. Something about the amount of space saved by not using those extra two bytes (four for unicode).

Re:What? by poohsuntzu · 2004-11-10 23:58 · Score: 2, Insightful

It isn't about having a better search engine, so much as it is knowing how to use it. If you are looking for information on a recipe for oriental rice using asian spice, how would you search?

Bad search example:

oriental rice recipe asian spice

Good search example:

recipe+"oriental rice"+spice

See the difference? google tries its best to get rid of the spam pages, but it won't ever combat them all. Half of the work has to be done with you understanding the best way to describe to the search engine, what it is you want to do. The better you explain it, the better it can search for you.

--
"We're breaking out the ramen noodles. . . "
"Really? Is it someone's birthday?"

Re:Google thieves my bandwidth by Rakshasa+Taisab · 2004-11-11 00:03 · Score: 2, Insightful

You can rant all you want, but Google still has a fair use right to your images. They are reduced resolution images and therefor legal for non-commercial use.

Not to mention robot.txt, but that is so obvious it shouldn't need to be mention.

--
- These characters were randomly selected.

great but where are the .txt and directories? by js7a · 2004-11-11 00:06 · Score: 2, Informative

Google won't be within reach of the pinnacle until they index .txt files, directory listings, and anonymous ftp sites.

Re:great but where are the .txt and directories? by geminidomino · 2004-11-11 00:21 · Score: 2, Informative

One out of 3 ain't a bad start. Add a few more keywords to narrow down the google-crawling.
Re:great but where are the .txt and directories? by Lehk228 · 2004-11-11 00:22 · Score: 1

they *Do* index directory listings just search for "index of"

--
Snowden and Manning are heroes.
Re:great but where are the .txt and directories? by Anonymous Coward · 2004-11-11 01:47 · Score: 0

or "parent directory" to catch those fugly IIS autogenerated indexes, too.
Re:great but where are the .txt and directories? by dandman · 2004-11-11 03:56 · Score: 1

For directories and other files (including, interestingly but worryingly zips etc) I found much unexpected data in the Wayback Machine

Now while it's not exactly a search engine itself, it's in the same family, and I use it instead of GoogleCache when needed.
Most informative were the snapshots you can find of sites recorded whilst they were in development (ie, before they turned off directory listings and turned on security settings)
Good for retrieving any backups you forgot to make - although a bit hard (and slow) to re-assemble if using a web-whacker to grab the bits automatically - the mirror files are all over the place.
Re:great but where are the .txt and directories? by irc.goatse.cx+troll · 2004-11-11 08:22 · Score: 1

You get better results with intitle:"Index Of"
Saves you from some spam traps, anyways.

--
Pain lasts, kid. Its how you know you're alive. Sometimes I think this growing up thing is just pain management-TheMaxx

Re:Google thieves my bandwidth by jvj24601 · 2004-11-11 00:09 · Score: 5, Informative

Well, if you know that Google is indexing your site and "stealing" your bandwidth, then you must have looked at the server logs, right? You'd see the name of the search bot is googlebot. Search for it, and you'll find that the first relevant link explains how to prevent googlebot from accessing your site.

The logs would probably also show failed attempts to find the file /robots.txt. Similar info is gained from searching on that term as well.

Re:What? by LiquidCoooled · 2004-11-11 00:10 · Score: 3, Interesting

I see the difference...

Search terms: oriental rice recipe asian spice
Search Results: Results 1 - 10 of about 254,000 for oriental rice recipe asian spice . (0.40 seconds)
Search Effectiveness: REASONABLE. good list of relivent items matched.

Search terms: recipe+"oriental rice"+spice
Search Results: Your search - recipe+"oriental rice"+spice - did not match any documents.
Search Effectiveness: UTTER SHITE

The user wants SIMPLICITY. If google cannot give decent results for simple search criteria, then people will go elsewhere.

Its the KISS principle in effect.

--
liqbase :: faster than paper

Re:What? by poohsuntzu · 2004-11-11 00:14 · Score: 0, Troll

The examples were only examples, nothing more, and hence thus why I said example. I'm quite sure most readers (that aren't out with a jackboot) will get the drift of what I am saying.

If a user wants simplicity, then they will get a simple search. If a user wants an advanced and refined search, then that requires advanced knowledge of google.

If people go elsewhere, oh well. those who know how to use the search engine properly will still be here, educating those who do not know how to use it. Know why? Because eliminating all spam and fake pages from searches won't happen. It just won't due to the time it would take to check each and every page for content, much less content defeating methods.

--
"We're breaking out the ramen noodles. . . "
"Really? Is it someone's birthday?"

Turn them off then. by aug24 · 2004-11-11 00:14 · Score: 1

You can customise your page to only have stories in your interests, and Google is one of the story types.

I'm moderating at the mo, and I'd have moderated you 'muppet', but I thought I'd be useful instead ;-)

J.

--
You're only jealous cos the little penguins are talking to me.

Re:Turn them off then. by aug24 · 2004-11-11 00:31 · Score: 1

Actually, no, it appears I'm the muppet. Google is one of the stories you have to have. I apologise!

J.

--
You're only jealous cos the little penguins are talking to me.

So, to sum up... by kahei · 2004-11-11 00:15 · Score: 5, Insightful

I am feeding this troll because there are people who really _do_ think like that and I wish I could yell at them to their faces :)

You put content in a place where it is publically accessible. You explicitly and proactively made that content available to everyone, including 'the average surfer' and googlebots. You took no steps to make it available only to the select few of whom you approve.

Now you are all cross and bothered because average surfers / googlebots have read / copied your content, such as it is.

The solution is to drown yourself in a bucket. I have a bucket.

--
Whence? Hence. Whither? Thither.

Re:So, to sum up... by Anonymous Coward · 2004-11-13 11:06 · Score: 0

Hear hear.

This person belongs with the idiots who use right-click blockers to 'protect' their precious 'content', and the whiny teens who have a LiveJournal or blog and use it to publically slag off everyone and everything, and then throw a tantrum when someone reads it and responds...

If you don't want people to read it, and possibly even *gasp* save it for future use, don't post it publically.

Closer to re-adding entries by Gambit+Thirty-Two · 2004-11-11 00:17 · Score: 1

I regularly watch where my nickname, full name, parents names, etc come up in google. I've noticed in the past couple of months, my hits have DRASTICALLY reduced. They just disapeared from the database. But over the past 2 days, I've gotten notifications (thanks google alerts) about new pages being indexed and voila! They come up in a search again.

Proximity search will help by Sai+Babu · 2004-11-11 00:19 · Score: 3, Insightful

This is why I've been begging google folks to implement NEAR operator!

Here is an example msn search: http://search.msn.com/results.aspx?FORM=SMCRT&q=fi sh%20NEAR%20ahi%20NEAR%20recipe

--
Now I'm the grandest Tiger in the Jungle!

Re:Proximity search will help by Anonymous Coward · 2004-11-11 04:41 · Score: 0

Last I knew, double quotes did that, in effect?
Re:Proximity search will help by mazarin5 · 2004-11-11 05:32 · Score: 2, Informative

Google has a near operator: *
Only useful in a quoted string.
Example:
Thomas * Edison

--
Fnord.
Re:Proximity search will help by Sai+Babu · 2004-11-11 06:39 · Score: 1

Wowsa! This is a big help. Thanks for info. While not the cashew I was groping for, it's still a very tasty nut.

--
Now I'm the grandest Tiger in the Jungle!
Re:Proximity search will help by exquisito · 2004-11-11 08:45 · Score: 1

Yeah, there is no NEAR operator, but check out this site which hacks Google into doing them: http://www.staggernation.com/cgi-bin/gaps.cgi
Re:Proximity search will help by Spy+Hunter · 2004-11-14 13:23 · Score: 1

Google makes these kinds of operators almost entirely redundant. I can't remember the last time using search operators gave me better Google results; even quotes are unneccessary 99% of the time. Google already prioritizes the pages which contain your search words in the order you specify, and pages which contain your search words in close proximity, and I believe it even does this in a phrase-sensitive way (so if you search for two common two-word phrases, Google recognizes this and prioritizes results accordingly, instead of prioritizing results based on one four-word phrase). This works much better than a NEAR operator because it is automatic based on the phrases actually used in the pages you're searching for. Notice if you search for "ahi fish recipie" on Google you get results about the same as your example MSN search; no operators necessary.

--
main(c,r){for(r=32;r;) printf(++c>31?c=!r--,"\n":c<r?" ":~c&r?" `":" #");}

Re:What? by anti-trojan · 2004-11-11 00:24 · Score: 0

There is no (+) operator to use with Google. It is being used by other search engines, but not the way you wrote it.

--
Virus infects both Windows and Linux!

Re:What? by jez9999 · 2004-11-11 00:25 · Score: 4, Informative

Erm, that's only because of the bizarre plus signs the grandparent poster put in - try this. Note to grandparent: Just about any modern search engine assumes words not prefixed by anything are to be included in the Boolean search query. No need for +.

--
== Jez ==
Do you miss Firefox? Try Pale Moon.

Still works without the quotation marks .. by RedLaggedTeut · 2004-11-11 00:26 · Score: 1

Well, it seems Microsoft has dropped to rank 5 in spiritual ranking, should I sell my stock?

--
I'm still trying to figure out what people mean by 'social skills' here.

Just tried the beta of the new MSN Search by Mostly+a+lurker · 2004-11-11 00:29 · Score: 3, Funny

I received this response:

This site is temporarily unavailable, please check back soon.
Didn't get the results you expected? Help us improve.

It is not clear to me how I can help them improve. Suggest they switch their servers to Linux?

Re:Just tried the beta of the new MSN Search by Henk+Poley · 2004-11-11 10:24 · Score: 1

They are already using a frontend served by Akamai, which runs Linux.

Advertising a deficiency by fleener · 2004-11-11 00:31 · Score: 1

When a search engine announces it has increased its index of pages, it advertises a deficiency....

"Oh, if you just added several billion pages, were you giving me crap before? How many more billions of pages are you not indexing right now?"

Google's announcement merely gives its users reason to question the size and comprehensiveness of Google's index.

Re:Advertising a deficiency by skraps · 2004-11-11 02:46 · Score: 1

Riiight.. because search engines are supposed to be birthed by God with a complete index already in place. None of that crawling business to make the index larger.

--
Karma: -2147483648 (Mostly affected by integer overflow)
Re:Advertising a deficiency by fleener · 2004-11-11 04:04 · Score: 1

No, I said making a big public announcement of that sort is advertising a deficiency, not that building the index is bad. It's negative public relations. Read before criticizing.

Re:What? by LiquidCoooled · 2004-11-11 00:32 · Score: 1

The problem is, you tell people to use quotes and pluses and cryptic search terms.

When google cannot find anything, it comes up and tells them the opposite:

Tip: Try removing quotes from your search to get more results.

People don't need to know the quoting syntax, or the inclusion format rules, they just need to click the "Advanced Search". :)

When you make an comparison regarding how much better your way is than everybody elses, make sure your facts are clear. I agree it was a mistake, and I agree with your sentiment, but most users don't even know how to type a quote character.

--
liqbase :: faster than paper

Re:Google makes minor change to website - news at by shrykk · 2004-11-11 00:40 · Score: 1

You can block Apple stories in your user preferences page.

--
#define struct union /* Reduce memory usage */

Re:Google thieves my bandwidth by rdc_uk · 2004-11-11 00:48 · Score: 1

"Google still has a fair use right to your images. They are reduced resolution images and therefor legal for non-commercial use."

FYI; nothing google does is "non commercial".

Even the stuff they let out "for free" serves to funnel their adverts to you, which is their source of revenue. i.e. it is a commercial activity.

Ergo; their use of other people's data (or data ABOUT other people's data, such as a thumbnail of someone else's copyright imagery) is in NO WAY non-commercial.

Do web crawler really have a future? by Anonymous Coward · 2004-11-11 00:52 · Score: 0

It seems to me the larger and more dynamic web sites become, the less and less useful web crawlers will become. I suspect it will get the point were site admins will have to regularly submit a keyword list to the search engines.

Re:Google makes minor change to website - news at by kjamez · 2004-11-11 01:01 · Score: 0, Offtopic

i don't know if it's news or not, but c|net news was reporting gmail now offeres free pop access.

that's cool.

i have gmail invites for free that require no ipod or free lcd signup. i just have no one to give them too. everyone i *know* has one already. i have six.

and it's more fun to talk to someone than just to submit them to the gmailinvitecache.

--
you can't have everything, where would you put it?

Re:In other news... by Anonymous Coward · 2004-11-11 01:03 · Score: 0

Including dictionary.com it would seem...

Re: Index update by geoff_smith82 · 2004-11-11 01:06 · Score: 1

I am happy because this is the first google update that have indexed some files on one of my websites that are going to be used for a program I wanted to write. I registered the domain and created the basic website 11 months ago and have been waiting since!!!. So finally I will be able to get to work on it.

The real reason... by Anonymous Coward · 2004-11-11 01:08 · Score: 0

Every kid in China has been asked to make a webpage about their family as a school project.

try +the by leuk_he · 2004-11-11 01:20 · Score: 2, Informative

Yes there is, try to search for

The Doctor

vs

+the doctor

Re:try +the by martingunnarsson · 2004-11-11 02:09 · Score: 1

Yes, the plus sign on google is only used to force a search for "common words", which are otherwise filtered out. These are usually simple words like are, is, how, what etc.

--
Martin

Re:What? by gus+goose · 2004-11-11 01:22 · Score: 1

Hmmm... actually:
http://www.google.com/help/refinesearch .html

There IS a + operator, and you are modded "informative"....

gus

--
.. if only.

I know why it has doubled... by jmcmunn · 2004-11-11 01:23 · Score: 2, Interesting

Because every blogger in the universe has added at least 3 pages since the last index. I fail to see how it is significant to me that there are now 8 billion mostly worthless sites out there. The number of actually useful sites has not gone up considerably.

Web API by Anonymous Coward · 2004-11-11 01:31 · Score: 0

Just notised this on the google pages, does anyone know where there web API came from :
http://news.google.com/apis/
oh, and why do they know own:
http://www.keyhole.com/
no then, a picture with each google local response?

That's gota be the quickest dupe!! by mcoko · 2004-11-11 01:36 · Score: 0, Offtopic

Amazing...Just as fast as Goodle hit 8 Billion, Slashdot duped the story. Subscribers will see it two to three stories about this one.

--
www.fotoforay.com

Still censoring images? by Anonymous Coward · 2004-11-11 01:38 · Score: 0

Yeah, but are they still censoring stuff? Like pictures of american war crimes in Iraq (just try a search for abu graib and lyndie england, google still returns none of the pictures of her torturing "inmates"). Google appears to be very open to censorship by commerical and government interests. I'm afraid unless this changes I am going to have to stop using them.

Read more carefully. by Anonymous Coward · 2004-11-11 01:42 · Score: 0

As you are the foremost of several identical replies, somehow marked (+5 insightful) instead of (-1 redundant), I will answer you. But consider this a response for all of you who have replied.

Firstly, read my words. I am fully aware of the existence of robots.txt. The clue was where I said Unless I adhere to their own arbitrary rules. In your rush to correct me you have all seemed to have been missing your reading comprehension. Thanks should also go to moderators who have yet again branded me troll because of 5 or 6 of you hot headedly misresponding to the same mistaken point.

The real question here is why I, a UKian, should have to KNOW about robots.txt at all? Why should have to 1)find out, somehow just instinctively know about this arcane piece of information, just to host my own website for my friends to visit. Why is this an opt out list, instead of an opt in list? Why are you automatically expected to want these ridiculous bots crawling over my personal space, violating my privacy??This is an American company with dubious notions of personal data privacy and no clear data retention policy indexing the contents of my site for all to see, when this is against my wishes, basing its decision on its own ridiculous opt out list. That is the scandal. Next time, percieve the beam instead of the mote etc. etc.

And when they make the original mistake, how easy do you think it is to make them cleanse the site from their archives, cache, etc? I can tell you, I hope you are never in that situation. In the end it took a legal threat for them to take notice of me.

Google are thieves. Just because they thieve from everyone, that their thievery is diluted a trillion times does not make it OK. They take our words, our information and our images and they use it to make money for themselves. And we do not see one red cent of it. It is paid to the rich backers.

I am aware that anonymity on the Internet is the motivating cause of many insults. But you should consider your words carefully, I was offended by being told to go and drown in a bucket. A family member drowned recently and it brought back to me the horror of a loss of life, especially in one so young. Please try to be more polite in future. I debated with myself to include this paragraph, as it would just open me up to more abuse from you, but I will give you the benefit of the doubt. We often say things in haste we repent slowly.

Re:Read more carefully. by RandoX · 2004-11-11 02:45 · Score: 1

I'll give you the benefit of the doubt in assuming your thoughts and words are genuine, and you're not trolling.

The internet is not a private place. Other than the security procedures the developer implements, there is no "opt in" for any web site, any more than you can only allow people on your whitelist to call your phone number. Robots.txt is not JUST a Google or an American tool. It is recognized by many international search engines or other indexing spiders. Having a web site comes with a certain amount of responsibility, including protecting the information you want to keep private, and telling the spiders that index sites that you don't want them. I think the vast majority of users will agree that Google provides a valuable service, and by conforming to the rules they allow a way to keep sites out that don't want to be included in searches. Just add this:

User-agent: *
Disallow: /

to the file to keep all robots out. Now, robots that don't pay attention to your requests are a legitimate problem.
Re:Read more carefully. by Mant · 2004-11-11 03:18 · Score: 4, Insightful

Robots.txt isn't some thing that only applies to Google, it is (supposed) to be honoured by all search engines, and uses the Robots Exclusion Standard. So, when you claim these are Google's arbitary rules, you are in fact wrong. They are neither Google's nor arbitary (at least no more than any web standard).

So your clue, not so much of clue, as robots.txt doesn't fit your description.

As for why you should know about it, you are putting up a web site, it is part of running a web site. You might as well complain why you need to know about HTML, CSS or registering a domain name. Quite what coming from the UK has to do with it (something I also do), I have no idea.

"I simply do not want the average surfer to be able to visit my site, I am not interested in serving my pages to them, they simply would not appreciate or understand what it is I am showing."

Then a publicly accessable webiste is the wrong place. It is not your personal space, and it isn't private. You made it available to the world, nobody made you. To turn around and complain when (some of) the world visits it is hypocracy.

It's like putting up posters around a town, then running around complaining all these people are looking at them, won't appreciate them, and you don't want them too. It's also comes across as condescending and arrogant, which probably explains the nastiness of some of the responses.

You opted in when you put up the publicly accessable website. If all search engines had to be opt in, nobody could find anything on the web, and it would use a lot of its utility. Your assumed to want them crawling becuase the vast majority of people do, they want their site to be found. If you don't though, no problem, just use the standards for stopping searches, or password protect the site. No scandal at all, just hysterics.

Showing the low res thumbnail of your image isn't violating your copyright either. The only legitimate claim you have is the amount of time it took to remove something from the cache.

The "thieves" accusation is even more ridiculous. If you put something up on the web people can see for free, you can't complain. There are options if you want to protect it. Google doesn't claim you work as theirs (which would be 'stealing' or at least copyright violation), they help people find you public web site.

If you don't want a public website but made one, whose fault is that? If you are going to run a website and can't be bothered to find out how to do it properly, you can't blame Google.
Re:Read more carefully. by Anonymous Coward · 2004-11-11 03:19 · Score: 0

If you wanted to keep the information private why did you post it on a publicly accessable webpage in the first place. You should have included an authentication mechanism to limit access or make it a private BBS instead on part of the internet.
Re:Read more carefully. by Anonymous Coward · 2004-11-11 03:56 · Score: 0

"The "thieves" accusation is even more ridiculous. If you put something up on the web people can see for free, you can't complain. There are options if you want to protect it. Google doesn't claim you work as theirs (which would be 'stealing' or at least copyright violation), they help people find you public web site."

I hardly agree with this. If someone writes a book and gives it away for free. Someone can't simply put infront of the book "John Doe wrote: " and then use it to make money either by selling it, giving it away when you buy something or even give it away when you read their ads.
The point is mute tho, because again you can opt-out of being cached.

*Posted annoymously because I modded in this thread.
Re:Read more carefully. by Anonymous Coward · 2004-11-11 04:27 · Score: 0

I think you mean hypocrisy. If hypocracy meant anything, it might be a deficient system of government.
Re:Read more carefully. by Anonymous Coward · 2004-11-11 05:21 · Score: 0

Hear hear! Stupid self-centered egocentric selfish grandparent BS. "I published information for the world, how dare they steal it from me!!!"

In the latest FireFox by mandrake*rpgdx · 2004-11-11 01:56 · Score: 1

you can click on the google search bar and it will bring down a choice of search engines, including yahoo.

--
click me

DUPE PULLED by LiquidCoooled · 2004-11-11 01:59 · Score: 0, Offtopic

The dupe has been dropped :)

2x google is enough for anyone.

Do the moderation points given in that article get returned?

--
liqbase :: faster than paper

OT: The dupe is removed! by helge · 2004-11-11 02:02 · Score: 0, Offtopic

It seems that the dupe of this article http://slashdot.org/comments.pl?sid=129334 "Google Cranks Up Index" is removed! Is this the first time it happens on Slashdot?

Re:OT: The dupe is removed! by Anonymous Coward · 2004-11-11 02:35 · Score: 0

I was able to find one link to the dupe through Google News: Google Cranks Up Index

googlewhack? by CarrotLord · 2004-11-11 02:18 · Score: 1

Does this mean the end of the googlewhack? Or the beginning of a whole new googlewhacky world?

--
Quidquid latine dictum sit, altum videtur.

Also: Gmail gets POP access by scrm · 2004-11-11 02:21 · Score: 1

Google also started implementing
POP access for Gmail today (my account has it enabled already). There's no IMAP yet, and we know there were ways of doing this before, but it's an interesting direction for Google to take. As stated in the article, they don't intend to start charging for POP access or mail forwarding in the future. So how can Gmail's ad-based business model continue to be viable when its users can read their mail from external clients and via external addresses?

--
---- scrm

GOOGLE DUPE STORY HIDDEN, BUT STILL THERE! by Anonymous Coward · 2004-11-11 02:37 · Score: 0

Somebody quickly mirror this link before Michael destroys it!

Google Cranks Up Index

Re:What? by Anonymous Coward · 2004-11-11 02:48 · Score: 0

but... you don't know how to use it properly. the plus sign is not used in google searches in the way you specify. from google's own help page which you obviously haven't read yourself:

" + " Searches

Google ignores common words and characters such as "where" and "how", as well as certain single digits and single letters, because they tend to slow down your search without improving the results. Google will indicate if a common word has been excluded by displaying details on the results page below the search box.

If a common word is essential to getting the results you want, you can include it by putting a "+" sign in front of it. (Be sure to include a space before the "+" sign.)

Another method for doing this is conducting a phrase search, which simply means putting quotation marks around 2 or more words. Common words in a phrase search (e.g., "where are you") are included in the search.

that's what you get for being an arrogant dick.

Re:This is news ? by Beyond_GoodandEvil · 2004-11-11 02:55 · Score: 1

"if the index has doubled so has our supply of information. Information rules!" Not to be a spelling Nazi but you misspelled pr0n.

--
I laughed at the weak who considered themselves good because they lacked claws.

Re:What? by Anonymous Coward · 2004-11-11 02:56 · Score: 0

The user wants SIMPLICITY. If google cannot give decent results for simple search criteria, then people will go elsewhere.

Which user? The same people who threatened to move to Canada if Bush was reelected? I know I want relevance.

Where will that user go? Yahoo? MSN? Or any other search engine that is no more easier to search or to get relevant results? Google is still the best, but people want more.

Google is not hard to master if you spend a few minutes to read their guide.

try technorati... by ndrtkr · 2004-11-11 03:03 · Score: 1

if you want up-to-date results, screw google and try Technorati, then you'll know who's talking about you...

still, it seems that you are the only one talking about you! :D

--
- live from Costa Rica !

Re:What? by nmg196 · 2004-11-11 03:08 · Score: 1

There is the need for the + sign if you want to force Google to include the word in the search when normally it would class it as an ignored word.

Useful if you want to find things that incorporate "noise words" in their names:

eg "+the guru" compared to "guru"
(film)

Re:What? by FlopEJoe · 2004-11-11 03:10 · Score: 1

I don't know... I've had good results searching for:

asian nurses spice

No fancy pluses or quotes needed. But I see we're looking for different things.

Wow by xnot · 2004-11-11 03:12 · Score: 1

My porn resources just doubled!

"Double your pleasure... double your fun..."

How can we prove this? by Eric_Cartman_South_P · 2004-11-11 03:12 · Score: 1

How can anyone prove this?

Is there any way to spider their spider, to prove thay have that many pages on an index?

Re:How can we prove this? by Anonymous Coward · 2004-11-11 04:39 · Score: 0

Well, now that the election's over, there are probably some Diebold machines collecting dust somewhere.

Why isn't there a headline on Microsoft's new.... by Anonymous Coward · 2004-11-11 03:18 · Score: 0

Why isn't there a headline on Microsoft's search engine which directly competes with Google?

Seems you guys love biting the hand that feeds you ...

67256 votes by Anonymous Coward · 2004-11-11 03:28 · Score: 0

Updated figures for 67256 (+5526) respondents:

Google (81.2%)
Yahoo (2.2%)
Msn (3.8%)
Other (11.5%)
Don't know (1.2%)

Dark Net by xnot · 2004-11-11 03:32 · Score: 1

Has google made any progress in indexing the so-called "Dark Net"?

http://news.bbc.co.uk/1/hi/sci/tech/1721006.stm

Re:Google makes minor change to website - news at by Anonymous Coward · 2004-11-11 03:34 · Score: 0

...weird. I though this was the "fanboy section".

index-- by Doc+Ruby · 2004-11-11 03:42 · Score: 1

Search for (abu ghraib), and find only pictures of a harsh wartime prison. None of the famous torture pictures which appear on the web, even though there are several showing Iraqis happily celebrating. Either their excuse that their index is just too old is now obviously bogus, or their image search should never have lost its "Beta" label. Either way, it's obviously dangerous to rely on Google, or any one Web filter, for any accuracy. A much more useful search system would include a multi-index client behind the browser's "Address" input widget. It would query multiple competing search indices (from among a user-defined list with popular defaults), returning collated results including the "messages" (ads) sent by the responding index. Accepting multiple different query formats, like Google, Yahoo, and others (and translating to query the respective indices), it could completely take over the search function, as long as it didn't play favorites with one engine over another (like the locked-in pages from these engines today). Mozilla plugin, anyone?

--

--
make install -not war

He who mods down by Anonymous Coward · 2004-11-11 03:42 · Score: 0

...doesn't get the joke. ...also has a small penis, most likely.

courts have ruled in google's favor by Anonymous Coward · 2004-11-11 03:44 · Score: 0

Thumbnails are fair use

Beginning of the end for Google! by Anonymous Coward · 2004-11-11 03:49 · Score: 0

Google was great a few years ago, now with 8 billion pages search results consist mostly of irrelevant nonsense such as web logs, links to stupid pages from lousy personal web sites like Angelfire and Geocities, etc.

For some reason Google ranks these junk pages higher than the primary source of information users are searching for. So much for Google's superior search algorithms, they are next to useless now that they have this much data to search.

I rarely use Google anymore because I waste too much time filtering through crap results!

Damn... by Phil+John · 2004-11-11 03:57 · Score: 1

...a search for Phil John now yields two results before my slashdot user page. :o(

--
I am NaN

Re: Case-sensitive search takes more effort? by Alwin+Henseler · 2004-11-11 03:57 · Score: 1

What you guys don't realize is the orders of magnitude higher that it takes to perform the whole "capitalized/not capitalized" search

I beg your pardon? You didn't ever follow any basic programming courses, did you? What you're saying is nonsense.

Case-sensitive searching is just EXACT comparison of text strings, if you compare:

"Joe JingleheiMerScHmIdT" with
"Joe JinGleheiMerScHmIdT"

there's no match, because the "G" doesn't match "g". This kind of searching is easy, simple & fast. Case-INsensitive comparison just means filtering the strings through "make all uppercase" or "make all lowercase" (or other filters) before doing the comparison. This is EXTRA work, but for most applications, insignificant (fast, simple & easy as well).

A long while back our CRM application was consistently getting hung on queries that involved customer first/last name combinations because it WAS capitalization sensitive.

You're confusing the programming technique itself with a badly coded implementation (your CRM app).

Google already had more than 5 billion by Dryth · 2004-11-11 03:57 · Score: 1

For quite some time now, searching for extremely common words (i.e. "the" by itself) would turn up a page count in the area of ~5,400,000,000. The number on Google's front page seems to update less frequently than the actual number indexed.

Still, I suppose it isn't unreasonable for most people to go by the number on the front page.

Re:Google already had more than 5 billion by izomiac · 2004-11-11 05:16 · Score: 1

Whoa, and that doesn't even count many of the non-english sites Google indexes... maybe the only thing they did was update the number on the bottom of the webpage...
Re:Google already had more than 5 billion by adpowers · 2004-11-11 19:33 · Score: 1

I'd believe it. Notice how when you search for [the] now, it returns exactly 8 billion pages? Who wants to bet Google has code in there that limits the number of results listed (not that you can view them anyway) so no one really knows /how/ much they have indexed. They are secretive about their computer count for competition reasons, I wouldn't be surprised if they limit how much the public knows about the number of indexed pages. However, I, like others, have noticed a bunch of new Google Alerts being sent out, so maybe they did update, but it may be much more than 8 billion.

Andrew

Re:Google thieves my bandwidth by Rakshasa+Taisab · 2004-11-11 03:57 · Score: 1

I can't find those adverts you are talking about, perhaps you are talking of some other Google image search?

--
- These characters were randomly selected.

Re:Google thieves my bandwidth by roman_mir · 2004-11-11 04:10 · Score: 1

What is more interesting to me is why aren't search engines work the other way around. Why do we need a robot.txt file to tell the robots to sod off, when anyone can come up with a new spider overnight.

I think it would be more appropriate to have the robots.txt file with invitations, so that the spiders would always check first and if they are welcome, only then they would crawl this site.

--
You can't handle the truth.

Blogs and Google's index by mpost4 · 2004-11-11 04:32 · Score: 1

I wonder if they are going to take some actions based on blogs. It seems to be skewing the results a bit. Google bombing is still popular, I would think that google would do something to clean up the problem, and give less weight for a link from a blog to the text they use to define it.

I have found that it is annoying sometimes I will be searching for something, and my own website is a hit, I admit I am not surprised, because I will put things in the blog that interest me and I will search for the same things.

you don't know what you are talking about by Anonymous Coward · 2004-11-11 04:34 · Score: 0

the + in front of a word means it MUST be included.
without it all words are optional.

just an FYI, an - in front of a word means it MUST NOT be included.

Re:you don't know what you are talking about by toddestan · 2004-11-11 05:38 · Score: 1

the + in front of a word means it MUST be included.
without it all words are optional.

Anyone who has done some Googlewhacking knows that is false. But for search engines other than Google, this might be true.

no its not by Anonymous Coward · 2004-11-11 04:38 · Score: 0

the + sign is used to FORCE a word to HAVE to appear in the results.

searching for

military victories +french

will only bring back results that MUST HAVE french in them, making the results MORE RELEVANT

military victories -french

does the opposite, it says I only want military victories that DON'T have the word french in them!

god, people can't even read some simple documentation.

Re:no its not by martingunnarsson · 2004-11-11 11:11 · Score: 1

god, people can't even read some simple documentation.

Yeah, like yourself. This is straight from the most basic of the Google search instructions found here:

"The Basics of Google Search

To enter a query into Google, just type in a few descriptive words and hit the 'enter' key (or click on the Google Search button) for a list of relevant web pages. Since Google only returns web pages that contain all the words in your query, refining or narrowing your search is as simple as adding more words to the search terms you have already entered. [...]

Google ignores common words and characters such as "where" and "how", as well as certain single digits and single letters, because they tend to slow down your search without improving the results. Google will indicate if a common word has been excluded by displaying details on the results page below the search box.

If a common word is essential to getting the results you want, you can include it by putting a "+" sign in front of it. (Be sure to include a space before the "+" sign.)"

Fucking idiot.

--
Martin

When will they get around to indexing blogspot? by .killedkenny · 2004-11-11 04:54 · Score: 1

Looks like Google doesn't even index their own blog hosting site. The title of my blogspot blog is nowhere to be found in a Google search, and that blog is over 6 months old.

Re: Case-sensitive search takes more effort? by delphi125 · 2004-11-11 04:54 · Score: 1

I think you are the confused one.

Great-Great Grandparent said Google uses an index.

Great Grandparent said that the index could be used to prefilter results.

Grandparent posted a problem with this - admittedly not as clearly as he could have.

You posted some crap which displays that you don't know what you are talking about at all, and insulting someone who is raising a valid objection.

The magic of Google is that using a very clever index, it can find relevant results amazingly quickly - as do all search engines. The whole point is that they don't search everything.

While prefiltering could in theory help, searching for exact matches is still far more expensive than is realistic. The reason for this is not the very specific searches proposed above, but rather people wanting to search for '#Windows' (with a capital W).

I'm sure you understand that searching for such a term in all documents which match the (case-insensitive, indexed) term 'windows' would be prohibitively expensive even if only a million such queries were put to Google each day.

The only sensible way in which such a thing could be achieved is if Google 'randomly' selected some searches to 'improve'. No keyword or special symbol, just IF there is CPU time and for a suitable (case sensitive) search term, do the post-processing. Having said that, it would be better to recognize 'common' case sensitive keywords, such as pH, PhD, EFNET, or whatever, and simply use those as separate keywords in the index.

Re:Google thieves my bandwidth by That's+Unpossible! · 2004-11-11 05:00 · Score: 1

What is more interesting to me is why aren't search engines work the other way around. Why do we need a robot.txt file to tell the robots to sod off, when anyone can come up with a new spider overnight.

I think it would be more appropriate to have the robots.txt file with invitations, so that the spiders would always check first and if they are welcome, only then they would crawl this site.

That is what they do. If you don't want to let any spiders crawl your site, you can say:

User-agent: *
Disallow: /

If you want something fancier, you could have robots.txt served up by a program. But the features of robots.txt suffice for most.

--
Ironically, the word ironically is often used incorrectly.

Re:Google makes minor change to website - news at by mcc · 2004-11-11 05:49 · Score: 1

Uh...

Google, the foremost search engine, makes a pretty serious change (doubling their index size while suddenly announcing a new dedication to increasing their index) on the exact same day that an intended major competitor of theirs (MSN search, which slashdot already had a story on) launches...

I'd say that's pretty significant. Since you apparently don't want any Apple or Google news, why not just disable those stories...?

--
Irritable, left-wing and possibly humorous bumper stickers and t-shirts

JavaScript by Anonymous Coward · 2004-11-11 05:53 · Score: 0

"The documents in Google's index are in dozens of file types from HTML to PDF, including PowerPoint, Flash, PostScript and JavaScript."

I'm sure people write a lot of great information in JavaScript. Is this a sign of Google going down hill?

How to get rich:
1) Create expensive, hyped-up IPO for search company
2) Announce you're getting double the pages since you now index JavaScript
3) Watch as share prices double
4) ???
5) Enjoy yourself on out-of-the-way tropical island, with your GOOG shares sold just before the crash.

OT: hotmail storage increased to 250MB today by peter303 · 2004-11-11 06:11 · Score: 1

Seems like lots of M$ and Google services are being enhanced today. I welcome the new storage.

Finally stopped using 32-bit int by glyph42 · 2004-11-11 06:13 · Score: 1

I guess they finally got around to changing their page indexing scheme from a 32-bit unsigned integer. The number sat at 2^32 - epsilon for what seemed like an eternity! I expect the sudden doubling of pages is simply the backlog that built up while they waited for the conversion to 64-bit, or whatever.

--
Music speeds up when you yawn, but does not change pitch.

msn search by alexislashdot · 2004-11-11 06:22 · Score: 1

I just did a couple of comparative searches on google and the new http://beta.search.msn.com/ and it is the first time lately when I saw another search engine returning more results and faster than google. At least for some keywords.

Try it yourself.

Re:msn search by demon4 · 2004-11-12 06:06 · Score: 0

u read the other /. article that says that the msn search may use google to update it's results

Re:What? by Anonymous Coward · 2004-11-11 06:24 · Score: 0

quite wrong. *BZZZRT* thanks for playing, take your FUD with you on the way out.

How to Find Anything by not_hylas(+) · 2004-11-11 06:36 · Score: 1

No holes barred, find anything.
One of the best sites on the web (totally private, no ads).

http://www.searchlore.org/

Knock politely, ask for Fravia.

http://fravia.2113.ch/phplab/mbs.php3/mb001

Win-Linux centric.

P.S. Some friendly advice, don't piss off the natives ... you've been warned.

http://www.searchlore.org/tools.htm

--
~hylas

Re: Case-sensitive search takes more effort? by Alwin+Henseler · 2004-11-11 06:59 · Score: 1

I think you are the confused one.

No confusion here, but misunderstanding perhaps. For clarification, let me summarise:

jez9999 writes he/she would like to 1) "search the web" for exact match '#windows EFNET'. Obviously a massive amount of work, impossible for quick search queries.

Google uses an index, which is updated/refreshed every so many weeks, and only contains a very limited/filtered subset of "the web". Logical, this is the whole point of using an index.

PsychoSlashDot writes that Google's index works in a way that doesn't allow search 1)

Erasmus Darwin proposes to do traditional search, and then use 2) retrieved subset of Google's index to do full-text search 1) on. I think it's very important to make a distinction here between 2) this subset of Google's index, and 3) the actual web content that this subset of Google's index refers to. 2) would be quickly accessible to Google, although using it differently could require major changes in Google's hardware/software infrastructure. To do full-text search on 3), you'd have to actually retrieve/process the web content itself, which could get huge task quickly, if search doesn't involve very small number of search results.

I think we can agree that 3) can be regarded very time-consuming, but that 2) may be possible, or not (ask Google).

cavemanf16's comment may have been meant to point this out (valid point), but what cavemanf16 actually wrote (CRM app stuff), says that full-text search becomes way more expensive if you include case-sensitivity. That is plain nonsense.

Searching subset of web-content found in Google's index (3) maybe too much work for Google, but maybe adding case-sensitive or punctuation search within subset of Google's index (2) IS do-able. Again, only Google knows.

I know why.... by Mostly+Monkey · 2004-11-11 07:47 · Score: 1

They must have crawled into Slashdot's -1 threshold cache.

--
Chika Chik-ah... do-e ow ow.

Whoa! Did they disclose this to investors? by rbrome · 2004-11-11 08:44 · Score: 1

Whoa - wait a minute...

If that is true - if Google's system has had a design flaw limiting it to 4.3 billion pages until now - then that is a really huge weakness, risk, and vulnerability that the company has had until now.

Thinking back, they must have known about this for a long time - before they went public. If that's the case, did they disclose this weakness/risk to inverstors in their S-1? If not, did they break the law by not doing so?

Re:Google makes minor change to website - news at by Anonymous Coward · 2004-11-11 09:50 · Score: 0

The big issue with Google is that their page count has been stuck at around 4 billion for a few years now. Which, as covered elsewhere, indicated that someone was using 32bit unsigned integers somewhere...

Nice to see that they've finally patched everything up and can now index beyond 4 billion pages.

for one second... by _Qiang_ · 2004-11-11 12:08 · Score: 0

I thought that was "Google enlarged"

Client-side full-text search by gottabeme · 2004-11-11 15:13 · Score: 1

Seems like what is needed is a client-side app to take the first X number of Google results and do a case-sensitive search on the client computer, instead of on Google's servers. Sure, it'd take a while to download the first X number of results, but it would work if you really needed a more specific search.

--
"Those who consume the bulk of goods are those who make them. We must never forget this secret of our prosperity."

Let's see... by CatOne · 2004-11-11 15:33 · Score: 1

2.7 billion from dealtime.net, 5 billion from slickdeals.net, and 250 million from blogs, and 50 million from real web sites?

Guess 50 million ain't bad.

Interesting milestone by NG+Resonance · 2004-11-11 15:44 · Score: 1

A search engine now indexes more web pages than there are members of the human race.

Re:What? by fingerfucker · 2004-11-11 16:26 · Score: 1

It's not because these spaces are "bizarre", it's because that instead of recipe+"oriental rice"+spice, the great grandparent should have put in some spaces around the pluses: recipe +"oriental rice" +spice

Re:MODS ARE STUPID by Vombatus · 2004-11-11 17:10 · Score: 1

I spelled it write you ass. You should learn how to spell (or how to use a dictionarie sight for that matter)

Surely you mean you spelt it right. And surely you mean dictionary.

If you are so fond of the dictionary, try using it occassionally to ensure that you not only have the right spelling, but the right homophone (or is that the write homophone?)

--
This sig is intentionally blank

Page rank is good by Anonymous Coward · 2004-11-11 18:02 · Score: 0

I should've included that info in the top post. The particular site in question is linked by other sites fairly well, and some of the linking sites are highly ranked directories that place my site as a premier or feature site. And if a competitor complained about the site, and google looked at it for problems, they should've left it alone because the keyword results leading to the site are highly relevant and no one is led to the site through misleading keywords or other methods.

The number and quality of backlinks are the first things I checked and double checked to make sure everything stayed ok with the site. It isn't easy to fake or mislead on this, which is one of the reasons google uses this method, and one of the reasons why I would be expecting much better results explicitly because of this. But I highly doubt that google is alone in using this method for search results, since this has been talked about for more than a year. I'm sure that Yahoo is using this as one of their algorithms as well, which would explain the excellent ranking of the site in question on their search engine.

Something else is going on with google. And that is what got me wondering about search results when I use the search engine myself.

324 comments