Everyman · Slashdot Mirror

This article has too much fluff on In Google We Trust · 2004-03-14 03:54 · Score: 4, Informative

I was disappointed in the piece. Because I'm the founder of Google Watch, the reporter on the piece, David Hochman, called me twice in the last three weeks to talk about Google, for a total of about an hour. I have a feeling that the reason the piece came out the way it did is because he was constrained by his editors. The NYT has a custom-filtered AdWords feed from Google, and it's one of the reasons why the Digital NYT is in the black. Their record of publishing trenchant pieces about Google has been rather lame now for several years. Money talks, both at the NYT and at Google.

Yes, Google has some problems on Search Beyond Google · 2004-02-23 09:13 · Score: 4, Insightful

Yes, Google has a spam problem. It has been getting worse over the last year. In April, 2003 Google stopped crawling the web once per month, and then recalculating PageRank based on that monthly crawl. Since then, there has been a question of whether PageRank can even be calculated accurately by Google.

I speculated about a 4-byte docID overflow problem in an essay last June at Google Watch. In recent months Google started a "Supplemental Index" for some curious, unexplained reason. Their total number of pages indexed was recently updated to 4,285,199,774 -- just below the maximum for a 32-bit integer. It looks as suspicious now as it did last June.

Last November, Google began using an on-the-fly filter to further refine the search results for ecommerce sites. Some spam was deleted, a lot of other spam took its place, and a lot of mom and pop ecommerce sites were dropped inadvertently. Many people were unhappy.

Further evidence that Google's old ranking system is broken is the fact that three famous Googlebombs, "french military victories," "weapons of mass destruction" and "miserable failure" are all still working. The first one is eleven months old. It used to be that such Googlebombs were suppressed at the next monthly crawl, when PageRank was recalculated. Now it seems that suppressing them is beyond Google's ability. How else can you explain why Google puts up with these widely-publicized embarrassments?

Google's results remain unsurpassed for noncommercial sites from EDU, ORG, and GOV domains, however. Their crawling of the noncommercial sector is the most complete of any engine. The reason Google does so well here is probably because spam isn't much of a problem in this area.

So far Yahoo doesn't appear to be making much of an effort at covering the noncommercial web. It should be added that Google has more of a spam problem simply because spammers have been focused on Google for so long. Once Yahoo gets the same attention from spammers, then we'll be able to make a fair comparison of Yahoo with Google.

Pushing opiates is good for Wall Street on Google Considering IPO Auction Online · 2003-10-23 23:50 · Score: -1, Troll

Investing in Google is a good bet, because they have zero scruples when it comes to making money. Their Adwords program pushes illegal opiates. The FDA doesn't care, the big pharmaceutical companies only care that many of the prescription drugs are coming from outside the U.S., and it seems that only Britain cares -- advertising drugs to the public in Britain is not allowed, so Google screens them out for their UK engine.

Re:Watching Google Watch! on Google Tracking Frequent Users · 2003-10-06 07:11 · Score: 1

Chris Beasley is the person behind the Google-watch-watch.org site. He explains on the site why he did it:

"I am Chris Beasley, the author of this site. Why did I make it? Because I love Google. Google is a great company, a good company, a responsible company. They are in a position of tremendous power and they do not abuse it. They never sacrifice their vision for the sake of making a buck. They are benign innovators, if only other companies (here's looking at Bill), were this good."

I'm Daniel Brandt, who started the Google-Watch.org site. I object to the fact that Mr. Beasley quotes from a Salon piece that's over a year old. Farhad Manjoo, the author of that piece, ambushed me. Most of his telephone interview with me was about privacy issues at Google, and most of his piece was about how I don't like my NameBase rankings. I never mentioned Donald Rumsfeld or United Airlines in the interview -- Mr. Manjoo made up these examples to make me look stupid.

I pointed out to Mr. Beasley that the piece he was quoting on his site was biased and inaccurate. He ignored me, and keeps quoting from the piece.

I complained to the Mr. Manjoo's editor at Salon, Andrew Leonard, a year ago when the piece came out. Mr. Leonard ignored me. I have never been given a chance to correct the record.

Now this Salon drivel gets picked up by Slashdot and gets modded up to a 5. It's not enough that I was gleefully slashdotted by the readers of this site when the piece first came out in Salon.

Strange counts for five weeks now on What's Wacky with Google? · 2003-10-06 04:46 · Score: 5, Interesting

The counts have been broken for the last five weeks. A count for the word "the" produced fairly consistent results until then of about 3.4 billion. Then it shifted five weeks ago to 5.2 billion. Lately it has been under 2 billion. Now it's just over 2 billion.

Webmasters who have various directories and know exactly how many pages are in each directory, began noticing five weeks ago that Google was reporting approximately twice the number of pages in each directory than have ever existed in that directory. Prior to five weeks ago, Google used to be fairly close to the actual number (assuming that you get a full crawl).

GoogleWatch speculates on the reason why Google has been behaving strangely ever since it stopped doing the traditional deep crawl once per month. The last standard deep crawl was in April but it wasn't used -- Google threw out this data (by their own admission) and reverted to earlier data. The speculative piece was written last June.

Since it was written, Google has started showing "supplemental results" on many searches. It looks like they are running a parallel index. Why would they do this? All the problems Google has been having, along with the supplemental index, seem to support GoogleWatch's theory.

Too broken to even conspire.... on How Objective Is Microsoft's Search? · 2003-08-24 10:33 · Score: 1

Search engines are so inconsistent these days that "broken" is a better description than "conspiracy." Using the appearance of being "broken" as a cover for "conspiracy" is way beyond reach, for Google or for Microsoft.

Example: Since May, Google reports 20 backlinks for www.google-watch.org and shows 16 of them. But Alltheweb reports 24,646 backlinks for us and shows the first 3,510.

And yet, we have no complaints. Our pages at Google Watch rank very well in Google searches.

Google's cache copy - the larger issue on Web Caching: Google vs. The New York Times · 2003-07-14 02:36 · Score: 5, Interesting

The question is framed very narrowly by Slashdot, so this discussion misses the larger issues. The cache copy is an issue in Google's main index for many webmasters. The Google News situation is a subset of a larger problem; the cached link doesn't exist in Google News. Google News is a much narrower issue. I'd like to bring up the issue of full-text caching done by Google in their main index.

My problem with the cache is that it gives Google a competitive advantage that is unfair, and furthers their monopoly. This is especially unfair since it is most likely illegal -- assuming that you could ever get a good test case into court, or get a class action lawsuit going by some webmasters, publishers, or search engines.

To add to the attractiveness of the cache copy, consider what Google has done:

1) The cache copy makes it possible to highlight the search terms, whether or not you have the toolbar installed.

2) The download time for the cache copy from Google's servers is always faster than from the original website.

3) You never get a 404 "not found" or a DNS lookup failure for the cache copy.

4) The link to the page recommended by Google for bookmarking at the top of the cache copy is a link to Google's copy, not to the original page.

5) How about all that Google branding on the top of the cache copy? Priceless. I feel the cache should be opt-in, not opt-out. The only way you can avoid it right now is to place a "noarchive" meta on every page in your site. On some file types, such as .txt files, there's no place to insert a "noarchive" and Google goes ahead and caches it anyway.

The cache copy tends to keep eyeballs on google.com, and increases their searches. You may have noticed that many major news sites won't link to other websites in their stories anymore, but rather just mention the relevant site without putting a link behind it. That's because they don't want eyeballs wandering off of their page. A wandering eyeball may not come back and look at more ads. That's basically one of the big reasons behind the cache copy as well -- it keeps eyeballs from wandering as much as they would without the cache.

All the Google partners -- AOL, Earthlink, Yahoo, Netscape -- don't include the cache links, and I assume that this is the reason. They don't want people wandering off to Google and staying there.

As new competition is organizing to challenge Google's monopoly, from places such as Overture (Alltheweb and AltaVista), Yahoo (Inktomi), AskJeeves/Teoma and Microsoft, these engines have to consider whether to fight Google on the cache copy, or offer their own cache copy even if they think it is illegal. There isn't really any middle ground on this.

Many observers with legal expertise feel that while the snippets are "fair use" of a website's content, offering the full text in a cache version is not. Copyright law requires "express permission," but Google only offers an incomplete and inconvenient opt-out. I suspect that the legal departments of these other engines are more inclined to challenge Google rather than launch into their own violations of copyright law.

Cartoon about Friedman and Google on Does Google = God? · 2003-06-29 09:23 · Score: 2, Informative

A site at www.google-watch.org put up a cartoon about this that pretty much says it all.

Re:Big Brother Google on Gator Examined · 2003-05-23 09:24 · Score: 1

And fortunately for American democracy, NameBase made Oliver North easy to convict because he accepted an illegal gratuity from a former CIA security officer. This person installed North's security gate. A reporter got the name of a "Mr. Robinette" from the gate's manufacturer, and NameBase led a CBS reporter to one Glenn Robinette, who admitted all in front of a CBS camera early the next morning. See http://www.namebase.org/ollie.html for a Washington Post column on this. That was in 1987. Where was Google in 1987?

Re:Evan Williams denies it... on Google To Create "Blog" Search; Potentially Remove From Main · 2003-05-12 06:25 · Score: 1

Evan Williams says Andrew Orlowski is full of crap, but what are we to make of this?

"Deal May Freshen Up Google's Links: Blogger Acquisition Taps Into Some of Newest Material on the Web"

by David F. Gallagher, New York Times, February 24, 2003, p.C5

Last paragraph in the story:

"Now Mr. Williams has hinted that a sophisticated search engine just for Weblogs is on the agenda, and status reports for the Blogger service last week indicated that his team was already taking advantage of Google's infrastructure. 'Suddenly we have the resources of Google, where I personally am no longer thinking about servers and bandwidth,' Mr. Williams said."

Is the New York Times full of crap too, after interviewing you Evan?

Slashdotters prove the article is accurate on The Googlewashing Of Our Language · 2003-04-03 13:48 · Score: 1

The basic point of The Register piece is that PageRank often causes mediocrity to rise to the top.

Look at the comments in this thread. A perfectly coherent, valid critique of PageRank, posted by The Register, is now drowning in a sea of blogger-like idiocy surrounding the keywords "second superpower," due to the fact that Slashdot, with its scandalously-high, Googly-geeky ranking status, couldn't leave well enough alone.

Shame on Google. Shame on Slashdot.

But if Google retains all data, it's cool, right? on Bookseller Purges Records to Avoid PATRIOT Act · 2003-02-20 15:38 · Score: 2, Interesting

ISPs and search engines are affected by the Patriot Act also. The authorities can claim that search terms are part of the URL, because they get logged with the URL in normal httpd logging. Therefore they fall under the definition of "routing and addressing" information that is subject to "tap and trace device" scrutiny. Judges are required to approve orders for such scrutiny without a showing of probable cause.

Google saves your cookie ID, your IP number, your search terms, the date and time stamp, and your browser configuration with every search request you make to Google, and Google retains all this data indefinitely, and Google will not comment on their dealings with the authorities.

But this is cool because Google has cute colored letters in their logo, right?

But if Google does it, it's cool? on Ebay's Flexible Privacy Policy · 2003-02-20 07:46 · Score: 3, Interesting

I'll bet Google does the same thing as EBay -- it's just that Google isn't dumb enough to brag about it. From New York Times, 28 November 2002, page E6:

"Google currently does not allow outsiders to gain access to raw data because of privacy concerns. Searches are logged by time of day, originating I.P. address (information that can be used to link searches to a specific computer), and the sites on which the user clicked. People tell things to search engines that they would never talk about publicly -- Viagra, pregnancy scares, fraud, face lifts. What is interesting in the aggregate can seem an invasion of privacy if narrowed to an individual.

"So, does Google ever get subpoenas for its information? 'Google does not comment on the details of legal matters involving Google,' Mr. Brin responded."

The EFF seems to agree with Google Watch on Should you Fear Google? · 2003-02-18 12:02 · Score: 4, Informative

From the Electronic Frontier Foundation's analysis of the Patriot Act:

"1. Be careful what you put in that Google search. The government may now spy on web surfing of innocent Americans, including terms entered into search engines, by merely telling a judge anywhere in the U.S. that the spying could lead to information that is "relevant" to an ongoing criminal investigation. The person spied on does not have to be the target of the investigation. This application must be granted and the government is not obligated to report to the court or tell the person spied upon what it has done."

A modest proposal on Google Responds to SearchKing's Lawsuit · 2003-01-09 18:20 · Score: 2

If Google reduces Slashdot's PageRank from an 8 to a zero and keeps it there, I'll take down my Google Watch site.

Google's reply to SearchKing claims they have the right to do this to SearchKing -- or anyone else -- for no reason whatsover, because it's their opinion protected by the First Amendment. Are you listening Google? Rid yourself of a pesky critic and raise the IQ of the Web in one simple step!

Big Broother under a cute logo? on A Peek Into the Google · 2002-11-28 09:03 · Score: 3, Interesting

"Mr. Poindexter is pursuing a scheme he thought up right after 9/11 and then sold to the Bush administration. Total Information Awareness, or T.I.A., aims to use the vast networking powers of the computer to 'mine' huge amounts of information about people and thus help investigative agencies identify potential terrorists and anticipate terrorist activities. All the transactions of everyday life -- credit card purchases, travel and telephone records, even Internet traffic like e-mail -- would be grist for the electronic mill." -- New York Times editorial, 18 November 2002

"Google currently does not allow outsiders to gain access to raw data because of privacy concerns. Searches are logged by time of day, originating I.P. address (information that can be used to link searches to a specific computer), and the sites on which the user clicked. People tell things to search engines that they would never talk about publicly -- Viagra, pregnancy scares, fraud, face lifts. What is interesting in the aggregate can seem an invasion of privacy if narrowed to an individual.

"So, does Google ever get subpoenas for its information? 'Google does not comment on the details of legal matters involving Google,' Mr. Brin responded." -- New York Times, 28 November 2002

Question: What would be the fastest, most efficient, and most revealing approach to data mining the Internet?

Answer: Pay Google for a back-door feed on who's searching for what.

Question: Has Google ever, in their entire existence, issued any sort of statement suggesting that their sense of public responsibility would preclude being used in this way, or that the information they collect would never be sold for a price?

Answer: No.

Question: If Google decided to sell out, could they be held liable for privacy violations? Would we even find out about it?

Answer: No. The Homeland Security Act exempts companies from lawsuits or government prosecution after they turn over information to the new agency. Such information is exempt from the Freedom of Information Act. Officials who release this information can get up to six months in prison and a $5,000 fine.

All my cookie has is this little number! on Mr Anti-Google · 2002-08-29 07:06 · Score: 1

Google's privacy page says:

"Google notes and saves information such as time of day, browser type, browser language, and IP address with each query."

They also save search terms with each query.

Some of you seem to think that it all has to be saved in the cookie. No, all you need in the cookie is a unique ID number. Then all this information is saved on Google's end under your ID number.

I don't know how many times I've heard someone say, "But cookies are harmless! All they have is this little number!" Simply amazing.

Here's my essay on Mr Anti-Google · 2002-08-29 06:12 · Score: 4, Interesting

Hi all. I'm the evil Daniel Brandt who has the gall to criticize your beloved Google. Sorry the site is down. We're being synflooded, apparently by one or more slashdotters, since it started with the slashdot post. It's probably one of those who posted here, saying that if we can't keep our site going, then we don't belong in Google. We have our own router, so we hope to be able to clear things up shortly.

A few points missed in the Salon piece:

I specifically pointed out to the author of the piece when he interviewed me, that I felt my site did okay in Google, and that I was speaking for the public interest. The so-called "royal we" that Mr. Manjoo, the interviewer and author, refers to sarcastically, is used because I'm speaking for a tax-exempt, nonprofit public charity, Public Information Research, Inc. We do not sell widgets. Some of the comments in Slashdot have me mixed up with another person who is selling ads based on PageRank. But then, who expects Slashdotters to actually read the article?

My main site in Google is www.pir.org and it has a PageRank of 7. The www.namebase.org, with a PR of 6, is a streamlined CGI version of the main site, without all the essays and cartoons. NameBase began in the early 1980s and has been on the Internet since early 1995.

The other problem I have with the author's spin is that a good half of the interview was about Google's cookie. Most of the work I put into www.google-watch.org has to do with the cookie. In the article, the cookie is briefly mentioned, and most of the article is about how selfish and silly I am to think that Google should rank me higher.

My complaint about Google is not that PIR got the short end of the stick from Google, but that Google's stick should be longer.

My essay about PageRank is below.

_____________________

PageRank: Google's Original Sin

by Daniel Brandt

By 1998, the dot-com gold rush was in full swing. Web search engines had been around since 1995, and had been immediately touted by high-tech pundits (and Forbes magazine) as one more element in the magical mix that would make us all rich. Such innovations meant nothing less than the end of the business cycle.

But the truth of the matter, as these same pundits conceded after the crash, was that the false promise of easy riches put bottom-line pressures on companies that should have known better. One of the most successful of the earliest search engines was AltaVista, then owned by Digital Equipment Corporation. By 1998 it began to lose its way. All the pundits were talking "portals," so AltaVista tried to become a portal, and forgot to work on improving their search ranking algorithms.

Even by 1998, it was clear that too many results were being returned by the average search engine for the one or two keywords that were entered by the searcher. AltaVista offered numerous ways to zero in on specific combinations of keywords, but paid much less attention to the "ranking" problem. Ranking, or the ordering of returned results according to some criteria, was where the action should have been. Users don't want to figure out Boolean logic, and they will not be looking at more than the first twenty matches out of the thousands that might be produced by a search engine. What really matters is how useful the first page of results appears on search engine A, as opposed to the results produced by the same terms entered into engine B. AltaVista was too busy trying to be a portal to notice that this was important.

Enter Google

By early 1998, Stanford University grad students Larry Page and Sergey Brin had been playing around with a particular ranking algorithm. They presented a paper titled "The Anatomy of a Large-Scale Hypertextual Web Search Engine" at a World Wide Web conference. With Stanford as the assignee and Larry Page as the inventor, a patent was filed on January 9, 1998. By the time it was finally granted on September 4, 2001 (Patent No. 6,285,999), the algorithm was known as "PageRank," and Google was handling 150 million search queries per day. AltaVista continued to fade; even two changes of ownership didn't make a difference.

Google hyped PageRank, because it was a convenient buzzword that satisfied those who wondered why Google's engine did, in fact, provide better results. Even today, Google is proud of their advantage. The hype approaches the point where bloggers sometimes have to specify what they mean by "PR" -- do they mean PageRank, the algorithm, or do they mean the Public Relations that Google does so well:

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

Google goes on to admit that other variables are also used, in addition to PageRank, in determining the relevance of a page. While the broad outlines of these additional variables are easily discerned by webmasters who study how to improve the ranking of their websites, the actual details of all algorithms are considered trade secrets by Google, Inc. It's in Google's interest to make it as difficult as possible for webmasters to cheat on their rankings.

It's all in the ranking

Beyond any doubt, search engines have become increasingly important on the web. E-commerce is very attuned to the ranking issue, because higher ranking translates directly into more sales. Various methods have been designed by various engines to monetize the ranking situation, such as paid placement, pay per click, and pay for inclusion. On June 27, 2002, the U.S. Federal Trade Commission issued guidelines that recommended that any ranking results influenced by payment, rather than by impartial and objective relevance criteria, ought to be clearly labeled as such in the interests of consumer protection. It appears, then, that any algorithm such as PageRank, that can reasonably pretend to be objective, will remain an important aspect of web searching for the foreseeable future.

Not only have engines improved their ranking methods, but the web has grown so huge that most surfers use search engines several times a day. All portals have built-in search functions, and most of them have to rely on one of a handful of established search engines to provide results. That's because only a few engines have the capacity to "crawl" or "spider" more than two billion web pages frequently enough to keep their database current. Google is perhaps the only engine that is known for consistent, predictable crawling, and that's only been true for less than two years. It takes almost a week to cover the available web, and another week to calculate PageRank for every page. Google's main update cycle is about 28 days, which is a bit too slow for news-hungry surfers. In August, 2001 they also began a second "mini-crawl" for news sites, which are now checked every day. Results from each crawl are mingled together, giving the searcher an impression of freshness.

For the average webmaster, the mechanics of running a successful site have changed dramatically from 1996 to 2002. This is due almost entirely to the increased importance of search engines. Even though much of the dot-com hype collapsed in 2000 and 2001 (a welcome relief to noncommercial webmasters who remembered the pre-hype days), the fact remains that by now, search engines are the fundamental consideration for almost every aspect of web design and linking. It's close to a wag-the-dog situation. That's why the algorithms that search engines consider to be consistent with the FTC's idea of impartial and objective ranking criteria deserve closer scrutiny.

What objective criteria are available?

Ranking criteria fall into three broad categories. The first is link popularity, which is used by a number of search engines to some extent. Google's PageRank is the original form of "link pop," and remains its purest expression. The next category is on-page characteristics. These include font size, title, headings, anchor text, word frequency, word proximity, file name, directory name, and domain name. The last is content analysis. This generally takes the form of on-the-fly clustering of produced results into two or more categories, which allows the searcher to "drill down" into the data in a more specific manner. Each method has its place. Search engines use some combination of the first two, or they use on-page characteristics alone, or perhaps even all three methods.

Content analysis is very difficult, but also very enticing. When it works, it allows for the sort of graphical visualization of results that can give a search engine an overnight reputation for innovation and excellence. But many times it doesn't work well, because computers are not very good at natural language processing. They cannot understand the nuances within a large stack of prose from disparate sources. Also, most top engines work with dozens of languages, which makes content analysis more difficult, since each language has its own nuances. There are several search engines that have made interesting advances in content analysis and even visualization, but Google is not one of them. The most promising aspect of content analysis is that it can be used in conjunction with link pop, to rank sites within their own areas of specialization. This provides an extra dimension that addresses some of the problems of pure link popularity.

Link popularity, which is "PageRank" to Google, is by far the most significant portion of Google's ranking cocktail. While in some cases the on-page characteristics of one page can trump the superior PageRank of a competing page, it's much more common for a low PageRank to completely bury a page that has perfect on-page relevance by every conceivable measure. To put it another way, it's frequently the case that a page with both search terms in the title, and in a heading, and in numerous internal anchors, will get buried in the rankings because the sponsoring site isn't sufficiently popular, and is unable to pass sufficient PageRank to this otherwise perfectly relevant page. In December 2000, Google came out with a downloadable toolbar attachment that made it possible to see the relative PageRank of any page on the web. Even the dumbed-down resolution of this toolbar, in conjunction with studying the ranking of a page against its competition, allows for considerable insight into the role of PageRank.

Moreover, PageRank drives Google's monthly crawl, such that sites with higher PageRank get crawled earlier, faster, and deeper than sites with low PageRank. For a large site with an average-to-low PageRank, this is a major obstacle. If your pages don't get crawled, they won't get indexed. If they don't get indexed in Google, people won't know about them. If people don't know about them, then there's no point in maintaining a website. Google starts over again on every site for every 28-day cycle, so the missing pages stand an excellent chance of getting missed on the next cycle also. In short, PageRank is the soul and essence of Google, on both the all-important crawl and the all-important rankings. By 2002 Google was universally recognized as the world's most popular search engine.

How does PageRank measure up?

In the first place, Google's claim that "PageRank relies on the uniquely democratic nature of the web" must be seen for what it is, which is pure hype. In a democracy, every person has one vote. In PageRank, rich people get more votes than poor people, or, in web terms, pages with higher PageRank have their votes weighted more than the votes from lower pages. As Google explains, "Votes cast by pages that are themselves 'important' weigh more heavily and help to make other pages 'important.'" In other words, the rich get richer, and the poor hardly count at all. This is not "uniquely democratic," but rather it's uniquely tyrannical. It's corporate America's dream machine, a search engine where big business can crush the little guy. This alone makes PageRank more closely related to the "pay for placement" schemes frowned on by the Federal Trade Commission, than it is related to those "impartial and objective ranking criteria" that the FTC exempts from labeling.

Secondly, only big guys can have big databases. If your site has an average PageRank, don't even bother making your database available to Google's crawlers, because they most likely won't crawl all of it. This is important for any site that has more than a few thousand pages, and a home page of about five or less on the toolbar's crude scale.

Thirdly, in order for Google to access the links to crawl a deep site of thousands of pages, a hierarchical system of doorway pages is needed so that crawler can start at the top and work its way down. A single site with thousands of pages typically has all external links coming into the home page, and few or none coming into deep pages. The home page PageRank therefore gets distributed to the deep pages by virtue of the hierarchical internal linking structure. But by the time the crawler gets to the real "meat" at the bottom of the tree, these pages frequently end up with a PageRank of zero. This zero is devastating for the ranking of that page, even assuming that Google's crawler gets to it, and it ends up in the index, and it has excellent on-page characteristics. The bottom line is that only big, popular sites can put their databases on the web and expect Google to cover their data adequately. And that's true even for websites that had their data on the web long before Google started up in 1999.

What about non-database sites?

There are other areas where PageRank has a negative effect, even for sites without a lot of data. The nature of PageRank is so discriminatory, that it's rather like the exact opposite of affirmative action. While many see affirmative action as reverse discrimination, no one would claim (apart from economists who advocate more tax cuts for the rich) that the opposite, which would be deliberate discrimination in favor of the already-privileged, is a solution for anything. Yet this is essentially what Google claims.

Those who launch new websites in 2002 have a much more difficult time getting traffic to their sites than they did before Google became dominant. The first step for a new site is to get listed in the Open Directory Project. This is used by Google to seed the crawl every month. But even after a year of trying to coax links to your new site from other established sites, the new webmaster can expect fewer than 30 visitors per day. Sites with a respectable PageRank, on the other hand, get tens of thousands of visitors per day. That's the scale of things on the web -- a scale that is best expressed by the fact that Google's zero-to-ten toolbar is a logarithmic scale, perhaps with a base of six. To go from an old PageRank of four to a new rank of five requires several times more incoming links. This is not easy to achieve. The cure for cancer might already be on the web somewhere, but if it's on a new site, you won't find it.

PageRank also encourages webmasters to change their linking patterns. On search engine optimization forums, webmasters even discuss charging for little ads with links, according to the PageRank they've achieved for their site. This would benefit those sites with a lower PageRank that pay for such ads. Sometimes these PageRank achievements are the result of link farms or other shady practices, which Google tries to detect and then penalizes with a PageRank of zero. At other times professional optimizers get away with spammy techniques. Mirror sites and duplicate pages on other domains are now forbidden by Google and swiftly punished, even when there are good reasons for maintaining such sites. Overall, linking patterns have changed significantly because of Google. Many webmasters are stingy about giving out links (which can dilute your transference of PageRank to a given site), at the same time that they're desperate for more links from others.

What should Google do?

We feel that PageRank has run its course. Google doesn't have to abandon it entirely, but they should de-emphasize it. The first step is to stop reporting PageRank on the toolbar. This would mute the awareness of PageRank among optimizers and webmasters, and remove some of the bizarre effects that such awareness has engendered. The next step would be to replace all mention of PageRank in their own public relations documentation, in favor of general phrases about how link popularity is one factor among many in their ranking algorithms. And Google should adjust the balance between their various algorithms so that excellent on-page characteristics are not completely cancelled by low link popularity.

PageRank must be streamlined so that the "tyranny of the rich" characteristics are scaled down in favor of a more egalitarian approach to link popularity. This would greatly simplify the complex and recursive calculations that are now required to rank two billion web pages, which must be very expensive for Google. The crawl must not be PageRank driven. There should be a way for Google to arrange the crawl so that if a site cannot be fully covered in one cycle, Google's crawlers can pick up where they left off on the next cycle.

Google is so important to the web these days, that it probably ought to be a public utility. Regulatory interest from agencies such as the FTC is entirely appropriate, but we feel that the FTC addressed only the most blatant abuses among search engines. Google, which only recently began using sponsored links and ad boxes, was not even an object of concern to the Ralph Nader group, Commercial Alert, that complained to the FTC.

This was a mistake, because Commercial Alert failed to look closely enough at PageRank. Some aspects of PageRank, as presently implemented by Google, are nearly as pernicious as pay for placement. There is no question that the FTC should regulate advertising agencies that parade as search engines, in the interests of protecting consumers. Google is still a search engine, but not by much. They can remain a search engine only by fixing PageRank's worst features.

*

[Daniel Brandt is founder and president of Public Information Research, Inc., a tax-exempt public charity that sponsors NameBase. He began compiling NameBase in 1982, from material that he started collecting in 1974, and is now the programmer and webmaster for PIR's several sites. He participates in various forums where webmasters share observations about the often-secretive algorithms, bugs, and behavior of various search engines. Brandt has been watching Google's interaction with NameBase ever since Google, in October, 2000, became the first search engine to go "deep" on PIR's main site by crawling thousands of dynamic pages.]

What about Google's log data? on NYT Discovers the Panopticon · 2002-07-25 02:10 · Score: 2, Insightful

I'm more worried about what you and I cannot find on Google, but which the FBI can.

Google's privacy policy claims that they do not collect identifiable information from the user. However, many users now have static IP numbers. New laws passed by Congress last year give authorities the right to obtain the information in Google's possession, apparently without a showing of probable cause, just as they now have the right to obtain logging information from Internet service providers, and borrowing records from librarians. With the new Patriot Act, the use of the GET instead of the POST method for Google searching makes their case even weaker, as the authorities can claim that the search terms are part of the URL, and that they get logged with the URL in normal httpd logging. Therefore they may fall under the definition of "routing and addressing" information that is subject to "tap and trace device" scrutiny. Judges are required to approve orders for such scrutiny without a showing of probable cause.

The fact that Google records unique cookie ID, plus IP number, plus date and time, makes much of their information "identifiable." Authorities can also do a "sneak and peek" search of a Google user's hard drive when he isn't home, retrieve a Google cookie ID, and then demand a keyword search history from Google for this ID.

Google has refused to address this issue. They do not respond to inquiries about why they need a cookie that expires in 2038, nor have they responded to recommendations that they institute a log retention policy, in which logs are destroyed after 60 days or so. There is nothing quite so revealing as a history of all the search terms that someone has used in Google searches.

Librarians are worried about the new law, and the American Library Association is recommending retention policies as one of the only means at their disposal to avoid compromising their profession. It's even illegal for a librarian to disclose that the FBI came a-knocking for their records!

Meanwhile, as librarians are struggling with this issue, Google is doing 150 million searches per day, and continues to fly under the radar because their colored logo is so cute.

Google is as guilty as Gator on Web Publishers Sue Gator · 2002-06-27 06:27 · Score: 1

This lawsuit is potentially another nail in Google's cache copy, to the delight of webmasters everywhere.

If Gator offered an opt-out for the publishers suing them, such that if a publisher put a "noarchive" or a "nohijack" meta on every page on their site for the benefit of Gator's software, would this cause the publishers to drop their lawsuit?

That's what Google offers. And no, it wouldn't satisfy the publishers. Copyright protection has to do with opt-in -- which is express, prior permission. There is no way that the failure to opt out is the same as express, prior permission.

Of course, you can argue that a simple robots.txt exclusion can keep Google off of your site. But many webmasters cannot afford to disallow Google altogether, because their referrals from Google are a significant portion of their total traffic (from 30 to 70 percent).

Fortunately, it appears that as Google's monopoly increases, the cache copy problem won't increase at the same rate. Recent major portals that have contracted with Google to provide search results (earthlink.net, netscape.com, and aol.co.uk), are not showing the "Cached" link. Big portals recognize the need to keep searchers on their own site, and they have the clout to make this happen.

But most webmasters are smaller than Earthlink, Netscape, and AOL. The Google cache copy puts its own branding at the top of our HTML. To add injury to insult, recently Google's blurb began stating that if you want to bookmark this page, you should use Google's URL instead of the original site. Google adds value to the cache copy by highlighting your search terms. Finally, their servers are so fast that many Google searchers get into the habit of ignoring the original sites altogether. How can the average webmaster compete with this?

This situation robs webmasters of control over their own material. Yet Slashdotters typically love Google, and one thing they love the most is the Google cache copy.

It's a good thing that Slashdotters don't have the final say on such matters.

PageRank is destroying the Web on Modeling Linking on the Web · 2002-04-18 02:38 · Score: 1

This is valuable research. It's important to understand the implications. If you're a little guy and you are looking for a Web-based career, you have basically two choices: 1) become a photographer and remain independent, or 2) work for the big guys.

Anything between these two extremes is a very slippery slope.

You can use this research to argue that the practice of ranking sites by counting incoming links, a practice that began only a few years ago, is fundamentally altering the nature of the Web.

This new Web is one of the reasons for the popularity of the weblogs -- it's the only way for the little guy to participate. It used to be that you felt like you were participating by putting up your own Web page. This is no longer true, because a new Web page from an average person no longer draws traffic. No traffic means no feedback, and no sense of participation.

More research like this will encourage search engines to discover algorithms that do less damage to the Web, one would hope.

PageRank tweaks are a minor problem on Google Relists Operation Clambake · 2002-03-22 03:55 · Score: 2, Interesting

There is one thing that's scarier than Google's willingness to compromise the PageRank system at the first hint of a perceived inconvenience. That's their completely inadequate privacy policy.

It's boiler-plate: they say they'll change it whenever they like, but there's no mention of whether the previous data they've collected would fall under the old or new policy. Add to this the fact that the ownership and control of Google will most likely be shifting over the next few years, if Google goes public. Bill Gates could buy the whole thing with the loose change he carries in his pocket.

Google apparently has no interest in destroying old data, and intends to keep it all as long as possible. It's a potential gold mine as a corporate asset, and a potential disaster in terms of civil liberties and privacy.

Google has no good reason for collecting any of the data they collect; they just do it.

They claim that none of it is "personally identifiable," without mentioning the fact that many IP numbers are static, and even if they aren't, new laws give the feds the power to make it "personally identifiable" without probable cause.

Google's outrageous cookie policy just makes it that much easier to tie it all together, for those who don't erase cookies frequently.

Google sets a cookie that expires in 2038 for anyone who visits any page of theirs and doesn't already have a Google cookie. They use a unique ID number in their cookie, and with this number they also log the Internet address (IP) number, date and time, search terms, and browser information. This is both unnecessary and scary.

There is nothing more revealing about a person than a history of that person's Google search terms. (Some of us use the Internet for something other than merely selling more and more widgets.)

Since Congress passed the Patriot Act last October, a showing of probable cause is not required for pen register or trap-and-trace information, and judges must grant the order. The definition of this sort of surveillance has been expanded for the Internet, and now includes "other dialing, routing, addressing, and signaling information." Search terms for engines such as Google are part of the URL address. The law's exclusion of "content" for this surveillance -- language that refers to the body of email messages -- is insufficient to exclude Web search terms in the URL. The FBI could set up Carnivore at Google (the feds will be happy to fork over the cost of any needed hardware or software), and we wouldn't even know about it. Similarly, the FBI can present a court order for Google's logs, from a judge who was required to sign without a showing of probable cause.

I was able to get the CIA to instantly withdraw their cookies this week. That's because even the CIA is accountable to the public (on the cookie issue at least) under federal guidelines. But there is no accountability for Google, even though the data they have collected is more revealing than anything the CIA has collected recently, by orders of magnitude.

How long before the feds zero in on Google's data? Why can't Google abandon most cookie use, and destroy logs after 30 days?

If they sit on their data without doing anything about their policies, they may wake up one day and discover that the feds have appropriated the entire thing. Already it may be too late; there's at least one former National Security Agency employee with a top secret clearance who is now a Google software engineer.

-- Daniel Brandt
Public Information Research, Inc.

A better microtrap on All MS Settlement Comments Now Online · 2002-03-02 13:18 · Score: 1

The DOJ alphabetical name list is an HTML file that is nearly 2 megs long, and it doesn't even have anchors to each person's comment. Some browsers will choke on this. Even if they don't, it takes two extra steps to go from a name to a comment.

There's a better list of exactly the same names at:

http://www.pir.org/mslist.html

Each name is anchored directly to that person's comment using a CGI redirect to the specific DOJ file. And this file, full of anchors, is only 1.2 megs.

I hope the current judge on the case is more competent than DOJ's webmasters.

Wisdom from the Lab's founder on MIT Media Lab Tightens Its Belt · 2002-01-23 17:09 · Score: 1

"If your refrigerator notices that you are out of milk, it can 'ask' your car to remind you to pick some up on your way home. Appliances today have all too little computing. A toaster should not be able to burn toast. It should be able to talk to other appliances. It would really be quite simple to brand your toast in the morning with the closing price of your favorite stock. But first, the toaster needs to be connected to the news."
...
"The notion of an instruction manual is obsolete. The fact that computer hardware and software manufacturers ship them with product is nothing short of perverse."

The above are from Nicholas Negroponte in his book, "Being Digital" (New York: Vintage Books, 1996), pp. 213, 215.

My two cents:

Apparently no one at the Media Lab has ever been forced to use Windows. The Media Lab has been operating in the stratosphere for nearly 17 years now, completely oblivious to the socio-economic-political infrastructure, and to the everyday lives of billions of people. They should just sell all their assets and donate them to the poor.

Ban personal spiders, please on Bandwidth Demand at American Universities · 2002-01-13 06:02 · Score: 1

Universities should ban personal spiders. Those dudes in their dorm rooms with their caps on backwards have no conception of what it's like for a nonprofit site with tens of thousands of pages of free, noncommercial content.

They sic their personal spider on our site instead of using our site search engine, and download thousands of cross-links. We can get hit as often as 15 times per second from a single surfer, and sometimes end up blocking the entire .edu domain because we're so ticked off.

Never had a customer from a university anyway who is interested paying for serious content. Bunch of freeloaders, they are....

Slashdot Mirror

User: Everyman

Comments · 96