Slashdot Mirror


El Reg Says Google Choking on Spam Sites

Grubby Games writes "The Register is reporting that Google is full, and in trouble." From the article: "Recently, we featured a software tool that can create 100 Blogger weblogs in 24 minutes, called Blog Mass Installer. A subterranean industry of sites providing 'private label articles,' or PLAs exists to flesh out 'content' for these freshly minted sites. And as a result, legitimate sites are often caught in the cross fire. But the new algorithms may not be solely to blame. Google's chief executive Eric Schmidt has hinted at another reason for the recent chaos. In Google's earnings conference call last month, Schmidt was frank about the extent of the problem. 'Those machines are full,' he said. 'We have a huge machine crisis.'" James Robertson points out that's a fairly selective bit of quoting.

44 of 234 comments (clear)

  1. Everyone - Attention by NotQuiteReal · · Score: 5, Funny
    Please start deleting items from the Internet. It is getting full.

    Thanks!

    --
    This issue is a bit more complicated than you think.
    1. Re:Everyone - Attention by endrue · · Score: 5, Funny

      eh... just defrag

      - Andrew

      --
      I meta-moderate because I care.
    2. Re:Everyone - Attention by cashman73 · · Score: 3, Insightful

      Can we start by getting rid of MySpace? That seems to be huge waste of space and bandwidth, and with 60 million subscribers, would definitely cut down on the internet's bloat.

    3. Re:Everyone - Attention by ZachPruckowski · · Score: 5, Funny

      SFD (Sites for Deletion):
      myspace.com

      Problem solved.

    4. Re:Everyone - Attention by Anonymous Coward · · Score: 2, Funny

      no, just reduce the font. That'll do it.

    5. Re:Everyone - Attention by SydBarrett · · Score: 4, Funny

      ATTN: HELPDESK

      Could someone do a quick backup first? There might be something on the internet that I might need later. I think you can just use Ghost or whatever you IT guys do. Also, please burn it to CD and have it on my desk by COB today.

      -Executive Chief Officer SydBarrett

    6. Re:Everyone - Attention by jb.hl.com · · Score: 2, Informative

      No. Not solved. MySpace is nowhere near a problem. I've yet to see MySpace used as a link farm, anywhere, hell I've never even seen a MySpace page in a Google result (except results for "MySpace" obviously). I'd probably count Blogger as the thing that should be deleted, as well as poorly configured WordPress installations which allow anonymous commenting.

      Maybe you should let your own little personal prejudices slide a bit. MySpace isn't the great Internet evil, you know.

      --
      By summer it was all gone...now shesmovedon. --
    7. Re:Everyone - Attention by feepness · · Score: 4, Insightful

      Hmmm, it appears requiring a sense of humor for access to the internets might cut down on 'indignant post' volume as well.

    8. Re:Everyone - Attention by Xichekolas · · Score: 3, Funny

      I hate it when I have to buy a new Internet because mine is full. Hopefully I can get one of the new perpendicular Internets that holds more... wonder if I will need to turn my monitor sideways to use it...

      --

      Self-referential Sigs are cool on /. these days...

      54

    9. Re:Everyone - Attention by Anonymous Coward · · Score: 3, Funny

      Myspace is emo.
      Emo sucks.
      If you use Myspace you are emo.
      If you use Myspace you suck.

    10. Re:Everyone - Attention by ozmanjusri · · Score: 2, Funny
      Internet Cleaning

      DO NOT CONNECT TO THE INTERNET FROM MAY 7 23:59 pm (GMT) UNTIL 12:01am (GMT) MAY 8.

      *** Attention ***

      It's that time again! As many of you know, each year the Internet must be shut down for 24 hours in order to allow us to clean it. The cleaning process, which eliminates dead email and inactive ftp, www and gopher sites, allows for a better-working and faster Internet.

      This year, the cleaning process will take place from 23:59 pm (GMT) on March 31st until 00:01 am (GMT) on April 2nd. During that 24-hour period, five powerful Internet-crawling robots situated around the world will search the Internet and delete any data that they find.

      In order to protect your valuable data from deletion we ask that you do the following:

      * 1. Disconnect all terminals and local area networks from their Internet connections.
      * 2. Shut down all Internet servers, or disconnect them from the Internet.
      * 3. Disconnect all disks and hardrives from any connections to the Internet.
      * 4. Refrain from connecting any computer to the Internet in any way.
      We understand the inconvenience that this may cause some Internet users, and we apologize. However, we are certain that any inconveniences will be more than made up for by the increased speed and efficiency of the Internet, once it has been cleared of electronic flotsam and jetsam.

      We thank you for your cooperation.

      Interconnected Network Maintenance Staff Main Branch, Massachusetts Institute of Technology

      Sysops and others: Since the last Internet cleaning, the number of Internet users has grown dramatically. Please assist us in alerting the public of the upcoming Internet cleaning by posting this message where your users will be able to read it.

      Please pass this message on to other sysops and Internet users as well.

      --
      "I've got more toys than Teruhisa Kitahara."
  2. How accurate is the Register Article? by xmas2003 · · Score: 5, Informative
    James Robertson suggests that Orlowski mis-reports it again and says that the Register report is a "fairly nasty bit of selective quoting" and was referenced in the DIGG commentary that Google's not full.

    With hardware (and bandwidth) getting cheaper, I find it hard to believe that Google has actually run out of space. But certainly the explosion in the number of web pages is an issue, especially with auto-generated pages. One current example is the V7ndotcom Elursrebmem SEO contest (white-hat celiac charity site I'm supporting) - that nonsense phrase returned zero results on January 15th, 2006 ... but now returns almost 5,000,000 ... of which I gotta believe the vast majority were NOT typed in by humans.

    So maybe it's more that the techniques/algorithms used to spider and index are struggling with the bazillions of web pages out there. Or it could just be disgruntled webmasters PO'ed that their web site isn't listed!

    --
    Hulk SMASH Celiac Disease
    1. Re:How accurate is the Register Article? by david.given · · Score: 4, Informative
      Andrew Orlowski seems to have this weird grudge against Google --- he's been posting reams of violently anti-Google stories for, well, years now. It's reached the stage where if the subject line has 'Google' in it, and Orlowski's byline is attached, I just skip over; even if there's actual information there, it's going to be so wrapped up in snide misreporting as to be useless.

      Be warned.

    2. Re:How accurate is the Register Article? by Richard_at_work · · Score: 4, Informative

      The Register is one of the most bias, spinning tech news sites Ive ever read, and I first started reading it 6 years ago - its only got worse since then. I actually refuse to browse the site these days, only reading their articles when directly linked and pretty much all of them have some really evil spin on them.

    3. Re:How accurate is the Register Article? by Anonymous Coward · · Score: 3, Informative

      Orlovski? Isn't he the guy that also hates Wikipedia, with his sneering remarks about wiki-fiddlers and barely restraining himself from referring to them as Wikipedophiles?

      I don't know what his problem is, perhaps he just needs pageviews for the advertisers. So: write knocking article about popular website, fans of the website look, pageviews escalate.

      Google -- check.
      Wikipedia -- Check
      Slashdot -- ?

      (The captcha word for this submission was "referral". How do they do that?)

  3. Google is Full!? by aftk2 · · Score: 4, Funny

    Wow...so there really is an end to the internet.

    --
    concrete5: a cms made for marketing, but strong enough for geeks.
  4. Spammer jokes by +InvaderSkoodge · · Score: 3, Funny

    I just realized that many of the jokes we apply to lawyers could also be used on spammers with good effect:

    So what do you have when you push 50% of all the spammers in the world into a hole and bury them? A good start.

    Did you know that if you took all the spammers in the world and lined them up end to end around the equator of the earth that two thirds of them would drown?

    1. Re:Spammer jokes by cashman73 · · Score: 5, Funny

      A stingy old spammer who had been diagnosed with a terminal illness was determined to prove wrong the saying, "You can't take it with you." After much thought and consideration, the old spammer finally figured out how to take at least some of his money with him when he died. He instructed his wife to go to the bank and withdraw enough money to fill two pillow cases. He then directed her to take the bags of money to the attic and leave them directly above his bed. His plan: When he passed away, he would reach out and grab the bags on his way to heaven. Several weeks after the funeral, the deceased spammer's wife, up in the attic cleaning, came upon the two forgotten pillow cases stuffed with cash. "Oh, that darned old fool," she exclaimed. "I knew he should have had me put the money in the basement."

  5. more internet space by Anonymous Coward · · Score: 5, Funny

    I'm not a computer person, but couldn't Google just upgrade to a bigger disk drive?

    I saw one at bestbuy.com that looks pretty good.

  6. Adsense is to blame by wackysootroom · · Score: 5, Insightful

    In creating adsense, google opened the floodgates for spammers who do not want to create good content. In fact, there are even people who copy tons of content from wikipedia and throw up adsense on the top and sides of the pages.

    There are people who are literally making $10,000 or more per month just putting up junk content sites that are auto generated for the purpose of creating adsense revenue.

    Don't get me wrong, I think adsense is a good thing, but Google's allowance of spam sites is giving adsense a bad name.

    1. Re:Adsense is to blame by merreborn · · Score: 3, Interesting

      ...It's not like google invented internet advertising.

      Banner ads were taking the same path. If anything, we should thank google for making internet advertising less intrusive.

    2. Re:Adsense is to blame by Snowmit · · Score: 5, Funny

      Please tell me how I too can make $10,000 or more per month just by putting up junk content from the comfort of my home. Is there a program that I have to order to learn to do this? Should I act now?

      --
      I have a lot of opinions about Cyborgs and Architects
    3. Re:Adsense is to blame by truthsearch · · Score: 2, Insightful

      What's interesting is that Google is pretty good at blocking these spam sites from the index, like the wikipedia copies. But since Yahoo and MSN are terrible at blocking them these spammers are making Google money without flooding Google's own index.

      I believe this is all an unintentional consequence of AdSense. I'm sure the people at Google knew some of this would happen, but probably not to this extent.

  7. The Reg MIght Be On To Something by cfoster611 · · Score: 3, Informative

    I glance at the google results for some of my own sites and the Reg is correct, Google's index is completely out of date, even for a super small time guy like me.

    I know the GoogleBot indexes the site almost every day. Yet, while one of my sites is completely out of date (the Cache is from 2005), another is almost completely up to date.

    Google's got problems.

    --
    --- Kicking the Cheat since late 2002
  8. How Google crawls a site by jamie · · Score: 5, Interesting

    Meanwhile, for no good reason, here's some gorgeous stats porn on how Google (and Yahoo and MSN) crawled a sample website. The animations and closeups of the trees are very cool.

  9. Re:Finally, an explanation by NewWorldDan · · Score: 4, Interesting

    Over the past 6 months or so, I've been finding a lot of link farms in my search results. Oh, irony or irony, SEOs are making search results worthless.

  10. Before you get too scared.... by Lxy · · Score: 2, Funny

    Just remember that /dev/null filled up years ago. Yet, we seem to be doing just fine.

    --

    There is no reasonable defense against an idiot with an agenda
    :wq
  11. I've heard of the user being ignorant... by TheNoxx · · Score: 2, Interesting

    You know, writing code and assuming that an end user somewhere will do the dumbest thing imaginable, but I guess nobody ever imagined the possible effects of collusion between extreme stupidity and cleverness (spammers). I know I'd never would have thought that someone would go to such lengths and spend so much time to barely scrape out a living while pissing off countless hordes of people. How do you go about creating enough international legislation and cooperation to catch these guys without crippling the internet with regulation? Are third world countries even capable of compliance? All I can think of is that we need something on the level of the UN where tech-heavy countries are given jurisdiction over other nations that don't have the resources needed to police these kinds of things in exchange for a fee , or maybe a guarantee that said nation will dedicate x amount of troops to any areas needing occupation to stop civil war or genocide or something. Am I over-reacting here? I just can't help but think that dealing with this problem without any legal consequence for the spammers is just encouraging and allowing them to come up with ways around whatever solution is currently in place.

    Eh, or I could be completely off my rocker, and just not competent enough to see a simple and effective method of combating these guys.

    --
    Ex nihilo nihil fit.
  12. Fud Light by Loconut1389 · · Score: 2, Interesting

    I do hate it when searching for something about 4-10 pages in a row are purely sites that pretend to have what you're looking for but are merely meta dumps with adwords or other advertising mechanisms on them. Some of them even have valid cached pages. That said, this article, while certainly Fud, is only Fud Light. I personally prefer Fud Dark- at least I can generally laugh at the article's absurdity. This one was more or less just plain retarded.

  13. Google Indexing by k4_pacific · · Score: 4, Funny

    Some of you might recall that for a long time the Google index stood at around 4 billion pages. It turned out that this was because of the limited number of unique 32 bit index values. To handle this, Google created two index values to reference each each page. One is called the "Selector", and the other is called the "Offset". Simply put, the selector is left shifted by 4 bits and added to the offset so that Google can find any page on the internet simply by knowing its selector and offset. According to the article. Google has exhausted these values as well, and will introduce something called "protected mode page rank" where the slector is shifted farther to create a greater range of values.

    --
    Unknown host pong.
    1. Re:Google Indexing by don.g · · Score: 2, Funny

      You mean segment, not selector (in your real/v86 mode analogue). Selectors only came in with protected mode. Personally, that small incorrect detail entirely ruined the joke for me.

      --
      Pretend that something especially witty is here. Thanks.
  14. Right by chazzf · · Score: 2

    So says Andrew Orlowski. Remind me why we take him seriously?

    --
    No statement is true, not even this one.
  15. There is an obvious solution by shoma-san · · Score: 3, Funny

    Do what I do when the toilet bowl is full of crap - FLUSH.

  16. Re:Google is full. Try this... by j_snare · · Score: 4, Informative

    Try this...

    Go to yahoo and search for "slashdot poneys". This will bring up a bunch of results, all approximately 1 month old.

    Now do the same search on google. Notice how many of the results from yahoo do not appear in the google results at all.

    Google has such a big backlog that they don't get around to spidering new sites for several months. While google does give priority to certain high-profile sites like slashdot and visits those frequently, most other sites do not get indexed for several months.


    Okay, so I tried this, just for kicks. You can verify, by a single click:
    Yahoo: http://search.yahoo.com/search?p=slashdot+ponies
    Google: http://www.google.com/search?hl=en&q=slashdot+poni es

    Since when does 44900 results on Yahoo mean that they have more than 92100 results on Google? As far as what's appearing, I was able to find most every one I saw on Yahoo on the first 2 or so pages of Google's results. I also see more results on Google that look like they'll show me more of what I'm looking for (since I am probably looking for the April 1st joke, screenshots especially).

    Works alright for me. Looks like I don't have a reason to switch again yet.

  17. One idea? by 955301 · · Score: 4, Insightful


    Well given that a human would have a hard time deciding if the page was autogen'ed if the text was in their second language, this *is* quite an issue.

    So it sounds like Google needs to *shudder* have a user feedback system where humans with logins add moderation metadata to the search results and in return get results based on this moderation en-mass.

    I know what your thinking, /. has it and it sux, but does it really? I'm always pretty confident that the goatse and gnaa and all that other crap will never make it to a score of 5 when I'm on it. Maybe that's what Google needs to throw the weight back in their court - human intervention on a colossal scale.

    It would withstand abuse since a massive amount of human inputed data would keep spambots from trying to exploit the moderation system. What's more, their toolbar could incorporate the control to flag a page as autogen'ed garbage.

    --
    You are checking your backups, aren't you?
    1. Re:One idea? by IamTheRealMike · · Score: 2, Interesting
      How does a moderator prove they are in fact a legit human and not a bot?

      I foresee a time when to access large parts of the net you will be required to use some central "proof of life" system. The current mish-mash of captchas isn't working. We have custom English captchas on a forum I admin and it doesn't seem to stop the bots: presumably when they get stuck they call for help.

      It's hard to believe a third of Googles index is auto-generated crap, but then I couldn't really believe the "50% of net traffic is spam or viruses" claim either and I'm pretty sure that one turned out to be true. It appears that an unregulated commons will always degenerate into a wasteland without some form of governance and law enforcement; perhaps rather than an arms race the only solution is for the internet to grow its own legal system and police force (how that'd work is left as an exercise to the imagination)

    2. Re:One idea? by humble.fool · · Score: 2, Informative

      Hey, looks like they are:

      http://googleblog.blogspot.com/2006/04/this-is-tes t-this-is-only-test.html The Googleblog shows that they have a cookie-based "block this site from results" feature in general beta test to random people on the site.

      --
      Being anonymous is not cowardice.
  18. If google and the spammers have an arms race... by s-gen · · Score: 5, Interesting

    ...then eventually the spam sites will actually contain the information you were looking for.

  19. SQL Solution by Hoi+Polloi · · Score: 2, Funny

    Delete from internet.world
    where lower(page_text) like '% beastiality%'
    or lower(page_text) like '% lose weight%'
    or lower(page_text) like '% refinance%'
    or lower(page_text) like '% ebay%'
    or lower(page_text) like '% make money fast%'
    or lower(page_text) like '% enlarge your%'
    or lower(page_text) like '% teens%';
    commit;

    --
    It is by the juice of the coffee bean that thoughts acquire speed, the teeth acquire stains. The stains become a warning
  20. Re:Are you sure? by Ctrl-Z · · Score: 2, Funny

    Well, would you look at that! Together, you both found both ends of the Internet!

    --
    www.timcoleman.com is a total waste of your time. Never go there.
  21. Careful... by Skadet · · Score: 3, Informative

    3. DDoS the spammers and linkfarmers. Yes, it's illegal. Yes, I don't give a fuck. No, not the sender. It's more likely than not a hijacked PC. DDoS the linked page. Blow the one who decided that spam is the way to advertize his service off the net. Don't worry, you won't start a war. That's already running. Needn't do it right away, but I'd reserve that as an option if the rest fails.

    Careful, that linked page is 99.9% likely to be a legitimate user's hacked hosting account. What's faaaaaar more effective is a phone call (or even an email!) to the hosting company. When I worked support for a hosting company and I got a call about this, it'd take me all of 2 seconds to suspend the account.

    DDoSing the linked page is:
    1. no skin off of the spammer's nose
    2. a pain in the ass to the hosting company
    3. far more time-consuming and less effective than a quick phone call.

    We're smarter than those spammers, let's act like it.

  22. Re:Google is full. Try this... by LunaticTippy · · Score: 4, Funny
    You did it wrong. Try searching for "slashdot poneys" just like the OP misspelt.

    44 on yahoo, 229 on google.

    Wait, what was I saying?

    --
    Man, you really need that seminar!
  23. Re:Google is full. Try this... by pembo13 · · Score: 2, Insightful

    I guess the OP didn't expect you to actually try it out.

    --
    "Thanks for all the money you paid to us. We've used it to buy off ISO among other things" -Microsoft
  24. Re:Google is full. Try this... by Anonymous Coward · · Score: 2, Interesting

    Interesting though that they index fairly different things.

    Top 10 results for "slashdot poneys" on yahoo:

    1. slashdot.cuteness.org (not on google)
    2. jfaughnan.blogspot.com (#1 on google)
    3. jfaughnan.blogspot.com (#1 on google)
    4. index.cristal-trace.com (not on google, outdated link)
    5. mfrost.typepad.com (#22 on google)
    6. pcdq.blogspot.com (not on google)
    7. www.ninme.com (#15 on google)
    8. www.firstworld.biz (not on google, spam)
    9. musicindustry.firsindustry.com (not on google, spam)
    10. girls-having-sex-with-horses.danielblog.info (not on google, spam)

    Top 10 on google:

    1. jfaughnan.blogspot.com (#2 on yahoo)
    2. slashdot.org (not on yahoo)
    3. slashdot.org (not on yahoo)
    4. linux.slashdot.org (#27 on yahoo)
    5. linux.slashdot.org (#27 on yahoo)
    6. mitternachts-lied.net (#22 on yahoo)
    7. interviews.slashdot.org (not on yahoo)
    8. linuxfr.org (#19 on yahoo)
    9. www.releton.com (not on yahoo)
    10. www.japancar.fr (not on yahoo)

    Both yahoo and google are missing pages from their indexes. Some appear on one but not the other. Yahoo was slightly worse at indexing spam sites. (Is www.releton.com spam?)

    I'd say both are 'full' in the sense that neither seems to have enough capacity to index everything.