Slashdot Mirror


Google Crawls The Deep Web

mikkl666 writes "In their official blog, Google announces that they are experimenting with technologies to index the Deep Web, i.e. the sites hidden behind forms, in order to be 'the gateway to large volumes of data beyond the normal scope of search engines'. For that purpose, the engine tries to automatically get past the forms: 'For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML'. Nevertheless, directions like 'nofollow' and 'noindex' are still respected, so sites can still be excluded from this type of search.'"

6 of 197 comments (clear)

  1. good and bad by ILuvRamen · · Score: 3, Insightful

    Well first of all, it's about time they learn how to read advanced sites! If your site is dependent on input from the user to display content, you're basically invisible to google. Now all they need is something to read text in flash files and they've got something going. But on the other hand, this is almost auto-fuzzing which could be considered hacking and I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.

    --
    Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
    1. Re:good and bad by QuoteMstr · · Score: 5, Insightful

      And should we not make any progress because we might step on a few toes while doing it? If Google can get your into uber-secret-private-database, so ran random user, or random Russian cracker. Fix your damn site if you're worried about this particular attack.

    2. Re:good and bad by Bogtha · · Score: 4, Insightful

      Now all they need is something to read text in flash files and they've got something going.

      They've indexed Flash for about four years now.

      I bet they'll often get results they didn't intend to and expose data that's supposed to be protected and private.

      No doubt. There are a lot of clueless developers out there who insist on ignoring security and specifications time and time again. I have no sympathy for people bitten by this, you'd think they'd have learnt from GWA that GET is not off-limits to automated software.

      --
      Bogtha Bogtha Bogtha
  2. Re:Google, consider this... by poot_rootbeer · · Score: 3, Insightful

    Do you realize the amount of wasted time the operators of some websites will spend, processing the trash data that doing this will create? I speak mainly of feedback forms, e-mail signups, and the like.

    If your site uses GET for a non-idempotent action like sending a feedback form or signing up for an email newsletter, you're doing it Wrong.

  3. Re:Oops... by orkysoft · · Score: 5, Insightful

    Unfortunately, there are tons of sites whose developers did not understand the part about GET being for looking up stuff, and POST being for making changes on the server.

    --

    I suffer from attention surplus disorder.
  4. Re:sites can still be excluded by danielsfca2 · · Score: 3, Insightful

    I never understood the point of robots.txt crap. Why put the site up if you don't want people to find it? Well I'm glad you asked. The presence (and continued following) of the robots.txt standard is crucial for these reasons:

    - Scripts with potentially infinite results. If you have a calendar script on your site, that shows this month's calendar with a link to "next month" and "previous month" then without Robots.txt, the search engine could index back into prehistoric times and past the death of the Sun, with blank event calendars for each month. This is stupid. With your robots.txt file you tell the spider what URLs it's in BOTH your best interests not to crawl. You save server resources and bandwidth, Google saves their time and resources.

    - If you have a duplicate copy or copies of your site for development, or perhaps an experimental "beta" version of your site, you don't want it competing with the real site for search engine placement, or worse, causing SE spiders to think you're a filthy spammer with duplicate content all over the place. So you disallow the dupes with robots.txt. Now sure ideally that server could be inside your firewall instead of on the Web, but it gets more challenging when your dev team is on a different continent.

    - Temporary crap that has no value to the outside world, once again, it's a waste of both yours and the search engine's time to index it.

    The above are all reasons why you might want some or all of the content on a site not indexed.