Slashdot Mirror


Searching the 'Deep Web'

abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"

22 of 193 comments (clear)

  1. Deep Web? by Traicovn · · Score: 5, Insightful

    I bet you this new 'Deep Web' search technology would be something that does not observe robots.txt...

    --

    [Something witty and intelligent should have appeared here.]
    {Traicovn}
    1. Re:Deep Web? by Anonymous Coward · · Score: 3, Insightful

      Good. If you leave things publically accessible on an open web server, that's your own damned fault. Let the engines crawl where they please.

    2. Re:Deep Web? by AndroidCat · · Score: 2, Insightful
      # go away. No, really - this means you!
      User-agent: *
      Disallow: /

      And if they don't listen, feed them a huge maze of generated links that eventually lead to goatse or something. Or just block their crawler at the router and they can search their intranet.

      --
      One line blog. I hear that they're called Twitters now.
    3. Re:Deep Web? by Anonymous Coward · · Score: 2, Insightful

      Well, I know that we use robots.txt to cover some directories that are both publicly accessible, and that we want people to be able to get the data in, yet that data is pretty useless unless you are visiting it from our link. We do signal processing, and looking at our data tables and our raw log files would be completely useless and can really alter a web search.

  2. But if you bypass the front pages... by oneiros27 · · Score: 3, Insightful

    Of course, it's nice to know that the content's there, but how many children are now going to be able to bypass the disclaimer pages on porn sites because of deep linking?

    I could care less about Ticketmaster whining out their deep linking, but there's probably some stuff out there that if it isn't taken in context to their intended point of entry may have other problems.

    I'm afraid that this is going to give people more reason to go back to using frames, and 'detecting' if their content has been hijacked, and writing more bad code that causes multiple windows to pop up all over the place, and/or crash browsers.

    --
    Build it, and they will come^Hplain.
    1. Re:But if you bypass the front pages... by CAIMLAS · · Score: 4, Insightful

      but how many children are now going to be able to bypass the disclaimer pages on porn sites because of deep linking?

      Hello, 1996 is calling; they want their paranoia back!

      Goodness, you aren't serious, are you? Have you used a search engine in the last couple years? Have you not ever looked for porn yourself? Just hop over to images.google.com and enter the name of a porn star - bam, shitloads of smut. Not only that, but search google.com for a porn star's name (many of which you could easily find by searching for 'famous porn stars', I'm sure) and you'll find gallery after gallery of porn, open and free.

      There is no such thing as protecting your kids from porn on the internet anymore. If you don't want to have them looking at porn, don't let them online or police their actions.

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
  3. From the article by sczimme · · Score: 4, Insightful


    Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.

    There is a reason for this: a Google search should turn up pointers to the items in the so-called "deep web" (*gag*). To use one of the examples above: if I am looking for information on patents, the search terms I use should point me to the US Patent and Trademark Office. It shouldn't have to point me to all 12 bajillion patent filings.

    Besides, what makes anyone think this is going to fly after all the hubbub over "deep-linking"?

    --
    I want to drag this out as long as possible. Bring me my protractor.
  4. Re:robots.txt should be ignored anyway by Anonymous Coward · · Score: 1, Insightful

    Oh okay, so your testing directory should be indexed? The best place to test is to actually have your files on a server. The easiest way to do this is to just put it in a "test" directory or something on your server. A simple line in your robots.txt file and that test directory does not get indexed.

    It would be a pain in the ass to have a test directory require a login and password all the time (if you don't want people to look at it BUT robots.txt doesn't work anyway).

  5. Google by nycsubway · · Score: 2, Insightful

    Generally, google finds the pages that the authors want to be searched. Thats why you submit your site to google. Even if you dont submit your site to google, if it's on a domain that google searches and there is a link to it, it'll be found.

    With google storing more than 4 billion web pages, I'd hate to see what kind of crap the other 99% is.

    Perhaps they count each iteration of a dynamic page as a seperate page? Even so, google's news page does a great job searching in real time for pages that change dynamicaly.

  6. 1 percent,? by zonix · · Score: 4, Insightful
    The article alleges that current search services like Google manage to access less than 1% of the web [...]

    1 percent, and I still don't have a problem feeling lucky almost every time I do a search on google.

    z
    --
    What would an EWOULDBLOCK block, if an EWOULDBLOCK could block would? -- me
  7. Relevancy by Traicovn · · Score: 4, Insightful

    Judging by the problems with relevancy that often occur in current search engines, (I think of the problem with meta keywords, which for many search engines are now completely useless, and google-bombing) why would a customer pay to add more data to the search engine? The idea of course is 'because they'll be more relevant and because they have more information will come up more often', however, if search engines start searching more and more of this 'deep web' how badly will relevancy be affected? I mean, the more data that is in there, the more chances there are of relevancy being broken, and if the weighting is in favor of this 'featured' searches, then relevancy may be even more broken. Sure, these companies will have more traffic directed to them, but will it merely be useless traffic by frustrated users searching for something else?

    I run a search engine for an educational institution, and I will admit, Google misses a significant number of our documents, on the other hand, some of those documents are scripts that when queried will create an (virtually) infinite amount of data (calendar scritpts, etc). How deep do we really need to go though? Do we really need to include calendar entries for the year 2452?

    I'm also confused, is this search service 'pay by the searcher' or 'pay by the content provider'. It seems to be content provider to me.

    --

    [Something witty and intelligent should have appeared here.]
    {Traicovn}
  8. Limitations of Google by PingKing · · Score: 3, Insightful

    One limitation of Google is that fact that a site that bases its navigation through a drop-down menu or submission form (i.e. choose a section from the list and click Go) cannot be spidered by Google.

    Personally, I find this infuriating. A site I once worked on was available in numerous languages, which could be chosen by choosing from a drop down list box. The upshoot of this is that Google has only cached the site in English, meaning users who would use the other languages do not get my site returned when they search in Google.

    We need an open-source alternative that can address these problems, as well as get rid of the security concerns and mysterious methods Google uses to rank sites.

    --

    Patriotism - the last resort of scoundrels.
    1. Re:Limitations of Google by Stiletto · · Score: 4, Insightful


      Solution: Web designers, stop trying to be so clever.

      If you want your site to be spiderable, don't hide it behind javascript and flash!

  9. Re:PHP? by andygrace · · Score: 2, Insightful
    Well the front pages might be, with a few top stories, but the real problem lies in getting at all the information that is stored in SQL databases ...

    There is reams of stuff in there that a search engine can't see. XML could be used to deep search these entire databases, rather than just the stuff that's pulled into the UI by the PHP code.

  10. Re:Maybe I'm just missing the point... by Anonymous Coward · · Score: 1, Insightful

    Sounds to me like that IS the point.

    I don't need a search engine to index the interfaces, I need it to index the DATA.

    Now, I'll admit, it would be a nice bonus if it can then map that back, and turn its search results into a link to the data via the desired interface - but I'd settle for just getting the data.

  11. Re:Oh yeah, a whole new pair of dimes by dsanfte · · Score: 4, Insightful

    I wish you luck using that credit card number without the appropriate expiration date. The FUD spreaders rarely mention the fact that exp dates are almost never stored with the numbers themselves.

    --
    occultae nullus est respectus musicae - originally a Greek proverb
  12. Re:Oh yeah, a whole new pair of dimes by Zone-MR · · Score: 2, Insightful

    So are you implying that you're credit card information is currently availible on web pages, with no password protection, and the only thing stoping hackers is that it isn't listed in a search engine?

  13. Re:Oh yeah, a whole new pair of dimes by Anonymous Coward · · Score: 1, Insightful

    I think it means that the crawler will use your credit card.. so you can just search for terms like "What did I buy today", or "If I was going to buy something -- oh? I did buy something?"

    This can't be done reliably with current technology (i.e. google)

  14. Warnings are there to limit liability. by oneiros27 · · Score: 3, Insightful

    It's rather stupid, but it has to do with legal practices.

    If you have no warnings, then someone can claim that you forced your content on them, and they didn't know what they were getting into, and it was offensive.

    By putting up warnings, which inform the user that they shouldn't enter your site if it's illegal for them to do so shifts part of the burden of responsibility to them, and away from you.

    So, if you're sued for having distributed offensive material, you can claim that you provided warnings, and that the person chose to disregard them. [Sort of like putting up 'wet floor' signs -- if someone gets hurt, they made an active decision to ignore the sign]

    --
    Build it, and they will come^Hplain.
  15. brute forcing 48 passwords by Anonymous Coward · · Score: 1, Insightful

    Huh?

    It's safe to guess that the exp'y is within 4 years. (otherwise, move onto another card)

    That's an amazing 48 possible "passwords" to brute force (assuming that cc subscriptions dates are uniformly distributed. any research on this?). I *THINK* there are >48 web merchants... Hmm.

    This, of course, doesn't use the resources mentioned in the other posts.

  16. Re:With the 10% that is crawled by danielsfca2 · · Score: 2, Insightful

    Hey cheapskate. Maybe if you subscribed to Salon you wouldn't have that problem. Independent news sites like Salon are going to disappear if they get no revenue. Maybe next time you visit salon.com, it'll say "Thanks to our former subscribers for the support. Due to our operating costs going through the roof but only four people subscribing, we've been forced to go out of business. This domain was bought by Fox News in bankruptcy proceedings. Click here to go there now.

    If you're too cheap to pay for anything, you have to be satisfied with things like ad-supported internet access (see NetZero) and ad-supported news (like salon's day-pass, and fucking TV, where's the complaining about CNN?). Yes, the ads are more intrusive than they were in 1999. The venture capital investment is gone and advertisers won't pay jack for barely-there banner ads. Now they want your full attention for a moment. So WTF is salon.com supposed to do, just say, "Everything is free! No ads! When the bandwidth bill comes, we'll just mail them some monopoly money"??

    If ad-supported websites didn't exist, the only people who could afford to publish on the Internet would be the conglomerated media who make their money from--say it with me--ad revenue from TV (etc.). Get it yet?

    Now, Mr. Troll, get back under your bridge.

  17. And analogously ... by cookie_cutter · · Score: 2, Insightful

    If you have a public mail server, you deserve any spam you get...