Slashdot Mirror


Google Sheds Light On 'Dark Web' With PDF Search

CWmike writes "Google this week took another step in its effort to shed light on the so-called Dark Web, announcing that its search engine can now search scanned documents in a PDF. In April, Google announced that it was looking for ways for its search engine to index HTML forms such as drop-down boxes or select menus that otherwise couldn't be found or indexed." An announcement is available at the official Google blog, and it contains some demonstration searches.

3 of 78 comments (clear)

  1. Re:Just what we needed by Ed+Avis · · Score: 2, Interesting

    Well, if it's a form with a GET request then it should be safe to request it, and it's used merely to display some information. Forms using the POST method, which performs an action, are less safe and I'd hope Google is not trying to spider those.

    If people want their sites to be indexed, they shouldn't use forms for navigation.

    So the alternative is automatically generating pages and pages of links to every possible item in the database just so that search engines can follow them? If a form is the most natural and convenient interface for a human there's no reason the spider can't use it too.

    --
    -- Ed Avis ed@membled.com
  2. Not so new? by Archon-X · · Score: 5, Interesting

    Google has long since favoured PDFs - and gives them boosted results, under the guise that anyone who makes a PDF has something serious to say, I guess.

    You may have noticed of late that people are wise to this - there are a bunch of sites that are embedding popular search terms / results in PDF files, and clustering their sites with adverts.

  3. Tesseract by mcrbids · · Score: 4, Interesting

    Not so sure about PDFs as an image format - which is exactly what you have when you use PDF to hold scanned documents. I think the more interesting point is that they feel they have an OCR package good enough to be trustworthy. I wonder if it's based on the Tesseract OCR software that they adopted a while back?

    I played with it for a while, and got very poor results from the command line. Even when I made a png or bmp of a full screen single word "HELLO" in 200 pixel font with GIMP (about as perfect as input gets!) I'd often get "HEHO" or "H3H0" or god only knows what else.

    Of course, this is when the project relaunch was first announced a year or two ago, I certainly hope it's better now! Looking at their web page, it does appear that there's some significant activity going on. Yay Google!

    Maybe I'll try it again, and see if it's worth using yet?

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.