Slashdot Mirror


Google Sheds Light On 'Dark Web' With PDF Search

CWmike writes "Google this week took another step in its effort to shed light on the so-called Dark Web, announcing that its search engine can now search scanned documents in a PDF. In April, Google announced that it was looking for ways for its search engine to index HTML forms such as drop-down boxes or select menus that otherwise couldn't be found or indexed." An announcement is available at the official Google blog, and it contains some demonstration searches.

5 of 78 comments (clear)

  1. Re:Just what we needed by denmarkw00t · · Score: 4, Informative

    I think you've got this wrong, to some extent. I don't think its going to "submit" to see what options go where, but more just indexing the options from forms to give a better idea of whats going on in the page - suddenly google can go "Hey, this isn't just a form, but its a form pertaining to X." and thus make their results more relevant by being able to index more of a site as a whole.

  2. Re:Just what we needed by Reckless+Visionary · · Score: 4, Informative

    If people want their sites to be indexed, they shouldn't use forms for navigation. It's not rocket science.

    This isn't about people who want their sites indexed. It's about sites that Google wants to index, but which aren't designed to be indexed. If you prefer not to be indexed, Google says they will abide by robots.txt.

    --
    I think I'll stop here.
  3. Re:1000 years of darkness coming to an end? by BorgAssimilator · · Score: 2, Informative

    Well yes, but it doesn't mean that no one will want to try and find it.

    Just look at /b/...

    --
    "Intelligence has nothing to do with politics!"
    -Londo Mollari
  4. Re:'Scanning is the reverse of printing.' by fiannaFailMan · · Score: 4, Informative

    "Scanning is the reverse of printing." -- WTF?! Because of artifacts?

    And isn't this what View as HTML has ALWAYS been about?

    Points awarded for techtard clarity, but the person at Google who thought writing a press release aimed at techtards should be firmly smacked.

    Calm down please. The guy is trying to explain the concept to a broader audience, or 'techtards' as you so pompously refer to them along with your out-of-context quote, and he's doing a fine job of explaining how it is hard for a computer to interpret scanned text. The days are gone when the web was the preserve of nerds with zero social skills. Get over it.

    --
    Drill baby drill - on Mars
  5. Re:'Scanning is the reverse of printing.' by PotatoFarmer · · Score: 5, Informative

    I'm not sure if you got the point of this - it's about using a form of OCR to translate embedded document images within a PDF, rather than simply sucking the text out of the PDF itself, as you rightly point out is already available in the View as HTML option for PDF search results.

    Scanning is the reverse of printing because, well, it's the reverse of printing. When you're scanning something, you're taking a purely human-readable document and translating its contents into a machine interpretable form. This is pretty much the exact opposite of printing from a computer.