Google Sheds Light On 'Dark Web' With PDF Search
CWmike writes "Google this week took another step in its effort to shed light on the so-called Dark Web, announcing that its search engine can now search scanned documents in a PDF. In April, Google announced that it was looking for ways for its search engine to index HTML forms such as drop-down boxes or select menus that otherwise couldn't be found or indexed."
An announcement is available at the official Google blog, and it contains some demonstration searches.
After reading that, I've come to the conclusion that some parts of the internet should definitely remain in the dark.
I'm not sure if you got the point of this - it's about using a form of OCR to translate embedded document images within a PDF, rather than simply sucking the text out of the PDF itself, as you rightly point out is already available in the View as HTML option for PDF search results.
Scanning is the reverse of printing because, well, it's the reverse of printing. When you're scanning something, you're taking a purely human-readable document and translating its contents into a machine interpretable form. This is pretty much the exact opposite of printing from a computer.
Google has long since favoured PDFs - and gives them boosted results, under the guise that anyone who makes a PDF has something serious to say, I guess.
You may have noticed of late that people are wise to this - there are a bunch of sites that are embedding popular search terms / results in PDF files, and clustering their sites with adverts.