Slashdot Mirror


Google Sheds Light On 'Dark Web' With PDF Search

CWmike writes "Google this week took another step in its effort to shed light on the so-called Dark Web, announcing that its search engine can now search scanned documents in a PDF. In April, Google announced that it was looking for ways for its search engine to index HTML forms such as drop-down boxes or select menus that otherwise couldn't be found or indexed." An announcement is available at the official Google blog, and it contains some demonstration searches.

12 of 78 comments (clear)

  1. Re:1000 years of darkness coming to an end? by philspear · · Score: 5, Funny

    After reading that, I've come to the conclusion that some parts of the internet should definitely remain in the dark.

  2. Re:Just what we needed by denmarkw00t · · Score: 4, Informative

    I think you've got this wrong, to some extent. I don't think its going to "submit" to see what options go where, but more just indexing the options from forms to give a better idea of whats going on in the page - suddenly google can go "Hey, this isn't just a form, but its a form pertaining to X." and thus make their results more relevant by being able to index more of a site as a whole.

  3. Re:Just what we needed by Reckless+Visionary · · Score: 4, Informative

    If people want their sites to be indexed, they shouldn't use forms for navigation. It's not rocket science.

    This isn't about people who want their sites indexed. It's about sites that Google wants to index, but which aren't designed to be indexed. If you prefer not to be indexed, Google says they will abide by robots.txt.

    --
    I think I'll stop here.
  4. Re:Just what we needed by spitzak · · Score: 4, Insightful

    I think it is just going to look in the contents of the controls. This would be really useful, for instance if you search for "Widget Model XJ123" it will now find a page by a manufacturer where the only place they list it is in a pulldown list that lets you choose the product to buy.

  5. Re:'Scanning is the reverse of printing.' by fiannaFailMan · · Score: 4, Informative

    "Scanning is the reverse of printing." -- WTF?! Because of artifacts?

    And isn't this what View as HTML has ALWAYS been about?

    Points awarded for techtard clarity, but the person at Google who thought writing a press release aimed at techtards should be firmly smacked.

    Calm down please. The guy is trying to explain the concept to a broader audience, or 'techtards' as you so pompously refer to them along with your out-of-context quote, and he's doing a fine job of explaining how it is hard for a computer to interpret scanned text. The days are gone when the web was the preserve of nerds with zero social skills. Get over it.

    --
    Drill baby drill - on Mars
  6. Re:'Scanning is the reverse of printing.' by PotatoFarmer · · Score: 5, Informative

    I'm not sure if you got the point of this - it's about using a form of OCR to translate embedded document images within a PDF, rather than simply sucking the text out of the PDF itself, as you rightly point out is already available in the View as HTML option for PDF search results.

    Scanning is the reverse of printing because, well, it's the reverse of printing. When you're scanning something, you're taking a purely human-readable document and translating its contents into a machine interpretable form. This is pretty much the exact opposite of printing from a computer.

  7. There are "dark webs", but this isn't them. by Jane+Q.+Public · · Score: 4, Insightful

    A "dark web" is a private network, accessible by members over the internet but not accessible to outsiders. (A VPN is one example of a kind of "dark web".)

    But as you say, this is something completely different.

  8. Re:Just what we needed by Arthur+Grumbine · · Score: 3, Funny

    for instance if you search for "Widget Model XJ123" it will now find a page by a manufacturer where the only place they list it is in a pulldown list that lets you choose the product to buy.

    Shenanigans! And I've been looking everywhere for that elusive XJ123, since the manufacturer stopped producing it. How dare you get my hopes up!

    --
    Now that I think about it, I'm pretty sure everything I just said is completely wrong.
  9. Not so new? by Archon-X · · Score: 5, Interesting

    Google has long since favoured PDFs - and gives them boosted results, under the guise that anyone who makes a PDF has something serious to say, I guess.

    You may have noticed of late that people are wise to this - there are a bunch of sites that are embedding popular search terms / results in PDF files, and clustering their sites with adverts.

  10. Re:Cool, and definitely worthwhile, but... by Firehed · · Score: 4, Insightful

    Why not just search for "teeth medicine" then? Google hasn't done direct keyword matching only in years now (for example, a search for "computer" may yield results containing synonyms such as "PC" or "Mac" even if the original keyword of "computer" isn't contained at all on the site).

    Remember that Yahoo started out as a category browser in its very early days, and now categories are really just another keyword. Google and all of the other search engines are designed to work well for the lowest common denominator of internet users - as someone with a 3-digit UID, I imagine you're not in that group. Trying to outsmart Google will probably just make its algorithm feel unnatural/broken.

    --
    How are sites slashdotted when nobody reads TFAs?
  11. Tesseract by mcrbids · · Score: 4, Interesting

    Not so sure about PDFs as an image format - which is exactly what you have when you use PDF to hold scanned documents. I think the more interesting point is that they feel they have an OCR package good enough to be trustworthy. I wonder if it's based on the Tesseract OCR software that they adopted a while back?

    I played with it for a while, and got very poor results from the command line. Even when I made a png or bmp of a full screen single word "HELLO" in 200 pixel font with GIMP (about as perfect as input gets!) I'd often get "HEHO" or "H3H0" or god only knows what else.

    Of course, this is when the project relaunch was first announced a year or two ago, I certainly hope it's better now! Looking at their web page, it does appear that there's some significant activity going on. Yay Google!

    Maybe I'll try it again, and see if it's worth using yet?

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
  12. Re:Just what we needed by AVryhof · · Score: 3, Funny

    for instance if you search for "Widget Model XJ123" it will now find a page by a manufacturer where the only place they list it is in a pulldown list that lets you choose the product to buy.

    Shenanigans! And I've been looking everywhere for that elusive XJ123, since the manufacturer stopped producing it. How dare you get my hopes up you insensitive clod!

    There. Fixed that for you.