Google Sheds Light On 'Dark Web' With PDF Search
CWmike writes "Google this week took another step in its effort to shed light on the so-called Dark Web, announcing that its search engine can now search scanned documents in a PDF. In April, Google announced that it was looking for ways for its search engine to index HTML forms such as drop-down boxes or select menus that otherwise couldn't be found or indexed."
An announcement is available at the official Google blog, and it contains some demonstration searches.
anything is possible, as it says in the manual.
greed, fear & ego (in any order) are unprecedented evile's primary weapons. those, along with deception & coercion, helps most of us remain (unwittingly?) dependent on its' life0cidal hired goons' agenda. most of yOUR dwindling resources are being squandered on the 'wars', & continuation of the billionerrors stock markup FraUD/pyramid schemes. nobody ever mentions the real long term costs of those debacles in both life & any notion of prosperity for us, or our children, not to mention the abuse of the consciences of those of us who still have one. see you on the other side of it. the lights are coming up all over now. conspiracy theorists are being vindicated. some might choose a tin umbrella to go with their hats. the fairytail is winding down now. let your conscience be yOUR guide. you can be more helpful than you might have imagined. there are still some choices. if they do not suit you, consider the likely results of continuing to follow the corepirate nazi hypenosys story LIEn, whereas anything of relevance is replaced almost instantly with pr ?firm? scriptdead mindphuking propaganda or 'celebrity' trivia 'foam'. meanwhile; don't forget to get a little more oxygen on yOUR brain, & look up in the sky from time to time, starting early in the day. there's lots going on up there.
we note that yahoo deletes some of its' (relevant) stories sooner than others. maybe they're short of disk space, or something?
http://news.google.com/?ncl=1216734813&hl=en&topic=n
http://www.cnn.com/2008/TECH/science/09/23/what.matters.thirst/index.html
http://www.nytimes.com/2007/12/31/opinion/31mon1.html?em&ex=1199336400&en=c4b5414371631707&ei=5087%0A
(deleted)http://news.yahoo.com/s/ap/20080918/ap_on_re_us/tent_cities;_ylt=A0wNcyS6yNJIZBoBSxKs0NUE
http://www.nytimes.com/2008/05/29/world/29amnesty.html?hp
http://www.cnn.com/2008/US/06/02/nasa.global.warming.ap/index.html
http://www.cnn.com/2008/US/weather/06/05/severe.weather.ap/index.html
http://www.cnn.com/2008/US/weather/06/02/honore.preparedness/index.html
http://www.cnn.com/2008/TECH/science/09/28/what.matters.meltdown/index.html#cnnSTCText
http://www.cnn.com/2008/SHOWBIZ/books/10/07/atwood.debt/index.html
http://www.nytimes.com/2008/06/01/opinion/01dowd.html?em&ex=1212638400&en=744b7cebc86723e5&ei=5087%0A
http://www.cnn.com/2008/POLITICS/06/05/senate.iraq/index.html
http://www.nytimes.com/2008/06/17/washington/17contractor.html?hp
http://www.nytimes.com/2008/07/03/world/middleeast/03kurdistan.html?_r=1&hp&oref=slogin
(deleted, still in google cache)http://biz.yahoo.com/ap/080708/cheney_climate.html
http://news.yahoo.com/s/politico/20080805/pl_politico/12308;_ylt=A0wNcxTPdJhILAYAVQms0NUE
http://www.cnn.com/2008/POLITICS/09/18/voting.problems/index.html
(deleted)http://news.yahoo.com/s/nm/20080903/ts_nm/environment_arctic_dc;_ylt=A0wNcwhhcb5It3EBoy2s0NUE
(talk about cowardlly race fixing/bad theater/fiction?) http://money.cnn.com/2008/09/19/news/economy/sec_short_selling/index.htm?cnn=yes
http://us.lrd.yahoo.com/_ylt=ApTbxRfLnscxaGGuCocWlwq7YWsA/SIG=11qicue6l/**http%3A//biz.yahoo.com/ap/081006/meltdown_kashkari.html
http://www.nytimes.com/2008/10/04/opinion/04sat1.html?_r=1&oref=slogin
(the teaching of hate as a way of 'life' synonymous with failed dictatorships) http://news.yahoo.com/s/ap/20081004/ap_on_re_us/newspapers_islam_dvd;_ylt=A0wNcwWdfudITHkACAus0NUE
(some yoga & yogurt makes killing/getting killed less stressful) http://news.yahoo.com/s/ap/20081007/ap_on_re_us/warrior_mind;_ylt=A0wNcw9iXutIPkMBwzGs0NUE
(the old bait & switch...your share of the resulting 'product' is a fairytail nightmare?)
http://news.yahoo.com/s/ap/20081011/ap_on_bi_ge/where_s_the_money;_ylt=A0wNcwJGwvFIZAQAE6ms0NUE
is it time to get real yet? A LOT of energy is being squandered in attempts to keep US in the dark. in the end (give or take a few 1000 years), the creators will prevail (world withou
Google announced that it was looking for ways for its search engine to index HTML forms such as drop-down boxes or select menus that otherwise couldn't be found or indexed."
Great. So basically, it's going to fuss with forms and pretend to be a user clicking "submit". That seems like a BRILLIANT idea, because, naturally, every HTML form out there is used purely for navigation...
If people want their sites to be indexed, they shouldn't use forms for navigation. It's not rocket science.
Please help metamoderate.
Increasing the number of items that can be searched is great, but the actual searching algorithms really haven't gotten THAT much better in the past 3 years or so.
Obviously, you can't have breakthroughs every year (or maybe even every 5 years) but search as an algorithm still has much more room to improve. I'd love to see an improvement in that, as opposed to just increasing the number of pages indexed.
Still cool though...
If you can read this... 01110101 01110010 00100000 01100001 00100000 01100111 01100101 01100101 01101011
Goatse?
Referenced article is talking about the "deep web", not dark web.
"Scanning is the reverse of printing." -- WTF?! Because of artifacts? And isn't this what View as HTML has ALWAYS been about? Points awarded for techtard clarity, but the person at Google who thought writing a press release aimed at techtards should be firmly smacked.
-- Dedicated Cthulhu cultist since 1982 A.C.E.
It's DEEP web, not dark. This is the internet not astrophyics.
I always thought the "dark" web was the seedy underside if password-protected forums and such where warez pirates and so on operated, releasing cracks for software and then letting it trickle down into more visible channels. Well, before torrents and TPB, at least.
I just started reading and it says "powerful search engines such as Google and Yahoo". Yahoo is a search engine? A Powerful one? It's an advertising index, Spam search, Ad finder? I call BS, no one thinks Yahoo is a powerful search engine!
Every time I use image search and see most are not related, I look at Google asking ME to help them label pictures to help. I feel guilty for not helping, and comfort myself knowing Google has a far better shot at image recognition than I ever will.
Never heard of either before. Looks like there's a competition going on to see who comes up with the next buzzword.
The filesystem is the package manager
A "dark web" is a private network, accessible by members over the internet but not accessible to outsiders. (A VPN is one example of a kind of "dark web".)
But as you say, this is something completely different.
The Deep, Deep, Dark, Dark, Deep, Dark Web...coming soon to a web browser near you!
MCSE? No, sir...I don't do Windows. Yes, I am an idealist. What's your point?
Google has long since favoured PDFs - and gives them boosted results, under the guise that anyone who makes a PDF has something serious to say, I guess.
You may have noticed of late that people are wise to this - there are a bunch of sites that are embedding popular search terms / results in PDF files, and clustering their sites with adverts.
What I would really like to see is OCR for mathematical formulas, and store those in some standard format. Using a standard input, like LaTeX, the engine would search for mathematical equations. Right now I find it a pain to look for a formula that I know exists, but don't know its name.
This would help bring together a lot of research that is done, but hard to sort through. Then, implement a smart system using a program like Mathematica to find variations of the equations, etc., and see where duplicates exist. Maybe we'll find that we've discovered things that weren't looked at thoroughly enough.
Nice feature, but I think it only works with PDF? I would love to see the same with DjVu as well.
How about adding the word *scanned* into the headline, just as the original headline was.
That way others won't have to read the summary going "Hey, I thought Google was searching PDFs for the last 10 years."
-David
dark web.. oh geez. eternal September has only just started.
aparently the world at large loves to shit on standards and practices.
it's been a while since search engines actually returned results I was looking for. google, yahoo, msn, metacrawler,.. they all want my money. "-com" + adblock doesn't really help anymore. I'm so sick and tired of the net. it once was the best thing that ever happened to the world. now it's the hyper-communication tool for fart jokes and perversion.
guess that tells you a lot about humanity.
It's not about fate, it's about character.
there be no shelter here, the frontline is everywhere!
It'll be even cooler when Google are able to automatically detect things like citations and references, and add hyperlinks as appropriate.
It still sort of bugs me that scientific papers are written in LaTeX, and not hypertext, especially considering that the web (in its current form) originated at CERN.
-- If you try to fail and succeed, which have you done? - Uli's moose
There's a module in CPAN for this. It rips out the images and runs them through Tesseract. It's worked well the few times I've tried it. Certainly well enough for search engine indexing.
Also, my understanding of the "dark web" concept was that it refered to sites that had no links going to them, so no spiders are able to access them. I'm not seeing how any of this would fix the "problem".
The only news here is that Google doesn't already index form content in drop down boxes and selection menus. Seems that would have been a fairly obvious extension.
Maybe not
Not so sure about PDFs as an image format - which is exactly what you have when you use PDF to hold scanned documents. I think the more interesting point is that they feel they have an OCR package good enough to be trustworthy. I wonder if it's based on the Tesseract OCR software that they adopted a while back?
I played with it for a while, and got very poor results from the command line. Even when I made a png or bmp of a full screen single word "HELLO" in 200 pixel font with GIMP (about as perfect as input gets!) I'd often get "HEHO" or "H3H0" or god only knows what else.
Of course, this is when the project relaunch was first announced a year or two ago, I certainly hope it's better now! Looking at their web page, it does appear that there's some significant activity going on. Yay Google!
Maybe I'll try it again, and see if it's worth using yet?
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Very soon they will start evaluating javascript too, that will shed more light on the dark internet.
Some kid's blog will have a new entry "How did I crash Google?"
Deep web is information buried under layers that are not easily penetrable by current indexing tech.
Dark web can either be physically separate from the internet or a virtual network that is hidden through encryption, secrecy, or both.
Women are like electronics: you don't know how damaged they are until you try to turn them on.
But in their Repairing Aluminum Wiring example, the PDF reads:
and the Google HTML reads:
Maybe this IS one of their better examples.
But the difference between web and net is probably not as important as the difference between deep and dark.
You have a good point. If the program could determine which values are undefined, and what the defined portions of the problem are, then I think I have a solution. It would be similar to what happens to your program code as it's being compiled. The compiler doesn't care what the actual variable is, just if that variable is the same as another.
For your solution, the database entry would be something like this:
(arbitrary value 1)^2 = (arbitrary value 2)^2 + (arbitrary value 3)^2, (arbitrary value 1)!=(arbitrary value 2) && (arbitrary value 1)!=(arbitrary value 3) && (arbitrary value 2)!= (arbitrary value 3)
Then any symbol could be transformed into these arbitrary values, and equality would only be based on same symbols within a single equation.
I'm sure there is a much simpler way of stating this, but I'm at a loss for words. Hopefully you can understand what I'm trying to say.
"Baby shark wisdom cleaner" 2.0 ?
Squirrel!