Google Sheds Light On 'Dark Web' With PDF Search

Re:1000 years of darkness coming to an end? by Quantos · 2008-10-31 09:49 · Score: 1

Hey Bill, it's great to see that they finally gave you that day pass....

--
Some people are only alive because it's against the law for me to hunt them down and kill them.

Re:1000 years of darkness coming to an end? by philspear · 2008-10-31 09:49 · Score: 5, Funny

After reading that, I've come to the conclusion that some parts of the internet should definitely remain in the dark.

Just what we needed by SuperBanana · 2008-10-31 09:50 · Score: 0

Google announced that it was looking for ways for its search engine to index HTML forms such as drop-down boxes or select menus that otherwise couldn't be found or indexed."

Great. So basically, it's going to fuss with forms and pretend to be a user clicking "submit". That seems like a BRILLIANT idea, because, naturally, every HTML form out there is used purely for navigation...

If people want their sites to be indexed, they shouldn't use forms for navigation. It's not rocket science.

--
Please help metamoderate.

Re:Just what we needed by denmarkw00t · 2008-10-31 09:58 · Score: 4, Informative

I think you've got this wrong, to some extent. I don't think its going to "submit" to see what options go where, but more just indexing the options from forms to give a better idea of whats going on in the page - suddenly google can go "Hey, this isn't just a form, but its a form pertaining to X." and thus make their results more relevant by being able to index more of a site as a whole.
Re:Just what we needed by Reckless+Visionary · 2008-10-31 09:58 · Score: 4, Informative

If people want their sites to be indexed, they shouldn't use forms for navigation. It's not rocket science.
This isn't about people who want their sites indexed. It's about sites that Google wants to index, but which aren't designed to be indexed. If you prefer not to be indexed, Google says they will abide by robots.txt.

--
I think I'll stop here.
Re:Just what we needed by spitzak · 2008-10-31 10:03 · Score: 4, Insightful

I think it is just going to look in the contents of the controls. This would be really useful, for instance if you search for "Widget Model XJ123" it will now find a page by a manufacturer where the only place they list it is in a pulldown list that lets you choose the product to buy.
Re:Just what we needed by Arthur+Grumbine · 2008-10-31 10:50 · Score: 3, Funny

for instance if you search for "Widget Model XJ123" it will now find a page by a manufacturer where the only place they list it is in a pulldown list that lets you choose the product to buy.
Shenanigans! And I've been looking everywhere for that elusive XJ123, since the manufacturer stopped producing it. How dare you get my hopes up!

--
Now that I think about it, I'm pretty sure everything I just said is completely wrong.
Re:Just what we needed by Ed+Avis · 2008-10-31 11:00 · Score: 2, Interesting

Well, if it's a form with a GET request then it should be safe to request it, and it's used merely to display some information. Forms using the POST method, which performs an action, are less safe and I'd hope Google is not trying to spider those.

If people want their sites to be indexed, they shouldn't use forms for navigation.
So the alternative is automatically generating pages and pages of links to every possible item in the database just so that search engines can follow them? If a form is the most natural and convenient interface for a human there's no reason the spider can't use it too.

--
-- Ed Avis ed@membled.com
Re:Just what we needed by martin-boundary · 2008-10-31 11:44 · Score: 1

What if I don't want to buy a widget? I'd like to see a Google filter which hides all the product pages from its listing. As I see it, those kinds of pages are just spam. Who wants to buy the same product from a zillion different places all over the web?
It might actually be useful for a search engine to read the product name in a pulldown as a simple indicator that the page should be penalized as content free. I would probably pay to use that kind of search engine.
Re:Just what we needed by Anonymous Coward · 2008-10-31 12:20 · Score: 0

No, that's not what it's doing at all. Please stop perpetuating the idiot Slashdot stereotype and research beyond the summary before you open your mouth.
Re:Just what we needed by ShieldW0lf · 2008-10-31 13:17 · Score: 1

You're mistaken.

"For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes and radio buttons on the form, we choose from among the values of the HTML," they noted in a blog post. "Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the Web page resulting from our query is valid, interesting and includes content not in our index, we may include it in our index much as we would include any other Web page."

This is not very far removed from a brute force hack on your website. Better make sure you do proper fuzz testing

--
-1 Uncomfortable Truth
Re:Just what we needed by rohan972 · 2008-10-31 14:00 · Score: 1

What if I don't want to buy a widget? I'd like to see a Google filter which hides all the product pages from its listing.
Which, incidentally, would probably also boost googles ad business since they would no longer be providing free advertising. Sales in the sponsored links, info in the search results, sounds good to me.

--
http://marriedmansexlife.com/
Re:Just what we needed by Firehed · 2008-10-31 15:55 · Score: 1

That's under the completely unsafe assumption that forms are being used properly. There have been numerous instances of people putting full SQL queries (with DB connection data) in a GET form - see TheDailyWTF.
Though I suppose that's a bad example, as it would be really damn easy for Google to index THOSE sites. Just swap in a SELECT * and you're all set :)

--
How are sites slashdotted when nobody reads TFAs?
Re:Just what we needed by AVryhof · 2008-10-31 23:47 · Score: 3, Funny

for instance if you search for "Widget Model XJ123" it will now find a page by a manufacturer where the only place they list it is in a pulldown list that lets you choose the product to buy.
Shenanigans! And I've been looking everywhere for that elusive XJ123, since the manufacturer stopped producing it. How dare you get my hopes up you insensitive clod!
There. Fixed that for you.

--
Make America grate again!
Re:Just what we needed by Tweenk · 2008-11-01 00:40 · Score: 1

It looks like they will only use GET requests, not POST requests. You may have trouble if you use GET requests to make changes on your site (which nearly everybody with minimal experience knows you should never do).

--
Those who would give up liberty to obtain working drivers, deserve neither liberty nor working drivers.
Re:Just what we needed by Anonymous Coward · 2008-11-01 01:33 · Score: 0

This is an insolence!
It's nothing short of automated vandalism.
This will taint online polls, may cause garbage posts in forums that don't have captchas, etc...
How can this be legal when even portscanning is sometimes considered a break-in attempt?!
Re:Just what we needed by ronocdh · 2008-11-01 06:44 · Score: 1

Is it funny or weird or normal that the only hit on Google for "Widget Model XJ123" is this thread?
Re:Just what we needed by Zaiff+Urgulbunger · 2008-11-01 18:04 · Score: 1

I suspect they'd only submit a form if the method is "get" rather than "post"... which technically is okay, although in practice it will likely upset some websites!
Re:Just what we needed by Mr.+Slippery · 2008-11-02 06:42 · Score: 1

Is it funny or weird or normal that the only hit on Google for "Widget Model XJ123" is this thread?

When in doubt, remove possibly extraneous search terms. I had to dig, but I found an xj123 model...

--
Tom Swiss | the infamous tms | my blog
You cannot wash away blood with blood

Cool, and definitely worthwhile, but... by religious+freak · 2008-10-31 09:57 · Score: 1

Increasing the number of items that can be searched is great, but the actual searching algorithms really haven't gotten THAT much better in the past 3 years or so.

Obviously, you can't have breakthroughs every year (or maybe even every 5 years) but search as an algorithm still has much more room to improve. I'd love to see an improvement in that, as opposed to just increasing the number of pages indexed.

Still cool though...

--
If you can read this... 01110101 01110010 00100000 01100001 00100000 01100111 01100101 01100101 01101011

Re:Cool, and definitely worthwhile, but... by mikael · 2008-10-31 11:05 · Score: 1

I'm still waiting for a context modifier for keywords, so when you type something like 'mechanics:teeth' you get all the technical matches for gears, and when you type 'medicine:teeth' you would get all the medical matches for dentistry.

--
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
Re:Cool, and definitely worthwhile, but... by Firehed · 2008-10-31 16:00 · Score: 4, Insightful

Why not just search for "teeth medicine" then? Google hasn't done direct keyword matching only in years now (for example, a search for "computer" may yield results containing synonyms such as "PC" or "Mac" even if the original keyword of "computer" isn't contained at all on the site).
Remember that Yahoo started out as a category browser in its very early days, and now categories are really just another keyword. Google and all of the other search engines are designed to work well for the lowest common denominator of internet users - as someone with a 3-digit UID, I imagine you're not in that group. Trying to outsmart Google will probably just make its algorithm feel unnatural/broken.

--
How are sites slashdotted when nobody reads TFAs?
Re:Cool, and definitely worthwhile, but... by Cochonou · 2008-10-31 19:34 · Score: 1

Have you had a look at exalead.com ? It makes good strides in this direction (even if it fails your mechanics teeth context modifier).
Re:Cool, and definitely worthwhile, but... by Dan541 · 2008-10-31 21:38 · Score: 1

When are they going to add Gmail contents to their search results?

--
An SQL query goes to a bar, walks up to a table and asks, "Mind if I join you?"
Re:Cool, and definitely worthwhile, but... by evilviper · 2008-11-01 06:14 · Score: 1

Just use http://www.clusty.com/ . The search results are just as good as google, and it generates a list of categories that you can select from.
Admittedly, "mechanical" isn't in there... The categories are quite a bit more specific, such as "baby", "shark" "wisdom", "cleaner", etc.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Re:Cool, and definitely worthwhile, but... by religious+freak · 2008-11-02 19:15 · Score: 1

I'd settle for being able to do any kind of special character search using google, or any search engine, for that matter. When trying to look up programming related content, the lack of ability to search by special characters can be a real pain.

--
If you can read this... 01110101 01110010 00100000 01100001 00100000 01100111 01100101 01100101 01101011

Dark web? Deep Web! by Anonymous Coward · 2008-10-31 10:06 · Score: 2, Insightful

Referenced article is talking about the "deep web", not dark web.

Re:1000 years of darkness coming to an end? by BorgAssimilator · 2008-10-31 10:07 · Score: 2, Informative

Well yes, but it doesn't mean that no one will want to try and find it.

Just look at /b/...

--
"Intelligence has nothing to do with politics!"
-Londo Mollari

'Scanning is the reverse of printing.' by irishdaze · 2008-10-31 10:07 · Score: 0, Troll

"Scanning is the reverse of printing." -- WTF?! Because of artifacts? And isn't this what View as HTML has ALWAYS been about? Points awarded for techtard clarity, but the person at Google who thought writing a press release aimed at techtards should be firmly smacked.

--
-- Dedicated Cthulhu cultist since 1982 A.C.E.

Re:'Scanning is the reverse of printing.' by fiannaFailMan · 2008-10-31 10:20 · Score: 4, Informative

"Scanning is the reverse of printing." -- WTF?! Because of artifacts?
And isn't this what View as HTML has ALWAYS been about?
Points awarded for techtard clarity, but the person at Google who thought writing a press release aimed at techtards should be firmly smacked.
Calm down please. The guy is trying to explain the concept to a broader audience, or 'techtards' as you so pompously refer to them along with your out-of-context quote, and he's doing a fine job of explaining how it is hard for a computer to interpret scanned text. The days are gone when the web was the preserve of nerds with zero social skills. Get over it.

--
Drill baby drill - on Mars
Re:'Scanning is the reverse of printing.' by PotatoFarmer · 2008-10-31 10:27 · Score: 5, Informative

I'm not sure if you got the point of this - it's about using a form of OCR to translate embedded document images within a PDF, rather than simply sucking the text out of the PDF itself, as you rightly point out is already available in the View as HTML option for PDF search results.

Scanning is the reverse of printing because, well, it's the reverse of printing. When you're scanning something, you're taking a purely human-readable document and translating its contents into a machine interpretable form. This is pretty much the exact opposite of printing from a computer.
Re:'Scanning is the reverse of printing.' by Anonymous Coward · 2008-10-31 14:23 · Score: 0

Hmm. Yep, they are decompiling PDF documents to extract information. This would violate TOS on certain websites...
Re:'Scanning is the reverse of printing.' by KamuZ · 2008-10-31 18:13 · Score: 1

configure robots.txt please! :)
Re:'Scanning is the reverse of printing.' by simplerThanPossible · 2008-11-01 00:58 · Score: 1

images of text, not images of things. To obtain text from a photograph of a person, or a painting, is beyond even Google at the moment...
BTW: I wish Adobe used this OCR, so search works on a pdf of scanned text.

small nit-pick by RJBeery · 2008-10-31 10:16 · Score: 2

It's DEEP web, not dark. This is the internet not astrophyics.

Dark? by Anonymous Coward · 2008-10-31 10:22 · Score: 0

I always thought the "dark" web was the seedy underside if password-protected forums and such where warez pirates and so on operated, releasing cracks for software and then letting it trickle down into more visible channels. Well, before torrents and TPB, at least.

BS, TFA says Yahoo is a search engine! by Eganicus · 2008-10-31 10:28 · Score: 1

I just started reading and it says "powerful search engines such as Google and Yahoo". Yahoo is a search engine? A Powerful one? It's an advertising index, Spam search, Ad finder? I call BS, no one thinks Yahoo is a powerful search engine!

Re:BS, TFA says Yahoo is a search engine! by hairyfeet · 2008-10-31 13:09 · Score: 1

Actually I switched back to Yahoo search from Google search and find its become pretty damned good. Especially the little "more" tab,which when pulled on,say "Dead Space",it'll give me reviews,codes,walkthroughs,etc. Compare than to the "more" in Google which gives me crap like Google blogs. If you haven't tried it lately they have really gotten a lot better. I guess what they really needed was the fear of MSFT put in them.

--
ACs don't waste your time replying, your posts are never seen by me.

Image search needs help I guess by Eganicus · 2008-10-31 10:33 · Score: 1

Every time I use image search and see most are not related, I look at Google asking ME to help them label pictures to help. I feel guilty for not helping, and comfort myself knowing Google has a far better shot at image recognition than I ever will.

Re:Dark web? Deep Web! by HishamMuhammad · 2008-10-31 10:36 · Score: 1

Never heard of either before. Looks like there's a competition going on to see who comes up with the next buzzword.

--
The filesystem is the package manager

There are "dark webs", but this isn't them. by Jane+Q.+Public · 2008-10-31 10:43 · Score: 4, Insightful

A "dark web" is a private network, accessible by members over the internet but not accessible to outsiders. (A VPN is one example of a kind of "dark web".)

But as you say, this is something completely different.

Re:There are "dark webs", but this isn't them. by wdebruij · 2008-10-31 19:52 · Score: 1

Indeed, it is called the deep web.
Even the first link uses that term. The submitter messed up (and the editors didn't catch it. News at 11)
Re:There are "dark webs", but this isn't them. by Anonymous Coward · 2008-10-31 22:15 · Score: 0

That would be a dark NET, not web

Re:Dark web? Deep Web! by element-o.p. · 2008-10-31 10:51 · Score: 1

The Deep, Deep, Dark, Dark, Deep, Dark Web...coming soon to a web browser near you!

--
MCSE? No, sir...I don't do Windows. Yes, I am an idealist. What's your point?

Not so new? by Archon-X · 2008-10-31 11:04 · Score: 5, Interesting

Google has long since favoured PDFs - and gives them boosted results, under the guise that anyone who makes a PDF has something serious to say, I guess.

You may have noticed of late that people are wise to this - there are a bunch of sites that are embedding popular search terms / results in PDF files, and clustering their sites with adverts.

Re:Not so new? by gad_zuki! · 2008-10-31 11:36 · Score: 1

Ive noticed this. Lots of the top 5 search results for the items I usually search are PDFs. I just figured that publishing a pdf is something large organization usually does and large organizations tend to have a higher pagerank. Im not sure if PDF is in itself something that can raise a score.
Re:Not so new? by Spy+Hunter · 2008-10-31 12:20 · Score: 1

The new part is now Google can index PDFs that have no text, only embedded images, via OCR. These are pretty common as a way of posting scanned multi-page documents online; for example many older academic papers are posted this way. Google Scholar should become more useful due to this.

--
main(c,r){for(r=32;r;) printf(++c>31?c=!r--,"\n":c<r?" ":~c&r?" `":" #");}

More to it by spud.dups · 2008-10-31 11:46 · Score: 2, Insightful

What I would really like to see is OCR for mathematical formulas, and store those in some standard format. Using a standard input, like LaTeX, the engine would search for mathematical equations. Right now I find it a pain to look for a formula that I know exists, but don't know its name.

This would help bring together a lot of research that is done, but hard to sort through. Then, implement a smart system using a program like Mathematica to find variations of the equations, etc., and see where duplicates exist. Maybe we'll find that we've discovered things that weren't looked at thoroughly enough.

Re:More to it by martin-boundary · 2008-10-31 13:40 · Score: 1

That won't solve the problem you're having, since mathematical formulas contain arbitrary variable names.
So if you're looking for the pythagorean formula z^2 = x^2 + y^2 say, (ie you've forgotten the name), then you'll miss the documents which contain c^2 = a^2 + b^2, etc.
And that's the easy case, because a lot of people write z^2 = x^2 + y^2. What if for some reason your natural inclination is to type u^2 + f^2 = K^2? You'd be missing out on virtually every relevant link, because most mathematicians like to keep the case consistent within categories of objects, and tend to label variables so that they are alphabetically close together.
Re:More to it by PPH · 2008-10-31 14:39 · Score: 1

Actually, if the problem of recognizing a formula in text or graphics has been solved, the second part, graphing the formula, normalizing the graph and storing/retrieving graphs that meet certain criteria is quite simple.
In other words, getting from the graphics to z^2 = x^2 + y^2 is the tough part. Once you're there, understanding that z^2 = x^2 + y^2 is equivalent to a^2 + b^2 = c^2 is easy.

--
Have gnu, will travel.
Re:More to it by martin-boundary · 2008-10-31 15:01 · Score: 1

Once you're there, understanding that z^2 = x^2 + y^2 is equivalent to a^2 + b^2 = c^2 is easy.

What I'm saying is that's the tough part, whereas the OCR is comparatively easy (eg InftyReader).
You can only transform an equation if you know its meaning (ie the rules of transformation embodied by the context in which it is being written). And understanding the meaning is a hard AI problem.
Re:More to it by PPH · 2008-11-01 05:49 · Score: 1

And understanding meaning is a hard AI problem
Not within a restricted knowledge domain. Mathematics, engineering, physics, etc. are some excellent examples of such domains.
Been there, done that. Back when the Internet was still text based.

--
Have gnu, will travel.
Re:More to it by martin-boundary · 2008-11-01 12:14 · Score: 1

Mathematical formulas on their own are not a restricted domain. The reason is that the symbols and operations are overloaded to such an extent that you cannot, by looking only at a formula, know what it means. The surrounding context is completely necessary.
For example, the "pythagorean" formula discussed above, z^2 = x^2 + y^2, doesn't carry any restrictions that tell you something. Are the variables numbers? In what kind of range? Are they matrices? Are they operators? Are the "2"s indices (labels) or exponents?
You might say it doesn't matter, it's just a string. But if you don't know the answer to the above questions, you can't say if a^2 = b^2 + c^2 in another document is a substitute string that works just as well for the query.
What you get when you do a query on formulas as strings is just noise. For an illustration of what I mean, try googling the string "39". There are too many overlapping domains which use this string to be able to find what a random person might be looking for. The same is true with formulas on their own.
Re:More to it by PPH · 2008-11-02 07:55 · Score: 1

The reason is that the symbols and operations are overloaded to such an extent that you cannot, by looking only at a formula, know what it means.
So, how do humans read and "understand" such a formula, sitting by itself, with no surrounding context? Answer: They don't. The same holds true for machines. The following equation: z^2 = x^2 + y^2 only makes sense if the terms and notation are defined for the context, most likely in the surrounding text. Likewise, typing in the search term: c^2 = a^2 + b^2 doesn't give either a human or a machine enough to go on. In either case, there are two approaches. One, prompt the user for further constraints. Or two, the 'Google' response, which is to list every possible solution.
So far, we haven't addressed anything not covered in a 101 level class on declarative languages. A decade ago.

--
Have gnu, will travel.
Re:More to it by martin-boundary · 2008-11-02 12:47 · Score: 1

In either case, there are two approaches. One, prompt the user for further constraints. Or two, the 'Google' response, which is to list every possible solution.

Precisely, and both known approaches are imho useless for the purpose of the OP, which is to type in an equation and obtain relevant documents in the case that he doesn't remember the context or wants variations.
The 'google' type response for formulas has high recall and very low precision. In fact, it effectively exists already for code searches, eg try searching for "i++".
The 'further constraint' questioning approach is redundant for the problem at hand: if the searcher knew the constraints, then he'd be able to do a text search for the _problem description_ which would give him much higher precision right off the start.

Re:1000 years of darkness coming to an end? by Anonymous Coward · 2008-10-31 12:22 · Score: 0

For instance, the goatse guy's innards? Yeah, I'd rather he had kept those where the sun don't shine, too.

Please add DjVu by wikinerd · 2008-10-31 12:59 · Score: 1

Nice feature, but I think it only works with PDF? I would love to see the same with DjVu as well.

*SCANNED* PDFs by DavidD_CA · 2008-10-31 13:16 · Score: 1

How about adding the word *scanned* into the headline, just as the original headline was.

That way others won't have to read the summary going "Hey, I thought Google was searching PDFs for the last 10 years."

--
-David

*sigh* eternal september by eltaco · 2008-10-31 14:06 · Score: 1

dark web.. oh geez. eternal September has only just started.
aparently the world at large loves to shit on standards and practices.
it's been a while since search engines actually returned results I was looking for. google, yahoo, msn, metacrawler,.. they all want my money. "-com" + adblock doesn't really help anymore. I'm so sick and tired of the net. it once was the best thing that ever happened to the world. now it's the hyper-communication tool for fart jokes and perversion.
guess that tells you a lot about humanity.

--
It's not about fate, it's about character.
there be no shelter here, the frontline is everywhere!

Re:*sigh* eternal september by chebucto · 2008-10-31 14:38 · Score: 1

I thought you were being cynical, but then I found http://www.fart-joke.com/ . Ah, well, all good things must come to an end.
How ironic that the uselessness of the web as a serious communications tool should be discussed on the web.

--
The English word fart is one of the oldest words in the English vocabulary.
Re:*sigh* eternal september by eltaco · 2008-10-31 15:28 · Score: 1

just don't google bukkake or hentai.
but, yeah, I get your point, there are still safe havens. granted. /. being one of the very few.
you know what I like about genmay.net or somethingawful.com? they once spearheaded the development that the net is now witness to. I visited them regularly for my local and esoteric laugh.
but then that shit hit mainstream - it was just the logical conclusion to the net. now "tits or gtfo" is common - same as 1337 once was a marker, now it's public and even grounds for bemusement. "OMG EPIC FAIL!" is common lingo. and all these people have in common is a general interest for the net. they aren't programmers or admins (devs or IT specialists, for those just joining us).
maybe I'm an elitist prick. maybe I just holding on to a past that once was - but one thing I know for sure, the future that I hoped for 10 years ago is surely not the present we have today.

--
It's not about fate, it's about character.
there be no shelter here, the frontline is everywhere!
Re:*sigh* eternal september by ciderVisor · 2008-11-01 07:32 · Score: 1

just don't google bukkake or hentai.
Well, not until you get home, anyhow...

--
Squirrel!

What'll be really cool is.... by moosesocks · 2008-10-31 15:02 · Score: 1

It'll be even cooler when Google are able to automatically detect things like citations and references, and add hyperlinks as appropriate.

It still sort of bugs me that scientific papers are written in LaTeX, and not hypertext, especially considering that the web (in its current form) originated at CERN.

--
-- If you try to fail and succeed, which have you done? - Uli's moose

Re:What'll be really cool is.... by Anonymous Coward · 2008-11-01 03:44 · Score: 0

IEEE for example does not allow hyperlinks in PDF documents. Otherwise you could just add a URL were more content is publicly available. For some reason they don't even allow hyperlinks within the document (chapters, figures, ...).

So? by jlarocco · 2008-10-31 15:17 · Score: 1

There's a module in CPAN for this. It rips out the images and runs them through Tesseract. It's worked well the few times I've tried it. Certainly well enough for search engine indexing.

Also, my understanding of the "dark web" concept was that it refered to sites that had no links going to them, so no spiders are able to access them. I'm not seeing how any of this would fix the "problem".

The only news here is that Google doesn't already index form content in drop down boxes and selection menus. Seems that would have been a fairly obvious extension.

--
Maybe not

Tesseract by mcrbids · 2008-10-31 16:21 · Score: 4, Interesting

Not so sure about PDFs as an image format - which is exactly what you have when you use PDF to hold scanned documents. I think the more interesting point is that they feel they have an OCR package good enough to be trustworthy. I wonder if it's based on the Tesseract OCR software that they adopted a while back?

I played with it for a while, and got very poor results from the command line. Even when I made a png or bmp of a full screen single word "HELLO" in 200 pixel font with GIMP (about as perfect as input gets!) I'd often get "HEHO" or "H3H0" or god only knows what else.

Of course, this is when the project relaunch was first announced a year or two ago, I certainly hope it's better now! Looking at their web page, it does appear that there's some significant activity going on. Yay Google!

Maybe I'll try it again, and see if it's worth using yet?

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.

Re:Tesseract by Anonymous Coward · 2008-11-01 09:38 · Score: 0

Even when I made a png or bmp of a full screen single word "HELLO" in 200 pixel font with GIMP (about as perfect as input gets!) I'd often get "HEHO" or "H3H0" or god only knows what else.
Try 12-point Times New Roman :)

javascript? by emiraga · 2008-10-31 18:57 · Score: 1

Very soon they will start evaluating javascript too, that will shed more light on the dark internet.

Some kid's blog will have a new entry "How did I crash Google?"

Re:Dark web? Deep Web! by Majik+Sheff · 2008-11-01 02:45 · Score: 1

Deep web is information buried under layers that are not easily penetrable by current indexing tech.

Dark web can either be physically separate from the internet or a virtual network that is hidden through encryption, secrecy, or both.

--
Women are like electronics: you don't know how damaged they are until you try to turn them on.

Interesting that their example didn't work! by Chapter80 · 2008-11-01 03:08 · Score: 1

The First Google example failed in some of their OCR. I figured that their "demo" would be scrubbed of errors - or at least they'd show you one of their better examples.

But in their Repairing Aluminum Wiring example, the PDF reads:

In 1972, manufacturers modified both aluminum wire and switches and outlets to improve the performance of aluminum wired connections Sale of the old style wire, switches and outlets still on dealers' shelves however, continued after 1972

and the Google HTML reads:

In 1972, manufacturers modified both aluminum wire and switches and outlets to improve the performance of aluminum wired connections Sale of the oÃa style wire, switches ano outlets stilf on dealers' shelves however, continued after 1972

Maybe this IS one of their better examples.

Yes, that's true. by Jane+Q.+Public · 2008-11-01 05:46 · Score: 1

But the difference between web and net is probably not as important as the difference between deep and dark.

z^2 = x^2 + y^2 by spud.dups · 2008-11-01 06:09 · Score: 1

You have a good point. If the program could determine which values are undefined, and what the defined portions of the problem are, then I think I have a solution. It would be similar to what happens to your program code as it's being compiled. The compiler doesn't care what the actual variable is, just if that variable is the same as another.

For your solution, the database entry would be something like this:

(arbitrary value 1)^2 = (arbitrary value 2)^2 + (arbitrary value 3)^2, (arbitrary value 1)!=(arbitrary value 2) && (arbitrary value 1)!=(arbitrary value 3) && (arbitrary value 2)!= (arbitrary value 3)

Then any symbol could be transformed into these arbitrary values, and equality would only be based on same symbols within a single equation.

I'm sure there is a much simpler way of stating this, but I'm at a loss for words. Hopefully you can understand what I'm trying to say.

Re:z^2 = x^2 + y^2 by martin-boundary · 2008-11-01 13:01 · Score: 1

(See also my other comments on this thread). You're touching on issues which are problems of mathematical logic.
If you think of say z^2 = x^2 + y^2 as an expression that belongs to a formal grammar of arithmetic (computer languages have formal grammars too), then you could use the rules of the grammar to check when two formulas are equivalent, and you could store in your database a canonical form for each such formula. This sort of thing was actually proposed by the great mathematician Hilbert a hundred years ago to automate all of mathematics. Unfortunately, there are fundamental (mathematical) limits to such a program, which make it impossible to know if two random formulas are indeed the same in general.
But I think this is putting the cart before the horse, because mathematicians (and physicists and engineers) tend to write the same formulas, with the same symbols, in completely different contexts. So even if you can write down all the equivalent ways of expressing the formula z^2 = x^2 + y^2 in arithmetic, say, then you'll still get a lot of search hits for z^2 = x^2 + y^2 which have nothing to do with arithmetic (ie require a different formal grammar), and you're back where you started, namely, you have to look at the words in the text to know which subject is being discussed.
For example, you can't know (from the formula alone) if the "2" is the second power of each variable x, y, z, or if the "2" is merely a label, ie the second component of some set of vectors x, y, z. Indeed some people like to write the components of a vector v= (v_1, v_2) with lower indices, other people like to put the indices at the top v = (v^1, v^2). So in this case, the formula z^2 = x^2 + y^2 is merely one part of the vector equation z = x + y (the other part being z^1 = x^1 + y^1), and this has nothing to do with Pythagoras, but will show up if you search for the equation.
Re:z^2 = x^2 + y^2 by spud.dups · 2008-11-04 07:03 · Score: 1

Ah, point well taken. Thanks for explaining that more thoroughly.

Re:Dark web? Deep Web! by ciderVisor · 2008-11-01 07:23 · Score: 1

"Baby shark wisdom cleaner" 2.0 ?

--
Squirrel!

Slashdot Mirror

Google Sheds Light On 'Dark Web' With PDF Search

78 comments