Searching the 'Deep Web'

With the 10% that is crawled by Trigun · 2004-03-09 01:51 · Score: 5, Funny

being pretty much total crap, I'd really hate to see the other 90%!

Re:With the 10% that is crawled by Zone-MR · 2004-03-09 02:27 · Score: 4, Informative

It could actually be useful content.

Let me give you an example. I run a forum. The main index page doesn't contain much information, just an overview of the latest posts and a brief introduction.

The rest of the content is what people submit. Here is the problem. The pages are generated dynamically. They end up having url's like http://domain/index.php?act=showpost&postid=12 44

Google sees index.php as one page, and does not attempt to submit any data via get/post. This means that effectively the most valuable content is missed.

Of course making it crawl /?yada=yada links has problems, namely the possibilty of getting stuck in an infinite loop where data and links are tracked using sessions, and an infinite number of URLs could potentially yeild valid, although very similar results.
Re:With the 10% that is crawled by hagardtroll · 2004-03-09 02:29 · Score: 1

What I hate, is when google returns a result, then you click on it and it is just a teaser to some subscription information. Or a book that someone is selling. I am looking for real information, not a paid adverisement. Speaking of which...

The article linked to in the story was exactly that!

"Want to read the whole article? You have two options: Subscribe now, or watch a brief ad and get a free day pass. If you're already a subscriber log in here."
Re:With the 10% that is crawled by Turing+Machine · 2004-03-09 02:54 · Score: 2, Interesting

http://domain/index.php?act=showpost&postid=12 44

Google sees index.php as one page, and does not attempt to submit any data via get/post.

Hmm... I see plenty of pages in Google that have URLs with GET parameters, so there must be some way of getting it to crawl them. Or am I misunderstanding what you're saying? Maybe the key here is to provide an alternate route to those pages without doing anything fancy (drop-down menus, radio buttons, etc.). Just generate another page that contains a regular link to all your pages. You could hide that page from your regular users by, say, linking it to a 1x1 pixel transparent GIF. A robot will find it, but most of your users won't even notice.
Re:With the 10% that is crawled by Zone-MR · 2004-03-09 02:59 · Score: 2, Interesting

Hmm... I see plenty of pages in Google that have URLs with GET parameters, so there must be some way of getting it to crawl them. Or am I misunderstanding what you're saying? Maybe the key here is to provide an alternate route to those pages without doing anything fancy (drop-down menus, radio buttons, etc.). Just generate another page that contains a regular link to all your pages. You could hide that page from your regular users by, say, linking it to a 1x1 pixel transparent GIF. A robot will find it, but most of your users won't even notice.

Yeah, I can see that google sometimes lists pages with get content in it's index. It doesn't want to do it for a lot of pages though, and I haven't figured out why. There seems to be nothing different in the HTML.

Hypothetically speaking, whats there to stop someone doing a:

<?
print("<a href='thispage.php/${rand()}'>Some page...</a>");
?> ... and looping google?
Re:With the 10% that is crawled by MyHair · 2004-03-09 02:59 · Score: 1

I've noticed some ?parameter=value URLs from Google, usually for Slashdot, so I'm guessing they enable that for certain sites by hand.

However, you can modify Apache and/or PHP to use URL/URI names for dynamic pages. You could remap your example query to http://domain/showpost/1244/ and the engines will probably index it. I'm not sure why more message board software doesn't do this. (Okay, probably because it requires httpd & server-side processing coordination.)
Re:With the 10% that is crawled by Anonymous Coward · 2004-03-09 03:07 · Score: 0

In that case, get a day pass, and then modify your Salon cookie to extend that pass for, say, three years. It's pretty simple.
Re:With the 10% that is crawled by Turing+Machine · 2004-03-09 03:12 · Score: 1

print("Some page..."); ... and looping google?

As I understand it, looping is in fact a big problem for robots. There are a number of ways of getting around it. A brute-force method would be to just limit the search tree depth to say, 20 levels or so (I pulled that number out of my butt, of course, so it would need some tuning based on how many levels you're likely to see on a real site).

It wouldn't surprise me to learn that more sophisticated robots (e.g., Google) actually do fairly sophisticated content analysis of the pages they retrieve to decide whether what they're seeing is really "new" or the same junk they've already seen before.
Re:With the 10% that is crawled by dealsites · 2004-03-09 04:55 · Score: 2, Informative

I agree that the search engines do not index dynamically generated pages very well. This page on my site http://www.dealsites.net/index.php?module=MyHeadli nes&func=view&myh=menu&gid=22&pid=2&eid=504&tid=30 0&context= hasn't seemed to attract any of the search engines yet. I'm not sure why, the data changes hourly and I have a direct link to that page on my site.

However, when search engines do start doing deep crawls, especially if they do POSTs and GETs, then the bandwidth of the web site will go up tremendously. While it is important to get crawled, what happens when your site uses more bandwidth for search engines than users? Also what would prevent other companies from developing thier own search engines? Then you might have 20 or more search engines doing deep crawls every month. Many websites are operated on low-cost low-bandwith hosting plans.
Re:With the 10% that is crawled by bheer · 2004-03-09 05:21 · Score: 2, Informative

Yeah, I can see that google sometimes lists pages with get content in it's index. It doesn't want to do it for a lot of pages though, and I haven't figured out why. There seems to be nothing different in the HTML.

One word: backlinks. Pages, even with request parameters, that get linked to from lots of popular (high-pagerank) sites get indexed.

--
Go somewhere random
Re:With the 10% that is crawled by Anonymous Coward · 2004-03-09 05:36 · Score: 0

"Of course making it crawl /?yada=yada links has problems,"

Which is why google does not crawl them. Which is probably good. I have a small photo gallery (http://bloodgate.com/photos/) and while it contains only 440something images, google could "view" each image in almost infinite variations. With blue, red, green etc color schemes, blue, red, white, black etc backgrounds (exactly 24 bit of different background colors - but luckily only a dozends or so are reachable via links), footer on/off, menu on/off, etc etc etc. And all these options can be combined.

$search_engine would never finish crawling this database, although it would not find new content.

Cheers,

Tels
Re:With the 10% that is crawled by Persol · 2004-03-09 05:42 · Score: 1

I'm sorry, but most of the people who replied to this parent are simply wrong. Google WILL crawl an entire forum if you treat it right. For instance, google has crawled just about every page on my forum:
http://www.google.com/search?q=site:www.ma geacadem y.com+forum+-sid&hl=en&lr=&ie=UTF-8&oe=UTF-8&filte r=0

The problem is in the way it crawls. Originally, it was finding the same page MANY different ways. Most aggravating was it treating new SID tags as new pages. This can be seen below:
http://www.google.com/search?q=site:www.ma geacadem y.com+forum+sid&hl=en&lr=&ie=UTF-8&oe=UTF-8&filter =0

Out of my +60 pages, it indexed +300. The solution seems to have 3 steps:
1) remove any user unique tags when you see a bot visiting
2) limit the bot to view only certain pages (no post/login/search/etc)
3) make your files 'look' static (Change forum.php?f=1 to forum1.htm)

www.phpbb.com has a couple threads on how to do this. Even if you use another forum, the general idea is the same.
Re:With the 10% that is crawled by SoSueMe · 2004-03-09 05:53 · Score: 1

Want to read more about "Deep Web"? Try this link.
It has been around for a couple years and was just updated in October.
Enjoy
Re:With the 10% that is crawled by Feztaa · 2004-03-09 07:54 · Score: 1

It requires mod_rewrite, specifically. Gallery does it, for example.
Re:With the 10% that is crawled by WNight · 2004-03-09 07:57 · Score: 1

Make the forum and message number into a directory, but leave the optional components (UTF-8, etc) as script parameters. This way google will recognize that a given message is the same, despite having three users link to it with different parameters.

Better yet though, provide a nested view (like Slashdot) and have that the default forum view, have everything else (threaded, individual message, etc) be additional options. Google will follow a link to a single message, then seeing a ton of links to the same 'page' with option differences will hopefully follow the first link, a small 'view thread in nested mode' link, and see all of the content at once.

Any bot-specific stuff (other than a robots.txt) will probably be counter-productive because the search engines try to avoid letting people provide them with edited content to please the bot. (You'll probably lose ranking if the non-google-UA bot that checks up on this sees different content.)
Re:With the 10% that is crawled by danielsfca2 · 2004-03-09 09:32 · Score: 2, Insightful

Hey cheapskate. Maybe if you subscribed to Salon you wouldn't have that problem. Independent news sites like Salon are going to disappear if they get no revenue. Maybe next time you visit salon.com, it'll say "Thanks to our former subscribers for the support. Due to our operating costs going through the roof but only four people subscribing, we've been forced to go out of business. This domain was bought by Fox News in bankruptcy proceedings. Click here to go there now.

If you're too cheap to pay for anything, you have to be satisfied with things like ad-supported internet access (see NetZero) and ad-supported news (like salon's day-pass, and fucking TV, where's the complaining about CNN?). Yes, the ads are more intrusive than they were in 1999. The venture capital investment is gone and advertisers won't pay jack for barely-there banner ads. Now they want your full attention for a moment. So WTF is salon.com supposed to do, just say, "Everything is free! No ads! When the bandwidth bill comes, we'll just mail them some monopoly money"??

If ad-supported websites didn't exist, the only people who could afford to publish on the Internet would be the conglomerated media who make their money from--say it with me--ad revenue from TV (etc.). Get it yet?

Now, Mr. Troll, get back under your bridge.
Re:With the 10% that is crawled by Kent+Recal · 2004-03-09 10:04 · Score: 1

Mod parent +1 funny.

Since when do "independent news sites" need "business" at all?
Last time I checked indymedia was doing quite fine. Kuro5hin and others seem to do okay, too (some banner ads here and there, but that's just normal when you get some traffic).
Re:With the 10% that is crawled by jelle · 2004-03-09 13:59 · Score: 1

Google is so fast, it can probably (almost?) search its own database to see if it has seen the link already. If the load is too much, then restrict the search to a fraction, such as only once per 25 links in a search branch, or once per second, or maybe just random inspections. Then the robot will loop for a bit, but that's it.

If google is smart, then they'll have robots close to as many servers as possible, preferably at least 1U colocated at each more than insignificant hosting provider, so that crawling bandwidth is cheap and plentiful.

--
--- Hindsight is 20/20, but walking backwards is not the answer.
Re:With the 10% that is crawled by Persol · 2004-03-09 14:02 · Score: 1

You'll probably lose ranking if the non-google-UA bot that checks up on this sees different content
Surprisingly Google has thought of this (and I'm guessing most other engines). It will not complain if the link points to the same page with different GET variables (removing SID), as long as the actual text is the same.

Your directory idea is basically how the PHPBB modification does it, except it is a unique HTML name for each thread instead of a unique directory. For some odd reason, google will still follow links to both a folder or file with different GET variables (although not consistantly). You have to use robots.txt to block the get variables (which is made easier when the same page has different names depending if it uses GET variables).

How PHPBB does it
Re:With the 10% that is crawled by coyotedata · 2004-03-09 15:48 · Score: 1

From the more has to be better departure
Re:With the 10% that is crawled by ElliotLee · 2004-03-09 16:32 · Score: 1

Whoa, you guys clearly aren't webmasters.
Just generate another page that contains a regular link to all your pages.
Yes, it's called a site map.
You could hide that page from your regular users by, say, linking it to a 1x1 pixel transparent GIF.
Not a good idea. These are perceived has bugs that track users and hiding links on a page may even be penalized by Google. (Search engines want to see exactly the way the user does, in order to provide the most relevant content.) Use a normal link.
Re:With the 10% that is crawled by instarx · 2004-03-09 19:36 · Score: 1

Since they had to pay salaries.

Deep Web? by Traicovn · 2004-03-09 01:52 · Score: 5, Insightful

I bet you this new 'Deep Web' search technology would be something that does not observe robots.txt...

--

[Something witty and intelligent should have appeared here.]
{Traicovn}

Re:Deep Web? by Anonymous Coward · 2004-03-09 01:54 · Score: 3, Insightful

Good. If you leave things publically accessible on an open web server, that's your own damned fault. Let the engines crawl where they please.
Re:Deep web? by AllUsernamesAreGone · 2004-03-09 01:59 · Score: 1

tubgoatgirlse-fu?

Hell, it even sounds like the name of a Lovecraftian Horror..
Re:Deep Web? by Anonymous Coward · 2004-03-09 02:18 · Score: 2, Interesting

User-agent: *
Disallow: /s3kr3t/

trawler: "Hey cool, thx for the tip I never would have thought to try /s3kr3t/"
Re:Deep Web? by AndroidCat · 2004-03-09 02:21 · Score: 2, Insightful

# go away. No, really - this means you!
User-agent: *
Disallow: /
And if they don't listen, feed them a huge maze of generated links that eventually lead to goatse or something. Or just block their crawler at the router and they can search their intranet.

--
One line blog. I hear that they're called Twitters now.
Re:Deep Web? by JDevers · 2004-03-09 02:32 · Score: 2, Informative

If I'm not mistaken, the original reason for robots.txt was to prevent endless loops from confusing spiders, not to "cover" some information that would otherwise be easily accessible. Of course, others use it for other things now...
Re:Deep Web? by Anonymous Coward · 2004-03-09 02:35 · Score: 0

Maybe, seems like a smart spider should check it's content for uniqueness then, instead of leaving it to a user (some of whom would love a spider to index thousands of pages laden in banner ads and popups)
Re:Deep Web? by Anonymous Coward · 2004-03-09 02:36 · Score: 2, Insightful

Well, I know that we use robots.txt to cover some directories that are both publicly accessible, and that we want people to be able to get the data in, yet that data is pretty useless unless you are visiting it from our link. We do signal processing, and looking at our data tables and our raw log files would be completely useless and can really alter a web search.
Re:Deep Web? by AndroidCat · 2004-03-09 02:53 · Score: 1

How unique? Using the URL as a seed, generating content isn't hard. That could lead to a gradual arms-race between spiders and poison. (Some people leave poison pages with email addresses for spammer spiders to harvest.)

--
One line blog. I hear that they're called Twitters now.
Re:Deep Web? by Rorschach1 · 2004-03-09 03:18 · Score: 2, Funny

Doesn't observe it? It probably relies on it - tells you where the good stuff is!
Re:Deep web? by djhertz · 2004-03-09 05:08 · Score: 1, Funny

I had not seen tubgirl before.. Google found it right away. How... damaging to my eyes and soul.

--
Modest doubt is called the beacon of the wise - William Shakespeare
Re:Deep web? by Anonymous Coward · 2004-03-09 05:29 · Score: 0

Actually, crap floats. Gold sinks ;-)
Re:Deep Web? by xmedar · 2004-03-09 08:37 · Score: 1

That's a cheap gag...

--
Any sufficiently advanced man is indistinguishable from God

Porn!!!!!!! by Anonymous Coward · 2004-03-09 01:53 · Score: 0, Funny

Best way to do a full deep search then

ignore robots.txt by Anonymous Coward · 2004-03-09 01:54 · Score: 1, Informative

These new deep-web crawlers try and ignore the robot access control files. They try and intelligently determine if they're in some type of infinite looping situation, but basically this is how they work.

Damn ... by Anonymous Coward · 2004-03-09 01:54 · Score: 2, Funny

I remember browsing the WWW directory in '93 and being able to scroll through all the sites on my VAX session at university. Are you telling me I am one of the few people who actually ever reached the end of the internet?

Re:Damn ... by radicalskeptic · 2004-03-09 01:58 · Score: 1, Funny

Was the end guy hard?

--
WARNING: If accidentally read, induce vomiting.
Re:Damn ... by Anonymous Coward · 2004-03-09 02:14 · Score: 0

not if you have enough health points.

By the way everybody knows that the end of the internet is Here
Re:Damn ... by adept256 · 2004-03-09 02:44 · Score: 1

No, I've been there too.

--

I ran a benchmark on my quantum computer, now I can't find it anywhere!

Oh yeah, a whole new pair of dimes by stienman · 2004-03-09 01:54 · Score: 3, Funny

Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"

Yeah. It means I'll be able to use someone else's credit card for more of my transactions, since finding credit cards, SSNs and other...uh...'deep web' stuff will be so much more accessable.

-Adam

Re:Oh yeah, a whole new pair of dimes by dsanfte · 2004-03-09 02:17 · Score: 4, Insightful

I wish you luck using that credit card number without the appropriate expiration date. The FUD spreaders rarely mention the fact that exp dates are almost never stored with the numbers themselves.

--
occultae nullus est respectus musicae - originally a Greek proverb
Re:Oh yeah, a whole new pair of dimes by Zone-MR · 2004-03-09 02:18 · Score: 2, Insightful

So are you implying that you're credit card information is currently availible on web pages, with no password protection, and the only thing stoping hackers is that it isn't listed in a search engine?
Re:Oh yeah, a whole new pair of dimes by Anonymous Coward · 2004-03-09 02:21 · Score: 1, Insightful

I think it means that the crawler will use your credit card.. so you can just search for terms like "What did I buy today", or "If I was going to buy something -- oh? I did buy something?"

This can't be done reliably with current technology (i.e. google)
Re:Oh yeah, a whole new pair of dimes by Anonymous Coward · 2004-03-09 02:26 · Score: 0

Umm, actaully, you don't need the right expiration date.. just one that hasn't expired. Try it next time you order something online with your own card. The security code thing on the other hand does make a difference.
Re:Oh yeah, a whole new pair of dimes by pohzer · 2004-03-09 02:47 · Score: 1

FYI expiration info, balance, credit limit are all readily available for free on certain cc hacker channels on the Internet - all you need is the card number. Apparently courtesy of your friendly neighborhood non-secure banking institution.
Re:Oh yeah, a whole new pair of dimes by Anonymous Coward · 2004-03-09 03:58 · Score: 0

So are you implying that you're credit card information ...

No, he isn't credit card information.

So are you implying that your credit card information ...
Re:Oh yeah, a whole new pair of dimes by Anonymous Coward · 2004-03-09 04:57 · Score: 0

Sowwy, I made yet another typo... I can only hope someday you will forgive me.

-- Zone-MR
Re:Oh yeah, a whole new pair of dimes by poot_rootbeer · 2004-03-09 05:30 · Score: 1

The FUD spreaders rarely mention the fact that exp dates are almost never stored with the numbers themselves.

If by "almost never" you mean "usually", I'd be inclined to agree with you.

We're talking about application designers that are foolish enough to store credit card numbers in a publicly accessible location to begin with. Do you really think any of them have given thought to deliberatily obfuscating the data model enough to store expiration dates somewhere other than right next to the CC numbers and account holder names?
Re:Oh yeah, a whole new pair of dimes by Anonymous Coward · 2004-03-09 08:50 · Score: 0

Um, actually the exp date doesn't matter much. Or at least it doesn't to capital one accounts, they (and others I'd assume) will process charges regardless of if the exp date is correct or not (as long as it's some point in the future).
Re:Oh yeah, a whole new pair of dimes by Anonymous Coward · 2004-03-09 10:08 · Score: 0

sometimes the CC-lists are stored in some obfuscated format: ".xls"
anyone know how to decipher that?

Deep Web? by dingo · 2004-03-09 01:54 · Score: 2, Funny

Why do I get the feeling that you will get a lot more search results for Linda Lovelace when searching the "Deep Web"

--
The Borg assimilated my race & all I got was this lousy T-shirt

robots.txt should be ignored anyway by Anonymous Coward · 2004-03-09 01:55 · Score: 1, Troll

If you don't want it indexed and looked at, don't put it on the web in the first place.

Re:robots.txt should be ignored anyway by Anonymous Coward · 2004-03-09 01:58 · Score: 1, Insightful

Oh okay, so your testing directory should be indexed? The best place to test is to actually have your files on a server. The easiest way to do this is to just put it in a "test" directory or something on your server. A simple line in your robots.txt file and that test directory does not get indexed.

It would be a pain in the ass to have a test directory require a login and password all the time (if you don't want people to look at it BUT robots.txt doesn't work anyway).
Re:robots.txt should be ignored anyway by AndroidCat · 2004-03-09 02:30 · Score: 1

If you didn't want your crawler fed a million-zillion pages of /lostsouls/{BF538DE0-71AB-11D8-AD10-00A0248B8F67}. html that link to four other pages like it (and so on), then you should have listened to robots.txt.

--
One line blog. I hear that they're called Twitters now.
Re:robots.txt should be ignored anyway by Anonymous Coward · 2004-03-09 04:00 · Score: 0

How's it going to find your test directory unless there is a publicly available link to it?

Please think before posting. It makes you look stupid if you don't.
Re:robots.txt should be ignored anyway by jrnchimera · 2004-03-09 08:37 · Score: 1

How about a type of service that runs on a webserver that allows a search engine to query for valid links on the server? That way one could configure the "valid link server" to tells search engines what the valid/allowed links are? Sort of the "next step" over the robots.txt file..

Deep web? by hookedup · 2004-03-09 01:55 · Score: 4, Funny

Doesnt crap sink? Not sure I want to know what the other 90-odd percent is. After tubgirl, goatse, etc.. what else could possibly be next..

deep web? by rjelks · 2004-03-09 01:55 · Score: 4, Funny

Is it just me, or does this sound like we're gonna get more pr0n when we search?

-

--

Tech News, Reviews and Tutorials

No... by Anonymous Coward · 2004-03-09 01:56 · Score: 1, Interesting

but it will get us 90% more useless results. The regular search spam on Google is bad enough (it's getting to the level of bad results AltaVista had before Google took over the throne) without this extra noise...

only 1%??? by Spetiam · 2004-03-09 01:56 · Score: 1

so maybe that's why google never tells me anything about servicing this teletype machine...

it's amazing to think how much more information we'd have access to if google (or another search engine) could search 90% of what's out there. i mean, just at 1% we already say, "google knows all"

Maybe I'm just missing the point... by robslimo · 2004-03-09 01:56 · Score: 5, Interesting

...but I don't want to see the guts of a web form. If I userstand correctly, they're talking about crawling into databases, actually parsing a Microsoft Access file, for instance. I see that as having dubious merit, and potentially pissing of web site owners. Web site designers go to a lot of trouble to provide the interface they want you to see to their data. This would just sidestep the interface and dump you into the data.

It the very least, it might require an overhaul or extension to the robots exclusion specification to keep spiders out of your data.

Re:Maybe I'm just missing the point... by Anonymous Coward · 2004-03-09 02:10 · Score: 1, Insightful

Sounds to me like that IS the point.

I don't need a search engine to index the interfaces, I need it to index the DATA.

Now, I'll admit, it would be a nice bonus if it can then map that back, and turn its search results into a link to the data via the desired interface - but I'd settle for just getting the data.

What if? by AMD-lover · 2004-03-09 01:56 · Score: 1

One could Deep Web your system and find all your prOn and your webcam feed?

Isn't the first 10 results the most important? by toesate · 2004-03-09 01:57 · Score: 1

Nevermind the rest of the 99%, especially if they are dups or trashy info.

These 99% might not be *intended* public info too. Privacy is consideration here.

--
Hey, that's my password you are typing

But if you bypass the front pages... by oneiros27 · 2004-03-09 01:57 · Score: 3, Insightful

Of course, it's nice to know that the content's there, but how many children are now going to be able to bypass the disclaimer pages on porn sites because of deep linking?

I could care less about Ticketmaster whining out their deep linking, but there's probably some stuff out there that if it isn't taken in context to their intended point of entry may have other problems.

I'm afraid that this is going to give people more reason to go back to using frames, and 'detecting' if their content has been hijacked, and writing more bad code that causes multiple windows to pop up all over the place, and/or crash browsers.

--
Build it, and they will come^Hplain.

Re:But if you bypass the front pages... by Zone-MR · 2004-03-09 02:21 · Score: 1

[i]Of course, it's nice to know that the content's there, but how many children are now going to be able to bypass the disclaimer pages on porn sites because of deep linking?[/i] ... because so many teenage children will be determined by the disclaimer. "Oh, damn, I'm not 18 so I can't see her titties".

Also there is nothhing from stopping the sites from checking the refferer to display the disclaimer on first EXTERNAL entry. Also search engines at present are hardly intelligent enough to automatically avoid directing people to pages beyond the disclaimer.
Re:But if you bypass the front pages... by Gr8Apes · 2004-03-09 02:46 · Score: 1

I'm afraid that this is going to give people more reason to go back to using frames, and 'detecting' if their content has been hijacked, and writing more bad code that causes multiple windows to pop up all over the place, and/or crash browsers.

Popups are only a problem for IE browsers. No one else ever sees them, unless they really want to.

--
The cesspool just got a check and balance.
Re:But if you bypass the front pages... by Chemisor · 2004-03-09 05:33 · Score: 1

> but how many children are now going to be able to
> bypass the disclaimer pages on porn sites because
> of deep linking?

How many children want to read a disclaimer page anyway? Or agree that they are not old enough to do something?
Re:But if you bypass the front pages... by CAIMLAS · 2004-03-09 06:07 · Score: 4, Insightful

but how many children are now going to be able to bypass the disclaimer pages on porn sites because of deep linking?

Hello, 1996 is calling; they want their paranoia back!

Goodness, you aren't serious, are you? Have you used a search engine in the last couple years? Have you not ever looked for porn yourself? Just hop over to images.google.com and enter the name of a porn star - bam, shitloads of smut. Not only that, but search google.com for a porn star's name (many of which you could easily find by searching for 'famous porn stars', I'm sure) and you'll find gallery after gallery of porn, open and free.

There is no such thing as protecting your kids from porn on the internet anymore. If you don't want to have them looking at porn, don't let them online or police their actions.

--
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Re:But if you bypass the front pages... by bronaugh · 2004-03-09 06:27 · Score: 1

Everyone loves a moron! (They taste like chicken; best parbroiled)

-- Funny comments from the Vegan Underground
Re:But if you bypass the front pages... by leifm · 2004-03-09 09:06 · Score: 1

Anymore? There never really was. But that was the case pre internet as well, in middle school we had a ton of magazines, now kids have the internet.

--

"Windows Me offers tremendous reliability and stability improvements..." -- Paul Thurott

PHP? by TGK · 2004-03-09 01:57 · Score: 4, Interesting

Since I moved my site over to a php bases sytem, nothing beyond my index page gets a second look from google. As web content moves away from static pages to more dynamic solutions (particularly XML) a more sophisticated crawler is neeeded, one that can read over this bewildering malstrom of data and extract form it meaning and content.

While I find it highly unlikely that this system will do well with large databases (or even databases at all for that matter) it is a step in the right direction. Google will probably have their version up on labs inside a month.

--
Killfile(TGK)
No trees were killed in the creation of this post. However, many electrons were inconvenienced.

Re:PHP? by pubjames · 2004-03-09 02:03 · Score: 1

Since I moved my site over to a php bases sytem, nothing beyond my index page gets a second look from google

Perhaps you are doing something wrong? All the dynamic PHP sites I know of are fully indexed by Google.
Re:PHP? by andygrace · 2004-03-09 02:09 · Score: 2, Insightful

Well the front pages might be, with a few top stories, but the real problem lies in getting at all the information that is stored in SQL databases ...
There is reams of stuff in there that a search engine can't see. XML could be used to deep search these entire databases, rather than just the stuff that's pulled into the UI by the PHP code.
Re:PHP? by DeadSea · 2004-03-09 02:14 · Score: 5, Interesting
Keep in mind that googlebot comes in two flavors, freshbot, and deepbot.
Freshbot is meant to update the google cache for pages that change frequently. Freshbot may pull pages as much as every couple hours for really popular pages that change frequently.
Deepbot goes out once every month or two and follows links. The higher your pagerank, the deeper into your site it will go. If you want more of your site to get crawled here are some tips:
1. Make your pages *look* static (end in .html)
2. Avoid CGI parameters except for handling form data (no ? in url)
3. Put all pages in the document root, or in very shallow subdirectories. Google goes after less and less as the directories get more.
It is likely that deepbot just hasn't run since you updated your site, so freshbot is just pulling your front page occasionally.
BTW: I noticed you have a link to my cheet sheet on your links page. Thanks! :-)
Re:PHP? by Xner · 2004-03-09 02:17 · Score: 4, Informative

I'm not exactly sure what you mean. If it is accessible by clicking on links, most search engines should be able to index it. If you want to be extra-friendly you can use $PATH_INFO to make dynamic pages look more like static ones, e.g.:
http://site.com/blah/prog.php/stat/1
instead of
http://site.com/blah/prog.php?stat=1
I use it all the time and it works really well.

--
Pathman, Free (as in GPL) 3D Pac Man
Re:PHP? by ip_vjl · 2004-03-09 02:17 · Score: 1

As web content moves away from static pages to more dynamic solutions (particularly XML) a more sophisticated crawler is neeeded, one that can read over this bewildering malstrom of data and extract form it meaning and content.

It's all in how you build your pages.

For PriorArtDatabase.com there is only a handful of actual 'pages' ... everything is actually pulled from source XML files. But the URLs are created in such a way that it appears to be separate pages to a search engine. I've seen the googlebot on thousands of documents from the site, even though they are actually being handled from the same script on the server.
Re:PHP? by kisrael · 2004-03-09 02:22 · Score: 1
1. Make your pages *look* static (end in .html)
Another way of looking static is to a, say, "index.cgi" within a subdirectory, and then only link to the subdirectory name. For example, a typical month's archive at my site kisrael.com has the URL like http://kisrael.com/arch/2004/03/ even though it's all dynaimcally generated. (I wasn't smart enough and/or didn't have enough access to my rented webserver to pull off that trick where that URL ends up going to, say, arch/index.cgi and /2004/03/ get interpreted as parameters, instead, I created a bunch of subdirs by hand, each containing a tiny "index.cgi" that perl includes the main program, which in turns inspects its URL to parse out the 2004 and 03 and show the right stuff. )
--
SO YOU'RE GOING TO DIE: The Comic for Dealing with Death
Re:PHP? by pubjames · 2004-03-09 02:33 · Score: 1

Or even better, use Apache mod rewrite
Re:PHP? by poot_rootbeer · 2004-03-09 05:37 · Score: 1

Since I moved my site over to a php bases sytem, nothing beyond my index page gets a second look from google.

Have you considered using mod_rewrite or a similar solution to convert your complex URLS with query string parameters aplenty into something that looks like a vanilla filepath?

For example, using mod_rewrite the URL of the page I'm typing this on

http://slashdot.org/comments.pl?sid=99804&op=Rep ly &threshold=3&commentsort=0&mode=flat&pid=85090 86

could be rewritten to look like

http://slashdot.org/comments.pl/sid%3D99804/op&3 DR eply/threshold%3D3/commentsort%3D0/mode%3Dflat/pid %3D8509086.html

to web spiders. ..
Re:PHP? by Anonymous Coward · 2004-03-09 08:12 · Score: 0

"over this bewildering malstrom of data and extract form it meaning and content"
malstrom?
What the heck is a malstrom? Perhaps you meant maelstrom?

From the article by sczimme · 2004-03-09 01:58 · Score: 4, Insightful

Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.

There is a reason for this: a Google search should turn up pointers to the items in the so-called "deep web" (*gag*). To use one of the examples above: if I am looking for information on patents, the search terms I use should point me to the US Patent and Trademark Office. It shouldn't have to point me to all 12 bajillion patent filings.

Besides, what makes anyone think this is going to fly after all the hubbub over "deep-linking"?

--
I want to drag this out as long as possible. Bring me my protractor.

Re:From the article by Professeur+Shadoko · 2004-03-09 03:55 · Score: 2, Interesting

Right.
But if you are interested in a specific subject..
Let's say you have a technical problem.
Chances are somewhere on the planet someone submitted the same problem on a web-based forum.

Now you want google to give you THAT specific message.
You don't want google to tell you "hmmm... I guess the solution must be in one of those zillions of forums here, here, and here".
Re:From the article by retinaburn · 2004-03-09 05:30 · Score: 1

however each of those patents would probably contain a link to the main page, so theoretically the main page would rank higher. and how many pages would link directly to the patent pages ? Just because it searches deep doesn't mean they are doing away with whatever page ranking algorithm they use.
Re:From the article by Anonymous Coward · 2004-03-09 13:35 · Score: 0

Can someone point me back to this "deep-linking"?
It sounds familiar

Spiders? by Vo0k · 2004-03-09 01:58 · Score: 4, Interesting

...and I wonder about something different.
Has anyone tried this yet? Change your user agent string to one matching the googlebot and crawl the web. I'm pretty sure many "registration only" websites would magically open themselves, but I wonder about other differences too :)

--
Anagram("United States of America") == "Dine out, taste a Mac, fries"

Re:Spiders? by MyHair · 2004-03-09 03:07 · Score: 2, Interesting

Good question. I haven't tried it yet, but I've run into several sites that Google indexes but the site refuses me entry until I register (which I don't). Some of them are clever enough to put Javascript (or something) in to prevent you from looking at Google's cache of that page. Yeah, I could get around that, but usually by then I figure I don't care what that site has to say.
Re:Spiders? by agurk · 2004-03-09 05:26 · Score: 1

It is unlikely although possible that sites returns other pages than their "real" pages to googlebot, but remember that these pages will show up in the google cache in the form that is given to google.

So if you suspect a site to do this, just check the google cache.
Re:Spiders? by poot_rootbeer · 2004-03-09 05:41 · Score: 2, Informative

I can't speak for everyone, but here we check not only a spider's User Agent string, but also whether the request is coming from Google's IP range or elsewhere. So your results may not be so great.

Then again, I've defeated many registration (er, pr0n) gateways by just seting a Referer header identical to the URL I'm requesting, so some defenses are better than others...
Re:Spiders? by Anonymous Coward · 2004-03-09 08:37 · Score: 1, Interesting

This is intriguing...

Tell us more.

(May I request an example URL?)
Re:Spiders? by ChaosDiscord · 2004-03-09 08:50 · Score: 1

Change your user agent string to one matching the googlebot.... I'm pretty sure many "registration only" websites would magically open themselves...

Indeed. I do exactly this to access the Insiders Only content on IGN. (You'll also need to disable javascript). I'd feel bad about it, but this pricks clearly intend to deceive. I find links to interesting content through Google, but the link leads somewhere else. I don't mind paid content (I pay for two online magazines), but attempting to mislead both Google and visitors is wrong. So I report the abuse, tweak my browser, and head back to enjoy free, free content. (The magic string is "Googlebot/2.1 (+http://www.googlebot.com/bot.html)". You can try it with one of these links.) I hate sites that try to mislead Google. If you don't want to provide the content to random people, don't provide it to Google. If you want to pimp your product be honest about it and purchase AdWords.

--
Search 2010 Gen Con events

Pay to search by zzxc · 2004-03-09 01:59 · Score: 1

I'd happily pay Google a monthly fee to gain access to extensive databases of information that take money to aquire and maintain... as long as this fee was reasonable. The current Google searches should stay as is, but if people want access to do a time consuming search on every single slashdot message ever posted, for example, the advertising would not pay for this effort. However, I wouldn't pay Yahoo! for this in a million years. Premium google searches might include Pages not ranked high enough to be crawled in the normal google search, full image search -- bandwidth intensive, and full news search -- google most likely will have to pay license fees to the news sources to do this. Most news publications charge a fee to access old articles.

BOFFINS ENHANCE SEARCH BY IGNORING robots.txt by Anonymous Coward · 2004-03-09 02:00 · Score: 0, Funny

More shocking news at 11am !

Get ready to tighten up those dynamic site scripts by pubjames · 2004-03-09 02:00 · Score: 0, Redundant

My guess is that they will be looking at ways of automatically polling dynamic web sites to extract all the data from the database. So if a site has a page, for instance

www.site.com/index.asp?content=10,

the search engine will try content=1 to content=n to see what it gets.

Privacy and Crap by jackb_guppy · 2004-03-09 02:00 · Score: 2, Interesting

Going after the other 90% does not mean that new things will come to top. Oh there maybe a few cool items like "Who realy shot JFK" or launch code for a trident.

But in reality the other 90% most likely be best left un-found. Who really wants to know that parents were not married as in the manor that they told.

Just is in archology, you will find a nice vase or two... but the rest is rumble.

You understand that digging a garage dump is the best place to find things in archology, because people clean their house then too. That is what other 90% is... a dump of information.

whats in the deep web? by Anonymous Coward · 2004-03-09 02:01 · Score: 0

I wish the internet was like a book. Except in this case the internet can have a varible index.
A search engine produces an abstract of the website.
I know that I'm sounding like an academic, but things could be more organised in the first page, I mean place.

And yes I use the net for research purposes more then entertainment. For me the internet hype has died down, and I now refer more to books these days, as they seem to be easier to find.

Google by nycsubway · 2004-03-09 02:02 · Score: 2, Insightful

Generally, google finds the pages that the authors want to be searched. Thats why you submit your site to google. Even if you dont submit your site to google, if it's on a domain that google searches and there is a link to it, it'll be found.

With google storing more than 4 billion web pages, I'd hate to see what kind of crap the other 99% is.

Perhaps they count each iteration of a dynamic page as a seperate page? Even so, google's news page does a great job searching in real time for pages that change dynamicaly.

--
http://github.com/gbook/nidb

Re:Google by SpaceLifeForm · 2004-03-09 02:08 · Score: 1

Perhaps Google needs to only subscribe to Slashdot, set the preferences, and Google will then find anything you ever need.

--
You are being MICROattacked, from various angles, in a SOFT manner.

Top 4 by UncleBiggims · 2004-03-09 02:02 · Score: 5, Informative

About.com lists the top 4 places to search the deep web as:

Anybody use any of these sites? Are they any good? Just wondering why this is getting to be news if sites like these already exist.

Are you Corn Fed?

Re:Top 4 by BReflection · 2004-03-09 02:15 · Score: 3, Informative

'Search Systems is another good site. They make '17,834' public databases accessible.

--
python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"
Re:Top 4 by Anonymous Coward · 2004-03-09 02:20 · Score: 0

I've had about.com mess up the relevancy of way too many of my searches. These links probably aren't any good at all. Then when you open up an about.com link it's INSIDE A FRAME!
ugh....

It's almost as bad as those sites that feature ebay items up for auction and pages which have hundreds of search terms in them just so they show up and then try to sell you viagra and movie tickets.
Re:Top 4 by Anonymous Coward · 2004-03-09 03:49 · Score: 0

Turbo10
Re:Top 4 by Anonymous Coward · 2004-03-09 04:59 · Score: 0

...I'm not sure I want to see the bottom.

I ran two search terms through each of the four.

Search term A is a completely unique word which happens to be one of my .com domain names. Normally this around a hundred hits on Google. Here it got zero.

Search term B is an actual (though arcane) archaeological term which average a few thousand hits in most search engines. Again, zero.

Same story for 'search systems'.

Turbo 10 did the best finding three instances of the first and seven of the second.

On balance, I agree with most of the thread, in that it really doesn't sound like the 99% is really so hot or useful.

1 percent,? by zonix · 2004-03-09 02:02 · Score: 4, Insightful

The article alleges that current search services like Google manage to access less than 1% of the web [...]

1 percent, and I still don't have a problem feeling lucky almost every time I do a search on google.

z

--
What would an EWOULDBLOCK block, if an EWOULDBLOCK could block would? -- me

Relevancy by Traicovn · 2004-03-09 02:02 · Score: 4, Insightful

Judging by the problems with relevancy that often occur in current search engines, (I think of the problem with meta keywords, which for many search engines are now completely useless, and google-bombing) why would a customer pay to add more data to the search engine? The idea of course is 'because they'll be more relevant and because they have more information will come up more often', however, if search engines start searching more and more of this 'deep web' how badly will relevancy be affected? I mean, the more data that is in there, the more chances there are of relevancy being broken, and if the weighting is in favor of this 'featured' searches, then relevancy may be even more broken. Sure, these companies will have more traffic directed to them, but will it merely be useless traffic by frustrated users searching for something else?

I run a search engine for an educational institution, and I will admit, Google misses a significant number of our documents, on the other hand, some of those documents are scripts that when queried will create an (virtually) infinite amount of data (calendar scritpts, etc). How deep do we really need to go though? Do we really need to include calendar entries for the year 2452?

I'm also confused, is this search service 'pay by the searcher' or 'pay by the content provider'. It seems to be content provider to me.

--

[Something witty and intelligent should have appeared here.]
{Traicovn}

Re:Relevancy by Traicovn · 2004-03-09 02:33 · Score: 1

Also, how do they know that they are only indexing 1% of the web? Have they already indexed the rest, but just aren't sharing it? Or maybe the rest is already weeded out because it really was pretty useless and caused relevancy problems in the first place?

--

[Something witty and intelligent should have appeared here.]
{Traicovn}

Limitations of Google by PingKing · 2004-03-09 02:07 · Score: 3, Insightful

One limitation of Google is that fact that a site that bases its navigation through a drop-down menu or submission form (i.e. choose a section from the list and click Go) cannot be spidered by Google.

Personally, I find this infuriating. A site I once worked on was available in numerous languages, which could be chosen by choosing from a drop down list box. The upshoot of this is that Google has only cached the site in English, meaning users who would use the other languages do not get my site returned when they search in Google.

We need an open-source alternative that can address these problems, as well as get rid of the security concerns and mysterious methods Google uses to rank sites.

--

Patriotism - the last resort of scoundrels.

Re:Limitations of Google by Stiletto · 2004-03-09 02:33 · Score: 4, Insightful

Solution: Web designers, stop trying to be so clever.

If you want your site to be spiderable, don't hide it behind javascript and flash!
Re:Limitations of Google by Anonymous Coward · 2004-03-09 03:56 · Score: 0

A site I once worked on was available in numerous languages, which could be chosen by choosing from a drop down list box

Then your site was broken. You should have used HTTP's content negotiation to provide alternate languages.
Re:Limitations of Google by quacking+duck · 2004-03-09 06:17 · Score: 1

I considered this very thing when designing my webpage, where the menus are javascript-drawn.

My solution: load the links normally inside a <div id=...>, but after the page loads and the JS menus are drawn, it replaces the contents of the DIV using the innerHTML function. Consequently, web spiders are able to crawl down to my sub-pages despite not having JS (not that any engines *have* crawled them, mine's just a small personal site hosted on my university account, please don't /. it!), but anyone visiting with a JS-enabled browser will see it more or less as I intend.

My email's also JS-drawn using Hiveware's Email Enkoder (http://hiveware.com/enkoder_form.php), so spam harvesters would have to do some serious work to get at it.

Article by Anonymous Coward · 2004-03-09 02:09 · Score: 3, Informative

When Yahoo announced its Content Acquisition Program on March 2, press coverage zeroed in on its controversial paid inclusion program, whereby customers can pony up in exchange for enhanced search coverage and a vaunted "trusted feed" status. But lost amid the inevitable search-wars storyline was another, more intriguing development: the unlocking of the deep Web.

Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.

Today, the deep Web remains invisible except when we engage in a focused transaction: searching a catalog, booking a flight, looking for a job. That's about to change. In addition to Yahoo, outfits like Google and IBM, along with a raft of startups, are developing new approaches for trawling the deep Web. And while their solutions differ, they are all pursuing the same goal: to expand the reach of search engines into our cultural, economic and civic lives.

As new search spiders penetrate the thickets of corporate databases, government documents and scholarly research databanks, they will not only help users retrieve better search results but also siphon transactions away from the organizations that traditionally mediate access to that data. As organizations commingle more of their data with the deep Web search engines, they are entering into a complex bargain, one they may not fully understand.

Case in point: In 1999, the CIA issued a revised edition of "The Chemical and Biological Warfare Threat," a report by Steven Hatfill (the bio-weapons specialist who became briefly embroiled in the 2001 anthrax scare). It's a public document, but you won't find it on Google. To find a copy, you need to know your way around to the U.S. Government Printing Office catalog database.

The world's largest publisher, the U.S. federal government generates millions of documents every year: laws, economic forecasts, crop reports, press releases and milk pricing regulations. The government does maintain an ostensible government-wide search portal at FirstGov -- but it performs no better than Google at locating the Hatfill report. Other government branches maintain thousands of other publicly accessible search engines, from the Library of Congress catalog to the U.S. Federal Fish Finder.

"The U.S. Government Printing Office has the mandate of making the documents of the democracy available to everyone for free," says Tim Bray, CTO of Antarctica Systems. "But the poor guys have no control over the upstream data flow that lands in their laps." The result: a sprawling pastiche of databases, unevenly tagged, independently owned and operated, with none of it searchable in a single authoritative place.

If deep Web search engines can penetrate the sprawling mass of government output, they will give the electorate a powerful lens into the public record. And in a world where we can Google our Match.com dates, why shouldn't we expect that kind of visibility into our government?

When former Treasury Secretary Paul O'Neill gave reporter Ron Suskind 19,000 unclassified government files as background for the recently published "Price of Loyalty," Suskind decided to conduct "an experiment in transparency," scanning in some of the documents and posting them to his Web site. If it weren't for the work of Suskind (or at least his intern), Yahoo Search would never find Alan Greenspan's scathing 2002 comments about corporate-governance reform.

The CIA and Dick Cheney notwithstanding, there is no secret government conspiracy to hide public documents from view; it's largely a matter of bureaucratic inertia. Federal information technology organizations may not solve that proble

Imagine it follows all forms by BibelBiber · 2004-03-09 02:09 · Score: 1

Imagine you have a script installed, testwise, something like webmin and that searchengine hops and clicks all round and crashes your server. Of course you're not supposed to have scripts like that openly usable but mistakes like that do happen.

Security through Obscurity ... by jobbegea · 2004-03-09 02:11 · Score: 1

isn't the way to go.

This would indeed force admins/designers to think about what data is really private. Which is not a bad idea

--

Net sa best, mar it koe minder

What color hat will the robots have? by w3weasel · 2004-03-09 02:15 · Score: 1, Interesting

Search my database????

How the fsck is the bot gonna have the DBI string to interface my DB without knowing the name of the DB, the name of the account that created the DB or the user account on the DB with correct permissions to read the info??????????????

Hmmm... sounds like marketing hype.

--

Just as irrigation is the lifeblood of the Southwest, lifeblood is the soup of cannibals. -- Jack Handy

Re:What color hat will the robots have? by Anonymous Coward · 2004-03-09 05:51 · Score: 0

Someone will pay you $$$ in exchange for an account. That someone is probably going to be a 3rd-party, who in turn is getting paid by, say, Yahoo!, which in turn is fuelled by premium subscription fees.

retarded moderators by Anonymous Coward · 2004-03-09 02:16 · Score: 0

Who the hell modded this down? please point out the flamebait that is in the parent post.

Bad kitty! by Underholdning · 2004-03-09 02:17 · Score: 4, Interesting

There's a perfectly good reason why a webcrawler doesn't (and shouldn't) crawl the backend databases. I have customers with items and prices in their database. They update that on a daily basis. I have customers that provide directory solutions. We update that information on a daily basis. Now, imagine the turmoil that will arise, when people find outdated items using their favorite search engine which crawls the database once in a blue moon. Nuff said. Bad idead.

--
Underholdning.info

Re:Bad kitty! by cowscows · 2004-03-09 03:55 · Score: 2, Informative

Exactly. The article mentions things like flight schedules and classified ads. Those sorts of rapidly and constantly changing infor sources need a completely different system to effectively search them. Fortunately, they've already been invented. Orbitz, and cheap tickets, and expedia are a few of many that handle flight schedules. Any website for a local newspaper probably does a decent job with classified ads.

If I want to find cheap airline tickets, I put "airline tickets" into google, and it'll give me a list of websites that are designed to help me find airline tickets. It doesn't try and find the actual flights for me, and that's ok.

This deep web browser idea is going to end up being a feature bloated search engine that does lots of things, but does them all poorly, and does nothing particularly well.

--
One time I threw a brick at a duck.

Who cares how deep by Anonymous Coward · 2004-03-09 02:17 · Score: 0

If what gets presented at the end of the day is the .01% that has been paid for by some commercial entity.

Also

What does deep crawling have to do with the relevance of information? 99% of the web is crap so I am quite happy with the 1% that google returns.

Useless statistic of the week by Alomex · 2004-03-09 02:17 · Score: 2, Funny

The article alleges that current search services like Google manage to access less than 1% of the web.

There's a useless statistic if you ask me.

I just wrote a cgi script that, upon requesting the url "http://bogus.com/nnnnn" returns a page with the text "nnnnn" where nnnnn is any number up to 1000 digits long. So there, I just added 10^1000 pages to the "deep web" of which google indexes none! (gasp).

So there, Google now indexes less than 0.001% of the deep web.

Re:Useless statistic of the week by pohzer · 2004-03-09 03:04 · Score: 1

Or, your could have just as easily generated seemingly interesting (but nonsensical) webpages to feed google, using target keywords as fodder for the generating engine, and cross-linking them all to each other as well as your other ecommerce website on those same keywords. Viola... Google increases your pagerank and sends you traffic so you can sell keyword widgets.

It's done all the time. Ever see a website that says "Chrissy sank her deep wet girls gone wild across his hungry enlarge it today guaranteed free results, causing a thrusting money back guarantee. His debt consolidation enlarged a statistically significant amount as she stroked her poker tour, and he blew the giant jackpot rewards......"

True nature of the deep database problem by andygrace · 2004-03-09 02:19 · Score: 5, Informative

I dont think most posters understand the issue - most websites are now run out of content management systems, and search engines just trawl the web storing current pages. This is fine in a static internet, but with pages changing on a minute by minute basis; for example a new site that pulls out the latest headlines - all you're going to have indexed in Google is what's on the page today.

Now say I was looking for info from a few weeks ago - Google is not necessarily the best way of finding this info. It's all still sitting there in the database, but it's not on the site's front page. archive.org may have a copy of it, but it would be much better to have google.com talk XML in a standard method to the news site's content management system, and have ALL the data there for a search.

Re:True nature of the deep database problem by poot_rootbeer · 2004-03-09 05:43 · Score: 1

it would be much better to have google.com talk XML in a standard method to the news site's content management system, and have ALL the data there for a search.

Then what would be the user's motivation to come to the news site, and spend any time there? They could just go to Google and leech all the same content for free.
Re:True nature of the deep database problem by Kent+Recal · 2004-03-09 10:17 · Score: 1

Ah, you guys are talking about RSS.

Funny by BenBenBen · 2004-03-09 02:21 · Score: 4, Interesting

Google's always been good enough for me.

--
The Slashdot Paradox: "100% Overrated"

Armed with this info by Anonymous Coward · 2004-03-09 02:23 · Score: 0

Now when a slashdotting occurs, the victim's servers are in deeper trouble.

IPO Baby by glenrm · 2004-03-09 02:26 · Score: 0, Flamebait

So Yahoo is paying Salon to make sure everybody knows they are still in the search biz and don't you forget it... Sure that little search company Google is IPO'ing soon and everybody from Mamma to Ask Jeeves is having their stocks party like it is 1999, but don't you forget about good old Yahoo, I mean we have the deep web tech...

--
Onward to the Aether Sphere!

Re:IPO Baby by Anonymous Coward · 2004-03-09 02:57 · Score: 0

Have you actually tried search.yahoo.com? Right now I am finding that the results are more relevant than Google and it seems to always find a few more pages on the really obscure searches. A lot of my Googlewhacks are not whacks at all on search.yahoo.com.

only missing 90 TB? by DeathBunnyRanger · 2004-03-09 02:27 · Score: 2, Funny

the internet is only 90terrabytes?

that is what salon says, and I think that is bull, given my favorite porn site offers 20gigs of raunchy action.

Re:only missing 90 TB? by Anonymous Coward · 2004-03-09 02:47 · Score: 0

perhaps its 90TB of text. 20GB of porn trims down to about 1-2kb of text.

Probably redundant... by jwthompson2 · 2004-03-09 02:28 · Score: 1

As a web designer and admin I will have to find ways to make that data as inaccesible as possible....oh wait, I already do that because it is a good security measure...Database only listens on localhost so unless my server is breached it is already hidden behind the interface, not to mention that Apache already keeps people from reading my PHP. But if these 'deep web' searches are going to resort to trying to crack security then we have another thing to worry about...

--
Even if I knew that tomorrow the world would go to pieces, I would still plant my apple tree. -Martin Luther

Yahoo and my spam trap by Anonymous Coward · 2004-03-09 02:30 · Score: 0

1x10E50000000 web pages searched...

Genetic data... by Rakishi · 2004-03-09 02:35 · Score: 1

Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page. There already are numerous immense databases to store medical and genomic data, and the linking between them is only now becoming usable. There is a reason there re so many of them and that so much effort has been put into methods for displaying their data. I don't want to have the data spit at me because it's useless, a waste of time when there are better, faster, more robust, and nicer interfaces I could use. It's probably the same with half the other stuff they want to search: there already are good methods of searching it for those who know where to look . Not only that but the existing methods provide data specific information which an automated search engine cannot do and without which the data is useless to those people who actually use it. And those who don't need to look don't have a need for the data. Do YOU really need a 300 page list of A,T,C and Gs?

Re:AKA goodbye robots.txt by Zone-MR · 2004-03-09 02:42 · Score: 1

AKA "What's a robots.txt file?" says the innocent web crawling robot. :P

Nah, I'm sure the contents of the robots.txt file will be read, and the file itsself will be listed in the index too ;)

Good-bye riaa.org.

More search results by earthforce_1 · 2004-03-09 02:43 · Score: 1, Funny

So instead of 5,234,169 search results returned, we will see 45,961,384 results?

Yippee!!!!!

--
My rights don't need management.

The bottom line. by BReflection · 2004-03-09 02:44 · Score: 1

The bottom line is that without patching the breach in communication between the database owner and the search engine we will never be able to get past the challenge of the 'deep web'.

With static systems such as yours that provide no links to much or all of the information, the only way a search engine could ever index the database would be for the owner to actually send the database to Google (for example). Due to issues of trust (i.e. the database getting leaked such as Microsoft's source code), this is next to impossible.

The next likely alternative would be a simple change in database standards. If all of this information on the deep web really is free and publicly available, just not searchable due to a lack of technical innovation, then the simple solution is to have database owners publish index files of their databases which search engines could then incorporate into their indexes.

Many database owners will react in fear to this idea, as the difficulty of getting the information on their website often leads to revenue through you looking at more ads etc etc, however the recent advent of Google Print should quickly put their fears to rest. Google is indexing books, something very akin to a database, but does not offer the entire book for download, instead providing a preview and a link to where you can purchase the book. Similiarly, by providing a link to the database content, a user may still be required to register before they can search for their content. The point is that search engines are not responsible for actually giving us the content, just showing us where we can find it.

I believe that the indexing of databases by their owners will in the end lead to more people finding the information they want, and therefore more people visiting the site, in turn earning the content provider more business and giving the consumers what they want. The comparatively trivial roadblocks between us and the 'deel web' are undeserving of the daunting connotation attached to its name. All we need is a little innovation!

--
python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"

Typo? by jez9999 · 2004-03-09 02:50 · Score: 1

The article alleges that current search services like Google manage to access less than 1% of the web

Surely that should be 10%, given the 90% statistic mentioned later on?

--
== Jez ==
Do you miss Firefox? Try Pale Moon.

Insight on the "deep web" by saddino · 2004-03-09 02:53 · Score: 3, Funny

99% of the "deep web" probably looks like this. Indexable? Sure. Necessary? No.

Re:Insight on the "deep web" by Jerf · 2004-03-09 05:11 · Score: 1

Oh, you can do better then that. Consider this site. How deep is it? As deep as you want it to be. Useful? Less so.

I remember one that actually did sentence fragments but I can't find it in Google. (Probably because the search terms I'm using are flooded with other relevant hits.)

Salon's Subscription Service Sux by Phurd+Phlegm · 2004-03-09 02:54 · Score: 1

I've tried to read some of these Salon things by using their "watch an ad, get a day free" interface and it never works. I have no idea how it is supposed to work, and no interest in debugging their software. Aren't there enough articles to comment on in fora that sorta kinda work?

P.S., I suppose if I used Inyerneck Exploder that it would Just Work, but after having to use Microsoft Outlook at work, I've decided to never voluntarily use any of Bill's stuff again.

Re:Salon's Subscription Service Sux by wolverine1999 · 2004-03-09 07:55 · Score: 1

Works for me - I use firefox however.

--
SCIREV.NET - fanfics,reviews & more
Re:Salon's Subscription Service Sux by MsRee · 2004-03-10 04:47 · Score: 1

It works fine for me on Opera 7.23 on Win 98 and XP. You do have to accept a bunch of cookies required for login (and sometimes the link to get into Premium is far from obvious), but it works fine.

--
In Soviet Russia, TV watches you!

more illegal stuff???? by Anonymous Coward · 2004-03-09 02:59 · Score: 0

Yeha. Now I can access the member part of pr0n sites by search engines. Why pay? Missing/corrupt rar's of my warez ain't a Problem any longer... Otherwise they will never reach that 90%...

How?? by Haydn+Fenton · 2004-03-09 03:01 · Score: 3, Interesting

I think i have a pretty good understanding of how google works..

People submit their site, google goes to their site and visits every link it can find on the main page, then every link it finds on those other pages etc. So that pretty much the whole site is included.

This obviously means pages which are not linked do not get included in googles search, so i'm not surprised at the fact that less than 1% is ever crawled.

So how does this new method of crawling work? How can it possibly know what files are on the server if they are not linked in any way. The only way I can think of is a brute-force type method, which seems extremely stupid to me, since that would consume so much of the search engine's resources.

This also brings me onto the next point, like a few people have mentioned, there are certain pages on the web which append string onto the end or before the beggining of the URL, for example yourname.ismyfriend.com or www.somegamesite.com/attack.php?player=bob&attacks =5 so how many times would the crawler decide was enough to move onto the next link?

Also, since most of the internet is porn, and this new found technology will reveal another 90% or so percent of the internet, are we suddenly going to be showered with explicit sites?

Re:How?? by MImeKillEr · 2004-03-09 03:41 · Score: 4, Interesting

People submit their site, google goes to their site and visits every link it can find on the main page, then every link it finds on those other pages etc. So that pretty much the whole site is included.

Google doesn't just search pages submitted - I've got an Apache webserver running a home, doling out pages for family photos and stats for a local UT2K3 server. I hadn't enabled robots.txt to stop search engines from crawling it (didn't think I needed to) and one day entered my URL in google, only to find it.

I've never submitted the URL to google.

Should we assume that Google's already crawled a majority of the sites out there?

BTW, Yahoo has no record of my site in their database.

--
Cruising the internet on my TI-99/4A @ a whopping 300 baud!
Re:How?? by poot_rootbeer · 2004-03-09 05:49 · Score: 1

As you've said, web spiders typically work by following links from one page to another.

But "a href" is not the only way to get from page to page on the Web. There are also form submits, DHTML, and a hundred varieties of Javascript tricks and techniques.

Deep-linking would presumably try to simulate human interaction well enough to take advantage of these more complex methods. For closed-ended systems, eg select one option from this pull-down menu, deep-linking will probably work well, but for more open-ended interfaces, eg type a 250-word essay on why you love Skippy peanut butter, problems of scalability and usefulness will arise.
Re:How?? by LinuxXPHybrid · 2004-03-09 12:30 · Score: 1

> I've never submitted the URL to google.

Google has submission page, but it doesn't really do much. The way it works is that a page gets indexed if and only if inbound link is found in Google's current index.

That means ..., yes, there are number of pages that are not indexed in Google, simply because no one or no page links to those pages/websites.
Re:How?? by guttersn · 2004-03-09 16:08 · Score: 1

It's been reported also that pages visited with the Google toolbar with no incoming links have been visited shortly after by Googlebot. This is more likely to happen if you have the advanced options turned on in the toolbar, which phone home to Google to get PageRank info about the page you're visiting.

more illegal stuff???? by Anonymous Coward · 2004-03-09 03:02 · Score: 0

Yeha. Now I can access the member part of pr0n sites by search engines. Missing/corrupt rar's of my warez ain't a Problem any longer... Otherwise they will never reach that 90%...

Warnings are there to limit liability. by oneiros27 · 2004-03-09 03:11 · Score: 3, Insightful

It's rather stupid, but it has to do with legal practices.

If you have no warnings, then someone can claim that you forced your content on them, and they didn't know what they were getting into, and it was offensive.

By putting up warnings, which inform the user that they shouldn't enter your site if it's illegal for them to do so shifts part of the burden of responsibility to them, and away from you.

So, if you're sued for having distributed offensive material, you can claim that you provided warnings, and that the person chose to disregard them. [Sort of like putting up 'wet floor' signs -- if someone gets hurt, they made an active decision to ignore the sign]

--
Build it, and they will come^Hplain.

Re:AKA goodbye robots.txt by heikkile · 2004-03-09 03:16 · Score: 1

I'm sure the contents of the robots.txt file will be read, and the file itsself will be listed in the index too ;)

Something like this: robots.txt

--

In Murphy We Turst

Re:Funny...doesn't work for .gov by jobbegea · 2004-03-09 03:37 · Score: 1

I cannot imagine there is no .gov domain with these directories indexed

--

Net sa best, mar it koe minder

Re:Get ready to tighten up those dynamic site scri by Anonymous Coward · 2004-03-09 03:39 · Score: 0

NO kidding. so can I sue Google for crashing my site and using up bandwidth for trolling my database? How does this jive with "fair use" on my copyrighted materials? hmmmm

The observer influences the experiment by AndroidCat · 2004-03-09 03:42 · Score: 1

Welcome to page /lostsouls/BF538DE1-71AB-11D8-AD10-00A0248B8F67!

Last crawled by:
yahoobot on 03/09/04 at 04:12.
spammerscum on 03/05/04 at 14:41.
googlebot on 02/29/04 at 10:38.
machoproducts on 02/23/04 at 18:21.
machoproducts on 02/23/04 at 18:20.
machoproducts on 02/23/04 at 18:18.

Have a nice day!

--
One line blog. I hear that they're called Twitters now.

Where are FireFox's cookies stored? by Anonymous Coward · 2004-03-09 03:47 · Score: 0

Where are FireFox's cookies stored?

The 99:1 Rule by Anonymous Coward · 2004-03-09 03:50 · Score: 0

Perhaps it's a new rule - 1% of the Web contains 99% of the useful information?

another form of DOS by ramar · 2004-03-09 04:03 · Score: 2, Interesting

If the boys with fat pipes start indexing "deeper" into sites, I think we're going to see a lot of sites going offline until they've been refactored to handle this sort of thing.

The frontend webservers that serve the static pages are fine (they're already being spidered now), but the dynamic content, largely dependant on databases and such, very likely wasn't built to handle this sort of load. Once the new engines get their hooks into these pieces, they're going to be in trouble.

Re:another form of DOS by Kent+Recal · 2004-03-09 10:20 · Score: 1

I can imagine some eBay engineers sweating.
Don't worry, your managers will make some expensive agreement before one of the spiders can hit you.

Deja vu by imgumbydamnit · 2004-03-09 04:03 · Score: 1

The cover story of the March issue of Technology Review is "Search Beyond Google" (click CURRENT ISSUE, privacy invasion required). Like the Salon article, they mention Dipsie, but they also cover a search engine (Mooter) that uses a MindMap style interface.

--
To err is human. To arr is pirate.

TOC by yagu · 2004-03-09 04:25 · Score: 1

Think of what Google does as generating an "Internet Table of Contents". While we may disagree on how well Google does this (I happen to think they do an amazing job, considering the complexity of the task), they essentially are giving us "pointers" into the internet.

A TOC represents a tiny fraction of a book, yet yields a powerful tool to gain access to specific and targeted pages in the book. A TOC need not "crawl" every word of every page of a book to be useful. Similarly, Google has developed their methods to give a reasonable representation of the WEB and at the same time a powerul tool to gain access to the part of the WEB relevant to your request.

I know this isn't a perfect analogy, but I think Google has gotten it close to right. I'm not sure what additional depth would gain for the effort invested.

I see one issue here... by innerweb · 2004-03-09 04:33 · Score: 1

And there are probably more. How will this mesh with the DB laws being pushed? If you are storing "deep web" pages that are parts of someone else's database, is that not flying in the face of what these new laws are going to be all about? Or am I misunderstanding what these laws are trying to accomplish?

It seems to me that the future will be a search on Yahoo (or google) will wind up pointing you to many results that are themselves current active sub-searches of a websites localized database. Anything else would seem to violate what they are trying to protect now.

InnerWeb

--
Freud might say that Intelligent Design is religion's ID.

bad idea by falsification · 2004-03-09 04:54 · Score: 1

So Google and Yahoo want to suck all the data out of my database, eliminate the middle man (me and my crazy web page interface to my data), and serve the world my data, denying me the ability to interact with my own customer base?

I just don't think that is going to fly.

if it... by maxpublic · 2004-03-09 05:23 · Score: 1

...results in more porn I'm all for it. You can never have too much porn.

Max

--
My god carries a hammer. Your god died nailed to a tree. Any questions?

*Look* static? Be static, dammit! by Chemisor · 2004-03-09 05:38 · Score: 1

> 1. Make your pages *look* static

I have not ran across a lot of pages that actually need to be dynamically generated. Shopping carts and account settings need it, but if you make everything dynamic, like most misguided web developers do these days, you simply succeed at slowing your site down to a crawl and evoking a long stream of curses from people like me, who still think that broadband access is not worth $60 a month.

Re:*Look* static? Be static, dammit! by Anonymous Coward · 2004-03-09 09:44 · Score: 0

why would your piss poor conection slow down dynamic content. If anything, people with slow connections should notice less when they are hitting a dynamic page, as the network will be slower than the server.

And it will only slow down the server if its done piss poorly anyway.
Re:*Look* static? Be static, dammit! by Chemisor · 2004-03-10 03:32 · Score: 1

> why would your poor conection slow down dynamic content.

Because of low latency. Throughput slows down everything, but dynamic content suffers more due to the amount of cross-talk it generates.

> And it will only slow down the server if its done poorly anyway.

Judging from what I see on ALL the web sites I visit, there is simply no one left who can do it well.

brute forcing 48 passwords by Anonymous Coward · 2004-03-09 05:44 · Score: 1, Insightful

Huh?

It's safe to guess that the exp'y is within 4 years. (otherwise, move onto another card)

That's an amazing 48 possible "passwords" to brute force (assuming that cc subscriptions dates are uniformly distributed. any research on this?). I *THINK* there are >48 web merchants... Hmm.

This, of course, doesn't use the resources mentioned in the other posts.

On a related note... by cr0sh · 2004-03-09 06:05 · Score: 4, Interesting

What about the "invisible web"?

The so-called invisible web is indirectly related to the "deep web", with the exception that most of it isn't connected at all to the main web. Slashdot has had some articles regarding these hidden segments of the web - but has any progress been made on finding these "lost networks"?

Current theory on networks explains how and why these networks form and separate from the main web of connections, mainly due to loss of one of the tenuous threads from a supernode to the outlyer nodes. When this loss occurs (an intermediary site goes offline, or popularity wanes, or a large meganode dies or stagnates), the network fragments - and getting back to the pages/sites within is nearly impossible, unless you already have a link to the inside, or a friend provides it to you.

Now, it is a good thing that this phenomena exists - it seems to exist in all robust, evolving networks - whether those networks be electronically connected, socially connected (ie, Friendster, Orkut, or plain-ole social groupings), or bio/chemo connected (ie, the brain, the body, etc).

Even so, I wonder at all the information out there which I *can't* access, because it isn't indexed in some way. Sometimes you come across fragments and echos in other archives (news, mail, irc) that lead to these far-off and displaced "locations" - but it is rare, and tedious to do unless you are looking for very needful information.

So I ask again, has anything been done to further the "searching" within/for the "invisible web"?

--
Reason is the Path to God - Anon

Re:On a related note... by DarkMan · 2004-03-09 14:01 · Score: 1

That's an interesting question, but ultimaitly, I can't see there being anything interesting in these invisible sections.

It will only take one link to reconnect a seperate section. Whilst this may not be much for many networks, with search engines that walk the entire network, it's then going to re-enter the indicies. At this point, it's connected by more than one link, and thus a bit more robust.

So, these invisible sections will only contain things that no one links to - which is a pretty good definition of not interesting, in general.

The only exceptions would be very specialist sites, of little interest to the general populations. That's insufficiant to ensure seperation, because those specialist sites would have to be done by and for people that have no interest in the rest of the internet (think how many people link to where they work, or sites they developed, for example).

In short, I can't see any such search turning up anything worthwhile.

Wrong Conclusion... by telbij · 2004-03-09 06:22 · Score: 1

Why does everyone assume the top 10% of results on Google must be all the best information? Some people even said that in the same breath they complained about Google Spam. Ridiculous!

The fact is there is TONS of great indepently published stuff that will never be found through Google because the author doesn't take the time to play the SEO game and advertise their page all over the web. Google's algorithm is far from the final word in relevancy algorithms. The evolution will continue until we have search engines that are smarter then humans. Of course, the evolution will probably continue after that, just without our interaction.

what google should do by CAIMLAS · 2004-03-09 06:32 · Score: 1

google should start a 'google development' search engine. normal google would still be available, but the googledev would have the same initial database, but use different algorithms and procedures with which it would classify material, thus yielding different results for the same searches... 'cutting edge' google. or it could even have it's own search crawler, for that matter. that way they can start finding new ways to combat spammers.

--
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers

Probably not what he meant... by Anonymous Coward · 2004-03-09 06:34 · Score: 0

It's likely he was talking about younger children accidentily stumbling across something innapropriate, not a teenager with a box of tissues. Admittedly, ads alone are awfully lewd right now - but I can see the vague idea he's getting at.

I've found this to be true.... by Jasonv · 2004-03-09 07:12 · Score: 1

I've had a lot of actual experience with this. I've been researching a bunch of stuff on the history of Quebec city, and been using the Internet for most of it. Using Google, a few other search engines, I'll find a lot of information but most of it is second-hand, urban legend and, often, completly wrong. Not that I don't expect that, but I'd also expect to find good sources listed; they do exist.

For example: try finding a biography on 'Louis Hebert' on the net. You'll find a few pages, some of them good mixed in with the expected crap. But what you WON'T find -- in fact someone had to tell me to look here -- is the entry in The Dictionary of Canadian Biography which includes a bibliography giving original sources. It is hands down the best source for the type of information I need.

After weeks of serching for different historical figures I'd didn't even come across The Canadian Encyclopedia. This site has detailed information, including video and pics, on absolutely everyone and everything Canadian, but I've never seen it come up in the first half dozen pages of a search. Google won't find it even when you limit it to that site

But both these recources are all but lost in the 'deep web'.

Some 'island' sites have no links to them by wolverine1999 · 2004-03-09 07:52 · Score: 1

Some island sites as I call them have to links to them so crawling will never find them.

Some of them do advertise on local tv/radio/papers/whatever but aside from that you'd never find them if someone doesn't tell you the URL.

This is of course not part of the 'deep web' but it is part of the "invisible web". It is impossible for
any algorithm to find this on its own.

--
SCIREV.NET - fanfics,reviews & more

solutions by TaraByte · 2004-03-09 09:04 · Score: 1

First, Google does crawl dynamic sites using GET variables.

Second, if you install apache's MOD_REWRITE you can change all your dynamic pages to "appear" static, thus allowing them to be more easily indexed.

--
Security is inversely proportional to the commitment of one desiring to circumvent it.

It's a Design Problem, not Search Problem by Anonymous Coward · 2004-03-09 09:11 · Score: 0

The reason government resources are not widely searched is because there is little incentive to make them searchable. Government websites are notoriously hard to use compared to commercial ones simply because businesses have a vested interest (money) to get their pages searched. High ranks in search results = more visitors = more business = more money. Government agencies do not work that way, not that they are intentionally trying to hide information, they just aren't as focused on the user as business sites are.

Ever seen a adult site tied up behind a complicated form? No, and you won't because these sites want to be searched and are designed with search engines in mind. There is no technical reason any website cannot be searched. All content should be browsable following standard HTML links, if human users want to aggregate the results, a form can be used. But if someone has to use a form to get at the content, that's a flaw in the design and the designer is to blame, not search engines.

Progress by vidnet · 2004-03-09 09:14 · Score: 1

The Deep Web, aka crapflooding submission forms

And analogously ... by cookie_cutter · 2004-03-09 10:58 · Score: 2, Insightful

If you have a public mail server, you deserve any spam you get...

Meta Tags Suck by Anonymous Coward · 2004-03-09 11:33 · Score: 0

Now maybe my site will come up more when people search for items in my database. Yay, more sales for me! I'll probably still mod_rewrite my urls in apache so that they look nicer, and are easier for people to "use," but at least it won't be as much of a necessity for the search engines to index me.

FYI mod_rewrite in apache will change a URL from:

http://www.mydomain.com/foo.jsp?blah=bar

to something like...

http://www.mydomain.com/foo/bar/

iirc, you can have it redirect any url you want to any other query string, using regex or just plain strings. also supposedly nice for fixing trailing slash issues (some systems assume a directory is a file when you don't put a trailing slash in the url).

The web is one place by SphericalCrusher · 2004-03-09 15:58 · Score: 1

There is no deeper side to the web. The only thing they can do to make their search give more topics is to just go after something else in the website code, instead of metatags. Maybe even offer some kind of free promotion method, such as AddMe, for the users of their Instant Messenging and E-Mail services.

Google is like the US to other countries. We may have not been the first in space, but we sure as hell have been the farthest.

--
"Instant gratification takes too long." - Carrie Fisher

Sailing the seas of cheese by bluethundr · 2004-03-09 18:19 · Score: 1

A couple of years ago, I went to the H2k2 conference here in New York City. I saw a fascinating talk there where I first heard the term "deep web" and some of its ramifications for national security. National security was very much on our minds at the time being only roughly a mile and a half from what we call "Ground Zero" (never liked that term).

The guy giving the speech claimed that he was a retired FBI agent and seemed to have a great deal of insight into the inner workings of national intelligence. As pointed out in the article, the speaker made the same claim that search engines only gleaned about 1% of the total information on the web. He recommended a tool called Copernic (as well as one other one that I can't remember right now) that bills itself as a "deep web" search tool. But all it appears to do is assemble the results from a bunch of other search engines. I don't recall it ever returning anything significantly "deeper" than what your average google search can yield, however.

Back to the topic of national security, he made mention that terrorist communities are thriving on the fact that only 1% of the total amount of information on the web is readily accessible. All kinds of information that would be beneficial for the NSA to know is just plain inaccessible.

He also faulted the intelligence communities for hiring "blonde haired pretty boy" college graduates, fresh out of school to analyze data in foreign languages instead of hiring local speakers. A 4.0 linguistics student will still miss out on a lot of the nuance to a conversation that a native, say Pashto, speaker will clue right into. Of course, the argument could be made that at least the "loyalties" of an American college graduate are almost guaranteed to be in the right place you can't ignore that he/she will be blind to much of the subtext of a conversation in a foreign language.

A little offtopic, but more alarmingly a point was made about the lack of digitization in the NSA of intelligence documents. Meaning that an agent will typically risk life and limb gaining access to a piece of information, who will then pass that info to a "runner" who places it in an "orange envelope" to signify its classified status. Then that same orange envelope goes into a locked filing cabinet where a good 7 or 8 times out of 10 it never sees the light of day and no attempt is made to analyze it.

But such is the challenge of the modern age. We are drowning in all of the information to produce. Vannevar Bush addressed this issue with astounding clarity right after world war II.

Quoth the Doctor:

"There is a growing mountain of research. But there is increased evidence that we are being bogged down today as specialization extends. The investigator is staggered by the findings and conclusions of thousands of other workers--conclusions which he cannot find time to grasp, much less to remember, as they appear. Yet specialization becomes increasingly necessary for progress, and the effort to bridge between disciplines is correspondingly superficial."

...and...

The difficulty seems to be, not so much that we publish unduly in view of the extent and variety of present day interests, but rather that publication has been extended far beyond our present ability to make real use of the record. The summation of human experience is being expanded at a prodigious rate, and the means we use for threading through the consequent maze to the momentarily important item is the same as was used in the days of square-rigged ships.

We are dealing with this problem (access to the information we produce) to a far greater extent than at any time in human history. The web, which was at one point designed and intended to be a more effective way to deal with and disseminate the oceans of data produce, has little more than square rigged ships to skim its surface.

--
Quod scripsi, scripsi.

you are missing the point... by frovingslosh · 2004-03-09 19:54 · Score: 1

There is plenty of very good information out there that isn't indexed. For example, I found a lot about the top level finances of my company, including compensation of the president and vice presidents, that was made a matter of public record when they filed the information as part of an IPO. However, unless I had found the IPO on the SEC website because I found a financial site that let me search for IPOs, I would have never known that the information was available to the public. No search engine would find it, even when given the name of the company and the names of the people involved, or the company name and the term IPO (and, interestingly, the copy of the IPO file documents that had been provided to myself and other managers were doctored to omit this information). At the very least, I would want all government information to be searchable on common search engines like Google. Not that I think they should be able to publish all of the information on the web that they do; but if they are going to publish it, then it should be easily searchable.

--
I'm an American. I love this country and the freedoms that we used to have.

So will cost 100x as much to run a website now? by paylett · 2004-03-09 20:19 · Score: 1

If some search engine is going to try and pull up every single record stored on some website's database - and do this every month or however often, then surely this is going to generate a heck of a lot more traffic than is necessary.

(Or even better, what happens when deep search engine #A starts crawling deep search engine/diretory #B :)

--

Believing something doesn't make it true. Not believing something doesn't make it false.

So that's the trick by MsRee · 2004-03-10 04:51 · Score: 1

I've heard that too. Never got it to work, but I disable advanced features. Google has a way of indexing ridiculous things (guestbook signings and things) while completely overlooking any actual content I place online. It's amusing as hell.

--
In Soviet Russia, TV watches you!

193 comments