Searching the 'Deep Web'

With the 10% that is crawled by Trigun · 2004-03-09 01:51 · Score: 5, Funny

being pretty much total crap, I'd really hate to see the other 90%!

Re:With the 10% that is crawled by Zone-MR · 2004-03-09 02:27 · Score: 4, Informative

It could actually be useful content.

Let me give you an example. I run a forum. The main index page doesn't contain much information, just an overview of the latest posts and a brief introduction.

The rest of the content is what people submit. Here is the problem. The pages are generated dynamically. They end up having url's like http://domain/index.php?act=showpost&postid=12 44

Google sees index.php as one page, and does not attempt to submit any data via get/post. This means that effectively the most valuable content is missed.

Of course making it crawl /?yada=yada links has problems, namely the possibilty of getting stuck in an infinite loop where data and links are tracked using sessions, and an infinite number of URLs could potentially yeild valid, although very similar results.
Re:With the 10% that is crawled by Turing+Machine · 2004-03-09 02:54 · Score: 2, Interesting

http://domain/index.php?act=showpost&postid=12 44

Google sees index.php as one page, and does not attempt to submit any data via get/post.

Hmm... I see plenty of pages in Google that have URLs with GET parameters, so there must be some way of getting it to crawl them. Or am I misunderstanding what you're saying? Maybe the key here is to provide an alternate route to those pages without doing anything fancy (drop-down menus, radio buttons, etc.). Just generate another page that contains a regular link to all your pages. You could hide that page from your regular users by, say, linking it to a 1x1 pixel transparent GIF. A robot will find it, but most of your users won't even notice.
Re:With the 10% that is crawled by Zone-MR · 2004-03-09 02:59 · Score: 2, Interesting

Hmm... I see plenty of pages in Google that have URLs with GET parameters, so there must be some way of getting it to crawl them. Or am I misunderstanding what you're saying? Maybe the key here is to provide an alternate route to those pages without doing anything fancy (drop-down menus, radio buttons, etc.). Just generate another page that contains a regular link to all your pages. You could hide that page from your regular users by, say, linking it to a 1x1 pixel transparent GIF. A robot will find it, but most of your users won't even notice.

Yeah, I can see that google sometimes lists pages with get content in it's index. It doesn't want to do it for a lot of pages though, and I haven't figured out why. There seems to be nothing different in the HTML.

Hypothetically speaking, whats there to stop someone doing a:

<?
print("<a href='thispage.php/${rand()}'>Some page...</a>");
?> ... and looping google?
Re:With the 10% that is crawled by dealsites · 2004-03-09 04:55 · Score: 2, Informative

I agree that the search engines do not index dynamically generated pages very well. This page on my site http://www.dealsites.net/index.php?module=MyHeadli nes&func=view&myh=menu&gid=22&pid=2&eid=504&tid=30 0&context= hasn't seemed to attract any of the search engines yet. I'm not sure why, the data changes hourly and I have a direct link to that page on my site.

However, when search engines do start doing deep crawls, especially if they do POSTs and GETs, then the bandwidth of the web site will go up tremendously. While it is important to get crawled, what happens when your site uses more bandwidth for search engines than users? Also what would prevent other companies from developing thier own search engines? Then you might have 20 or more search engines doing deep crawls every month. Many websites are operated on low-cost low-bandwith hosting plans.
Re:With the 10% that is crawled by bheer · 2004-03-09 05:21 · Score: 2, Informative

Yeah, I can see that google sometimes lists pages with get content in it's index. It doesn't want to do it for a lot of pages though, and I haven't figured out why. There seems to be nothing different in the HTML.

One word: backlinks. Pages, even with request parameters, that get linked to from lots of popular (high-pagerank) sites get indexed.

--
Go somewhere random
Re:With the 10% that is crawled by danielsfca2 · 2004-03-09 09:32 · Score: 2, Insightful

Hey cheapskate. Maybe if you subscribed to Salon you wouldn't have that problem. Independent news sites like Salon are going to disappear if they get no revenue. Maybe next time you visit salon.com, it'll say "Thanks to our former subscribers for the support. Due to our operating costs going through the roof but only four people subscribing, we've been forced to go out of business. This domain was bought by Fox News in bankruptcy proceedings. Click here to go there now.

If you're too cheap to pay for anything, you have to be satisfied with things like ad-supported internet access (see NetZero) and ad-supported news (like salon's day-pass, and fucking TV, where's the complaining about CNN?). Yes, the ads are more intrusive than they were in 1999. The venture capital investment is gone and advertisers won't pay jack for barely-there banner ads. Now they want your full attention for a moment. So WTF is salon.com supposed to do, just say, "Everything is free! No ads! When the bandwidth bill comes, we'll just mail them some monopoly money"??

If ad-supported websites didn't exist, the only people who could afford to publish on the Internet would be the conglomerated media who make their money from--say it with me--ad revenue from TV (etc.). Get it yet?

Now, Mr. Troll, get back under your bridge.

Deep Web? by Traicovn · 2004-03-09 01:52 · Score: 5, Insightful

I bet you this new 'Deep Web' search technology would be something that does not observe robots.txt...

--

[Something witty and intelligent should have appeared here.]
{Traicovn}

Re:Deep Web? by Anonymous Coward · 2004-03-09 01:54 · Score: 3, Insightful

Good. If you leave things publically accessible on an open web server, that's your own damned fault. Let the engines crawl where they please.
Re:Deep Web? by Anonymous Coward · 2004-03-09 02:18 · Score: 2, Interesting

User-agent: *
Disallow: /s3kr3t/

trawler: "Hey cool, thx for the tip I never would have thought to try /s3kr3t/"
Re:Deep Web? by AndroidCat · 2004-03-09 02:21 · Score: 2, Insightful

# go away. No, really - this means you!
User-agent: *
Disallow: /
And if they don't listen, feed them a huge maze of generated links that eventually lead to goatse or something. Or just block their crawler at the router and they can search their intranet.

--
One line blog. I hear that they're called Twitters now.
Re:Deep Web? by JDevers · 2004-03-09 02:32 · Score: 2, Informative

If I'm not mistaken, the original reason for robots.txt was to prevent endless loops from confusing spiders, not to "cover" some information that would otherwise be easily accessible. Of course, others use it for other things now...
Re:Deep Web? by Anonymous Coward · 2004-03-09 02:36 · Score: 2, Insightful

Well, I know that we use robots.txt to cover some directories that are both publicly accessible, and that we want people to be able to get the data in, yet that data is pretty useless unless you are visiting it from our link. We do signal processing, and looking at our data tables and our raw log files would be completely useless and can really alter a web search.
Re:Deep Web? by Rorschach1 · 2004-03-09 03:18 · Score: 2, Funny

Doesn't observe it? It probably relies on it - tells you where the good stuff is!

Damn ... by Anonymous Coward · 2004-03-09 01:54 · Score: 2, Funny

I remember browsing the WWW directory in '93 and being able to scroll through all the sites on my VAX session at university. Are you telling me I am one of the few people who actually ever reached the end of the internet?

Oh yeah, a whole new pair of dimes by stienman · 2004-03-09 01:54 · Score: 3, Funny

Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"

Yeah. It means I'll be able to use someone else's credit card for more of my transactions, since finding credit cards, SSNs and other...uh...'deep web' stuff will be so much more accessable.

-Adam

Re:Oh yeah, a whole new pair of dimes by dsanfte · 2004-03-09 02:17 · Score: 4, Insightful

I wish you luck using that credit card number without the appropriate expiration date. The FUD spreaders rarely mention the fact that exp dates are almost never stored with the numbers themselves.

--
occultae nullus est respectus musicae - originally a Greek proverb
Re:Oh yeah, a whole new pair of dimes by Zone-MR · 2004-03-09 02:18 · Score: 2, Insightful

So are you implying that you're credit card information is currently availible on web pages, with no password protection, and the only thing stoping hackers is that it isn't listed in a search engine?

Deep Web? by dingo · 2004-03-09 01:54 · Score: 2, Funny

Why do I get the feeling that you will get a lot more search results for Linda Lovelace when searching the "Deep Web"

--
The Borg assimilated my race & all I got was this lousy T-shirt

Deep web? by hookedup · 2004-03-09 01:55 · Score: 4, Funny

Doesnt crap sink? Not sure I want to know what the other 90-odd percent is. After tubgirl, goatse, etc.. what else could possibly be next..

deep web? by rjelks · 2004-03-09 01:55 · Score: 4, Funny

Is it just me, or does this sound like we're gonna get more pr0n when we search?

-

--

Tech News, Reviews and Tutorials

Maybe I'm just missing the point... by robslimo · 2004-03-09 01:56 · Score: 5, Interesting

...but I don't want to see the guts of a web form. If I userstand correctly, they're talking about crawling into databases, actually parsing a Microsoft Access file, for instance. I see that as having dubious merit, and potentially pissing of web site owners. Web site designers go to a lot of trouble to provide the interface they want you to see to their data. This would just sidestep the interface and dump you into the data.

It the very least, it might require an overhaul or extension to the robots exclusion specification to keep spiders out of your data.

But if you bypass the front pages... by oneiros27 · 2004-03-09 01:57 · Score: 3, Insightful

Of course, it's nice to know that the content's there, but how many children are now going to be able to bypass the disclaimer pages on porn sites because of deep linking?

I could care less about Ticketmaster whining out their deep linking, but there's probably some stuff out there that if it isn't taken in context to their intended point of entry may have other problems.

I'm afraid that this is going to give people more reason to go back to using frames, and 'detecting' if their content has been hijacked, and writing more bad code that causes multiple windows to pop up all over the place, and/or crash browsers.

--
Build it, and they will come^Hplain.

Re:But if you bypass the front pages... by CAIMLAS · 2004-03-09 06:07 · Score: 4, Insightful

but how many children are now going to be able to bypass the disclaimer pages on porn sites because of deep linking?

Hello, 1996 is calling; they want their paranoia back!

Goodness, you aren't serious, are you? Have you used a search engine in the last couple years? Have you not ever looked for porn yourself? Just hop over to images.google.com and enter the name of a porn star - bam, shitloads of smut. Not only that, but search google.com for a porn star's name (many of which you could easily find by searching for 'famous porn stars', I'm sure) and you'll find gallery after gallery of porn, open and free.

There is no such thing as protecting your kids from porn on the internet anymore. If you don't want to have them looking at porn, don't let them online or police their actions.

--
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers

PHP? by TGK · 2004-03-09 01:57 · Score: 4, Interesting

Since I moved my site over to a php bases sytem, nothing beyond my index page gets a second look from google. As web content moves away from static pages to more dynamic solutions (particularly XML) a more sophisticated crawler is neeeded, one that can read over this bewildering malstrom of data and extract form it meaning and content.

While I find it highly unlikely that this system will do well with large databases (or even databases at all for that matter) it is a step in the right direction. Google will probably have their version up on labs inside a month.

--
Killfile(TGK)
No trees were killed in the creation of this post. However, many electrons were inconvenienced.

Re:PHP? by andygrace · 2004-03-09 02:09 · Score: 2, Insightful

Well the front pages might be, with a few top stories, but the real problem lies in getting at all the information that is stored in SQL databases ...
There is reams of stuff in there that a search engine can't see. XML could be used to deep search these entire databases, rather than just the stuff that's pulled into the UI by the PHP code.
Re:PHP? by DeadSea · 2004-03-09 02:14 · Score: 5, Interesting
Keep in mind that googlebot comes in two flavors, freshbot, and deepbot.
Freshbot is meant to update the google cache for pages that change frequently. Freshbot may pull pages as much as every couple hours for really popular pages that change frequently.
Deepbot goes out once every month or two and follows links. The higher your pagerank, the deeper into your site it will go. If you want more of your site to get crawled here are some tips:
1. Make your pages *look* static (end in .html)
2. Avoid CGI parameters except for handling form data (no ? in url)
3. Put all pages in the document root, or in very shallow subdirectories. Google goes after less and less as the directories get more.
It is likely that deepbot just hasn't run since you updated your site, so freshbot is just pulling your front page occasionally.
BTW: I noticed you have a link to my cheet sheet on your links page. Thanks! :-)
Re:PHP? by Xner · 2004-03-09 02:17 · Score: 4, Informative

I'm not exactly sure what you mean. If it is accessible by clicking on links, most search engines should be able to index it. If you want to be extra-friendly you can use $PATH_INFO to make dynamic pages look more like static ones, e.g.:
http://site.com/blah/prog.php/stat/1
instead of
http://site.com/blah/prog.php?stat=1
I use it all the time and it works really well.

--
Pathman, Free (as in GPL) 3D Pac Man

From the article by sczimme · 2004-03-09 01:58 · Score: 4, Insightful

Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.

There is a reason for this: a Google search should turn up pointers to the items in the so-called "deep web" (*gag*). To use one of the examples above: if I am looking for information on patents, the search terms I use should point me to the US Patent and Trademark Office. It shouldn't have to point me to all 12 bajillion patent filings.

Besides, what makes anyone think this is going to fly after all the hubbub over "deep-linking"?

--
I want to drag this out as long as possible. Bring me my protractor.

Re:From the article by Professeur+Shadoko · 2004-03-09 03:55 · Score: 2, Interesting

Right.
But if you are interested in a specific subject..
Let's say you have a technical problem.
Chances are somewhere on the planet someone submitted the same problem on a web-based forum.

Now you want google to give you THAT specific message.
You don't want google to tell you "hmmm... I guess the solution must be in one of those zillions of forums here, here, and here".

Spiders? by Vo0k · 2004-03-09 01:58 · Score: 4, Interesting

...and I wonder about something different.
Has anyone tried this yet? Change your user agent string to one matching the googlebot and crawl the web. I'm pretty sure many "registration only" websites would magically open themselves, but I wonder about other differences too :)

--
Anagram("United States of America") == "Dine out, taste a Mac, fries"

Re:Spiders? by MyHair · 2004-03-09 03:07 · Score: 2, Interesting

Good question. I haven't tried it yet, but I've run into several sites that Google indexes but the site refuses me entry until I register (which I don't). Some of them are clever enough to put Javascript (or something) in to prevent you from looking at Google's cache of that page. Yeah, I could get around that, but usually by then I figure I don't care what that site has to say.
Re:Spiders? by poot_rootbeer · 2004-03-09 05:41 · Score: 2, Informative

I can't speak for everyone, but here we check not only a spider's User Agent string, but also whether the request is coming from Google's IP range or elsewhere. So your results may not be so great.

Then again, I've defeated many registration (er, pr0n) gateways by just seting a Referer header identical to the URL I'm requesting, so some defenses are better than others...

Privacy and Crap by jackb_guppy · 2004-03-09 02:00 · Score: 2, Interesting

Going after the other 90% does not mean that new things will come to top. Oh there maybe a few cool items like "Who realy shot JFK" or launch code for a trident.

But in reality the other 90% most likely be best left un-found. Who really wants to know that parents were not married as in the manor that they told.

Just is in archology, you will find a nice vase or two... but the rest is rumble.

You understand that digging a garage dump is the best place to find things in archology, because people clean their house then too. That is what other 90% is... a dump of information.

Google by nycsubway · 2004-03-09 02:02 · Score: 2, Insightful

Generally, google finds the pages that the authors want to be searched. Thats why you submit your site to google. Even if you dont submit your site to google, if it's on a domain that google searches and there is a link to it, it'll be found.

With google storing more than 4 billion web pages, I'd hate to see what kind of crap the other 99% is.

Perhaps they count each iteration of a dynamic page as a seperate page? Even so, google's news page does a great job searching in real time for pages that change dynamicaly.

--
http://github.com/gbook/nidb

Top 4 by UncleBiggims · 2004-03-09 02:02 · Score: 5, Informative

About.com lists the top 4 places to search the deep web as:

Anybody use any of these sites? Are they any good? Just wondering why this is getting to be news if sites like these already exist.

Are you Corn Fed?

Re:Top 4 by BReflection · 2004-03-09 02:15 · Score: 3, Informative

'Search Systems is another good site. They make '17,834' public databases accessible.

--
python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"

1 percent,? by zonix · 2004-03-09 02:02 · Score: 4, Insightful

The article alleges that current search services like Google manage to access less than 1% of the web [...]

1 percent, and I still don't have a problem feeling lucky almost every time I do a search on google.

z

--
What would an EWOULDBLOCK block, if an EWOULDBLOCK could block would? -- me

Relevancy by Traicovn · 2004-03-09 02:02 · Score: 4, Insightful

Judging by the problems with relevancy that often occur in current search engines, (I think of the problem with meta keywords, which for many search engines are now completely useless, and google-bombing) why would a customer pay to add more data to the search engine? The idea of course is 'because they'll be more relevant and because they have more information will come up more often', however, if search engines start searching more and more of this 'deep web' how badly will relevancy be affected? I mean, the more data that is in there, the more chances there are of relevancy being broken, and if the weighting is in favor of this 'featured' searches, then relevancy may be even more broken. Sure, these companies will have more traffic directed to them, but will it merely be useless traffic by frustrated users searching for something else?

I run a search engine for an educational institution, and I will admit, Google misses a significant number of our documents, on the other hand, some of those documents are scripts that when queried will create an (virtually) infinite amount of data (calendar scritpts, etc). How deep do we really need to go though? Do we really need to include calendar entries for the year 2452?

I'm also confused, is this search service 'pay by the searcher' or 'pay by the content provider'. It seems to be content provider to me.

--

[Something witty and intelligent should have appeared here.]
{Traicovn}

Limitations of Google by PingKing · 2004-03-09 02:07 · Score: 3, Insightful

One limitation of Google is that fact that a site that bases its navigation through a drop-down menu or submission form (i.e. choose a section from the list and click Go) cannot be spidered by Google.

Personally, I find this infuriating. A site I once worked on was available in numerous languages, which could be chosen by choosing from a drop down list box. The upshoot of this is that Google has only cached the site in English, meaning users who would use the other languages do not get my site returned when they search in Google.

We need an open-source alternative that can address these problems, as well as get rid of the security concerns and mysterious methods Google uses to rank sites.

--

Patriotism - the last resort of scoundrels.

Re:Limitations of Google by Stiletto · 2004-03-09 02:33 · Score: 4, Insightful

Solution: Web designers, stop trying to be so clever.

If you want your site to be spiderable, don't hide it behind javascript and flash!

Article by Anonymous Coward · 2004-03-09 02:09 · Score: 3, Informative

When Yahoo announced its Content Acquisition Program on March 2, press coverage zeroed in on its controversial paid inclusion program, whereby customers can pony up in exchange for enhanced search coverage and a vaunted "trusted feed" status. But lost amid the inevitable search-wars storyline was another, more intriguing development: the unlocking of the deep Web.

Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.

Today, the deep Web remains invisible except when we engage in a focused transaction: searching a catalog, booking a flight, looking for a job. That's about to change. In addition to Yahoo, outfits like Google and IBM, along with a raft of startups, are developing new approaches for trawling the deep Web. And while their solutions differ, they are all pursuing the same goal: to expand the reach of search engines into our cultural, economic and civic lives.

As new search spiders penetrate the thickets of corporate databases, government documents and scholarly research databanks, they will not only help users retrieve better search results but also siphon transactions away from the organizations that traditionally mediate access to that data. As organizations commingle more of their data with the deep Web search engines, they are entering into a complex bargain, one they may not fully understand.

Case in point: In 1999, the CIA issued a revised edition of "The Chemical and Biological Warfare Threat," a report by Steven Hatfill (the bio-weapons specialist who became briefly embroiled in the 2001 anthrax scare). It's a public document, but you won't find it on Google. To find a copy, you need to know your way around to the U.S. Government Printing Office catalog database.

The world's largest publisher, the U.S. federal government generates millions of documents every year: laws, economic forecasts, crop reports, press releases and milk pricing regulations. The government does maintain an ostensible government-wide search portal at FirstGov -- but it performs no better than Google at locating the Hatfill report. Other government branches maintain thousands of other publicly accessible search engines, from the Library of Congress catalog to the U.S. Federal Fish Finder.

"The U.S. Government Printing Office has the mandate of making the documents of the democracy available to everyone for free," says Tim Bray, CTO of Antarctica Systems. "But the poor guys have no control over the upstream data flow that lands in their laps." The result: a sprawling pastiche of databases, unevenly tagged, independently owned and operated, with none of it searchable in a single authoritative place.

If deep Web search engines can penetrate the sprawling mass of government output, they will give the electorate a powerful lens into the public record. And in a world where we can Google our Match.com dates, why shouldn't we expect that kind of visibility into our government?

When former Treasury Secretary Paul O'Neill gave reporter Ron Suskind 19,000 unclassified government files as background for the recently published "Price of Loyalty," Suskind decided to conduct "an experiment in transparency," scanning in some of the documents and posting them to his Web site. If it weren't for the work of Suskind (or at least his intern), Yahoo Search would never find Alan Greenspan's scathing 2002 comments about corporate-governance reform.

The CIA and Dick Cheney notwithstanding, there is no secret government conspiracy to hide public documents from view; it's largely a matter of bureaucratic inertia. Federal information technology organizations may not solve that proble

Bad kitty! by Underholdning · 2004-03-09 02:17 · Score: 4, Interesting

There's a perfectly good reason why a webcrawler doesn't (and shouldn't) crawl the backend databases. I have customers with items and prices in their database. They update that on a daily basis. I have customers that provide directory solutions. We update that information on a daily basis. Now, imagine the turmoil that will arise, when people find outdated items using their favorite search engine which crawls the database once in a blue moon. Nuff said. Bad idead.

--
Underholdning.info

Re:Bad kitty! by cowscows · 2004-03-09 03:55 · Score: 2, Informative

Exactly. The article mentions things like flight schedules and classified ads. Those sorts of rapidly and constantly changing infor sources need a completely different system to effectively search them. Fortunately, they've already been invented. Orbitz, and cheap tickets, and expedia are a few of many that handle flight schedules. Any website for a local newspaper probably does a decent job with classified ads.

If I want to find cheap airline tickets, I put "airline tickets" into google, and it'll give me a list of websites that are designed to help me find airline tickets. It doesn't try and find the actual flights for me, and that's ok.

This deep web browser idea is going to end up being a feature bloated search engine that does lots of things, but does them all poorly, and does nothing particularly well.

--
One time I threw a brick at a duck.

Useless statistic of the week by Alomex · 2004-03-09 02:17 · Score: 2, Funny

The article alleges that current search services like Google manage to access less than 1% of the web.

There's a useless statistic if you ask me.

I just wrote a cgi script that, upon requesting the url "http://bogus.com/nnnnn" returns a page with the text "nnnnn" where nnnnn is any number up to 1000 digits long. So there, I just added 10^1000 pages to the "deep web" of which google indexes none! (gasp).

So there, Google now indexes less than 0.001% of the deep web.

True nature of the deep database problem by andygrace · 2004-03-09 02:19 · Score: 5, Informative

I dont think most posters understand the issue - most websites are now run out of content management systems, and search engines just trawl the web storing current pages. This is fine in a static internet, but with pages changing on a minute by minute basis; for example a new site that pulls out the latest headlines - all you're going to have indexed in Google is what's on the page today.

Now say I was looking for info from a few weeks ago - Google is not necessarily the best way of finding this info. It's all still sitting there in the database, but it's not on the site's front page. archive.org may have a copy of it, but it would be much better to have google.com talk XML in a standard method to the news site's content management system, and have ALL the data there for a search.

Funny by BenBenBen · 2004-03-09 02:21 · Score: 4, Interesting

Google's always been good enough for me.

--
The Slashdot Paradox: "100% Overrated"

only missing 90 TB? by DeathBunnyRanger · 2004-03-09 02:27 · Score: 2, Funny

the internet is only 90terrabytes?

that is what salon says, and I think that is bull, given my favorite porn site offers 20gigs of raunchy action.

Insight on the "deep web" by saddino · 2004-03-09 02:53 · Score: 3, Funny

99% of the "deep web" probably looks like this. Indexable? Sure. Necessary? No.

How?? by Haydn+Fenton · 2004-03-09 03:01 · Score: 3, Interesting

I think i have a pretty good understanding of how google works..

People submit their site, google goes to their site and visits every link it can find on the main page, then every link it finds on those other pages etc. So that pretty much the whole site is included.

This obviously means pages which are not linked do not get included in googles search, so i'm not surprised at the fact that less than 1% is ever crawled.

So how does this new method of crawling work? How can it possibly know what files are on the server if they are not linked in any way. The only way I can think of is a brute-force type method, which seems extremely stupid to me, since that would consume so much of the search engine's resources.

This also brings me onto the next point, like a few people have mentioned, there are certain pages on the web which append string onto the end or before the beggining of the URL, for example yourname.ismyfriend.com or www.somegamesite.com/attack.php?player=bob&attacks =5 so how many times would the crawler decide was enough to move onto the next link?

Also, since most of the internet is porn, and this new found technology will reveal another 90% or so percent of the internet, are we suddenly going to be showered with explicit sites?

Re:How?? by MImeKillEr · 2004-03-09 03:41 · Score: 4, Interesting

People submit their site, google goes to their site and visits every link it can find on the main page, then every link it finds on those other pages etc. So that pretty much the whole site is included.

Google doesn't just search pages submitted - I've got an Apache webserver running a home, doling out pages for family photos and stats for a local UT2K3 server. I hadn't enabled robots.txt to stop search engines from crawling it (didn't think I needed to) and one day entered my URL in google, only to find it.

I've never submitted the URL to google.

Should we assume that Google's already crawled a majority of the sites out there?

BTW, Yahoo has no record of my site in their database.

--
Cruising the internet on my TI-99/4A @ a whopping 300 baud!

Warnings are there to limit liability. by oneiros27 · 2004-03-09 03:11 · Score: 3, Insightful

It's rather stupid, but it has to do with legal practices.

If you have no warnings, then someone can claim that you forced your content on them, and they didn't know what they were getting into, and it was offensive.

By putting up warnings, which inform the user that they shouldn't enter your site if it's illegal for them to do so shifts part of the burden of responsibility to them, and away from you.

So, if you're sued for having distributed offensive material, you can claim that you provided warnings, and that the person chose to disregard them. [Sort of like putting up 'wet floor' signs -- if someone gets hurt, they made an active decision to ignore the sign]

--
Build it, and they will come^Hplain.

another form of DOS by ramar · 2004-03-09 04:03 · Score: 2, Interesting

If the boys with fat pipes start indexing "deeper" into sites, I think we're going to see a lot of sites going offline until they've been refactored to handle this sort of thing.

The frontend webservers that serve the static pages are fine (they're already being spidered now), but the dynamic content, largely dependant on databases and such, very likely wasn't built to handle this sort of load. Once the new engines get their hooks into these pieces, they're going to be in trouble.

On a related note... by cr0sh · 2004-03-09 06:05 · Score: 4, Interesting

What about the "invisible web"?

The so-called invisible web is indirectly related to the "deep web", with the exception that most of it isn't connected at all to the main web. Slashdot has had some articles regarding these hidden segments of the web - but has any progress been made on finding these "lost networks"?

Current theory on networks explains how and why these networks form and separate from the main web of connections, mainly due to loss of one of the tenuous threads from a supernode to the outlyer nodes. When this loss occurs (an intermediary site goes offline, or popularity wanes, or a large meganode dies or stagnates), the network fragments - and getting back to the pages/sites within is nearly impossible, unless you already have a link to the inside, or a friend provides it to you.

Now, it is a good thing that this phenomena exists - it seems to exist in all robust, evolving networks - whether those networks be electronically connected, socially connected (ie, Friendster, Orkut, or plain-ole social groupings), or bio/chemo connected (ie, the brain, the body, etc).

Even so, I wonder at all the information out there which I *can't* access, because it isn't indexed in some way. Sometimes you come across fragments and echos in other archives (news, mail, irc) that lead to these far-off and displaced "locations" - but it is rare, and tedious to do unless you are looking for very needful information.

So I ask again, has anything been done to further the "searching" within/for the "invisible web"?

--
Reason is the Path to God - Anon

And analogously ... by cookie_cutter · 2004-03-09 10:58 · Score: 2, Insightful

If you have a public mail server, you deserve any spam you get...

55 of 193 comments (clear)