Searching the 'Deep Web'
abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"
being pretty much total crap, I'd really hate to see the other 90%!
I bet you this new 'Deep Web' search technology would be something that does not observe robots.txt...
[Something witty and intelligent should have appeared here.]
{Traicovn}
Best way to do a full deep search then
These new deep-web crawlers try and ignore the robot access control files. They try and intelligently determine if they're in some type of infinite looping situation, but basically this is how they work.
I remember browsing the WWW directory in '93 and being able to scroll through all the sites on my VAX session at university. Are you telling me I am one of the few people who actually ever reached the end of the internet?
Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"
Yeah. It means I'll be able to use someone else's credit card for more of my transactions, since finding credit cards, SSNs and other...uh...'deep web' stuff will be so much more accessable.
-Adam
Why do I get the feeling that you will get a lot more search results for Linda Lovelace when searching the "Deep Web"
The Borg assimilated my race & all I got was this lousy T-shirt
If you don't want it indexed and looked at, don't put it on the web in the first place.
Doesnt crap sink? Not sure I want to know what the other 90-odd percent is. After tubgirl, goatse, etc.. what else could possibly be next..
Is it just me, or does this sound like we're gonna get more pr0n when we search?
-
Tech News, Reviews and Tutorials
but it will get us 90% more useless results. The regular search spam on Google is bad enough (it's getting to the level of bad results AltaVista had before Google took over the throne) without this extra noise...
so maybe that's why google never tells me anything about servicing this teletype machine...
it's amazing to think how much more information we'd have access to if google (or another search engine) could search 90% of what's out there. i mean, just at 1% we already say, "google knows all"...but I don't want to see the guts of a web form. If I userstand correctly, they're talking about crawling into databases, actually parsing a Microsoft Access file, for instance. I see that as having dubious merit, and potentially pissing of web site owners. Web site designers go to a lot of trouble to provide the interface they want you to see to their data. This would just sidestep the interface and dump you into the data.
It the very least, it might require an overhaul or extension to the robots exclusion specification to keep spiders out of your data.
One could Deep Web your system and find all your prOn and your webcam feed?
Nevermind the rest of the 99%, especially if they are dups or trashy info.
These 99% might not be *intended* public info too. Privacy is consideration here.
Hey, that's my password you are typing
Of course, it's nice to know that the content's there, but how many children are now going to be able to bypass the disclaimer pages on porn sites because of deep linking?
I could care less about Ticketmaster whining out their deep linking, but there's probably some stuff out there that if it isn't taken in context to their intended point of entry may have other problems.
I'm afraid that this is going to give people more reason to go back to using frames, and 'detecting' if their content has been hijacked, and writing more bad code that causes multiple windows to pop up all over the place, and/or crash browsers.
Build it, and they will come^Hplain.
Since I moved my site over to a php bases sytem, nothing beyond my index page gets a second look from google. As web content moves away from static pages to more dynamic solutions (particularly XML) a more sophisticated crawler is neeeded, one that can read over this bewildering malstrom of data and extract form it meaning and content.
While I find it highly unlikely that this system will do well with large databases (or even databases at all for that matter) it is a step in the right direction. Google will probably have their version up on labs inside a month.
Killfile(TGK)
No trees were killed in the creation of this post. However, many electrons were inconvenienced.
Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.
There is a reason for this: a Google search should turn up pointers to the items in the so-called "deep web" (*gag*). To use one of the examples above: if I am looking for information on patents, the search terms I use should point me to the US Patent and Trademark Office. It shouldn't have to point me to all 12 bajillion patent filings.
Besides, what makes anyone think this is going to fly after all the hubbub over "deep-linking"?
I want to drag this out as long as possible. Bring me my protractor.
...and I wonder about something different. :)
Has anyone tried this yet? Change your user agent string to one matching the googlebot and crawl the web. I'm pretty sure many "registration only" websites would magically open themselves, but I wonder about other differences too
Anagram("United States of America") == "Dine out, taste a Mac, fries"
I'd happily pay Google a monthly fee to gain access to extensive databases of information that take money to aquire and maintain... as long as this fee was reasonable. The current Google searches should stay as is, but if people want access to do a time consuming search on every single slashdot message ever posted, for example, the advertising would not pay for this effort. However, I wouldn't pay Yahoo! for this in a million years. Premium google searches might include Pages not ranked high enough to be crawled in the normal google search, full image search -- bandwidth intensive, and full news search -- google most likely will have to pay license fees to the news sources to do this. Most news publications charge a fee to access old articles.
More shocking news at 11am !
My guess is that they will be looking at ways of automatically polling dynamic web sites to extract all the data from the database. So if a site has a page, for instance
www.site.com/index.asp?content=10,
the search engine will try content=1 to content=n to see what it gets.
Going after the other 90% does not mean that new things will come to top. Oh there maybe a few cool items like "Who realy shot JFK" or launch code for a trident.
But in reality the other 90% most likely be best left un-found. Who really wants to know that parents were not married as in the manor that they told.
Just is in archology, you will find a nice vase or two... but the rest is rumble.
You understand that digging a garage dump is the best place to find things in archology, because people clean their house then too. That is what other 90% is... a dump of information.
I wish the internet was like a book. Except in this case the internet can have a varible index.
A search engine produces an abstract of the website.
I know that I'm sounding like an academic, but things could be more organised in the first page, I mean place.
And yes I use the net for research purposes more then entertainment. For me the internet hype has died down, and I now refer more to books these days, as they seem to be easier to find.
Generally, google finds the pages that the authors want to be searched. Thats why you submit your site to google. Even if you dont submit your site to google, if it's on a domain that google searches and there is a link to it, it'll be found.
With google storing more than 4 billion web pages, I'd hate to see what kind of crap the other 99% is.
Perhaps they count each iteration of a dynamic page as a seperate page? Even so, google's news page does a great job searching in real time for pages that change dynamicaly.
http://github.com/gbook/nidb
- Complete Planet
- The Invisible Web Directory
- Librarian's Index to the Internet
- INFOMINE
Anybody use any of these sites? Are they any good? Just wondering why this is getting to be news if sites like these already exist.Are you Corn Fed?
1 percent, and I still don't have a problem feeling lucky almost every time I do a search on google.
zWhat would an EWOULDBLOCK block, if an EWOULDBLOCK could block would? -- me
Judging by the problems with relevancy that often occur in current search engines, (I think of the problem with meta keywords, which for many search engines are now completely useless, and google-bombing) why would a customer pay to add more data to the search engine? The idea of course is 'because they'll be more relevant and because they have more information will come up more often', however, if search engines start searching more and more of this 'deep web' how badly will relevancy be affected? I mean, the more data that is in there, the more chances there are of relevancy being broken, and if the weighting is in favor of this 'featured' searches, then relevancy may be even more broken. Sure, these companies will have more traffic directed to them, but will it merely be useless traffic by frustrated users searching for something else?
I run a search engine for an educational institution, and I will admit, Google misses a significant number of our documents, on the other hand, some of those documents are scripts that when queried will create an (virtually) infinite amount of data (calendar scritpts, etc). How deep do we really need to go though? Do we really need to include calendar entries for the year 2452?
I'm also confused, is this search service 'pay by the searcher' or 'pay by the content provider'. It seems to be content provider to me.
[Something witty and intelligent should have appeared here.]
{Traicovn}
One limitation of Google is that fact that a site that bases its navigation through a drop-down menu or submission form (i.e. choose a section from the list and click Go) cannot be spidered by Google.
Personally, I find this infuriating. A site I once worked on was available in numerous languages, which could be chosen by choosing from a drop down list box. The upshoot of this is that Google has only cached the site in English, meaning users who would use the other languages do not get my site returned when they search in Google.
We need an open-source alternative that can address these problems, as well as get rid of the security concerns and mysterious methods Google uses to rank sites.
Patriotism - the last resort of scoundrels.
When Yahoo announced its Content Acquisition Program on March 2, press coverage zeroed in on its controversial paid inclusion program, whereby customers can pony up in exchange for enhanced search coverage and a vaunted "trusted feed" status. But lost amid the inevitable search-wars storyline was another, more intriguing development: the unlocking of the deep Web.
Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.
Today, the deep Web remains invisible except when we engage in a focused transaction: searching a catalog, booking a flight, looking for a job. That's about to change. In addition to Yahoo, outfits like Google and IBM, along with a raft of startups, are developing new approaches for trawling the deep Web. And while their solutions differ, they are all pursuing the same goal: to expand the reach of search engines into our cultural, economic and civic lives.
As new search spiders penetrate the thickets of corporate databases, government documents and scholarly research databanks, they will not only help users retrieve better search results but also siphon transactions away from the organizations that traditionally mediate access to that data. As organizations commingle more of their data with the deep Web search engines, they are entering into a complex bargain, one they may not fully understand.
Case in point: In 1999, the CIA issued a revised edition of "The Chemical and Biological Warfare Threat," a report by Steven Hatfill (the bio-weapons specialist who became briefly embroiled in the 2001 anthrax scare). It's a public document, but you won't find it on Google. To find a copy, you need to know your way around to the U.S. Government Printing Office catalog database.
The world's largest publisher, the U.S. federal government generates millions of documents every year: laws, economic forecasts, crop reports, press releases and milk pricing regulations. The government does maintain an ostensible government-wide search portal at FirstGov -- but it performs no better than Google at locating the Hatfill report. Other government branches maintain thousands of other publicly accessible search engines, from the Library of Congress catalog to the U.S. Federal Fish Finder.
"The U.S. Government Printing Office has the mandate of making the documents of the democracy available to everyone for free," says Tim Bray, CTO of Antarctica Systems. "But the poor guys have no control over the upstream data flow that lands in their laps." The result: a sprawling pastiche of databases, unevenly tagged, independently owned and operated, with none of it searchable in a single authoritative place.
If deep Web search engines can penetrate the sprawling mass of government output, they will give the electorate a powerful lens into the public record. And in a world where we can Google our Match.com dates, why shouldn't we expect that kind of visibility into our government?
When former Treasury Secretary Paul O'Neill gave reporter Ron Suskind 19,000 unclassified government files as background for the recently published "Price of Loyalty," Suskind decided to conduct "an experiment in transparency," scanning in some of the documents and posting them to his Web site. If it weren't for the work of Suskind (or at least his intern), Yahoo Search would never find Alan Greenspan's scathing 2002 comments about corporate-governance reform.
The CIA and Dick Cheney notwithstanding, there is no secret government conspiracy to hide public documents from view; it's largely a matter of bureaucratic inertia. Federal information technology organizations may not solve that proble
Imagine you have a script installed, testwise, something like webmin and that searchengine hops and clicks all round and crashes your server. Of course you're not supposed to have scripts like that openly usable but mistakes like that do happen.
isn't the way to go.
This would indeed force admins/designers to think about what data is really private. Which is not a bad idea
Net sa best, mar it koe minder
Search my database????
How the fsck is the bot gonna have the DBI string to interface my DB without knowing the name of the DB, the name of the account that created the DB or the user account on the DB with correct permissions to read the info??????????????
Hmmm... sounds like marketing hype.
Just as irrigation is the lifeblood of the Southwest, lifeblood is the soup of cannibals. -- Jack Handy
Who the hell modded this down? please point out the flamebait that is in the parent post.
There's a perfectly good reason why a webcrawler doesn't (and shouldn't) crawl the backend databases. I have customers with items and prices in their database. They update that on a daily basis. I have customers that provide directory solutions. We update that information on a daily basis. Now, imagine the turmoil that will arise, when people find outdated items using their favorite search engine which crawls the database once in a blue moon. Nuff said. Bad idead.
Underholdning.info
If what gets presented at the end of the day is the .01% that has been paid for by some commercial entity.
Also
What does deep crawling have to do with the relevance of information? 99% of the web is crap so I am quite happy with the 1% that google returns.
The article alleges that current search services like Google manage to access less than 1% of the web.
There's a useless statistic if you ask me.
I just wrote a cgi script that, upon requesting the url "http://bogus.com/nnnnn" returns a page with the text "nnnnn" where nnnnn is any number up to 1000 digits long. So there, I just added 10^1000 pages to the "deep web" of which google indexes none! (gasp).
So there, Google now indexes less than 0.001% of the deep web.
I dont think most posters understand the issue - most websites are now run out of content management systems, and search engines just trawl the web storing current pages. This is fine in a static internet, but with pages changing on a minute by minute basis; for example a new site that pulls out the latest headlines - all you're going to have indexed in Google is what's on the page today.
Now say I was looking for info from a few weeks ago - Google is not necessarily the best way of finding this info. It's all still sitting there in the database, but it's not on the site's front page. archive.org may have a copy of it, but it would be much better to have google.com talk XML in a standard method to the news site's content management system, and have ALL the data there for a search.
Google's always been good enough for me.
The Slashdot Paradox: "100% Overrated"
Now when a slashdotting occurs, the victim's servers are in deeper trouble.
So Yahoo is paying Salon to make sure everybody knows they are still in the search biz and don't you forget it... Sure that little search company Google is IPO'ing soon and everybody from Mamma to Ask Jeeves is having their stocks party like it is 1999, but don't you forget about good old Yahoo, I mean we have the deep web tech...
Onward to the Aether Sphere!
that is what salon says, and I think that is bull, given my favorite porn site offers 20gigs of raunchy action.
As a web designer and admin I will have to find ways to make that data as inaccesible as possible....oh wait, I already do that because it is a good security measure...Database only listens on localhost so unless my server is breached it is already hidden behind the interface, not to mention that Apache already keeps people from reading my PHP. But if these 'deep web' searches are going to resort to trying to crack security then we have another thing to worry about...
Even if I knew that tomorrow the world would go to pieces, I would still plant my apple tree. -Martin Luther
1x10E50000000 web pages searched...
Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page. There already are numerous immense databases to store medical and genomic data, and the linking between them is only now becoming usable. There is a reason there re so many of them and that so much effort has been put into methods for displaying their data. I don't want to have the data spit at me because it's useless, a waste of time when there are better, faster, more robust, and nicer interfaces I could use. It's probably the same with half the other stuff they want to search: there already are good methods of searching it for those who know where to look . Not only that but the existing methods provide data specific information which an automated search engine cannot do and without which the data is useless to those people who actually use it. And those who don't need to look don't have a need for the data. Do YOU really need a 300 page list of A,T,C and Gs?
AKA "What's a robots.txt file?" says the innocent web crawling robot. :P
;)
Nah, I'm sure the contents of the robots.txt file will be read, and the file itsself will be listed in the index too
Good-bye riaa.org.
So instead of 5,234,169 search results returned, we will see 45,961,384 results?
Yippee!!!!!
My rights don't need management.
The bottom line is that without patching the breach in communication between the database owner and the search engine we will never be able to get past the challenge of the 'deep web'.
With static systems such as yours that provide no links to much or all of the information, the only way a search engine could ever index the database would be for the owner to actually send the database to Google (for example). Due to issues of trust (i.e. the database getting leaked such as Microsoft's source code), this is next to impossible.
The next likely alternative would be a simple change in database standards. If all of this information on the deep web really is free and publicly available, just not searchable due to a lack of technical innovation, then the simple solution is to have database owners publish index files of their databases which search engines could then incorporate into their indexes.
Many database owners will react in fear to this idea, as the difficulty of getting the information on their website often leads to revenue through you looking at more ads etc etc, however the recent advent of Google Print should quickly put their fears to rest. Google is indexing books, something very akin to a database, but does not offer the entire book for download, instead providing a preview and a link to where you can purchase the book. Similiarly, by providing a link to the database content, a user may still be required to register before they can search for their content. The point is that search engines are not responsible for actually giving us the content, just showing us where we can find it.
I believe that the indexing of databases by their owners will in the end lead to more people finding the information they want, and therefore more people visiting the site, in turn earning the content provider more business and giving the consumers what they want. The comparatively trivial roadblocks between us and the 'deel web' are undeserving of the daunting connotation attached to its name. All we need is a little innovation!
python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"
The article alleges that current search services like Google manage to access less than 1% of the web
Surely that should be 10%, given the 90% statistic mentioned later on?
== Jez ==
Do you miss Firefox? Try Pale Moon.
99% of the "deep web" probably looks like this. Indexable? Sure. Necessary? No.
P.S., I suppose if I used Inyerneck Exploder that it would Just Work, but after having to use Microsoft Outlook at work, I've decided to never voluntarily use any of Bill's stuff again.
Yeha. Now I can access the member part of pr0n sites by search engines. Why pay? Missing/corrupt rar's of my warez ain't a Problem any longer... Otherwise they will never reach that 90%...
I think i have a pretty good understanding of how google works..
s =5
so how many times would the crawler decide was enough to move onto the next link?
People submit their site, google goes to their site and visits every link it can find on the main page, then every link it finds on those other pages etc. So that pretty much the whole site is included.
This obviously means pages which are not linked do not get included in googles search, so i'm not surprised at the fact that less than 1% is ever crawled.
So how does this new method of crawling work? How can it possibly know what files are on the server if they are not linked in any way. The only way I can think of is a brute-force type method, which seems extremely stupid to me, since that would consume so much of the search engine's resources.
This also brings me onto the next point, like a few people have mentioned, there are certain pages on the web which append string onto the end or before the beggining of the URL, for example yourname.ismyfriend.com or www.somegamesite.com/attack.php?player=bob&attack
Also, since most of the internet is porn, and this new found technology will reveal another 90% or so percent of the internet, are we suddenly going to be showered with explicit sites?
Yeha. Now I can access the member part of pr0n sites by search engines. Missing/corrupt rar's of my warez ain't a Problem any longer... Otherwise they will never reach that 90%...
It's rather stupid, but it has to do with legal practices.
If you have no warnings, then someone can claim that you forced your content on them, and they didn't know what they were getting into, and it was offensive.
By putting up warnings, which inform the user that they shouldn't enter your site if it's illegal for them to do so shifts part of the burden of responsibility to them, and away from you.
So, if you're sued for having distributed offensive material, you can claim that you provided warnings, and that the person chose to disregard them. [Sort of like putting up 'wet floor' signs -- if someone gets hurt, they made an active decision to ignore the sign]
Build it, and they will come^Hplain.
Something like this: robots.txt
In Murphy We Turst
I cannot imagine there is no .gov domain with these directories indexed
Net sa best, mar it koe minder
NO kidding. so can I sue Google for crashing my site and using up bandwidth for trolling my database? How does this jive with "fair use" on my copyrighted materials? hmmmm
Last crawled by:
yahoobot on 03/09/04 at 04:12.
spammerscum on 03/05/04 at 14:41.
googlebot on 02/29/04 at 10:38.
machoproducts on 02/23/04 at 18:21.
machoproducts on 02/23/04 at 18:20.
machoproducts on 02/23/04 at 18:18.
Have a nice day!
One line blog. I hear that they're called Twitters now.
Where are FireFox's cookies stored?
Perhaps it's a new rule - 1% of the Web contains 99% of the useful information?
If the boys with fat pipes start indexing "deeper" into sites, I think we're going to see a lot of sites going offline until they've been refactored to handle this sort of thing.
The frontend webservers that serve the static pages are fine (they're already being spidered now), but the dynamic content, largely dependant on databases and such, very likely wasn't built to handle this sort of load. Once the new engines get their hooks into these pieces, they're going to be in trouble.
The cover story of the March issue of Technology Review is "Search Beyond Google" (click CURRENT ISSUE, privacy invasion required). Like the Salon article, they mention Dipsie, but they also cover a search engine (Mooter) that uses a MindMap style interface.
To err is human. To arr is pirate.
Think of what Google does as generating an "Internet Table of Contents". While we may disagree on how well Google does this (I happen to think they do an amazing job, considering the complexity of the task), they essentially are giving us "pointers" into the internet.
A TOC represents a tiny fraction of a book, yet yields a powerful tool to gain access to specific and targeted pages in the book. A TOC need not "crawl" every word of every page of a book to be useful. Similarly, Google has developed their methods to give a reasonable representation of the WEB and at the same time a powerul tool to gain access to the part of the WEB relevant to your request.
I know this isn't a perfect analogy, but I think Google has gotten it close to right. I'm not sure what additional depth would gain for the effort invested.
It seems to me that the future will be a search on Yahoo (or google) will wind up pointing you to many results that are themselves current active sub-searches of a websites localized database. Anything else would seem to violate what they are trying to protect now.
InnerWeb
Freud might say that Intelligent Design is religion's ID.
I just don't think that is going to fly.
...results in more porn I'm all for it. You can never have too much porn.
Max
My god carries a hammer. Your god died nailed to a tree. Any questions?
> 1. Make your pages *look* static
I have not ran across a lot of pages that actually need to be dynamically generated. Shopping carts and account settings need it, but if you make everything dynamic, like most misguided web developers do these days, you simply succeed at slowing your site down to a crawl and evoking a long stream of curses from people like me, who still think that broadband access is not worth $60 a month.
Huh?
It's safe to guess that the exp'y is within 4 years. (otherwise, move onto another card)
That's an amazing 48 possible "passwords" to brute force (assuming that cc subscriptions dates are uniformly distributed. any research on this?). I *THINK* there are >48 web merchants... Hmm.
This, of course, doesn't use the resources mentioned in the other posts.
The so-called invisible web is indirectly related to the "deep web", with the exception that most of it isn't connected at all to the main web. Slashdot has had some articles regarding these hidden segments of the web - but has any progress been made on finding these "lost networks"?
Current theory on networks explains how and why these networks form and separate from the main web of connections, mainly due to loss of one of the tenuous threads from a supernode to the outlyer nodes. When this loss occurs (an intermediary site goes offline, or popularity wanes, or a large meganode dies or stagnates), the network fragments - and getting back to the pages/sites within is nearly impossible, unless you already have a link to the inside, or a friend provides it to you.
Now, it is a good thing that this phenomena exists - it seems to exist in all robust, evolving networks - whether those networks be electronically connected, socially connected (ie, Friendster, Orkut, or plain-ole social groupings), or bio/chemo connected (ie, the brain, the body, etc).
Even so, I wonder at all the information out there which I *can't* access, because it isn't indexed in some way. Sometimes you come across fragments and echos in other archives (news, mail, irc) that lead to these far-off and displaced "locations" - but it is rare, and tedious to do unless you are looking for very needful information.
So I ask again, has anything been done to further the "searching" within/for the "invisible web"?
Reason is the Path to God - Anon
Why does everyone assume the top 10% of results on Google must be all the best information? Some people even said that in the same breath they complained about Google Spam. Ridiculous!
The fact is there is TONS of great indepently published stuff that will never be found through Google because the author doesn't take the time to play the SEO game and advertise their page all over the web. Google's algorithm is far from the final word in relevancy algorithms. The evolution will continue until we have search engines that are smarter then humans. Of course, the evolution will probably continue after that, just without our interaction.
google should start a 'google development' search engine. normal google would still be available, but the googledev would have the same initial database, but use different algorithms and procedures with which it would classify material, thus yielding different results for the same searches... 'cutting edge' google. or it could even have it's own search crawler, for that matter. that way they can start finding new ways to combat spammers.
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
It's likely he was talking about younger children accidentily stumbling across something innapropriate, not a teenager with a box of tissues. Admittedly, ads alone are awfully lewd right now - but I can see the vague idea he's getting at.
I've had a lot of actual experience with this. I've been researching a bunch of stuff on the history of Quebec city, and been using the Internet for most of it. Using Google, a few other search engines, I'll find a lot of information but most of it is second-hand, urban legend and, often, completly wrong. Not that I don't expect that, but I'd also expect to find good sources listed; they do exist.
For example: try finding a biography on 'Louis Hebert' on the net. You'll find a few pages, some of them good mixed in with the expected crap. But what you WON'T find -- in fact someone had to tell me to look here -- is the entry in The Dictionary of Canadian Biography which includes a bibliography giving original sources. It is hands down the best source for the type of information I need.
After weeks of serching for different historical figures I'd didn't even come across The Canadian Encyclopedia. This site has detailed information, including video and pics, on absolutely everyone and everything Canadian, but I've never seen it come up in the first half dozen pages of a search. Google won't find it even when you limit it to that site
But both these recources are all but lost in the 'deep web'.
Some island sites as I call them have to links to them so crawling will never find them.
Some of them do advertise on local tv/radio/papers/whatever but aside from that you'd never find them if someone doesn't tell you the URL.
This is of course not part of the 'deep web' but it is part of the "invisible web". It is impossible for
any algorithm to find this on its own.
SCIREV.NET - fanfics,reviews & more
First, Google does crawl dynamic sites using GET variables.
Second, if you install apache's MOD_REWRITE you can change all your dynamic pages to "appear" static, thus allowing them to be more easily indexed.
Security is inversely proportional to the commitment of one desiring to circumvent it.
The reason government resources are not widely searched is because there is little incentive to make them searchable. Government websites are notoriously hard to use compared to commercial ones simply because businesses have a vested interest (money) to get their pages searched. High ranks in search results = more visitors = more business = more money. Government agencies do not work that way, not that they are intentionally trying to hide information, they just aren't as focused on the user as business sites are.
Ever seen a adult site tied up behind a complicated form? No, and you won't because these sites want to be searched and are designed with search engines in mind. There is no technical reason any website cannot be searched. All content should be browsable following standard HTML links, if human users want to aggregate the results, a form can be used. But if someone has to use a form to get at the content, that's a flaw in the design and the designer is to blame, not search engines.
The Deep Web, aka crapflooding submission forms
If you have a public mail server, you deserve any spam you get...
Now maybe my site will come up more when people search for items in my database. Yay, more sales for me! I'll probably still mod_rewrite my urls in apache so that they look nicer, and are easier for people to "use," but at least it won't be as much of a necessity for the search engines to index me.
FYI mod_rewrite in apache will change a URL from:
http://www.mydomain.com/foo.jsp?blah=bar
to something like...
http://www.mydomain.com/foo/bar/
iirc, you can have it redirect any url you want to any other query string, using regex or just plain strings. also supposedly nice for fixing trailing slash issues (some systems assume a directory is a file when you don't put a trailing slash in the url).
There is no deeper side to the web. The only thing they can do to make their search give more topics is to just go after something else in the website code, instead of metatags. Maybe even offer some kind of free promotion method, such as AddMe, for the users of their Instant Messenging and E-Mail services.
Google is like the US to other countries. We may have not been the first in space, but we sure as hell have been the farthest.
"Instant gratification takes too long." - Carrie Fisher
A couple of years ago, I went to the H2k2 conference here in New York City. I saw a fascinating talk there where I first heard the term "deep web" and some of its ramifications for national security. National security was very much on our minds at the time being only roughly a mile and a half from what we call "Ground Zero" (never liked that term).
...and...
The guy giving the speech claimed that he was a retired FBI agent and seemed to have a great deal of insight into the inner workings of national intelligence. As pointed out in the article, the speaker made the same claim that search engines only gleaned about 1% of the total information on the web. He recommended a tool called Copernic (as well as one other one that I can't remember right now) that bills itself as a "deep web" search tool. But all it appears to do is assemble the results from a bunch of other search engines. I don't recall it ever returning anything significantly "deeper" than what your average google search can yield, however.
Back to the topic of national security, he made mention that terrorist communities are thriving on the fact that only 1% of the total amount of information on the web is readily accessible. All kinds of information that would be beneficial for the NSA to know is just plain inaccessible.
He also faulted the intelligence communities for hiring "blonde haired pretty boy" college graduates, fresh out of school to analyze data in foreign languages instead of hiring local speakers. A 4.0 linguistics student will still miss out on a lot of the nuance to a conversation that a native, say Pashto, speaker will clue right into. Of course, the argument could be made that at least the "loyalties" of an American college graduate are almost guaranteed to be in the right place you can't ignore that he/she will be blind to much of the subtext of a conversation in a foreign language.
A little offtopic, but more alarmingly a point was made about the lack of digitization in the NSA of intelligence documents. Meaning that an agent will typically risk life and limb gaining access to a piece of information, who will then pass that info to a "runner" who places it in an "orange envelope" to signify its classified status. Then that same orange envelope goes into a locked filing cabinet where a good 7 or 8 times out of 10 it never sees the light of day and no attempt is made to analyze it.
But such is the challenge of the modern age. We are drowning in all of the information to produce. Vannevar Bush addressed this issue with astounding clarity right after world war II.
Quoth the Doctor:
"There is a growing mountain of research. But there is increased evidence that we are being bogged down today as specialization extends. The investigator is staggered by the findings and conclusions of thousands of other workers--conclusions which he cannot find time to grasp, much less to remember, as they appear. Yet specialization becomes increasingly necessary for progress, and the effort to bridge between disciplines is correspondingly superficial."
The difficulty seems to be, not so much that we publish unduly in view of the extent and variety of present day interests, but rather that publication has been extended far beyond our present ability to make real use of the record. The summation of human experience is being expanded at a prodigious rate, and the means we use for threading through the consequent maze to the momentarily important item is the same as was used in the days of square-rigged ships.
We are dealing with this problem (access to the information we produce) to a far greater extent than at any time in human history. The web, which was at one point designed and intended to be a more effective way to deal with and disseminate the oceans of data produce, has little more than square rigged ships to skim its surface.
Quod scripsi, scripsi.
There is plenty of very good information out there that isn't indexed. For example, I found a lot about the top level finances of my company, including compensation of the president and vice presidents, that was made a matter of public record when they filed the information as part of an IPO. However, unless I had found the IPO on the SEC website because I found a financial site that let me search for IPOs, I would have never known that the information was available to the public. No search engine would find it, even when given the name of the company and the names of the people involved, or the company name and the term IPO (and, interestingly, the copy of the IPO file documents that had been provided to myself and other managers were doctored to omit this information). At the very least, I would want all government information to be searchable on common search engines like Google. Not that I think they should be able to publish all of the information on the web that they do; but if they are going to publish it, then it should be easily searchable.
I'm an American. I love this country and the freedoms that we used to have.
(Or even better, what happens when deep search engine #A starts crawling deep search engine/diretory #B :)
Believing something doesn't make it true. Not believing something doesn't make it false.
I've heard that too. Never got it to work, but I disable advanced features. Google has a way of indexing ridiculous things (guestbook signings and things) while completely overlooking any actual content I place online. It's amusing as hell.
In Soviet Russia, TV watches you!