Searching the 'Deep Web'
abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"
but it will get us 90% more useless results. The regular search spam on Google is bad enough (it's getting to the level of bad results AltaVista had before Google took over the throne) without this extra noise...
...but I don't want to see the guts of a web form. If I userstand correctly, they're talking about crawling into databases, actually parsing a Microsoft Access file, for instance. I see that as having dubious merit, and potentially pissing of web site owners. Web site designers go to a lot of trouble to provide the interface they want you to see to their data. This would just sidestep the interface and dump you into the data.
It the very least, it might require an overhaul or extension to the robots exclusion specification to keep spiders out of your data.
Since I moved my site over to a php bases sytem, nothing beyond my index page gets a second look from google. As web content moves away from static pages to more dynamic solutions (particularly XML) a more sophisticated crawler is neeeded, one that can read over this bewildering malstrom of data and extract form it meaning and content.
While I find it highly unlikely that this system will do well with large databases (or even databases at all for that matter) it is a step in the right direction. Google will probably have their version up on labs inside a month.
Killfile(TGK)
No trees were killed in the creation of this post. However, many electrons were inconvenienced.
...and I wonder about something different. :)
Has anyone tried this yet? Change your user agent string to one matching the googlebot and crawl the web. I'm pretty sure many "registration only" websites would magically open themselves, but I wonder about other differences too
Anagram("United States of America") == "Dine out, taste a Mac, fries"
Going after the other 90% does not mean that new things will come to top. Oh there maybe a few cool items like "Who realy shot JFK" or launch code for a trident.
But in reality the other 90% most likely be best left un-found. Who really wants to know that parents were not married as in the manor that they told.
Just is in archology, you will find a nice vase or two... but the rest is rumble.
You understand that digging a garage dump is the best place to find things in archology, because people clean their house then too. That is what other 90% is... a dump of information.
Search my database????
How the fsck is the bot gonna have the DBI string to interface my DB without knowing the name of the DB, the name of the account that created the DB or the user account on the DB with correct permissions to read the info??????????????
Hmmm... sounds like marketing hype.
Just as irrigation is the lifeblood of the Southwest, lifeblood is the soup of cannibals. -- Jack Handy
There's a perfectly good reason why a webcrawler doesn't (and shouldn't) crawl the backend databases. I have customers with items and prices in their database. They update that on a daily basis. I have customers that provide directory solutions. We update that information on a daily basis. Now, imagine the turmoil that will arise, when people find outdated items using their favorite search engine which crawls the database once in a blue moon. Nuff said. Bad idead.
Underholdning.info
User-agent: * /s3kr3t/
/s3kr3t/"
Disallow:
trawler: "Hey cool, thx for the tip I never would have thought to try
Google's always been good enough for me.
The Slashdot Paradox: "100% Overrated"
http://domain/index.php?act=showpost&postid=12 44
Google sees index.php as one page, and does not attempt to submit any data via get/post.
Hmm... I see plenty of pages in Google that have URLs with GET parameters, so there must be some way of getting it to crawl them. Or am I misunderstanding what you're saying? Maybe the key here is to provide an alternate route to those pages without doing anything fancy (drop-down menus, radio buttons, etc.). Just generate another page that contains a regular link to all your pages. You could hide that page from your regular users by, say, linking it to a 1x1 pixel transparent GIF. A robot will find it, but most of your users won't even notice.
Hmm... I see plenty of pages in Google that have URLs with GET parameters, so there must be some way of getting it to crawl them. Or am I misunderstanding what you're saying? Maybe the key here is to provide an alternate route to those pages without doing anything fancy (drop-down menus, radio buttons, etc.). Just generate another page that contains a regular link to all your pages. You could hide that page from your regular users by, say, linking it to a 1x1 pixel transparent GIF. A robot will find it, but most of your users won't even notice.
... and looping google?
Yeah, I can see that google sometimes lists pages with get content in it's index. It doesn't want to do it for a lot of pages though, and I haven't figured out why. There seems to be nothing different in the HTML.
Hypothetically speaking, whats there to stop someone doing a:
<?
print("<a href='thispage.php/${rand()}'>Some page...</a>");
?>
I think i have a pretty good understanding of how google works..
s =5
so how many times would the crawler decide was enough to move onto the next link?
People submit their site, google goes to their site and visits every link it can find on the main page, then every link it finds on those other pages etc. So that pretty much the whole site is included.
This obviously means pages which are not linked do not get included in googles search, so i'm not surprised at the fact that less than 1% is ever crawled.
So how does this new method of crawling work? How can it possibly know what files are on the server if they are not linked in any way. The only way I can think of is a brute-force type method, which seems extremely stupid to me, since that would consume so much of the search engine's resources.
This also brings me onto the next point, like a few people have mentioned, there are certain pages on the web which append string onto the end or before the beggining of the URL, for example yourname.ismyfriend.com or www.somegamesite.com/attack.php?player=bob&attack
Also, since most of the internet is porn, and this new found technology will reveal another 90% or so percent of the internet, are we suddenly going to be showered with explicit sites?
Right.
But if you are interested in a specific subject..
Let's say you have a technical problem.
Chances are somewhere on the planet someone submitted the same problem on a web-based forum.
Now you want google to give you THAT specific message.
You don't want google to tell you "hmmm... I guess the solution must be in one of those zillions of forums here, here, and here".
If the boys with fat pipes start indexing "deeper" into sites, I think we're going to see a lot of sites going offline until they've been refactored to handle this sort of thing.
The frontend webservers that serve the static pages are fine (they're already being spidered now), but the dynamic content, largely dependant on databases and such, very likely wasn't built to handle this sort of load. Once the new engines get their hooks into these pieces, they're going to be in trouble.
The so-called invisible web is indirectly related to the "deep web", with the exception that most of it isn't connected at all to the main web. Slashdot has had some articles regarding these hidden segments of the web - but has any progress been made on finding these "lost networks"?
Current theory on networks explains how and why these networks form and separate from the main web of connections, mainly due to loss of one of the tenuous threads from a supernode to the outlyer nodes. When this loss occurs (an intermediary site goes offline, or popularity wanes, or a large meganode dies or stagnates), the network fragments - and getting back to the pages/sites within is nearly impossible, unless you already have a link to the inside, or a friend provides it to you.
Now, it is a good thing that this phenomena exists - it seems to exist in all robust, evolving networks - whether those networks be electronically connected, socially connected (ie, Friendster, Orkut, or plain-ole social groupings), or bio/chemo connected (ie, the brain, the body, etc).
Even so, I wonder at all the information out there which I *can't* access, because it isn't indexed in some way. Sometimes you come across fragments and echos in other archives (news, mail, irc) that lead to these far-off and displaced "locations" - but it is rare, and tedious to do unless you are looking for very needful information.
So I ask again, has anything been done to further the "searching" within/for the "invisible web"?
Reason is the Path to God - Anon