Is the Internet Becoming Unsearchable?
wergild asks: "With more and more sites going to a database driven design, and most search engines not indexing anything that contains a query string in it, we're missing alot of content. I've also heard that some search engines won't index certain extensions like php3 or phtml. Is anything being done about this? How can you use dynamic, database driven content and still get it indexed into the major serach engines?" Is keyword searching obsolete? Do you think its time to index sites by the type of content they carry rather than the content itself? Will larger indexing databases (or a series of smaller, decentralized ones) help?
...to "force" search engines to search certain pages. Currently, you cna only tell searchbot to "piss off". There is no way to tell a searchbot "hey!!!! come look at this...."
Yeah, give me a minute to back that statement up. :)
Honestly though. With something that is inherently dynamic like the internet, it is already near impossible to catalogue and make it searchable. Just to illustrate this take any given news site. Today they might have articles about Clinton, tomorrow it might be news about a big fire. Search engines can't just direct you to those sites based on queries because who knows what data they have.
Even if a search engine was able to validate the content on every site before it gave you the url it could still change by the time you actually got to see it.
So quite literaly there isn't even a clue of a way to catalogue a database generated web site. Now granted I know there are plenty of sites like Slashdot that eventually the 'content' settles down and becomes static. Still, how are you going to get some stupid program to verify and validate that for *every* dynamically generated web page. I don't think you can.
The web was created to be open and dynamic and it will stay that way. I've heard people say that maybe there should be *more* interoperability between things like search engines and spiders. This in my mind would do more damage.
Besides is it so bad that spiders don't get these pages? It probably isn't even reasonable because it would add that much more complexity to the search engine to catalogue what it finds. How do you rank content?
Anyway... just my 2 cents or so...
This space for sale
Still, I see a potential threat in information becoming unmanagable, and, most of all, ways of finding information being abused (like using unrelated keywords just to get some visitors). Stanislaw Lem, the polish sf-writer described this situation in many of his books - starting with the 60s, when noone was even starting to think about such problems.. Sooner of later we'll have a large branch of computer sciences dealing only with searching information in Internet; searching services are already available, but they are either incomplete, or not evaluated. The latter is the key: and google is the first service I'm aware of which tries to automatize evaluating (by counting links pointing to a specific page).
There has been a lot of talk about "Internet agents" a couple of years ago (I remember an article in Scientific American...) - could some good soul explain to me how is the situation now?
Regards,
January
I've done some work on a spider and these are the types of pages I spider: :)
/index.html for any non-existent path. Also, all links on the homepage were relative (not a bad thing) and one was invalid. The call sequence is below. /index.html /foo/broken.html /index.html /foo/foo/broken.html
html htm asp php shtml php3
I guess I'll add phtml
Other extensions and urls with query strings are ignored. This is mainly for self defense. There are many, many infinite loops and blackholes on the web and they're hard to avoid. For instance, my spider once got stuck on a server that would return the contents of
GET
found foo/broken.html
GET
webserver couldn't find path, so returns
GET
etc.
What was the programmer thinking?
This is just one example of the blackholes that lurk on the web. It was completely unexpected and pretty difficult to detect. What if someone wanted to write a search engine trap? I don't believe there is a simple solution to this problem.
Ryan
One way round the search engine missing query URLs is to write to static pages for the purpose of submitting it to search engines, there are many clever ways of having truely dynamic sites without the need for long urls, you just have to put some effort into it.
Search engines not picking up on php3 is a bit worrying though, all my sites are written purely in php3, although I never seem to have any problems with getting listed.
Gateway pages are a good way of making sure you get listed with the keywords you want, although they aren't very dynamic and unless you get really clever don't tend to reflect the contents of a regularly update site... however it seems to me that you can only really hope for *a* listing these days, not an index of all of your site.
Even google has a 3 month disclaimer on it's submit page, that's a mighty long time if you are looking for support on a brand new motherboard.
LASE seems to be the way to go... subject specifc full text indexes which spider regularly and can index specialised data keeping it up to date.
However you would still need a search engine to find a LASE that will get you what you want, but at least it's a bit more structured!
There are many ways round the search engine problems, and keeping on top of it is a full time job, Submit-it doesn't come close, that hasn't changed in the past 3 years, Search engines however have!
IMO a combination of all of the above will get you where you want. Keywords and Meta Tags still count, and you have to be persistent.
Some time later, it occured to me to try and monitor the efficiency of web indexing tools using a spider trap.
The methodology is like this:
Anyone done this? I'm particularly interested in knowing how spiders handle large websites -- have been ever since I was doing a contract job on Hampshire County Council's Hantsweb site a few years ago and caught AltaVista's spider scanning through a 250,000 document web that at the time had only a 64K connection to the outside world. (Do the math! :)
I think the real problem with searching really isn't that the Internet is growing too large. The central problem with it being too hard to find information is due to the unfortunately ever-changing nature of HTML. (Yes, I know there are much better solutions out there -- I work with some of them on a daily basis. However, we seem to presently be stuck with HTML and its variants.)
It's a self-feeding monster, whose typical cycle goes as follows: SearchEngineInc (a division of ConHugeCo) creates a new technology that really impresses people with its ability to find what they want more quickly. (Right now SearchEngineInc is probably Google, at least in my view.)
Once the new technology takes root, content authors (well, maybe not the authors so much as their PHBs) note that SearchEngineInc doesn't bring their business (which sells soybean derivatives) to the top of the search list (when people type ``food'' into the search engine). Said PHBs make the techies work around this ``problem'', and all of a sudden SearchEngineInc's technology isn't so great anymore because the HTML landscape it maps has changed.
A similar situation occurs when PHBs think their site doesn't ``look'' quite as good as others. (Insert my usual rant about content vs. presentation here.) Whether via a hideous HTML-abusing web authoring program, or via all sorts of hacks that God never intended to appear in anything resembling SGML, the HTML landscape is changed there as well, and SearchEngineInc's product becomes less effective.
What's the solution to this? I'm not quite sure. Obviously there are better technologies out there that are at least immune to PHBs' sense of ``aesthetics'' but I would wager few of them are immune from hackery. I'd say that search engine authors are doomed for all time to stay just one step ahead of the web wranglers. At least it assures them that their market segment won't go away any time soon. :-)
It disturbes me that so many pron sites have hidden in their html code (and sometimes not even hidden) huge lists of adult film stars just to get hits from search engines.
If you do a search for Cortknee or Lotta Top you'll get a bazillion hits and 90%+ of them are "Click here to see young virgins having sex for the first time on their 18th birthday!"
As we all know, but nobody likes to admit, pron is the fuel that makes the net go 'round.
Many other sites have taken hints from the pron people. I'm sure that it was a deal of some sort, but everytime I do a search on metacrawler there's a line to search for anything I get a like to search a certain bookstore for books on the same topic.
Commercialism and shady practices are what are making the net so hard to search.
LK
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano
Okay, I just got done with my research paper for college last week, and although I can pull a paper out of some orifice of my body, researching is always a pain.
Our library has a wonderful online database where you can type in keywords and search for them, but the keywords only look as far as the Title, Author, or abstract of the book. If you wanted to look up some narrow topic, you can't expect that there's books written exactly on that topic, but there's always bound to be a few books out there that have a few pages dedicated to that subject (but isn't listed in the abstract). So, what do you do? You have to get your hands dirty.
My topic: Holy Wisdom (I won't bore you with details, but just stick with the subject). Looking in the online database, I find that there are zero books on the subject. Darn. Let's do some lookin...
After I read in a few Religion Dictionaries, I find that Holy Wisdom is also called "Sophia." I go back to the catalog, type in "Sophia," and I get one book. I skim this one book, and find that Sophia has sometimes been associated with the Holy Trinity. So, I go back to the catalog, enter "Holy Trinity," and BOOM, I get back 400 results (anyone seeing a similarity here...). Let's limit them...we'll search within the results for "History of," and I get back about 11 results. I read the abstracts, find a few books of interest, and start skimmin...
...Well, whadda know, there's a page in one book that talks about Sophia, and half a chapter in another book that talks about Sophia as well. There's a few more sources for the paper!
Now, for those of you who just don't understand what I'm trying to say here, just read from here on, cause here's my point: Computers aren't smart enough yet to "guess" at what we want, and personally, I don't think they ever will. Internet keyword searches are just like asking someone to help you who has no idea what your topic is...they can only search for what you ask them to search for.
Internet keyword searches are a hastle, and many times the first few returns won't be anything CLOSE to what you want (search for "Computer Science," you get back porn, search for "Linux," you get back porn, search for "White House,"...). But if you learn how to dig, like the people who lived fifty years ago WITHOUT Boolean Searches, you'll find what you're looking for. Sometimes, it's just like searching for a topic...you might not find anything directly, but you can't sum up an entire book in just a paragraph either!
Try some links, look around, and it'll be there!
I'm going to say a naughty word: artificial intelligence. I'm hoping we soon ( 5 years) get good enough at this "indexing" stuff to create semantic models of Web content rather than purely syntactic models. (Google is a small step in the right direction.) If so, then perhaps dynamic pages can be indexed according to their location (role?) in an "ontology" rather than via the frequency of essentially meaningless character strings. That may sound farfetched, but it seems to me that the Web finally provides a real _financial_ incentive with near-term payoff for that kind of research. Hitherto, the quest has been purely academic. And where there's the lure of a real payoff, stuff often happens quickly (usually -- batteries and flat-screen technologies being notable exceptions).
I hope that after I die the one word people use to describe me is "resurrected."
Not so, fortunately. A certain very large telco (which I'm not yet allowed to name) is now running its Intranet directory on an XML/XSL application which I've written. The application was developed on Linux and is currently running on Linux, although the customer intends to move it to Solaris.
My XML intro course is online; it's a little out of date at the moment but will be updated over the next few months.
XML and particularly RDF do have a lot to offer for search engines - see my other note further up this thread.
I'm old enough to remember when discussions on Slashdot were well informed.
Why doesn't anyone use the ScriptAlias directive? It does the same thing as query strings, but makes it look nicer, like the rest of the web. You can "say" your looking at a directory or a .html file, but in reality you are viewing a singe script. For an example go to http://store.wolfram.com/. There are no directories on the server side, it's all served off of one script. Yet, to the user, it appears as a hierarchical directory structure, complete with .html files. The only query string is your session id, which is appended to the URL in case your browser doesn't support cookies (however, these are not there if a robot views the site). Anyway, a simple directive like ScriptAlias can save everyone a lot of trouble. If anyone has questions about its usage, send me an email.
Jon
Engineering and the Ultimate
I hate to do an "amen to that, brother" post, but I'm going to do so.
Any reasonable search term is likely to present results like "Search returned 417,373 hits. Hits 1-10 displayed." You have to then winnow by adding include and exclude words until you get it down to a manageable 7,422 hits, then you browse them.
The truth is, I turn to wide searches quite rarely. I tend to find and "bookmark" authoritative sites I find on a given topic and return to those over and over again. It is only when a site grows noticably stale or I have to research a new topic that I turn, reluctantly, to search engines. As for indexing database sites, I like the idea of extending the robot hack. Slightly less appealing would be to have a new HTML tag to include "bot content" in any page, including dynamic pages. An XML solution is a good idea, but I wonder how long before every extant site gets XML-aware? That plus XML is almost too flexible, making it likely that a hundred competing methods for indexing dynamic pages will appear and no one will know which one to cling to.
It depends on what technology you're using to generate the pages.
Zope sites for instance, are totally dynamically generated, even those pages that would normally be static. But the entire content of the site that's stored in the ODB is traversable via 'normal' URLs. This means that search engines can easily index your entire site.
Note, however, that this only works if you've taken care to expose your content via links. If you've delibarately hidden your content behind a search interface (and you can still do this with Zope), then your site will be no more indexable than any other dynamic site.
--
The real Webmaven is user ID 27463. I don't rate an imposter, because my ID is such a lame-ass high number.
Don't overlook XML-RPC, which builds on the XML spec to provide a way of serving data over the web to remote clients.
Then there's RSS, which is a way of serving up a news channel or other changing data. These applications are here and in use. Together, these XML-based technologies will someday provide the data layer for the software agents of the future. Read lately about that new "price-checker" technology? Imagine being the one business that doesn't serve up your product list and pricing to that agent.
An interface from XML to these "hidden" databases is only a matter of time. We're just caught right now at a moment between technologies: the authoring tools don't really exist.
----
lake effect weblog
{Network engineer in Chicago--looking for work!}