Is The Web Becoming Unsearchable?
wayne writes: "CNN is running a story on web search engines and their inablity to keep up with the growth of the web. Web directories such as Yahoo! and the Open Directory Project can take months to add a site and the queue of unreviewed sites is growing. Most search engines are even further behind and are filled with off-topic and dead pages. The trend is toward pay for listing. Will the free, searchable web fade away?" The article gets beyond the "Wowie, so much content, engines can't keep up" typical blather and addesses some of the reason search engines have a hard time keeping up.
Yowie.
----
Every year during my review, I just pray the words "slashdot.org" aren't mentioned.
The correct way to handle this situation is how the search engines already do - when a link is reported dead, they just make a request to the link. If it generates an HTTP 404 response code, or the site is down, it's marked actually dead.
I'm not convinced this is always a good idea, though - I've worked for a guy who would battle for top positioning on the search engine with a few competitors. When either of them noticed that the other's site was down, they'd submit the other site as a dead link. I like google's Cached page mechanism, which allows you to view sites that are currently unreachable. Great for when you need docs from a site which happens to be down at the time.
This is actually trivial to implement, as shown in Google's toolbar page: http://www.google.com/options/toolbar.htmlOf course, you'd need to use this technique with a search engine who takes dead link submissions. Eg., Altavista and its "Add or Remove a Page" link here: http://web.altavista.com/cgi-bin/query?pg=addurl
What the web REALLY needs is a directory. An honest-to-goodness, telephone/yellow pages style directory. This whole nonsense about keyword searching is providing people who just want traffic with a lot of free advertising and listings.
The phone company provides you with one free listing (unlisted is optional), and makes you pay for each extra category (like in the Yellow Pages -- and if you're not from the U.S., please see http://www.bigyellow.com/supertopics for an example) that you want something listed in. Search engines ought to be replaced with something similar.
Yes, I know Yahoo and Dmoz try, but they don't go out and actively index sites, making their use limited, and the number of sites even more limited. If Google were to create a Yahoo/Dmoz style directory, that would help. Better yet, if people were forced to provide either META tags, or some information when they acquired their domain (part of whois?)....
For example, where can I get my oil changed in Paris, France?
This is a real problem, but the fundamental reason it's a problem is one that's well-understood by library scientists: We only have addresses, not content identifiers.
To use a book analogy, the entire web is built on Dewey Decimal addresses (URLs), when what we need is those combined with ISBN numbers (URNs).
I didn't make up the idea of URNs - the concept was first described to me by Peter Deutsch, the inventor of Archie, at Interop sometime in the early 90's, shortly after the web got going. (Back when there were no search engines, and we found out about new web sites by visiting NCSA's What's New page, which for a while, anyway, actualy cataloged *every* new web site that appeared, and some of us could claim to have surfed the entire web...)
The idea behind URNs is that they would be a unique identifier for the content. The same content living on different sites would have severl URLs, but only a single URN. This is still needed today, but the problems that kept it from being implemented then are even more intractable today: Who hands out URNs? (IANA didn't want to touch that!) How do you handle versioning? What about dynamic content? Who are the librarians?
We still desperately need somthing that fills this need, but it's not likely we'll get it. One last parting thought - in discussing this with Deutsch, he pointed out that these are new problems to us, but that the library scientists had solved them quite some time ago: It is only the typical CS insistence on reinventing everything and dismissing the knowledge of those in other fields that makes the process so incredibly painful... Hubris strikes again.
"The future's good and the present is nothing to sneeze at." - Roblimo's last
Look up information on the "Invisible Web" - islands typically untouched by search engines, where you need another site to "hop" to these nets of information - cool stuff can abound in these disconnected areas. Here are some links to get started with:
DirectSearch - Invisible Web Search
The InvisibleWeb
WebData.com - Invisible Web Search
InfoMine - Scholarly Internet Resource Collections
AlphaSearch - Invisible Web Search
IIRC, Slashdot even ran an article about this not too long ago - I think this is it, not sure...
Worldcom - Generation Duh!
Reason is the Path to God - Anon
You mean the META tag already exisiting?
The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
I don't know WHAT they are talking about -- I can find ANYTHING that I look for on Google -- even sites that I have just created a day or two ago have been found. These people just aren't using the right search engine, dammit! =)
------------
CitizenC
The article skims over the fact that search engine technology is progressing fairly rapidly, and that some companies (Google) are creating new technologies that exploit the way the web works while Yahoo! and some others are relying on older technology for some things (like filtering pages by hand for their directory!).
Google's approach is novel; make the web pages rank themselves. If more people link to your site, it's probably a better site. If few enough people link to it, it probably isn't and besides that it'll probably never be found.
Web site creators have to do the legwork to get their sites recognized, and going to a general search engine to do it isn't the way. If someone makes a site and tells their friends about it, and their friends like it and link to it, it'll get picked up; that's the way of the web. (At least, it'll get picked up by crawlers like Google, and even ranked highly if enough people link to it).
Search enginge tech has to catch up to dynamic pages yet, but it's the fault of the content creators if they want their pages on search engines but can't code enough alt tags to make their stuff show up.
In any case, the bulk of the web does work, and good pages get recognition. I've always eventually been able to find what I'm looking for on the web, no matter what the topic. Search engines have to grow like everything else, but so far they're the best thing going and getting better.
This is how I found
Did anyone out there get hooked up to
-----
crazy dynamite monkey
It has a second, separate business re-selling articles from trade journals, professional publications, etc., for which you do pay... but less than you would pay to buy the same thing in dead-tree format from the publisher.
What confuses people is that, by default, the main engine will return hits on both the web and the special collection.
-Eldurbarn
Of course, you'd need to use this technique with a search engine who takes dead link submissions. Eg., Altavista and its "Add or Remove a Page" link
AltaVista does not allow submissions from visually impaired users or users of text-based web browsers such as Lynx, Links, or w3m. Its submission page uses a GIF image (burn all GIFs) to display rotated text in various fonts. The user is supposed to read the text and enter it into a field below. But visually impaired users, users on text browsers, and users on browsers whose developers have been cease-and-desisted by Unisys never see the GIF and cannot contribute links to AltaVista.
Will I retire or break 10K?
Except that this isn't true. If I look up, say, Ronald Reagan, none of the top 5 hits are big commercial sites. They include the Whitehouse pages on former presidents, a fan page, the Reagan Presidential Foundation, the Reagan Library, and the Official Reagan Web Site. If I look up Linux Kernel, the #1 site is the Kernel Archives page. Maybe you're looking for data where there just aren't many interesting independant web sites out there, which is not something that can be cured with a better search engine.
There's no point in questioning authority if you aren't going to listen to the answers.
What ever happened to the peer to peer idea of searching? I remember when Napster and GNUtella started, people were talking about how this might actually alter the way searching was happening on the web. By having each server tell us what they have, we are assured that when someone searches for how to replace a broken window, they won't get what they don't want.
--------------------
`Lex - Find Me Here: Text Appeal
I have a suggestion to anyone who is thinking of implementing a better directory. First, define the categories, and allow any site to submit their site to their categories. Then, introduce moderation to the mix. Allow users of your directory to rank sites in terms of suitability to the category. Allow them to create red flags for people submitting porn to health->teens->sexuality, and so forth. Let the users do the work!
I think moderation works well for sites like slashdot, why not a moderated web directory?
No, Thursday's out. How about never - is never good for you?
Well, what you're describing sounds a lot like META KEYWORD tags.
Having been an Open Directory editor in the past, I don't really think the problem is finding the right pages. Actually the biggest problem is just that a lot of editors aren't active, and it's hard to know who's active, because they're listed as editors even if they haven't logged in or checked submissions for a year. This creates problems for editors who have to cooperate with other editors, and may also give outsiders the impression that Open Directory is overwhelmed in general, when really it's just that the editor they submitted to is AWOL.
Yahoo is doomed to failure because they don't have enough people working for them. Open Directory works just fine, because they have orders of magnitude more eyeballs working in parallel. No, Open Directory doesn't list every page on the web, and that's just fine with me as a user -- it's more useful because it's selective.
The Assayer - free-information book reviews
Find free books.
I never said it would be easy! :)
Having actually tried to implement a DDC based web directory once, I am familiar with the problem that many pages would possibly fall under many categories. This is a problem with any directory-based approach, especially if you list a page in one category and then the page changes enough so that the category no longer applies.
In your example, I would hope it would not be too much trouble for you to put a different class number into the pages that make up each logical section of your site. Or if the site is small enough, it would likely fall under something like "personal web pages", which may have a number of subclasses itself, and then you'd choose the one you felt appropriate.
Again, this is a common issue among all directories, where do you put stuff? Do you allow multiple listings/classes per site/page? You still end up having to include some sort of keyword or text-based search so that users are not forced to browse the directory structure, guessing at the classification they are looking for or where it lies in the hierarchy. Text searches also allow for the possibility of searching based on content rather than metadata.
Most of this is a non-issue, given that Google seems to have rather successfully implemented a non-directory type of engine-- succeeding where Altavista was simply unwieldy. At least that's my impression. I usually find what I want with Google.
I do not have a signature
This also brings up the problem of being able to use multiple pages that are essentially redirects to get around the listing limits. For instance, I make http://www.hotgrits.com/natalie1.html, .../natalie2.html, .../natalie3.html, etc which all are really mirrors of http://www.hotgrits.com/portman.html, which is the main page for my site. The only thing I change is the category for each page so that my site effectively shows up in numerous places in the directory. With a properly constructed CGI program I could be listed in every category without having to work that hard.
I do not have a signature
Yahoo and DMOZ are web directories. This is a very human labor intensive way to categorize the web. Google is actually a search engine. It spiders out and runs an indexing algorithm of some sort to help it respond to queries. These are very different approaches.
Yahoo and the like are doomed to failure until someone implements something like the Dewey Decimal System for web pages and then convinces a large number of webmasters to correctly classify their pages using it. That way a machine can do the hard work and only the person designing the page need do the actual work of making sure the page is classified correctly.
Obviously this is fraught with problems similar to those of keyword spamming, but it's either that or build something like DMOZ on a decentralized basis, so that any individual maintainer builds a set of links that are tailored to his/her interests and either uploads them to a central sever or provides them as an XML document for an engine to work with.
I do not have a signature
The only "problem" is that the Internet is simply too large for one engine to index. People go to Google expecting to search every web document that's online, a labor comparable to going to your local library and expecting their database to tell you about every book in existence on a particular topic or by a particular author. Even the Library of Congress isn't that comprehensive.
I disagree with the article's claim that "much of the most interesting and valuable content [on the Web] remains hard to find." I think that the most interesting and valuable content is easy to find, provided that you start looking in the right place. Which means that if I want information on the latest US school shootings, I don't go to Yahoo or Google and search for "school shootings", I go to those sites and search for major news sources (BBC, CNN, Reuters, etc.) and use their up-to-the-minute search engines.
The role of search engines isn't "shrinking" by a long shot; it's just becoming less comprehensive. Searching on the Web is now a two-step process instead of a one-step process, and you have to apply a little more intelligence than you could back in 1995. If high school students researching their latest humanities paper have a problem with that, well, they should ask us twentysomethings what it was like to have to use card catalogs and microfiche for our own high school projects.
I think that neither the people who claim that this is impossible nor the people who want to dismiss it are correct. There is undoubtedly a major problem, and it is only getting worse. The flip side of that, however, is that while we are getting farther and farther from having a complete listing of the web in search engines, the ability of end users to find what they are looking for appears to be improving, particularly with the advent of better search engines like Google.
The solution to indexing the web completely, or much more completely, has to lie in another methodology. How about a distributed solution? Google@home? distributedYahoo!.net? Honestly...there are ways to tackle the problems, and the reason why this entire system exists is because people refused to just shake their heads and say, "Nope, can't do it...sorry!"
How about a button in browsers that enables you to mark a page as a dead link? Just hit that button and a centralized system gets a reference to the URL currently in your browser. That centralized system is funded by all search engines and all search engines draw from it. Yes, I know..."What if a user falsely claims a site to be dead?" Well, what if it took 100 different IPs claiming it to be dead before it really was considered dead? If you don't get many people hitting the site from a search engine in the first place, then you probably aren't serving it up to too many people.
How about a system for pre-indexing an entire site, such that the person who runs it can have a single document at the root of their domain with the index results? A standard could be developed that would even go so far as to map out the existing sub-sites (for AOL personal sites, for example) so that the engine could go to each one for the index documents.
I guess that what I mean to say here is that the problem is largely based around the hugeness of the web, and how brute force is no longer enough. But that's not really that big a problem...all that's needed is a bit of creativity.
For your security, this post has been encrypted with ROT-13, twice.
You said the same thing two years ago!
The Web is a victim of its own success. Now every snake-oil salesman, fanboy and their grandmother has a website.
Even Slashdot is too big. How the hell are you supposed to follow a conversation this big.
especially with the goatsex.
I'm gonna start mailing postcards.
Excelsior,
ME
evanchik.net
Actually, this just is a great opportunity for the next Great Search Engine. Look at how well Google has done just indexing a small portion of the web (1%, according to the article). So that leaves the door wide open to anyone who can crack the puzzle of how to keep up with the web. If word gets around that something is better than Google, it'll be huge. You can say "oh, no one can index the whole web accurately," but there is someone out there with the brains and courage to try it -- and succeed.