Slashdot Mirror


Search Engines Can't Keep Up

joshwa writes "The Boston Globe today reported a study by Nature saying that search engines barely index one-sixth of the pages on the net. To a certain extent it's a plug for the Northern Light search engine, which claims to be the most comprehensive (at a staggering 16 percent of the web), but it's an interesting read nonetheless. "

13 of 82 comments (clear)

  1. Bogo-coverage by geophile · · Score: 2

    The article points out that one reason for low coverage is the lag (search engines are months out of date), combined with the incredibly rapid increase in the number of pages, (100% growth in about a year). So even if *everything* six months old were indexed, coverage will still be only 50%.
    Anyone who does any searching quickly realizes this, so the study isn't breaking ground here, although maybe it quantifies the problem.

    Beyond this, I don't see how the study's result could be meaningful.

    1) How did they come up with their estimate of 800 million web pages? If that number is bogus, so is the %age. They can measure the pages they found, but how do they measure the pages they couldn't find? Different techniques of estimation might provide great variance in the number of web pages.

    2) Counting pages (and computing coverage) is especially problematic given the increasing amount of content generated dynamically.

  2. The other 84% by twdorris · · Score: 3

    It's OK that only 16% of the web is summarized by search engines. The other 84% is dedicated to sex sites anyway...and we all have those bookmarked by now...

  3. Uhhh.... by ChrisGoodwin · · Score: 3

    One sixth is a staggering 16 percent.

    Reminds me of a joke, but I can't remember the specifics. Something like "Fully 33 percent of our foos are bar, but only one third of their foos are bar."

    --
    Pretend there is some witty statement here.
  4. Distributed indexing??? by Hard_Code · · Score: 2

    Hmmm...what is the limiting factor in indexing pages? Is it bandwidth? Or CPU? Or just the fact that so many go up and down so fast? If it's bandwidth or CPU, would a distributed project work??? I know you can get dumb Yahoo pager and Altavista Search and all that junk...what if they had "Download: Altavista Index Agent/Spider" or something, where people could use their spare cycles/bandwidth to index...would it work? Does that even make sense? Like SETI, the server could give them some chunk of "namespace" to index and the spider/agents could go at it.

    --

    It's 10 PM. Do you know if you're un-American?
  5. Search engine coverage by substrate · · Score: 4

    The percentage of indexed web sites is small, but the amount of data that represents is pretty staggering. Unlike an encyclopedia or other reference book which can cross reference between the a concept in the index and a number of appearances of the concept in the body of the text a web search engine has a much harder job (as do people trying to use the search engine). For an encyclopedia some person does the job of indexing things with an understanding of context, so for instance 'green' in the index would be referenced to entries on 'colours', 'the spectrum' but not 'grass'. The web search engine blindly returns every instance of the word 'green' with no regard to context. So if the person was actually wondering how to make 'green' with his box of crayolas (since his sister ate every shade of green in his box of 64) he'd either have to wade through each site till he found what he was after or choose a better search term.

    Machines aren't very good at being intelligent in this manner, so suppose a new search engine was created. You type in a search term and it comes back with a list of matching pages. You again wade through the list but now you also can award a number of relavence points to the ones that matched closest. This would work well for a while, but would break down in the long run, as the web continues to expand new pages will be unranked, so they would not appear in the ranked lists of potential hits (at least for popular search terms) and so won't be ranked.

    What might work better would be a search by reduction. Type in some overgeneralized search term and the text on the page is distilled down to a brief outline. There are already packages which can create fairly decent summaries of documents. You click on a button that indicates "I like this, find me more like it" which means that there's something you like about the summary so it generates a number of new more specific search terms from the summary and comes up with a new list.

    1. Re:Search engine coverage by TWR · · Score: 2
      I was actually involved in research on this very concept, back in 1996 (seems like pre-history, eh?). You can check out Professor Jude Shavlik's research at http://www.cs.wisc.edu.

      I left the project because I don't think it works. The system would have involved THE WORLD LARGEST NEURAL NET, by having inputs which contain information describing all of the "important" words on the page and the distance between various words and the font size used to display the words.

      IMHO, there were a few insurmountable problems with the project. One, the neural net was way too large. There are too many words to search, and the word list would need to grow over time (in 1996, would the words "Linux" and "PalmOS" and "WinCE" have been frequent enough to merit their own input nodes? Probably not. Today, on the other hand....). How do you design a neural net which changes the number of input nodes over time, but doesn't lose its current weights? I don't know if there's any research on this, but it would be interesting.

      There are various problems with synonyms and related words as well. I also wasn't sure that the Hn tags were good indicators of importance. Web pages arent structured like outlines anymore.

      The biggest problem is the lack of NEGATIVE feedback. You only tell the neural net search engine what you like, not what you don't like. Neural nets are initialized with random weights for various technical reasons (Prof. Shavlik has experimented with starting neural nets off with rule-based knowledge in his KBANN project). That means that some things which you DO like will most likely get negative weights at first and you'll never see them. While you might specify a list of words you do NOT want to see (which would help the inputs), you would probably not spend time examining pages to see if they do NOT interest you (which means you would never do back propagation with a negative answer). No one would want a product which says: "I think you will really hate this page. Am I right?" The problem is that this is a very necessary part of training a neural net.

      This isn't to say that the research project hasn't shown some results, but it isn't as ideal of a solution as you'd think.

      -jon

      --

      Remember Amalek.

    2. Re:Search engine coverage by Tom+Christiansen · · Score: 2
      There's been quite a bit of work done in the last couple of years for devising completely new methods of storing spider information. Scientific American very recently had a description of one of them, although there are a few others as well.

      The system in Scientific American works by analysing not merely the contents, but the relationship of the links. It then classifies sites and documents according to the pattern of links into and out of them. This helps in priortizing "authoritative" sites, for example.

      You should check out the article and its bibliography.

  6. The question: Do we want all the web pages? by Erich · · Score: 2
    When 80% of the pages on the web are ``JimBob's Personal Web Page'' or ``Click HERE FOR 31337 Pr0N!'' do we really need (or want?) all those web pages bogging down the search engines? I'd say that only about 10% of the web is useful information. If the crawlers can get that (and if it's useful, it will get linked to (in theory) from other web pages) then that's probably all we'd want...

    One problem I have with engines are sites with changing sidebars... when the sidebars mention one of my keywords because it was a recent article when the crawler went by, but the article has nothing to do with what I want...

    --

    -- Erich

    Slashdot reader since 1997

  7. Search engines as a commodity? by TrentC · · Score: 2

    To make what I'm thinking of possible, you'd need to have a standard indexing format. I'm sure Microsoft has one we can use, as long as half the links point back to them :)

    Isn't that part of what the META tag is for? Or the LINK tag?

    Looking over my copy of the HTML 4.0 specification, there's not a specified list of META attributes, but maybe the following should be considered standard for search engines:

    • "description": for an overview of your page
    • "keywords": give something for spiders to index by

    The following LINK attributes should be set also:

    • "home": Topmost level of your site
    • "copyright": Copyright info
    • "made": Author information

    That way, a search result could take the format of:

    • Page Title
    • META description
    • URL
    • Home LINK attribute
    • Author name (or webmaster of a larger site)
    • Copyright information
    • Keyword relevancy

    The best thing about the LINK attributes is that at least one browser, iCab, provides a set of buttons for several LINK attributes -- start, end, next, prev, home, search, help, made, etc. Too bad it's MacOS only; maybe someone could create a similar set of buttons for Mozilla?

    Anyway, Altavista, Yahoo, Infoseek, etc... could make deals with the big ISP's/web host services such as Mindspring, Netcom, Earthlink, Geocities, Tripod, etc... Those sites would then index their own sites, which would save your spider/crawler a lot of time.

    Now there's a thought! Then meta-search engines like Metacrawler could have more meaningful returns.

    Am I the only one that thinks a search engine should be a commodity? I don't care which search engine I use, so long as I get the best results. (Keeping paid advertisements out of the search results would be a benefit, too...)

    There is still the issue of other sites not located on these big ISP's, like .edu's and ibm.com's.

    Maybe someone should consider an EduSearch search engine, indexing only sites under the .edu domain? (Especially if its index can be used by a larger metasearch engine...)

    As for ibm.com and the like, large corporate web sites should have some form of search facility; an Alertbox column from UseIT.com discussing corporate intranets says that having some form of search facility should be considered essential -- I don't see why the same shouldn't be true for their Web shingle as well.

    Jay (=

  8. Searching the Web by jbgreer · · Score: 2

    Why do we need to search the whole Web?
    Are we afraid that someone in New Guinea has the answer to our life's problems?
    I don't see why searching the whole web is any more relevant an activity than reading every book that has been written. Some will see a flaw in this: they'll say, "Reading the web and searching the web aren't the same thing - I want to know my choices." Fine, I say - you don't know all your choices when it comes to books, either.

    Then there's the "quality" argument: "I don't want all of the references to 'X' - I want only the 'good' references to 'X'." On the Internet not only does no one know if you're a dog, they don't know if you're a dog with bad taste! I think this argument needs to be changed; I like the Social Sciences Index idea, personally: the number of references to an article makes it "important". That is, the greater the number of times that an article is refererred to by another article, even if the reference is only to refute the original, the higher the ranking of the article. We already see this in action - they're called portals. They are the hot spots of the web...

    --

    --
    The Norton Anthology of English Literature, 4th Ed., Vol 2
  9. Plug, eh? by gary.flake · · Score: 2

    Just to set the record stright, the poster's assertion that the article is a Northern Light plug is completely baseless. The authors (Lawrence and Giles) work at The NEC Research Insitute (where I work), which has no connection to Northern Light. In fact, they did an earlier and less comprehensive study a year ago that showed Hotbot and Altavista had the greatest coverage at that time.

  10. In the Trenches... by Asim · · Score: 2

    I run and manage a web site in my *COUGH* spare time, whose purpore is to categorize other sites with Middle Eastern dance (better known as belly dance) content.
    Having started up a coupe of years back, I can say I've seen some of what this article is talking about. More and more, I see sites listed and mentioned by work of mouth than I had not found via any of the major search engines. Even with date restrains, a search of the majors (Altavista and HotBot in my case) can eat up days, literally.
    The reviews I write tend to note this fact -- although I have a few "big" Middle Eastern Dance sites, my focus and goal is noting all the little sites that are being left behind. Most of them still come from the search engines, but it's just too much. Even with 100 workers, I'd still not get them all, could not.
    I can't say I know of a realistic way of overcoming this. What would be good is to have a strong effort to have all the major ISP's offer an easy way to register with all the search engines any pages their users create. It's easy to create a web site, but so many people get left behind in actually promoting it, and when they do, they do so very poorly. (For the moment, let's ignore those who just don't do HTML well) Without the promotion, it's just for a few families and friends, unles the content is really interesting, and is promptly drowned out by the chaos of the web.
    Also, I think projects like Google and the push towards XML are imperative to the health of the web. We need to more away from the free-form nature of _everything_ on the WWW, and towards some more structure, more focus. Peple simply need to be able to find stuff, and they cannot right now. I'm going to do my part -- my site is being converted to an XML for the far future, and, for the near future, the perl scripts that build it have already been rewritten to be moved to an server with CGI, so that people can search my site, specifically.
    Just my two cents.

  11. distributed "SETI"-like initiatives called for? by jwjr · · Score: 2

    If a page is truly useful, likely someone is accessing it. A distributed program to harvest those pages could be quite useful. You could choose when to allow it to examine your browsing history, and when to pull back the curtain, as it were. Of course, you'd have to make privacy guarantees. You'd also want to make the source code visible to the world. If a page you were browsing was unknown to the system, then spidering from it would probably be quite productive, so the program could harvest your spare CPU cycles to spider from any pages that you visit that the search engine does not yet know about. Everyone would have an incentive to participate to make sure that the pages they want to see indexed are actually indexed.

    To avoid the Netscape "What's Related?" fiasco, the authors should allow the end user editorial control, and provide for some discretion over and anonymizing of the results submission.