Search Engines Can't Keep Up
joshwa writes "The Boston Globe today reported a study by Nature saying that search engines barely index one-sixth of the pages on the net. To a certain extent it's a plug for the Northern Light search engine, which claims to be the most comprehensive (at a staggering 16 percent of the web), but it's an interesting read nonetheless. "
Anyone want to set up a cypherpunks/cypherpunks account?
The article points out that one reason for low coverage is the lag (search engines are months out of date), combined with the incredibly rapid increase in the number of pages, (100% growth in about a year). So even if *everything* six months old were indexed, coverage will still be only 50%.
Anyone who does any searching quickly realizes this, so the study isn't breaking ground here, although maybe it quantifies the problem.
Beyond this, I don't see how the study's result could be meaningful.
1) How did they come up with their estimate of 800 million web pages? If that number is bogus, so is the %age. They can measure the pages they found, but how do they measure the pages they couldn't find? Different techniques of estimation might provide great variance in the number of web pages.
2) Counting pages (and computing coverage) is especially problematic given the increasing amount of content generated dynamically.
Help achieve Liberty in your lifetime - join the Free State Project - http://www.freestateproject.org
Beowulf.
Me thinks you have a bit of a way to go yet! I attempted my search for Linux Home Automation and it failed to bring up a site in England (Fortune City). I kept the search to just central & eastern Europe. It only brought up 1 site about Linux (there are quite a few more sites than that in Europe).
--
Linux Home Automation - Neil Cherry - ncherry@home.net
http://members.home.net/ncherry (Text only)
http://meltingpot.fortunecity.com/lig htsey/52 (Graphics)
Neil Cherry - Linux Smart Homes For Dummies
http://www.research.digital.com/SRC/personal/Kris
(Don't know why a space got inserted in the link, just remove the space after you get the 404 error. Sorry!)
Also from Compaq (DEC) SRC:
Web Archeology
Mercator Web Crawler
The really tough problem is dynamic pages served out of a database. This is where the really useful information is stored but search engines don't index it. I've been thinking about this for a few years and I think the solution is for a standard way of representing this information. So what you'd do is have a single, large text file on your web server that specifies in one field the text of the database record and in another field the URI to use to access it.
This way you won't be indexing really useless stuff like postcode (zip for Yanks) databases and census results but will be indexing things like dynamic news, patents and the like.
Google's algorithm is simpler than the one described in Scientific American. The Clever project marks certain pages (authorities) as having _content_ and other pages (hubs) as having links to good pages. Content doesn't necessarily have links to good pages, and good pages don't necessarily have content. Google treats everything the same, so in theory it's not as good. Still, since the IBM folk don't have anything available for us to try it's hard to compare.
heh, sorry.-
had to put my 2 cents in. google rules.
-----------------------------------------
Reveal your Source, Unleash the Power. (tm)
The main limits to creating an index of the entire web are:
1) disk space. You need to hold an extract of each page.
2) bandwidth. You need approx 100Mbits/sec IO to the outside world to keep up with your disk (10Mbytes/sec write speed to a single disk ~= 100Mbits/sec). You also want to crawl fast enough that you are hitting new pages within a few days of finding the URL.
3) memory and disk. You'll need to do a sort to create a usable index. A few gigabytes of RAM and tens of gigs of disk for the sort.
CPU is not an issue. You can create an index of the web as large as any search engine's with a moderately beefy pentium unix box.
This is all the back-end stuff. To create a search engine, you will want lots of RAM so you can keep your index in memory (this could be a single machine with a huge amount of RAM, or a cluster of machines that each maintain a portion of the index) and a fair amount of CPU to parse the queries quickly.
i don't know about that. yahoo seems to be leaning more towards shopping and less towards infomation. besides the fact that yahoo does a lousy job of cleaning links. a list of results generated from a search is a little like playing minesweeper for dead links which renders it pretty useless when the listing of results is limited.
i'm not really that excited but, hey pal, that's life in the breakdown lane.
98 Results
99 IEEE Paper
So use the text-only version of Hotbot. The solution is right in front of your face.
--
Don't like it? Respond with words, not karma.
Yes, humans do do it better! I use NetMechanic's link checker to keep my links pages up to date - the only problem is, it seems to cache the pages somehow - links that have been removed from the page physically still show up in the report.
I should also mention the Mining Co, now About.com. I had given up hope on looking for decent 3D graphics sites, and to my amazement, I found a whole section devoted to it and VRML there!
When I put up my first pages, I submitted them to the search engines, waited a few days, searched on my name and got pages and pages of junk. I did a new page, on people with the same name as me, figuring that would get me into the running. Nothing.
So where are my hits coming from? Well, go to MetaCrawler and search for scuba, pictures, women. That gets you pictures of me with various celebrities (none underwater) along with a mix of dive sites, scuba porn sites and the charming pages of www.whitesonly.org.
What I'm listening to now on Pandora...
It's OK that only 16% of the web is summarized by search engines. The other 84% is dedicated to sex sites anyway...and we all have those bookmarked by now...
It's not _quite_ as it seems. Read about moderation, this explains how scores are awarded. The moderator has no control over what word (Informative, Insightful) is used, in the literal sense anyway. Look, just read the pages about moderation apparently these are a little out of date though.
:)
...Student, Artist, Techie - Geek *
To become a moderator, you need to be a user. I really don't understand why regular readers aren't users - IMHO of course
Hopefully, this piece was "informative".
Mong.
* Paul Madley
*...Slacker, Artist, Techie - Geek *
Remember: Nothing is Cool.
Yes, I want all the web pages, so if I'm trying to track down my friend JimBob I can find him.
I also want them to ignore meta tags, or any text in a tag - or at least have that be one of the search options, which'll cut down on an 31337 P0rn site popping up on EVERY search.
I agree with the sidebar problem. The other problem is half the stuff they seem to have indexed has moved on by the time I search and a lot of "That member's page can't be found" or just 404 errors pop up.
How is Yahoo! (an index) even able to compete anymore? Goes to show you what good a little name-recognition can get you...I would bet they are at less-than 1% coverage now!
I think that a 100% human-entered index is still handy. If only they could somehow quadruple the number of monkeys on typewriters we might really have something: http://dmoz.org/
-AP
Interestingly I find hotbot to be the best search engine in this respect as it has a very useful advanced search option ... "Find color with word stemming located anywhere created at any time but do not include pages containing x,y,z,a,b,c" ... I do a basic search, get a few dozen thousand matches, look over what pages I'm getting that I don't want, hit back and add those keywords to my "don't match" list ... eventually I get a few hundred websites that are highly related. Then I bookmark the search results (as they're dynamic).
- Michael T. Babcock (Yes, I blog)
To make what I'm thinking of possible, you'd need to have a standard indexing format. I'm sure Microsoft has one we can use, as long as half the links point back to them :)
.edu's and ibm.com's.
Anyway, Altavista, Yahoo, Infoseek, etc... could make deals with the big ISP's/web host services such as Mindspring, Netcom, Earthlink, Geocities, Tripod, etc... Those sites would then index their own sites, which would save your spider/crawler a lot of time.
The indexes could then be merged back at Altavista, Infoseek, etc... Or the search of those sites could hit all the distributed indexes.
There is still the issue of other sites not located on these big ISP's, like
1.) Get a massive machine (we're talking massive, beefy, huge, makes any mere mortal shake with fear)
2.) Grab ahold of a gigantic dataline (OC-192 anyone?)
3.) Set up an engine to visit every IP concievable. then check each site for every directory concievable... And every filename concievable..
4.) Brace for lawsuits
5.) Throw more hardware at it
6.) run SETI@home (in it's spare time)
7.) a month later, all indexed, sued up the wazoo, time to start *all* over again
I disagree and hold myself in contempt, what blashphemy!
I think a distributed project would be great, assuming anonymity, and the fact that most people's outgoing pipeline is fairly unused. If browsers could simply toss the current page an META summary to a search server for it to check if that page is indexed and update that information. If the page isn't indexed, it can do it's spidering on it ... of course, a majority of visited pages would probably be indexed, but the work completed would increase exponentially (especially personal sites).
For an excellent example that is almost there, see the Open Directory.
- Michael T. Babcock (Yes, I blog)
Once again I couldn't find a Sherlock plugin, although they do have nice comment tags for parsing the output. Anyone knows of an official one (unlike Google, which just has three unofficial ones ands their recommended one looks out of date to me) let me know. Until then, I have put one up on our Sherlock page. Enjoy.
You will not drink with us, but you would taste our steel? - Walter Matthau, The Pirates
I have been a moderator at times.
An individual moderator can only nudge the score up or down by 1 point. The adjective displayed is the one select by the last moderator to grade a message.
You don't have to read the page contents to see if it has changed, you can send a HEAD request to the server, and it will just send the headers, including the timestamp. You still have to ask for each file's headers seperately.
One sixth is a staggering 16 percent.
Reminds me of a joke, but I can't remember the specifics. Something like "Fully 33 percent of our foos are bar, but only one third of their foos are bar."
Pretend there is some witty statement here.
Tangentially related is this short preprint on "the diameter of the WWW". Talks about how many average hops it is from any given web page to any other, and how this might affect search engines.
This was sent in last nite with a link to the
nature site. Denied once again
Lawrence Bird
Hmmm...what is the limiting factor in indexing pages? Is it bandwidth? Or CPU? Or just the fact that so many go up and down so fast? If it's bandwidth or CPU, would a distributed project work??? I know you can get dumb Yahoo pager and Altavista Search and all that junk...what if they had "Download: Altavista Index Agent/Spider" or something, where people could use their spare cycles/bandwidth to index...would it work? Does that even make sense? Like SETI, the server could give them some chunk of "namespace" to index and the spider/agents could go at it.
It's 10 PM. Do you know if you're un-American?
This is purely anecdotal evidence, I know, but one of my more common searches turned up no results on their page, as opposed to several relevant pages through Altavista.
Guess I'm not going to be switching anytime soon.
Check out Orientation for international markets. Big and getting bigger by leaps and bounds. The next Yahoo...
What if someone were to design a 'neural net'-based search engine? Initially it could be much like any other 'dumb' search engine. Instead of linking directly to the target sites, have the links go back to a redirector on the search engine's server, enabling the search engine to get feedback on what pages the user actually utilizes, out of the multitude returned. For instance, when I do a search on "ADSL and linux", it would learn that I only clicked on the links that actually had relevant material, and ignored the multitude of XXX/porn sites that put large blocks of common keywords on their pages... Over time and use, the engine could learn what sort of information is really relevant to "ADSL and linux", and what sort is really not, and rank them accordingly.
Well, perhaps the computing power for something like this isn't available yet. But it'd be nice...
The percentage of indexed web sites is small, but the amount of data that represents is pretty staggering. Unlike an encyclopedia or other reference book which can cross reference between the a concept in the index and a number of appearances of the concept in the body of the text a web search engine has a much harder job (as do people trying to use the search engine). For an encyclopedia some person does the job of indexing things with an understanding of context, so for instance 'green' in the index would be referenced to entries on 'colours', 'the spectrum' but not 'grass'. The web search engine blindly returns every instance of the word 'green' with no regard to context. So if the person was actually wondering how to make 'green' with his box of crayolas (since his sister ate every shade of green in his box of 64) he'd either have to wade through each site till he found what he was after or choose a better search term.
Machines aren't very good at being intelligent in this manner, so suppose a new search engine was created. You type in a search term and it comes back with a list of matching pages. You again wade through the list but now you also can award a number of relavence points to the ones that matched closest. This would work well for a while, but would break down in the long run, as the web continues to expand new pages will be unranked, so they would not appear in the ranked lists of potential hits (at least for popular search terms) and so won't be ranked.
What might work better would be a search by reduction. Type in some overgeneralized search term and the text on the page is distilled down to a brief outline. There are already packages which can create fairly decent summaries of documents. You click on a button that indicates "I like this, find me more like it" which means that there's something you like about the summary so it generates a number of new more specific search terms from the summary and comes up with a new list.
The layout is pretty nice, or rather clean, but the search was slightly slow. Who wants to catalog more of the Web when it means that much more noise to wade through? I also think the 1/6 to 16% thing is hilarious.
Grass Roots Info Ronin
I read an article every few months that makes it sound like cyber-armageddon that the web is growing too fast for search engines to keep up. My response is, "so what"? I think that search engines are great for finding an entry point into a topic, not for finding EVERYTHING about a topic. If I wanted to find pages dedicated to the greatest rap artist of all time, Vanilla Ice, I could probably find one site through the search engine and then find a whole bunch of other fan pages linked through that page.
Maybe that's just me...but I really do believe that a person's time on the web is so much more productive when they actually learn to properly use a search engine.
Yahoo's directory has only a fraction of the fraction of the pages that the search engines catalog, but its still the best way to find out about most topics. Quality is the issue - I don't want 15,000 search results on a search query - I only want the five best ones, and I prefer them to be structured hierarchically if possible.
One problem I have with engines are sites with changing sidebars... when the sidebars mention one of my keywords because it was a recent article when the crawler went by, but the article has nothing to do with what I want...
-- Erich
Slashdot reader since 1997
I was looking for a Director page...and came up with a page apologizing to people who came from Yahoo. It read something like "This link has been dead since Dec. 16, 1997, if you're wondering how long Yahoo keeps old URLs".
Yeah, when I need something in-depth, Google, Hotbot and Ask Jeeves do the job pretty good!
I've always had trouble with search engines. I've registered my pages with various services and basically it hasn't helped. Most people find my pages though my sigs. Or off other similar pages which have links to my page.
I did a search on Northerlights and it didn't find my pages but did find pages wih links to my page. I also used the power search but that failed also.
I probably shouldn't be too upset as I get about 1000 hits per month. Since it is a specilaized page I don't think I'll get any more hits. But it does tick me off that if I want poeple to find my page I have to pay for it. I thought a search engines reputation was supposed to garner it more attention and therefor more advertisement dollars. Now to increase their rep's I have to pay them.
--
Linux Home Automation - Neil Cherry - ncherry@home.net
http://members.home.net/ncherry (Text only)
http://meltingpot.fortunecity.com/lig htsey/52 (Graphics)
Neil Cherry - Linux Smart Homes For Dummies
(Note to Rob: I submitted this same story to /. yesterday afternoon, with links and proper attribution to NECRI and Nature, but I guess accuracy doesn't count as much as timing.)
I can see the fnords!
To make what I'm thinking of possible, you'd need to have a standard indexing format. I'm sure Microsoft has one we can use, as long as half the links point back to them :)
Isn't that part of what the META tag is for? Or the LINK tag?
Looking over my copy of the HTML 4.0 specification, there's not a specified list of META attributes, but maybe the following should be considered standard for search engines:
The following LINK attributes should be set also:
That way, a search result could take the format of:
The best thing about the LINK attributes is that at least one browser, iCab, provides a set of buttons for several LINK attributes -- start, end, next, prev, home, search, help, made, etc. Too bad it's MacOS only; maybe someone could create a similar set of buttons for Mozilla?
Anyway, Altavista, Yahoo, Infoseek, etc... could make deals with the big ISP's/web host services such as Mindspring, Netcom, Earthlink, Geocities, Tripod, etc... Those sites would then index their own sites, which would save your spider/crawler a lot of time.
Now there's a thought! Then meta-search engines like Metacrawler could have more meaningful returns.
Am I the only one that thinks a search engine should be a commodity? I don't care which search engine I use, so long as I get the best results. (Keeping paid advertisements out of the search results would be a benefit, too...)
There is still the issue of other sites not located on these big ISP's, like .edu's and ibm.com's.
Maybe someone should consider an EduSearch search engine, indexing only sites under the .edu domain? (Especially if its index can be used by a larger metasearch engine...)
As for ibm.com and the like, large corporate web sites should have some form of search facility; an Alertbox column from UseIT.com discussing corporate intranets says that having some form of search facility should be considered essential -- I don't see why the same shouldn't be true for their Web shingle as well.
Jay (=
But I agree that Yahoo! can't compete anymore, if you want your site to be indexed with it you have 2 options.
- You add it for free and it shows up in a month or six
:( - You pay (this sucks) for it and it shows up very fast
However I still think the search engines are the biggest solution for somebody finding your site, secondly are abnner ads.=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Belgium HyperBanner
http://belgium.hyperbanner.net
Linux hosting for $2.50/mo
In another message on this topic, I commented that search engines should be a commodity. It shouldn't matter what engine you use, so long as you get the most, best results.
Could we turn web search engines into a distributed hierarchy like DNS? I don't expect my ISP's DNS server to have every IP address on the planet, but I expect it to be able to find the ones I need.
Have each of the major ISPs (expecially those that give their members web space!), free web page providers, companies that do virtual domain hosting, and large corporate/education/organization sites maintain their own index of web pages.
There could be generic, "top level" engines like Yahoo and Altavista (which could choose to exclude indexes of porn sites) but also more focused engines -- educational sites, business sites, scientific and technical sites; hell, why not a porn engine?
Would this work?
Jay (=
I understand fairly well what the current conventions are for HTML indexing (meaning things that are usually included in the HTML so that your page can be intelligently indexed.) The problem is that for a significant number of personal (as well as professional) pages, they aren't indexed -- no comments. Perhaps it would encourage internal indexing to modify existing browsers (in future versions) so that the indexing information is displayed by default in the browser window. Then, when people create pages, they might decide that they don't want the receiver's index window to be empty when they view the page, and index their HTML.
If people would just delete their fscking siht outdated web pages after they have ended their (and their webpages') useful lives there would be several orders of magnitude fewer pages that one would have to search thru. The result would be faster and more relevent searches.
But no!!! People are a--holes (YOU KNOW WHO YOU ARE) and don't delete their crap. Y? Y should they. There's no incentive to!!!! It's the fscking tragedy of the commons all over again.
Why do we need to search the whole Web?
Are we afraid that someone in New Guinea has the answer to our life's problems?
I don't see why searching the whole web is any more relevant an activity than reading every book that has been written. Some will see a flaw in this: they'll say, "Reading the web and searching the web aren't the same thing - I want to know my choices." Fine, I say - you don't know all your choices when it comes to books, either.
Then there's the "quality" argument: "I don't want all of the references to 'X' - I want only the 'good' references to 'X'." On the Internet not only does no one know if you're a dog, they don't know if you're a dog with bad taste! I think this argument needs to be changed; I like the Social Sciences Index idea, personally: the number of references to an article makes it "important". That is, the greater the number of times that an article is refererred to by another article, even if the reference is only to refute the original, the higher the ranking of the article. We already see this in action - they're called portals. They are the hot spots of the web...
--
The Norton Anthology of English Literature, 4th Ed., Vol 2
Just to set the record stright, the poster's assertion that the article is a Northern Light plug is completely baseless. The authors (Lawrence and Giles) work at The NEC Research Insitute (where I work), which has no connection to Northern Light. In fact, they did an earlier and less comprehensive study a year ago that showed Hotbot and Altavista had the greatest coverage at that time.
The Computational Beauty of Nature
Posted by Jeff Martin:
These cached engines need to update on a daily basis if they intend to remain functional.
As websites often update, they change the pages and the names of pages to fit a new look or feel.The searches I used found pages that did not exist anymore, nor have they for a few months now.
Oh well maybe people will "back up" in the URL when they visit...
I switched to NL quite some time ago, and it's comprehensive enough for me (dunno about your "common searches"). What I like best is
1. Breaks down the search results by their type and location (mini-directory)
2. Doesn't annoy your with stupid plugs (ahem, "recommendations").
Apparently they also have a fair amount of non-web (presumably OCR scanned) material, but I've never tried purchasing it.
I run and manage a web site in my *COUGH* spare time, whose purpore is to categorize other sites with Middle Eastern dance (better known as belly dance) content.
Having started up a coupe of years back, I can say I've seen some of what this article is talking about. More and more, I see sites listed and mentioned by work of mouth than I had not found via any of the major search engines. Even with date restrains, a search of the majors (Altavista and HotBot in my case) can eat up days, literally.
The reviews I write tend to note this fact -- although I have a few "big" Middle Eastern Dance sites, my focus and goal is noting all the little sites that are being left behind. Most of them still come from the search engines, but it's just too much. Even with 100 workers, I'd still not get them all, could not.
I can't say I know of a realistic way of overcoming this. What would be good is to have a strong effort to have all the major ISP's offer an easy way to register with all the search engines any pages their users create. It's easy to create a web site, but so many people get left behind in actually promoting it, and when they do, they do so very poorly. (For the moment, let's ignore those who just don't do HTML well) Without the promotion, it's just for a few families and friends, unles the content is really interesting, and is promptly drowned out by the chaos of the web.
Also, I think projects like Google and the push towards XML are imperative to the health of the web. We need to more away from the free-form nature of _everything_ on the WWW, and towards some more structure, more focus. Peple simply need to be able to find stuff, and they cannot right now. I'm going to do my part -- my site is being converted to an XML for the far future, and, for the near future, the perl scripts that build it have already been rewritten to be moved to an server with CGI, so that people can search my site, specifically.
Just my two cents.
One search engine not complete enough for you? Search a bunch of them with a meta-search engine. I like SavvySearch.
If a page is truly useful, likely someone is accessing it. A distributed program to harvest those pages could be quite useful. You could choose when to allow it to examine your browsing history, and when to pull back the curtain, as it were. Of course, you'd have to make privacy guarantees. You'd also want to make the source code visible to the world. If a page you were browsing was unknown to the system, then spidering from it would probably be quite productive, so the program could harvest your spare CPU cycles to spider from any pages that you visit that the search engine does not yet know about. Everyone would have an incentive to participate to make sure that the pages they want to see indexed are actually indexed.
To avoid the Netscape "What's Related?" fiasco, the authors should allow the end user editorial control, and provide for some discretion over and anonymizing of the results submission.
Where I work "they" are cracking down because 40% of all sick days are taken on Mondays or Fridays.
Dilbert used that very item, and that's where I heard it from.
Pretend there is some witty statement here.
Outdated information != useless information
if I have a thirty year old peice of equipment, I still want to be able to find a thirty year old document to describe it in detail. Even better -- thirty years of accumulated information describeing it in detail.
for Selling Fantasy Real-Estate.
If you scroll down the page, you'll find the story about Kevin Roseler, an employee at Origin Systems, who was dismissed after he abused his priviledges at Ultima Online to generate castles/gold/etc. and sell it for $7000 on Ebay. What he did wasn't technically legal, but it was an abuse of power and whatnot.
"If one is really a superior person, the fact is likely to leak out without too much assistance" -- John Andrew Holmes
The link to the Boston Globe is dead. Check out an article from Pigdog Journal about the exact same topic. It also has a link to a BBC article about it.
Web Search Engines Are Falling Down on the Job!
1999-07-07 18:18:38
Bien amicalement, Hubert Orlick
Whoops, little trouble typing there...
Anyway, just go to the Pigdog front page (www.pigdog.org). The article is right there. This slashdot message board doesn't like the long URL for the article.
Bien amicalement, Hubert Orlick
Thats what google (www.google.com) does. Its pretty good when you use good search terms. It still sucks when your search turns up a bunch of irrelevant links and they also end up in the indexing process along with the relevant ones.
...
In the end, the only solution is to structure the data better than HTML allows. XML here we come
Frankly the fact that only 16% of sites are indexed is something of a relief until the search engines can get their indexing better sorted out. Google does the best job of prioritising away obviously irrelevant results, but it still gets it wrong a depressing amount of the time.
It seems to me that the only way we're ever going to get away from ever-deteriorating keyword searches and ever more corruptible and less competent cataloging sites is to switch over to better (ie. more logical, more meaningful) forms of mark up than HTML provides. XML anyone ?
Ah yes, the guys behind taz.northernlight.com, the spider that keeps coming back for pages that have been dead forever. Yum yum, 404s galore!
/sbin/ipchains -l -I input -s 208.219.77.9 -d ! 0.0.0.0 -j REJECT
Ah yeah. Much better.