Is The Web Becoming Unsearchable?
wayne writes: "CNN is running a story on web search engines and their inablity to keep up with the growth of the web. Web directories such as Yahoo! and the Open Directory Project can take months to add a site and the queue of unreviewed sites is growing. Most search engines are even further behind and are filled with off-topic and dead pages. The trend is toward pay for listing. Will the free, searchable web fade away?" The article gets beyond the "Wowie, so much content, engines can't keep up" typical blather and addesses some of the reason search engines have a hard time keeping up.
of the entire web has degraded so much that it's not the search engines that are full of useless garbage... it's the web itself and these engines simply have indexed what exists "out there". Garbage In, Garbage Out -- still holds true after all these years.
You can say that again. I submitted a site last November and it @!##@% still isn't there in its directory! What's the deal?
Yowie.
----
Every year during my review, I just pray the words "slashdot.org" aren't mentioned.
Yes.
Next question?
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
I wouldn't call all of those services useless...there's an interesting one by Verizon called Call Intercept If the person's number is unavailable or anonymous on Caller ID, they are sent to a message asking them to identify themselves. If they don't, then they don't get through. Great for those "please stay on the line for an important message..." phone calls that telemarketers & bill collectors love :).
This is probably something like what you're looking for, though the word "Aida" can't be found at all with the others.
Apparently this is an attempt at foiling script-based ping and if down, submit as dead type attacks on other people's entries.
I think a more reasonable way of handling this would be to, eg., check the site for 2 days in 12 hour increments (to allow for, eg., eBay's Sunday maintenance Windows and such). If no positive response during that period then drop the link.
In any case, I was only using that mechanism as an example of a saner way than having 100 votes to automatically mark a site as dead. I don't personally use Altavista's search engine or condone it, and how this mechanism could be linked to a browser button (which could work with Altavista if they used my method instead of requireing a multi-submit + enter text from a GIF reporting process)
Sounds like a good title for a trivial patent, even..
Method of verifying URL availabity for a database of URL's
The correct way to handle this situation is how the search engines already do - when a link is reported dead, they just make a request to the link. If it generates an HTTP 404 response code, or the site is down, it's marked actually dead.
I'm not convinced this is always a good idea, though - I've worked for a guy who would battle for top positioning on the search engine with a few competitors. When either of them noticed that the other's site was down, they'd submit the other site as a dead link. I like google's Cached page mechanism, which allows you to view sites that are currently unreachable. Great for when you need docs from a site which happens to be down at the time.
This is actually trivial to implement, as shown in Google's toolbar page: http://www.google.com/options/toolbar.htmlOf course, you'd need to use this technique with a search engine who takes dead link submissions. Eg., Altavista and its "Add or Remove a Page" link here: http://web.altavista.com/cgi-bin/query?pg=addurl
Thanks for the laugh...
Black holes occur when God divides by zero.
(Well, not really, but it's damn good...)
It's about the only Window$ app I use anymore.
It's kinda gone down hill after the parent company was bought out by ZDNet but it still really works pretty well.
It meta-searches about a dozen of the major search sites simultaneously.
I use it alot to search for the meaning of obscure error messages and error codes and stuff like that.
Used to use it alot for searching out what cryptic .dll filenames were related to...
t_t_b
--
I think not; therefore I ain't®
I'm on PJ's "enemies" list! Are you?
I've also been kicked from first to sixth on a search for "book reviews" :-(.
Danny.
I have written over 900 book reviews
Really? Google found it for me on about 2,510 pages.
--
Fuck the system? Nah, you might catch something.
The piece tries to make a good point about dynamic content that is generated by user input not being indexable which is true.
Search engines can't type things into forms and get results in an intelligent way.
It's just a shame that they get confused in their expressions.
Nice piece generally though. 550 billion wab pages is an awful lot..
I can usually find what I'm looking for either using Google or altavista. The hurestics used for google are the best I've seen in any search engine. I can ususally find stuff that is anywhere from several days to several years old.
Come on... these are the same people that were claiming that we would all run out of IP's by now. They don't seem to realize that everything adapts.
Umm... sure, it would be a great idea if it would work. But the whole proposal depends on the directory structure being harder to spam than keywords. I don't see any reason why it would be any harder to put "teens->education->health" in the directory structure for hardcoreteensex.com than it would be with current keyword-based schemes. I'd love to hear why you think that this would be different than what already goes on... but I'm not holding my breath about being convinced.
IAAL,BIANLY
Darn. We thought we could get that one past you...
BTW, AFAIK Google doesn't change rankings for money, it adds those little side-links for money. I do hope they stop adding gingerbread now lest the site end up as cluttered and useless and Deja did.
Got time? Spend some of it coding or testing
My friend has/is developing a system and tools for creating a p2p search network. This seems like one way to interconnect searches and information as it becomes more interspersed thoughout the know universe. Have a look at Neurogrid
When shit hits the fan get some of these https://youtu.be/pY-GncsZ-UE
I'd like to see some specilized search engines, nothing too complicated. What I've been wanting for some time is a search engine of just .edu.
.edu sites out there, but they don't show up well on search engines, and may are burried levels deep. i.e. college.edu/~professor/fall2000/class/topic/lotsof info.html
.edu engine, PLEASE let me know!)
There are lots of relly informative
(btw, if anyone finds a
========= Put my nick in front of the "_". I love my computer
I love my computer -- You make me feel alright (Bad Religion)
Why should Google replace the yellow pages ?
Can't you just try www.pagesjaunes.fr like any sane person would (hint you'll get 1510 answers, all right on spot).
(duh)
May contain traces of nut.
Made from the freshest electrons.
It's becoming obvious that scan-type engines are having increasing difficulty with the amount of data on the internet. The bandwidth required by search engines will increase exponentially, and at some point it *will* become unworkable.
The other alternative is to have webmasters manage the directory themselves. This is problematic because webmasters have a strong incentive to list their website in as many places as possible. Some pr0n kings would do every single listing if they could.
So you take away the incentive for the listers to list everywhere, or give them a strong enough reason to list only in the correct places. Since the pr0n kings will never get it straight, you'd be better off using "trusted" maintainers. With the wonderful world of PKI cryptography, verifying submissions could be completely automated and your staff of submitters could be *very* large.
So you make it possible for anyone to become a submitter. It can't be easy enough for the pr0n masters to get a new ID every day, but maybe once every three to six months (say).
Then if enough complaints (*authenticated* complaints) are lodged, some sort of distributed arbitration process could decide to revoke a submitter's status - and then remove all of that submitters submissions.
The distributed arbitration process could take the form of a jury of twelve randomly selected submitters (or submitters with a special arbitration rating?). Basically, people could be polled at random, and anyone willing to be on the jury could examine the facts and make a vote. Perhaps a discussion group could be setup for deliberation.
Hmmm... it would be an interesting example of an online society. Would the system really run itself? If anyone has ideas, email me at tom@alterworld.net (put tom-ok in the subject line or it will bounce).
I hardly ever use search engines anymore. Most of the sites that I find are linked directly off of pages that are specializing in what I am looking for anyhow, and I find that the content is usually of a higher quality anyhow.
Either that or freinds will send me links.
You say you want a revolution....
> Black holes occur when God divides by zero.
I once joked to my friend Steve Pearl that God resets the universe's divide by zero errors. He wittily remarked, "So you're saying when we are thrown, God catches us?"
Why don't they distribute the link verification much the same way SETI@Home does? They could then shoot a micropayment (say a penny or so) to the user for every work unit (say 10,000 or so links) that they verified.
This would primarily be for folks with always on access as it might tend to clog a thin pipe.
I am very small, utmostly microscopic.
Actually something must have gone wrong there, because I was already an editor and they cancelled my account without a note, an email, nothing.
t ional/ and http://dmoz.org/World/Portugu%eas/Computadores/
Besides in all my attempts to reactivate my account or merely contact them I received no answer at all.
Now the categories I created are marked "This category needs an editor"... this is absurd!
Go check the categories I created at http://dmoz.org/Computers/Software/Databases/Rela
--
Leandro Guimarães Faria Corcete Dutra
DBA, SysAdmin
Leandro Guimarães Faria Corcete DUTRA
DA, DBA, SysAdmin, Data Modeller
GNU Project, Debian GNU/Lin
I've experienced both sides of the question. Usually I can find anything I want on Google--- especially if it's technical information, but I've successfully looked up saints, theological arguments, gaming groups, etc. I occasionally supplement this with Citeseer, an excellent resource for research papers.
On the other hand, I was looking for a replacement rack mount kit for a Cisco switch that had been donated to my research group. Google and Altavista were pretty useless, as far as I could tell; I eventually just had to go to ECost and use their search facility to find the part I wanted.
So, I can see how users with different desires could easily develop widely divergent opinions about the utility of web search. Perhaps consumer sites are much less well searched? Perhaps one way that search engines can increase their utility is by making partnerships with online retailers to provide indexing of their product descriptions--- I'd be very happy if Amazon books or ECost electronics started showing up in response to my Google searches.
Well, I can't find anything when I search for addesses ..
BilldaCat
Yes, Google has never failed to return useful and active links. What a great resource!
AltaVista used to be good until they turned into another www.useless-portal-to-everything.com.
-Mike
--- witty signature
Do search engines have bots that go out and search already indexed pages to check for dead links, changed pages, and etc? I would think that the major search engines would have many of these to make sure their data was up to date... if nobody has done this yet, there's my contribution ;-)
Sig missing. Reward.
SEO Copywriter. Just Say ON
...and look how well Gnutella scales.
If you want 99.9% of Internet traffic be nodes forwarding search requests and results back and forth, that's the way to go.
You are in a maze of twisty little passages, all alike.
ah, but if broadband starts getting into a majority of the households, would there be a need for an offline search capability? i mean, i'm usually connected all the time, so it's never a problem to pop open a browser window and do a quick search. i guess it would depend a little on if people start leaving their pc's on all the time. anybody know of "normal" people who like leaving their machines on? i know my girfriend likes to..
this is just a placeholder till i send back my real sig from the future.
signatures pending - ansa@kos.to - (dont mail there)
not taken of me.
Seven of the first ten have nothing to do with automo repair. Two of them are iffy at best. My grading of One right and two half-right out of 10 answers is still an "F".
Try to find, using the Google Directory, pictures of Yellowstone National Park, taken by me. No fair using the search function. (However searching the directory for "pictures of yellowstone and scott purl" will result in two misses, and nothing else.)
Yes, it's "Vanity Web Surfing", but if Google indexes my site, why doesn't it automatically categorize it? (whine whine)
So, yes, Google is pretty derned good. But it's still not a directory, and the directory it does have covers, what, 1% of the web? 0.01%?
What the web REALLY needs is a directory. An honest-to-goodness, telephone/yellow pages style directory. This whole nonsense about keyword searching is providing people who just want traffic with a lot of free advertising and listings.
The phone company provides you with one free listing (unlisted is optional), and makes you pay for each extra category (like in the Yellow Pages -- and if you're not from the U.S., please see http://www.bigyellow.com/supertopics for an example) that you want something listed in. Search engines ought to be replaced with something similar.
Yes, I know Yahoo and Dmoz try, but they don't go out and actively index sites, making their use limited, and the number of sites even more limited. If Google were to create a Yahoo/Dmoz style directory, that would help. Better yet, if people were forced to provide either META tags, or some information when they acquired their domain (part of whois?)....
For example, where can I get my oil changed in Paris, France?
It was either following a link or a search result. I don't remember which, or the subject, but I bookmarked the site immediately.
Whats wrong with this then?:
Google Search: fix a broken window Ad vanced SearchPreferences&nb sp;SearchTips
"a" is a very common word and was not included in your search. [details]
Searched the web for fix a broken window . Results 1 - 10 of about 189,000. Search took 0.90 seconds.
Category:Recreation>&nb sp;Autos>MakesandModels >Mazda>RX-7&nb sp;
Learn2 Repair a Broken Window
... 2torial #0515: Learn2 Repair a Broken Window. Home Run!!! As we know, windows break ... way, ...
the "rabbet" is the notch in the window sash that the glass fits into.
www.learn2.com/05/0515/0515.asp - 28k - Cached - Similar pages
Remodel.com Fix-It-Smart: REPLACING BROKEN WINDOW GLASS
... Fix-It-Smart, Home. REPLACING BROKEN WINDOW GLASS Broken window glass can be ...
A SS.asp - 15k - Cached - Similar pages
replaced by regular glass or by plastic unbreakable glass.
www.remodel.com/fixit2/REPLACING_BROKEN_WINDOW_GL
ITworld.com - Tweak columns in Explorer and fix a broken ...
... OPINION Tweak columns in Explorer and fix a broken Java patch Plus: Tips on drag-and ... printer: ...
m l - 32k - Cached - Similar pages
He drags the icon from one window to another. To do this in
www.itworld.com/jita/3799Win2kFeat/0,,1_3799.ht
Glass_and_Windows, Topic 108
... I have a broken window, they are old wood windows, ...
_ 10 8.htm - 9k - Cached - Similar pages
can anyone help with telling me how to fix it?
www.doityourself.com/archives/Glass_and_Windows
Repair a Broken Window Pane with the iVillage Home How-To ...
... painting. Becoming soft. Remove stubborn window putty with a heat ... Take a shard of ... STREAK-FREE GLASS CLEANSER FIX A LEAKY GUTTER CLEAN ...
a te /articles/ 0,9449,167075_211955,00.html - 71k - Cached - Similar pages
broken glass with you to
www.ivillage.com/home/howtoguide/repairandrenov
Re: Don't fix what isn't broken
... 2000 12:48 pm. In Response To: Don't fix what isn't broken (Terri Zamore). ... the light ...
_ 23 .upgradeguy/ ?read=10 - 6k - Cached - Similar pages
of day in OS X. For instance, window management in OS 9 is at the very
www.maccentral.com/storyforum/forums/_news_0011
Centre of Criminology News
... HOW MANY CRIMINOLOGISTS DOES IT TAKE TO FIX A BROKEN WINDOW? The following responses ...
m news.htm - 35k - Cached - Similar pages
to this query were provided by faculty, staff and students at the Centre
www.library.utoronto.ca/libraries_crim/centre/cri
LifeMinders Home Sample
... Unsubscribe. Fix It Projects Replace A Broken Window. ...
Maintain Your Gutters Now...Or Pay Later. Gardening
www.lifeminders.com/examples/home_minder.html - 13k - Cached - Similar pages
Home Upkeep
... Fix a Leaky Faucet How to fix most faucets yourself and save ...
money. Repair a Broken Window Fix your own broken windows.
www.frugalliving.about.com/cs/homeupkeep/ - 54k - Cached - Similar pages
ResultPage:
1
2
3
4
5
6
7
8
9
10
Next Searchwithinresults Try your query on: AltaVista Excite HotBot Lycos Yahoo!
GoogleWebDirectory - CoolJobs - AdvertisewithUs! - AddGoogletoyourSite - GoogleinyourLanguage - AllAboutGoogle
©2001 Google
This is a real problem, but the fundamental reason it's a problem is one that's well-understood by library scientists: We only have addresses, not content identifiers.
To use a book analogy, the entire web is built on Dewey Decimal addresses (URLs), when what we need is those combined with ISBN numbers (URNs).
I didn't make up the idea of URNs - the concept was first described to me by Peter Deutsch, the inventor of Archie, at Interop sometime in the early 90's, shortly after the web got going. (Back when there were no search engines, and we found out about new web sites by visiting NCSA's What's New page, which for a while, anyway, actualy cataloged *every* new web site that appeared, and some of us could claim to have surfed the entire web...)
The idea behind URNs is that they would be a unique identifier for the content. The same content living on different sites would have severl URLs, but only a single URN. This is still needed today, but the problems that kept it from being implemented then are even more intractable today: Who hands out URNs? (IANA didn't want to touch that!) How do you handle versioning? What about dynamic content? Who are the librarians?
We still desperately need somthing that fills this need, but it's not likely we'll get it. One last parting thought - in discussing this with Deutsch, he pointed out that these are new problems to us, but that the library scientists had solved them quite some time ago: It is only the typical CS insistence on reinventing everything and dismissing the knowledge of those in other fields that makes the process so incredibly painful... Hubris strikes again.
"The future's good and the present is nothing to sneeze at." - Roblimo's last
Check out Dizz-net. It's basically an article spawned by a conversation on Slashdot over a year ago that moved to a mailing list.
:-)*
We had some cool ideas, but the infrastructure for such a thing would be huge. I have a bunch of interesting messages from the mailing list describing some pretty cool stuff, like having nodes only search for stuff that near them, network-wise, to lessen the load at critical points. There was also some talk about moderation ("Click here if this link is not relevant to your search") and heuristics to stop common abuses (spider-bait).
It never happened, because it's pretty heavy stuff to implement properly.
I'm sure some patent-squatter has a patent on it already, with the full intention of letting someone else do the hard work.
Are libraries becoming useless?
Posted by Hemos on 03:53 PM March 27th, 2001
from the we-talk-and-talk-about-same-crap dept.
segmond writes: "CNN is running a story on libraries around the world and their inablity to keep up with the growth of the number of books published. Libraries such as ones belonging to even the biggest instutions such as Harvard, Yale and MIT can take months to add a book to their collection and the queue of unreviewed books is growing. Most libraries are even further behind and are filled with off-topic and old assembly books about VAX and Z80 programming. The trend is toward pay for listing your book. Will the free, searchable library fade away?" The article gets beyond the "Wowie, so much content, libraries can't keep up" typical blather and addesses some of the reason libraries have a hard time keeping up.
------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
What about a browser plug-in that indexes pages as you view them and submits the results to a centralized database (or decentralized if possible)? This would have the advantage of being able to index every page people go to. The database could even store more detailed information about pages that are more popular. Groups of people with special intrests could set up there own private index of the pages they visit. Individuals could even have private indexs of their history and bookmarks.
the possibilities are endless.
-ishmael
In my local, SBC(SouthWesternBell)charges extra if you don't want your phone number to appear in their published phonebooks otherwise your number will be published in the proper section(white pages for persons, yellow for business, blue for government, etc)
I can foresee a time when people pay to *not* be included in search engines.
Skip ------ See the latest from http://www.anArchyFortWorth.com
that should be
# include <math.h>
we don't want no C++ OOPs here, just plain old C ma'm
Look up information on the "Invisible Web" - islands typically untouched by search engines, where you need another site to "hop" to these nets of information - cool stuff can abound in these disconnected areas. Here are some links to get started with:
DirectSearch - Invisible Web Search
The InvisibleWeb
WebData.com - Invisible Web Search
InfoMine - Scholarly Internet Resource Collections
AlphaSearch - Invisible Web Search
IIRC, Slashdot even ran an article about this not too long ago - I think this is it, not sure...
Worldcom - Generation Duh!
Reason is the Path to God - Anon
You mean the META tag already exisiting?
The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
So it's big? Do you ask every random person in the street for the best place in town to buy comics? You look for people who are likely to be clued in and see where they hang out, or where they recommend. You'll also get a very differnt answer if you ask a BDSM mistress for a "rack" than if you ask a geek (well, most of the time...).
You just need to make sure the web has these same cues and communities built into it.
Xix.
"Everything is adjustable, provided you have the right tools"
What do you expect? Pay for listing is the only way search engines will make money. Think about it: Would you use a search engine that charged a little, but provided much better results (ie no dead links, no off-topic stuff)? Think NorthernLight.Com does this.
Is this really a big deal? Hasn't anyone used the yellow pages in a phone book before? People have to pay to be listed in that, and its very useful for finding a companies.
whee
--
Frankly, this article doesn't depress me as much as the quality of google results impresses me. Whether it's 1% or 100% of the available space, I can very often find exactly what I'm looking for.
Now maybe there are vast areas of the web unavailable to google searches because of language quirks or protective admins, but so what.
They have as much a right to exist uncataloged as I do to have an unlisted phone number. If sites want to be indexed, they can register with a search engine. If they don't, and are unreachable, so be it. I don't see what the problem is.
if you own stock in an indexing search engine, you should dump it now, because distributed search engines are going to replace them. if you don't believe me, just ask all of the young peer-2-peer developers out there, because distributed computing can solve this, and there is a huge hole in the efficiency of the internet that these developers can fill, and THEY KNOW they will be famous if they solve it. its a race. run. run. run. your going to lose if your still wearing your penny loafers.
-- Betting on the survival of the media industry is a serious risk. I advise investing elsewhere.
I don't know WHAT they are talking about -- I can find ANYTHING that I look for on Google -- even sites that I have just created a day or two ago have been found. These people just aren't using the right search engine, dammit! =)
------------
CitizenC
The article skims over the fact that search engine technology is progressing fairly rapidly, and that some companies (Google) are creating new technologies that exploit the way the web works while Yahoo! and some others are relying on older technology for some things (like filtering pages by hand for their directory!).
Google's approach is novel; make the web pages rank themselves. If more people link to your site, it's probably a better site. If few enough people link to it, it probably isn't and besides that it'll probably never be found.
Web site creators have to do the legwork to get their sites recognized, and going to a general search engine to do it isn't the way. If someone makes a site and tells their friends about it, and their friends like it and link to it, it'll get picked up; that's the way of the web. (At least, it'll get picked up by crawlers like Google, and even ranked highly if enough people link to it).
Search enginge tech has to catch up to dynamic pages yet, but it's the fault of the content creators if they want their pages on search engines but can't code enough alt tags to make their stuff show up.
In any case, the bulk of the web does work, and good pages get recognition. I've always eventually been able to find what I'm looking for on the web, no matter what the topic. Search engines have to grow like everything else, but so far they're the best thing going and getting better.
This is how I found
Did anyone out there get hooked up to
-----
crazy dynamite monkey
I have been a PHP programmer for 2 years now and I applied to review the PHP sites. They rejected me citing an overabundance of PHP reviewers. Does this mean that they want people to review anything instead of what they know?
If you need to find more relevant documents on specific subjects, I recommend using topic-specific search engines. I maintain one for all subjects relating to Paganism and Wicca on my Omphalos website. True, the site submissions have to be manually approved and this can lead to backlogs of site submissions, but since I spider all of the websites I have included in the directory (totalling over 140,000 webpages so far) the relevancy of any search results is raised by the lack of clutter from unrelated websites.
Similarly, if you are searching for information on Space Exploration try Spaceref where I used to work. Again, the directory is manually generated, and the results are greatly improved overall.
Nothing guarantees improved relevancy (for general purposes nothing beats Google in this respect), but using specialty search sites helps immensely in many cases.
"The first time I got drunk, I got married. The second time I bought a chimpanzee, after that I stayed sober" Arian Seid
If I were to set up a search engine:
Every unique domain name found would get crawled for free. You paid for a domain name, you must care about your content.
Every geocities-style cheap personal page would require a small fee to get crawled. Too much schlock; scan only the stuff people care about. You don't wanna pay your own fee? Ask a visitor to pay the fee. PayPal or something newer/better should do the trick.
Every dynamic page like slashdot, everything2, or real estate listings, would have to have a more expensive agreement in place to get anything indexed. The buck stops at cgi. Waste no time on something that will probably be gone tomorrow.
Commit on the resources it will take to prune and groom the stale dead stuff out of the index, regularly. Dead links are bad business.
[
Well, I'm tired of them too, and I write pages that I submit to search engines from time to time, and I've come up with what I feel is the best way to submit links to a bunch of sites:
Direct links into the pages that have the URL submission forms on a bunch of search engines.
-
Search Engine Submission Form Index
Keep a text window open with your URL, title, description, for-public-consumption email address and the like, and use "Open Page in New Window" on all these links to manually copy and paste your information into a bunch of search engine submission forms.That's it!
I got all these search engines off the Search Engines Category at the Open Directory Project. If you know of any pages that list a bunch of other search engines (there are many smaller ones, and a lot of special purpose ones) then drop me a line at crawford@goingware.com.
In my index I provide brief notes about some of the engines, including mentioning whether they refuse to accept submissions without payment. I don't provide links to submission forms for the engines that won't list a site for free, and I'd like to ask you not to support the trend towards paid index and spider placement.
You should understand that the vast majority of visitors to your sites don't get there through search engines, they get there because other people like your page and give you a link. The main value of search engines is to "prime the pump" so a few people start finding your site and then know to create a link for it.
Create successful web sites by writing good web sites - see Some Web Application Design Basics for links to a few good pages written by experts that will start you well on the road to an appealing, successful website.
Thank you for your attention.
Mike
-- Could you use my software consulting serv
Also, I'm not impressed with ODP's handling of new applicants. I applied once last year and received NO reply, not even a rejection letter. I had applied to edit the category of "Personal Pages -- Surnames starting with U". It was to get my feet wet, learn how to be an editor, see how time consuming it might be before adding a more serious category. I mentioned that in my application.
I resubmitted it in February and successfully received . . . a rejection letter! They decided I have a personal stake in the category (note my last name) and might be biased. Oh no! We must prevent the potential for abuse of Web Pages about people named U*!
If I'm not allowed to edit for categories that I know something about and I'm interested in, then what exactly should I volunteer for, and why should I?
Does anyone know if there are any search engines 'out there' that help implement this 'two step' process?
You know, suppose I am looking for something that requires these two steps, but I don't know anything about the subject (That's why I'm searching in the first place). So I don't know what my first search should be like.
Hans Voss
---
Hans Voss
---
"I have no special talents, I am just passionately curious" -- Albert Einstein
I think the problem with the current search engines/directories is that they are trying to index the entire web into one handy-dandy catch all whiz-bang database. IMHO its too much for a single system to deal with.
It seems to me that search engines/directories should start to specialize in specific topics. For example, science, pop culture etc.
We might have a fighting chance this way.
Thoughts?
Dave
DOS is dead, and no one cares...
DOS is dead, and no one cares...
If there's a Bourne Shell, I'll see you there
Yellow pages list companies. On the Web, sometimes you don't want that, you want to find non-commerical sites. Fan sites, non-biased reviews and information, etc. That's getting harder and harder to find via search engines. All the commerical sites appear first. Obviously, more and more sites are going commericial to cover costs, but there's still a lot of quality information on non-profit sites, and it's getting harder to find. At least, that's my experience, even with Google. Who knows, it might just be the harsh reality of the future of the 'net. Surfers standards' of quality have risen much higher, and non-commericial sites have a harder time keeping up with the companies that have a whole team running a site to make it look pretty. That doesn't mean the information won't be missed, though. So is it really a big deal? Yes, I think so.
If you really think about it, being able to search up on cached contents on Google is actually a GOOD THING. Now if they only would make it an option by the search-criteria, and make their spiders check the more popular links more often, it could really improve their search-results.
;), specialized search-engines are probably the best option if you want some obscure student-paper or 4-year-old newsflash.
On another note, they should probably spider more news-sites like Slashdot and Freshmeat frequently. There they can get the new links as they arrive, good or goatse.cx.
On the last note (Yes I promise
- Steeltoe
http://www.debunkingskeptics.com/
doomed to failure until someone implements something like the Dewey Decimal System for web pages
Yes, we're stuffed -- but Dewey Decimal isn't the answer (we can do a lot better than that).
There's an initiative around that's gaining considerable momentum - the Semantic Web. It starts from one bright idea by one guy, but as the guy in question is Tim B-L, then he gets listened to. There are solutions to all this. We've barely started on what we could easily achieve for indexing the web, without even trying for the really hard stuff.
Once basic semantic level indexing becoms commonplace, through tools like Dublin Core, then take a look at ontological descriptions and projects like DAML.
There's a huge amount happening in this field research-wise, it just hasn't hit the punter's web yet.
My idea is to come up with a standard set of headers that provide directory/hierarchy information for search engines. This is much more useful than keywords, et al., because they allow for top-down directories such as Yahoo! and the Open Directory project. Sites like this could be automatically created simply by crawling the web and organizing sites according to a category specified in their header.
The problem with keywords is that it's easy to spam them. If you need more hits, just add "bestiality", "Natalie Portman", and "hot sluts" to your keywords. The keywords often have nothing to do with the actual site.
It would be much harder, however, to spam a directory structure, especially if most search engines limited the amount of directories a page could specify to, say, two or three.
The header would be easy to implement. It could be done very easily within the comment tags of existing HTML. The only problem is getting people to do it. It would work beautifully if Yahoo! or another large site were to give up on "hand-picked" sites and start letting people specify their own location on the structure. Then anyone who wanted their site to be locatable would specify a hierarchical subject category in their header.
Great idea. It'll never happen.
Got Rhinos?
if the web is becoming unsearchable:
make smarter search engines:
only search part of the net
very specialised
reorganize search engines
reoranganize the web
Company behind it -- MaxBot.com -- also offer SearchMil.com ("Over 1 million military pages indexed and ranked in order of popularity."), SearchGov.com and Search eBooks.com.
I don't know WHAT they are talking about -- I can find ANYTHING that I look for on Google -- even sites that I have just created a day or two ago have been found.
You're kidding, right? Or have you just not tried it in the last year or two? I submitted my site to Google -- and everywhere else -- two months ago and have yet to see it. And I wasn't about to pay the $199 to Yahoo or Lycos to get it listed. Bastards.
Slash has nothing to do with Slashdot.
This would require a lot of human verification, for there are many possibilities for abuse. I could always report my competitors for false keywords, just to keep them out of the listings. And as soon as we get to more exotic topics, who can say if a keyword is relevant or not? And how relevant is relevant anyway - if a porn site does have many pictures of women getting out of girl-scout uniforms, is "girl-scout" a valid keyword?
There are simple ranking algorithms, that weigh uncommon keywords more, and take into consideration how many keywords the site claims to relate to. These might be more effective.
In Murphy We Turst
What about combinin efforts? AltaVista already wants to own all search engines (hence the patent), why don't they form deals with other search engines that quite frankly suck (like excite or lycos) and distribute the work load?
Of course, leave Google out of the mix. They already kick ass.
If META tag spamming is so much an issue then there are algorithms that might help. A simple one would be for the "value" of a particular META keyword to be reduced according to the number of keywords provided.
So, in a page with a single META keyword "sex" the sex would count as value1.00. In a page with META keywords "sex, drugs, rock-n-roll" the keyword "sex" would have value0.33.
Those who do not learn from Dilbert are doomed to repeat it.
http://www.google.com/search?hl=en&lr=&safe=off&q= troll+slashdot+equinox
-- @rjamestaylor on Ello
Most searches for herbal medicines (e.g. "5-HTP") find you way more hits (especially the high ranking ones) from companies trying to sell you it than actual objective information about it.
Had you typed 5-htp information into Google, you would see 5-htp information, with Harvard as result #2.
Will I retire or break 10K?
"html" 188,000,000
But, as usual for Google, the first three results are highly relevant for at least one common sense of the search term. (The first is W3C's official HTML standards site.) I didn't realize how bad AltaVista sucked until I tried it after using Google for a year.
does anyone find anything better than "and"???
+a comes close. It seems they're blocking searches for +the.
Will I retire or break 10K?
Yep, all that content, and yet when there's a slow day at work I can still run out of interesting stuff to look at on the internet.
little gamers, penny arcade, goats (not goatse), and badtech: online comics. It'll take a while to browse the entire archive.
everything 2: nearly half a million writeups on topics from aardvarks to zzyzx.
Will I retire or break 10K?
Basically, using the peer-2-peer revolution (buzzword alert) in advertising is the next thing.
I hope you're not talking about spamming Gnutella.
some companies are try to combine the peer to peer aspect of traditional word of mouth and the web.
In this model, surfers are paid to recommend the sites to other surfers. Spedia is a prime example, as was AllAdvantage until it went to a "sweepstakes" scheme. Other examples can be found in the many sites that use Recommend-It.
Hatten är din, hatten är din, habeetik, habeetik.Will I retire or break 10K?
Of course, you'd need to use this technique with a search engine who takes dead link submissions. Eg., Altavista and its "Add or Remove a Page" link
AltaVista does not allow submissions from visually impaired users or users of text-based web browsers such as Lynx, Links, or w3m. Its submission page uses a GIF image (burn all GIFs) to display rotated text in various fonts. The user is supposed to read the text and enter it into a field below. But visually impaired users, users on text browsers, and users on browsers whose developers have been cease-and-desisted by Unisys never see the GIF and cannot contribute links to AltaVista.
Will I retire or break 10K?
Not quite. Disney can pay zillions to be top in a search for "animation techniques", but they're actually not a reference site for learning how to do animation. Ditto a search for "electronic circuit design" - Intel could pay to be listed on there, but you're not going to find much info about designing electronics on their site. Paying for listing on those kind of things simply increases the noise, whereas Google's system looks for sites which are popular references on a subject.
;-) then pay-per-listing is more likely to show you ToySmart or whoever (are they still going? can't remember), which you actually want.
But you're right in some ways, too. If you search for "children's toy company" or something (and temporarily ignoring the other 'toys' listed
Good points and bad points about both. I think the best would be a two-tier system - a pay-per-listing one for commercial stuff (Amazon, etc) and a free one with a reference-check system for information-search purposes. Maybe the pay-per-listing could subsidise the free one?
Grab.
" Wrong answer."
:)
The web is a publicly accessable resource, if you don't restrict acces to your pages then I can bookmark a page deeply imbeded in your site, and go directly there anytime I want. I will dispute your "right" to tell me I can't give my list of bookmarks to anyone I want.
It is the web designer's responsibility to restrict access, if such is needed. If I can go directly to a page and skip your ads (and you don't want that) then you need to redesign your page. Besides I can filter out all the ads so you don't get any hits anyway. (I don't though)
I don't know where you got your definitions, but you need to look up hacking and IP, I do not think they mean what you think.
Last time I checked theft ment take something from someone, I can't take something from you if you don't have it.
Actually many of these sites are stealing from the companies that pay them for ads. At least I would consider it stealing if I paid money for an ad that merely sat on a "gateway" page, or any page that dosn't have content that will hold the browser's attention.
Nothing to say here... move along
Google's only about 2 months behind on their indexing, their cached copy of slashdot is from january 23
searching for a specific site through a search engine is pretty well useless, unless you are a wizz at forming your querry properly. The only way I have ever been able to find what I was looking for was to search for index pages relating to the topic I was looking for and then jumping from index to index until I found what I needed.
These indexes tend to keep a small list of sites and and tend to check on these sites often for dead links or being off topic.
The sad fact though is that the larger and more complex that a system becomes (such as the Internet is becoming) the more chaotic and disorderly it will become. Like watching five buterflies fly around is alot easier than watching a million. And search engines are not going to be able to keep up, because the enviroment that we ask them to keep track of is billions of lines of text that are constantly changing. Then in this enviroment we want it to find a specific word that has context to what we want, while at the same time cutting out the superflous chatter. Impossible... at least impossible now, it is going to take a major breakthrough in search algorythms for this puzzle to be cracked.
News at 11...
The content on mini-portals is a million times better than Yahoo's old haphazard system. I gave up submitting non-commercial links to Yahoo because you wait months before being sure they didn't list you, then resubmit and wait months, then resubmit... etc.
When I use google to make a search on a technical topic related to my work, usually the vast majority of the links provided by google are relevant.
The problem is, there are too many of them for my little hands, my little head, and my little time allocated on earth. Google scales, I don't.
I'm somewhat familiar with web search engine technology since its inception. Over these six years the quality of the results has gone up while the precentage coverage has remained steady.
So the main premise of the article is moot.
Also their data is flawed. The article quotes a bogus 550 billion pages which includes dynamic content not meant to be indexed. If we used a realistic definition of what a web page is, the total number of pages out there would rate in the 10-30 billion, tops.
As computational linguistics improve, as well as usage pattern agents such as copernicus and firefly are refined I expect the quality of searches to continue improving.... Just recently I came across an article demonstrating amazing automated content subject classification (coming soon to a search engine near you).
Are these "the sky is falling" printed press articles the forerunner of trolls?
What ever happened to the peer to peer idea of searching? I remember when Napster and GNUtella started, people were talking about how this might actually alter the way searching was happening on the web. By having each server tell us what they have, we are assured that when someone searches for how to replace a broken window, they won't get what they don't want.
--------------------
`Lex - Find Me Here: Text Appeal
Libraries are government-funded, so everyone has paid for them already. A government-funded search engine might not be a bad idea though.
How does the Dewey system address that, since a book can also fall into more than one category?
And how long do you think it will be before microsoft.com, mpaa.org and riaa.org disappear from all search engines?
That way we would at least get reusable info. Still doesn't address quality of content directly, but IMHE those sites which take care of their information format tend to be the ones taking care of their information content as well.
The fundamental problem right now is that the search engines don't give a damn about content quality or format, just raw hit rates.
///Peter
- New and Original
- A copy, mirror, or just links to something Original
How much of the web is truly new and original information? You got me. A lot of it, I would guess. However, if a search engine or directory structure would be able to pick the best and most informative sites and link to those they could accurately address the majority of its searches. How do you determine that something is the best site?Could an algorithim do this? Perhaps, but so could a staff of people. A web crawler brings in new sites. Then someone on the staff looks at the sight and asks "Is this one of the best sources for original information on any topic?" If it is, they add it to the database and associate it with the proper nouns.
Let's say I try to search for a book titled "foo bar" by "john doe". In the search for "foo bar", I may not find anything if "foo bar" is a common phrase or common words. Same goes for the name. I would get loads of links to sites that mention people with that same name. Why not have the search engine ask for more information if the first search comes up too big. Instead of trying to find it directly, which you may have about the same odds as winning the lottery, ask the user to describe the thing that they are searching for. The user might enter something like "It's a book about widgets."
From there, the search engine might see the word "book" and pop up a link to Amazon.com or even do a search on Amazon.com and return those results. Have some built in intelligence that can match up a noun with the best sites about that noun in it's database. It could search for "widgets" and return those results. Or even apply the first search to the results of the "book" and/or "widgets" search. How do you design a search engine that can make the association between the noun and the best sites about that noun? (best being the key word in that sentence) Also, the engine would need to have an intuition about how much information is needed. Do I have enough information to give them a highly accurate link or do I need to ask them to describe it more? Do I need them to describe a description? From what perspective are they comming from? (perspective does determine relavance to a degree)
It's like that old saying "you have to have money to make money." In this case, you have to have information to get information. More specifically, you have to pass information about the information that you require. In other words, let the user provide the meta data instead of the database. The database would focus on the "noun-to-best-links" matches. The search engine asks the user questions and breaks the search request down into a set of noun searches which it can reference in its database. The best answers float to the top and if they are not what the user is looking for, odds are they can go to a site and find it there or get more meta data and try again.
You know solving the keyword problem woulnt be too hard. I mean if yahoo had an option where you could submit a site that you think had off-topic keywords (like if it were a porn site and it had the keyword 'girl scouts' or something children might be searching for) and they would completely remove all occurances of an offending site from their database, then maybe things could be well classified. People would only use on-topic keywords so that they dont get banned from yahoo.
This would make searches SO much more acurate. It would just take someone to have the balls to say "you are abusing the keywords so now nobody will ever get to your site from our search engine."
A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
It is simply not possible in the future having a single search engine categorizing the whole web. What we need is specialized search-engines. Portals go through the exact same at the moment. No single portal can satisfy all web-related needs of a single person. Therefore, more specialized enginges is the answer, and when it comes down to it, i'm ready to pay for such a service. As long as I get what I need..
/JacQ "Find the metaclass of everything and find God.."
I have a suggestion to anyone who is thinking of implementing a better directory. First, define the categories, and allow any site to submit their site to their categories. Then, introduce moderation to the mix. Allow users of your directory to rank sites in terms of suitability to the category. Allow them to create red flags for people submitting porn to health->teens->sexuality, and so forth. Let the users do the work!
I think moderation works well for sites like slashdot, why not a moderated web directory?
No, Thursday's out. How about never - is never good for you?
Whilst Google is clearly the best for non-commercial searches, GoTo is apparently the best for commercial searches (if you want a service someone will make money from supplying).
It nicely gets around the problem of manual classification, by effectivley using market forces to make an advertiser classify themselves correctly (or pay for referrals which make them no money).
Let say I have a hotel in San Francisco, but bid for the general term Hotel ($1.03). Now I will presumably only get some custom if they were looking for a Hotel in SF - otherwise I just paid GOTO $1.00 for a useless referral. Better I list myself as HOTEL SAN FRANCISCO, even though this costs ($1.71), I will have a much higher conversion ratio.
Of course, if I am a US Hotel Chain or Broker, then maybe I would bid on the general Hotel keyword.
End of self serving Sales Pitch :) Personally I'd like to see us create a GoTogle (TM) :-) that combines the best of both approaches.
Winton
In the article, the author talked about search engines dying from lack of use, but I don't think they will. There are always going to people on the internet who don't know much about the internet, and they turn to search engines to find what they need. They might loose the more experienced clientel, but they will always retain the novice clientel
"The heresy of one age becomes the orthodoxy of the next" - Helen Keller
That's really all it comes down to. You just need to learn what method of searching gets the best results. Personally, I use google and almost always find relevant pages on the first listing of results. After the second page, however, they get off-topic (what do you expect when you get 2,560,009 results?)
It helps if you just search for an exact phrase (i.e. "amount of trolling on slashdot in relation to the vernal equinox"). But then again, I didn't need to tell you that.
--
--
#nohup cat
I do not think that the pay-for-placement in search engines must be a bad thing. If you have to pay to get into the top positions of a search engine - you are simply more likely to spend time on the site, and the other way around (you spend time - you pay). This makes sure that the top listings are truly relevant for your search, and if they aren't just scroll past the paid ones until you find good ones. By the way - let's face it - getting high in search engines is not about having a good site. It's about knowing how to optimize you site for the search engines (of course, having a good site will help you a lot). Pay-per-listing is really just making use of money, instead of skill/time.
The interesting part is here:
PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."
For you non-search eninge gurus - many search engines stop at "?" in URL:s. Look at the slashdot url in your browser window. Most search engines refuse to index past the "?" sign. Stupid. I think an easier solution would be to make the search engines stop stopping at "?". Google already does this, of course.
Why in the hells should dynamic content NOT be index. All the best sites ARE dynamic with databases.
You're wrong. There are metaeditors, and they have quite a bit of power. They can delete entire categories, for example.
I know that many of the active editors are people pimping their own sites and ignoring submissions.
Listing your own site is OK, even encouraged -- having a site in the category is considered a sign that you know something about it. Keep in mind that just getting listed on DMOZ isn't supposed to be hard; it's not like Yahoo, which makes more of a point of being selective.
Listing your own site as a "cool" site is not OK -- if you know of such a case, you should complain to the editor of the more general category of which it's a subcat. Keep complaining up the tree until you hear back.
Actually, you hear a lot of people whine about how their applications to become editors got turned down, but the people at DMOZ who review the applications say that by far the most common reason is that the person's application makes it clear they're only interested in self-promotion. So yes, people do try to abuse it, but DMOZ tries to stop it from happening, takes complaints seriously, and is in fact getting criticized all the time for taking it too seriously.
Open Directory sucks. I work for a fairly large international B2B company and not only is our company not listed, neither are several of our competitors. I eagerly await the day that AOL stops using so I can stop caring what Open Directory does.
Maybe you'd have more luck getting listed if you'd learn more about how Open Directory works.
The Assayer - free-information book reviews
Find free books.
Well, what you're describing sounds a lot like META KEYWORD tags.
Having been an Open Directory editor in the past, I don't really think the problem is finding the right pages. Actually the biggest problem is just that a lot of editors aren't active, and it's hard to know who's active, because they're listed as editors even if they haven't logged in or checked submissions for a year. This creates problems for editors who have to cooperate with other editors, and may also give outsiders the impression that Open Directory is overwhelmed in general, when really it's just that the editor they submitted to is AWOL.
Yahoo is doomed to failure because they don't have enough people working for them. Open Directory works just fine, because they have orders of magnitude more eyeballs working in parallel. No, Open Directory doesn't list every page on the web, and that's just fine with me as a user -- it's more useful because it's selective.
The Assayer - free-information book reviews
Find free books.
Well, what you're describing sounds a lot like META KEYWORD tags.
The problem with meta tags is that everyone has their own idea about how they should be used. I think Ichimunki had something like RDF or Dublin Core in mind when talking about a Dewey system equivalent for the Web. They define standard document properties which make searching through metadata a much easier process.
"www" makes 326,000,000
"it" 262,000,000
"html" 188,000,000
(all out of a total of 1,346,966,000 pages according to google, so more than 60% of pages include "and", having fun with stats :) )
does anyone find anything better than "and"??? (to search type "+and" otherwise it will be ignored as it is a "common word")
Keywords are not especially helpful in auto-creating directories. They are of limited value because only about 10% of web sites use them at all. Of those that do use them, there is no limit or structure to them. They are easily spammed. This is exactly why they were discarded as useful by SEs a long time ago. I have found keywords and descriptions helpful in my own efforts at classifying web pages because, once verified by a human (me), they could be used as a partial basis for text based searches (in which I also included META descriptions). If no keywords were given I frequently resorted to duplicating the description. If keywords were given, but no description, I could usually find a short excerpt from the site that could be copied and pasted.
Open Directory works rather well, IMHO, as a directory because the editors have a strong sense of ownership and are given small enough chunks to do that the work is very manageable at the individual level (and they can do it in their spare time easily). But the human element is always going to be a potential issue with any directory. A problem you just don't have with Google.
I do not have a signature
A book can be cross-listed in a card catalog, from my understanding, but since the book can only be in one place on the shelf, it's not a big concern. The librarian simply chooses the dominant topic, or uses one of 000 general classes (for things like encyclopedias, periodicals, etc).
I do not have a signature
I never said it would be easy! :)
Having actually tried to implement a DDC based web directory once, I am familiar with the problem that many pages would possibly fall under many categories. This is a problem with any directory-based approach, especially if you list a page in one category and then the page changes enough so that the category no longer applies.
In your example, I would hope it would not be too much trouble for you to put a different class number into the pages that make up each logical section of your site. Or if the site is small enough, it would likely fall under something like "personal web pages", which may have a number of subclasses itself, and then you'd choose the one you felt appropriate.
Again, this is a common issue among all directories, where do you put stuff? Do you allow multiple listings/classes per site/page? You still end up having to include some sort of keyword or text-based search so that users are not forced to browse the directory structure, guessing at the classification they are looking for or where it lies in the hierarchy. Text searches also allow for the possibility of searching based on content rather than metadata.
Most of this is a non-issue, given that Google seems to have rather successfully implemented a non-directory type of engine-- succeeding where Altavista was simply unwieldy. At least that's my impression. I usually find what I want with Google.
I do not have a signature
Yahoo and DMOZ are web directories. This is a very human labor intensive way to categorize the web. Google is actually a search engine. It spiders out and runs an indexing algorithm of some sort to help it respond to queries. These are very different approaches.
Yahoo and the like are doomed to failure until someone implements something like the Dewey Decimal System for web pages and then convinces a large number of webmasters to correctly classify their pages using it. That way a machine can do the hard work and only the person designing the page need do the actual work of making sure the page is classified correctly.
Obviously this is fraught with problems similar to those of keyword spamming, but it's either that or build something like DMOZ on a decentralized basis, so that any individual maintainer builds a set of links that are tailored to his/her interests and either uploads them to a central sever or provides them as an XML document for an engine to work with.
I do not have a signature
I ignore pay for listing sites - they are nothing but commercials for big business.
We all know that a computer does what you tell it, not what you want. If you search for "cars" you will of course get much more useless information than if you search for "cars sedan mid-sized" or whatever other modifiers you have in mind. I remember once talking to a new internet user who tried to find information about "ants" through some search engine and ended up wading through links about restarurANTS and consultANTS. Okay, so all search engines are not created equal. Isn't that why most of us like Google?
Perhaps it helps that I am researching very specific topics. Yes the Web is getting bigger, yes there are things it is hard to search for, but I don't really think it is getting worse. As the song says, "If she knew what she wants he'd be giving it to her now."
-- I Am Not A Terrorist.
This idea is worth some thought. The basic problem is that the richness of the pages we produce in response to a name search is the very thing that is making it worthwhile to have our names represented on Google. A Google-referred user immediately appreciates what our site has to offer -- data visualization of interlinks between names, with clustering, cluster-click selection, etc.
If this richness is available to the Google user who arrives at our search results page via Google, then the same richness is available to the original crawler that put up the page.
But I appreciate the suggestion, and it may well be that some balance could be achieved that would bar Google from the "richness" but keep it open, available, and apparent for everyone else. We already do something like this -- the program that does the visualization is blocked to Google, so that the links Google gets are from a program that doesn't have to generate GIFs with client-side image maps, nor Java applets with cluster-clicking.
I run a site that's a cumulative name index of 700 books
and thousands of clippings. The indexing started in 1983.
For any name, you can get all the other names that share
pages with that name throughout the entire database. In
other words, each name search produces a page that contains
anywhere from several to several hundred additional names
-- all pre-linked directly to their own searches, which do
the same thing. You get the idea.
It's a bot's worst nightmare. But if you are Google, with
lots of crawlers to sic on the task, it quickly can become
my nightmare instead of Google's. Indeed, Google doesn't
seem to care much.
Last October I noticed that Google was inclined to stumble
into our cgi-bin on rare occasions, and actually do a
decent job of delivering referrals to the name data that it
got from us. I lifted the robots.txt exclusion to see what
would happen. No other bots have even delivered referrals
as consistently as Google, so I can only assume that Google
is the only bot that's even serious about going after the
dynamic web.
Either that, or their algorithms do a much better job on
our names, which are all listed as surname-first throughout
our site. If you search for a name in the news as Firstname
Lastname without quotes, Google will put our Lastname,
Firstname high on the list due to two facts: Our name is
part of the anchor description and they give link data more
points, and secondly, the two words are close to each other
and this adds to the score (even though they are backwards).
Google has come by once a month since ever since I lifted
the robots.txt. Each time they spend about 10 days solid,
24/7, with from three to five crawlers, chasing all the
name searches. The rate from all the crawlers together for
those 10 days varies from about two name searches per
second to several per minute.
It's very erratic during that time; the crawlers don't talk
to each other, and there's no detectable pattern that
they're following. They don't manage to get through the
entire database of 115,000 names by any means. There is an
incredible amount of waste and duplication.
I had to install a load-sensitive thermostat so that when
our server hits a certain load threshhold and it's Google
calling, it starts delivering "Server too busy" responses
instead of the search that was requested. That seems to
work pretty well, but they get all those "Server too busy"
messages stored in their cache copy for that name.
To put it bluntly, their bots are dumber than toast, and
if you don't watch them, they can turn your server into
toast.
Last November I wrote to Larry Page and offered to send him
the damn database on CD-ROMs, in discrete HTML files using
any specification he cared to define, so that his crawlers
wouldn't have to load down our servers once per month.
Mr. Page never responded. The letter was e-mailed, faxed,
and snail-mailed. Someone from google.com did a Larry Page
search shortly after I faxed it, so I'm pretty sure they
read the thing. I offered these CD-ROMs for free, and I
didn't ask for any changes in PageRank or any other
considerations. It would simply mean that I can get my
names onto Google efficiently and comprehensively, without
enduring that 10-day orgy once a month.
My point is that there is no real effort at Google to make
any sort of accommodation on a case-by-case basis with the
so-called "deep web." Until that happens, sites such as
mine have difficulty in allowing Google's crawlers to run
amuck once per month. We have other customers to consider.
Although technologies such as frames, ASP and JSP, cold fusion, or Flash may make it harder to design a crawler friendly web page, such pages need not be crawler hostile. As the article points out, the issue is how the site handles requests that contain no parameters. The incompetent designer will treat such a request as an error. The more thoughtful designer will display a useful page with appropriate meta tags.
The second issue is intellectual property and the true number of pages on the web. Suppose we create a site on the history of widgets. This site contains 10 base pages backed by a database of 100,000 widgets. Is the true size of the site 10 or 1 million pages? I would say that their size is 10 pages and indexing 0.001% of the possible pages in a complete index. The problem is how to make these 10 pages representative of the site. It may be reasonble that a search of '1145 crusade keepsake widget' might fail, but our design should allow the more general search 'history widgets' to succeed.
Anyone who has done library research in the pre-computer age knows that is takes skill and determination to find citations. The fact that we have replaced 1 million tiny cards and 1 thousand volumes of indexes with an online database does not mean that search and design skills are no longer necessary. Unfortunately, we cannot assume that user will have the proper search skills, so we, as designers, must learn better design skills.
The only "problem" is that the Internet is simply too large for one engine to index. People go to Google expecting to search every web document that's online, a labor comparable to going to your local library and expecting their database to tell you about every book in existence on a particular topic or by a particular author. Even the Library of Congress isn't that comprehensive.
I disagree with the article's claim that "much of the most interesting and valuable content [on the Web] remains hard to find." I think that the most interesting and valuable content is easy to find, provided that you start looking in the right place. Which means that if I want information on the latest US school shootings, I don't go to Yahoo or Google and search for "school shootings", I go to those sites and search for major news sources (BBC, CNN, Reuters, etc.) and use their up-to-the-minute search engines.
The role of search engines isn't "shrinking" by a long shot; it's just becoming less comprehensive. Searching on the Web is now a two-step process instead of a one-step process, and you have to apply a little more intelligence than you could back in 1995. If high school students researching their latest humanities paper have a problem with that, well, they should ask us twentysomethings what it was like to have to use card catalogs and microfiche for our own high school projects.
Are there people that actually have problems finding stuff on the web? I can't think of one single time that I haven't found at least some relevant information on something I was looking for. Sure you'll get a few dead links, but my problem is usually that there is too much info to sort thru.
Jaysyn
There is a war going on for your mind.
Easy to fix. Session cookies that control how you are allowed to move around a site. If you have "value" and wish to protect it, then do so. Let others who want to be philanthropic do so.
Zero Sum (don't amount to much). [root@localhost]
Google consistantly returns good information on every search I make. A fairly superficial, PR-ish overview of their technology is here. The gist of it is that, among other things, the number of links TO a page is considered part of the criteria for ranking. (The theory is that an important or well established page will have many links to it.)
OTOH, human-edited directories like Yahoo and dmoz are going to have a really tough time as the web continues its exponential growts. I get so many dead links from these services that it's not worth the bother.
Sounds pretty trivial and stupid, right? Think it through - people are willing to pay >$100/year to subscribe to something like stratfor, which pretty much recapitulates information you could find with a broad sampling of news feeds. Why are people willing to pay for it? Because it sorts the wheat from the chaff, and puts it into a context that makes sense.
What's that mean in a concrete sense? Anyone care to take bets on how long it'll take Yahoo to move to a subscription model (very small, sez my money), probably one not too different from the phone book or the newspaper.
Add more spidering servers to look through stuff.
One problem is this, though. It takes hours to find what you're looking for, in most cases. You search for C++ and it finds nothing, because it uses the plus sign for something else. You search for breast cancer and it gives you free XXX hot porno sex sex sex only 500 dollars.
Why don't they improve this, one might ask.
Easy: You see those little banners at the top of the screen? Every time you load a page, new ones show up, and therefore they get more advertising in when you load their pages more. Thus, if it's harder to find things, it's your loss, not theirs. Indeed, they gain, as they get more advertising done.
Aciel
aciel@speakeasy.net
The last thing I want my browser doing is reporting my whereabouts to a central registry. My visits to \. are my secret --- even when the page times out.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~ the real world is much simpler ~~
--- -- - -
Give me LIBERTY, or give me a check.
That's the deciding factor for the most part. Although I can search for "Lexus dealership New York" and still get a hit for Hanks homepage. Maybe hank has the word dealership on it somewhere and that was all it took. However for the most part if you know the correct words to search on you will get the proper results.
Yours,
Bob
All the best,
--Bob
The lightweight stuff is, justifiably, more popular and gets more links. It's just that when one wants to go beyond Introduction to Duckspeak or Duckspeak Tutorial to Deep Duckspeak Analysis the popularity weighting starts hindering things.
Working out what's reference material still needs an understanding of context and content. Librarian is probably going to be the job of the 21st century.
People pay to advertise in the yellow pages, what's wrong with being charged to list on the Internet?
I came across http://www.mujen.com a wile back. Its pretty acurate. Not as fast as others, but probably more accurate.
I think so. google is fast, but I think mujen is better. Plus I don't want to use the same searchengine that my moron users use.
Maybe search engines relying on older methods are having problems, but using Google, I honestly haven't had a problem locating material quickly at all. You just have to have the right approach in searching for things...
Like I said, most of this is common sense and redundant to most people who've searched for stuff before. But you'd be amazed how many people have no idea how to find the information they need, when you can get it in less than ten seconds, including the time needed to plan the search and type in the query. I try to use this sort of list when telling people how to find info., sort of like teaching a person to fish so they can feed themselves for a lifetime.
--------
Bleah! Heh heh heh... BLEAH BLEAH!!! Ha ha ha ha...
I was always very happy with Internet searching, so I was surprised to see an article talking about some big Internet content crisis. I see their point about the 'surface' and the 'deep' web, but these are also the same terms used in BrightPlanet's whatepaper on the subject. Since it's pretty obvious that BrightPlanet invented the term, the entire article comes into question: why didn't they draw a distinction between the company whitepaper's thoughts and facts?
And in the fourth paragraph:
An unsubstantiated 550 billion pages, or about 100 pages for every living human being? I'm no expert, but that's ridiculous.
They quoted the Google people saying how hard it is to search for anything besides text, and then spruced some BrightPlanet PR. It sounds like someone's meeting the quota at Reuters, more of that fantastic deep content we should all pay for.
Offline would only work if you downloaded a massive database of web sites. There IS a lot of desktop searching already, usually though IE plugins (like Yahoo Companion).
Yes, free, independent sites ARE tough to find, even with Slashdot's favorite Google. Eveyr time you search for ANYTHING, the first 1000 hits are always for a commercial site. the thing is, it's because the big commercial sites have most information that most people find most useful. Is there a good way to change this? Not that I can come up with, unless an 'alternative' search engine is created that doesn't accept large corporate sites. But realistically, that WAS Google, but even they couldn't live on 0 revenue.
You would be able to even search offline, etc. Price tag included.
Offline? What's that?
Check out this link to the Internet & Web Yellow Pages you can get on amazon.com, I've also found it on the shelves at Walden Books. Hope this helps a bit.
Microsoft is not the answer. Linux is the answer. Microsoft is the question.
I've found that the engines cover enough to make my new site work pretty well.
Not for everything, but enough to be getting along with...
Some of the search engines have this, but Google in particular does not. Having this feature would allow one to potentially cull out a lot of dead links, given the half-life of the average link
If you can't beat them, embrace and extend them.
not true at all. some of the older search engines (most of you guys havent heard of them i suspect) dont seem to throw up relevant results but thats because the search strings are not made properly. google seems to use a different technology and the results are FANTASTIC!! av and lycos are getting selective about what they index. yahoo is a directory and not a search engine really. dmoz is so -so. newer ones similar to yahoo are still popping up and do a creditable job given their resources. the way out is to specialise in some group of subjects.
Right now i just use Dogpile.com for my searches. It sends the query to about 15 other search engines (Google included) and shows the top results of each. Works pretty well...
Not so fast.
While that is true of older, cr@ppier search engines like AltaVista and Inktomi, Google can and does index dynamic pages. (Indeed, more than 60 percent of new users to one of my sites come in via dynamically generated .cfm detail pages that have been indexed on Google.)
It seems to me that if you want your content to be indexed, getting on Google (and by extension, Yahoo, since Yahoo uses Google results in addition to its directory), is pretty darn easy. I have to say, I'm not nearly as frustrated with search engines as I was in the days B.G. (Before Google)
Why is it called COMMON sense when so few people have it?
Gee, FUCKING TEENAGE SLUTS I wonnder SEX CUNT COCK why PUSSY the PUSSY net PUSSY is GAND-BANG ANAL SLUTS getting so CUM FACIALS hard GOAT SEX to search and TITIES index? They're ASS probably using the wrong search engines and PUSSY aren't "web-savy" PUSSY enough. I can find anyting I LOLITA want GOAT SEX on the THREESOME net... You DOUBLE PENETRATION just have BOOBS to GOAT SEX know ASIAN ANAL SLUTS where and how CUM DRENCHED WHORES to look for it. GOLDEN SHOWERS.
Already been done to a certain degree. Unfortunately, these guys are about to be inducted into the Fucked Company Hall of Fame.
--
--
You sure got a purty mouth...
Google gets relevant and recent additions as everyone knows. But what I didn't know until recently is that the UK and Ireland Yahoo uses Google now. A wise move. So why doesn't Yahoo.com? Anyone know? Check out the differences between searches on yahoo.co.uk and yahoo.com.
If this trend continues, would it be an idea to make a search engine design similar to DNS?
I think that neither the people who claim that this is impossible nor the people who want to dismiss it are correct. There is undoubtedly a major problem, and it is only getting worse. The flip side of that, however, is that while we are getting farther and farther from having a complete listing of the web in search engines, the ability of end users to find what they are looking for appears to be improving, particularly with the advent of better search engines like Google.
The solution to indexing the web completely, or much more completely, has to lie in another methodology. How about a distributed solution? Google@home? distributedYahoo!.net? Honestly...there are ways to tackle the problems, and the reason why this entire system exists is because people refused to just shake their heads and say, "Nope, can't do it...sorry!"
How about a button in browsers that enables you to mark a page as a dead link? Just hit that button and a centralized system gets a reference to the URL currently in your browser. That centralized system is funded by all search engines and all search engines draw from it. Yes, I know..."What if a user falsely claims a site to be dead?" Well, what if it took 100 different IPs claiming it to be dead before it really was considered dead? If you don't get many people hitting the site from a search engine in the first place, then you probably aren't serving it up to too many people.
How about a system for pre-indexing an entire site, such that the person who runs it can have a single document at the root of their domain with the index results? A standard could be developed that would even go so far as to map out the existing sub-sites (for AOL personal sites, for example) so that the engine could go to each one for the index documents.
I guess that what I mean to say here is that the problem is largely based around the hugeness of the web, and how brute force is no longer enough. But that's not really that big a problem...all that's needed is a bit of creativity.
For your security, this post has been encrypted with ROT-13, twice.
This is not a well developed idea, but i've been envisioning a gnutella-type sharing community where links are shared and pages mirrored. (i usually think of something like this every time a site is slash-dotted...:) a plugin in your browser could keep track of where you go, and what is to be found there. Then searches could be run against those individual db's by those in the community. Generous individuals could choose to mirror sites that you think are popular or overloaded. Kinda like google, i guess, but with a community of people doing this, it could be more scaleable (possibly directly proportional to the growth of the web.) Please lemme know it y'all think this is lame, or potentially feasible.
"Ummmm..."
It may be thousands of years old, but it has stood the test of time. It has no annoying banner ads and very little porn to distract you from what you were actually searching for. What is the name of this search engine ? The Holy Bible. No matter what subject you are interested in, the Holy Bible has something to say. From Geekiness, to Installing Linux, to how to get a date, to what to eat. Its all in there. I realise I will be marked as flamebait by the anti-religious slashdot zealots but if just one person is saved my my advice it will have been worth all the negative moderation in the world.
Now they haven't indexed the entire web but 1.3 billion pages is pretty impressive... I don't need more than that... And if you can't find a page that you like in there then bugger to yah...
A totally new approach could be that you don't search but interesting web resources gets recommended to you by your personal agent. We are currently working on a peer-to-peer system that doesn't exchange files but exchanges recommendations for web sites.
Nice, but no replacement for traditional web search. When I search the web, I usually search for very specific information, e.g. an XF86Config file for my laptop computer, scientific papers on 3D user interfaces, or a manual for my office telephone. Search engines like Google do a good job pointing me directly at such resources and I believe they do because of their KISS approach of indexing every page they can get hold of and ranking the search results.
When searching for specific stuff, I'm interested in exactly the stuff I search for, sometimes only a few bits of information, not sites which may contain that stuff. I think it is quite unlikely for my friends other competent persons to recommend exactly what I'm searching for. They are more likely to recommend sites, i.e. collections of interesting information, and a few outstandingly interesting single items.
What I'd expect to get recommended with respect to my examples above would be Linux on Laptops, Citeseer, and some Siemens or telecommunications site. But compared to a traditional search engine, these recommendations would not make my life easier. Instead, they would add an unnecessary level of indirection to my search.
This does not mean your approach is useless, but it covers a different field of gathering information. I think a recommendation system is more suitable for keeping track of what's going on in the world, i.e. find out what's new and cool in one's fields of interest. Your concept is just closer to /. than to a traditional search engine, so it will be used more like /..
http://erichsieht.wordpress.com/category/english/
The trend is toward pay for listing. Will the free, searchable web fade away?"
Its not a trend, its companies attempting to keep afloat in whats becoming a bull market. Its amazing to see how companies like google stay in business when they show little methods of collecting any kind of revenue. E.g., the only means of Google obtaining revenue is what? Charging for a company for a copy of its search engine? Why would a company pay for a search engine when the market if overflooded with them?
Ad based revenue, we all know where those click me businesses are going.
We also know most of the "web rings" never went anywhere, but for a search company to think people would pay for finding something on the net, they'd be shit out of luck, maybe corporations may do this, but I'd just make my own search engine (freely distributed) post it somewhere and let the whole "submit your site for free" revolution take place again.
Privacy Info
360 degrees of Karma
Maybe you should try Subme. They use a intelligent search-engine. It's quite handy some times. You need to know little about the subject and when you find anything that is you just select one of the buttons -, +/-, + That's easy...
If the sites themselves are complaining about no one able to find their content, aren't there ways to help that? Run a query on their database site to generate a possible site list of the content and then provide that list to the search engines. The search engines could then provide a link (found based on a content search) that would put the user on the page where they enter the form (or whatever) information to generate the page needed. Not being familiar with XML, but knowing that it has some features to aid in content grouping, could this be needed to recode the sites in?
Obviously if the sites themselves dont want this deep content easily viewed except by deep clicking through their whole site, or some pay-per-view system, that is their choice. I feel that they are limiting themselves however. If they think they have robust enough content to useful to users, they should strive to make that content as widely available as possible.
Should proprietory websites even be considered as 'Internet-web content'? Those seem to me to be 'Intranet content' which most often should not be seen by the general public (ie: internal company policies only needed by employess of company X). For that information to be set free you should either need a very savvy person to break in from the outside or a traitor from the inside. If its only certain products listed that the company doesn't want to available to the public, well that is too bad for them, I'll just get a quote elsewhere and pay someone else my money.
"evidence of a widening gap between the deep Web and the freely-accessible 'surface Web,' which could become a clutter of recreational and amateur-oriented content -- the online equivalent of public cable access television or self-published novels." Funny, ever since the late eighties, I've always seen the whole web like this. It's more like the big corporations tried to muscle in on the public cable channel and realized they might be better off on their own channel.
Not your normal AC.
Well, I'll say that searching the web has gotten alot more difficult to find anything decent, my favorite search engine www.alltheweb.com (fast search) is becoming less appealing; Ever search engine I have tried, nowadays, I could be searching for bread recipes, and the first 1000 results to come up are either HARDCORE XXX FREE NO CREDIT CARD or something like Yes, we have searched all bread recipes and have high quality bread recipes for the taking, click it, and its really some site like Best of the Web which happens to be mysteriously behind most of those "we have high quality" type URLs popping up in searches. Web searching has moved in status down to... SUCKS!
I searched "Paris France Auto Repair" in Google and found the following address: Garage Carlos 9-11 Rue Riquet - 75019 Paris +(33) 1 46 07 03 48 You're Welcome. Next question??
My father is a blogger.
Much fuss is made about the search engines needing to "fix the problem" of not being able to index sites like microsoft.com because the pages are dynamically generated. Is this really a problem?
Microsoft (or whatever over dynamic site you wish to pick) chose to make their content unindexable. Don't try to make it someone else's problem. Let people who use the search engines find third-party information instead. If the site designers wanted their site in the search engines, it would be there. Many of the sites built with ColdFusion or ASP contain basically static information anyway, and making them dynamic just reduces your traffic.
Sites like Slashdot are dynamic. A search engine can't be expected to keep up with something that changes every 30 seconds. However, making all of the archives static HTML allows them to be searchable by the engines and takes some load off the server, to boot.
I went for a "best of both worlds" approach on my personal site by writing a perl site generator. Each time I update the site, I re-run the site generator, which takes about a minute. My server carries a lighter load, but I still have "dynamic" links to related articles and such that the site generator builds.
You said the same thing two years ago!
I don't think anyone is saying that the problem is that people won't pay for it. The task of indexing the web is just too big for humans, too complex for computers, and impossible anyway because the web changes too fast and the content is increasingly stored in dynamic databases, not in (relatively) static text files. The problem is only going to get worse as the rest of the world comes online.
Send lawyers, guns, and money. Dad, get me out of this.
Works pretty well, is updated regularly (they add new search engines all the time) and even includes a number of different "categories" like e-mail, dictionaries, auctions, etc. that let you narrow your search.
Well worth a look, IMHO, and no...I don't work for them or own any stock in the company.
-Coach-
Perhaps the world's greatest tragedy is that ignorance is not impotence.
*ahem*
"The Google Web Directory, organized by topic"
http://directory.google.com/
sheesh....
one thing i can tell you is you got to be free
That's depressing, and wrong, from an technologically ethical point of view. The web as a medium requires openness. Here's a novel suggestion, if somewhat heretical. Maybe the current slashdot format does not scale as well as might hoped. I'm not knocking the slashcode. Hey, I'm posting here, aren't I. My point is that maybe it needs to be organized differently thatn it currently is. Just a thought. Or maybe I don't know how to navigate around in it well enough
evanchik.net
The Web is a victim of its own success. Now every snake-oil salesman, fanboy and their grandmother has a website.
Even Slashdot is too big. How the hell are you supposed to follow a conversation this big.
especially with the goatsex.
I'm gonna start mailing postcards.
Excelsior,
ME
evanchik.net
IMHO, the web searching applications for your desktop are going to be the next wave. You would be able to even search offline, etc. Price tag included.
That it was wrong then doesn't mean it is wrong now. That it is wrong now doesn't mean when the claim is made again a year or two down the road it will be wrong then.
--
I'd like to see web sites rated by category in a way similar to what Slashdot does. You could probably create a company based on the idea.
Actually, this just is a great opportunity for the next Great Search Engine. Look at how well Google has done just indexing a small portion of the web (1%, according to the article). So that leaves the door wide open to anyone who can crack the puzzle of how to keep up with the web. If word gets around that something is better than Google, it'll be huge. You can say "oh, no one can index the whole web accurately," but there is someone out there with the brains and courage to try it -- and succeed.
Well, I think this article is saying basically, "free search engines are doing a piss poor job....hmm, I wonder if I can find a decent one for free, or will I have to *gasp* pay for one? (which would obviously be superior!)" ah, people, I will never understand them
-Tar Ciryatan, Angry Hermit-
Well, sadly, people seem to enjoy spam here....and I thought this place was full of decent conversations and saying that, lets see how many spammers yell at me.
-Tar Ciryatan, Angry Hermit-
I'm one of the authors of Sparkseek, a remotely-hosted search service. I'm also a student at Pennsylvania State University. I want to give you an idea of what kind of problems researchers in the field of internet text retrieval have to deal with.
Larry Page, one of the co-developers of the Google search engine said in his 1997 research paper entitled "The Anatomy of a Large-Scale Hypertextual Web Search Engine" that the primary benchmark for information retrieval, the Text Retrieval Conference, uses a fairly small, well controlled collection for their benchmarks. The largest benchmark they have available is only 20GB compared to the 147GB from Google's crawl of 24 million web pages. Today, Google has over 1.4 billion web pages in their database and a reported 4,000 node linux cluster.
One of the problems I have encountered and digress that I've found difficult to deal with is the shear amount of redundancy in web content. Anybody who has ever tried a search for any linux command has no doubt encountered hordes of duplicate MAN pages in their results.
Not only that, but I honestly don't believe that when it comes to search engines, more is better. I have noticed over the past 6 months, as google has made great increases in its index sizes, that results have consistently become worse and worse. Search engines really need to begin narrowing the focus of their index and creating multiple indexes. Educational institutions should be separated from commercial establishments.. if I'm performing research on some subject, the last thing I want is to arrive at a commercial establishment pitching some product.
Also, the method google utilizes when creating their indexes creates a huge scalability problem. Their indexes are updated less frequently that ever, and if you read their document that was published in '97, it's not hard to see why.
Michael Tanczos
A totally new approach could be that you don't search but interesting web resources gets recommended to you by your personal agent. We are currently working on a peer-to-peer system that doesn't exchange files but exchanges recommendations for web sites.
It's much like a good friend suggests that you have to look at a interesting web site. You can see all the marketing blurb at http://www.iowl.net/. At the moment this is a seminar paper of some people (including me) at the Wuerzburg University of Applied Sciences. We have a working prototype that will be released hopefully in about a month or so.
Ironically, when searching for Newtonian's palace, Google will find sites linking to my page but not my page itself.
There are 2 kinds of people in this world: Those who write in decimal and those who don't
Splearch