Web Searches For What Lies Beneath
fat_hot writes: "The New York Times has an article [here] (registration required) about specialized search engines which try to drill into the submerged mass of the Internet iceberg to try to limit searches to particular subjects (and hopefully thereby increase coverage of the limited scope)." Considering that a google search for friends' web sites and other good stuff usually turns up more dirt than paydirt, it's pleasant to contemplate more relevance in search engines.
I just yesterday found an essay on this subject. It can be found at http://www.lucifer.com/~sasha/articles/ACF.html He goes on a bit at times, so make yourself some coffee and print it out to save your eyeball(s). It's all about what he calls Automated Collaborative Filtering and Semantic Transport. Of course, I rarely have more than a little trouble finding what I'm looking for, but that may just be that I think I've found the most relevant info, not that I actually have. This paper lays some of the theoretical groundwork for revamping search technology. However, I would be hesistant to give up on the current engines. I think a "smarter" search should be regarded as an addition to the current toolset, not as a replacement. (End user moderation would help cut down on the detritus currently clogging the pipes though!)
These guys are claiming responsibility for it.
Here is a story from Wired about it.
However, I suspect that whatever the answer to the search engine problem actually turns out to be, it will have the following characteristics:
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Huh? Where'd I say that? Clearly your school isn't doing its part in teaching research & attribution.
Contrary I believe many schools *are* doing their part. No not all, but many. Tragically school libraries & school librarians have been tremendously short-changed in the past few decades, ironically often in order to fund sexy things like computer labs.
The truth is that the skills one needs to use in a library are even more critical now then they were in the past. As you correctly pointed out card-catalogues are dead, I can't think of any post-HS system that still seriously maintains one. Unfortunately the helpful Reference Librarian willing to walk a random person around and re-tech them the ropes have also been budget-cutted out of existence too. With the information explosion / the information economy the ability to search, prioritize, and compile material has become even more critical (not to mention the ability to comprehend the materials.)
Corporate knowledge-bases, electronic paperwork, web-based 'employee handbooks', online job searches & apartment rentals; these all require the ability to search for information in an efficient and comprehensive way. Search-engine cluelessness is simply a symptom of a wider problem.
That said again I believe schools are doing a reasonably good job. I know my old elementary & high schools are teaching kids how to use search engines, as is my old university library. My concern is for those out of the educational system.
Reading the directions doesn't seem onerous to me. If one is performing searches and coming up empty or with useless material then figuring out how to fine-tune one's searching doesn't seem to require any great intuitive leap. Yes it would be wonderful to live in a world as trivially comprehensible as the doorbell but lacking that most folks have learnt to READ THE DIRECTIONS.
Generally search engines do a great job of explaining how to use them. There are even search engines that try to out-think the user and parse their natural-language requests into regular search expressions. Google isn't one of these engines; it's a high-powered bare-to-the-metal engine that requires a certain amount of understanding by its users to use. On the other hand there are literally dozens of other engines that *do* walk a person through performing a decent search. The fact that folks pick the wrong tool for the job (a tool they neither know how to operate nor are willing to invest the 1-screen/2-minutes to learn) and then complain about their results seems to be just idiocy on the part of the user (or in this case an article author.)
Yes, the original article clearly set up a straw-man in order to promote these dedicated search engines, on the other hand there are legions of folks who continue to use search-engines every day with poor results and do complain about them.
The solution? I dunno - sell them more lottery tickets?
I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.
That point aside I'm trying to figure out the rest of your posting. You don't like the fact that different search engines use different formats? Well pick one and just use it. You prefer a GUI interface instead of a command-line type one? There are lots of those. You'd prefer a walk-through format? There's lots of those too.
I think you've got a point somewhere but I can't find it. I suppose my only comment would be that folks should, again, pick tools suited for the job. If it's not worth it to them to learn a seach syntax then they shouldn't use a search-engine that relies on one (DUH!) Google requires a syntax, many others don't, use one of them.
As to search-engines getting tricked into returning misleading its, yeah that's a problem but not a big one. So 5% or even 10% of the hits are come-ons to porn sites, there's still going to be ~30% good hits (the rest misses of varying degrees) and that's enough to be productive with.
Finally - don't tell someone not to be "smug and negative", I could insert some comments here about the apparent tone of your posting but that wouldn't be productive, lets just say I don't see those in my posting & drop it.
I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.
Most of us recall being brough into the school library and show how to use the card catalog, given a few assignements, etc. Unfortubately for those of us out of school the's not that set of skills in place to help searching.
Boolean seaches, using key words, supplying partial words, phrases, etc. are all supported by most search engines but few folks understand how to use them.
What's really suprising to me is that folks who use search engines regularly, indeed even rely upon them (journalists I mean you!) seem some of the most poorly prepared. There are lots of resources for learning how to do a good search, many from the search engines themselves and many more from third parties yet we still get these perennial "I can't find ..." stories.
Honestly, I'm not into blaming-the-victim but how difficult is it to learn how to perform a good search? One screen of directions? Two minutes of time?
Yes there's a place for specialized engines handling unique or limited content but most of the larger, more general purpose engines do nearly as well if properly used. Again, it's dependant on the user to learn how to define what they want, all of the tools in the world are no good if they're not taken advantage of.
I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.
That's where specialty search engines like Moreover come in. Eventually, sites like this will let you search those bits of the Web that change often (news sources, weblogs, discussion groups, sites like Slashdot, message boards, financial news, etc.), allowing people to keep up with things as they happen.
Existing search engines are great at finding things that are archived on the Web, but poor at keeping up with what's currently happening. Looking for all the articles on the latest Shuttle mission, as well as what people are saying about it? You might find one or two things about it on Yahoo! or Google, but a search engine like Moreover will find the fluff article on CNN, the more in-depth article on Space.com, and a discussion about the mission on Slashdot. That's pretty powerful.
Or a review system. There is a way to do it, although it might be a pain in the ass. Basically, what you need is a web of trust and digital signatures.
For example, suppose I have a list of keywords. I submit my page to a reviewer, and they judge whether or not my keywords are a reasonably good match for my page. If I pass the test, they PGP-sign my page.
Then you just have a modified search engine that only returns pages that have a valid signature by someone who is on a list of authorities that the searcher trusts.
This type of thing could be used for a more general web page rating or reviewing system. It's just that perhaps some reviewers might judge pages solely on the criterion of meta tags matching the content.
---
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
I've many times searched the internet with a search program to find nothing that I was looking for. It's very upsetting when you try to find something. I used to like Ask Jeeves (www.aj.com), but it still wouldn't find what I was looking for, then I found google (www.google.com) and was very happy to find that it did infact find what I was looking for most of the time. Why can't all search engienes look for what you type in?
I think part of the problem lies in the fact that they match words all over the website.. ie... if I type in "hot green hamsters" the words Hot, Green, and Hamsters can appears anywhere on the homepage, even if I put them in "'s the search programs dont' always group them togethor. So A page talking about hot peppers, green peppers, and how hamsters eat the pepper gardens in Mexico, would bring up a search, even though it wasn't anything about what I was looking for.
...that average people are morons.
IIRC, Google uses an algorithm that, based on a combination of HTML tag size and logged click-throughs would sort the links. Neato-keen.
Well, about a year ago when google was still young and fresh, you could type in your search strings, hit the "I'm Feeling Lucky" and get EXACTLY what you wanted. Blew me away time and time with its strange accuracy.
But, as more of that click-thru data got integrated into the sifting, I got more and more of the crap that the sheep (ie, normal mom and pop AOLer types) wanted to look up. What the hell, man. Don't get me wrong, I still use google, but now I have to scan three pages deep before relevant pages come up.
Dirk
I keep trying to pick fights, but I can't shake this Excellent karma.
Go to Google. You know where to find it.
Punch in "Dumb Motherfucker".
Click "I'm feeling lucky".
-=Best Viewed Using [INLINE]=-
And its been around for a while as a concept. I used to work for SpaceRef who maintain an excellent niche search engine devoted to space exploration.
I maintain Omphalos which is a niche search engine devoted to the modern alternative religions (Paganism, Wicca, etc) and related subjects.
All it really requires is a reliable collection of websites focusing on a specific range of subjects and good search engine software to index their pages. The results are often much more relevant than those from the major search engines - although Google is generally an excellent choice IMHO.
"The first time I got drunk, I got married. The second time I bought a chimpanzee, after that I stayed sober" Arian Seid
What did they expect? Google can't read minds yet.
Bunch of mojacks.
Tony
Yes you are blaming the victim. The basic concepts of searching take less time to learn than fancy terms like "boolean". Ideals are nice, but the devil is in the details. Search engine sites perform a difficult task and some do a first rate job. For that they should be thanked, but nothing is perfect.
What confounds the user mostly are all the syntaxes uses to express those concepts. They are different for every site and take some getting used to. It would be neat to see a search engine with more than one line for input. You could have a box for exact phrases, one for anyword matches, an exclusion box... It's not that command line syntax is ugly, it's that most people have better things to memorize.
Another thing that confronts the user is the effeciency of the search itself. Very clever people constantly seek to fool search engines, and ocasionaly do. The result is garbage to wade through until the search engine can recover. I remember a time when all search.com would retrieve was porn sites. Even Google has been beat a few times.
Let's not be so smug and negative. Look for the opertunities presented by user confusion. Be happy that these new search engines are comming.
Friends don't help friends install M$ junk.
If your friends' have sites but not too many people link to them, they won't rank too highly in Google's eyes, will they?
A Google search for 'dumb motherfucker' will yield George W. Bush's website, how inaccurate could Google possibly be?
"a Google search on "chavez" led to several encyclopedia entries on Cesar Chavez" Would it have fucking killed them to type in "Linda Chavez labor secretary"? And this was very recent news, exactly how quickly do you expect Google to scan the entire internet for updates? How quickly could these 'iceberg drilling' search engines possibly scan the net? It's a deep web right now, what's invisible will bubble to the surface if it's relevant... Maybe they have a point on using the search engines to only scan specific areas, but I think websites which specialize in these areas should license the Google engine instead of Excite's... (you know what I'm talking about right? Every big site has some article you want to find, you go to look for it, you get the worst search interface possible that doesn't return any useful links...)
--
Peace,
Lord Omlette
ICQ# 77863057
[o]_O
Yeah, and it would be great if nobody stole money and gave to charity, too. It just isn't going to happen. Any system that is A) valuable and B) depends on everyone behaving honestly is doomed to failure. You're never going to get people to stop cheating the search engines as long as doing so is both possible and beneficial to the cheaters. The plain fact is that manipulating the system works, and people are going to keep doing it as long as it keeps working. The only solution is to develop a system that is not easily manipulated.
Perhaps you should try looking at Google, a search engine that actually uses these in a clever way as the key part of its ranking system. It's remarkably effective at finding relevant information and at avoiding the kinds of simple manipulation you complain about. Other ranking schemes (like GoTo.com's straight pay for placement system) are also relatively resistant to manipulation. I think that the long term solution is going to be natural selection; search engines that are easy to manipulate to give lousy results will go out of business and leave behind the ones that are actually useful.
Good luck. The latest versions of Google include over 1 billion pages. Manual sifting for poorly labeled ones just plain isn't an option if your primary goal is comprehensiveness.
There's no point in questioning authority if you aren't going to listen to the answers.
The proverbial iceberg of data on the net lives in databases not accessible to search engines as we know them today. The power and complexity of the little engine that could would be far too sophisticated for the public to be allowed access to. It'll be interesting to see how they pull off the privacy end of the whole thing...
"Helping to keep you two steps ahead of the Thought Police!"
Why does one need cheesy dotcoms to tell us what a directory is?
A directory search limited to U.S. newspapers immediately brings up, say, an explanation by Linda Chavez about her relationship with the illegal alien in question.
If one wants political news, one can go to a political news source. If one wants information on Linda Chavez, one can do a more specific search. If one wants political news about Linda Chavez, one can (this must be getting very complex for your average dotcom founder) search a news archive.
-- Stanislav Shalunov
As the articles says: "People may know to come to the library, but they probably do not know which reference books to pull off the shelf. Of course, in such cases, patrons can at least consult a reference librarian."
...) defined by the taxonomy used. In other words the idea is to search the meaning not the words (see also www.oingo.com).
- lab/ka/KnowledgeAgents.htm
e o- 11/www/wwkb/index.html
In the example given by the article a "linda chavez" or "linda chavez labor secretary" query would be much better than the ordinary "linda".
Moreover, there exists the problem of determining the category of what is being searched. A trend is the use of AI and ontologies by the search engines, which determine what is really relevant in a page and classify it during the indexing phase based on the different categories (economy, medicine, technology, entertainment,
What the article talks about are the knowledge based agents. A quite interesting article can be found at: http://www.cs.technion.ac.il/~cs236512/www-search
Another interesting link:
- CMU World Wide Knowledge Base (Web->KB) project:
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/th
Examples:
Searching for "John Smith" should return my friend John Smith and no one else.
Searching for "C++ implementation of Knuth algorithms" should return exactly that, and leave out references to C++, Knuth, or algorithms.
At the very least, large search results should immediately separate the mass of results into categories - i.e. "Jessica Alba" - up at the top should be pr0n - fan sites - commercial sites - etc. Yahoo does this, but there are way too many categories. Really, the web has maybe 10-12 different broad types of sites - commercial, homepages, academic sites, pr0n, multimedia, weblog - you get the point, the list isn't that long. We should be able to filter entire broad categories out of our searches. Altavista does a fairly good job with multimedia searches - unfortunately there still is way too much manual searching - it still doesn't read our minds enough within the broad category search.
Google uses PageRank to determine the order of results, but does it track the sites its users click on after performing a search? No, but it should. Further, it should track users individually and be able to customize its results based on that persons individual personality. The more you use a search engine, the better it should work for you.
I can't stress this enough: A search engine needs to be able to read our minds.
No, Thursday's out. How about never - is never good for you?
I asked Jeeves "Where can I find a good search engine?" and was directed to a really good site where I can buy engine parts for my car online.
Thanks for nothing you bastard butler!
-----
"The only difference between me and a madman is that I'm not mad." - Salvador Dali (1904-1989)
I disagree. I continually find close matches using Google, much better than anything I used previously (Hotbot was good for a while).
When Yahoo started using them I rejoiced. It was the best of all possible worlds (good search engine, web of content like the calender, and hand-picked sites when all else failed).
- I don't care if they globalize against free speech. All my best free thoughts are done in my head.
90% of everything is junk
In truth, it maybe more than that.
So we come to the needle in the hay stack,and how the databases that the search engines consult give priority to different terms, how they index the various sites, and how long it takes.
Of course, for the person truly expert in these things, these are trivial details. They are as obvious as a traffic jam. For the rest of us, it is more a matter of "where did all these cars come from?"
Unlike our computer, there is no central index for the full content of the web. It is a job that is done continously at a surface level, and takes a month or two or three.
In that context, of course last night's news will not get indexed while we wait.
Just like the tradition of game installation, search engines have been designed to be used by people who have a clue.
Sometimes I swear that until we get a system designed by geniuses to be used by idiots, we will need to have some sort of internet user license or something. Other wise it is simply a matter of designing systems that can obey the command:
"Do what I want, not what I say."
This is an interesting problem in programming, is it not?
"It is a greater offense to steal men's labor, than their clothes"
The example they used at the beginning of the article was fixed, they just typed "chavez" into a search engine, not "linda chavez". Of course they got tons of irrelevant links. You'd think the reporter could have picked up on what a bad example this was. I'm not saying that the search engines don't have flaws, but they could have picked something that demonstrated their point much better.
The problem isn't the searches, it's the people who make the webpages.
Why doesn't everyone use metatags properly? What about specifying good (descriptive) title tags?
Plus, don't you think it would be much easier if people actually didn't try to cheat search engines?
In actuallity there would be some very easy ways to score pages for relevance then:
1) The number of times a particular word shows up in the keywords, and description of the page.
2) If the word actually appeared in the title of the page.
3) The number of times the word appears in the body of the text
4) The length of the supposedly searched word
5) The number of times a particular page is linked to.
6) The words used to in the link
7) The number of times the linking page is linked.
Wouldn't the world be happier. Personally, I think that it would be great that if there was an editing team that would simply delete misrepresented pages.
Anyway. That's my two cents.
"i blew a booger that i'd swear had it's own spinal cord" "OUCH" Caroline's Spine
If we're talking about specialized search engines, then don't we need some way to know which sites to search? What is the feasability of creating a system where meta data about a site is entered in a database tied with domain registration? When I go register widgets.com I can specify that I'm commercial, serve north and central america, manufacture widgets, etc. Meta data about my individual pages could provide more detail, but the meta data at the domain level would direct the specialized search engine to my site in the first place. It just seems to me that even if a search engine is specialized it needs some way to find appropriate sites without brute force searching the net, or they will still have the same problems unless they have the manpower to filter the results.