Is the Internet Becoming Unsearchable?
wergild asks: "With more and more sites going to a database driven design, and most search engines not indexing anything that contains a query string in it, we're missing alot of content. I've also heard that some search engines won't index certain extensions like php3 or phtml. Is anything being done about this? How can you use dynamic, database driven content and still get it indexed into the major serach engines?" Is keyword searching obsolete? Do you think its time to index sites by the type of content they carry rather than the content itself? Will larger indexing databases (or a series of smaller, decentralized ones) help?
But I'm not saying this necessarily justifies some of those million-dollar sales.
The one constant throught the history os internet technology, is that directory entries (open directory, yahoo) with structured, categorized results are and always will be superior to free text search for anything that isn't completely obscure.
The one constant throught the history of internet technology, is that directory entries (open directory, yahoo) with structured, categorized results are and always will be superior to free text search for anything that isn't completely obscure.
I think we've actually hit another period, technologically, where we're advancing too fast for active standards on "how things should be done" to make things like searching pages/web databases/etc. an accessible, easy thing. It's probably going to take a while...it seems like every month they come out with a new way of doing things, a new "language that's going to change the world!", a new proprietery language/program for corps to use. Until that dwindles, for whatever reason, the web is going to continue to be behind in terms of searchability.
Listen to me Peter, I want this bench. You go sit on that bench over there, and if you're good I'll tell you the rest of
anyways this is very true. I have a site that all of it is database driven and it uses mysql and php and the pages.
One idea would be have a centralized authority have individual machines scan for sites ala distributed.net to expand existing databases. Would this be possible?
Help us build a better map!
...to "force" search engines to search certain pages. Currently, you cna only tell searchbot to "piss off". There is no way to tell a searchbot "hey!!!! come look at this...."
What if we just have a standard search interface that can be built in to any DB driven website....say it returns XML or WDDX or something. So now when the search engines hit a DB driven site, it goes ahead and creates an index through this interface. I guess like a DNS zone transfer.....hmmmm...
I think the effort many of us put in to make sure that the relevant site info is indexed by the engines makes up for it. Many of my sites include special pages that only the search engines get to increase their "relevance" in the engines databases. What does pose a problem is where people are totally abusing the indexing methods to get their site promoted in searches that they shouldn't. I don't see that anything can be done about that (and efforts by some engines, including ignoring meta tags etc are quite annoying)
My company created lots of dynamic sites with
dynamic content - without the use of different
extensions or URLs that contain query strings.
Apache is awesome (in case you haven't heard)!
Almost all of our HTML files actually contain
embedded TCL code, so the servers are configured
to parse every *.html file - allowing to use
the *.html extension for files that have dynamic
content. We also use things like mod_rewrite
to send data to a single file that tells the
file what data to use and how to behave. We
could have an entire range of sites served out
by a single file... even making it look like
they have thier own directories, when in
reality they don't exist.
-- Richard Finn http://www.random-seed.com/
We've been running across problems related to this in my office (a web design/hosting/advert firm) and, while I'd like to see non-database driven searching of the Internet continue, I have to say that perhaps, most people, would rather have the database. So many web design clients expect that once they have a web site they won't have to advertise in print ever again are driving the whole thing toward the database method... creating the problem they so love to bitch about.
Perhaps doing away with keywords entirely, getting search engines to look at the content instead of the "false content" of meta tags... now that would be nice.
web suckz!!!!!!
Once one site is found an a certain topic, it will often be linked to many others; you can find a lot of information this way. I only use search engines in extreme cases.
We had a client once who wanted keywords inserted dynamically into the metatags on his webpage based on query results because he read once that search engines index pages based on the tags. Nothing we could say would convince him what was wrong with that picture.
Is it even possible to index dynamic pages? They don't really exist until the page is generated. Perhaps the best thing to do for sites that want to be indexed is to make sure they have a plain, vanilla index.html page that contains relevant keywords?
Dana
First hit google. Then metacrawler. Then try it as a phrase, then add "-" terms to filter useless results. After that ask jeeves, then the imdb, ubl, mp3.com, amazo^H^H^H^H^Hbarnes&noble online, the manufacturer's sites, then give up and ask someone for it on alt.binaries.whatever.
Short answer, yes. Long answer->I'll find it if given an afternoon or two.
mcrandello@my-deja.com
rschaar{at}pegasus.cc.ucf.edu if it's important.
As is, search engines just index raw HTML, with no regard for the actual content of the pages. Perhaps as XML and related stuff begins to proliferate, the indexers of the future will begin to use the extra markup to deduce things about the data that are relevant to the searchers. Certainly it needs to be rethought, because as it it is, it's crappier than even searching for text in Emacs. Think of the internet as the worlds largest text file and you're trying to find things using a simple search in a myopic text editor that can only see 1/100 of the whole document anyway.
It isn't too bad if you're looking for obscure
things; for example if you get a weird error
message from a Linux utility, or song lyrics.
But try to search for a device driver and you
get those "ad bait" sites like driver-forum.com.
This is a serious problem... there is a big
opportunity for a search engine that will be more
selective about keywords and will reject sites of
dubious value like driver-forum.com
Mark
All of my sites are dynamically generated, using PHP3, MySQL, etc., and they do get indexed in AltaVista and alltheweb. I'm not sure about other search engines, but those two find my sites just fine.
#1: it's easy to make apache run cgi scripts with any extension you want, so php3 and shtml being ignored shouldn't hurt any site that really wants to be indexed
#2: technologies like XML may give a standard interface to databases, so that search engines can index databases directly.
IMO, A much bigger threat to the "searchability" of the internet is the rapidly growing amount information -- and with it, the amount of misinformation.
One obvious possibility is to generate - using the database - a set of static pages as "targets" for the search engines. This could be done weekly or monthly, for example. Each target page would contain a prominent link to the dynamic database-driven front-end of the website, so that searchers could find the site and then quickly get directly to the main front end. Not particularly elegant, but it seems like a reasonable work-around for the time being. The real solution, in the long run, will involve more sophicated searching and indexing paradigms.
What do people think about this approach?
and webluis. Make that 2-3 days...
You can tweak Apache to parse documents ending in .html with PHP3. You could use .html for generated content and .htm for static pages.
Computers. You can't live with them, you can't live without them.
Yeah, give me a minute to back that statement up. :)
Honestly though. With something that is inherently dynamic like the internet, it is already near impossible to catalogue and make it searchable. Just to illustrate this take any given news site. Today they might have articles about Clinton, tomorrow it might be news about a big fire. Search engines can't just direct you to those sites based on queries because who knows what data they have.
Even if a search engine was able to validate the content on every site before it gave you the url it could still change by the time you actually got to see it.
So quite literaly there isn't even a clue of a way to catalogue a database generated web site. Now granted I know there are plenty of sites like Slashdot that eventually the 'content' settles down and becomes static. Still, how are you going to get some stupid program to verify and validate that for *every* dynamically generated web page. I don't think you can.
The web was created to be open and dynamic and it will stay that way. I've heard people say that maybe there should be *more* interoperability between things like search engines and spiders. This in my mind would do more damage.
Besides is it so bad that spiders don't get these pages? It probably isn't even reasonable because it would add that much more complexity to the search engine to catalogue what it finds. How do you rank content?
Anyway... just my 2 cents or so...
This space for sale
No only is multipul site searching becoming more dificult, but single site searches as well.
.pl, .cgi, etc.) which generated the information. But this only works if arguments are not required.
Now most content is stored in a SQL database. While it is fairly easy to search an SQL database, returning the information in usable form is not. This is especially true once you have many type of tables containing many different types of information.
Currently, the search engine on the site I work on has it's own built in forms for information from each type of table, but this method takes a lot of maintainance.
Another possible way is to point to the page (php3, asp,
It is about time someone developed some technology to do "smart searches" of sql data and return useful information without having to write a template for each and every type of data that might be queried.
I might be off my rocker a little bit on this, but I cannot believe I am the only one experiencing these problems.
-Pete
Soccer Goal Plans
Best thing to do is to create static versions of dynamic content that you want to index (like articles etc.) and use scripting to divert non-robots to a dynamic version.
You can also make those static pages keyword and meta tag heavy without affecting the user experience.
"Everybody's Got Something to Hide Except for me and my monkey" - The Beatles "If you're not part of the solution, you'
I was beginning to think this as well - Yahoo, Infoseek, Hotbot and the like just don't seem to find the good stuff anymore. If there's content held in a database and a page is generated on demand by an active server page or CGI script for example then the page doesn't come into existence until the user requests the information.
Perhaps it's time for search engines to search by topic and direct to a site related to the enquiry. The individual sites could then have their own search utilities to trawl through their databases? Not sure if this is feasible or not though.
In terms of good search engines though - Google and AllTheWeb.com seem to find good content whenever I use them. The problem I guess is that you don't know what you're missing until you find it by some other means, and neither do the search engines.
I also find that self-registred index sites (like WebRings) can be useful. May be a search engine for WebRings (e.g. look 'Elbereth' on Tolkien WebRing) can be useful (I have to look if there aren't already one).
Personnally, I use specialized index site (like NewHoo, Linux Life or Freshmeat) when I'm looking for something. Those sites will just have more value in the future, IMHO.
Fabien Ninoles -- Debian GNU/Linux Developer
One of my biggest gripes with the current scheme is that so many people abuse it. I once searched for "XOR gate" and pulled up porn sites. If sites could be indexed based on content type, perhaps this wouldn't be such a problem. Currently, these jokers dump the entire dictionary in a meta tag and waste everyone's time by throwing off keyword searches.
The sheer volume of websites out there makes effective searches difficult. I imagine a search engine could be tuned for better results, but will people be willing to wait while it crunches through data longer than a shoddy counterpart?
One of the reasons that I became a lawyer was to avoid ever having to hire one. -SPYvSPY
Since most dynamic sites provide their own internal search engines, it seems that a standard Search Engine to Search Engine protocol could help ease this problem.
It seems to me that if anything, the internet is MORE searchable then it used to be. I remember some statistic about how a couple of years ago the few search engines that were around only got some small percentage of the web covered anyway. These days it seems the search engines do a better job, and there are a zillion more search engines and also tools that let you search multiple search engines at once. That and the fact that there is just plain a lot more stuff on the net. Back a few years ago, if you searched for Cervantes, the author of Don Quixote, you might find a page or two on some college webpage somewhere, if you were lucky. These days there are enough pages out there that you're bound to find at least one of them that's halfway decent. Anyway, to summarize, keyword searching still seems to work for me. I think that the only way it will get considerably better is when true artificial intelligence is possible. That way, when you ask the computer to find something, it is actually smart and goes out and finds it like a real person. However, it seems to me that true artificial intelligence is a way off....
Today, there are two methods used when a site is added to a search engine database. The first relies on information submitted by the site, the second relies on information (e.g. keyword fields) found by crawlers. As more sites switch to dynamic content, the sites offer no easy way for a crawler to find information about content. This could be solved by developing some method for storage and retrieval of the data. For an example, look at how the "robots. txt"-mechanism works. /Joakim Crafack
... Elecance is left to the implementors.
.. just too many things that you can save in databases and publish on the internet in dynamic pages .... The internet IS unsearchable because of the amount of data and not because some websites cannot be searched by searchengines (including these would increase both signal and noise and not help). What is IMHO needed more is some way to differentate real information from fluff and spam .. but that is still far away ... (hoping for some advanced AI)
Still, I see a potential threat in information becoming unmanagable, and, most of all, ways of finding information being abused (like using unrelated keywords just to get some visitors). Stanislaw Lem, the polish sf-writer described this situation in many of his books - starting with the 60s, when noone was even starting to think about such problems.. Sooner of later we'll have a large branch of computer sciences dealing only with searching information in Internet; searching services are already available, but they are either incomplete, or not evaluated. The latter is the key: and google is the first service I'm aware of which tries to automatize evaluating (by counting links pointing to a specific page).
There has been a lot of talk about "Internet agents" a couple of years ago (I remember an article in Scientific American...) - could some good soul explain to me how is the situation now?
Regards,
January
I used to make a decent living as an Information Broker - basically, a trained database searcher for hire. Along came the net, and suddenly everyone with a modem could search for themselves. So I wrapped my shingle up, and stored it away.
These days, there is so much junk and bad indexing, that I may as well put the shingle back out. Almost any search will find mostly commercial sites, unrelated to the search, or completely useless garbage.
You almost have to be in a bizarre frame of mind to create a good search term these days.
Mark Edwards
Proof of Sanity Forged Upon Request
I've done some work on a spider and these are the types of pages I spider: :)
/index.html for any non-existent path. Also, all links on the homepage were relative (not a bad thing) and one was invalid. The call sequence is below. /index.html /foo/broken.html /index.html /foo/foo/broken.html
html htm asp php shtml php3
I guess I'll add phtml
Other extensions and urls with query strings are ignored. This is mainly for self defense. There are many, many infinite loops and blackholes on the web and they're hard to avoid. For instance, my spider once got stuck on a server that would return the contents of
GET
found foo/broken.html
GET
webserver couldn't find path, so returns
GET
etc.
What was the programmer thinking?
This is just one example of the blackholes that lurk on the web. It was completely unexpected and pretty difficult to detect. What if someone wanted to write a search engine trap? I don't believe there is a simple solution to this problem.
Ryan
Many companies, especially startups, are turning to using catchy domain names as the way to promote their site and products. Even many non-profits and research groups now register domain names that reflect what they do since many people just type in domain names into their browser - and ironically having a domain name actually helps in being indexed by some search engines; one may debate if this is good or bad, but it's a reality.
Until there's a standard, the search engines will continue to miss more and more of the sites out there. XML may be the answer to indexing and exchanging data. However, on the bright side, the difficulty of finding data makes censorship much more difficult for the censors - and that's a good thing.
Just stop thinking that tera\bytes are the limit. Get more hardware and more computers. Create petabyte databases. In fact have millions of petabyte locations world wide and create a series of multipetabyte databases that one can use.
Categories are nice but some (most) sites are personal sites and these sites chage quite often in subject matter.
While the categories are nice we should have a community planned and maintained categorical system along with a plain text search. Have identifier tags that go along with every web site and then have a standalone and a web based version of this program which will allow for anyone to create a hierical listing of anything according to tcertain tastes and peramaters.
Slashdot social engineering at it's finest
I think we are already looking at a two-tiered structure: there are sites (that could be found through standard search engines) and then there are databases/archives inside those sites.
It is getting more and more so that to find an answer to a somewhat obscure question, I need first to find major sites on the topic, and then do a search through their databases or mailing list archives. I believe this reflects a real-life structuring of the Web and will have to be taken into account by next-generation search engines.
Kaa
Kaa
Kaa's Law: In any sufficiently large group of people most are idiots.
I don't think the issue has to do with the ability of centralized search engines to index dynamic pages. I think there is a more fundamental flaw in that idiom.
The lists of problems that exist for centralized search engines goes on and on: dynamic pages (of course), missing/broken/changed links, getting to new pages, and so on.
What I think could be done is to define a search protocol (perhaps through some kind of search://domain/search+terms method) that is standardized. The global search engines then search by determining the most likely sites to have information for you and querying those sites directly for information. This would fix the problem of broken/missing/changed links being reported, new pages would automatically be available (assuming sites updated their search engines quickly), and if the local search engines are integrated with dynamic page generators (which should be possible) than those pages could be searched too.
I realize that a lot of work would be needed to be put in to this in order for it to work. A protocol would need to be developed, as well as servers for the protocol. Search engines would have to learn to efficiently decide which sites to query to complete their searches, etc.
Perhaps a combination of both approaches could yield something even better. All I know is that what is out there right now, well, fails miserably.
One way round the search engine missing query URLs is to write to static pages for the purpose of submitting it to search engines, there are many clever ways of having truely dynamic sites without the need for long urls, you just have to put some effort into it.
Search engines not picking up on php3 is a bit worrying though, all my sites are written purely in php3, although I never seem to have any problems with getting listed.
Gateway pages are a good way of making sure you get listed with the keywords you want, although they aren't very dynamic and unless you get really clever don't tend to reflect the contents of a regularly update site... however it seems to me that you can only really hope for *a* listing these days, not an index of all of your site.
Even google has a 3 month disclaimer on it's submit page, that's a mighty long time if you are looking for support on a brand new motherboard.
LASE seems to be the way to go... subject specifc full text indexes which spider regularly and can index specialised data keeping it up to date.
However you would still need a search engine to find a LASE that will get you what you want, but at least it's a bit more structured!
There are many ways round the search engine problems, and keeping on top of it is a full time job, Submit-it doesn't come close, that hasn't changed in the past 3 years, Search engines however have!
IMO a combination of all of the above will get you where you want. Keywords and Meta Tags still count, and you have to be persistent.
In my opinion, part of the fault lies with the browsers, which poorly handle caching dyanmic content, regardless of whether it is on a remote webserver or a local drive. I for example am forced to add a useless query string to the end of local file URLs so that all browsers will work. Browsers are notorious for ignoring no-cache pragmas and expiration dates.
The most common way though people find out about worthy dynamic content sites I think is word of mouth. We could use more forums and link referrals to share websites we have found useful. This has the very distinct advantage over search engines of providing a better filter of QUALITY of information. After reading someone's recommendation of slashdot or an article elsewhere, I won't have to hurdle 19 irrelevant hits to get there.
The Open Directory Project, managed by dmoz.org, is an open source effort to create an organized index of the internet through volunteer work. Currently their are 20,000+ volunteers working on the project. This is a way cool idea that we should all support.
I read a while back that meta data for sites would eventually move to an XML based standard which would accurately describe the content of the site?
Whatever happened to that? I don't mind all that much being taken to the front page of a site if I know that site has the information somewhere in there, I just hate having to hit seven sites to find that one.
Hotnutz.com
I have been thinking about the working of a search engine lately and this post just comes at the right time.
..the web engine should give me a link to drkoop.com, webmd.com (AFAIK, these sites do not allow search engines to copy their content) and so on.
Some of the challanges which will be faced for search the web in the future will be :
1. Displaying matching URLs as well as links which match the type of content. This is important. If I search for "throat infection" on a search engine..apart from the pages which mention "throat infection"
Search engines will have to maintain huge databases linking words to categories. And with the proliferation of hte internet the number of sites carrying content and disallowing search engines is going to increase. Search engines need a intelligent way to get around this.
2. Search engines will need "help" users with their searches. For example if I just search for "throat" the search engine should have a helper section where it can ask me more...whether I am searching for "throat infection" or "study off throat" and so on.
3. Search assisted by humans. This is also one of the concepts picking up these days. Basically you submit a question and there will be some person searching the web, and you will get you answer in a few hours/days. Chk out www.xpertsite.com.
4. Tools for better maintenance of bookmarks. I for one usually bookmark all relevant stuff and then I spend a full weekend arranging them so that I can find the relevant stuff from the bookmarks quickly. The current bookmarking scheme is very primitive causing a lot of users to "reinvent teh wheel" (searching for URLs which are already bookmarked).
Phew!
I'll jot down more thoughts later. Gotta work now.
CP
I was just about to ASK SLASHDOT about XML. XML will solve the search problem (or at least help make it better) Working drafts of XML have been drawn up by the W3 Consorium and XLINK, XSL, etc... are coming. There are almost no XML applications available yet though!!!!! most of what is available is in java. This is a field where Linux could be leading the pack, but is instead an example where I think we are lagging behind. (I hope someone can point me to a group that is bringing XML deep into the linux os)
I want to know if Linux is on top of this. Microsoft has an XML notepad available and I hear that it's going to be all over Win2000 (in the registry even). XML will be the foundation of the new internet and we don't want microsoft to have a technology edge there do we? Perl has XML modules, as I am sure other languages do too (python). Lets get some apps written!
What about Gnome and KDE? this could help make their projects easier. Especially KDE with all of the object similatrities between Corba and XML and Object RDB's. All Config files could be theoretically stored in XML. We need to push this one people!
-pos
The truth is more important than the facts.
The truth is more important than the facts.
-Frank Lloyd Wright
The problem with dynamic content is that you pretty much have to query the target web servers at the time the user enters the search request.
One solution that attempts to address this is Apple's Sherlock. It uses XML to pass queries to web sites and return results. There are certainly some limitations: you have to choose which web sites you want to search (although this isn't always a bad thing), these web sites have to support Sherlock queries, and it only works on the MacOS. Currently lots of big name and Apple-specific sites support it.
The dev info at Apple is pretty clear though. It wouldn't be difficult for others to create clones for Sherlock that either work over a web interface or on other OSes too. (dunno if Apple could...or would... make any claim against this)
Scott
You can't study the darkness by flooding it with light. --Edward Abbey
No technology is going to read your mind - you're limited by language, and that can be interpreted and misued in multiple ways. This includes searching (e.g. keywords in porn sites) applications. Word misuse will never stop (ask Plato or Burke) so we're just going to have to deal with it.
Eventually, the *end user* has to do the infromation filtering, so you might as well take what you can get FAST so you can move on if you don't see what you need. Indexing every database or dynamic page on the web would slow down engines to a crawl. Do you honestly want Altavista bringing up books from Amazon, companies from the Thomas Register, and patents from the USPTO? There's no need for this. If you want specialized information, go to a specialized source.
Some time later, it occured to me to try and monitor the efficiency of web indexing tools using a spider trap.
The methodology is like this:
Anyone done this? I'm particularly interested in knowing how spiders handle large websites -- have been ever since I was doing a contract job on Hampshire County Council's Hantsweb site a few years ago and caught AltaVista's spider scanning through a 250,000 document web that at the time had only a 64K connection to the outside world. (Do the math! :)
I think that, for the most part, the databases are doing their job rather well.
Where do you find the most dynamic content? News sites. Slashdot, Freshmeat, Linuxtoday, Yahoo! News, etc. These are the sites that need dynamic content.
Ironically, these are the exact sites that search engines are pretty much not interested in indexing, anyway. Even assuming that a database can update all its sites once per day, that means that the information is a day old-- centuries, in Slashdot time! People don't go to AltaVista to search for the story over at ABCNews.com. They go to AltaVista to find information about international child custody laws (to name a random hot issue of late).
Most of your general information stuff is pretty much static. This is what the search engines look for anyway-- this is the stuff that doesn't change often, so it's good stuff to record. Why would anybody bother to make a page about Cup 'O Noodles that's generated through a Perl script? It's too tough, and can be a huge pain in the ass to change it.
Why index the pages that are constantly changing, when the stuff you're looking for (by definition) doesn't change much? Sure, there's overlap (small sites that use generate the exact same content every time). But it's such a small segment that hardly anybody would miss it (yes, it may be important, but not important enough to totally revamp the indexing procedure).
Yes, for a very large category of dynamic pages, it is. For example, in an online shop, the actual number of a particular product in stock at the moment may very from minute to minute, the price of that product in the user's preferred currency may change from week to week, but the product itself doesn't change much over months over months or years. It makes perfect sense to index the product page, because although some of the contained data may be transient, a great deal more is not.
Or take another example: the weather forecast for a particular area. The forecast itself may change regularly, but the page always contains a current forecast and that fact is worth indexing. The best technology available for this sort of thing is probably RDF and the Dublin Core metadata specification. Of course, the search engines still have to be persuaded to take heed of this...
I'm old enough to remember when discussions on Slashdot were well informed.
Therefore, it would be entirely feasable to have a system in which regular users saw regular pages and web crawlers saw a "static" index page, all at the same URL.
This would allow web crawlers to index according to genuinely useful keywords, rather than by how the crawler's writer decided to determine them.
An alternative approach would be to distribute the keyword database. Since all the web servers have the pages in databases of one sort or another, it should be possible to do a "live" distributed query across all of them, to see what URLs are turned up.
This would be a lot more computer-intensive, and would seriously bog down a lot of networks & web servers, but you'd never run into the "dead link" syndrome, either, where a search engine turns up references to pages which have long since ceased to be.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I think the real problem with searching really isn't that the Internet is growing too large. The central problem with it being too hard to find information is due to the unfortunately ever-changing nature of HTML. (Yes, I know there are much better solutions out there -- I work with some of them on a daily basis. However, we seem to presently be stuck with HTML and its variants.)
It's a self-feeding monster, whose typical cycle goes as follows: SearchEngineInc (a division of ConHugeCo) creates a new technology that really impresses people with its ability to find what they want more quickly. (Right now SearchEngineInc is probably Google, at least in my view.)
Once the new technology takes root, content authors (well, maybe not the authors so much as their PHBs) note that SearchEngineInc doesn't bring their business (which sells soybean derivatives) to the top of the search list (when people type ``food'' into the search engine). Said PHBs make the techies work around this ``problem'', and all of a sudden SearchEngineInc's technology isn't so great anymore because the HTML landscape it maps has changed.
A similar situation occurs when PHBs think their site doesn't ``look'' quite as good as others. (Insert my usual rant about content vs. presentation here.) Whether via a hideous HTML-abusing web authoring program, or via all sorts of hacks that God never intended to appear in anything resembling SGML, the HTML landscape is changed there as well, and SearchEngineInc's product becomes less effective.
What's the solution to this? I'm not quite sure. Obviously there are better technologies out there that are at least immune to PHBs' sense of ``aesthetics'' but I would wager few of them are immune from hackery. I'd say that search engine authors are doomed for all time to stay just one step ahead of the web wranglers. At least it assures them that their market segment won't go away any time soon. :-)
I have to say, yes. I believe that with the way the internet is growing, it's difficult to keep up with new pages and new technology. I know there have been several times I have done searches only to turn up nothing when I KNOW it's there or to turn up too much which pertains to nothing I'm looking for. Most of the more mainstream search engines have become obsolete, I'm afraid. Many of them use methods that just simply aren't practical like searching for certain words in the text of a page. When you search for things like that your searches will not be accurate and often you'll get information you don't really want or need.
So, I believe the internet is outgrowing the current search engine technology.
-- Shadowcat
kageneko@kageneko.net
"I can roleplay. I can frag. I can PK while you lag."
I think it is just the plain and simple truth that the searching algorithms all of the search engines use currently are not suitable for the task. I will perform searches on what I think are pretty obscure terms and return >10,000 hits on some of these search engines. Of course, none of them mean anything to me.
:)
I'm not saying that this problem won't be figured out at some point. It's going to take a little more technology than we have right now, but no doubt it's on its way even as we speak. (Any AI experts out there?
Until then, indexing by hand seems to be the only 100% solution. Humans are fallible, but much less than the machines are at this present stage. Plus, directories geared towards specific topics would help narrow down your search before you even start searching.
There is no excuse for having a purely database-driven website that does not appear to be straight HTML pages. If you have ?s everywhere then you're just lazy.
Firstly, even though you might pull everything out of a database, a large per cent of all such content is not really all that dynamic, which means you're probably better off precompiling the base down into static HTML, and recompile the page only when its content changes.
Secondly, if you have a script with a messy query string you can turn it into something that doesn't look like a script at all, e.g., /cgi-bin/script.cgi?foo=bar&this=that could be presented as /snap/foo/bar/this/that.
With Apache, you would just define and pass it off to a handler, that would pick up the parameters in the PATH_INFO environment variable. If people tried URL surgery, you could just return a 404 if the args made no sense.
Search engines are your best (and probably only) hope of getting people in to visit your site. It's up to you to make sure your URIs are search-engine friendly. If they can't be bothered to index what looks like a CGI script, well that is your problem. There are more than enough pages elsewhere for them to crawl over and index without bothering with yours.
It disturbes me that so many pron sites have hidden in their html code (and sometimes not even hidden) huge lists of adult film stars just to get hits from search engines.
If you do a search for Cortknee or Lotta Top you'll get a bazillion hits and 90%+ of them are "Click here to see young virgins having sex for the first time on their 18th birthday!"
As we all know, but nobody likes to admit, pron is the fuel that makes the net go 'round.
Many other sites have taken hints from the pron people. I'm sure that it was a deal of some sort, but everytime I do a search on metacrawler there's a line to search for anything I get a like to search a certain bookstore for books on the same topic.
Commercialism and shady practices are what are making the net so hard to search.
LK
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano
with mod_perl, you could create a system that analyses the URL requested, and makes a database query. You could hide a database behind something like: www.webzine.org/articles/section/109944.html on your server, no actual file called 109944.html would exist, but the request of that file would tell your server to query the record 109944.html from the database.
Okay, I just got done with my research paper for college last week, and although I can pull a paper out of some orifice of my body, researching is always a pain.
Our library has a wonderful online database where you can type in keywords and search for them, but the keywords only look as far as the Title, Author, or abstract of the book. If you wanted to look up some narrow topic, you can't expect that there's books written exactly on that topic, but there's always bound to be a few books out there that have a few pages dedicated to that subject (but isn't listed in the abstract). So, what do you do? You have to get your hands dirty.
My topic: Holy Wisdom (I won't bore you with details, but just stick with the subject). Looking in the online database, I find that there are zero books on the subject. Darn. Let's do some lookin...
After I read in a few Religion Dictionaries, I find that Holy Wisdom is also called "Sophia." I go back to the catalog, type in "Sophia," and I get one book. I skim this one book, and find that Sophia has sometimes been associated with the Holy Trinity. So, I go back to the catalog, enter "Holy Trinity," and BOOM, I get back 400 results (anyone seeing a similarity here...). Let's limit them...we'll search within the results for "History of," and I get back about 11 results. I read the abstracts, find a few books of interest, and start skimmin...
...Well, whadda know, there's a page in one book that talks about Sophia, and half a chapter in another book that talks about Sophia as well. There's a few more sources for the paper!
Now, for those of you who just don't understand what I'm trying to say here, just read from here on, cause here's my point: Computers aren't smart enough yet to "guess" at what we want, and personally, I don't think they ever will. Internet keyword searches are just like asking someone to help you who has no idea what your topic is...they can only search for what you ask them to search for.
Internet keyword searches are a hastle, and many times the first few returns won't be anything CLOSE to what you want (search for "Computer Science," you get back porn, search for "Linux," you get back porn, search for "White House,"...). But if you learn how to dig, like the people who lived fifty years ago WITHOUT Boolean Searches, you'll find what you're looking for. Sometimes, it's just like searching for a topic...you might not find anything directly, but you can't sum up an entire book in just a paragraph either!
Try some links, look around, and it'll be there!
Many of my sites are database-driven sites that run on PHP and MySQL. No problem with indexing, and no problem with the file extensions.
If you can get beyond the backend concept of a dynamic page, most pages really appear to be quite static, from an indexing perspective. A http-based indexing system (as opposed to filesystem-level) can't tell that pages are dynamic, and don't care.
I've never had a problem with search engines failing to index pages just because they had convoluted URL. If some engines do that, it's a bloody shame.
The point is there has to be a link there in the first place. They will not be able to index a dynamic page if it is only accessable through a "form" post.
The way you can get around this is to have a hidden (to users) page on your site with hardcoded (or database generated) links into the dynamic content that you'd like visible from search engines.
For example, if you have a whole heap of news articles on your site, with one per page, you can make a dynamic page called "newslinks" which, when generated by a crawler, querys the database and writes links to every news article in the site.
cheers, j.
"My cat's breath smells like cat food." - The Tao of Ralph Wiggum.
IMHO we're already seeing the advent of meta search engines that do their own search and then do a simultaneous search using other engines. (Yahoo does this, I think, as does lycos/hotbot) That's a great kludge for these engines to extend their reach, but not a real solution.
I think we'll see more topic-specific search engines (I use trade rag sites exclusively for really good info on tech news, for example) linked together through the big search engines. The main engine (Google, or whatever) will check the search term to see if that term has been pre-linked by the engine managers to generate a search on a more topic-specific engine (for example a search on "market size" may cause the engine to do a lookup on the northpoint search engine) or engines, and then combine the results of its own search with that of the topic-specific engine for relevant results.
It's the whole idea of vertical portals taken to the next level. The vertical portals provide topic-specific searching capabilities over the 'Net to the behemoth engines and portals for a fee, or something.
Remember, the user will not get smarter, but will rather look for the faster and easier solution.
IMHO.
I'm going to say a naughty word: artificial intelligence. I'm hoping we soon ( 5 years) get good enough at this "indexing" stuff to create semantic models of Web content rather than purely syntactic models. (Google is a small step in the right direction.) If so, then perhaps dynamic pages can be indexed according to their location (role?) in an "ontology" rather than via the frequency of essentially meaningless character strings. That may sound farfetched, but it seems to me that the Web finally provides a real _financial_ incentive with near-term payoff for that kind of research. Hitherto, the quest has been purely academic. And where there's the lure of a real payoff, stuff often happens quickly (usually -- batteries and flat-screen technologies being notable exceptions).
I hope that after I die the one word people use to describe me is "resurrected."
Every Web/FTP server must have a standard, live query engine. Every week or so, some sites would query them, and update their database, but only to the site level, if the end-user want, must query in the site for the page, in a second phase search. [Buen español, bad english]
I'm from Argentina: Tango, Asado, Mate, Gaucho, Maradona, YPF
Look up in the sky.. its a bird, its a plane, it web sites dumping their information from hard to index databases to easy to read XML!
Wasn't this sort of thing what XML and RDF were originally designed for?
DrLunch.com The site that tells you what's for lunch!
I think that this is going to force the search engines on focusing on sites rather than pages.
:)
as a site can be described by keywords even if their subsequent pages are database driven. i like searching by site usually anyways - provided that the site has a nice search engine
The web is growing and changing at a pace that a band-aid fix like static indeces just wont solve. Database-driven web sites are simply more manageable, scale better, and more easily allow the separation of content creation from site design than static ones consisting of n-thousand HTML documents.
Technologies like XML and WDDX provide access to databases through standard protocols and are not difficult to implement. A few simple, scalable solutions include:
DB-Based web content has the potential to make the web more searchable then ever before through hierachy and content classification, but only if we do not try to reign it in. Instead, we should adapt the way we search to the emerging scalable, powerful web architechture that is the future of the web.
Bryan Klingner, MCSE, MCP+I
What I try to prevent is the problem I am going to mention next, which is that it seems like with many data driven sites, the content pages "expire" (i.e., they are aged out of the database -- thus disappearing from the site) without any notification to the search engines that the page is expired.
As an example, I use a product which performs queries against 10-12 search engines at the same time. For any given search, 10% or more of the pages will be invalid. What little research I have done into the invalid sites often shows that the page has been dead for more than a year -- even when 8 or more of the search engines advertise that they have (at least in theory) spidered the page within the last 60 days.
What we have here is a problem in search of a standards based solution (an official RFC) designed to bring order out of the chaos.
My own thought (which I acknowledge are from someone who has been doing data driven sites for less than a year) is that there ought to be a standard way of telling an external spider to use a "site local" index file, similar to how the robots.txt file excludes some or all of a site from spidering (assuming the spider's coders obey the standards -- not all do.)
It then becomes the data-driven site's coders responsibility to add the added code which updates the robot's index file "automagically" based on the content changes to the site.
It also seems to me like browsers could access this file to see if a bookmark is still active, and with the proper format, maybe even update the local bookmark file. Something like this:
:= http//mygreat.com/new.html.
I'm interested in what more experienced coders have to say about this idea, BTW.http://mygreatsite.com/old.html
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
A web site is basically a network service. It seems like there should be a place for a distributed protocol that actually allows an intelligent* search. If you defined a doc/HOWTO type you could search for sites providing those services with criteria that select the particular issue you're looking for. Try that with a search engine and irrelevant juxtapositions will fill your results with noise.
*Intelligent in the sense that the search method used shares a vocabulary with the providers.
Almost all search engines will reject dynamically generated pages if they have extended characters in the URL (except for Lycos and Inktomi). This is primarily due to the fact they are worried of getting into what they call "robot traps" where there may be no end to the number of links that a script or program generates. If the URL contains a "?", "%" or other similar characters, they will probably not index your site. A work around is to build "Pointer Pages" using regular static html with links to the target page. If you attempt to use the refresh tag within the "pointer" pages, be aware that Infoseek will try to index your targeted page, not the page that you submit. There are ways around this problem...
(From The Unfair Advantage Book on Winning The Search Engine Wars)
Google rocks. I can do a search and find all the articles I have ever posted on slashdot. (Archived, of course) The problem of slowness of distribution to search engines is a difficulty, but compared to historical ways of gaining information, what we have is incredible.
:)
We should have some sort of a standard way of indexing these pages, and if they make it compatible with all the new technologies coming out, I will be very impressed. The best search engines will use the standard indexing in addition to current technologies, I would suppose, but it would still make life much easier to have this.
Also, if there were a central place to notify that you have posted/changed content. Something like the way domain names are registered in central places. Its in the users best interest to notify the central location that its content has been added/changed, and then the central point propigates its information to anyone who wants it, for a small fee, of course.
Why do I post these things on public forums, anyway?
>>>>>>>>> Kvort, Lord High Peanut of Krondor
-Don't mind me, I'm personality-deficient and mentally-impaired.
Basically we need an distributed open database
standard for the web. Searching a database is
much faster than doing a blind text search and
should definitely take up less bandwidth and
resources than a text search. If each ISP and
independent node on the internet hooked up
their databases and (imported) html pages,
we'd be able to search anything, anywhere.
Of course, implementing it will be tough. The
current approach of web searching is based on
laziness. Actively participating in the creation
of a web index is not necessary. The only reason
for ISPs to participate is because their afraid
that spiders eat up too much bandwidth.
In the mean time, we'll just have to live with
what we have. As Larry Wall is fond of
saying, "Laziness is a virtue". I hope that
enough of us are lazy enough to use plan ol'
text, HTML, SGML and XML.
Search engines have serious problems. One is that boolean strings and other forms of highly specific searching never seem to work. I search for anything, and I get maybe 20 out of 3000000 sites that have what I want. And many of these sites are on the fifth or sixth page. What needs to happen is search engines coming up with a better way of ranking sites. Its really annoying when the 100% relevant site has nothing remotely related to your search, and the 25% site is exactly what you are looking for. Search engines also have to do more to prevent spamming them. Content based searching rather than keyword should be implemented, it can help, but keyword searching, if improved, is still good when searching for specific information. Search engines could focus on specific areas, like a SlashSearch.com would be a tech search engine. The search everything engines could add a new option for their advanced mode searching for category. Database driven sites should use meta tags describing the content type. While no solution can be perfect in a rapidly changing environment like the Web, these ideas can be implemented and would help.
If an HREF contains a query string, sending that query string will return the content in the same way that sending an ordinary www.sample.com/page.html link will return the content.
Another message mentioned the problem of loops. A table of visited URLs does not always work because of the problem of relative links that get continuously appended to on sites that return index.html for broken links. Two alternatives are:
(1) limit the spidering depth so that you only go, say, 4 links deep into the site, or
(2) make a hash value on content returned, and use the hash value to see if you are getting the same content with a different URL. Stop spidering any time the hash value is the same as a previous hash value.
I have alot of interests, but I don't thinf A URL entry for every web site that holds my interest would fill up 20gigs of data.
Also it should have the option to only search entry points into domains, such as http://slashdot.org but not http://slashdot.org/whatever/more/test.html
On a similiar note I think the 'WEB' has gotten to a point where web sites need a tag that determins an overall content review, an example would be that porn sites may have this , and a personal site may have tag and so on. so I can click the no porn option on my search engine and not have 500 returns, 450 involving animals....
Now I use porn as an example, but I don't think it should be removed from the net, but I think I should be subject to it if I want to.
So, yes, interms of technology, it is easier to classify webpages into categories, then index them within each category. Check out http://www.cora.jprc.com/ It is a search engine for Computer Science researhc papers. It is in a format that is just like yahoo. But every thing is done automatically!
So, the technology is here. It is just a matter of time before this kind of thing is neccessary.
This front-end does a lot of (admitedly, crude) parsing of the rest of the URI line to determine what "document" in the DB to look up, and which subordinate page, or if it's supposed to be instead generating an image, or whatever. The main script also looks up styles for each document, builds navigation bars, etc.
Works pretty well. Not nearly as flexible as if I'd actually thought it through before writing it, but it fits our needs admirably.
Why'd we go to such a complicated approach? Because we have bunches of InfoSec engineers, who really don't want to worry about HTML, writing web pages and reports. We've got a GUI front-end with a nice wysiwyg HTML editor that hits the documents in the DB directly, and all changes happen on the "live" HTTP server immediately. It's completely scannable because we use a web-get sort of tool to create a static "snapshot" of the final report before we send it to customers.
At any rate, I think it's cool... :-)
david.
A friend of mine asked me once to explain my opinion of why the web is broken. After some thought, I came to some conclusions that are relevant here. I'll see if I can restate them effectively. All IMHO, of course.
A couple of assumptions:
1) The web is a non-hierarchical, non-linear system. The entire nature of it is actually closely related to how most people think, through a series of links. Ever found yourself explaining to someone how you got from one seemingly unrelated topic to another? The web is the same thing.
2) Mapping linear, hierarchical systems is what humans are good at. Indices, tables, flowcharts, etc. are all designed to present a certain kind of data in a randomly accessable way. When information is non-linear, we try to force it into this kind of structure, for better or for worse. This is what search engines currently try to do - provide a keyword index to every document on the web.
We cannot treat the web like something it is not. It is not a book or a collection of books. It is not even linear. It's a lot closer to the repository of information that is the human mind than most things that humans create.
This presents an information-finding nightmare. Much as it's sometimes difficult to find the piece of information you know you have stored in your head, it's becoming increasingly more difficult, even with the power of algorithmic parsing and pruning, to extract single pieces of information from the system. Search engines are, as the original post stated, becoming obsolete.
So what is the solution? In my opinion, the most intuitive 'index' type interface to the web has always been Yahoo, which for any given topic will provide a number of starting points. Not every document is indexed, not everything is represented - but if you drill down through links, you are more than likely to find what you're looking for. It takes the natural process of searching the web, which if it were a few hundred nodes could easily be done by hand, and gives it a logical starting point, much as someone can remind you of something you were searching to remember, and suddenly it all becomes clear. Indexing the entire web is as useless as trying to do an entire braindump of your mind. Indexing a set of starting points for using the web the way it was intended - as a series of links - is the only way that will probably ultimately work.
It's clear that many aspects of a webpage that could be pregenerated every time the information is updated are not being done that way. Slashdot is a prime example. Presumably, thousands of people visit Slashdot anonymously everyday. Even though they see the same content, the page is regenerated with unsearchable/uncacheable content. Shouldn't it be a simple matter to have a script choose between a dynamic page for logged in users and a constantly up to date pregenerated page for anonymous users? Saving CPU cycles for the servers, allowing indexing by search engines, and speeding up accesses for users behind a cacheing proxy. Sounds like only good things to me.
Obviously this won't solve all of the problems, but many websites front pages are the same for every user. Wouldn't it make sense to pregenerate it as static content? This could be taken much further by news sites that provide the same story content to every user, but use a database frontend for simplicity anyway. This doesn't preclude use of a backend database for information storage and organization, but it does impose quite a lot of complexity in the implementation of a system to index all of the pages as they become available and make them into static, numbered pages.
I tend to fall into the category of folks who believe that site designers should be a little more aware of the outside world and making their content accessible via every possible means. I don't think it makes sense to prevent search engines from finding ones content. If you've put it up, you want people to find it. Why turn down that extra banner display simply because someone doesn't check your headlines and instead searches Google or Alatvista?
I'm sure there are other issues involved and I'm glad this was brought up...I've been trying to figure out solutions to these problems myself while implementing a company web page with our web designer. Being a cache server company, we've got to make sure our own pages are completely cacheable whenever humanly possible. Not to mention that when someone does a search on any engine we want our URL to come up if we've got something to say on the subject of the search. It just makes sense to be as openly accessible as possible.
So, is this a problem that should be addressed mainly by the search engines, or should web designers be thinking ahead to such concerns when they are building a site with dynamic content?
Joe, Swell Technology
Markus
--
The answer to all this isn't going to come from making existing engines better, nor is it going to come from bigger, badder, faster database engines powered by your friendly clustering technologies!
The answer is simple: More specialized search engines. You're looking for technical stuff? Then you should be able to search a technical database. Like, if I'm looking for source code to model fluid flows - that's pretty specific already. There's no reason that I should have to wade through all the references to "bodily fluids" that I'll get on altavista for instance!
Search engine people, take note of this. Classify your URLs into categories - like Yahoo - but come up with some way to do it automatically. Or even better yet, let the users do it, a la NewHoo.
End of internet predicted. Film at 11. We've heard it before, and we'll hear it again. Just need someone with a little VC money to throw it towards an idea that supports more specialization in search engine tech.
Kudos..
..don't panic
I use a free site statistics service to keep track of hits to my web site, where I keep some software that I've written. Looking at the referrer statistics to my site, the vast majority of hits are generated from explicit, categorized links to my site (e.g. bookmark pages and surprisingly Lycos which has a categorized database), and rarely ever from general search engines like Altavista. The questioner may be right - from the perspective of a web site owner, general search engines aren't very effective at bringing visitors to my site.
http://www.navigateone.com OK, so it's only financial information, but it does update itself, work out queries on it's own etc..etc... So it's not impossible. p.s. It's nothing to do with me.
People just don't know how to search.
Since, I have been using the internet, I have stopped making daily trips to the library. Searching is an art, The web is pretty searchable, but takes quite some effort, knowing the right search engines to use for what, knowing the right keywords and combinations.
------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
IIRC, XML was designed to help alleviate this sort of thing. Unfortunately, XML has not been exploited enough to have any significant ramification on the way the internet is sorted.
Why doesn't anyone use the ScriptAlias directive? It does the same thing as query strings, but makes it look nicer, like the rest of the web. You can "say" your looking at a directory or a .html file, but in reality you are viewing a singe script. For an example go to http://store.wolfram.com/. There are no directories on the server side, it's all served off of one script. Yet, to the user, it appears as a hierarchical directory structure, complete with .html files. The only query string is your session id, which is appended to the URL in case your browser doesn't support cookies (however, these are not there if a robot views the site). Anyway, a simple directive like ScriptAlias can save everyone a lot of trouble. If anyone has questions about its usage, send me an email.
Jon
Engineering and the Ultimate
OK, maybe that's a bit of an overstatement, but not much.
Does anyone else remember searching before the web came into its own? I remember constructing carefully planned Archie searches only to often find either no results, or pages and pages of 'em. After a while (and perusing a lot of those pages of results), you learned which sites had most of the stuff you needed. Windows shareware was at Simtel-20, OS/2 at cdrom.com, Unix at sunsite, etc. Non-software usually necessitated going back to Archie and throwing searches at it until something stuck.
Fast forward, and we're still doing that with the web. The only difference is that the amount of archived, non-software information (and hence its importance) has gone up dramatically. In light of that, I'd say that the search engines are more useful than one might expect.
Unfortunately, that's not really good enough for practical purposes. Forget all of the techniques we're trying to tweak right now. Someone has to come up with a fundamentally different way of searching through indices; one which behaves the way altavista claims (but fails) to do. In other words, enter a question and have the engine _interpret_ the question before searching.
But I don't see it happening for a few years. Oh well.
"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
One solution, anyway. Simply tell the spider not to index anything more than X levels deep into a site. Where, of course, X is a relatively small number - say, 5. Alternately, for this sample, look closely at the URL. If you're looking at /foo/foo/foo/bar.html, then there "must" be something wrong with the path, so stop looking there and move back out.
Trifle not with Dragons, for you are crunchy - and go well with catsup.
Hmm. How about this for an idea:
when a webbot sees a dynamic page, it changes the query to ?Webbot - and expects to get back a specially formatted page starting <H1>Webbot index</H1> and followed by a set of comma separated keywords, a break, an URL, a paragraph, then the next set? The webbots would be happy, as they don't have to waste bandwidth and cpu time spidering over the site; the server should be happy, as it doesn't have to support the webbot's spidering, and the site owners should be happy, as they can specify what keywords each result will be indexed under. obviously, just reformatting the index to the product database could generate this page for an ecommerce site, and more static sites could just use a static statement of what their site carries.....
--
-=DaveHowe=-
First, the use of robots.txt is well known - but never used, when in a hurry. This means: Never used. Only when you get too much hits by a search engine, you put one in the directory. Even then by the lack of time you normally exclude all pages exept the frontpage.
Second, Designers and Webmasters are not putting the technical possiblities to the max. Like using Apache paths and mod-rewrite to transform queries into a virtual path. Which would make a dokument look real. Even for a search engine. Since this is no out of the box feature, there will be no hope for this.
But the worst one is a thing of the engines themselfes: The time between two visits of an engine. It is up to a month now, this means: Better use local search in slashdot, freshmeat, nerfpoint and else.
Indexes like yahoo, infoseek, web.de(german) and others become the only hope to find a start and the use local searching.
This is why altavista, hotbot, lycos and others got the additional "directory" feature. Compare them to google and you know how they looked two years ago. And by the time, google will ... well, maybe not. I hope.
* Smile. People will wonder what you think. *
My Suggestions for said protocol:
It's not putting Altavista or yahoo or others out of business b/c you still need those top level servers to query everyone. It's solving the dynamic problem b/c each search site can create it's own DB however it wants, which also still gives Excite and Infoseek and the like a market in which to sell their search engines.
Until this protocol is ready, create static pages from your dynamic content so that the conventional search engines will have something to catalogue.
-f
http://www.peruano.org/
-f
www.blackant.net
For me the best way to search the internet is to go to a site dealing with the context of your query and search that site with it's own search engine (which most major sites have).
It would be nice if a generic search engine working in the following way:
1. User searches for say "Cisco VPN Routing"
2. The search engine identifies sites www.cisco.com and other sites which are related to the search query string.
3. Instead of trying to account these sites it calls on the search engine at the site matching the context and queries it instead.
4. Returns the results of the search at cisco.com to the user.
It's kind of like a distributedSearch, where the actual search is done by the holder of the data, all that the search engine actually does is try to find a context for the Search Query and find sites with their own search engines that match that context.
So in answer to your question: My answer is No, the Internet isn't unsearchable, we just haven't implemented a reasonable standard for searching, which can be as important as routing when it comes to a network of the size of the Internet.
I work for a .com company. We had this very issue as our entire website is dynamically generated with a single C program. It grabs various parts of pages created mostly with php3.
How to index it? Frames. Put a 1 pixel frame at the top of the page. Hardly noticeable, and the search engine sees the frames page, where you can put a whole bunch of comments and meta tags and a tag to get the straight text, depending on the search engine needs.
I like a snappy search engine response time as much as the next guy, especially when I'm looking for something fairly current or mainstream. But how can you tell a search engine to tread farther off the beaten path?
For example, a few days ago I was looking for the dip switch settings on an old 14.4k modem. Now I *knew* the info was out there on the web somewhere. I also thought it was highly unlikely to be in any of the major search engines in-ram indexes. I would have been quite happy to submit a boolean or reg-ex query to a search engine and then check back an hour later to get the results.
In my mind, instant gratification search engines are useful and have their place, but I see a whole segment which just doesn't seem to be addressed. Is anybody even thinking about working on this?
-matt
There was a time when I could jump straight into an Excite power search and be assured that I'd find what I'm looking for within minutes.
I don't think that PHP or high usage of CGI has affected things, tbh. But search engines like Yahoo!, who don't trawl for content, are going to get entirely more useful.
If ebay has their way, indexing data is equivalent to cracking into another's system illegally. I guess that means that we should do away with all search engines entirely...
Nose
Nose -Common Sense isn't.
Give anyone the ability to talk directly to search engines and you'll see what has been happening with those damn porn sites on a large scale - do a query for anything, and it'll come up with a totally unrelated porn site for you.
People figured out how to abuse keywords real quick, and this would just make it worse. Which is why I wonder about the contnued existence of search engines. I use \. as my search engine - I use it to index my way into the web every day. I think that's the way of the future.
PS I hate the G3 keyboards. They're tiny! It's like carpal tunnel syndrome x 5!!!
lf.o
I think in some cases, it is easier on both web site maintainer and search engine for the content to be periodically generated rather than dynamically generated upon every request.
Continuing the Slashdot example, for awhile during one of Slashdot's bandwidth crunch times, Rob was running CacheDot, a static version of Slashdot that was updated periodically.
Sites that I run contain content such as product database representation, and these pages are regenerated whenever somebody adds/deletes/edits information in that database. This may become impractical (Generating a complete product catalog) for larger sites, but then it's just an issue of generating a particular category, or even locking it down to a specific set of files related to the product being changed. (In a sense, Makefiles for web sites) It's not terribly difficult to accomplish this generation work, and the result is cacheable product information, which I consider a Good Thing.
Everyone who thinks about this problem a lot comes up with the same answer: searching based on content never worked in the library context and won't on the Internet either. Metadata is the right way to go, which is why Yahoo and ODP are more popular than the robot-driven content search engines. The only model that has a hope is the Open Directory, but the right answer is a cultural shift where when people post data, they post metadata at the same time.
Does anyone know how they do it? Certainly some have special deals with the sites they're search, I think PriceWatch does this mostly, but there's so many products on these sites that it seems like they'd have to be spidering...
Are these bots very specialized, or can their techniques be used for the rest of the 'net?
Searchengines consume an enormous amount of bandwith while only indexing small parts of the web.
I think, distributed indexing is the way to go. Give everyone with a website a tool, which indexes her site. Create an open index format to ensure, that sites with dynamic content can create an index in that open format.
Send compressed indexes to the searchengines everytime relevant content has changed.
The problem: the common index format (while I think that the harvest-project produced such a format: SOIF). The searchengine companies will never cooperate on this - the users have to do it.
But as long as the searchengine results are 'good enough' [tm], nothing will change.
By neuro-- it ain't over 'til it's over
With so many sites now offering dynamic, up-to-the-minute information, merely caching the contents of a page at a particular moment in time is at best only catching glimpses of these pages, and at worst leads to misleading search results when querying the search database. It strikes me that with these sorts of site, where the content of the pages is changing so rapidly, something more objective is needed.
For example, the first clues about the state of flux of a particular page can be obtained by diff'ing the page against the previous copy held in memory. If the page is simply having extra items added in periodically, such as a FAQ, then the diffs will generally show that there is more being added than taken away, and the traditional snapshot method employed by search engines is fine. However, if the page is wildly different in the majority of its content (such as the slashdot main page), there is practically no point in making a copy of the page for indexing purposes. A much better solution is attempt to build a keyword database automatically for this page by lexical analysis of the text - even a '100-most-common-words' list (with 'the', 'and', 'is' etc filtered out) would be an improvement on the current situation. As repeated visits build up, this keyword list will refine itself and actually provide some reasonable pointers to the material likely to be found at the site.
With any large database, particularly when you get to the stage where GB's of information are consider small fry, the need for efficient data mining and generation of useful indices becomes increasingly important. Database technologies are looking forward to a time when there will be a need for Petabyte and Exabyte storage and retrieval, and effective distillation of a web page's information, rather than a word-for-word verbatim cache will be the only answer.
Cheers,
Toby Haynes
Anything I post is strictly my own thoughts and doesn't necessarily have anything to do with the opinions of IBM.
One thing you can do is cache a portion of your pages, regenerate them every 5 or 10 minutes or whatever, if you cache the most interesting 10,000 pages, thats 10,000 pages that the search engines can add to their databases. Thats what we do at rivals.com, but its based on popularity, not interest.
Browse a few relevant papers and find some keywords to search for more of the part of the field in which you are interested:
Raw indexing of HTML leads to raw results that are often of no use whatsoever. What the indexers need is a way to query a 'site' for pages that should be indexed, how often to index them, what the general topic areas of the pages are etc. also needed are html tags that indicate the 'content area' of a page (so that navigational header and footer crap can be ignored) and a means to apply relative weight to areas of the page.
I'm not sure searching can be automated at all. That's why portals are going to play an increasing role and become increasingly specialized. It's worth remembering how young the Internet is right now and how centralized most useful resources are around very few sites. A recent BCG report indicated that 43% of online dollars are spent in the top 10 stores. That's insane considering the overall brick&mortar market they represent. Information is no different. As specialized websites continue to grow in number, specialized portals are going to develop to fill the need of finding useful information. Sites will have a much more interactive rapport with their portals than is currently the case. Search engines displaying thousands of hits per search will die off as their utility continues to diminish. And, ultimately, those who win will be the ones whose content and opinions are trustworthy (think Slashdot).
A few suggestions here (for the web designers and the people doing the searching):
1) The previously mentioned two-level web sites: enough static pages (with meta-tags etc.) to capture the search engine's interest. Backed up with dynamically generated pages for the bulk of the content.
2) A huge collection of static pages refreshed from database-hosted source material. The static pages are updated whenever a change is made to the source. I'm sure a lot of web sites use this already in cases; it probably performs better when the number of updates isn't to high anyway.
3) Using "well known" sites for your searching: I remember attending a web-design conference where one speaker talked about search engines actually increasing the search time when compared to users clicking through links (on a properly designed site). Sites such as the IMDB, the big web bookstores, about.com, slashdot and the major news sites provide so much useful information in one place there often isn't any need to check anywhere else.
I tend to locate a site or two that excels at providing a PARTICULAR type of content and go straight there instead of a search engine. All of the companies working on these general-purpose "web portals" (ick) should give up. Locate a niche and work on providing the BEST content and comprehensive links that you can on ONE TOPIC (or at least use some common sense).
4) Smarter search engines? I've switched to using Google almost exclusively; it often displays the site I'm looking for in the top 5. However, I've clicked through 5 pages of results, given up in disgust and found the perfect site at a later date by sheer coincidence. I suspect that the perfect search algorithms are going to elude us for some time yet, and the WWW is getting too big to allow human-aided searching to make much of a dent.
It has been my opinion for a long time that database driven dynamic web pages are entirely overused. If more people used things like Website Meta Language to preprocess their web site and make them "dynamically generated but statically served" that would take us a long way toward being able to index content.
There is a tradeoff. All of your content is then not only in a database it is also in the web pages. But in my experience most sites who are dynamically generating their content via PHP, ASP, perl, mod_perl, whatever, don't really have enough content to worry about it.
...the problem is the people who have completely and totally ignored everything the W3C ever said about why and how tags and documents should be used. Okay, so it's not limited to just that, but it's the most obvious symptom.
For example... How many sites have you see simply neglect to use the paragraph (<P>) tag? Instead they choose to make indiscriminate usage of the hard line break (<BR>) tags to separate blocks of text. This is silently wrong although the visible output is the same. Remember how WAIS engines could further qualify searches by how "close" multiple worsds in a search were to each other in a document? Here in HTML we have a way to group words into semantic bundles by paragraph, and people completely ignore it. (No, we're not to the point yet.)
How many times have you actually seen people use the <DD> and <DT> tags properly by a web page author when they are giving definitions? Most authors seem to simply decide that they don't like the way the text looks, and use some oddball invocation of tables and/or transparent GIF images. Of course, this means that a search spider has no idea that it's looking at the definition of something now, where if the text were marked properly, any query of "definition of widget" or "definition:widget" would immediately return that page! Why do people dislike using <DD> & <DT> for definitions--the most popular answer I get is that they didn't like the way those tags formatted their text. They're entirely missing the point again that HTML is for marking different parts of a document with extra meaning. The browser is supposed to be what decides how it is shown to the user. META tags were abused by porn vendors and the other bottom feeding denziens of the net to the point where they are nearly useless now. Even with CSS1 and CSS2 waiting in the wings to allow authors to properly control document layouts, most people seem to be too lazy to create their documents properly, so long as it's not immediately obvious that they were the ones who did something wrong. (Seems like the attitude of some large corporations--and we're still not at the point yet.)
The proper use of HTTP is also completely neglected by most web site administrators. The cache/no-cache pragmas, the last-modified times, the content-type declarations, these things were all meant to give hints to the remote client (which is not supposed to be assumed to be a browser) about what type of document they're looking at and how to deal with it. Instead we find sites who have marketing directors who insist that everything be done to inflate their hit counts as much as possible by preventing last-modified times from going out so the browsers won't cache the documents. We have entire sites which in their insecurity that someone, somewhere, might decide that the entire site sucks and needs to be done over (just the look, not the content mind you!) so they make the entire site out of dynamically generated content (like shtml- and php3-only sites), even though the parts that matter never change. (Apache now includes a number of things to get around this problem of template driven content by the way--see the 'Full' option for the X-Bit Hack for one such example.) (Almost there now.)
I'm terribly sorry to have to point it out, but far too many web page authors have completely disregarded the fact that HTML is not meant to be used to format the text. HTML is meant to mark-up the document so that the browser can format the text, and thus, upwards of 90% of the web pages online today are a folly in progress.
Tons of things to facilitate search engines were specifically included in the protocols, but go straight out the window in practice because of short sighted people who seem to think that the title WebMaster confers them automatic competence and understanding of the system itself.
Do not blame the search engine for the ignorance of the masses (because they are asses).
Style over substance is the real culprit. (Point!)
The "invisible web" issue being discussed is one
that is gaining a great deal of energy as more
and more users, especially new and unsophisticated
web searchers learn that many of the general
search tools can not and do not make all that
the Internet offers easily, if not entirely
acceesible and/or retrievable.
Searchers after learning this fact must become
knowledgeable about "specialty tools" in the
area(s) that they need information in. This is
quite similar to finding the necessary specialty
reference book on the library shelf.
Below find the urls for a large and growing
collection of these tools, that many visitors use
as an acquisition tool to help in the selection
process.
Unlike similar "Invisible Web" resources, these
pages have a more academic/scholarly feel to them.
direct search-Main Page:
http://gwis2.circ.gwu.edu/~gprice/direct.htm
direct search-State (U.S.) Databases
http://gwis2.circ.gwu.edu/~gprice/state.htm
direct search-Searchable Bibliographies
http://gwis2.circ.gwu.edu/~gprice/bibs.htm
http://www.altavista.com/cgi-bin/query?pg=aq
You can find more info on the Invisible Web here:
The "invisible web" issue being discussed is one
that is gaining a great deal of energy as more
and more users, especially new and unsophisticated
web searchers learn that many of the general
search tools can not and do not make all that
the Internet offers easily, if not entirely
acceesible and/or retrievable.
Searchers after learning this fact must become
knowledgeable about "specialty tools" in the
area(s) that they need information in. This is
quite similar to finding the necessary specialty
reference book on the library shelf.
Below find the urls for a large and growing
collection of these tools, that many visitors use
as an acquisition tool to help in the selection
process.
Unlike similar "Invisible Web" resources, these
pages have a more academic/scholarly feel to them.
direct search-Main Page:
http://gwis2.circ.gwu.edu/~gprice/direct.htm
direct search-State (U.S.) Databases
http://gwis2.circ.gwu.edu/~gprice/state.htm
direct search-Searchable Bibliographies
http://gwis2.circ.gwu.edu/~gprice/bibs.htm
http://www.altavista.com/cgi-bin/query?pg=aq
You can find more info on the Invisible Web here:
It depends on what technology you're using to generate the pages.
Zope sites for instance, are totally dynamically generated, even those pages that would normally be static. But the entire content of the site that's stored in the ODB is traversable via 'normal' URLs. This means that search engines can easily index your entire site.
Note, however, that this only works if you've taken care to expose your content via links. If you've delibarately hidden your content behind a search interface (and you can still do this with Zope), then your site will be no more indexable than any other dynamic site.
--
The real Webmaven is user ID 27463. I don't rate an imposter, because my ID is such a lame-ass high number.
(Those who know my views on RealNames know I'm only kidding.)
Having a database visible to a search engine depends greatly on the complexity of the database itself. Something simple (like the MySQL/Perl-driven Imprinted Products Source List ) can be given a default list-everything URL that doesn't look like a script. As size and complexity increase, of course, that isn't feasible (or even desirable), but it might be adapted to display a representative SQL View of a complex database, with sufficient content to give the search engine the "meat" it needs.
No Laughing Allowed!
Mozilla's Open Directory Project can always use more volunteer editors to index web sites into yahoo-like categories. Editors are expected to have knowledge about the cats for which they are responsible, so there's human judgement involved. I know, it's not as efficient as meta-tags and spiders, et cetera but humans are creating web sites (mostly). Maybe in the long run, humans are required to sort it all out properly. ODP data is open source (I think I'm using the term correctly), and used by many web directories.
Got a beef? Plug a name into the Bizarre Rumour Generator!
It seem to me that having URLs with extensions of:
is incorrect. What is being served is not an ASP script, nor is it a PHP script, nor it is a Perl program. It is, however, an HTML file (or a GIF, or a PDF, etc.), and should be labelled as such.
If your server isn't smart enough to figure out how to generate the requested resource, and needs the generating program explicitly mentioned in the URL, then you need a smarter server. And if you aren't smart enough to figure out how to do this correctly, well...=)
Remember, kids, a URL != a file. All the /. end user cares about is getting an article with the comments formatted appropriately. They don't care[1] if it's stored as a text file, or generated by Perl, or..
[1] Well, they might care in a geek sense, but not in the way needed to read comments.
pooptruck
The web is certainly becoming significantly more difficult to search, especially for informational content. Just -try- searching for information on a musician or an author... you'll get links to the like of music.com, amazon.com, whatever-your-topic-is.com, with a little one-or-two paragraph blurb about the person, if you're lucky. Hundreds of links like this to every little virtually-hosted e-tailer out there. Somewhere, buried in all this, will be the informational content hosted on a personal webpage or at some non-profit organization. Anyway, so, that's the problem, or an aspect of it, we already know this.
... probably 'cause it's mostly static pages and there are not so many anime fans as there are linux users. But that isn't really relevant; if linux.com is going to become the search-engine alternative for linux-resources, they need to respond quickly at all times of the day and night, otherwise 'Joe's Linux Links' is a better option. :))
:))
Good news! The solution is coming. Maybe the solution is here. google.com has their unique approach to web-indexing. Another method that's probably going to be tried sometime soon is to look all the natural-language-processing technology that has been researched in the past twenty years, take the most efficient heuristics, and index pages by apparent-topic instead of by keyword.
Then there are places like anipike.com - if it's a web page about Anime, it's on anipike, or it may as well not exist. I would -never- search the web for anything anime-related; I go through anipike.
I'm really, really hoping that linux.com will become that useful to the linux community, but I don't think they're quite there yet. They may never be. Anipike is generally very fast to load, especially compared to linux.com
(Apologies to any Joe out there who is proud of his links page.
Anyway, currently I still use search engines for Linux-stuff, but as I keep getting more and more hits on rpm files cluttering up the informational content, that may change soon. (Especially since I'm a debian user! I'm looking for information when I search the web, I know where my package is.
--Parity
--Parity
'Card carrying' member of the EFF.
Things such as... www.domain.com/index.php3?q=page&dummy=yawn.html So I named my .html page a bit odd... wasn't there also an <isindex%gt; tag for stuff like this?
Php/dynamic pages may be indexed by search engines, but do they show up well in the search results? After all, it doesn't matter if your page is indexed or not when the listing is buried thirty pages deep.
I think it a distinct possibility that php/dynamic pages may be penalized in relevance scoring. Does anybody know of a php page that is indexed in the top 10 results for any search term? (Yahoo doesn't count - that's a directory.)
I have read quite some comments, that suggested that the key to searching might lie in putting some of the load on the sites themselves. This includes ideas on how to write HTML, XML code or whatever, or including a search engine at the site itself. The problem with this is that people are lazy, If someone ones to write HTML code, they just want to write code, not worry if it can be searched. Basically HTML writers are like programmers, lazy people who will prefer if a function handled the error instead of having to write two lines of code to handle the error. We will have a successful engine when we have one that can search any content, what that means is, you don't need to write your HTML code in a particular way for it to be searched. You don't have to use special tags or what not for it to be searched, it is true that using special tags will be nice. But only .00001% of sites will use this. Search engines should never ever depend on whatever they search for how to search it, they need to determine the strucuture of the data
themselves.
------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
One extension you are missing, which I personally feel is a worthwhile addition is .cfm and .cfml Those of you who are fellow Cold Fusion programmers will already know how wonderful and powerful a web application server it is... and how it's growing popularity and recent expansion to support Linux... is not to be ignored!
put a "meta" field in the database and then use a cron job (or equivilant) to generate the core html at intervals to incorporate the meta field into the search fields? You couldn't go overboard, but if the script ignores blank fields in the database and you just drove with "significant" items, you could make this work.
Just a thought
Hank
The solution to this problem for developers who want their site indexed is anywhere from simple to completely trivial. There are several common methods:
- A generic solution that works in many cases regardless of server software and scripting language is to generate (rip) static pages from the database on a regular bases (daily) and link them hierarchically. Not only will this allow the search engines to pick them up, but it may dramatically reduce the load on your web server if users start pulling these static pages rather than require cgi/database hits for the pages.
- When using PHP, just make
.html the extension for php scripts. This will cause all pages to be parsed by PHP and therefore incur some additional processing overhead, but the newer PHP parsers are quite speedy. - If using Apache/Perl, it would not be difficult to hack the CGI module (I'm sure this has already been done...) to look at an alternate cgi encoding other than
.cgi?xx or .pl?xx and then hand the page off the the proper module. I assume this is what is done on the many sites that show URLs in the form of:
These are, of course, only the simplest and most obvious solutions. There are no doubt many more.If you have a loaded site where speed is an issue, you could use .html for plain text and .htm for php scripts. I also believe that it is possible to specify what files should be processed by PHP and which ones should not. Of course, it has been over two years since I have played with PHP...
-p.
What we need is a way to reliably categorize web pages that doesn't involve roving the whole net. If the author assigns a category, it's a starting point. Then you can search just a category. These are a start:
There are more major categories of course and there would obviously be sub categories to hone the topic down. Essentially what we need is the web equivalent of the dewey decimal system where a page or group of pages can be categorized and subcategorized. With categorization, self reporting is less prone to misdirection becuase they have to choose one. Then you create a slashdot like moderation culture to rate and correct the categories as search results are returned.
Maybe you do still have to crawl the web to provide finer granularity than just categories, but perhaps crawling over provided links will help reduce the work load. After all, the whole internet isn't new every night. You could potentially prioritize crawling in some way as well.
Maybe I should have pattented that before I revealed it...
Today is all we really have. We should all live it well: it is our stepping stone to all of our tomorrows.
All of this made a lot of sense even back in 1997, and I think that the issue is even more relevant now. Computer scientists know how to generate the content, and library science is very good at organizing and categorizing information, as well as indexing it for the easiest way to look it up. I read this issue from front to back twice, and it's a permanent part of my library.
I can no longer read Dilbert. It's too depressing, because it is too real. -- Hyperhaplo
I recommend a real language, like Perl or PHP or JSP or anything but ColdFusion. I had to do ColdFusion for 6 months and I just about went nuts.
I work for a startup company, "XYZ Find" (www.xyzfind.com). We have nothing available yet, but we are developing an XML-based "search engine" that allows parametric search of data (in any XML schema!) on the Web. This will combine the advantages of full-text search engines (broad coverage, simple interface) with database query (precise parametric search, highly relevant results).
.)
This does require that database-driven sites expose their data as XML, but this is starting to happen already (look at RSS), and we believe (and hope!) that it is an increasing trend -- and one that will take off once XML search engines llike ours are available. (We're not the only ones doing this, though our solution will of course be the best
This may sound odd, but I think it would work. Why don't the web developers and programmers work together on this one, and create a draft standard protocol (rfc) that can handle searches. What do we know about searching the web?
These are really enough to get a handle on how this could work. By problem 3, it's obvious we can't send a keylist request to each server in the world, and get their response (though this would be the best solution for maximum search depth.) What we can do, however, is present servers the ability to contact whatever search engine is the main hub(s) and send a keyword based tree. This will allow a search engine to grab information instantly and give a list of sites with that keyword or description.
Most likely, though, the answer will have to be two-fold. What about people who just send in infinite key words? The first search the engine finishes is the "domain->keyword list" search. From that point on, either by querying each individual server on the original match to get more extensive information, or hitting a previously cached crawl, context and relevance numbers can be fleshed out.
The final result, then, would be this:
Domain->keylist->internal|external lookup
The structure would still allow results like what current engines give us, but much more up to date. The protocol could also include a "last update" kind of field, so internal data doesn't have to be updated for x days/months/years. I think if we work on it, it could happen. But it's the only real alternative. Indexing the entire internet just isn't possible.
Just a thought.
-- ShaunRead: Rabbit Rue - Free serial nove
Well, this is not entirely on the topic of indexing dynamic content, but bear with me.. the increasing difficulty of getting relevant search results has long been a pet peeve of mine. There are several factors that make good results hard to find:
:) The drawbacks? Well, it still doesn't solve the problem of dead links very well, and it's slow (approx 30 seconds).. it has to hit 6 engines and collate and analyze results before you see anything. Adding the language parsing will make it even slower. Cacheing results in a database could speed up common searches, expiring them periodically and refreshing in the background..
.. If anyone is interested in helping to develop such a beast, let me know.
1. sites that abuse meta tags and include pages of keywords just to get hit more - this makes results of keyword searches less relevant.
2. the explosive growth of the web - making 'quality' sites more difficult to find amidst the deluge of junk
3. dead links, outdated information, changed dynamic content indexed by search engines - the web changes too quickly for most engines to keep up
So what can we do to get high quality, relevant results without weeding through pages of URLs? It's not easy, but I've been playing with an interesting approach. First off, different search engines use different indexing/ranking methods: keywords, meta tags, link count, traffic stats, user recommendations, human categorizing. By combining the results of several engines using different index methods, you can cross reference the results and see who appears on all the engines. This gives you at least *some* degree of assurance that the URL matches your query. These results are ordered by number of engines that reported the link.
Now that we have maybe 20 or 30 semi-relevant URLs, the next step (which I have not coded yet) is to retrieve these pages and parse them based on natural language processing techniques. This should give a good idea what kind of actual content the page holds - ie, is it an order form?, a page full of pictures?, a magazine article?, a threaded discussion?, etc.. From that, and from stored or learned user preferences, a better list of results can be show to the user.
OK, so the easy part of this is done, and just automates some manual searching
Is this where searching is headed? Maybe.. I don't pretend to know, and only started messing with it out of frustration. In any case, it seems to work pretty well already, and could probably be expanded into a pretty decent agent/search tool (open source, of course!)
--segfault at netwinder dot org
"640k ought to be enough for anybody." -- Bill Gates ca. 1981
The solution is easy. Don't use them in your URLs.
Do not use GET args in dynamically built links, but hide your args in a longer plain ole URL. For example, a script at http://www/x/y can actually interpret http://www/x/y/z/ just fine and you can then parse off z as an argument.
First, alias a directory that runs your CGIs, PHPs, etc. Like you would cgi-bin but don't call it that!
Then, plant your cgi program(s) in there. The "arguments" further down would be in the PATH_INFO variable (which you'd have to parse out manually).
So, in the case of http://www/aa/xx/yy/zz/ the script is in the aliased /aa directory. The script is named xx and the PATH_INFO passed to it, in the above example, would be /yy/zz/
This works with Apache. Don't have Apache? Upgrade today at www.apache.org :-)
So http://www.somecompany.com/en/products/doors/alumi nium is actually passed to a script which seperates the arguments from the URI and then builds the aluminium door product page with the english template. It works rather well and everything can be configured to be text/html and .html without problem.
An example in PHP is available at http://www.phpbuilder.com/column s/tim19990117.php3.
A.
--
Adam Sherman
--
Adam Sherman
Freelance Geek
...because one person's "fluff and spam" is another person's "real information".
Several years ago it became clear that the net was growing too fast for search engines to be able to keep up. At that time I came up with a design to solve the problem of scaling that reflects the open-source solution: through volunteers.
You have two categories of volunteers, the Spinners and the Weavers. The Spinners each voluntarily search some small part of the web via a spider each night. The Weavers each publish to the Spinners a list of queries that they are interested in. When a Spinner's spider hits a new page that matches a query, or receives a new query that matches a previously indexed page, it sends an email to the Weaver. The Weaver can look over the web pages coming in and create web sites that provide easy access to those pages, as they apply to the particular subject the site.
I sent this suggestion to the Open Directory people, since I think it is a perfect tie-in to their concept. The editors in the Open Directory project would be the Weavers who could separate the wheat from the chaff. Unfortunately, I never heard from them and let the idea die.
I'd be willing to ressurect it if others are interested. Feel free to send me an email.
From the PHP Knowlege Base:
.php3 in the middle of your url take a look at the Apache ForceType directive.
r cetype
How can I pass variables in a form that won't scare off search engines?
Mailing List, Nathan Wallace
Jun 28th, 1999 06:04
It's easy. Just use:
/local.php3/var1/var2
Then in your PHP code parse the url and extract the variables. This works since Apache finds the local.php3 script and ignores the rest of
the url.
For example:
http://www.server.com/page.html/wilma/betty
and then in the script:
$res = explode("page.html/", $REQUEST_URI);
$vars = explode("/", $res[1]);
$fred = vars[0];
$barney = vars[1];
If you want to get rid of the ugly
http://www.apache.org/docs/mod/mod_mime.html#fo
The client shouldn't infer the type of the object based on an "extension" in the URL at all ... that is what the Content-Type header is for!
If I'm looking for news about the plane crash in Yagadoodlestan that killed 8,000 people, I'm not going to go to a search engine and type in "Yagdoodlestan plane crash", I'm going to go the New York Times and see if they have an article, or an AP story in the margin. If I'm looking for a review for the G88-superdooper motherboard that I'm thinking about buying, I don't go to a search engine and type "G88-superdooper motherboard review", but I might type "computer hardware motherboard review" and expect to get links to a bunch of hardware review sites, any decent ones I expect would have a review of the product I'm looking for.
Conversely (inversely? whatever...) no one is going to type "international news" into a search engine, even though that might be the best way to find the NYT, they're going to go there because they heard about it from someplace else.
An example from my own recent web-browsing life: I heard about some site called bluesnews from a Quake'n friend a couple years ago, so I check it out. They have links to articles at some place called /. ("/.," I say, "what the $&*# is that?") so I check it out. I bookmark it, and now I'm happy. No search engine required. Last night I see the ad for Man on the Moon and start talking with my mother about Andy Kaufman "You didn't hear Mom? He died." "When," she says, "and from what?" I dunno, so I go to google and I type "Andy Kaufman" and find the answer.
So what, exactly, do you expect these search engines to do? Sites like the New York Times, BluesNews and SlashDot serve one purpose (bringing me news about topics I care about on constantly updated dynamic pages) and sites like AndyKaufmanFansOfIdaho.com bring the occasional bit of static triva goodness when I need it.
Works for me, what are all of you doing???
"God does not play dice with the universe." -Albert Einstein
Those who fail to understand communication protocols, are doomed to repeat them over port 80.
I saw somewhere a few months ago (don't you love it when people really back up their information like that?) that the growth of the web vs. the available indexing technology meant that only about 4% of the web was being indexed. Goodness, that's a surprisingly low number, isn't it? I've heard it mentioned, and have often mulled over the idea myself, that some sort of distributed indexing is probably the next logical step. With the apparent successes of distributed.net and SETI@home, this is at the very least intriguing. So let's just say, for the sake of discussion, that I had some time on my hands and the motivation to see a monster search database project through (these are both very hypothetical points). I could create the central database and write some client code. My dedicated AOV (Army of Volunteers) could come in veritable droves to download the client code and join the team (which they certainly would, right? Right?!?!?) Anyway, their client would initialize and get one of the starting URLs from the root database and go to down indexing and spidering. Shouldn't be too much of a bandwidth hog, since it will be just text, but it would be constant. Maybe not a good idea to do this with an analog modem. When the client has "eaten it's fill", or once a day, or something like that, it would slam it's content my way.... ah, there's a potential problem. That's a lot of content. Well, it's all text, so my client could maybe get some decent compression out of it by gzipping it up. Still, it wouldn't exactly be trivial. Thinking on about this, why will it help me to have my AOV doing these HTTP transactions for me when my server could do them it's own damn self. Surely my server would have a big enough pipe that the bandwidth wouldn't be a problem, and I could start any number of processes. What's the big difference between web indexing sites like distributed.net and SETI@home? Ah, it's processing power. That's what's required for the "traditional" distributed application. That kind of number crunching isn't helped by bandwidth... you need such a honkin' processor to do all that chewing that it's not cost-effective to create the system... it makes a whole lot more sense to distribute the work to any number of "normal" machines, thereby simulating a "super computer". That's right... it's all coming back to me now. So, what would be helped by distributing the web indexing process? And wouldn't the smart fellas at Google or AltaVista or have thought this through by now and come out with some sort of beta? Hmmmmmmmmmmm...... What? You actually read this whole thing? Sheesh. That's impressive. Oh well, might as well moderate me up :) RP
Why not just have an index that weights the site using a text:images ratio. If it has 120k of images and 2k of text, assume it's content-free.
.02
My
Quux26
My
Quux26
www.crashspace.net
I remember building a simillar trap by accident.
:-(
It's much simpler to do than your receipe:
I did a content management system, which did not impose an iherent hirarchy upon the data (it had a net instead).
The nodes where presented to browsers like directories in web servers.
Poor search engine - did not believe in circular links between directories.
As for the handling. After retrieving several hundred M (out of my 2M test data) it stoped and came back next day. I could not efford that...
Where was I reading this? Maybe it was someones thesis paper.
Basically, the idea is that you collect all links and some basic keyword scanning (could be meta tags) and then you build an index based on the fact that many pages that claim to be about keyword X all link to page Y (any URL extension, dynamic or not) which is also about X. The more links there, the more likely that content is something valuable. It sort of polices itself and results in the top search item coming back as the most linked-to page.
Very slick, but I haven't seen anyone implement it yet. --ds
I haven't found a recent Open Source indexing engine that could do 1/10th the scale of Google assuming you had the hardware to spare. If there were, then folk can run Open Source indexing engines on small parts of the net (distributed by network topology or geographically) and a meta-index can handle those. Then we have local customizations for dealing with dynamic content.
Think Google.
/., it is reasonable to assume that regardless of actual content today, /. typically is a good result to return for the search "geek sites".
/., but the idea is that they will be outliers, and drown in the noise.
Google works on the idea that pages that have a lot of incoming links are authorities on what they discuss, so they should be ranked highly.
A modification of this is to not only rank a site's authoritativeness (eh?) this way, but also what kind of content it has. So if 10K geeks all have homepages that include the words "geek" and "computer" and also point to
Of course, some of those homepages will also have the words "tennis" and "knitting", that will be spuriously attributed to
This basically is keyword indexing, but the keywords are dynamically determined, rather than using the broken meta tags.
The big problem with this approach is implementation; the association tables are likely to be huge.
Also, you assume a large sample size, so that the outliers will cancel.
Johan
Actually I don't see a problem. Stuff which most people are looking for is getting more plentiful - so what if you can't find all of them - they're usually highly redundant - 1000 people putting out the same info. Think of these as backup copies.
For the rest of us, we usually know how to find what we are looking for... If it really is there but it can't be found it probably wasn't that good anyway- if the page isn't linked to by anyone, and the author can't figure out how to get it indexed or hasn't bothered, then either the content is not worth reading, or the author doesn't want it to be read (doh).
Anyway, if they want they can always decentralise things:
For example every site could have a standardised search engine helper which will allow a keyword-url list to be dumped out, possibly in a compressed format.
e.g.
list of URLs=
urla,urlb,urlc,urld
list of Site specific keywords
(nonstandard keywords)
keyword1,keyword2,keyword3
Site specific keyword to URLs
X1=a,b
X2=a,b,c
X3=d
Standard keywords to URL list
S1=b,c,d
S2=d,e
(don't need to list standard keywords - understood)
A more sophisticated version would list categories which the site is about. Then when the search engine is searching for some unlisted keyword but in that category, it can actively query the site's search engine on that keyword.
e.g.
cat.scanner.medical
(not cat.pets)
You could have something like DNS, but by category and keyword.
If idiots want to be in every category fine. People could always put -dumbsite in their search.
Cheerio,
Link.
There seem to be two schools of thought here. The folks who do searches, and are satisfied at what they get, and the folks who KNOW how searching works, and the breadth of information that exists, and KNOW that there's technically no way that all of that information is just plain not going to be included in any search.
It's unsettling to these "second school" people, because it's like looking for something in a library, knowing that you can't even go into 90% of the rooms of books and scrolls and papers.
Computers are supposed to be our security blanket that no information is out of our reach, or ever becomes lost. Unfortunately, this is "capital-R Reality", and even with the great equalizer of the internet, you just plain can't have it all. Steven Wright said, you can't have it all, where would you put it? The computer answers, digitize it, and put it online. All you need to do is build enough disk drives.
It IS a noble goal. And perhaps even realistic. But not with our current technology, and system of management (ad hoc/capitalist survival of the fittest standard) of that technology.
I wish I had a nickel for every time someone said "Information wants to be free".
These are my friends, See how they glisten. See this one shine, how he smiles in the light.
How will a Proxy server treat PHP/PHTML/PL/etc files? probably ignore them, or simply download them (which is worse...)
If i load
This also can't be solved in traditional methodes such as telling the proxy to refresh the page he has every X minutes, since every user demends a different page (either by query or by cookie prefs.).
I have given this subject some thought and came up with an idea:
Have the proxy store the cookies and then download the pages according to them: User access site via Proxy Server -> Proxy Server loads user Cookie file -> Server checks current stored page -> Server downloads requested page to a dedicated \user\site dir, if necessary -> Server updates latest Time of Page Load -> Server sends page to the user's browser.
If this will be implemneted, Proxies will be much more efficient and could be used to further minimize banwidth load.
To the fool, he who speaks wisdom will sound foolish. ---Euripides
Actually I don't see a problem. Stuff which most people are looking for is getting more plentiful - so what if you can't find all of them - they're usually highly redundant - 1000 people putting out the same info. Think of these as backup copies.
For the rest of us, we usually know how to find what we are looking for... If it really is there but it can't be found it probably wasn't that good anyway- if the page isn't linked to by anyone, and the author can't figure out how to get it indexed or hasn't bothered, then either the content is not worth reading, or the author doesn't want it to be read (doh).
Anyway, if they want they can always decentralise things:
For example every site could have a standardised search engine helper which will allow a keyword-url list to be dumped out, possibly in a compressed format.
e.g.
list of URLs=
urla,urlb,urlc,urld
list of Site specific keywords
(nonstandard keywords)
keyword1,keyword2,keyword3
Site specific keyword to URLs
X1=a,b
X2=a,b,c
X3=d
Standard keywords to URL list
S1=b,c,d
S2=d,e
(don't need to list standard keywords - understood)
A more sophisticated version would list categories which the site is about. Then when the search engine is searching for some unlisted keyword but in that category, it can actively query the site's search engine on that keyword.
e.g.
cat.scanner.medical
(not cat.pets)
You could have something like DNS, but by category and keyword.
If idiots want to be in every category fine. People could always put -dumbsite in their search.
Search services could always put "filter out this site" links in the results screen.
Cheerio,
Link.
Andy Armstrong
They've been around for a while. Ron Guilmette created wpoison a while back. There's even a Wired story about it.
;-)
Unfortunately, wpoison appears to have since disappeared, although Ron never mentioned this to me.
Interestingly, I found out all this information doing a simple Google search on "wpoison".
Brad Knowles
http://daily.daemonnews.org/ -- if you're not
trained an infinite number of monkeys to browse for what I'm looking for, and they always find it in no time (literally!)
I have a few (very pertinent) meta-tags on the information page for a mailing list that I run. The tags are designed to get hits from people looking for my list. But, it seems that the meta tags don't work in some of the major search engines. Perhaps the engines have caught on to the practice of embedding surperfluous tags in order to get hits on engines. I think I'll have to rework my page to make sure that the key phrases that I'm trying to get hits on actually appear in the text.
I have discovered a truly marvelous sig, unfortunately the sig limit is too small to contain i
I promise, I previewed and it was fine. Any further critiquing of the link problem in my post will be superfluous :) Thanks.
RP
Yes, but many file systems, which may be the destination of the results of the HTTP request, *do* make use of extensions to determine file type. Though, perhaps, storing MIME-type meta-information would be better, we're stuck with what we've got.
Also, I mean the URL to also be used as a user-interface. For example:
http://slashdot.org/99/12/14/1154243/comments
would generate your browsers's preferred format, whereas requests to:
http://slashdot.org/99/12/14/1154243/comments.pdfl
and
http://slashdot.org/99/12/14/1154243/comments.scm
would return the PDF and the Slashdot Comment Markup Language (an XML app) respectively. This could be done with content-type markers, but the interface is much poorer than simply using file extensions.
pooptruck
Anyone seriouslly pretending to build professional web domains should consider these tools: mod_perl : http://perl.apache.org HTML::Mason : http://www.masonhq.com With those I build very VERY easily .html dinamic content, even creating virtual directories and files from databases.
It seems that everyone and their brother is out there trying to get every single page on their site indexed in the engines. This is the wrong approach for many reasons: it help prolifigate link rot and search engine database bloat, it increases the time it takes for a spider to "index the web", it decreases the effectivenes of the search engines, etc.. etc.. The better way to resolve this is to only index the "splash", or first, page of a site. Index that page completely. That page should contain all of the keywords and such necessary for a search to find every relevant item or page on your entire site. That the spiders only have to index a MUCH smaller portion of the web and will still return all of the relevant information in a much quicker time with much smaller databases. While at the same time allowing the restrictions on number of keywords and size of descriptions to be greatly increased. Of course this requires web designers to actually have some sort of interest in the public good so that they provide good, valid keywords and information as well as decent navigation internally to the site. It also creates problems for people who are not on their own domain, but are merely someone's ~. I think though that these things can be ironed out in a re-write of the robots.txt file format.
It seems to me, after all the discussion, that webmasters/hosts should have to generate their own index of a site based on certain criteria (of course).
For instance, I run a little program called swish-e (which some of you may have heard of, if not, check it out) to run small search engines on several of my sites. What if every host/domain/site had to run a "swish-e" index of their space and post a "spiderindex.txt" file in their main directory. Generate the index for the spider before it even gets there.
That would open up a whole new can of worms, probably, and add another layer of complexity to creating your own site, which is a good thing IMHO. Professionally produced sites, or those produced by the web-wise at least, with "spiderindex.txt" in them get indexed better than Joe Smith's personal home page with a few meta tags.
-Mark
-- I lived through the IPO Rush of '99
So here we go, two links: One with a captital A in the closing tag and one with a lowercase a:
slashdot test | slashdot test
And now some additional text to see if the link turned off.
OK, I just previewed it and it's perfect. Now submitting...
Sorry for the test, but you gotta relieve curiosity.
RP
I'm suprised nobody has licensed the Cyc software/ontology for use in web indexing. Actually I could be out of date and someone might have already!
The key to good indexing and search lies in scanning for knowledge and not "words". Unfortunetly more and more webpages are designed to be as noisy as possible and contain little information. For example millions of webpages contain navigation menus however the "knowledge" of what can be navigated is stored as images, which is completely useless... the "knowledge" is completely lost and indexing is difficult.
There needs to be more use of meta-data in web pages if we want to index them for the knowledge they contain. Until we can index them we can't search them.
Possible scenario: Apache httpd gets a couple of add-ons that speak Z39.50 (protocol for distributed searching). The search engines build a database of what these Web servers say is on their site (could be multiple collections; could be dynamic content....).
An information seeker would use a search engine to determine which Web servers (aka, "collections") might be appropriate for the query. Then, the query could be delivered to the best servers (for searching on their own sites).... The main benefits are:
Drawbacks: slow speed of the net & slow remote servers; porn and other misrepresented content; need to integrate and rank results from multiple Web servers...
We already have the protocol for this type of searching (Z39.50 - remember WAIS?); the next logical phase is to integrate it into our most common tools, especially Apache.
I was just wondering how useful database pages are. I suppose some of them are (e.g. the local library) but most of them are BarnyCorp's list of useless widgets and I'm rather glad that the engines don't index that stuff...
You will not drink with us, but you would taste our steel? - Walter Matthau, The Pirates
Anyone ever heard of metadata? Instead of indexing every word in a document we should be capturing accurate, relevant metadata about it and facilitating the searching of that. As non-text content increases (like mp3s, videos, images, audio streams, animations, etc.) on the internet, the need for a new search paradigm increases as well. Of course there's the the Dublin Core, but much more interesting to me is the IEEE LTSC's work. Their metadata standard, currently at version 3.8, is very close to being finalized. In addition to providing general fields, it also includes some that supposedly facilitate the instructional use of the object of the metadata.
The nifty-dandy slashdot search engine only works for *non-archived* posts --- that means past two weeks *only*! The rest are rendered as static HTML, but you'd lose all the special slashdot searching features that way (search in story, filter by topic, etc) even if a major search engine is indexing them -- which I can't find any evidence to support. So, techno-idealists, let's start *here*. How should *slashdot* be indexed?
It works stunningly well!
Another thing the search engines could do is figure out how to ignore "trolling" pages. i.e. those which are nothing but index spam, a catchy title, and a refresh tag to ship your browser off to fetch their actual main page
/. and Search Engine Watch.
Right on! Nothing burns me more than to see the 'enter here' page when I go to a site. Some flash or other animation or some huge graphic that loads up, and then either sits there, forcing you to 'click to enter' or redirects you to the rest of the site, which is where I wanted to be in the first place...what purpose does that serve? Oh, sorry, it impressed the client you built the page for. (I will admit I have done this once or twice, but not until after trying to talk them out of it.)
As far as searches, I agree that we need a new standard, one that is not only intelligent and dynamic, but that can outwit those who try to trick it. I believe that's quite a way off... until then I'll keep reading
The Divine Creatrix in a Mortal Shell that stays Crunchy in Milk
The House Between - Original Sci-Fi Series
Why ask if the Internet is becoming unsearcable and then only talk about World Wide Web issues?
The Internet actually consists of many things, of which WWW is only a part, but I'm beginning to realise that more and more people, in particular those who are new to "The Net", have a hard time understanding that.
1. Have a second domain name system based on topic rather than location, the way things are organized on usenet. E.g. sci.astro.cosmology.inflation
2. Create a legal equivalent of a self-reproducing worm, which requires the cooperation of a site in order to gain access. Give each copy a reasonably short lifetime: say, 15 minutes.
3. Have a number of well-known, moderated, top-level sites for various topics, with links to other sites dealing with the same topics.
I agree that the current system needs work; I once searched for "Andromeda galaxy" and "radial velocity" and of the sites that the search turned up, one was a lesbian site and another was a neonazi site.
I work for a company whose primary product is a search engine. However, in our case we allow searching not of web pages but of student "profiles" (which are basically like super resumes). We made a conscious design decision at day one to not even allow keyword searching because of its incredibly imprecise nature. Rather, we collect a lot of meta-information about our clients and then use sophisticated sorting techniques to do hihg-precision searching. If anyone's interested, please feel free to email/post a comment. I'd be curious to know if anyone has suggestions about how this should/coould work better.
Thanks
Eric
Can your IM do this?
To say "We've written the system in XML." is about the equivalent of saying "We've written the system in ASCII and matched the parentheses."
In the vast majority of applications, when people say "XML" they really mean something like RDF, BRML, RELML, etc.
The best use of XML currently is to simply dump existing relational databases to the web and index them with XML-oriented search engines like XSearch for things like RELML.
One of the pit-falls of "XML-oriented search engines" is that they fail to provide basic query capabilities such as numeric comparison on the indexed fields. This is really unnecessary since all they need to do is put the XML data back into a relational database on their end and index appropriately on the numberic fields. If they don't like the schema checking, they can always use LDAP and turn off schema checking.
An example of good use of XML via RELML is at www.nmre.net. Check it out.
Beyond this simple "dump the legacy rows to web pages" approach to E-commerce searching, there are the inferential systems that are more or less the equivalent of inferential databases. In these schemes, rather than storing literal values for the database rows in XML fields, a set of rules of derivation are included along with the XML data fields, and the actually index values that are not explicitly specified are derived prior to indexing. One might think of these as methods derived attributes as opposed to stored attributes. This is the direction Guha et all were trying to take things with RDF, but IMHO, failed to find the "sweet spot" of simplicity and power required for a new standard.
Seastead this.
Example: Suppose you're looking for information on a Zip drive. You already have the drive but are having trouble with it (problems with zip drives? really?
Of course I don't even bother, I just go straight for the LDP, but Windows users don't have that option.
It would be interesting to be able to search only engineering sites for engineering information. I once did a search for "wheatstone bridge" and got tons of $cientology links. If the engine was able to determine if a site was, in fact, an engineering information site, that wouldn't have happened.
How about a "no pr0n" checkbox. That would be sweet.
Of course that would require a herculean effort in changing the standards and getting site owners to be honest.
But maybe not. Here are some ideas I was thinking of:
It's not perfect, but it's gotta be better than the garbage we put up with now.
The sites I maintain and append to in my day job are all "must register/login before proceeding" sites. (All of them are training/web based education sites, targeted audience specific) Each of these sites has it's robots.txt disallowing anything but the root page to be spidered across. In the root page I have stuck the primary "Mission Statement" (sans phb speak) in the keywords meta tag. In my case, I have no intention to let a search engine wander around the site, it would get lost in all the forms and frames. (no content flames please, I merely build what they pay me to build.)
.com domain.
I often do quite a bit of searching, often for external reference verification. I never run a singluar search, but run over several engines. What I would recommend, is a way of grepping multiple sites (hotbot, altavista, yahoo... ) and presenting a scored list of which sites appeared in the first $X returns. I won't/can't limit myself to searching in just one location. I don't have 'site loyalty' to any one place. (I have not extensively used Google, so I am not sure if their methodology is just this.) "Specialization breeds disaster"
'Spamdexers' (the term has been much heard of before, I doubt the need for redefinition) Is in my "Righteous and Elitistic" opinion, (I was called just that today, which may make this post borderline rant.) is one of the three "Punishable by death" crimes on the internet. (Spamming and Domain Squatting being the other two)
But then, I believe there should be a compentency test before being allowed to use "The Internet", I remember gopher. I know there is more to "The Internet" than "The Web". I'm sick and tired of people who think that -just because- Geocities will give them free space, they -MUST- make a page. I'm tired of "target spam" that 'guarantees' to get me in the top 30 of a chosen search engine for $149.95, just because I have a
So at the end of a day, what do I know? Apparently nothing to my PHB's who want me to use "onUnload" to "make sure" that people don't accidentally leave the site... Next they'll ask me to alter back and forward histories and flood screens with popups for the other training sites.
Are you sure? I don't find this to be the case at all with any of the searches I've conducted including those from open directory. And, I consider myself to be a professional power searcher. My biggest complaint about open directory is that many of the links are bad. I've had more success with askjeeves.com.
I think many people are missing the point of why databases are used by sites in the first place--to keep webcrawler/indexers out. Most sites have the robots.txt file to exclude these intruders. The webcrawler developers with have to find other means of indexing this information if possible. The direct approach isn't working nor will you convince most sites that restrict robots to let you in. Some with even use tactics like scrambling the url of your search based on your cookie session so if you bookmark a hit from you search, you won't be able to get that same hit from you bookmark in your next session. Devious isn't it?
I'we been wondering why none of the library classification systems have emerged on the net? Back in the good old days when I relied on the library for the information Universal Decimal Classification system was extremely handy. Even if you didn't know the name of the book you could browse thru a certain category that interested you.
The idea is that a book can belong to a single class that is marked by a decimal schema. Top categories are:
0 Generalities. Information. Organization.
1 Philosophy. Psychology.
2 Religion. Theology.
3 Social Sciences. Economics. Law.Government. Education.
4 (vacant)
5 Mathematics and Natural Sciences.
6 Applied Sciences. Technology. Medicine.
7 The Arts. Recreation. Entertainment. Sport.
8 Language. Linguistics. Literature.
9 Geography. Biography. History.
The main categories are defined further down:
.....
61 Medical Sciences. Health.
62 Engineering and Technology Generally.
63 Agriculture, Forestry, Stockbreeding,Fisheries.
64 Domestic Science; Household Economics.
and further and further:
631 AGRICULTURE
631.1 Farm Management
631.15 Planning
The classification would be used like KEYWORD meta tag in HTML and search engines would index it. This would enable user to specify word as well as the topic they are looking the information on.
To prevent the misuse of the classification, only one or two classes should be allowed per page. Like
"Marketing of agricultural products" -> 380.13:631
(38 = Trade. Commerce. Communication. Transport.)
UDC is language independed and it has already been translated to numerous languages. Also most libraries use some kind of numerical classification so many people are familiar with the concept
To help page authors to classify their pages a special website could be created. It should contain at least
How about it? Is it a good idea?
One major problem in the matter is that the UDC classification is copyrighted. I couldn't find more than a skeleton listing from the web! So the first step would be to negotiate the licence for it or to the competing Dewey Decimal Classification. I don't think it would be wise to start building a own scheme without negotiations since both UDC and DDC are in extensive use. But if everything else fails, Gnu Decimal Classification to the rescue!
More information about classification on internet see:The role of classification schemes in Internet resource description and discovery
I think that would qualify for a patent. Go for it. Its a great idea.
I just made the entire site unusuable by my entire company by viewing the robots.txt. How proxy server friendly. I hope nobody tries to look at the robots.txt file through an AOL connection.
ColdFusion may have some shortcomings, but lack of a switch equivalent isn't one of them.
The main problem with CF is that it is targeted at Programming for Dummies^H^H^H^H^H^H^HWeb Designers, but that's also the biggest strength.
rodgerdAs I see it, the only way to really make a planet full of data accessible to everyone sensibly is to take a step back, take a long deep breath, and take a closer look at our wobbling usability practices.
We've abandoned some of the most important elements of our user interface in our rush to splatter the world wide web with our content: The domain name, and the sensible URL
Domain names -- unlike heirachically organized things (Usenet is a good example) -- no longer really mean anything at all. They've been smashed flat into just a few heiarchies, largely so NSI could maintain fascist control of the few TLDs. It reminds me of the MS-DOS days, when everything wanted to install itself as C:\SOMETHING. Companies rush to register their word or words in several TLDs, fearful that their competitors may soon take away their opportunity to hold even a teeny slice of the narrow internet domain pie. Domains don't mean anything anymore. Not in terms of content anyway -- they simply don't help us find what we're looking for. Doesn't it seem obvious, or at least worthwhile, that our position, or "place" in the big information avalanche that is the internet should at least be related somehow to our content? Don't you wish people could find you directly that way?
URLs have a place in this big scheme, too -- as a continuation of this structure. That is, URLs should represent information structure on a site in a manner simple enough not only for a person who is browsing to know and understand where they are and what they're doing there, but also to actually use as a user interface. When did we forget this crucial and human factor?
The more we tailor our information to only be useful by machines, the more we ruin our ability as humans to traverse the internet in a way that's sensible and seems natural to us.
This thread kinda hit me... I've been thinking in the same lines for a long time. When the internet started, I used to be able to go onto the internet and do a search and get quite good results. Often a lot of information, but within it, there were always useful links. Then you started to hear on the news, and advertisements about people suggesting a "better" search engines. More accurate, giving only precise information... these projects got boosted with billion dollar support. As the internet, was the new medium of the millenium... and every IT company, with some "new" search type came up with a record high index on the nasdac or whatever :-) But me, the average user... can't for the life of me see how these buisness deals go about. Becase "I" don't get better results, I get worse results. Currently, when I do a free search... I don't get any useful information at all. And most of the links I get from free search, are links to emails and discussion forums. And going through the structural index, that these newly developed search engines have. Reveals nothing more than advertisements from companies, that have zero contents with the actual search being made. Despite the money being poured into it, and the "promise" of better results. The user gets less results, and companies apparently get a better stronghold on monopolizing what is and is not allowed to be on the internet. A person can no longer make a "link" to someones material... imagine the era, where people will be arrested and fined by talking aboout a mmaterial... because someone has a copyright on it. Or the time, when inventors are arrested for trying to "improve" concept or invention. Because they are illegally utilizing methods on copyrighted material. Since all material is copyrighted and patented, whose going to "legalize" what is being tought in schools and who is going to ensure that this material is correct and accurate? Since the students themselves, will be arrested if they try to discover it on their own, since after all they might be breaking a patent or a copyright by poking into it :-) I find it amazing, at how far people can actually go in robbing the people. Man isn't intellegent, man is a monkey... and not even a very smart monkey.
Well, I think you should give Google some credit for its page-ranking system. A porn page is unlikely to come up at the top of your search for "3d game quake" just because it has those words (and every other one in the dictionary) in its META tag, unless a lot of other pages containing those terms link back to it, which would tend to make it a reliable site for that kind of information.
I think Google has the right idea. Because of the proliferation of Porn/Business sites that will stop at nothing to get visitors, you can't quite trust a site to represent itself correctly for searching. Your best hope may be to keyword search, and then do a "background check" on the site to see if it really does provide that content. Maybe some sort of Meta-search which knows popular sites for different categories, and asks them to search for the results? i.e. you search for "linux program mp3" and the search engine knows that freshmeat.net knows a lot about "linux program" and mp3.com knows a lot about "mp3", so it asks those sites for search results, and displays those. It would put a lot more focus on providers of content anyway.
Just my ideas,
-Ted
Would you mind hyper linking that for us couch potatoes... thanks. www.npsis.net
This is a problem I came up against a few years ago and tried to solve at the time. I never got fart enough down the path to release a standard or anything, but I'll explain the thinking.
Basically the problem is that search robots can get stuck in loops on your site if it's a database-driven one. Equally if the database contains something like stock quotes or postcodes it would just succeed in filling the engines with contextless gibberish.
So instead, the plan was to get people to manually export their database into a flat text file referenced from the robots.txt file. The text file would in some way have a data field and an address field. So the data field has the content itself as plain text and the address tells the search engines where they should send people, rather than referring them to that text file.
Now there's another big problem that the author hasn't mentioned. How do you find real-time information?
Here's the scenario: You hear from a friend that a school in Dublin, Ireland has just been closed down due to sexual harrassment from the principal. Since this is your field, you want to find out more. The major sources: CNN, BBC, ABC etc aren't covering it. You know that some local new site will cover it though. So how do you find it?
Right now, the only way would be to find the Brand in a subject index like Yahoo, then hope they cover it. Looking for Dublin Time might work. But why can't you search for sexual harrassment school dublin?
The answer lies in a real-time database of news, requiring the news services to either update a file with all the news in it or perhaps in some way push information into the search engines.
One approach to this problem is Moreover who index news themselves without the benefit of metadata. These guys are very clued in about metadata though.
Instead of asking robots to parse based on URL, we should have a new attribute for to indicate that the link could/should be followed. At the simplest level, this could look like INDEX="yes", but this could be extended in various ways, e.g. telling the spider if it needs to accept/send cookies, indicating a range of hours (in GMT) that the spider should restrict its queries to, etc.
I think a bigger problem these days for search engines are the number of sites that actively try the spam the search engines (i.e. porn sites). A friend at Inktomi says this is a huge problem.
incredibly inefficient, somewhat annoying, but it seems to work well...
webhelp.com
-nutsaq
I wonder if anyone has considered making an XML version of the MARC system that libraries use. Most library catalogs will let you search by items by author, title, publisher, standard number (isbn and so forth) date, type of material (book, cd, video etc.) and various other parameters, driven by the descriptive capabilities of MARC tags.
Also, has anyone gone to including URLS & search strategies in works cited for papers and such? Will this become necessary?
If someone wants to create a site that isn't searchable, then that's pretty much they're problem. I have very little difficulty finding what I'm looking for using any of several search engines. But I'll admit that I may just not be searching for the "wrong" things. The real problem is spam-dexers. Those jerks that put every word in the dictionary in thier metatags in order go get you to their site, which has little or nothing to do with what you are looking for, and consists of nothing but banner ads (if they don't spawn a bunch of windows for you). Its more than a little anoying having to skip the first 5-20 items in a search to get to the real meat.
Given the chaos that is the net, it is going to be tough for the search engine creators/programmers to deal with all the badly coded dynamic pages and properly index them.
b ots.html) to indicate which parts of your site to not index. Put these and any other hints/standards in a public place and make it widely known that if you want the traffic a search engine can generate for your site, adhere to these guidelines.
Why not have a standard (something like meta tags to enhance the search hits of your page content) to identify keywords for the page. Why not have a standard to indicate which parts of the site not to search (say a robots.txt file http://info.webcrawler.com/mak/projects/robots/ro
Warning, this is an informational, on-topic, product mention.
HREF Tools Corp. came out with an ISAPI filter to do something about this, yes. We called it the Coolness Layer because it makes a dynamic, even database-driven, site use "cool" URLs that omit the path, program name and ?. The filter redirects the incoming URL according to some rules and your application can keep working normally. Of course, it helps if you update your application to create "cool" URLs so that the created links maintain the illusion. We have the defaults all worked out so that this is easy for WebHub programmers (WebHub is our core technology), but it applies to anyone running IIS. And the idea applies to any web server.
More info: http://www.href.com/coolness
Version 2.1 supports multiple domains on the same machine, each with their own redirection rules.
Enjoy.
But, on the other hand, if I look for a subject using words that describe the subject (for instance - "lyrics", "song", "band", I would find the content search engine itself rather than the song, because the search engine should (and will) contain such words in its static parts.
So IMHO there are two complementing and distinct solutions to the problem presented:
The first solution is obvious and should be widely in use today. The second solution puts the load on the spider, and seems "un-nettic" (ethic).
Just my 2e-2$.
I don't know if this is an optimum solution, but here goes...
/. is right now - with users/moderators and posters (of sites). I think such a thing could work - and would allow for a more complete indexing of the web (and allow sites that wish to be anonymous to stay somewhat anonymous), while giving a high QOS (due to the moderation - so you won't see a bunch of crap adverts, wrong info, or dead links).
I don't think we need search engines, but rather "search sites" - in many ways they would work like a search engine, however, they would lack the one property of a search engine that clearly cannot keep up with the web - the spider.
The solution to the problem: Rather than having a spider go out and crawl the web to build the database, those sites wishing to be represented in the database should submit their site to the database. How would this work? Well...
1. The site would submit their URL or "root" directory (in the case of personal sites) to the database for inclusion.
2. Each site would only be allowed one URL/directory in the database. At that root level, would be the "index.html" page that should have its own search engine or links to the various parts of the site. It should have META tag info for the search site to use to generate descriptions for the page. Maybe this might be controlled with a "free" membership type thing, so that owners could change info about the submission.
3. The site owner would have to categorize the entry himself - in other words, it would be the responsibility of the site owner to properly locate the link in its proper hierarchical context in the search database.
4. Each link would be given ratings points - which users (maybe registered as members as well?) can use to "moderate" the site - so that sites that are in the proper spot and present good information get moderated up (to appear higher in the search results), while those in the wrong area, or those that have bad content (purposefully misplaced adult sites, commercial sites that have no good content) would get moderated down.
5. Those sites with a consistent moderation rating less than 0 would, after a period of 30 days, be deleted from the database (maybe with an email to the site owner, so that he is warned).
6. Searching could be done via a normal keyword interface or a hierarchical "click-n-choose®" interface (like Yahoo uses). Results and ordering on either method result from moderation points each sites have (so that top sites filter to the top).
7. Use of a natural language interface for the searching would make searching optimal, but depending on available technology, may or may not be needed.
I am sure I missed a few things here - please add on to the idea if you can. I think such a site could be run like
Does this sound reasonable?
Reason is the Path to God - Anon
Given the chaos that is the net, it is going to be tough for the search engine creators/programmers to deal with all the badly coded dynamic pages and properly index them.
b ots.html) to indicate which parts of your site to not index. Put these and any other hints/standards in a public place and make it widely known that if you want the traffic a search engine can generate for your site, adhere to these guidelines.
Why not have a standard (something like using meta tags) to increase the relevance of keyword searches on your site. Also, why not have a standard (say a robots.txt file http://info.webcrawler.com/mak/projects/robots/ro
I would hope that people that would take the time to build a database backed dynamic web site would do the small amount of extra work to make sure that people could actually find the information through a search engine.
An aside about the changing nature of the web-wandering public:
Wow, I've just come back from checking out the unfiltered (i.e. allows porn-associated searches to appear) metaspy voyeur site. Folks, I think the internet public may be changing. When I first checked this out for several weeks ~1 yr ago, most of the searches were porn related. This time, out of about ~100 search queries, I saw only a few of sex-related ones. Are things changing? That would be nice.
______________________(
I find it telling how recently this post and the one on saving /. from Natalie Portman/Trolls have come up together--both are the result of Too Much Information needing to be distilled down. /. turned to moderators. About.com is all about that, too.
I first learned the term "reintermediation" from an article by Nicholas Negroponte in WIRED magazine. In another article by him (which I can't find right now) he says that while some people believe that librarians will be out of jobs, he predicts that there will be a new form of librarian. The old librarian could help you find what book you were looking for in the library. The new librarian will help you find the content you need from the 'net.
As a personal aside, I'm shocked and dismayed that search engines don't index database queries--I *just* finished rolling out my new personal web site which is totally database-driven, and thorhoughly meant to be indexed. Now you're telling me that because all my URLs look like this "content.asp?nodeid=dejavu" that search engines won't find all the delicious content I'm creating? Botheration!
I suppose *a* workaround here is to create an application which traverses the site and builds some mirrored heirarchy of it in static pages for the search engine to index, which uses JavaScript to bounce the user to the *right* page once they get there.
*sigh*
Is it me, or are Altavista's results, deliberatly :)
modified?
I just notice some changes in behaviour recently, it seems like some contents is filtered out.. I can't really explain, I notice it with queries for various subjects...
From guru techtalk to pr0n..
Maybe one of the indications is the "related searches" item.. Though this must be made up of frequent searched items, using those searches results quite often in few results..
I take it, most searched for items are better haunted for by the spiders, or...?
Well, it seems like search-engines in general are degradating... Where did the times go, that you didn't had to scroll through a searchengine-page looking for that much to tiny white box (not much bigger than the average picture for it's commercials), with not enough room to type in your query without scrolling?
Just my 2 eurocents...
The problem is NOT with database/PHP/Zope/etc... driven site. They pose no problem to decent crawlers, since they behave like standard pages. Any dynamic html generation is done at the server, and is invisible to the client:
You GET a url, and back comes HTML.
The problem is with forms and javascript which happen to co-occur fairly often for obvious reasons with database driven sites. As long as url's (it doesn't matter if they end in .php or whatever) are provided as links, the crawler will have no problem in traversing them. But what about the target of a FORM 'action'?
How can a crawler deal with forms and javascript? Javascript may not always be so bad, if the crawler can execute the javascript. But how does a bot fill out a form (in other words how will the bot generate an appropriate query string)? If a form is the only entry point to a collection of information, that information is currently inaccessible to crawlers.
I can't think of a single good reason NOT to merge the domain and URL structures.
Pick a URL on this site, say /about_us. Use the new inverted-node notation.
meta/web/design/companies/antistatic/about_us/ix any
See how the URL blends into the heiarchy? That's *GOOD*. A given server somewhere should have control over a certain region of the heirarchy (I might serve from "antistatic" down in this example. I might even delegate some of it!)
In addition, a redirect would make meta/web/design/com/ synonymous with meta/web/design/companies/ as an abbreviation. My email would be smtp://meta/web/design/companies/antistatic/ixany. My web page would be at http://meta/web/design/companies/antistatic/ixany. My resume would be at http://meta/web/design/companies/antistatic/ixany/ resume/. so if you typed: telnet://meta/web/design/companies/antistatic/obli que into your browser, you'd know where you're headed. I'd serve DNS for meta/web/design/companies/antistatic/ on down, and delegate everything for meta/web/design/companies/antistatic/oblique on down to oblique, so that you could mail an oblique.antistatic.com user at smtp://meta/web/design/companies/antistatic/obliqu e/username.
That is, I'd have control over a certain node of a big tree structure. I'd *give* control of sub-trees to actual branches and leaves that make sense in an information sort of way.
Our current URL scheme wants to specify a heirachy inside a heirarchy. But the problem is it must make obvious the fact that the outside heirarchy takes one kind of query to provide, and the inside heirarchy takes another. But it's no longer useful to separate these -- it's just a way to organize trees of information, after all, in which one tree is rooted in another. It seems more and more that should be transparent, and that brings to the table other issues, such as the current lack-of-information provided by current use of the domain name system.
This looks like a lot of typing -- but look how you can get to information directly! That's a huge win. Also, the heirarchy could be browsed from the top on down, getting closer to where you want to be in sequential steps, rather than the search engine paradigm where you're often getting a lot further away as you go.
Perhaps this is not a search engine problem at all? Sounds more like bad site design. A database is meant to store data that is changing or needs to be searched often. Anyone who designes a static site using a complete database backend with cgi,php3,whatthefsckever is acting alot like Microsoft. Adding fluff for absolutly NO reason, and only getting back more bugs and slower performance while increasing the required resources.
Pages with dynamic content, pulled from a database, should not be indexed in the first place, they are not static, they may change the INSTANT the search engine is done with the page, so how is the search engine supposed to return predicatable results?
Finally, if you can't figure out how to change your extensions so that index.html is interpreted as a php3 script, well thats your own problem, become a real admin and that won't effect you. If your not the admin, your ISP needs to get a clue and help you.
If you expect to use content searching, your suggesting that the page has some content metatags or the like... this is fine and dandy, except I doubt you'll agree with what everyone else considers a "content" type... for instance, you search for "Adult Art", possibly expecting back some ART(not porn), instead you get 2.6 million entries for www.hardporn.com... that just doesn't work. That and the fact that businesses will just put every day keyword they can think of in thier page, so you find it in searchs that are completely unreleated.
Perhaps I'm acting as an eleetist, but this IMO is what happens when you have 20,000 MCSEs that THINK they know how the internet works, go grab FrontPage and ColdFusion and write database based websites all day long, with completely static content. All because they are too lazy to index the site themselves. I'm GUILTY of this myself, my website is entirely database driven, most of the content in the database will NEVER change. If I put a little more time into it, it would be rather easy to write out old information to static html on a regular basis, allowing those pages to be RELIABLE searched by the global search engines.
Just my $0.02
http://www.schizo.com/
I've found this technique very effective:
I use apache and mod_rewrite so I can URLs that look like static web pages to the everyone (search bots included), but are really database backed dynamic data.
Nightly I generate a heirarchical static of all the dynamic pages on the site that I want indexed. A link to this index is located on every page on the site by an invisible HREF (one that bots will pick up on, but people with browsers will never see). These faux-indexes contain just href's (invisible) to all the pages on the site I want indexed, but the indexes contain absolutely no text because I don't want the bot index to be indexed. (I also use the ROBOTS META tag on these box indexes to keep them from being indexed themselves)
You can take care of filtering by file extension very easily: don't put the file extension in the URI. It's a very bad idea; it effectively puts the type of the object in its name, so if you ever go to change the type (as GIF to PNG) or the server handling (as .html to .shtml to .asp to .php) you have to change all references to it. And Cool URIs don't change.
Searching for device driver on Google gives many fine results.
Standard disclaimer: IANALG (I am not a Linux geek.) Rather, I'm a web design geek. So please, be nice.
From what I understand, what a lot of the OpenSource movement is about is doing it yourself if you don't like how it's being done now. Don't like commercial Unix, Linus? Make your own fscking Linux and let everyone contribute. Oh yeah: Give it away free to really piss people off.
In this discussion, there are a ton of excellent ideas for how search engines should operate. Yet no one, to my knowledge, has put forward the next logical step: Build our own search engine. Google is a good start but hey, I know you guys could build it better. Worried about hardware and bandwidth costs? Venture capital.
As I said, do it yourself. :-)
----
Am I the only one who thinks Microsoft is a misnomer? Perhaps Macrosoft would be a better fit?
This is more a concept than an answer, but my thinking on search engines and their ilk is that they are becoming (if not already are) useless. 800 million web sites? Two billion? Twelve trillion? How far will it go? Who knows? We could eat up every MIP of processing and every bit of bandwidth trying to keep current in search engine indexing... and in the end, you'll have a mess.
The answer? I don't know, but I have an idea. Berners-Lee talked in his book about a "web of trust" -- mostly talking about security and e-commerce and such -- but the concept can be expanded to apply here.
For example, I trust /. to provide me with useful, timely information, and act as a great resource for all things nerdly. If /. provided a search engine for a few specific sites that the /. content owners felt were worthy of inclusion, I'd use it quite a bit. /. becomes responsible for maintaining those connections, and monitoring the output to ensure relevancy. The outside content owners provide hooks into their data, tailored to the idiosyncracies of the /. community (plenty of RMS, no MSG).
Censorship? Depends on your definition. If you trust /. to provide good info, you also trust (implicitly, if not overtly) their editorial judgements. It's a human-to-human connection, facilitated, not replaced, by the computer.
I liken it to a bibliography. When I do dead-trees library research, I like to find the appropriate section, pull a book down and skim it. If it looks appropriate, I'll then check the bibliography to see what other books the authors found appropriate. Hey, they've just done research for me! Neat! Go to those books, the books under those, etc. I now have a web of sources, all culled from a (basically) random book pulled from the shelves.
Expanding that to the Web, /. trusts theOnion to provide the latest in useful headlines around the world (I know I do...). The Onion provides, through a "Bibliomatic" link (TM, (c) me me me me ... are you reading this Amazon? :), hooks into their data with published calls to pull appropriate, timely keywords relating to their content, with the ability to search archived content as well.
Everything goes swimmingly, until The Onion IPOs and starts to be run by MBAs, and the content-o-meter drops to zero. Mr. Taco gets innundated with a million emails complaining about how the /. results for The Onion all return "Make $$$ Fast with 18 year old transvestites having anal sex with dogs". Taco dumps all calls to The Onion's content, fires off a letter threatening Armageddon, informs them that they are off the list, and starts using somebody else.
Some things we got right the first time around. Car doors that open forward (not up), radios with volume dials (not tiny, fiddly buttons) with a real potentiometer behind them, and bibliographies. Those search engines that don't incorporate at least some aspects of this become obsolete, or relegated to searching for obscure content.
What about mailing list archives? Those are GREAT resources -- better than the FAQ usually. Getting to that data is more problematic, but not impossible. You can index the subject lines and provide a hook to that -- if /. chooses to use that hook, great. You can index the whole mess and force content owners to search it -- which will put you on the blacklist pretty quickly when people get a jillion results that all have "Im hafing problems wif Winblows" as their title. Or you can send a link with each possibly appropriate query to your own search engine that will locally search the mess.
I think the One True Search Engine is a pipe dream. As for myself, I try Google, Yahoo, and AltaVista in that order. Most of the time, I'm looking for someplace I've already been, and can't remember the URL, so I can customize my query to bring that particular site up in the rankings. I've tried doing pure general searches, but I'm always daunted by "Your querey returned 12,486 results." Yeah, right.
For more on-topic information, try Philip's book. He had the same problem as discussed here, and he solved it with a few lines of TCL code inside AOLserver. That's the short answer...
Potato chips are a by-yourself food.
Don't overlook XML-RPC, which builds on the XML spec to provide a way of serving data over the web to remote clients.
Then there's RSS, which is a way of serving up a news channel or other changing data. These applications are here and in use. Together, these XML-based technologies will someday provide the data layer for the software agents of the future. Read lately about that new "price-checker" technology? Imagine being the one business that doesn't serve up your product list and pricing to that agent.
An interface from XML to these "hidden" databases is only a matter of time. We're just caught right now at a moment between technologies: the authoring tools don't really exist.
----
lake effect weblog
{Network engineer in Chicago--looking for work!}
PHP Builder ran an article describing how you can have Apache Webserver treat a certain "directory" as a script, using the Location directive. So if I had a script file name called www.mydomain.com/foo then I could access www.mydomain.com/foo/param1/param2 and the foo script would run, and could use environment variables to find the "path" foo/param1/param2. I tried it, and it works quite well. This hides get parameters as "paths" so that search engines don't think the pages are dynamic (this is how Amazon.com works)
SPAM
It's already started, in some categories.
:)
Mostly media searches so far, but it will expand. An example I found the other day: Sourcebank, for code and research papers.
The next problem will be finding all the different specialized search engines. But surely someone will make a search engine for that.
On the other hand, you don't want to index database driven sites at all. First of all, that'd be impossible technically. Best search engines currently index something like 25% of the web if not less, and are able to re-check these pages only once in a month or so.
The practical solution is to have a database site (if necessary) that only uses database for dynamic content. IE if it's a /. FAQ page or ABOUT page or something else you want to be searchable, make it a static page, while articles/comments should be dynamic.
-- ATTENTION: do not read this sig. It doesn't say much.
A similar situation occurs when PHBs think their site doesn't ``look'' quite as good as others. (Insert my usual rant about content vs. presentation here.) Whether via a hideous HTML-abusing web authoring program, or via all sorts of hacks that God never intended to appear in anything resembling SGML, the HTML landscape is changed there as well, and SearchEngineInc's product becomes less effective.
Oh, I wouldn't blame that solely on PHB types... Every time on slashdot that someone points out that HTML was meant for a logical or content-based tagging system, 3 people pipe up and say "But you can't get a good looking site that way!"
It's the other way around. If nobody had ever abused HTML, and Netscape and Microsoft extenstions of HTML, no PHB would ever have known that a web page could be a graphical monster.
This is a bit offtopic now, but d*mn, I'd love to be able to retrieve Slashdot comments via XML. Then I could reformat them to taste, and e.g. lose those horrible ugly colors they're now using for Your Rights Online stories...
--
Do I look like I speak for my employer?
Okay, I know a bunch about this, since it's an important part of what I'm studying... Pay attention.
First of all, the Internet -- porn or not -- is growing at an absurd rate. Any single search engine, regardless of how good its ranking algorithm is, will not be able to keep up either with new or more difficult to use technologies (such as databases, as the post mentions). Some of my research is directed towards the idea of distributed indexing. I can't get into it now, but imagine Napster except with metadata instead of MP3s. These distributed mini-engines would know how to answer very specific queries (some would know how to deal with databases, some with PHP, some with other mini-engines, and so forth). It's a pretty complicated idea, and has some problems (such as response time for searches), but is the only real scalable solution for the growing Internet.
Second, XML is becoming more prevalent on the Internet in general (see Apache XML), but unfortunately is not quite there yet. However, as a poster alluded to, RDF (an XML flavor used to describe site metadata) is usable today. The state of RDF, however, is that it's currently used more for the purposes of Slashboxes for example than web spidering.
Anyway, be sure to keep an eye out. Expect things to change dramatically in the next year or so. The Internet is still a baby, and it's just now learning to walk...
...but you made an elaborate site entirely dependent on Microsoft Active Server Pages, and you're expecting it to work with _any_ web standards, much less be indexable and spiderable as if it was proper HTML? I'm afraid that you stepped right into that one. Look on the bright side- were it not for this Ask Slashdot article, you might never have known you weren't indexable, as this seems to be a little known fact! That alone is rather shocking.
While I agree with your central point, I think that it goes a little beyond this.
Any scheme which relies upon a central authority or site to manage or index web content will fail because the web gets too big too fast. This means that sites must be responsible for identifying themselves, and in a distributed way.
One way to do this would be to have a DNS-like system of distributed "web-index" servers, which describe the sites and their content as known to that server. Then, each of the web-index servers could gain information from local web pages, and report it up some heirarchy with a known root. You would then be able to find sites by content type, by specific knowledge areas (Dewey decimal website classification?) or whatever, depending on how the standard is defined.
This has the advantage of distributing the load, increasing the likelihood of finding what you want quickly, and allowing easy site hiding (by obscurity) for non-general-use sites.
-- Two men say they're Jesus. One of them must be wrong. - Dire Straits
So maybe this means that it is very difficult for someone ever have some sort of control over it. Which I think is good.
"Video bona proboque; deteriora sequor." -- Ovid
Well, what you need is a search engine for news. One that is constantly crawling news sites so you can search on "Clinton" or "fire" and get current or recent results.
Shameless plug: My own site, NewsBlip.com, is just coming out of beta now. Fast searching now, more features coming. Built on Open Source (Apache, PHP, etc.). End of shameless plug.
I'm no expert, but three things come to mind: 1. An open standard that defines ways that content is indexed, and defines a standard interface to the indexed content, no matter how that data is stored (i.e. relational database, XML, text, etc...). 2. A new language for searching the indexed content (in much the same way that SQL was developed to access relational databases). 3. A distributed system which allows each site to be authoritative for the content of the site (much like DNS). Each site could be responsible for providing a "search server" which would expose the standard interface to the indexes mentioned in my first idea. There could be "root servers" that are specific to certain types of content, where each root server could refer clients to the "search servers" that expose the type of data the client is seeking. This has the advantage of distributing the processing load, which should allow it to scale well. Am I totally out in left field or what? It just seems like we need a basic paradigm shift from the current [klunky] search methods. I see parallels between the problems that led to DNS, and the problems we face dealing with the rapidly growing quantity of available content on the Net.
Or get the info from everybody.
With agent-technology, you can provide your agent(s) with information you like. Your agent will negotiate with other agents and come back with results. According to your rating, the provider of those results is rated higher or lower, resulting in some sort of social context.
The advantage of this is that you don't rely on the contents of pages (which can easily be modified to provide a maximum of hits, while not being related), but on the opinion of others.
The beta has just been released on www.tryllian.com
Much as I hate to start a new topic here, I should point out that the original poster's suggestion that we index sites by the type of content they provide, instead of the actual content, is called "yahoo".
Been there, done that, and it's occasionally useful, but usually not.
Ultimately, Web-wide searching will fail, and not just because of database-driven Web sites. There are a number of reasons why:
Attempts to index the Web can best be described as an attempt to index and catalogue the largest, most diverse and most frequently-changing collection of documents, which adhere to no common standards of self-identification or description whatsoever, by people who generally have no training or experience in cataloguing or indexing for people who generally have no training or experience in database searching, and hoping that somehow, everything will work out.
That it has worked this well up to now is a testament to the creativity and ingenuity of the developers of indexing and search engine technology. Still, when compared to a professionally-organized database like a library catalogue or bibliographic subject database, the Web's search facilities are incredibly primitive and while it's easy enough to find a known item, it's basically impossible to do an extensive subject search or even to find a few of the most relevant resources on a particular topic without already knowing what those resources are.
Eventually, we're going to have to rethink how we index the Web, and this will involve making decisions about breaking the Web into manageable pieces and deciding exactly what kinds of things we want to catalogue/index in the first place.
Personally I think the biggest problem working towards making search engines useless is junky commercial sites that offer nothing worth my while. Type in a few keywords to find something and end up with a few hundred:
(a) dead links
(b) rubbishy commercial sites that arent related to what you're looking for
(c) home pages that look like exactly what you're looking for - at a glance - but turn out to contain less than a few scraps of useful stuff.
"A mere half-hour's perusal of the Voyeur turned up one sad goof after another:
super modles
sex weied
streaptease
sesamie street
necked asain women
wallsreet journals
and my favorite --
christian boardcasting network"
I suspect that the percentage of sex-related searches is correlated to time of day and day of week. The low percentage I witnessed probably had something to due with the fact that I was checking mid-day Tuesday. I bet it goes up a lot on Fri & Sat, especially at night.
______________________(
the programs and indexers get confused and are corrupted by junky crappy sites. the web in a certain aspect has become unsearchable because of sites that claim they have "unbiased" links when many of them are paid ads. the web was built on quality links and it still is mostly. you can find what you want but with paid searches you are limited to what they give you which is horrible and every time you are you should be specifically told that you are or else they are deceiving you and this is something which has been going on for years.
Traditional search engines don't work for many reasons. First of all they can't keep up with the staggering expansion of the web. Second, they seem to be more interested in ad revenue than performing a service. I for one am sick of banners and ani gifs, so I did smomething about it and designed a business card engine using Linux, php and mysql that is user supported and cheap! Now I need the support of the Linux community to help kick it off and provide meaningful feedback/comments. www.cards411.com
Count the *last* link the heaviest.
It's where either the person got fed-up, or found the information they were looking for.
-- Ender Duke_of_URL