Is the Internet Becoming Unsearchable?
wergild asks: "With more and more sites going to a database driven design, and most search engines not indexing anything that contains a query string in it, we're missing alot of content. I've also heard that some search engines won't index certain extensions like php3 or phtml. Is anything being done about this? How can you use dynamic, database driven content and still get it indexed into the major serach engines?" Is keyword searching obsolete? Do you think its time to index sites by the type of content they carry rather than the content itself? Will larger indexing databases (or a series of smaller, decentralized ones) help?
I think we've actually hit another period, technologically, where we're advancing too fast for active standards on "how things should be done" to make things like searching pages/web databases/etc. an accessible, easy thing. It's probably going to take a while...it seems like every month they come out with a new way of doing things, a new "language that's going to change the world!", a new proprietery language/program for corps to use. Until that dwindles, for whatever reason, the web is going to continue to be behind in terms of searchability.
Listen to me Peter, I want this bench. You go sit on that bench over there, and if you're good I'll tell you the rest of
...to "force" search engines to search certain pages. Currently, you cna only tell searchbot to "piss off". There is no way to tell a searchbot "hey!!!! come look at this...."
What if we just have a standard search interface that can be built in to any DB driven website....say it returns XML or WDDX or something. So now when the search engines hit a DB driven site, it goes ahead and creates an index through this interface. I guess like a DNS zone transfer.....hmmmm...
We've been running across problems related to this in my office (a web design/hosting/advert firm) and, while I'd like to see non-database driven searching of the Internet continue, I have to say that perhaps, most people, would rather have the database. So many web design clients expect that once they have a web site they won't have to advertise in print ever again are driving the whole thing toward the database method... creating the problem they so love to bitch about.
Perhaps doing away with keywords entirely, getting search engines to look at the content instead of the "false content" of meta tags... now that would be nice.
We had a client once who wanted keywords inserted dynamically into the metatags on his webpage based on query results because he read once that search engines index pages based on the tags. Nothing we could say would convince him what was wrong with that picture.
Is it even possible to index dynamic pages? They don't really exist until the page is generated. Perhaps the best thing to do for sites that want to be indexed is to make sure they have a plain, vanilla index.html page that contains relevant keywords?
Dana
You can tweak Apache to parse documents ending in .html with PHP3. You could use .html for generated content and .htm for static pages.
Computers. You can't live with them, you can't live without them.
Yeah, give me a minute to back that statement up. :)
Honestly though. With something that is inherently dynamic like the internet, it is already near impossible to catalogue and make it searchable. Just to illustrate this take any given news site. Today they might have articles about Clinton, tomorrow it might be news about a big fire. Search engines can't just direct you to those sites based on queries because who knows what data they have.
Even if a search engine was able to validate the content on every site before it gave you the url it could still change by the time you actually got to see it.
So quite literaly there isn't even a clue of a way to catalogue a database generated web site. Now granted I know there are plenty of sites like Slashdot that eventually the 'content' settles down and becomes static. Still, how are you going to get some stupid program to verify and validate that for *every* dynamically generated web page. I don't think you can.
The web was created to be open and dynamic and it will stay that way. I've heard people say that maybe there should be *more* interoperability between things like search engines and spiders. This in my mind would do more damage.
Besides is it so bad that spiders don't get these pages? It probably isn't even reasonable because it would add that much more complexity to the search engine to catalogue what it finds. How do you rank content?
Anyway... just my 2 cents or so...
This space for sale
No only is multipul site searching becoming more dificult, but single site searches as well.
.pl, .cgi, etc.) which generated the information. But this only works if arguments are not required.
Now most content is stored in a SQL database. While it is fairly easy to search an SQL database, returning the information in usable form is not. This is especially true once you have many type of tables containing many different types of information.
Currently, the search engine on the site I work on has it's own built in forms for information from each type of table, but this method takes a lot of maintainance.
Another possible way is to point to the page (php3, asp,
It is about time someone developed some technology to do "smart searches" of sql data and return useful information without having to write a template for each and every type of data that might be queried.
I might be off my rocker a little bit on this, but I cannot believe I am the only one experiencing these problems.
-Pete
Soccer Goal Plans
It seems to me that if anything, the internet is MORE searchable then it used to be. I remember some statistic about how a couple of years ago the few search engines that were around only got some small percentage of the web covered anyway. These days it seems the search engines do a better job, and there are a zillion more search engines and also tools that let you search multiple search engines at once. That and the fact that there is just plain a lot more stuff on the net. Back a few years ago, if you searched for Cervantes, the author of Don Quixote, you might find a page or two on some college webpage somewhere, if you were lucky. These days there are enough pages out there that you're bound to find at least one of them that's halfway decent. Anyway, to summarize, keyword searching still seems to work for me. I think that the only way it will get considerably better is when true artificial intelligence is possible. That way, when you ask the computer to find something, it is actually smart and goes out and finds it like a real person. However, it seems to me that true artificial intelligence is a way off....
Still, I see a potential threat in information becoming unmanagable, and, most of all, ways of finding information being abused (like using unrelated keywords just to get some visitors). Stanislaw Lem, the polish sf-writer described this situation in many of his books - starting with the 60s, when noone was even starting to think about such problems.. Sooner of later we'll have a large branch of computer sciences dealing only with searching information in Internet; searching services are already available, but they are either incomplete, or not evaluated. The latter is the key: and google is the first service I'm aware of which tries to automatize evaluating (by counting links pointing to a specific page).
There has been a lot of talk about "Internet agents" a couple of years ago (I remember an article in Scientific American...) - could some good soul explain to me how is the situation now?
Regards,
January
I used to make a decent living as an Information Broker - basically, a trained database searcher for hire. Along came the net, and suddenly everyone with a modem could search for themselves. So I wrapped my shingle up, and stored it away.
These days, there is so much junk and bad indexing, that I may as well put the shingle back out. Almost any search will find mostly commercial sites, unrelated to the search, or completely useless garbage.
You almost have to be in a bizarre frame of mind to create a good search term these days.
Mark Edwards
Proof of Sanity Forged Upon Request
I've done some work on a spider and these are the types of pages I spider: :)
/index.html for any non-existent path. Also, all links on the homepage were relative (not a bad thing) and one was invalid. The call sequence is below. /index.html /foo/broken.html /index.html /foo/foo/broken.html
html htm asp php shtml php3
I guess I'll add phtml
Other extensions and urls with query strings are ignored. This is mainly for self defense. There are many, many infinite loops and blackholes on the web and they're hard to avoid. For instance, my spider once got stuck on a server that would return the contents of
GET
found foo/broken.html
GET
webserver couldn't find path, so returns
GET
etc.
What was the programmer thinking?
This is just one example of the blackholes that lurk on the web. It was completely unexpected and pretty difficult to detect. What if someone wanted to write a search engine trap? I don't believe there is a simple solution to this problem.
Ryan
Just stop thinking that tera\bytes are the limit. Get more hardware and more computers. Create petabyte databases. In fact have millions of petabyte locations world wide and create a series of multipetabyte databases that one can use.
Categories are nice but some (most) sites are personal sites and these sites chage quite often in subject matter.
While the categories are nice we should have a community planned and maintained categorical system along with a plain text search. Have identifier tags that go along with every web site and then have a standalone and a web based version of this program which will allow for anyone to create a hierical listing of anything according to tcertain tastes and peramaters.
Slashdot social engineering at it's finest
I think we are already looking at a two-tiered structure: there are sites (that could be found through standard search engines) and then there are databases/archives inside those sites.
It is getting more and more so that to find an answer to a somewhat obscure question, I need first to find major sites on the topic, and then do a search through their databases or mailing list archives. I believe this reflects a real-life structuring of the Web and will have to be taken into account by next-generation search engines.
Kaa
Kaa
Kaa's Law: In any sufficiently large group of people most are idiots.
One way round the search engine missing query URLs is to write to static pages for the purpose of submitting it to search engines, there are many clever ways of having truely dynamic sites without the need for long urls, you just have to put some effort into it.
Search engines not picking up on php3 is a bit worrying though, all my sites are written purely in php3, although I never seem to have any problems with getting listed.
Gateway pages are a good way of making sure you get listed with the keywords you want, although they aren't very dynamic and unless you get really clever don't tend to reflect the contents of a regularly update site... however it seems to me that you can only really hope for *a* listing these days, not an index of all of your site.
Even google has a 3 month disclaimer on it's submit page, that's a mighty long time if you are looking for support on a brand new motherboard.
LASE seems to be the way to go... subject specifc full text indexes which spider regularly and can index specialised data keeping it up to date.
However you would still need a search engine to find a LASE that will get you what you want, but at least it's a bit more structured!
There are many ways round the search engine problems, and keeping on top of it is a full time job, Submit-it doesn't come close, that hasn't changed in the past 3 years, Search engines however have!
IMO a combination of all of the above will get you where you want. Keywords and Meta Tags still count, and you have to be persistent.
The Open Directory Project, managed by dmoz.org, is an open source effort to create an organized index of the internet through volunteer work. Currently their are 20,000+ volunteers working on the project. This is a way cool idea that we should all support.
I read a while back that meta data for sites would eventually move to an XML based standard which would accurately describe the content of the site?
Whatever happened to that? I don't mind all that much being taken to the front page of a site if I know that site has the information somewhere in there, I just hate having to hit seven sites to find that one.
Hotnutz.com
I have been thinking about the working of a search engine lately and this post just comes at the right time.
..the web engine should give me a link to drkoop.com, webmd.com (AFAIK, these sites do not allow search engines to copy their content) and so on.
Some of the challanges which will be faced for search the web in the future will be :
1. Displaying matching URLs as well as links which match the type of content. This is important. If I search for "throat infection" on a search engine..apart from the pages which mention "throat infection"
Search engines will have to maintain huge databases linking words to categories. And with the proliferation of hte internet the number of sites carrying content and disallowing search engines is going to increase. Search engines need a intelligent way to get around this.
2. Search engines will need "help" users with their searches. For example if I just search for "throat" the search engine should have a helper section where it can ask me more...whether I am searching for "throat infection" or "study off throat" and so on.
3. Search assisted by humans. This is also one of the concepts picking up these days. Basically you submit a question and there will be some person searching the web, and you will get you answer in a few hours/days. Chk out www.xpertsite.com.
4. Tools for better maintenance of bookmarks. I for one usually bookmark all relevant stuff and then I spend a full weekend arranging them so that I can find the relevant stuff from the bookmarks quickly. The current bookmarking scheme is very primitive causing a lot of users to "reinvent teh wheel" (searching for URLs which are already bookmarked).
Phew!
I'll jot down more thoughts later. Gotta work now.
CP
I was just about to ASK SLASHDOT about XML. XML will solve the search problem (or at least help make it better) Working drafts of XML have been drawn up by the W3 Consorium and XLINK, XSL, etc... are coming. There are almost no XML applications available yet though!!!!! most of what is available is in java. This is a field where Linux could be leading the pack, but is instead an example where I think we are lagging behind. (I hope someone can point me to a group that is bringing XML deep into the linux os)
I want to know if Linux is on top of this. Microsoft has an XML notepad available and I hear that it's going to be all over Win2000 (in the registry even). XML will be the foundation of the new internet and we don't want microsoft to have a technology edge there do we? Perl has XML modules, as I am sure other languages do too (python). Lets get some apps written!
What about Gnome and KDE? this could help make their projects easier. Especially KDE with all of the object similatrities between Corba and XML and Object RDB's. All Config files could be theoretically stored in XML. We need to push this one people!
-pos
The truth is more important than the facts.
The truth is more important than the facts.
-Frank Lloyd Wright
The problem with dynamic content is that you pretty much have to query the target web servers at the time the user enters the search request.
One solution that attempts to address this is Apple's Sherlock. It uses XML to pass queries to web sites and return results. There are certainly some limitations: you have to choose which web sites you want to search (although this isn't always a bad thing), these web sites have to support Sherlock queries, and it only works on the MacOS. Currently lots of big name and Apple-specific sites support it.
The dev info at Apple is pretty clear though. It wouldn't be difficult for others to create clones for Sherlock that either work over a web interface or on other OSes too. (dunno if Apple could...or would... make any claim against this)
Scott
You can't study the darkness by flooding it with light. --Edward Abbey
Some time later, it occured to me to try and monitor the efficiency of web indexing tools using a spider trap.
The methodology is like this:
Anyone done this? I'm particularly interested in knowing how spiders handle large websites -- have been ever since I was doing a contract job on Hampshire County Council's Hantsweb site a few years ago and caught AltaVista's spider scanning through a 250,000 document web that at the time had only a 64K connection to the outside world. (Do the math! :)
Therefore, it would be entirely feasable to have a system in which regular users saw regular pages and web crawlers saw a "static" index page, all at the same URL.
This would allow web crawlers to index according to genuinely useful keywords, rather than by how the crawler's writer decided to determine them.
An alternative approach would be to distribute the keyword database. Since all the web servers have the pages in databases of one sort or another, it should be possible to do a "live" distributed query across all of them, to see what URLs are turned up.
This would be a lot more computer-intensive, and would seriously bog down a lot of networks & web servers, but you'd never run into the "dead link" syndrome, either, where a search engine turns up references to pages which have long since ceased to be.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I think the real problem with searching really isn't that the Internet is growing too large. The central problem with it being too hard to find information is due to the unfortunately ever-changing nature of HTML. (Yes, I know there are much better solutions out there -- I work with some of them on a daily basis. However, we seem to presently be stuck with HTML and its variants.)
It's a self-feeding monster, whose typical cycle goes as follows: SearchEngineInc (a division of ConHugeCo) creates a new technology that really impresses people with its ability to find what they want more quickly. (Right now SearchEngineInc is probably Google, at least in my view.)
Once the new technology takes root, content authors (well, maybe not the authors so much as their PHBs) note that SearchEngineInc doesn't bring their business (which sells soybean derivatives) to the top of the search list (when people type ``food'' into the search engine). Said PHBs make the techies work around this ``problem'', and all of a sudden SearchEngineInc's technology isn't so great anymore because the HTML landscape it maps has changed.
A similar situation occurs when PHBs think their site doesn't ``look'' quite as good as others. (Insert my usual rant about content vs. presentation here.) Whether via a hideous HTML-abusing web authoring program, or via all sorts of hacks that God never intended to appear in anything resembling SGML, the HTML landscape is changed there as well, and SearchEngineInc's product becomes less effective.
What's the solution to this? I'm not quite sure. Obviously there are better technologies out there that are at least immune to PHBs' sense of ``aesthetics'' but I would wager few of them are immune from hackery. I'd say that search engine authors are doomed for all time to stay just one step ahead of the web wranglers. At least it assures them that their market segment won't go away any time soon. :-)
It disturbes me that so many pron sites have hidden in their html code (and sometimes not even hidden) huge lists of adult film stars just to get hits from search engines.
If you do a search for Cortknee or Lotta Top you'll get a bazillion hits and 90%+ of them are "Click here to see young virgins having sex for the first time on their 18th birthday!"
As we all know, but nobody likes to admit, pron is the fuel that makes the net go 'round.
Many other sites have taken hints from the pron people. I'm sure that it was a deal of some sort, but everytime I do a search on metacrawler there's a line to search for anything I get a like to search a certain bookstore for books on the same topic.
Commercialism and shady practices are what are making the net so hard to search.
LK
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano
Okay, I just got done with my research paper for college last week, and although I can pull a paper out of some orifice of my body, researching is always a pain.
Our library has a wonderful online database where you can type in keywords and search for them, but the keywords only look as far as the Title, Author, or abstract of the book. If you wanted to look up some narrow topic, you can't expect that there's books written exactly on that topic, but there's always bound to be a few books out there that have a few pages dedicated to that subject (but isn't listed in the abstract). So, what do you do? You have to get your hands dirty.
My topic: Holy Wisdom (I won't bore you with details, but just stick with the subject). Looking in the online database, I find that there are zero books on the subject. Darn. Let's do some lookin...
After I read in a few Religion Dictionaries, I find that Holy Wisdom is also called "Sophia." I go back to the catalog, type in "Sophia," and I get one book. I skim this one book, and find that Sophia has sometimes been associated with the Holy Trinity. So, I go back to the catalog, enter "Holy Trinity," and BOOM, I get back 400 results (anyone seeing a similarity here...). Let's limit them...we'll search within the results for "History of," and I get back about 11 results. I read the abstracts, find a few books of interest, and start skimmin...
...Well, whadda know, there's a page in one book that talks about Sophia, and half a chapter in another book that talks about Sophia as well. There's a few more sources for the paper!
Now, for those of you who just don't understand what I'm trying to say here, just read from here on, cause here's my point: Computers aren't smart enough yet to "guess" at what we want, and personally, I don't think they ever will. Internet keyword searches are just like asking someone to help you who has no idea what your topic is...they can only search for what you ask them to search for.
Internet keyword searches are a hastle, and many times the first few returns won't be anything CLOSE to what you want (search for "Computer Science," you get back porn, search for "Linux," you get back porn, search for "White House,"...). But if you learn how to dig, like the people who lived fifty years ago WITHOUT Boolean Searches, you'll find what you're looking for. Sometimes, it's just like searching for a topic...you might not find anything directly, but you can't sum up an entire book in just a paragraph either!
Try some links, look around, and it'll be there!
Many of my sites are database-driven sites that run on PHP and MySQL. No problem with indexing, and no problem with the file extensions.
If you can get beyond the backend concept of a dynamic page, most pages really appear to be quite static, from an indexing perspective. A http-based indexing system (as opposed to filesystem-level) can't tell that pages are dynamic, and don't care.
I've never had a problem with search engines failing to index pages just because they had convoluted URL. If some engines do that, it's a bloody shame.
I'm going to say a naughty word: artificial intelligence. I'm hoping we soon ( 5 years) get good enough at this "indexing" stuff to create semantic models of Web content rather than purely syntactic models. (Google is a small step in the right direction.) If so, then perhaps dynamic pages can be indexed according to their location (role?) in an "ontology" rather than via the frequency of essentially meaningless character strings. That may sound farfetched, but it seems to me that the Web finally provides a real _financial_ incentive with near-term payoff for that kind of research. Hitherto, the quest has been purely academic. And where there's the lure of a real payoff, stuff often happens quickly (usually -- batteries and flat-screen technologies being notable exceptions).
I hope that after I die the one word people use to describe me is "resurrected."
The answer to all this isn't going to come from making existing engines better, nor is it going to come from bigger, badder, faster database engines powered by your friendly clustering technologies!
The answer is simple: More specialized search engines. You're looking for technical stuff? Then you should be able to search a technical database. Like, if I'm looking for source code to model fluid flows - that's pretty specific already. There's no reason that I should have to wade through all the references to "bodily fluids" that I'll get on altavista for instance!
Search engine people, take note of this. Classify your URLs into categories - like Yahoo - but come up with some way to do it automatically. Or even better yet, let the users do it, a la NewHoo.
End of internet predicted. Film at 11. We've heard it before, and we'll hear it again. Just need someone with a little VC money to throw it towards an idea that supports more specialization in search engine tech.
Kudos..
..don't panic
I use a free site statistics service to keep track of hits to my web site, where I keep some software that I've written. Looking at the referrer statistics to my site, the vast majority of hits are generated from explicit, categorized links to my site (e.g. bookmark pages and surprisingly Lycos which has a categorized database), and rarely ever from general search engines like Altavista. The questioner may be right - from the perspective of a web site owner, general search engines aren't very effective at bringing visitors to my site.
IIRC, XML was designed to help alleviate this sort of thing. Unfortunately, XML has not been exploited enough to have any significant ramification on the way the internet is sorted.
Why doesn't anyone use the ScriptAlias directive? It does the same thing as query strings, but makes it look nicer, like the rest of the web. You can "say" your looking at a directory or a .html file, but in reality you are viewing a singe script. For an example go to http://store.wolfram.com/. There are no directories on the server side, it's all served off of one script. Yet, to the user, it appears as a hierarchical directory structure, complete with .html files. The only query string is your session id, which is appended to the URL in case your browser doesn't support cookies (however, these are not there if a robot views the site). Anyway, a simple directive like ScriptAlias can save everyone a lot of trouble. If anyone has questions about its usage, send me an email.
Jon
Engineering and the Ultimate
I hate to do an "amen to that, brother" post, but I'm going to do so.
Any reasonable search term is likely to present results like "Search returned 417,373 hits. Hits 1-10 displayed." You have to then winnow by adding include and exclude words until you get it down to a manageable 7,422 hits, then you browse them.
The truth is, I turn to wide searches quite rarely. I tend to find and "bookmark" authoritative sites I find on a given topic and return to those over and over again. It is only when a site grows noticably stale or I have to research a new topic that I turn, reluctantly, to search engines. As for indexing database sites, I like the idea of extending the robot hack. Slightly less appealing would be to have a new HTML tag to include "bot content" in any page, including dynamic pages. An XML solution is a good idea, but I wonder how long before every extant site gets XML-aware? That plus XML is almost too flexible, making it likely that a hundred competing methods for indexing dynamic pages will appear and no one will know which one to cling to.
Hmm. How about this for an idea:
when a webbot sees a dynamic page, it changes the query to ?Webbot - and expects to get back a specially formatted page starting <H1>Webbot index</H1> and followed by a set of comma separated keywords, a break, an URL, a paragraph, then the next set? The webbots would be happy, as they don't have to waste bandwidth and cpu time spidering over the site; the server should be happy, as it doesn't have to support the webbot's spidering, and the site owners should be happy, as they can specify what keywords each result will be indexed under. obviously, just reformatting the index to the product database could generate this page for an ecommerce site, and more static sites could just use a static statement of what their site carries.....
--
-=DaveHowe=-
For me the best way to search the internet is to go to a site dealing with the context of your query and search that site with it's own search engine (which most major sites have).
It would be nice if a generic search engine working in the following way:
1. User searches for say "Cisco VPN Routing"
2. The search engine identifies sites www.cisco.com and other sites which are related to the search query string.
3. Instead of trying to account these sites it calls on the search engine at the site matching the context and queries it instead.
4. Returns the results of the search at cisco.com to the user.
It's kind of like a distributedSearch, where the actual search is done by the holder of the data, all that the search engine actually does is try to find a context for the Search Query and find sites with their own search engines that match that context.
So in answer to your question: My answer is No, the Internet isn't unsearchable, we just haven't implemented a reasonable standard for searching, which can be as important as routing when it comes to a network of the size of the Internet.
I like a snappy search engine response time as much as the next guy, especially when I'm looking for something fairly current or mainstream. But how can you tell a search engine to tread farther off the beaten path?
For example, a few days ago I was looking for the dip switch settings on an old 14.4k modem. Now I *knew* the info was out there on the web somewhere. I also thought it was highly unlikely to be in any of the major search engines in-ram indexes. I would have been quite happy to submit a boolean or reg-ex query to a search engine and then check back an hour later to get the results.
In my mind, instant gratification search engines are useful and have their place, but I see a whole segment which just doesn't seem to be addressed. Is anybody even thinking about working on this?
-matt
Give anyone the ability to talk directly to search engines and you'll see what has been happening with those damn porn sites on a large scale - do a query for anything, and it'll come up with a totally unrelated porn site for you.
People figured out how to abuse keywords real quick, and this would just make it worse. Which is why I wonder about the contnued existence of search engines. I use \. as my search engine - I use it to index my way into the web every day. I think that's the way of the future.
PS I hate the G3 keyboards. They're tiny! It's like carpal tunnel syndrome x 5!!!
lf.o
A site-specific search service built around the newly-opened E-speak? Damn good idea.. Not only would it provide an easy interface for searchbots, but in the future it could provide information for user-agents and other client-side searches. I'd imagine it wouldn't inflict as much server overhead as the current system.
I'll be flipping through the 'E-speak tutorial for the rest of the afternoon!
.sig: Now legally binding!
Browse a few relevant papers and find some keywords to search for more of the part of the field in which you are interested:
It depends on what technology you're using to generate the pages.
Zope sites for instance, are totally dynamically generated, even those pages that would normally be static. But the entire content of the site that's stored in the ODB is traversable via 'normal' URLs. This means that search engines can easily index your entire site.
Note, however, that this only works if you've taken care to expose your content via links. If you've delibarately hidden your content behind a search interface (and you can still do this with Zope), then your site will be no more indexable than any other dynamic site.
--
The real Webmaven is user ID 27463. I don't rate an imposter, because my ID is such a lame-ass high number.
It seem to me that having URLs with extensions of:
is incorrect. What is being served is not an ASP script, nor is it a PHP script, nor it is a Perl program. It is, however, an HTML file (or a GIF, or a PDF, etc.), and should be labelled as such.
If your server isn't smart enough to figure out how to generate the requested resource, and needs the generating program explicitly mentioned in the URL, then you need a smarter server. And if you aren't smart enough to figure out how to do this correctly, well...=)
Remember, kids, a URL != a file. All the /. end user cares about is getting an article with the comments formatted appropriately. They don't care[1] if it's stored as a text file, or generated by Perl, or..
[1] Well, they might care in a geek sense, but not in the way needed to read comments.
pooptruck
The web is certainly becoming significantly more difficult to search, especially for informational content. Just -try- searching for information on a musician or an author... you'll get links to the like of music.com, amazon.com, whatever-your-topic-is.com, with a little one-or-two paragraph blurb about the person, if you're lucky. Hundreds of links like this to every little virtually-hosted e-tailer out there. Somewhere, buried in all this, will be the informational content hosted on a personal webpage or at some non-profit organization. Anyway, so, that's the problem, or an aspect of it, we already know this.
... probably 'cause it's mostly static pages and there are not so many anime fans as there are linux users. But that isn't really relevant; if linux.com is going to become the search-engine alternative for linux-resources, they need to respond quickly at all times of the day and night, otherwise 'Joe's Linux Links' is a better option. :))
:))
Good news! The solution is coming. Maybe the solution is here. google.com has their unique approach to web-indexing. Another method that's probably going to be tried sometime soon is to look all the natural-language-processing technology that has been researched in the past twenty years, take the most efficient heuristics, and index pages by apparent-topic instead of by keyword.
Then there are places like anipike.com - if it's a web page about Anime, it's on anipike, or it may as well not exist. I would -never- search the web for anything anime-related; I go through anipike.
I'm really, really hoping that linux.com will become that useful to the linux community, but I don't think they're quite there yet. They may never be. Anipike is generally very fast to load, especially compared to linux.com
(Apologies to any Joe out there who is proud of his links page.
Anyway, currently I still use search engines for Linux-stuff, but as I keep getting more and more hits on rpm files cluttering up the informational content, that may change soon. (Especially since I'm a debian user! I'm looking for information when I search the web, I know where my package is.
--Parity
--Parity
'Card carrying' member of the EFF.
The solution is easy. Don't use them in your URLs.
Do not use GET args in dynamically built links, but hide your args in a longer plain ole URL. For example, a script at http://www/x/y can actually interpret http://www/x/y/z/ just fine and you can then parse off z as an argument.
First, alias a directory that runs your CGIs, PHPs, etc. Like you would cgi-bin but don't call it that!
Then, plant your cgi program(s) in there. The "arguments" further down would be in the PATH_INFO variable (which you'd have to parse out manually).
So, in the case of http://www/aa/xx/yy/zz/ the script is in the aliased /aa directory. The script is named xx and the PATH_INFO passed to it, in the above example, would be /yy/zz/
This works with Apache. Don't have Apache? Upgrade today at www.apache.org :-)
The client shouldn't infer the type of the object based on an "extension" in the URL at all ... that is what the Content-Type header is for!
Think Google.
/., it is reasonable to assume that regardless of actual content today, /. typically is a good result to return for the search "geek sites".
/., but the idea is that they will be outliers, and drown in the noise.
Google works on the idea that pages that have a lot of incoming links are authorities on what they discuss, so they should be ranked highly.
A modification of this is to not only rank a site's authoritativeness (eh?) this way, but also what kind of content it has. So if 10K geeks all have homepages that include the words "geek" and "computer" and also point to
Of course, some of those homepages will also have the words "tennis" and "knitting", that will be spuriously attributed to
This basically is keyword indexing, but the keywords are dynamically determined, rather than using the broken meta tags.
The big problem with this approach is implementation; the association tables are likely to be huge.
Also, you assume a large sample size, so that the outliers will cancel.
Johan
I have a few (very pertinent) meta-tags on the information page for a mailing list that I run. The tags are designed to get hits from people looking for my list. But, it seems that the meta tags don't work in some of the major search engines. Perhaps the engines have caught on to the practice of embedding surperfluous tags in order to get hits on engines. I think I'll have to rework my page to make sure that the key phrases that I'm trying to get hits on actually appear in the text.
I have discovered a truly marvelous sig, unfortunately the sig limit is too small to contain i
Yes, but many file systems, which may be the destination of the results of the HTTP request, *do* make use of extensions to determine file type. Though, perhaps, storing MIME-type meta-information would be better, we're stuck with what we've got.
Also, I mean the URL to also be used as a user-interface. For example:
http://slashdot.org/99/12/14/1154243/comments
would generate your browsers's preferred format, whereas requests to:
http://slashdot.org/99/12/14/1154243/comments.pdfl
and
http://slashdot.org/99/12/14/1154243/comments.scm
would return the PDF and the Slashdot Comment Markup Language (an XML app) respectively. This could be done with content-type markers, but the interface is much poorer than simply using file extensions.
pooptruck
You missed the best stuff, though! You forgot to mention:
Take a look at my site, theFYI. Still a work in progress as the backend isn't done (yet). Dig through the source and see how it's built. I would have loved to use CSS for element layout but, hey, the browser support just is not there yet. Stuck with tables for a few more years. BUT take a look at the structure around each article. The header is denoted with an {h1} tag, its appearance changed with CSS-1. The paragraphs are marked with paragraph tags and, well hell, the linked URLs are surrouned with {cite} tags. That's how you code indexable HTML.
Used a lot of the same tricks on another site, http://www.ptrm.org/ and the site does well in the search engines. Specifically, check out the page on the PTRM's paleontology field tours. It does well in the engines simple because it's got 'dinosaur' in the page title and in a header tag.
(Yes, I know that curly brackets don't go around HTML tags. I just didn't want to escape the angle brackets everytime I used an example of HTML)
----
Am I the only one who thinks Microsoft is a misnomer? Perhaps Macrosoft would be a better fit?
Example: Suppose you're looking for information on a Zip drive. You already have the drive but are having trouble with it (problems with zip drives? really?
Of course I don't even bother, I just go straight for the LDP, but Windows users don't have that option.
It would be interesting to be able to search only engineering sites for engineering information. I once did a search for "wheatstone bridge" and got tons of $cientology links. If the engine was able to determine if a site was, in fact, an engineering information site, that wouldn't have happened.
How about a "no pr0n" checkbox. That would be sweet.
Of course that would require a herculean effort in changing the standards and getting site owners to be honest.
But maybe not. Here are some ideas I was thinking of:
It's not perfect, but it's gotta be better than the garbage we put up with now.
I think that would qualify for a patent. Go for it. Its a great idea.
I just made the entire site unusuable by my entire company by viewing the robots.txt. How proxy server friendly. I hope nobody tries to look at the robots.txt file through an AOL connection.
If ebay has their way, indexing data is equivalent to cracking into another's system illegally
I think what you meant to say was "If ebay has their way, accessing a copyrighted database and publishing information from it after explicitly being explicitly told not to is equivalent to cracking into another's system illegally."
I guess that means that we should do away with all search engines entirely...
I'm afraid you're right. We're pretty close to a time when most web pages will be served up programmatically from what amount to copyrighted databases. Indexing such sites without explicit permission from the content owners would be legally risky.
Standard disclaimer: IANALG (I am not a Linux geek.) Rather, I'm a web design geek. So please, be nice.
From what I understand, what a lot of the OpenSource movement is about is doing it yourself if you don't like how it's being done now. Don't like commercial Unix, Linus? Make your own fscking Linux and let everyone contribute. Oh yeah: Give it away free to really piss people off.
In this discussion, there are a ton of excellent ideas for how search engines should operate. Yet no one, to my knowledge, has put forward the next logical step: Build our own search engine. Google is a good start but hey, I know you guys could build it better. Worried about hardware and bandwidth costs? Venture capital.
As I said, do it yourself. :-)
----
Am I the only one who thinks Microsoft is a misnomer? Perhaps Macrosoft would be a better fit?
Don't overlook XML-RPC, which builds on the XML spec to provide a way of serving data over the web to remote clients.
Then there's RSS, which is a way of serving up a news channel or other changing data. These applications are here and in use. Together, these XML-based technologies will someday provide the data layer for the software agents of the future. Read lately about that new "price-checker" technology? Imagine being the one business that doesn't serve up your product list and pricing to that agent.
An interface from XML to these "hidden" databases is only a matter of time. We're just caught right now at a moment between technologies: the authoring tools don't really exist.
----
lake effect weblog
{Network engineer in Chicago--looking for work!}
PHP Builder ran an article describing how you can have Apache Webserver treat a certain "directory" as a script, using the Location directive. So if I had a script file name called www.mydomain.com/foo then I could access www.mydomain.com/foo/param1/param2 and the foo script would run, and could use environment variables to find the "path" foo/param1/param2. I tried it, and it works quite well. This hides get parameters as "paths" so that search engines don't think the pages are dynamic (this is how Amazon.com works)
SPAM
...but you made an elaborate site entirely dependent on Microsoft Active Server Pages, and you're expecting it to work with _any_ web standards, much less be indexable and spiderable as if it was proper HTML? I'm afraid that you stepped right into that one. Look on the bright side- were it not for this Ask Slashdot article, you might never have known you weren't indexable, as this seems to be a little known fact! That alone is rather shocking.