Search Engines-Does Obscurity Prevent Exploitation?
GeekLife.com asks: "Search engines refuse to release (and often change) the exact criteria that determines their ranked results, presumably both to prevent competitors from stealing their techniques and to stop (or at least make less successful) attempts at "cheating" - optimizing a site to exploit these criteria, resulting in a higher ranking than it deserves to be. Is this an example where keeping the specifics a secret actually improves the tool? Or would releasing all the rules result in enough feedback ('given enough eyeballs...'), honing the criteria towards unexploitable results?" Interesting though. Can current systems be improved to give better results or have we reached an 'accuracy limit' as far as keyword-based searching is concerned?
The google "algorithm" is explained on the Why Use page on Google. Although it doesn't give the *exact* code used, it explains (in english) the whole process pretty well.
--Xandu
Well, while there are user-submitted lists of sites (yahoo,whatever) I think it's just about time for a moderated search engine.
The users could submit links in different subjects or categories with different keywords adding them to the harvested ones and, most important, registered users would be able to get x moderator points a week and vote down spam links or links that don't make much sense with the search one conducted.
Add a healthy dose of meta-moderation (maybe three levels) and some obvious anti-cheat prevention techniques and it should work much better than a normal search engine.
God knows many times even on google obviously poisonous sites come up in the search, it would be so nice to have a button to click to moderate down the page or the domain itself...
-- the cake is a lie
Well, also the fact that a huge chunk of the web isn't even indexed at all.
Other than that, though, the interfaces that most search engines use are pretty bad. There is usually no way to filter through a set of results to eliminate things that are obviously not what the searcher wants. Just being able to eliminate a set of domains from the initial results would make a huge difference for me.
Also, most people have no clue how to effectively use search engines - and they're not all that interested in doing so. I've been working in the web industry for quite a long time, and most of my colleagues seem to have no idea that changing the settings can yield better results. The setting 'phrase' for instance, makes a HUGE difference much of the time - yet I've never seen a colleague change any default settings when doing a web search. If you're not willing to do so much as even toggle an individual setting, you deserve the crappy results you get.
Oh, another thing - many of the links I get back are of dubious quality - even on the setting 'phrase', many results don't come back that match what I specified. If you play the the rules and the results STILL don't match, I have little faith in ANY results, even if the web site operators are trying to override accuracy. This is aside the very common result of '404 not found' pages.
Right now, the best search engine I know of is a meta search engine called 'ProFusion' - I've had much better luck with it than with Google. Not enough control over Google...I also like that the results with Profusion ( http://www.profusion.com ) come back with an option next to each result to open in a new browser window - now THAT's a nice idea!
Some really good points by previous posters that I want to recap:
If you open up the criteria such that *everyone* exploits the criteria, then there is no discrimination. When the criteria is closed, only those who have found the exploits can get increased exposure, making it inherently unfair.
Another issue is that what a search engine wants you to see is different than what you want the search engine to give you, in some cases.
We want the union of two criteria; the results that give the search engine the most use/reuse(usefulness of the search) and the results that give the search engine the most financial recompense(so that the search engine can grow, get better, get faster, etc)
They may not be correlated, but they are both very important. The most useful pages may not give them the most money, and the pages that pay them the most may not generate enough repeat use for them either.
Perhaps the best search algorithm is two step:
Rank according to links (the more links to a page, the more useful the page)
Count repeat use (the more times a search has to be refined, the less useful the pages returned)
Rank according to links already occurs at Altavista and Google.
I don't know that anyone does the second.
Say you do a search on Google; if you hit the next button, then the pages that were generated get knocked a few points. If you hit Google again a few minutes later with a variant search, then knock a few points to *all* the pages that got listed in the previous search. If a user goes back, and hits 'related' pages, increase the points to that page, and all the related pages. Repeat the above algorithm for every hit to Google.
The nick is a joke! Really!
GPL Deconstructed
That is only part of the way it works.
Sites are grouped into categories known as Authorities and hubs. A hub points to lots of different pages (yahoo for instance). An authority has lots of different places pointing to it.
Where the ranking comes into play is dependant on how good the hub or authority is. A hub is good (better than others) if it points to a number of good authorities. Likewise, an authority is good if it is linked to by a number of good hubs. Yes, this is a recursive process, and yes, it takes a number of passes to get the ranking to level out.
If a pr0n site wanted to exploit google to get a higher ranking, they would first need to create a LOT of dummy sites to link to it, and all those dummy sites would need to be found by google's robot.
However, just having a large number of dummy sites linking to the pr0n site is not sufficent. Those dummy sites would also have to link to a large number of other GOOD authority sites (on pr0n or whatever).
Now, throw another wrench into the works. Google doesn't search only on keywords ON the site itself, but on the sites that refer to it, and the other way around. Thats why if you search for "more evil than satan himself" you end up with microsoft as a prominant result, even though the words evil and satan probably don't appear anywhere on microsoft's website (although maybe they should).
This way, if you were searching for pages about a certain topic, but the pages themselves don't actaully use the words you're looking for, you will still find that page as long as there are good hubs out there that refer to that page and use your search terms in close proximity to the links.
Now, if a hub points to a large number of authorities on a specific topic, words relevent to those topic will then become viable search terms to find the hub when searching, as the hub would also be a good source of information, even if it doesn't list the specific search terms. All of this affects the "ranking"
So, for a dummy hub to get a high ranking, it would need to point to a large number of high ranking authority pr0n sites (which would anti-productive when what you're trying to do is advertise your own site). This would raise the hub rating for certain terms (specific to pr0n sites), and therefore raise the bar on the site you're trying to promote.
Of course, trying to get a pr0n site to come up on a search for "teen" or even "sex" is not easy because while a pr0n site is generally fly by night, there are many legitamate sites which have been around for several years and have built themselves into the web structure well and therefore get catagorized correctly.
-Restil
Play with my webcams and lights here