Search Engines-Does Obscurity Prevent Exploitation?

Unexploitable? .... -1 flamebait by Emugamer · 2000-09-13 06:55 · Score: 1

err did someone say unexploitable? name me one thing that millions use that has never been exploited that everyone knows how it works and I will show you something not worth being exploited. P.S. If anyone releases a good way to exploit the rankings on google I will kick their ass

C;mon, open it up! by jailbrekr2 · 2000-09-13 06:56 · Score: 1

there is only a finite number of ways one can use the word 'sex' as a keyword.....

--
Feed The Need[goatse.cx]

Re:C;mon, open it up! by grahamsz · 2000-09-13 07:22 · Score: 2

dunno i can think of at least 18 different ways

ummm oh i see, as a keyword :)
Re:C;mon, open it up! by ShakespeareProj · 2000-09-13 19:35 · Score: 1

18 different ways. Yup, that's seems finite to me.

Unfortunately not. by Hanzie · 2000-09-13 06:56 · Score: 3

Since there's a strong incentive to get your site listed in the search engines, the search criteria will always be exploited.

A friend of mine left the company I work for and started making porn pages for an australia based porn company.

He is supposed to make 400 pages per month, all somewhat different. He gets a bonus based on how many hits are generated, and a commission based on signups from his banner ads.

He's doing pretty well financially

--
********* sig: If you don't like the law, get filthy stinking rich, and buy a better one.

Re:Unfortunately not. by Anonymous Coward · 2000-09-13 07:05 · Score: 1

Hrmm I could do that job... *grin* where do you go to find these people? I don't see them advertising on monster.com...

Search engines can -always- be improved by vertical-limit · 2000-09-13 06:56 · Score: 3

Have we reached an 'accuracy limit'? Not for now, at least. While search engines have been improving, there's still a long way to go before they can serve up the correct page 100% of the time. Obviously, it's impossible for the search engine code to emulate the human brain; there's no way to tell exactly what the searcher wanted. Instead, search engines can only "guess", which is why you always end up with a few oddball results.

The only way to achieve true search engine accuracy is to have an actual person search for pages on request. Why no company has thought of this, I'm not sure, as this could certainly be an explosive business opportunity here. The difficulty of finding trustworthy information on the Internet is legendary, and I'm a sure plenty of clueless newbies would pay a monthly fee to get better search results.

Re:Search engines can -always- be improved by Hanzie · 2000-09-13 07:07 · Score: 2

Initially, I was going to say why human searching is such a bad idea...

However, most of my family (a bunch of engineers) seem to think that the best form of internet searching is asking me to do it for them.

Two days ago, an associate of military persuasion requested I do some searching. After telling him about google, he still wanted me to do the search in my free time at work. He said I could be his 'intelligence analyst'.

My reply would be inappropriate for this forum.

--
********* sig: If you don't like the law, get filthy stinking rich, and buy a better one.
Re:Search engines can -always- be improved by rongen · 2000-09-13 07:11 · Score: 4

The only way to achieve true search engine accuracy is to have an actual person search for pages on request. Why no company has thought of this, I'm not sure, as this could certainly be an explosive business opportunity here.

Dear GOD, those people at my local library! They must be part of some top secret start-up R&D initiative! So helpful, and for FREE! I KNEW they were up to something!!! :)

--8<--

--

--8<--
Re:Search engines can -always- be improved by shion · 2000-09-13 07:19 · Score: 2

In regards to getting the *exact* page the searcher wants, that's compounded more by the fact that different people interpret things in different ways.
I'd like to see a search engine which 'profiles' an individual user in an attempt to improve search results. And that segues nicely into the next paragraph:
Ideas like this exist to some extent, but unfortunately all the ones I've seen are of the targetted advertisement variety (gross..). It seems any kind of profiling ability eventually ends up being used for ads. Are there any search engines out there that keep track of preferences like this, becoming more accurate over time, and without any pornographic vampire junk mail distributors lurking in the background?
It'd be nice to combine the personalization of intelligent agents with the vast databases of major search engines...
Re:Search engines can -always- be improved by sracer9 · 2000-09-13 08:02 · Score: 1

I agree. We've definitely got a way to go before we achieve near 100% correct results. Take the company I work at for instance. I've personally gone through our web site to be sure that there are meta tags, content, links etc... that would be searchable. But still, when performaning a basic search under Google or any other engine, we're hard to find. If you search for our company name, however, you will find all kinds of pages on us. That's great if you already know about us, but not so great if you want to find us. I've often wondered what exactly it is that allows me to find some sites, that don't appear to be what I'm looking for, and not find the one I am. Anyway, yes, I concur that there must be a better way.

--

No thanks. I don't smoke anymore.
Re:Search engines can -always- be improved by B'Trey · 2000-09-13 08:13 · Score: 2

And of course there's the fact that there often isn't a "correct page." The user may have only a vague idea of what he's looking for; the information needed may be quite expansive and not covered on a single page or even a single site; the user may want to see multiple viewpoints on the same subject, etc. Even an actual person would have to return a list of possibilities in most places. The person would just be better at removing the obvious dross; they'd still likely return a number of false hits for most queries.

--
"The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.
Re:Search engines can -always- be improved by NicGCotton · 2000-09-13 08:26 · Score: 1

Actually, I remember that there WAS a company that had humans search the internet for information you wanted. I don't remember the name (of course). Their main focus was on major research, that would include intenet sites, but would also do an internet only search for some small fee.

--
"You must do the thing you think you cannot do" E.Roosevelt
Re:Search engines can -always- be improved by BradleyUffner · 2000-09-13 08:32 · Score: 1

OMG!! Profiling users! you blasphemer! the web page would be tracking your personal web habits and all slashdotters know that that is EVIL! no good can come from web pages that track user data. If you can't tell, i being facicious. :)
Re:Search engines can -always- be improved by TMB · 2000-09-13 09:09 · Score: 1

I'd like to see a search engine which 'profiles' an individual user in an attempt to improve search results. And that segues nicely into the next paragraph: Ideas like this exist to some extent, but unfortunately all the ones I've seen are of the targetted advertisement variety (gross..). It seems any kind of profiling ability eventually ends up being used for ads. Are there any search engines out there that keep track of preferences like this, becoming more accurate over time, and without any pornographic vampire junk mail distributors lurking in the background?

If the profiling is done well, and you're getting pornographic vampire junk mail, then you must be fond of pornographic vampires! ;-)

[TMB]
Re:Search engines can -always- be improved by pclinger · 2000-09-13 10:50 · Score: 2

I'd be willing to trade accurrate search results for banners that I care about.

I mean, you get the "best" of both worlds.

First off, you get accurate results -- which is awesome.

Secondly, who wants to see an ad for viagra when you can see an ad for (Gateway | Dell | etc) which you might actually be interested in? It's like watching commercials on TV. Do you guys really want to see tampax commercials, or do you pay more attention to things like Frys/Comp USA/Best Buy/Computer stuff?

I personally would prefer to see ads tailored more to my interests than to .. say, something a pregnant women could use
--

--
/. editors made it impossible to link to file:///c:/con/con in my sig. Please just type it in
Re:Search engines can -always- be improved by eudas · 2000-09-13 11:17 · Score: 1

better drop your high-$ .com startup job and go join 'em as a librarian quick, then.

eudas

--
Blessed is he who expects the worst, for he shall not be disappointed.
Re:Search engines can -always- be improved by sydb · 2000-09-13 14:44 · Score: 1

There's not many banner ads for maternity wear on the web as far as I can see. I hope you don't think tampax is for pregnant women...

--
Yours Sincerely, Michael.
Re:Search engines can -always- be improved by Karellen · 2000-09-13 15:34 · Score: 1

"Do you guys really want to see tampax commercials, or do you pay more attention to things like Frys/Comp USA/Best Buy/Computer stuff?"

I pay more attention to the Tampon ads. I don't need another computer right now, and certainly not from the places that advertise most. I also don't need tampons.

But - the tampon ads have much cuter women in them. Yay!

--
Why doesn't the gene pool have a life guard?
Re:Search engines can -always- be improved by streetlawyer · 2000-09-13 15:52 · Score: 1

The only way to achieve true search engine accuracy is to have an actual person search for pages on request. Why no company has thought of this, I'm not sure
As far as I can tell, this feature has been implemented under the name "Ask Slashdot".

--

-- the most controversial site on the Web
Re:Search engines can -always- be improved by Kaa · 2000-09-13 21:28 · Score: 1

I'd be willing to trade accurrate search results for banners that I care about

You would? How interesting. Why don't you then go and sign yourself up with as many user profiling services as you can find? I am sure they'll find a way to send you many, many ads that you would care about.

Secondly, who wants to see an ad for viagra when you can see an ad for (Gateway | Dell | etc) which you might actually be interested in?

I do. Ads for Viagra are funnier. And why in the world would I be interested in a Dell ad?

t's like watching commercials on TV. Do you guys really want to see tampax commercials, or do you pay more attention to things like Frys/Comp USA/Best Buy/Computer stuff?

First, I rarely watch commercials. When I do it is for pure entertainment value. Some tampax commercials are funny. Most CompUSA commercials are boring and ugly.

In any case, what makes you treat commercials as a source of information?

Kaa

--

Kaa
Kaa's Law: In any sufficiently large group of people most are idiots.
Re:Search engines can -always- be improved by SkunkPussy · 2000-09-14 05:32 · Score: 1

do pregnant women use tampax? methinks you have spent a little too long focussing on the Frys/Comp USA/Best Buy/Computer stuff :-P

--
SURELY NOT!!!!!

criteria. by mtvsucks · 2000-09-13 06:57 · Score: 1

personally i'd like to see critera opened so i can get more hits on my "bee arther naked" pr0n site.

--
1337

Re:criteria. by King+of+the+World · 2000-09-13 07:18 · Score: 1

"bee arther naked" pron site?
The chick of Golden Girls with little hexagons in her eyes naked?
That's not old age, no sir, those are polen sacks.

--
--Giving to trolls for the benefit of us all
Re:criteria. by Mortanius · 2000-09-13 09:15 · Score: 1

If you're going to do this, at least get it right. Bea Arthur. "Bee arther naked" sounds vaguely Yoda-esque.
Re:criteria. by King+of+the+World · 2000-09-13 09:31 · Score: 1

uh, being my point entirely.
Sorry if I didn't spell it out for you.

--
--Giving to trolls for the benefit of us all

Many-eyeballs doesn't apply by jmv · 2000-09-13 06:59 · Score: 3

I think the many-eyeball argument doesn't apply here, because the it's not about finding bugs in the rules, but preventing cheating. When a site decides to "cheat", it doesn't exploit a bug. The scoring system is not like a kernel, where you know exactly what should happen, it a (generally) complex AI system. These systems are designed so that they work well enough in 95% (or 80%, this is not the point) of the time. There're going to miss a couple percent, but what else can you do. Now if you have access to the rule, you can make sure your site uses the 5% errors to go on top of the list. Unless someone thinks he can have 100% accuracy (how do you measure accuracy anyway!), the scoring rules shouldn't be released.

--
Opus: the Swiss army knife of audio codec

Re:Many-eyeballs doesn't apply by BlueArcus · 2000-09-13 19:35 · Score: 1

That's bang-on. Many eyeballs doesn't apply unless everyone uses the 'exploitative methods' that the many eyeballs reveal...

A good comparison is sailboat racing... the handicap systems there are designed to 'rate' how fast a boat is and adjust race finishing times accordingly. Just like rating a web-page for relevance, this is a computationally pretty hard problem, not an exact science. The actual underlying formulae are *very* closely guarded secrets in most cases, and what generally happens is that a new rating systems gradually becomes obsolete as boat designers learn the weaknesses of the underlying algorithm and how to exploit (design around) them.

A tricky problem, but many-eyeballs is a non-starter I believe.

--
Think today's great? Should've been here *yesterday*.

If everyone exploited the criteria... by ripicheep · 2000-09-13 06:59 · Score: 1

...then we would get the most relevant listed first, and not the most exploitative.

--
"A witty saying proves nothing." -Voltaire

Exploit for Google? by Adam9 · 2000-09-13 07:01 · Score: 1

So far I haven't read anything that has "exploited" Google's technique, which is popularity based. Will someday some website create 1000 pages that each have a link, or links, to one site on each one? I don't think Google would be able to seek them out within the sea of pages in its billion-sized index.

Re:Exploit for Google? by YellowBook · 2000-09-13 07:10 · Score: 2

I thought someone already did this, posting enough links on /. that goatse.cx became the number one response for searches on "natalie portman grits"

--
The scalloped tatters of the King in Yellow must cover
Yhtill forever. (R. W. Chambers, the King in Yellow)

--
The scalloped tatters of the King in Yellow must cover
Yhtill forever. (R. W. Chambers, the King in Yellow
Re:Exploit for Google? by King+of+the+World · 2000-09-13 07:19 · Score: 1

Olde day pr0n sites had this exact thing. (many sites with different URLs linking to one)

--
--Giving to trolls for the benefit of us all
Re:Exploit for Google? by eudas · 2000-09-13 11:21 · Score: 1

no, the #1 search on www.google.com for 'natalie portman grits' is this:

http://netlosers.plebius.org/article/000330

(Natalie Portman Denounces Hot Grits)

hehe.

eudas

--
Blessed is he who expects the worst, for he shall not be disappointed.
Re:Exploit for Google? by hey · 2000-09-13 19:37 · Score: 1

This doesn't work because doesn't count all
links equally. It rates "good" links higher.
So a bunch of interlinked junk pages that
nobody else links to doesn't rate too highly.

Limits by um...+Lucas · 2000-09-13 07:01 · Score: 4

I think that we may have reached an "accuracy limit" with search engines until such time that people don't mind search engines leaving cookies on their hard drives, so they can examine a user's past queries and use those to try to present more relevant results for that users current query. I really think that will be the only way for them to grow, because most search terms I've seen (basically, referrer logs for my site and few other sites i've worked on) only consist of 3 or less words. It's a rarity that someone enters more than that, so that doesn't give a search engine much to work with...

However, if say google knew that I'd done searches for "albini" and "shellac" in the past, it could probably surmise that when i did a search for "big black", i'm actually looking for Steve Albini's first band, and not BIG BLACK BOOBS, et al...

I can't figure how else something like that could be accomplished without a sacrafice of our hope for privacy...

Re:Limits by 2nd+Post! · 2000-09-13 07:20 · Score: 2

You don't need cookies or privacy violating initiatives to improve searches at all!

Imagine a different criteria;

If you do a search and within some time limit hit the 'next' link at Google, the search engine should knock a few criteria points off the first page of returned links, under the concept that those were not accurate enough matches.

If you do a new search with a variant/refinement of the original search, then knock off points from all the pages that were browsed as irrelevent.

If you use the 'related pages' then add points to the page that was found.

Every link that is followed should have added points, under the concept that they were good enough, on inspection, to be useful.

At no time does the server ever need to place a cookie to keep track of you, other than session ID or something like that!

The nick is a joke! Really!

--

GPL Deconstructed
Re:Limits by DrgnDancer · 2000-09-13 09:41 · Score: 1

What about expert agents located on the local machine? It would not improve "Search Engine" accuracy per se, but it would eliminate some of the privacy issues. The agent would gather the information about the user, but would not display the information to the internet at large. Instead, the user would fire up his agent, and the agent would fire up the search engine. The agent could then filter the search engine results locally (with sufficently high speed access, it could go to each site and make evaluations, otherwise it could get special info blurbs on each site that matches the initial query.) Revenue stream could be a subscription model. I'd be willing to pay twenty bucks a month for a really superior search agent /search engine combo.

--
I don't need a million points of light, just two points of multi-mode fiber and a 10 Gig-E router.
Re:Limits by RedWizzard · 2000-09-13 10:00 · Score: 1

Using a cookie system is a reasonable idea, provided it can be shut off. But it's a far greater privacy invasion than anything else around at the moment because they are able to build a database of a far larger proportion of your online activity than they could with other methods (such as the web banner cookies).
Also, it's always going to be very, very difficult for a search engine to anticipate what you want. It's better to explicitly tell it. To use your example, it would be better for you to just search for "big black albini" than to try to have the search engine guess that's what you wanted based on previous searches for "albini" and a current search for "big black". Otherwise the search will be less effective when you want something new. Using your example again, if you type "big black" when you really mean "big black boobs" then you don't want the search engine to assume that your previous search on "albini" has any relevance.
People have been trying for years to get natural language interfaces and other ways of being imprecise with computers working and so far the results are generally mediocre. Why do we want to try to rehash the same ideas in the context of the search engine?
Re:Limits by RegularFry · 2000-09-13 17:14 · Score: 1

But you do need to track what each individual user is doing, because the meanings will be different for each person that does the search. Sure, you can improve relevancy rankings per click-through, but one person's relevant result is going to be another's dross.

--
Reality is the ultimate Rorschach.
Re:Limits by thenerd · 2000-09-13 18:48 · Score: 1

You could try Kenjin, from Autonomy. It has an extremely good press and is basically what you describe.

www.kenjin.com.

thenerd.

--
The camels are coming. I'm in love.
Re:Limits by MrNixon · 2000-09-13 18:52 · Score: 1

Expert Agents! I remember those... from like 1997! They were supposed to be the wave of the future, and there were tons in development; but since then, I have heard nothing at all (except an article in SciAm a couple months ago, in an article from MIT).
Anyone have info?
Re:Limits by Phibian · 2000-09-13 20:44 · Score: 1

"Why do we want to try to rehash the same ideas in the context of the search engine? " Because there is a demand for it. Reference: www.askjeeves.com.
Re:Limits by SquidBoy · 2000-09-13 21:10 · Score: 1

To use your example, it would be better for you to just search for "big black albini" than to try to have the search engine guess that's what you wanted based on previous searches for "albini" and a current search for "big black".

The problem is that searching for "big black albini" will only find sites on the band that also mention Steve Albini. This is a serious limitation of pretty much all search engines.
For another example, imagine you're trying to find out about New Wave band The Saints. Clearly, "Saints" is a totally useless term. You might get more with "Saints music", "Saints CD", etc, though you'll still get stuff relating to All Saints, "When The Saints Go Marching In", etc, etc. You could refine it with 'saints AND (music OR CD OR album OR punk OR "new wave" OR band OR group) AND (NOT "all")'; but that's still pretty horrible and won't catch everything.
One solution would be like what Northern Light claims to do: grouping the responses into categories; but this is a pretty hit-and-miss affair, and requires much more sophisticated categorisation. How feasible would it be to have a search engine try and order all its data into categories based not on search results, but on analysis of the entire database? It could build up a pretty detailed classification, especially when it comes to common sorts of search (TV show names, band names, hardware and software)
Another helpful idea would be that if you included the term "music" it searched for "CD", "album", "band", etc, automatically.
Of course, I'd also like to be able to do regexp searches on the entire WWW, but I don't know if this is realistic or even useful.

--
If you're a jock, inflict some pain / If you're a nerd then use your brain - DAPHNE AND CELESTE
Re:Limits by jawtheshark · 2000-09-13 21:55 · Score: 1

Well, you could imagine a search-engine with preferences just like slashdot has. Create a user, dictate your own preferences (and eventually allow it to store a certain amount of past queries) This should works fairly good and you give the info you want to give, so privacy is in your own hands.
Besides you could create a different "search-account" for you p0rn needs, your work-needs for your music-needs...name it :-)

--
Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)
Re:Limits by lgas · 2000-09-14 04:04 · Score: 1

Our search engine at http://www.2wrongs.com/ does exactly what you suggest - we drop at cookie on the users machine in order to track their searching habits and refine the results we provide based on the data accumulated in the users' profile.
Of course the search works even if you deny the cookie, it just doesn't work as well. If you are willing to upload your bookmarks as part of the registration process, we also analyze them and use them reinforce the user profile we use to custom tailor the results to you.
-
John

P.S. in reference to the idea of using regexs to search the internet -- the databases are many orders of magnitude too large for that, but it might be interesting to develop a mechanism to apply a regex to a specific page or set of pages in a result set once it's been narrowed down.

New AltaVista is now a squirrel tech Google by spudboy · 2000-09-13 07:01 · Score: 1

Tweaking meta tags to ensure high placement in
search results is dead, dead, dead. AltaVista is doing a rank-by-number-of-links thing, just like
Google. ("Raging Search" and "AltaVista" are the same
information, just different layouts.)

--
-- Real free software sites don't use GIFs.

Re:New AltaVista is now a squirrel tech Google by Xandu · 2000-09-13 07:05 · Score: 5

The google "algorithm" is explained on the Why Use page on Google. Although it doesn't give the *exact* code used, it explains (in english) the whole process pretty well.

--

--Xandu

I like inaccuracy in search engines by un_eternal · 2000-09-13 07:02 · Score: 1

I've plenty of useful, informative, and yes trival(but still interesting) sites becuase of the inaccuracy of search engines.

--
Ahh, A nice legally binding electronic signature...

goooooooogle by nutty · 2000-09-13 07:02 · Score: 1

Straight from the link given in this google article posted just recently.

"Last November, as reported in Google likes directory sites, I discovered that Google had the uncanny ability to sniff out high-quality, but little-known directory sites. As I discussed in that article, Google was able to do this because it ranks sites according to how many people make links to them, and smart people everywhere learn that directories are important, so they make many links to them."

Then one can wonder about that actual article (Google Propping Up Yahoo In Search Results?), which is not quite as good news. But it still rocks. Plain and simple.

Google is the only search engine I use. Well except when Im lazy and I get me a WebBITCH (webhelp) . . . hehe.

/nutt

I vote for obscurity... by Kierthos · 2000-09-13 07:02 · Score: 1

It's already hard enough to get to the web-page you're looking for without having a bunch of porn site operators or script kiddies skewing the results by embedding background text. If you know the code for the search engine, then it makes it that much easier for them to do so.

The last thing I want is to always, 100% of the time get the 'wrong' web-pages no matter what criteria I use to search for them. I get enough of that as it is.

Fortunately, the search engines keep evolving to try and take care of this. Unfortunately, it also seems to be easier and easier to divert search engines because of holes in the evolution of the browsers or new options allowed in the code or script for web-pages. Hopefully, the search engines will win this fight.

Kierthos

--
Mr. Hu is not a ninja.

Re:I vote for obscurity... by dgris · 2000-09-13 09:57 · Score: 1

It's already hard enough to get to the web-page you're looking for without having a bunch of porn site operators or script kiddies skewing the results by embedding background text. If you know the code for the search engine, then it makes it that much easier for them to do so.

The point that everybody seems to be missing is that the search engines shouldn't have to rely on implementation tricks. The relevancy algorithms should be adaptive (to catch drift in word usage) and personalized (to account for differences in word usage between individuals). The best way to make the engine unexploitable is to not compute the same relevancy ranking for each user.
I expect that eventually the big search engines are going to have to provide client side software to do much of the necessary processing. The privacy issues are too dear to the hearts of most people for them to be comfortable giving away enough data to make the engines as relevant as we want them to be.
daniel

--
All I needed to know in life I learned from /usr/man.

Re:Unexploitable? .... -1 flamebait by Anonymous Coward · 2000-09-13 07:03 · Score: 1

IF I EVER MEET SOMEONE WHO HAS RELEASED A GOOD WAY TO EXPLOIT THE RANKINGS ON GOOGLE, I WILL KICK THEIR ASS.

Didn't we debate this already today? by Tairan · 2000-09-13 07:04 · Score: 1

After all, if the system was changed, and everyone knew what was going on, then how would Google.com go about making millions by raising certain listings?

Note: This is my lame attempt to be funny.. TO answer the question, how about making a search engine that is free, and NOT advertising based? Maybe if the bandwidth was free..

--
/. is a commercial entity. goto slashdot.com

Re:Didn't we debate this already today? by skoda · 2000-09-13 11:59 · Score: 1

"how about making a search engine that is free, and NOT advertising based"

My impression is that Google is keeping its search site free, with little to no ads, and making the real money by licensing its engine to companies for their site-specific searches.

In effect, google's website is an ad & a loss-leader to get the big boys to pay for that magic on their own corporate site.
-----
D. Fischer

--
ShoutingMan.com

Ranked by referring pages by duckworth · 2000-09-13 07:04 · Score: 2

I like the concept of ranking pages by the number or pages referring to it. That seems to be the best method of keeping it somewhat difficult to cheat. The only problem is that newer material will take longer to reach the top, regardless of the relevance. It would be nice to have a search engine that could record poeples input regarding the results that were found. How many times have you searched for something only to find the best result buried five pages down. A method to go back and vote the ranking results yourself would seem appealing to me.

Re:Ranked by referring pages by Restil · 2000-09-13 08:29 · Score: 5

That is only part of the way it works.

Sites are grouped into categories known as Authorities and hubs. A hub points to lots of different pages (yahoo for instance). An authority has lots of different places pointing to it.

Where the ranking comes into play is dependant on how good the hub or authority is. A hub is good (better than others) if it points to a number of good authorities. Likewise, an authority is good if it is linked to by a number of good hubs. Yes, this is a recursive process, and yes, it takes a number of passes to get the ranking to level out.

If a pr0n site wanted to exploit google to get a higher ranking, they would first need to create a LOT of dummy sites to link to it, and all those dummy sites would need to be found by google's robot.

However, just having a large number of dummy sites linking to the pr0n site is not sufficent. Those dummy sites would also have to link to a large number of other GOOD authority sites (on pr0n or whatever).

Now, throw another wrench into the works. Google doesn't search only on keywords ON the site itself, but on the sites that refer to it, and the other way around. Thats why if you search for "more evil than satan himself" you end up with microsoft as a prominant result, even though the words evil and satan probably don't appear anywhere on microsoft's website (although maybe they should).

This way, if you were searching for pages about a certain topic, but the pages themselves don't actaully use the words you're looking for, you will still find that page as long as there are good hubs out there that refer to that page and use your search terms in close proximity to the links.

Now, if a hub points to a large number of authorities on a specific topic, words relevent to those topic will then become viable search terms to find the hub when searching, as the hub would also be a good source of information, even if it doesn't list the specific search terms. All of this affects the "ranking"

So, for a dummy hub to get a high ranking, it would need to point to a large number of high ranking authority pr0n sites (which would anti-productive when what you're trying to do is advertise your own site). This would raise the hub rating for certain terms (specific to pr0n sites), and therefore raise the bar on the site you're trying to promote.

Of course, trying to get a pr0n site to come up on a search for "teen" or even "sex" is not easy because while a pr0n site is generally fly by night, there are many legitamate sites which have been around for several years and have built themselves into the web structure well and therefore get catagorized correctly.

-Restil

--
Play with my webcams and lights here
Re:Ranked by referring pages by pod · 2000-09-13 09:05 · Score: 1

Of course 'more evil that satan himself' does not point to ms.com anymore. Instead it brings up all the pages talking about this particular search result :(

--
"Hot lesbian witches! It's fucking genius!"
Re:Ranked by referring pages by Bazzargh · 2000-09-13 15:51 · Score: 3

Excellent description. I can only top that by providing links which go over the research underlying this stuff.
The classic algorithm of this type is called HITS, by J. Kleinberg.
IBM's 'Clever' is an enhancement to 'HITS'.
Part of the success of these is that they can be mapped on to well known matrix solving problems...theres enough information in the documents above for you to work out how to write one.
One wrinkle Restil doesnt mention is that the technique is not purely based around link structure. You _seed_ the process with content-ranked pages (hoping the process 'crawls' to the best set independently of the seed), and subsequently you may select the most relevant 'communities' of pages by content ranking. So if you are already in the top 100, say you may be able to content-mangle yourself up the list, but you need good linkages to get in first!
A further criteria used is response time (I strongly suspect Google use this, I got hooked on it when I found that its sites _responded_ rather than hanging as most AltaVista sites did at the time). Again theres publications on this stuff: the shark search algorithm is a spider with this feature.
Re:Ranked by referring pages by KingOfCartoons · 2000-09-13 22:51 · Score: 1

The policy at findwhat.com is pretty clear from the following page:
http://secure.findwhat.com/signup/signup.asp
Whoever pays the most shows up first on the list!

A matter of time by jjr · 2000-09-13 07:06 · Score: 3

With enough experimenting someone can find out how the system works. Either through keywords,page text,bribing .. etc whatever. People will find out how it works. Just a matter of time

Re:A matter of time by dgris · 2000-09-13 09:53 · Score: 1

With enough experimenting someone can find out how the system works. Either through keywords,page text,bribing .. etc whatever. People will find out how it works. Just a matter of time.

No need to waste tons of time experimenting, Google is documented well enough in this paper (presented by the founders of the company at the 7th WWW conference) that someone could implement a look alike system. Of course, since most of the technology described in the paper is patented, actually implementing a system would be illegal.
daniel

--
All I needed to know in life I learned from /usr/man.

What about a moderated search engine ? by MarcoAtWork · 2000-09-13 07:08 · Score: 5

Well, while there are user-submitted lists of sites (yahoo,whatever) I think it's just about time for a moderated search engine.

The users could submit links in different subjects or categories with different keywords adding them to the harvested ones and, most important, registered users would be able to get x moderator points a week and vote down spam links or links that don't make much sense with the search one conducted.

Add a healthy dose of meta-moderation (maybe three levels) and some obvious anti-cheat prevention techniques and it should work much better than a normal search engine.

God knows many times even on google obviously poisonous sites come up in the search, it would be so nice to have a button to click to moderate down the page or the domain itself...

--
-- the cake is a lie

Re:What about a moderated search engine ? by mrmag00 · 2000-09-13 08:26 · Score: 2

By golly, they even did this! they called it the 'open directory project'

such a remarkable idea - USERS moderate how relevant a site is to a category, not a computer.

You can even download the data and make your own directory from it - www.dmoz.org
Re:What about a moderated search engine ? by tswinzig · 2000-09-13 08:49 · Score: 2

God knows many times even on google obviously poisonous sites come up in the search, it would be so nice to have a button to click to moderate down the page or the domain itself...

Yeah, and god also knows that system would never be exploited. Jeez, it'd be worse than it is now!

Shit, do you think Microsoft matches would EVER show up after the first few weeks of use?

Need we go on...

-thomas

"Extraordinary claims require extraordinary evidence."

--

"And like that ... he's gone."
Re:What about a moderated search engine ? by Steeltoe · 2000-09-13 13:20 · Score: 1

No "moderator gods" moderate on dmoz.org, not users and not XX points per category from each users IP.

Don't misunderstand me, dmoz.org is great, but it lacks the automatics of a search-engine.

- Steeltoe

--
http://www.debunkingskeptics.com/
Re:What about a moderated search engine ? by MarcoAtWork · 2000-09-13 20:42 · Score: 1

That why I was thinking of having a two or even three level meta-moderation scheme, and having only registered people be able to vote (if one were to be draconian one could also make it that you have to register with your provider's email address, rather than with yahoo, hotmail etc.)

I know that the potential for abuse is there, but hopefully with some tuning I think it could be made to work.

--
-- the cake is a lie
Re:What about a moderated search engine ? by beth_linker · 2000-09-13 21:40 · Score: 1

I'd expect a moderated search engine to run into the same problem that web filtering encounters - one person's trash is another person's useful information. You could easily have half a dozen users who search on the same keyword to find different information.

For example, if I search on the keyword "hair" to find information about hairstyling and I moderate down the links relating to the musical "Hair," that's going to make it harder for the person who wants to know about the original cast album.

Also, a moderating system could be intentionally or unintentionally tilted by users with agendas. For example, someone could register a large number of users and moderate down sites which contained opinions or information which they opposed.

I think the biggest problem with search engines is actually in the user interfaces. The web is so huge that it's often hard to find anything by just entering a word or two. Search interfaces should make it easy for users to provide effective search criteria. Also, users need to learn how to construct effective searches. Coming up with the keywords that produce the documents you want without also producing 10 pages of crap is not intuitive for most people.
Re:What about a moderated search engine ? by CaseyB · 2000-09-13 23:26 · Score: 3

Moderation is a MUCH more difficult thing to apply to search engines than it might initially appear.
You don't want the search to return the "best" sites! That's not the point. You want the search to return the most appropriate sites.
If I search for "redhat reviews", I'll get both "redhat.com" and "joe's linux distro reviews" on GeoCities.
Now, redhat is obviously going to have been historically rated higher than Joe's site, because it gets more traffic and is probably on the whole a much more useful and informative site. More people will have been happy with redhat.com as a search result than they have been with Joe's.
So should redhat.com be at the top of the list, and Joe's site at the bottom? No! Because if I happen to be looking for an objective review, I don't care what redhat has to say -- I want to know what Joe thinks about the relative merits of redhat and debian. Redhat.com is NOT an appropriate site for a "redhat reviews" search even though it matches the terms and is a highly ranked site.
So search results must be a function of both the site and the search terms, and moderation has to be based on this. This is a very nasty can of worms, because interpreting what the user wanted when he typed in the search is subjective. Doing a simple moderation on the intersection of the search terms and the desired result probably isn't feasible either, since freeform searches aren't discrete enough, there are too many possible ways to phrase the search for distro reviews. Determining what he wants and returning the best results based on previous moderation amounts to full blown natural language parsing and artificial intelligence.

It's all about the benjamins... by L-Train8 · 2000-09-13 07:08 · Score: 1

Would the search engine source code show how much money sites have paid to be listed in the top 5 results?
It would be hard to cheat at that criteria.

--

Don't forget that Friday is Hawaiian shirt day.

Closed source has it's usage... by NetDrain · 2000-09-13 07:08 · Score: 1

Though the search engines refusal to give out how they rank a site may disagree with the whole "open source" movement, it does serve a purpose. We can guarentee that every person out there would do their best to exploit the system, and we'd all end up getting porn sites pop up when we searched for "vacuum tubes" (that's actually happened to me) _ The world can't end today because it's tomorrow in Australia

A question within a question by m0nkeyb0y · 2000-09-13 07:09 · Score: 2

Let's presuppose that their ranking criteria are reasonably accurate. If you search for "girl's soccor" (sans the quotes, since most people don't use them, or know the option is available to them), or something pertaining to female gender in a non sexual sense, why are most of the top results pornography related material? Is it just because there are so many adult oriented sites, or is it because they have some sort of technique that allows search spiders to place them in all sorts of unrelated catagories, say using meta information?

--
-- From my Best Friend (Written to me over ICQ): "i was gonna go to a party...but i had to reinstall windows"

Re:A question within a question by dbthomas · 2000-09-13 07:50 · Score: 1

I think if you searched for "Girls soccor" not much would come up. If you searched "Girls soccer" however....

--
"These are the days that must happen to you." -Walt Whitman
Re:A question within a question by m0nkeyb0y · 2000-09-13 12:37 · Score: 1

Yeh, I can't spell. Thanks for the correction.

--
-- From my Best Friend (Written to me over ICQ): "i was gonna go to a party...but i had to reinstall windows"
Re:A question within a question by Compuser · 2000-09-13 12:38 · Score: 1

I dunno. Just ran your search (girls soccer)
thru av.com. Top 50 are porn link free.
Even search for girls soccor has clean top 10
links.
Similarly, google gives clean top 10 for both
searches.

It will never be unexploitable. by The.Tempest · 2000-09-13 07:11 · Score: 1

I belive that keeping the standards closed is best in this case. Although opening the standards and polling the public about them will help create better standards, it just means that the website X will know exactly what to do to spoof the results, no matter how good the criteria is.

Like Seti@Home, whose developers chose to keep the source closed to prevent spoofing results. I'm all for open source of pretty much everything, but sometimes there's something that is better off hidden.

--
-The Tempest

Rotating Criteria by MWoody · 2000-09-13 07:11 · Score: 2

Here's an idea: rotate the criteria between several open-source methods of sorting. Design each to both low-score attempts at exploiting the other methods, while still maintaining an adequate order for people doing searches.

So, at best, a company could be really popular on occasion, but suck the rest of the time.
---

Odd search results. by chotlhpah · 2000-09-13 07:11 · Score: 1

I once did a search, and it contained an acrynom that doesn't really occur anywhere else, I was just looking for the url, and I did a search with it on altavista, and it was on about the third or forth page! They don't have any good search engines really.

Re:Odd search results. by MonkeyHanger · 2000-09-13 07:23 · Score: 1

Not really fair on Altavista, its search engine isnt really any good if you have the name of a company and want its URL, for this just use Yahoo or other directory based engines. Altavista is more designed for keyword searches and cant really be blamed that the site didnt appear in the first few pages.

Google: The Criteria Aren't Exploitable by Saint+Aardvark · 2000-09-13 07:12 · Score: 2

Perfect example, wouldn't you say? IIRC, Google rates their sites based a good deal on how many other sites link to them. That is going to be non-trivial to hack, and almost certainly a reliable indicator of how valuable people find the resource in question.

Sure, you could pay people to link to your site -- it's done all the time. But only pr0n sites and the Big Portals are gonna have enough coin to make that a factor in a generalized site. If your site is at all focussed or specialized in nature, you're not likely to have the funds or time to pay people to stay linked to your site.

And that brings me to another point: with sophisticated enough keyword ranking algorithms, it'll become more of a pain in the ass to come up with spam that makes it through the filters than to simply put up a good site in the first place. 600 repetions of "sex" are easy to pick out. And if HotGrammar 2.0 can pick through my dangling participles in a reasonable amount of time, then it can't be too much more difficult to point it at a website and say "Does the content match the keywords?"

At that point I think you really would have some reasonably close to a non-hackable search engines: The Google Algorithm to pick out the sites that people have blessed with links, and The Grammar Nazi Algorithm to make sure there's content to match.

--
Carousel is a lie!

Obscurity? Not here... by Millennium · 2000-09-13 07:12 · Score: 2

If your algorithm is good, it can't be exploited. Google's algorithm, which is quite well-known, seems to work quite well. For that matter, dmoz.org isn't doing badly.

What I'd suggest, though, is a compromise. The basic popularity rating is based off of number of links, like with Google. However, people would be able "rate" the effectiveness of a given site when it comes up in the list (nothing fancy, just "relevant," "neutral," or "irrelevant"; maybe a five-point scale rather than three-point). No individual vote could do very much to the system, but as votes add up trends start to show, and relevance ratings can be modified based on this. This would require a user registration system to keep track of moderations (though definitely not searches), but so long as registration isn't required to actually use the search engine itself I fail to see the problem in it.
----------

Re:Obscurity? Not here... by SkunkPussy · 2000-09-14 05:45 · Score: 1

nothing fancy, just "relevant," "neutral," or "irrelevant"; maybe a five-point scale rather than three-point

except youre only interested in the positive side of the scale, e.g. Really relevant, relevant, quite relevant, related, or unhelpful.

--
SURELY NOT!!!!!

It seems like Google is kinda exploit-proof... by Dirtside · 2000-09-13 07:12 · Score: 2

I mean, your site is ranked based on how many OTHER sites link to it. The only way to exploit this would be to get other people to link to your page... which is the whole idea in the first place!

The biggest exploit I could think of for this, would be for a big company like Amazon to pay a lot of people to link to their site (oh wait, they already do). Thus they can use money to skew the results, without having to corrupt Google directly. Hmm.

--
"Destroy science and religion. Science would re-emerge exactly the same; but not religion." - Penn Jillette, paraphrased

Re:It seems like Google is kinda exploit-proof... by Aciel · 2000-09-13 10:13 · Score: 1

Google may be exploit-proof, but as nerds we must all acknowledge that it's like the student council; the person selected is not the most responsible or the most worthy in most cases, but rather the most popular. Google works in just such a way, listing the most popular pages rather than the best pages. Now for websites in general this would be good, but for individual webpages on specific subjects that aren't extremely sought-after...well...give up, Seymour.

Aciel
aciel@speakeasy.net
Re:It seems like Google is kinda exploit-proof... by timboy3 · 2000-09-13 14:56 · Score: 1

Disclaimer: I'm an engineer at one of the major search engines.
Dirtside said:
I mean, your site is ranked based on how many OTHER sites link to it. The only way to exploit this would be to get other people to link to your page... which is the whole idea in the first place!
Um, nope. This might be true if each person or company were allowed only one domain name...
A lot of effort and expertise goes into search-engine spamming, and I can assure you that the spammers are hip to link-indexing and try to exploit it.

Re: PS by King+of+the+World · 2000-09-13 07:13 · Score: 1

There was some software I read about that updated X many websites with randomly structured sentences - occasionally linking to one of several urls. It was done in the days of pr0n sites having dozens of sites with near mirrored content.

I guess this would defeat the page-rank (page vote) idea that google uses. Although as no one's seen their algorithms I doubt they wouldn't have considered this.

--
--Giving to trolls for the benefit of us all

The quality of results is the fault of users & UI! by Tumbleweed · 2000-09-13 07:14 · Score: 5

Well, also the fact that a huge chunk of the web isn't even indexed at all.

Other than that, though, the interfaces that most search engines use are pretty bad. There is usually no way to filter through a set of results to eliminate things that are obviously not what the searcher wants. Just being able to eliminate a set of domains from the initial results would make a huge difference for me.

Also, most people have no clue how to effectively use search engines - and they're not all that interested in doing so. I've been working in the web industry for quite a long time, and most of my colleagues seem to have no idea that changing the settings can yield better results. The setting 'phrase' for instance, makes a HUGE difference much of the time - yet I've never seen a colleague change any default settings when doing a web search. If you're not willing to do so much as even toggle an individual setting, you deserve the crappy results you get.

Oh, another thing - many of the links I get back are of dubious quality - even on the setting 'phrase', many results don't come back that match what I specified. If you play the the rules and the results STILL don't match, I have little faith in ANY results, even if the web site operators are trying to override accuracy. This is aside the very common result of '404 not found' pages.

Right now, the best search engine I know of is a meta search engine called 'ProFusion' - I've had much better luck with it than with Google. Not enough control over Google...I also like that the results with Profusion ( http://www.profusion.com ) come back with an option next to each result to open in a new browser window - now THAT's a nice idea!

Another idea - 'demote' button by MWoody · 2000-09-13 07:15 · Score: 2

How about a 'vote' icon? Not for each and every click, as no one will do that, but rather a 'demote' button next to a link. You can keep on organizing results by the number of people who click on the link, but if someone clicks through and finds something other than what they're looking for they'll be heading back to the search anyway, so they can give a 'demote' button a quick click as they continue down the list. If you made such a system immediately detract 100 clicks from that site, misleading links would soon be phased to the bottom.

After all, more people like to complain about a bad link than to promote a good one. Let human nature work in our favor, for once.
---

Re:Another idea - 'demote' button by 2nd+Post! · 2000-09-13 07:56 · Score: 1

Rather than an additional step, use prior behavior to collect such information; see my
previous posts on the idea.

The nick is a joke! Really!

--

GPL Deconstructed
Re:Another idea - 'demote' button by Steve+Smithies · 2000-09-13 09:24 · Score: 1

What if an unscrupulous company then decides to knock their competitors out of the search results? All they would have to do is repeatedly do a search, then demote their competitors out of the results.
Re:Another idea - 'demote' button by eudas · 2000-09-13 11:49 · Score: 1

unless, as was mentioned earlier, the results of searching were customized per user; results stored in a cookie or something similar. then you could tweak the search engine for you personally, and some unscrupulous company attempting to manipulate the system in that manner would be wasting their time.

eudas

--
Blessed is he who expects the worst, for he shall not be disappointed.

Banner adds can teach us alot by Gerad · 2000-09-13 07:15 · Score: 2

Two points:

Since from when I can first remember seeing banner adds, I can also remember seeing "please click here to support this page" right below them. People often end up clicking adds in order to 'support' a site and generate money, rather than being interested in the content of the adds. If people can exploit something for money, they will.

Secondly, banner adds are what give people incentive to cheat on search engines. If they can get more hits per month, that directly translates to more clicks (or impressions) per month, and more $$$. In today's society with the commercialization of the internet and 'dot-commies', I would bet money that the resulting information would definately be used to make somebody a quick buck.

That being said, I would wager that the final result would end up being more reliable search engines, but the potential problems may or may not be worth it.

--
Be the Ultimate Ninja! Play Billy Vs. SNAKEMAN today!

Open up the criteria! by 2nd+Post! · 2000-09-13 07:15 · Score: 5

Some really good points by previous posters that I want to recap:

If you open up the criteria such that *everyone* exploits the criteria, then there is no discrimination. When the criteria is closed, only those who have found the exploits can get increased exposure, making it inherently unfair.

Another issue is that what a search engine wants you to see is different than what you want the search engine to give you, in some cases.

We want the union of two criteria; the results that give the search engine the most use/reuse(usefulness of the search) and the results that give the search engine the most financial recompense(so that the search engine can grow, get better, get faster, etc)

They may not be correlated, but they are both very important. The most useful pages may not give them the most money, and the pages that pay them the most may not generate enough repeat use for them either.

Perhaps the best search algorithm is two step:

Rank according to links (the more links to a page, the more useful the page)
Count repeat use (the more times a search has to be refined, the less useful the pages returned)

Rank according to links already occurs at Altavista and Google.

I don't know that anyone does the second.

Say you do a search on Google; if you hit the next button, then the pages that were generated get knocked a few points. If you hit Google again a few minutes later with a variant search, then knock a few points to *all* the pages that got listed in the previous search. If a user goes back, and hits 'related' pages, increase the points to that page, and all the related pages. Repeat the above algorithm for every hit to Google.

The nick is a joke! Really!

--

GPL Deconstructed

Re:Open up the criteria! by jmv · 2000-09-13 08:06 · Score: 5

If you open up the criteria such that *everyone* exploits the criteria, then there is no discrimination. When the criteria is closed, only those who have found the exploits can get increased exposure, making it inherently unfair.

You seem to forget that the idea of search engine result scoring/ranking is not about being fair to all sites, it's about returning the best result possible.

If you open the criteria, the sites that make money from ads will all use them (the result is going to be "fair" between those sites), but the problem is that the not-for-profit websites (which are much more common) won't chenge their page just to get more hit (they don't care). The result is that, though it ends up being fair to all the commercial sites, but as a user, you're less likely to find what you're looking for... which is the point of using a search engine.

If you just want to be fair, have the search engine return a random URL. Now *that* would be fair!

--
Opus: the Swiss army knife of audio codec
Re:Open up the criteria! by jmv · 2000-09-13 09:36 · Score: 2

This was *not* meant to be funny. It it really is, then I guess I deserve a "-1 Funny". Or maybe it's just another case of a moderator on crack...

--
Opus: the Swiss army knife of audio codec
Re:Open up the criteria! by Pinball+Wizard · 2000-09-13 10:07 · Score: 2

If you just want to be fair, have the search engine return a random URL. Now *that* would be fair!
There *already* is a random search. Uroulette.
I've actually found some interesting sites with this.

--
No, Thursday's out. How about never - is never good for you?
Re:Open up the criteria! by TheNightOwl · 2000-09-13 23:09 · Score: 1

Say you do a search on Google; if you hit the next button, then the pages that were generated get knocked a few points. If you hit Google again a few minutes later with a variant search, then knock a few points to *all* the pages that got listed in the previous search. If a user goes back, and hits 'related' pages, increase the points to that page, and all the related pages. Repeat the above algorithm for every hit to Google.
This is a great idea. What you are suggesting is a form of user moderation that doesn't require any user effort. To build on that idea, perhaps every click-out could increase the revelance ranking (maybe with some way to guess at the time spent on a suggested page).

blah blah blah by dR.fuZZo · 2000-09-13 07:15 · Score: 2

Having many-eyeballs thing could improve ranking algorithms. However, the more people know about the algorithm, the more complex the algorithm is going to have to be to defeat cheaters. It's a catch-22. Eventually the algorithms would be so complex that they'd have to render the page to determine the relevance of different elements and parse out sentences to determine if they're gibberish or not.

So what we need is to work on developing an open, complex, and nigh-uncheatable algorithm while search engines continue to use their own closed methods.

--
-- dR.fuZZo

Re:Google: The Criteria Aren't Exploitable by Junks+Jerzey · 2000-09-13 07:16 · Score: 2

Perfect example, wouldn't you say? IIRC, Google rates their sites based a good deal on how many other sites link to them. That is going to be non-trivial to hack

Nah, easy. You ever look into reigstering a domain name. Notice how there are all those bulk registration specials, like "only $9.95 each if you register 250 or more?" That's because there are a good many web companies out there with dozens or hundreds of domains. In the case of porn, maybe thousands. It's pretty easy to cross link everything so you look more popular than you are.

Track IPs by MWoody · 2000-09-13 07:16 · Score: 1

Just thought of a variation on my other idea: how a bout tracking users, via their IP address in each session, and determining whether or not they jump back to the search page immediately after following a link?

If a user is still searching after a minute or two, he or she obviously didn't find what they were looking for.
---

Re:Track IPs by Dwonis · 2000-09-13 08:01 · Score: 1

Don't assume anything about how people conduct their searches. I do the following:

* Visit sites unrelated to the search, if I find it interesting
* Middle-blick sites I want to see, so the search page is never reloaded
--------
"I already have all the latest software."

I Asked Jeeves . . . by dgale · 2000-09-13 07:17 · Score: 2

. . . "have we reached an 'accuracy limit' as far as keyword-based searching is concerned?" and he didn't really have a conclusive answer, but was able to suggest a places where I could buy foundation and how to apply it.

There is a way to vote! by 2nd+Post! · 2000-09-13 07:22 · Score: 1

If you visit the link, add more relevence points to it.

If you hit 'next', decrease the relevence points for all the pages returned to the user.

If you hit 'similar pages' then increase the relevence points even higher for this hit!

If the user refines the search, reduce the rp for all pages that were previously generated.

Of course, I don't see any search engines using this criteria, yet!

The nick is a joke! Really!

--

GPL Deconstructed

Re:There is a way to vote! by eudas · 2000-09-13 11:25 · Score: 1

yeah, great. takes karma whoring and poor moderation to a whole new level... now instead of it being limited to just slashdot, we can have the entire web that way...

blah.

eudas

--
Blessed is he who expects the worst, for he shall not be disappointed.

Probably an open source issue really by tbray · 2000-09-13 07:24 · Score: 1

In 1994-96, I ran one of the first-gen search engines (the Open Text Index, r.i.p.), and made my living in a search & retrieval software company for years.

If you look inside the source code for any of the engines, you'll discover that the result rankings is an unholy stew of heuristics layered on linguistics layered on guesses. Among other things, the world isn't in English and there are lots of language-specific techniques. Furthermore, there are people who fine-tune this all the time. Furthermore, the code is shot through with special-case handling, and all sorts of boring tedious stuff to stave off word-spam and <meta>-spam and litigious organizations and so on.

There's no doubt that Google took a serious step forward when they started working the input-link count into the result ratings in a serious way. Works for me, anyhow.

I guess the upshot is that the search engine's source code is the only meaningful specification of how the rankings work. Which probably means that the info won't be in the public domain until they start going Open Source. Which would probably be a good idea, but their management and investors might not see it that way.

-Tim

Why such a complicated system? by 2nd+Post! · 2000-09-13 07:26 · Score: 2

Why not...:

Increase relevence points if a link is followed
Decrease relevence points if a link is ignored(next is hit, instead)
Decrease relevence points if a new search is defined(none of the prior search were sufficient)
Increase relevence points if 'similar pages' is followed

It should behave something like what you propose, without additional cookies, work, voting, or otherwise. Other than normal behavior!

The nick is a joke! Really!

--

GPL Deconstructed

Re:Google: The Criteria Aren't Exploitable by Saint+Aardvark · 2000-09-13 07:26 · Score: 1

Mm, good point. I might argune that pr0n can be set aside as a special case, given that it's essentially indistinguishable for the most part. But commercial sites (as opposed to the hobby/reference sites I had in mind, where I think the law of diminishing returns really would come into play)...you have got a good point.

Any ideas on how to thwart that?

--
Carousel is a lie!

NEWSTORY: that slash forgot and is too lame by Anonymous Coward · 2000-09-13 07:29 · Score: 1

In a slap to the face for the record industry, popular alternative rock band "Smashing Pumpkins" has released their supposed last album on old school vinyl and MP3 format via Napster. There will be no CD release of the music through its label Virgin Records according to sources at CNET. Only 25 vinyls were made, and included in the records was a note that stated the ploy was to be a "final f--- you to a record label that didn't give (The Pumpkins) the support they deserved," according to Sonicnet. The new album, entitled "Machina II", can be found all over FTP and download networks.

So fuck yo all, slashdot wont report on this now that its broken here before it.

Intelligent? I think not.... by NullStream · 2000-09-13 07:30 · Score: 2

If your trying to make a search engine that isn't easily exploitable how about identifying how current engines are exploited and design around that....

One way is to include the keyword many many times in a comment tag. A possible solution is to grade the keywords via their entropy with the words surrounding it. Such as testing for repeating patterns in a comment tag. Hell use the HTML tags to help you out by not grading anything in HTML comments. Another way is to do some syntactic analysis on the content of tags like and if they are not unique then they are only counted once. Certainly people with bad grammer skills and languages other than the language intended will suffer but you can add verification critera for each language you want to index. Before the trolls hunt me down and say this method is censorship (americans love to say anything is censorship) then you should just stick to the simple useless search engines we have now. Plus one can always implement this as a searching option to an existing broken wheel.

This isn't really that hard it just requires a bit cleverness and lots of prepratory work.

Or maybe I'm just on glue. :P

--
"Survival of the fittest Max, and we've got the fucking gun!" - Pi

Re:Intelligent? I think not.... by eudas · 2000-09-13 11:56 · Score: 1

hey, i think that making people with poor grammar skills and spelling suffer is a great idea. it'd be 'the internet's spell-checker', and would force people to learn the language that they are attempting to communicate in. i for one think that would be a Good Thing(Tm). grammar nazi will back me up on this. :) (have you ever tried to read some internet sites, like some of slashdot's posters' posts or things like www.techcomedy.com? it's amazing that some people are qualified as literate.) (i just know this post will be ripped apart by people for spelling and grammar, heh.)

Flame On!

eudas

--
Blessed is he who expects the worst, for he shall not be disappointed.

An AI-complete problem? by drfireman · 2000-09-13 07:49 · Score: 1

It seems like the problem to be solved here is, in reduced form, telling the difference between pages that are actually more relevant to the search criteria than others and pages that are trying to fake being especially relevant. That sounds like an AI-complete problem (actually it sounds a whole lot like a web page Turing test). In that case obscurity might be the best model, since the best solutions will probably be exploitable. (Actually, even real intelligence is probably exploitable.)

Re:An AI-complete problem? by NullStream · 2000-09-13 07:56 · Score: 1

An interesting idea.

It would be great to have search engines which specialize in a specific topic. Like medicine or economics or what ever topic they feel necessary. This way you can apply strict rules to everything that doesn't fit your topic criteria... of course this is very niche based but the countless hours I've searched for explainations of alogrithims only to get shareware sites is mind numbing. Try searching for "Berkley mbox lex grammer" .... and end up with nothing useful. (disclaimer: lex parses better than i do so let it do the work)

--
"Survival of the fittest Max, and we've got the fucking gun!" - Pi

Take a step back for more power by Space+Cow · 2000-09-13 07:51 · Score: 2

I was reading these posts and started rolling around ideas in my head for scripting a little meta-search engine that learns from your searches and is customizable, when I realized that such programs already exist (sort of). Search Rocket is one of them (I know there are more). It searches all the big search engines at once, lets you filter and sort results and save your work in a nice xml file for later. I use this tool when I am doing in depth research and need to come back to the links I found later (but don't want to shuffle bookmarks for an hour). Still, it would be even cooler if it did some of the smart auto-filter type stuff mentioned above.

Oh, and just to be completely on topic, if the big guys change their search results, Search Rocket needs a quick patch. So in a way, they could prevent tools like this from working by changing the results a lot, but I would think they want to keep it standard due to cross licensing etc etc

Ok, enough rambling. Time to go home.

The Bugaboo is Relevancy by Phrogman · 2000-09-13 08:03 · Score: 4

The biggest problems with Search Engines, is relevancy. The problem being that when I do a search for a word like "magic" the search engine will return results based upon its algorithm, but trying to produce relevancy from a single search word is just about impossible as a task. With a term like "magic" I could be looking for:

Magic as in Magic the Gathering - a collectible card game I used to play.
Magic as in the occult.
Magic as in sleight-of-hand.

Or any of a large number of subjects that I could have in mind at the time of my search. The results from a search engine such as Google, will rank pages which contain the word magic in the page title, multiple times in the body of the page, in the META tags, in or near HREF links, or which are linked to by many other sites higher than those which do not meat these criteria. It differs from search engine to search engine, depending on criteria.

None of these criteria for ranking take into account the nature of my query - what I had in mind when I did the search. In other words they do not directly address the relevancy of the results. If a search engine offered me the opportunity to pick from results it returned and gradually refine the search to produce better results it would be addressing this situation. Some do with a "search again in this result set" or "more like this" type option on their results pages, but its still kinda mechanical, and not all that reliable.

I think it will take some sort of AI analysis of search requests based on user-feedback of some sort and with a learning capability to surpass the current crop of search engines. Until such time as we have some smart systems working behind the scenes on searching any improvements will no doubt be incremental rather than radical.

Now, as for keeping the specifics of how a page is ranked secret I think its absolutely necessary. There is a constant, quiet, war going on between the search engines and the folks who want to get their websites listed at the top of the page when a result set is produced. The people who regularly submit their sites to the various search engines, with each search engine receiving a specially made page generated just for its benefit to ensure that the website gets the best ranking possible etc, are not interested in how accurate the search engine is, they simply want to come up first. The folks at the search engine generally want the most relevant pages to be returned. There is an essential difference of purpose between the two camps.

On the side of the search engines, they have control over their ranking system, and change it peridically to prevent abuse of the system. The folks who are seriously trying to get to the top of the heap in the search engine results are constantly trying new methods to get ahead.

For instance, at one point some webmasters were creating their webpages with a lot of text at the bottom of the page that was the same font color as the background, so that the search engines would spider the contents of the page but users would never see those contents. This let them list all sorts of words that scored higher in the search engines returns, but had little or no relevancy to the page contents. The search engines got wise to this trick and now most will penalize you for using it.

Opening up the search engines ranking rules would only make the system easier to abuse more precisely. No matter how many eyeballs pour over the code, it will still not change the nature of the guy who will use any method at his disposal to get his porn page returned as Link #1 when you do a search for MP3 because its the hottest term currently being searched for.

Google has altered this battle somewhat by ranking pages higher in their results based on how many other webpages contain links to that page (and also based upon the nature of the linking page. They use a distinction between pages which contain a lot of links - like a web directory such as my own Omphalos - and those which are linked to by a lot of other pages. Both get points for different reasons and in different instances. I don't remember the details), but even this is open to abuse, although with a bit more effort required. I know of a website which has over 200 different URLs registered and operational, all of which contain pages which point back to the main URL they are promoting. When a search engine such as Google goes to anaylize this website, it will rank it higher because it is linked to by so many separate domains and so many separate pages on those domains. Its harder to abuse, but it can be done.

Of course, this is all basically irrelevant, since each of the search engine companies keeps their methodology and their source code highly protected. It is worth millions of dollars in revenue, and I cannot honestly see any of them deciding to release their software in this way.

If you have not noticed, practically every graduate student who devises a new and effective method of indexing and ranking search results ends up creating their own company once they have delivered their thesis and entered the real world. That is certainly how Google started, and I believe is also how Ask Jeeves got going. I am sure that most of the other main search engines have gotten going in the same or similiar manners.

All that said, If you want to play with a true search engine that is GPLed and works quite well, although not on the scale of a Google or an Altavista, try UDMSearch. It runs just fine under Linux or FreeBSD (I have installed it on both in the past) and I am using it on my site under Solaris. It is still in an intense development cycle and new versions are released regularly, but its worth exploring if you are interested in how a search engine works, and want to get your hands dirty.

For more information on the big boys, check out Search Engine Watch, and finally, if you are simply interested in Space, Space Exploration or Space Science, check out SpaceRef.

--
"The first time I got drunk, I got married. The second time I bought a chimpanzee, after that I stayed sober" Arian Seid

Re:The Bugaboo is Relevancy by tswinzig · 2000-09-13 08:52 · Score: 3

The biggest problems with Search Engines, is relevancy. The problem being that when I do a search for a word like "magic" the search engine will return results based upon its algorithm, but trying to produce relevancy from a single search word is just about impossible as a task. With a term like "magic" I could be looking for:

Magic as in Magic the Gathering - a collectible card game I used to play.
Magic as in the occult.
Magic as in sleight-of-hand.

I know this will blow your mind, but no advanced AI is necessary.

Instead of typing "magic," you can add one or two more words to your query, and actually get the info you need! E.g. "Magic the Gathering."

Pretty neat, huh kids?

-thomas

"Extraordinary claims require extraordinary evidence."

--

"And like that ... he's gone."
Re:The Bugaboo is Relevancy by Field+Marshall+Stack · 2000-09-13 09:08 · Score: 1

The biggest problems with Search Engines, is relevancy. The problem being that when I do a search for a word like "magic" the search engine will return results based upon its algorithm, but trying to produce relevancy from a single search word is just about impossible as a task. With a term like "magic" I could be looking for:
Magic as in Magic the Gathering - a collectible card game I used to play.
Magic as in the occult.
Magic as in sleight-of-hand.

I know this will blow your mind, but no advanced AI is necessary. Instead of typing "magic," you can add one or two more words to your query, and actually get the info you need! E.g. "Magic the Gathering."
Pretty neat, huh kids?
-thomas

Yeah, really... and say you're looking for magic as in sleight-of-hand, you could search for `magic sleight -gathering -occult'! Whoa!
Really, this sort of thing isn't difficult, doesn't take up that much time, isn't the slightest bit hard to figure out, and it's much more reliable than any AI.

--
"HORSE."

--
"HORSE."
-Flaming Carrot
Re:The Bugaboo is Relevancy by phantomlord · 2000-09-13 09:28 · Score: 1

If a search engine offered me the opportunity to pick from results it returned and gradually refine the search to produce better results it would be addressing this situation. Some do with a "search again in this result set" or "more like this" type option on their results pages, but its still kinda mechanical, and not all that reliable.
Going back around 3 or 4 years, Altavista used to have a system to do just this( side note: I still have it bookmarked as www.altavista.digital.com though I rarely use it anymore... ). You could do your primary query and then it had a "refine" button if it returned more than some arbitrary number of results. When you hit the button, it would bring up a page which had related words on it and how many pages contained each word. You could selectively exclude or include certain words to refine your search. The functionality seemed to disappear when "portal" became a big buzzword but it was nice while it lasted.

--
Don't leave your mind so open that your brain falls out. Don't close it so much that you cut off the blood.
Re:The Bugaboo is Relevancy by Phrogman · 2000-09-13 11:03 · Score: 2

Maybe the example was bad, but my point was simply that a simple text search cannot show the intention of the user. I am quite aware that you can use boolean searches, or specify addtional terms to get better relevancy. I work on three websites at the moment and all of them run search services.
Specifying additonal terms does indeed narrow down the results, but I am sure if you think about it you can come up with two or three text strings that might occur together but in different contexts.

--
"The first time I got drunk, I got married. The second time I bought a chimpanzee, after that I stayed sober" Arian Seid
Re:The Bugaboo is Relevancy by mojotoad · 2000-09-13 12:18 · Score: 1

Excellent point, and also one that you would think that Google could be adapted to exploit.

The point in question -- multiple meanings of intent for words and phrases. You use the example of the word "Magic" (which a truly intelligent search engine would search for phonetically, and include such things as "Magick", but I digress).

Anyway, as others have pointed out, Google operates in terms of Authorities and Hubs. There should exist clusters of these for each meaning of the word -- with some overlap, of course, but don't tell me computers can't grok Zen diagrams.

Google should offer subsearches based on these clusters. I don't think the problem is in finding the clusters -- the problem is in properly phrasing what these clusters are actually about...something that a computer has a lot of trouble doing. Google should shy away from the Jeeves approach...it has the clusters in hand, it should offer a few contextual phrases around the tidbits of phrasing that caught its attention on the Authorities in question, each of these phrase bundles representing a cluster of "relevant density".

How about it, Google? You guys are clever, I know you've thought about it. Is it simply how to phrase the things?

Mojotoad
Re:The Bugaboo is Relevancy by mojotoad · 2000-09-13 12:21 · Score: 1

And an even more intelligent search engine would realize that I meant Venn diagram rather than Zen.

But both seem appropriate. Pick your meaning and savour.

Mojotoad
Re:The Bugaboo is Relevancy by DZign · 2000-09-13 16:14 · Score: 1

-small commercial break- The company I work for (www.DMPartners.be also has a search engine. This one is based on linguistics and is much better then other engines when words have more then one meaning. However there must be some customisation done for the customer but we have some base sets (ie. for financial institutions, pharmaceutical companies, ..) so you only have to add specific terms for the company. A previous version even asked you what meaning you were looking for (it's now taken out because most clients found it annoying and with the customisation we know what they usually want, but it can always be added again). Looking for hits in foreign languages is also much more efficient and returns more hits as you are looking for correct concepts, and not just checking on some badly translated keywords.

--
Learn about pinball machines on www.flippers.be

Re:The quality of results is the fault of users & by khym · 2000-09-13 08:04 · Score: 1

Other than that, though, the interfaces that most search engines use are pretty bad. There is usually no way to filter through a set of results to eliminate things that are obviously not what the searcher wants. Just being able to eliminate a set of domains from the initial results would make a huge difference for me.

AltaVista has a pretty good set of primitives in its advanced search, including matching against title, the whole URL, the hostname, what sites it links to, and the text in anchors; you can also use "*" to say "search for anything beginning with this". So you could do a serch like:

title:slashdot* AND NOT (host:portman OR host:hot-grits OR host:penis-bird)

I find that being able to search by title helps enourmously, and being able to use "*" saves me from having to search on variations of the same term/prefix.

Suppose you were an idiot. And suppose that you were a member of Congress. But I repeat myself.

--
Give a man a fire, and he'll be warm for a day, but set him on fire, and he'll be warm for the rest of his life.

Re:<META `/usr/dict/words` by willis · 2000-09-13 08:05 · Score: 2

usually -fuck or -cum works for me. I mean, I may be reading something about "the male sex" or whatever, but I'm not usually reading something about the word fuck.

Also, -"credit card" and -"member"
usually filter out a lot of shit.

willis/

--

there is no thing
what else could you want?

Re:Unexploitable? .... -1 flamebait by g_mcbay · 2000-09-13 08:07 · Score: 2

Yahoo released a method for exploting the rankings on Google. It is known as the "strategic alliance exploit"

Release their assets? by Mr.+McGibby · 2000-09-13 08:07 · Score: 2

It's important to note that most search engines are just like every other internet buisness, they thrive on the number of hits they get, since most of them depend heavily on advertising dollars. How does a search engine get hits? By improving the results that queries produce. Almost all search engines are so bad anyway that when I find a better one, I'm almost instantly converted. (Hence I use google all the time)

The sorting mechanism used by the search engines is their way of creating a great search engine! It's relatively easy (given enough resources) to catalog web pages. The big problem is figuring out which one of those web pages somebody is looking for.

Asking a search engine to release it's sorting mechanism is like asking Big Bob to release his secret barbeque sauce recipe in the name of improving barbeque sauce around the world.

--
Mad Software: Rantings on Developing So

What about Web Position software? by antdude · 2000-09-13 08:18 · Score: 2

How does this Windows (I know, it is Windows!) software relate to this story question? Any opinions? :)

--
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).

Search result usefulness declines over time. by kd5biv · 2000-09-13 08:22 · Score: 1

I don't think opening the search criteria will help much, nor will keeping them closed. Either some people cheat, or everybody cheats, or the Web grinds to a halt under all the wasted bandwidth from people trying.

<META NAME="keyword"> tags were a good deal for a while, but that got corrupted pretty quickly. Notice how long some of the high ranked pages take to load? View source on one of those pages and see how many screenfuls of keywords appear in META tags. There are numerous other ways to crowbar the index, this is only one of the easiest -- if the search engine indexes comments for keywords as well, you can cram the same enormous list of keywords into comment lines. Neither is visible to the end user, so all they see is a long delay.

Sorry, this has been a sore spot of mine for a LONG time, and relates as much to obnoxious unsolicited email as it does to pages and pages of results totally irrelevant to my search. (Google seems to filter out a lot of this c**p out pretty well .. so far ..) Why do we have to put up with it? Is it just that people who don't care about anything but ramming their advertisements down as many throats as possible just happen to be good at figuring out how to force us to see them everywhere we go, or is it that the whole process of evolution happens fast enough that any decent search engine has a useful half life of about a year before the indexing algorithm is totally compromised?

OK, OK, done ranting, and I know I could probably force feed my own page to as many eyes as the big admongers do, but the point is I *don't*. Maybe it's because I have ethics, I don't know, but it just p***es me off when so many people don't seem to know any better. Maybe they think they're actually boosting their sales. I don't know. All I can say is they're not getting *my* business ..

--

73 de N5VB (ex-KD5BIV) AR SK

About returning the best results possible... by 2nd+Post! · 2000-09-13 08:23 · Score: 2

If the users could 'rank' sites as well, the the criteria of 'getting the best results' would be fulfilled pretty quickly, regardless of search-sorting.

If all the users get are Coca-cola sites, no matter how they search, and these are not relevent sites, then they will get devalued to the point of non-existence.

So any gross abuse gets moderated by the user population.

Thus any useless site, no matter how hard they try, will not be able to maintain a high rank.

The nick is a joke! Really!

--

GPL Deconstructed

Re:About returning the best results possible... by Samrobb · 2000-09-13 09:19 · Score: 2

User moderation is an extremely difficult problem. Look at /. - they have a fairly good moderation population, and yet, the system still gets abused. All in all, the /. moderatuon system isn't anything to write home to Mom about, but it works after a fashion.

One of the reasons it does is that while moderators may have an axe to grind (pro-Linux, anti-MS, whatever) - at least their particular stance is somewhere in the same ballpark as the majority of the readers. When you open up the moderation to the world at large, though, and anyone can moderate... sorry, that just doesn't work. You end up with to many people, with to many goals, and too many directions they want to pull a ranking in for it to work out.

Most moderation/rating schemes, even if they don't state so, assume that the moderater/rater is going to at least make an attempt to rate something honestly. The sophisticated ones try to account for the possibility of intentionally or unintentionally bad moderation or ratings. These types of systems succeed when the moderating population is somewhat cohesive, and at least shares the same fundamental outlook. The really sophisticated ones even try to deal with the l337 hax0rs trying to skew everything to show sheep pr0n just because they can... but none of them can deal with the idea that a web page or article or whatever can belong to an arbitrary number of groups simultaneously.

I know of (and worked for) one company that did try this, unsuccessfully; if you have a system this sophisticated, it's far too easy to throw a monkey wrnech into the ratings and turn everything into dreck. I'm not saying that such a system couldn't be made to work; but I am saying that it just isn't worth the effort. By the time you had it working, you'd have developed a general purpose expert system; and if you could do that, you sure as hell wouldn't be wasting your time using it to power a search engine.

--
"Great men are not always wise: neither do the aged understand judgement." Job 32:9
Re:About returning the best results possible... by Erik+Hollensbe · 2000-09-13 10:39 · Score: 1

There are an average (i browse a 1) of 200 comments or so per article.

There are.. well the webpage count has a lot of digits in it. Google just recently reached 1 billion.

You also have to consider the fact that most people are searching for what they want to look for, and if the site provided the information they wanted, they're more apt to vote for it.

I always thought it'd be a cool idea to have a system that tracks clicks through a cookie with a short expiry, and rates the system that way. Even if you DON'T find what you wanted, you're more likely to click on what LOOKS like it might have the information rather than some old mailing list article with shitty sysadmins who don't know how to use robots.txt. (personally, if I wanted to search a mailing list i'd find the site for the list, not 400 mails in a search engine query - slashdot included)

This way, the users vote through their choices, and no extra effort is needed. For advertising (NOT MARKETING) purposes, this would also be useful. (ie, how many hits to search query "blah foo", not tracking users - hence short expiration cookies)

Erik

New MegaSearch Search Engine by TWX_the_Linux_Zealot · 2000-09-13 08:29 · Score: 2

Come visit the new MegaSearch©®(tm) Engine, the only engine to find exactly what you want, every time, GUARANTEED!

Features Include:

Drops All Non-relevant results! Tired of getting results for Debian and RedHat when you only care about Slackware? We filter based on the OS you care about, so you don't have to!
Doesn't track where you go at all! We don't see what links you try, we have a magical formula to determine our demographics, without cookies, without logging GET and POST data, and without cheating with autoredirects!
downloads instantly to your PC! We don't waste time with any graphics, for we know all you want is the search data. We don't send you tons of graphics for the advertisements we don't have!
Auto Mind-Meld Technology! Our process is based on that which Paramount Studios developed for Mr. Spock on Star Trek(tm), enabling us to automatically know exactly what you want without needing keyboard input whatsoever! You just think of what you want, and you have it! No fuss, no muss!

Never coming from a startup .com near you!

--

IBM had PL/1, with syntax worse than JOSS,
And everywhere the language went, it was a total loss...

Search quality is hard by dca · 2000-09-13 08:29 · Score: 1

Searching the web for the "right" page is an intrinsically hard problem, made tougher by users typing one-word queries and deception by website authors. Information Retrieval is an active area of research for a good reason. We all know computers are REALLY stupid, yet somehow we expect them to "do the right thing" when we ask for any concept under the sun. Counting links to a page is a useful heuristic, but it says nothing about the actual information content of the page itself. In fact, if you're looking for a really oddball datum, it may hurt. A lot.

A problem with the opening the rules by Dacta · 2000-09-13 08:47 · Score: 2

Pages I'm looking for are often not designed to exploit search engines anyway. If I'm looking for some technical documentation, it is often just stuck together by someone in a rush, or on a mailing list archive. These pages often don't have optimised META tags, etc, so some engines don't index them well.

Many people are suggesting that making the rules open is good because then everyone will exploit them equally. Perhaps, but that will just make it more difficult to find pages that don't try and get a highly rated search result.

On the other hand, I find Google works pretty well, and how it works is pretty well known.

Yep, by firstpostfirstpost · 2000-09-13 08:47 · Score: 1

I got the first post!

--
---------------- This post is the first post-first post post.

Security by obscurity? by Felinoid · 2000-09-13 08:52 · Score: 3

This isn't security as much as it is in the same argument base...
The arguments against "Security by obscurity" apply here.. so just insert those arguments [here] and I'll move on...

It works not by prevention so much as "reduced body count" and I guess thats the best a search engen can hope for.

When someone thwarts security thats it.. your dead...
When someone tricks a search to give them top results it's just a few websites.. it CAN be overlooked.

So say... 1 person hacks AltaVista.. it's down... blah.. 100 persons hack AltaVista.. it's still down... 1 cracker vs 1,000 crackers... makes very little diffrence... it only takes one defect and one joker to ruin your day...

But with searches... a defect becomes known and you don't fix it in time... 1,000 jokers and your screwed...
1 joker however isn't a problem.... your still online and USUALLY you still give good results... just one bad result...
You get bad results by random chance and user mistakes... so big deal...

But your expecting the joker.. once he's discovered this little trick... won't make it public....

Right now this dosn't happen...
But it's a lot to risk...

Recomendation.... sence obscurity is effective... but not perfict... give away the OLD system...
Provide a liccens that basicly says "Any changes may be used by us at any time with out notice... but only we may do this... all else is open source"

--
I don't actually exist.

Many of the listings are paid by threemile · 2000-09-13 08:52 · Score: 2

A lot of the "preferred" listings on search engines are paid for - that is the company who is at the end of a link is paying a certain amount per click through. GoTo, and I think Yahoo and About provide preferred search results to many sites such as Netscape search and AOL. So essentially, to get yourself to the top pay more $$$. Now this is not a necessarily a horrific practice if you compare it to, say, how the yellow pages works. If you are looking for a plumber, chances are you are going to be drawn to the biggest ad on the page. And how does the plumber afford the biggest ad? - He must have a lot of business - which may infer that he has lots of happy customers (ok I know, I prefer the mom and pop businesses, but we are talking about the masses here). So I suppose relevance is determined by who can afford the most for advertising, and who can afford the most advertising is doing the most business, and they are doing the most business because they are obviously the best at what they do - ahhh, capitalism ladies and gentlemen...

Re:The quality of results is the fault of users & by sillysally · 2000-09-13 08:57 · Score: 1

you don't need Altavista advanced search which IMHO has too much syntax: plain old av.com has some pretty good operators all on it's own:

+ insist on a word or phrase or meta term
- reject a word, phrase, or meta term

where the meta terms are

host:
url:
image:
link:
title:
...and many more

So, to find Signal 11 trolling on Slashdot and not on Kuro5hin:

+Signal_11 +troll* +host:slashdot -host:kuro5hin -timothy -Jon_Katz

where the _ will be treated as a non-breaking space and the * will get troll(s)(ed)(ing) as a wildcarded stem. Oh yeah, I've found the -timothy and -Jon_Katz to be useful for increasing relevance :)

Now that search might look complicated but I was trying to illustrate a lot. Just get in the habit of doing all your searches with +plus +prefixes , then add in -terms and +/- host: to clean up the results. easy.

Accuracy limit by Pinball+Wizard · 2000-09-13 09:00 · Score: 2

Can current systems be improved to give better results or have we reached an 'accuracy limit' as far as keyword-based searching is concerned?

We have not reached the accuracy limit. A search engine should be able to read my mind and infer the best sites to go to.

--

No, Thursday's out. How about never - is never good for you?

Real reason? Money by geekoid · 2000-09-13 09:09 · Score: 1

They don't want people figuring out how many "hits" are a result of company buying spots and not actually part of a search criteria.
A solid search criteria that is reselent to abuse can easily be implemented.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

Re:The quality of results is the fault of users & by Jester99 · 2000-09-13 09:10 · Score: 1

had you been using netscape in linux, you'd just hafta center-click to get a new window ;)

release away by onShore_Jake · 2000-09-13 09:15 · Score: 1

I do not think that releasing source means for sure that the search engine could be tricked.
For instance you all know exactly how junkbuster works but NOT my settings so you cant trick me into loading your banner. So they could release (or have leaked) the code and not certainly lose the ability to control things. Just have some "settings" like (yes its too simple here) if ($times_this_word_appears > $how_many_i_allow ) { blah... }

Why do people cheat so much? by iabervon · 2000-09-13 09:21 · Score: 1

It seems like everyone these days is trying to get their site to appear on all searches. This mystifies me, because if the page isn't titled something that looks like what I'm looking for, I'm not going to look at the page, and I'm probably going to be annoyed when a site keeps coming up on searches for other things, and I wouldn't go there even if I *was* interested.

It makes sense to try to get your site to appear first on a search that actually fits your site. But if the only sites fighting for the top spot are informative pages on the topic I'm actually interested in, I don't think that's really a problem, and it's not really cheating.

It may be that I have a tendancy to look for a particular web page (e.g., the official screen home page), and I don't want other pages on the topic (e.g., the GNU page about screen, which is non-gnu), but cheating never gets my eyes and certainly never gets me to explore a site, and definitely won't get me to advertizers' sites.

It would probably be helpful if search engines would give instructions on how to make your site come up on searches that are looking for it and not on searches which aren't. Of course, having the search engine be better at determining what the site is about without special help and making it easier to tell the engine what you're looking for would be even more helpful.

(incidentally, I've found an official screen ftp site, but no web site; it may very well not have one)

When a search engine hits a 404 error... by antdude · 2000-09-13 09:42 · Score: 2

Aren't search engines supposed to remove the bad link when a visitor click on the broken link? Let's say I was using Google, and it gave me a link that I want to click on. I click on it and it gives me a 404 error page/missing page message. I think search engines should remove that particular link when I found this.

Would that help a lot with these broken links? What do you think? Is this possible? :)

--
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).

Re:When a search engine hits a 404 error... by Amit+J.+Patel · 2000-09-13 09:59 · Score: 1

The search engine doesn't know that YOUR browser hit a 404. It's not visiting the page -- you are. It could find out if you let them install some sort of spy software on your system that lets them track you, but normally, the search engine doesn't know what you clicked on and whether you got a 404.

My Gripes with Search engines..... by oblisk · 2000-09-13 09:48 · Score: 1

The biggest gripe with search engines i have is the lack of an option to narrow, or search the results you had from your first search.

For example when searching for something, I would like to beable to knock out any of the bulliten board 'hits' i get as there always to messed up. My other option would be to get search engines to put the BBS responses somewhere else where they are nested and threaded, so i can go to the base post instead of digging through 15+ RE: to find the original post.

Also drop out anything with reference to a certain product I.e. Im searching for something on 3d graphic algorithms i would like to drop out all references to 3DFX or Nvidia, i know by doing this i might let something slip through, however it will usually flter out alot of crap.

These things could probably be implimented quite easily with a pull down menu. The other big gripe i have is with Linux/Unix stuff, this is esp true on FTP searches. I wish there was a way to ignore any Linux/Unix/BSD mirrors while searching for something, usually its impossible to search for something general on FTP search engines b/c of this.

Re:Unexploitable? .... -1 flamebait by Emugamer · 2000-09-13 09:48 · Score: 1

so he puts it in past tense and gets modded up go fig

Why not find out? by graniteMonkey · 2000-09-13 09:51 · Score: 1

Rather than wondering if search services are going to "Open Source" their search engines, why not start your own? If you're curious about whether many eyes are better, or you're out to prove the ultimate power of Open Source, then go ahead and do it!

Personally, I'm busy right now, but hey, there's gotta be enough people around to try it, right?

--

This is a manual virus. Copy it to your sig and help me spread!

Google by superlame · 2000-09-13 10:03 · Score: 1

Perhaps I'm wrong, but doesn't google already essentially tell us their criteria for finding the best page (eachs page is weighted by the sum weights of pages that point to it). I mean, they used to have thesis papers on their site explaining the details of how it works.

--
-- Superlame http://catpro.dragonfire.net/joshua/

''Adversary'' view and randomness by Tom7 · 2000-09-13 10:03 · Score: 2

In algorithm design, we often think about the input to our algorithm as being designed by an "adversary" -- a really smart entity which knows our algorithm and aims to defeat it. This is good, because it gives us robust algorithms and data structures which perform well in the worst case.

The solution often involves randomness. If search engines built randomness into their weighting criteria, an adversary with the algorithm would still not be able to influence the random (or "pseudo-random") aspect of the weighting.

Other things beyond the control of the adversary are usable too (some people have mentioned excellent ideas such as external links, voting, etc.).

404 not found by Tom7 · 2000-09-13 10:05 · Score: 1

I still believe that search engines should check for 404s when they return results (and the site hasn't been previously checked in, say, a week) or perhaps right after. Isn't it in their best interest to keep the database clean?

Re:404 not found by Erik+Hollensbe · 2000-09-13 10:25 · Score: 1

Although, not only would this be a huge horde of bandwidth for the provider, it could also lead to slow service due to a ton of open TCP connections. (ie, the host just simply doens't exist anymore, wait to timeout)

Erik

in theory... by inciteful · 2000-09-13 10:25 · Score: 1

It seems to me, releasing the criteria would, in the long term, result in a more stable system. Not to mention preventing artificial boosting of pages from a certain portal... In the short term, however, there would be chaos. If the technology didn't settle down quickly enough the search engine as a concept might be replaced with something new. On the other hand, with the incredible power of the open-source community, things could gt under control pretty quick.

The politics of search engines by Jim+Madison · 2000-09-13 10:36 · Score: 2

These researchers at Princeton have written a cool report on the politics of search engines (unfortunately just an abstract, although I've read the entire article).

Even "good" search methods have embedded social values. For example, Google's backlinking methodology tends to reinforce traditional power structures since heavily commercial sites tend to link to each other a lot.

Search engines are in the business of controlling what you become aware of. There are lots of things that become interesting just because lots of other people are also aware of it (e.g., Survivor, Big Brother, etc.).

Search Engines don't really try to maximize relevancy; they try to be relevant enough so that you don't leave. That's why Yahoo uses google search results as a placeholder, but that's just to create more space to promote its own stuff.

Proprietary SE techniques are a bad thing(tm) from the perspective they obscure the embedded social values in their design.

Yes, this is lots of random thoughts but I think this is an important topic.

--
Hey democracy lovers, add Quorum as a c

Profusion == No Linux by winterstorm · 2000-09-13 10:57 · Score: 2

You can't view Profusion with the Linux version of Netscape (V4.74). Well, you can VIEW it but you can't do anything after that because it messes up the browser (Linux Netscape users are used to this on sites with amateurish HTML). I'll be intrigued when they take pride in their work and hire authors that can write browser-nonspecific HTML. For now I'll just yawn and use the ever more annoying google.

I like using Oingo. Honestly the hits it turns up are not always great and it is tricky to use but it is not a "stupid" search engine. It attempts some level of ontologizing (ontologization?)... and its about time someone tried that. There are many times when I can't get a decent hit via one of the old fashioned search engines like google or AV but Oingo will find plenty, mostly for searches that require more than just "keywords"... where I'm searching for a related concept or idea and not just a "token".

Re:Profusion == No Linux by rlk · 2000-09-13 18:34 · Score: 2

I just tried Profusion with Netscape 4.75 and had no trouble. Perhaps the combination of turning off Javascript and running a fairly strict filter helped, but there were no obvious problems at all.
Re:Profusion == No Linux by Keel · 2000-09-14 00:56 · Score: 1

Actually, the HTML on profusion.com doesn't look that bad. The only thing really strange that I can see is a few elements that are between a and a . In other words, they are not contained within a visible element. This appears to have been done on purpose, since these are . If this screws up rendering of the table in Netscape, then Netscape is broken, since the correct response for an SGML parser that encounters invalid tags is to ignore them.

--
----
"Oh, bother," said Pooh, as he hid Piglet's mangled corpse.

To answer the question that was asked, by nels_tomlinson · 2000-09-13 10:59 · Score: 2

We know that telling people what the rules are will get them to change their behavior. Game theory tells us that the set of possible equilibria may change. In short, it is not necessarily better to tell everyone what the rules are, and try to iterate to some new steady state, in which we all try to exploit, and "it all balances out".

This really is not related to the security-by-obsecurity issue, I think. It involves security in the sense of "keep your passwords secret", not in the sense of "build a system with no bugs", which seems to be where security-by-obscurity fails. The issue here is that rules which people could know about and not be able to exploit might well be significantly inferior to rules which are exploitable.

I suppose that someone who wanted to find out an answer to this could try to get a grant to set up a search engine with a public rating system, and see what came of it. If we can come up with a reasonable metric for the signal-to-noise ratio which resulted from searches, we could find out what really works. By the way, I suspect that one problem with such a proposal would be that no one will bother gaming the system unless it is REALLY popular, and delivers loads of hits. That is, I don't think that this could be done on a small scale: go Google-size, or you won't get any data that applies to the big engines. But you could find out how alternative rules work, with a site that was below the radar screens of the home shopping network and the porn queens.

I have a question for you CS grad students: is there any academic literature on this issue, looking at how people react to this sort of structure, and how the structure must be designed to get them to react in a way which doesn't screw things up? This seems to relate to the information theory field of economics.

--
See what I've been reading.

3 probs with the current search engines: by M@T · 2000-09-13 11:21 · Score: 1

1) Database driven web sites make it difficult to directly index pages on a particular site. This is compounded by the fact that site specific search engines are usually ill-equipped for anything other than the most basic word searches.

2) Apathy on behalf of the user in that they never get around to actually checking out the site's 'How to use this search engine' section and rarely stray from your basic keyword search, but will bemoan the number of results returned anyway. (eg. '+host:' option on altavista)

3) Apathy on behalf of the search engine site in not providing the tips and tricks mentioned in item 2 such that a user can't break the search down any further even if they wanted to.

--
'sapientia potestas est'

can search engines be improved? by daviddlewis · 2000-09-13 11:27 · Score: 1

Absolutely! There's a number of techniques already known in the information retrieval research community (relevance feedback in particular) that aren't being exploited in current web search engines, and would make a big difference. What's holding them back is usually some combination of efficiency problems and lack of a good interface metaphor for allowing naive users to effectively use the technique. As others have pointed out, most people don't even use phrases.

I think it's a non-issue whether the criteria the engines used are publicized or not. There's enough index spammers out there that any weaknesses in the criteria get discovered, exploited, and patched fairly quickly.

Dave

speeds the process by aozilla · 2000-09-13 11:39 · Score: 1

case 1 (open source):
search engine is written (1 year)
for (n=1;n<10000000000;n++) {
someone finds a way to spam it (n days)
search engine is fixed (n minutes)
}

case 2 (closed source):
search engine is written (1 year)
for (n=1;n<10000000000;n++) {
someone finds a way to spam it (n minutes)
search engine is fixed (n minutes)
}

--
ok then your [sic] infringing on my copyright! Could you as [sic] me next time before STEALING my comments for your own?

Misapplication of 'open source' by The+Kow · 2000-09-13 12:23 · Score: 1

"honing the criteria towards unexploitable results?" I think this would be, as per the topic, a bad place to apply the open source theory. I can't get too into specifics regarding the theory itself (I only know what I do from context, in all honesty), but having been involved with MUDs, FPS's, and other such systems where the drive to cheat is generally very high, issuing out information on how things work is very rarely a completely effective solution. While methods of cheating-prevention are established, the methods used to cheat become refined, similar to strains of insects and virii, to accomodate. Sometimes it ends in success, but very rarely. The only way I can see this working is if the people behind the engine were constantly working on upgrading the grading process. This would probably require additional manhours on the budget sheet, and I doubt you could convince management that the cost-benefit was worth it. If it ain't broke, don't fix it. If it's only a little broke, don't look at it.

--
Moo

Google Algorithm by dzhei · 2000-09-13 12:30 · Score: 1

Here are links to abstracts from two papers that detail the inner workings of Google. The links below lead to abstracts- from those pages you can view cached pdf and postscript copies of the papers. The first paper is a reasonably high-level overview, although it's a bit technical. The second paper is more in-depth and discusses pagerank in more detail than you most of you probably want. In any case this should give you a good idea about what goes on in a real search engine, and should clarify why it's hard to fool google.

Re:some SE's and web-pimps already do this by fence · 2000-09-13 12:38 · Score: 2

GoTo and some other sites already do this.

Check out the "cost to advertiser" link under the top 'N' results on GoTo.com...

usually it is just a couple of cents, but I've seen some site pay more than a US$ dollar for your click-through.

Makes me wonder if they knew what they were doing when they submitted their bid...
hey--what's the difference between 2 and .02 anyway
---
Interested in the Colorado Lottery?

--
Interested in the Colorado Lottery or Powerball games?
check out http://colotto.com

Keeping things secret... by Ascender · 2000-09-13 13:52 · Score: 1

Perhaps one place where it might be better to keep things secret it virus scanners... If those of us who write such things could read the algorithms by which scanners look for virii, they could quite easily avoid those algorithms.

Re:Unexploitable? Read the Google Paper by thenning · 2000-09-13 16:14 · Score: 1

There exists a paper on the Google search engine since it is developed in Stanford- read and find out how it works.

Multiple search engines- inelegant but they work by cheekymonkey_68 · 2000-09-13 16:33 · Score: 1

I use the Copernic search engine at work,which searches a range of search engines depending
on what criteria you wish to search on

It even has a search engine for programming,and of course slashdot is amongst the computer sites they search for content!

Its not particulary elegant, but it does seem to produce good results.

The more they tell us, the less they earn? by cah1 · 2000-09-13 16:51 · Score: 1

If they told us *exactly* how it all worked, then they'd lose the ability to sell spots on the front results pages to specific sites ...

--

--
"I do not speak for my employers, though they are controlled from my Teddy's huge pulsating brain."

Some bad things about dmoz.org by KjetilK · 2000-09-13 17:08 · Score: 2

I've been an editor at dmoz for two years, but I've given up. Dmoz has collapsed under it's own weight and nowadays, it sucks.

The primary reason why it sucks is exactly that it doesn't have moderation... :-) They have a problem that in some categories lots of spammers sign up only for self promotion, and their response is to reject 90% or so of those signing up. Instead, what they should do is to make sure that no individual has too much power, including the meta editors (who are not always awfully clued).

The reason why it never gets corrected is that writing a comment like I do now is considered "illoyal" to the directory.

I have a bunch of ideas for a better web directory. In the meantime, I'm thinking about a classification for skeptical resources.

--
Employee of Inrupt, Project Release Manager and Community Manager for Solid

Some engines control your clicks by Pseudonymus+Bosch · 2000-09-13 18:02 · Score: 1

Actually some engines (Mamma?) don't give the exact results but something like http://search.tld/redirect?http://sought.site.tld/ page.htm .

From this, they can range the usefulness of the result (or charge the advertiser). They could reach that page as well to check 404s, but with so many searchs, I think it would be hard on their networks.

I don't like this for privacy reasons and because I can't tell whether this redirected URL is one I already visited from another engine or not.
__

--
__
Men with no respect for life must never be allowed to control the ultimate instruments of death.
GW Bu

Open source search engine by dnnrly · 2000-09-13 18:34 · Score: 1

Has anyone actually considered writing an open source search engine? Is there one on source forge etc? I can see the problem with some of the more unscrupulous people trying to use the source to cheat on these searches but I tend to agree with the "many eyes" theory. Given that the method for searching does not have to implemented in the code, just input via scripts which call tell the search engine how to do the searches. This means that a truly open source engine could be built and each peice of functionality or method of searching (be it checking out the links, counting words or whatever) could be subjected to many eyes but the way it is all implemented is kept as secret as you want.

Your post, corrected by spell checker. by MaxGrant · 2000-09-13 18:40 · Score: 1

Hay, Eye think that making peephole with pore grammar skills and spelling suffer is a great eye dear. It did be "the internet spell checker" and wood force peephole two learn the language that they are attempting two communicate in. Eye fore won think that wood be a God Thing(TM). Grammar not see will back me up on this. Have you ever tried two reed sum internet sites, like sum of slashdot's poster's posts or things like www.techcomedy.com? It's amazing that sum peephole are qualified as literate. I just no this post will be ripped apart bye peephole fore spelling and grammar, heh.

Thank you.

Secret algorithms? by ficara · 2000-09-13 18:50 · Score: 1

Isn't it a fundamental precept of cryptography that any algorithm dependent on hiding the details of its operation is not secure? That a strong algorithm is strong only if it cannot be broken no matter how much you know about how it works? Why would the same not apply to search algorithms?

:) by Pseudonymus+Bosch · 2000-09-13 18:58 · Score: 1

Thank you, that was funny.
__

--
__
Men with no respect for life must never be allowed to control the ultimate instruments of death.
GW Bu

Re:The quality of results is the fault of users & by magic · 2000-09-13 19:14 · Score: 1

Well, also the fact that a huge chunk of the web isn't even indexed at all.

The increasing use of JavaScript, Flash, Acrobat, Images, and other technologies that take human readable text and put it into forms a web search engine can't parse or understand is taking many of the "fancy" sites out of search engines.

The irony is that these tend to be the sites that are trying very hard to get to the top of search engine listings.

-m

Obscurity has its place. by Refrag · 2000-09-13 19:19 · Score: 1

Obscurity has its place, some people have to realize that. This is one of those places.

If you're really that interested in making search engines better maybe you should apply to one of the search companies or make a better search engine.

Refrag

--
I have a website. It's about Macs.

Google Cache by isaac_akira · 2000-09-13 19:50 · Score: 1

Google's caching system is great for those informative web-sites that dropped off the net, are temporarily down, or just really slow. It's also nice for sites that rotate their content, because even if the current site no longer has the info you were looking for (or they do, but it's archived somehwere deep in the site) the Google cache has it, and with all your keywords highlighted.

Hmm... You know how new users are always saying "I've got the Internet IN my computer"? Well, I guess Google really does.

- Isaac =)

Oingo rocks, thanks for the pointer by Raffaello · 2000-09-13 20:38 · Score: 1

"I like using Oingo. Honestly the hits it turns up are not always great and it is tricky to use but it is not a "stupid" search engine. It attempts some level of ontologizing (ontologization?)... and its about time someone tried that."

Yes, yes, yes! I just tried Oingo (never heard of it before), and it not only turns up relevant web sites, but will return a whole ontological category, with all the major sites in that category as well.

Moreover, you can specify which of several recognized "meanings" of each term the search engine should use. For example, I tried "Dylan incremental compiler," and was able to refine the search by selecting from a pop-up menu that the meaning of "Dylan" I meant was "object oriented programming language" as opposed to a "Celic deity."

Really, it's brilliant. I agree that ontological approaches are the way to go for intelligent people searching the web for meaningful information. Thanks for the link.

Problems with Inverted Indexing by harrisj · 2000-09-13 21:07 · Score: 1

Most of the people here have already noted some of the problems with data collection that search engines face. Mainly, pages may have surreptitious content designed to fool search engines about their real agenda. Even Google could be spoofed by a dedicated foe, although it takes a bit more work.

While this is always a pressing concern for those people writing engines, there are other issues that might affect the accuracy of search engines. Mainly, there are certain limitations on the underlying technology, and other technologies are still in early development.

I think every major search engine uses Inverted Indexes to represent the data. The idea behind this is that you can think of every document as being made up of the following tuples: {docid, termid, position, fieldid}, where each doc has a unique id, each word in the lexicon has an id, and a fieldid can be used to indicate special fields like titles or meta tags. All the engines basically take this information and produce inverted indexes for searching which contain the following tuples: {termid, docid, position, fieldid}. Throw in some mapping tables, sorting, and some compression optimizations and you have the basic idea. When a search comes in, pull up the various document lists for each term, scan through them for matches (ones for each term), and return the best results.

This works well for large collections, but it has a few limitations. For one thing, it can only find documents by words that are in them, so relevant documents with related words are ignored (I think this is called polysymy). Also, you can have problems with synonymy (a search for "jaguar" could be a car, a team, etc.). In addition, the lexicon scales in the worst way, causing most indexes to limit the size of their lexicon, causing rarer words to be ignored. Finally, it can perform rather poorly for words with that appear often (try "market share" for example), since the term lists for these are large and require scanning through large amounts of disk. And mainly, all the search engines just tweak this model, but there might be better solutions out there.

Some research has been designed to tackle the scalability problem. k-Nearest-Neighbors works by performing a feature selection on the lexicon and pruning words that aren't really useful for searches (eg, "slashdot" is more significant than "the"). Some approaches can remove up to 98% of the lexicon without a significant loss in quality. Then documents can be represented as vectors of the remaining features and queries can be mapped into this space and the k-nearest neighbors are returned (eg, you calculate a dot-product). This scales in size nicely on disk, but you find yourself doing more vector comparisons as your collection size increases, so it's really only practical for smaller documents. It also requires that your searches contain one of the terms in the feature set, which can be a bit limiting.

Some research has been focused on capturing more of the "latent semantic" information on a document. Indeed, Latent Semantic Information (LSI) is the focus of much recent research. This technique works by feature selection and transforming documents into vectors in a semantic space that represent their semantic information (break out the linear algebra). Researchers claim that this allows you to find documents with related content, even if they don't match your terms. It also claims to solve the "jaguar" synonymy problem, but you really need to enter more than one term for it to work (and they have to be in the feature set too). While early research looks promising against test collections, its performance scales poorly and it doesn't work well against noisy collections like the web yet. But research continues.

Other research has been focused on natural language parsing in an attempt to recognize meaning of user queries and documents. This however is really tentative, and it hasn't been able to show some of the success of the less intelligent and more statistical methods like LSI or kNN.

I hope these rambling notes prove somewhat helpful. Interested slashdotters can probably find some useful primers on the web that explain this better than I can. Also, ACM members should definitely check out the SIGIR conference proceedings (the Digital Library is great).

Re:Problems with Inverted Indexing by dehora · 2000-09-14 00:45 · Score: 1

Can anyone please explain to me how this gets a 1?

--
I saw that ...to search was not always to find, and to find was not always to be informed -Sam Johnso

Re:Baner adds can teach us alot by ameoba · 2000-09-13 21:22 · Score: 1

Actually, not too long ago I remember hearing some guys in a bar going off about the latest & greatest pyramid scheme ^H^H^H^H^H^H^H^H^H^H^H^H^H^H small business.

The core of the scam was that you'd buy a web page (or server space or whatever, they were clueless ex-Amway guys) from them, and agree to link to their web pages. Then fill the pages with banner ads, and give them a cut of the profits. There was something about free internet access rolled into the deal

When I heard the words 'bulk email' I nearly snapped and threw my stool at them...

too bad it was bolted down...

--
my sig's at the bottom of the page.

Not at the local library any more... by goliard · 2000-09-13 21:23 · Score: 2

Actually, I know a bunch of research librarians who have been snapped up by a new company out on Rt128 which sells their services for doing web hunts....
----------------------------------------------

--
-*- Any technology indistinguishable from magic is insufficiently advanced -*-

How to make "ranking" really work. by jekk · 2000-09-13 21:46 · Score: 1

You raise the problem that "moderation" is a difficult problem (eg: /.) and that using it on a search site would not work well because "too many people" moderating would "pull a ranking in [doo many directions".

There IS a solution to this problem. It WOULD work, and it would make an AWESOME search engine (once enough people used the rating system). The only weakness is that it requires some sort of log-in to the search site -- which might actually be GOOD for the search site's bottom line.

Consider Amazon. They don't make book recomendations by finding the "most popular books in the world" and suggesting them to everyone... instead they recomend the books that are similar to what YOU like to read. The technology behind it is a technique called "collaborative filtering". Basically (to strip off a whole bunch of marketing designed to make it seem complicated), the idea is to look at the rankings YOU have already made, and then use the pool of people who's rankings are SIMILAR to yours as the pool of people from which to draw on when deciding how "popular" a book (or, for web search engines, a website) is.

So here's how it would work. You start with a basic set of search criteria... maybe start with Google's, for instance. Then, when people sign up for your search engine, you invite them to submit a list of their favorite/most-used websites. This gives a starting place, and from the very beginning, search order can be modified slightly by giving a slight + to the score of those sites which are frequently mentioned in favorites-lists for people whose favorites-lists are very similar to your own. (There should be an option to exclude sites on your own favorites-list from showing up in queries. Some would want it, some wouldn't.)

This gets things started, but favorites-lists won't provide enough data for a really good, web-wide ranking of sites, and it will get stale fairly quickly. The real trick is to develop a "clicked-on" list for each user as well, and use the clicked-on-lists, as well as the favorites-lists, for modifying the ranking of new sites. The clicked-on-lists could be gathered by making the links to the website go to the search site for a redirect, along the way they can be counted.

So, once the system was up and running, new users would have to create an ID when they signed in. Then they could create a favorites list, which would result in immediate "customized results", or they could just use the search site for a while, and their results would gradually become more personalized as they built up a clicked-on-list. Most importantly, a large portion of the ranking of sites would be determined by what users had listed or clicked on, rather than by keyword, linked-to-rating, or other factors more easily manipulated. And it would customize itself to your own searching preferences.

Now, one final thing. I know that this is a GOOD idea. I've been thinking about it for a long time. If anyone else reads this and agrees, please let me know... because I'd be very interested in doing it. (This is a bit long for an elevator speech, but I figure the /. readers I'm interested in have longer attention spans.) In fact, if you even just READ THIS, but AREN'T interested, could you drop me an email to let me know? Thanks.

Michael Chermside
michael.chermside@destiny.com
5715 North Ridge Ave
Chicago IL 60660

Re:Google: The Criteria Aren't Exploitable by Tau+Zero · 2000-09-13 23:57 · Score: 2

If your sites only link to each other, and there are few or no links to them from the outside, then they form an "island". This shouldn't be that hard to detect. Further, if none of the authoritative sites link into the island the credibility of the cross-links isn't high. The Google method isn't quite as easy to spoof as it might sound; I've never been directed to a pr0n site from a Google search.
--
Build a man a fire, and he's warm for a day.

--
Time is Nature's way of keeping everything from happening at once... the bitch.

Aeiwi by AeiwiMaster · 2000-09-14 00:07 · Score: 1

Aeiwi don't refuse to release
the exact criteria that determines the results ranking.

It is all on the submit page

Ontological Search Engines by winterstorm · 2000-09-14 00:32 · Score: 1

The people at Yahoo had once aspired to at some ontological smarts to their search engine. There was an article some years back in wired that talked about how Yahoo had hired some of the people from Cyc Corp. to help them develope an ontology. I remember that one of them was the co-author of BLKBS ("building large knowledge based systems").

This seemed like exciting news to me at the time but Yahoo was then, and remains now, more of a "dewie decimal system" than an ontology.

audio/music search. by amchugh · 2000-09-14 01:13 · Score: 1

Hey, when will I get a search engine which listens to me whistling a tune and then pulls up the relevant song(s)?

Oh, and I'm on the DON'T RELEASE CRITERION side of the fence. The spammers and marketeers own enough of the web thank you very much.

BTW - I'm not sure what all the fuss is about Google. It doesn't seem to return as good results as a well formed query on AltaVista advanced search.

Wow! That was magical! by exister · 2000-09-14 07:32 · Score: 1

How did you do that?

--
The cure for 1933 is 1917.

Re:Submitcorner.com by AgentWebRanking+Free · 2000-09-14 20:53 · Score: 1

Thanks for the link submitcorner.com . You will find other info on searchengineforums.com.

--
Freeware - Search engines ranker and analyzer - 5 stars Zdnet - http://www.aadsoft.com

Re:profusion isn't so good by Tumbleweed · 2000-09-15 00:53 · Score: 2

Guess you didn't read what I wrote, eh?

Try searching for 'babelfish translator' with PHRASE mode, and it comes up as number one. ;)

Slashdot Mirror

Search Engines-Does Obscurity Prevent Exploitation?

188 comments