Gnutella Technology Powers New Search Engine
Matrium writes: "News.com (owned by CNet) is running an article on how the makers of Gnutella have turned their decentralized model of information swapping away from music and porn, and are now looking at search engines. InfraSearch is still in beta, but it does offer an interesting look in the evolution of the Internet." InfraSearch presently paws through only a few search sites, but as a concept really intrigues me. For one thing, it introduces the long-overdue concept of "how long to search" right into the query dialogue.
What's to stop people 'spamming the index'? When your site gets a query, you could respond with 'very strong match' in the hope of getting more hits.
Who is enforcing that sites won't just lie? Maybe some sort of collaborative moderation a la Slashdot would be needed?
-- Ed Avis ed@membled.com
I really, really hate projects which are "open source" but who refuse to release the source until it's "done." Too many projects these days seem to be following that path, and it's a dangerous one to take. Because what if the code is never truly "finished", as no project ever really is. Its sad.
--------- Beware the dragon, for you are crunchy and good with ketchup.
Since people can define their own content, would this mean that people running the server-end could still be distributing their MP3's, pr0n etc, but through a web interface? It's not just limited to html-page searching.
This makes pira^H^H^H^H trading files even easier - people no longer need to install a client, there's a nice web-search interface, with direct dload URLs. Web searching for files with no broken links. Nice.
Go ahead and do a search for something. Within 5 minutes whatever connection you have (dsl t3 etc) will be saturated. Has anyone ever had a complete download? Getting 100bytes/sec on a 5 meg file is insane. Maybe if it reported their connection speed truthfully and people set realistic download/upload limits.
Only the State obtains its revenue by coercion. - Murray Rothbard
Doesn't the model imply that every search will be processed by every available server - effectively turning a single query into n queries and responses?
Just think - you're dialled in to an ISP and want to search for something. Eventually you start getting responses, first from hosts logically closer to you then those further away (we can only hope that there's no negative response in the protocol). You may have to wait for it all to come down the line before you get a useful result. And you'll still have to wade through mountains of useless junk (since responders get to define what content they have) just that now you'll have to actually visit the site to see that it's just another boring article on internet protocols instead of the "fix your credit record" guys you were looking for. Eventually, you'll learn which hosts not to accept responses from and which ones respond better to what types of queries (just like today).
Big search engines will still dominate the field by being able to get it right most of the time. I don't see any real advance.
---
The real problem is entropy.
The article mentions this, but not strongly enough. Without "legitimate" applications for technology, they will be viewed as simply tools for pirating or other illegal use. FTP, as an example, could be used for those purposes, but the mainstream uses came first. We need to develop as many mainstream uses for mp3 and gnutella as can be done, so the focus of the technology critics can be drawn away from the music/copyright questions, on to the other uses. As of now, they can claim that other uses are simply "vaporware". Sure, they're possible, but no one is actually doing anything with them. Once the applications come, the technology will gain the acceptance it deserves.
Of course I use Microsoft. Setting up a stable unix network is no challenge
Yes, but the idea of letting clients search eachother and share files, the Napster or the Gnutella, way is a very good idea. It has many legitimate possibilities, its just that it started out being used for piracy, but saying that its only use is for piracy is a bit short sighted. Though honestly, it can be easy to see things that way. I used to believe the only good use for CD-R technology, was copying games and music. But then I became a network administrator, and realized its benifits for cheap backups. Anyways, my point is that you should never abandon the new ideas, just because its first uses are bad (take nuclear power for instance :)
The internet has traditionally been free, in recent years/months we have seen an increase in attempts to control the internet via legislation, patents and law suits. The problem is that whilst the internet has seen a large influx of everyday joe's and suits the real power behind the net is as always the people who write the software. Gnutella and software systems like it are part of the fight back. Previously online systems have been centralized due to simplicity and the lack of reason to build them any different. Since we are now entering a time when the freedom we used to take for granted online is under threat new software systems that are nearly impossible to regulate are inevitable. If the various governments and organizations had paid attention to the cherished principles of the net perhaps we could have found a way to limit the pedophiles and professional pirates that they seem so paranoid about without compromising the net's principles to much.Instead the MPAA, the RIAA and all the other control freaks decided they wanted to make a war out of it, and a way they will get.
What's to stop people from spamming the index?
I suppose they could build in a little technology to actually check the page. On the other hand, anything you do can be circumvented.
I suppose this is the classic downside to the entire Internet "thing". You can't enforce absolute control in a medium specifically designed against it. Of course, there are a few things you could do to help the situation.
With a Gnutella-style model for distributed searches, any host that is consistently returning false positives could be cut off by the adjacent node(s), right? If you have tons of traffic coming through your node from a spam site, couldn't you just stop forwarding requests to them.
Of course, this wouldn't stop all spamming on the index, but it should allow any one node to cut off a spam node "below" itself. On the other hand, since not everyone will be eternally vigilant, this much freedom could be damaging.
You could always have something like the MAPS RBL for search nodes. Just have someone paying attention that can keep a database of hosts to ignore requests from. If anybody can create a blackhole list, it wouldn't necessarily be centralized, so it wouldn't impinge on freedom of the search. It may still have an "open relay" problem, like SMTP does now, but that doesn't necessarily make it not worthwhile.
Example: If you get the results of this kind of broadcast search back from a bad search ("sex nude pictures jpg"), you'll trash your own internet connection and probably that of others (or the search-interface's if you use a web-interface).
Imagine a network of a million hosts (a small subset of all webservers). Each of these is running a gnutella-based search-engine. On one of the servers is an interface to search the network for some information. The query is forwarded onto the overlay network, to say 10 nodes at each node, assuming some mechanism is in place to avoid loops. if the network is well interconnected, it will take about 5-6 hops to reach an edge of the cloud (probably a couple of times more to reach all the nodes). As soon as the first nodes get the search-request, they send back results, say limited to the first 5-10 most significant hits. Each reply has a number of tuples consisting of (URLs, a description and an indication of how close the match is and a timestamp and probably some more), maybe 1-2 kB per reply. Say 10% of servers have a match, then 100000 hosts will at some point send back results.
I calculate, roughly a 100 MB of results will be arriving at the searching node within a few minutes, if it can process the dataflow
This is only one search, both the searching nodes and the servers will have to deal with a lot of searches if you look at other search-engines as a comparison.
Centralised search-engines are a good way to limit the bandwidth-usage, but they are slow to get changes on the web.
idea: It would be good to have a webserver keep track of an index for it's own document-space and when that changes, push that change to a central search-engine where it can be searched. Distributing the searches is a waste of resources, IMHO you should distribute the indexing mechanism and centralise the searching.
And considering that for this thing to work you need an index-engine on each server anyway, it's a small step to do it like this, isn't it?
The idea is interesting, no doubt. However there are three major (from my POV) with it:
(1) An obvious point: if a site itself decides which queries to respond to, there'll be a lot of spamming the index. Doesn't anybody remember the fate of the [meta] tags?
(2) This search technology essentially turns a search into an advertising stream. Since the site decides what to return, it'll return a blurb instead of a context around the match. And if the site can returns graphics and not just text strings... oh, my! Advertising banners as search results! Joy.
(3) The results are going to be dependent on the location of the query. Same question asked from a machine in California is likely to return different results if asked from a machine in Germany (especially with low timeouts). This isn't horrible, but not all that good. In particular, it means that I cannot tell other people "Search for 'foo', you'll find the site I am talking about on the first page".
Out of the three, the first is so obvious, something will be done about it. I don't know what, though. It's the second that worries me most of all. Besides more advertising, there is a basic problem here -- I want to see what the site has, not necessarily what they prefer to show me. To give a trivial example, a company could have a recalls/warnings/manufacturing defects page somewhere on its site to satisfy disclosure requirements, but never return this page to any search.
All in all, I'll stick with Google for the time being, thank you very much.
Kaa
Kaa
Kaa's Law: In any sufficiently large group of people most are idiots.
"Unlike Napster, however, it allows people to search for any kind of files; a random sampling of the search terms being used at any given time ranges from MP3s to blockbuster movies to pornography."
"The Department of Transportation released a shocking report this morning, in which it was discovered that the federal highway system, unlike rural routes, allow transportation of any kind of material. A random sampling of items being transported at any given time ranges from pirated music to pirated blockbuster movies, to pornography."
It's 10 PM. Do you know if you're un-American?