Mathematical Analysis of Gnutella
jrp2 sent in a paper written by one of Napster's founding engineers. It is
a mathematical evaluation of Gnutella discussing
why the network won't be able to scale up to any reasonable size. I
have been impressed with Gnutella in the past, and have wondered along
these same lines in the past.
Yes, but..
It's sort of like calculating the maximum hull speed for steam ships crossing the Atlantic Ocean and saying there is a theoretical maximum speed to intercontinental travel. Then someone comes along and invents airplanes.
Gnutella will mutate and evolve, and will at somepoint in the future be replaced by something better when it starts to fall over.
The demand for Ms. Spears and the Backstreet Boys is just too damn strong for things to stand still.
I enjoyed that this post was next to the announcement that of the new-and-not-so-improved preview of Napster was out..
~.~
I'm a peripheral visionary.
Did you use Linux as the server? There are known limitations to Linux, the main ones being:
- it sucks
- it sucks
- it sucks
Hope that this clears things up.This is one of the biggest problems with P2P file sharing programs. Nearly everyone wants great content for free, but very few are willing to give back and supply any of it.
There is a major flaw in all P2P software, and it has nothing to do with the coding. More people tend to want to take than recieve. I remember seeing a line graph on LimeWire's page (I think?) that showed a monthly progression of the number of people sharing files compared to the number of people downloading files. The 'downloaders' were outweighing the 'uploaders' by a HUGE ammount.
If everyone was willing to share their files, then there would be no such problem with P2P programs.
In theory, a true Peer-to-Peer file transfer network would exist in a decentralized fashion where you would never have to query a central host for routing or file availability. Napster requires you to route through one of the Napster servers for information. Even introducing Napigator still doesn't alter the Napster model because all it does is allow you to route through a different central host. It seems that all Napster did was integrate a search engine and nameserving into one element (coming from only one provider).
This isn't to knock the accomplishments of Napster, it was certainly an original idea to incorporate these areas and provide a GUI access client to boot. But it is apparent that Napster developers weren't all that revolutionary in their thinking either.
The suggestion of true P2P is revolutionary, and the perfect implementation (should it ever arrive) will also be revolutionary. But the Napster model is no different than everyone providing their MP3 list to a website who maintains a list of links on where to download MP3s. Napster simply automated this process. Napster is no more P2P than any TCP/IP connection not operated through a proxy.
Is http P2P? I'm talking directly to another system, and there is no moderator/mediator. Normally, I have to find out about that system from a 3rd party (e.g. a search engine) -- just like someone obtains a list of links from Napster.
True, I'm being no better than the author of the original article; because I too am offering no solutions. I'm just holding out hope for true P2P in the future.
This isn't a case of hackers getting into people's systems, it's a case of people who don't understand their own computer's directory structures sharing a bunch of files they shouldn't, unless there's something I missed in this poorly done news story. The real security risk here is not Gnutella, it's ignorance. I know the manual for Win ** is very thin and sketchy, but directories are covered in it.
It's depressing to think that a lot of people put their computers on a network without even understanding basic concepts like this. (It's even more depressing to call tech support at an ISP and realize you understand more about the problem then they do, but now I'm rambling.)
As I pointed out last time this was posted, this article is basically 100% FUD. Yes, the amount of traffic goes up. And no, gnutella doesn't scale very well. But the author goes out of his way to make the problem look worse than it actually is. You see, the article only computes the total amount of traffic in the entire network. A number which is both huge and meaningless. You see, by this math, if I send a packet somewhere and it takes 10 hops, well, thats like sending 10 packets!
At the end of the paper, the author coughs up the big scary number of 63GBps of traffic in the Gnutella network when the nodes each have 8 connections and are using a TTL of 8. Wow! That's a lot of traffic. That certainly isn't scaling! Well, what the author never points out is that, by his own math, the network has 7,686,400 users at this point! When we divide up the total traffic among all of those network links, we get a different view. If you do the math you discover that this is a whopping 72Kbps! Oh no! It's the end of the world! Well, no, it's not. True, it's more than a modem can handle. But it's well within the reach of most cable modem connections. Given that your computer is being expected to handle the search requests of over 7 million other people, it's not that much traffic.
Don't get me wrong, I agree that Gnutella doesn't scale all that well. But this paper is just plain FUD. The only number that really matters to users is the total bandwidth load on their pipe. By carefully avoiding that number, which isn't very big and scary at all, the auther is clearly lying by ommision. Given all of the real problems networks like Gnutella encounter, it isn't interesting to read this sort of drivel. Why don't we drag out Mathmatical and model how much bandwidth Napster wastes by transmitting the names of all the files being shared even though most of them will never get searched for. Hmmm. lets assume 7,000,000 users. Let's assume that they each share 1000 files with an average filename length of 32 characters. Why, that's 224 Gigabytes of data, and we haven't even done any searches yet! Cleary, Napster doesn't scale. Ugh. This guy might know how to use Mathematica, but I still suspect he worked in the Marketing department. With the same guys who will tell you about their 200Mbps fast ethernet.
First, if I understand what he's driving at correctly, the bandwidth numbers he gives are for the Gnotella network as a whole, not for each and every client connected to it. This is equivelent to saying "average 'HTTP' usage generates n amount of bandwidth over the Internet", or "DNS traffic will consume x number of bytes on a given network". So what? Would anyone be really shocked if 7,000,000 web browsers generated HTTP and DNS traffic in the gigabyte range? Doesn't bother me. That might be an interesting number to your ISP but as a user of Gnotella I could care less about how much total bandwidth my query for 'The Grateful Dead' takes up. It sure sounds like alot of traffic, but it's distributed over the entire Gnotella network. As long as the traffic isn't high enough to overwhelm individual clients I don't see the problem. These numbers just don't seem to be that important, or am I missing something here?
The other item the author fails to consider (and I'm going to guess that, as one of the engineers behind Napster, he probably knows better) are client-side optimizations like search caching and differentiation of the clients. The caching arguement goes like this:
If client A sends out a query to client C looking for 'Grateful Dead' and client B sends out a very similar request to client C , say, 'The Grateful Dead', even basic caching would prevent client C from sending this request back out to the same hosts that responded to the first request made by client A. Again, am I missing something important here? I'm not sure that caching would reduce the traffic dramatically but I'd be willing to bet that it would improve matters significantly, especially for clients that remained 'up' for long periods of time (which is in itself another important factor that seems to be missing here). This just seems so obvious.
There are bunches of optimizations like this that can be done with the Gnotella application to reduce the overall bandwidth. And this leads to the other half of my point, i.e. the author assumes that each and every client will be functionally the same. They aren't. The Gnotella FAQ tells you to reduce your N if your on a slow connection. This means that not all Gnotella clients are exactly the same now anyway; some have higher N's than others. The FastTrack guys (i.e. KaZaA, Morpheous, et. al.) have already shown that it makes sence from an efficency standpoint to have some clients do more then others via 'supernodes' and the like. This seems like a fairly obvious development on the client side and I can't for the life of me understand why this isn't addressed. I mean, really, isn't the 'client-client' vs. 'client-server' approach really the underlying assumption behind why Napster will scale and Gnotella won't?
I hate to say it but it looks to me like the author is showing just a little bias here. Hey, I suppose that if I worked on a competing standard I'd trash-talk the competition too but I think his time would be better spent making the Napster approach work better. No matter how you slice it or dice it Napster is pretty much dead while the Gnotella network is still alive and kicking. Maybe it will never scale to 'billions and billions' of hosts but at least it's still around and going strong.
Mod that man up.
I have never had much luck using Gnutella, the main problem seems to be the lack of parallel download, if you have 20 users all with the same file you want, it is dismally painful to have to pick one.
Fasttrack on the other hand (Kazaa has a linux client that is IMHO better than the bloated windows offering) works very well in this regard. Choose a file and the client download it in parallel from as many clients as it can, makes for much quicker transfer.
NZ Electronics Enthusiasts: Check out my Trade Me Listings
read the protocol spec, and you would understand why you can't do this. You don't reply directly to a request. You send it back through your connections and the clients you are connected to only accept replies with the correct information.
If you had 8 connections and a request comes in from 1 of them, only that 1 connection would accept a reply with the request's guid. The IP information is taken directly from your connection.
If you are going to criticize a paper, do so on the basis of what they are claiming (there is no shortage of support for the claims he is making), not with conspiracy theories about the author's motivation.
So... How come after... 2? years Freenet hasn't become a standard or even a well known in the file-sharing world? I'm not trolling, I'm curious. Napster has come and gone, gnutella has come and gone, Now we have fasttrack... Meanwhile, the freenet site just chugs along...
Of course, building an indexing system that scales arbitrarily is difficult, and building an indexing system that recognizes local topologies is also critical. A typical problem universities had with Napster was that if N people at the school wanted a given tune, most of them would be likely to fetch it across the school's limited outside bandwidth instead of most people fetching it from other sites on the fast LAN after the first one or two had downloaded it across the limited part. Napster was able to reduce this problem, at least at some schools, because having a centralized indexing service means that they can enforce more locality by making it easiest for people to find nearby peers. A decentralized system *may* be able to accomplish this, but it's a lot harder.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
The only solution is to structure the network by using "super clients" or "servants" or "super nodes"[...]
But won't this "super singularties" become, on the long run, bottlenecks themselves, prone to abuse, DoS etc., plus the logical target for the "other side" that wants this kind of p2p to be buried and forgoten?
One of the strenghts of the p2p model is that is hard to pursue 1000's of (arguably) minor copyright infingements as opposed to charge one entity (Napster?) with all of them...
-- No sig today