Faster P2P By Matching Similiar Files?
Andreaskem writes "A Carnegie Mellon University computer scientist says transferring large data files, such as movies and music, over the Internet could be sped up significantly if peer-to-peer (P2P) file-sharing services were configured to share not only identical files, but also similar files.
"SET speeds up data transfers by simultaneously downloading different chunks of a desired data file from multiple sources, rather than downloading an entire file from one slow source. Even then, downloads can be slow because these networks can't find enough sources to use all of a receiver's download bandwidth. That's why SET takes the additional step of identifying files that are similar to the desired file... No one knows the degree of similarity between data files stored in computers around the world, but analyses suggest the types of files most commonly shared are likely to contain a number of similar elements. Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.""
Well, sure, if you're only looking at Nickelback songs.
similiar files ftw!
Wait...what?
One wonders if these "researchers" have ever actually used p2p...
"Here Lies Philip J. Fry, named for his uncle, to carry on his spirit"
I'm hoping this CATCHES ON and wet ransfer a11 sorts of information like this. It'11 be 1ike getting every thing in the form of a ransom n0te.
I don't need large brains to have a good time.
So it's not me then? All new tunes DO sound the same?
Smokey, this is not 'Nam, this is bowling. There are rules.
porn.
No wait, hear me out. Most porn is going to be largely white or black skin colour (particularly with Friesian Cows if you're into that sort of thing), so the P2P can just find a chunk with a similar amount of that colour and download that!
Summation 2
It must be too early for me.
Why would I want to let someone transfer part of a file on my system that resembles part of a completely different file that they're looking for? Maybe everyone should just transfers all of their files all the time.
What if you're downloading a backup copy of a movie that is subtitled, as opposed to a version that is not? Considering how much space the video data takes up, these files would be awfully similar. Sometimes it IS the little things that make all the difference in files. Version of software with bugfix as opposed to without? Only difference between the two files is the name (1.0 vs. 1.1) and the fixed lines of code. That sort of thing. Seems to me this may not be as useful as advertised.
Sure this is going to work... really
I'll just splice that bit from that torrent, that bit from that one... it should work, I mean they are all the same TV episode and they are all mpeg4 - the file name says so...
Hmmm how about which bitrates, codecs, if it was from TV whether it was started at the same time??
That guy seriously has to be joking - the byte offsets are unlikely to ever specify a suitable join - and even if they rewrote the protocol so it split by seconds rather than fixed file widths you'd still have changing codecs and bitrates to deal with. Personally I'll stick to torrents with decent known trackers
$_="Slashdotter";$syn="OTT";s;..;;;sub _{print shift||$_};s!ash!Perl !;s=$syn=ack=i;tr+LLEd+BLAH+;_"Just Another ";_
Oh but if we only could! I'm sorry, sounds like someone is just imagining a 'what-if' scenario. This is not going to happen, unless, a new filetype is made specifically for it, and then you'd need a very specialized peer to peer program, and who knows how much extra work it's going to cost in terms of cpu/networking to determine if a file is similair or not. To me this just sounds stupid. Maybe it's just from waking up. They should work on optimising the 'pipes' or the 'tubes' to carry more data instead. It sounds like all the person thought of was hey, bit-torrent AND a stupid way to increase the number of people I can download from!
instead of sharing files, divide them into 16KB chunks and share those, to help work around files that get renamed or trivially altered (eg a website tagging their url to all the files you upload).
I knew all the music on the radio sounded the same, but I didn't think it was *that* lacking in originality.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
LOTS of overhead just to find the chunks.
.torrent files which contain the chunk breakpoints anyway)
The article talks about 16kb chunks, which for a dvd image would take more than the torrent protocol currently allows.
The client would spend more time communicating its chunk lists around than actually getting data.
(If I remember rightly, torrents can have a max of 65535 chunks and some servers prevent huge
liqbase
not particularly new. many P2Ps have been grouping identical files together. i know one of the early ones did it (was it Napster, Audiogalaxy?), but i think only if the files were 100% identical other than the filename.
there's definitely potential for problem here. what if those files really arent supposed to be the same? a swapped byte here and there could have huge effects on the end result.
Wouldn't this generate a lot more overhead traffic?
I'm sure smarter people than I have already thought this out,
but that seems to be the most immediate downside.
"SET divides a one-gigabyte file into 64,000 16-kilobyte chunks"
In other words: instead of seeking out the one master-hash for the file,
your P2P is looking up the thousands of chunk hashes.
[Fuck Beta]
o0t!
i don't think this makes sense for bittorrent p2p, but i think what they mean is
that when you search limewire you can see that there are many differently tagged versions with roughly the same filesize
when you try to download one, it isn't associated with the other files that are tagged differently even though it is the same song/bitrate everything.
so this program would recognize that i guess...
how about bringing the orig. napster back, or make a client IDENTICAL to it.
I have no idea what the overheads might be for their "handprinting" algorithms, or how effective they are. But I've wondered in the past whether something vaguely similar could be achieved by for example hashing each stream and headers separately in audio and video files, or each file within an archive. The same could apply to any container format. That would certainly deal with e.g. the same mp3 with different ID3 tags, and the overheads might be lower. Could get messy though, I guess.
Oh no... it's the future.
Thought of this a while back on a similar subject: taking bittorrent as an example you track a file (or set of files) and the torrent has presized chunks that are hashed. A simple extension may be to change that relationship from one to many to many to many. Share a set of chunks and this particular torrent uses the following chunks (identified by their hash). Obviously this would cause overhead issues but would address the issue in TFA (which, being new here I did not read).
I think everyone posting above saying "it won't work" should RTFA.
This works by breaking files down into clusters and hashing the clusters (like Bittorrent already does). Then it searches for other shares that have clusters with the same hash value, and requests them.
Assuming that the hashing scheme being used it "good" in that there are no collisions, two clusters with the same hash will contain the *exact same* information.
Should work just fine.
Of COURSE all files are "basically" the same, after all its just a set of 1s and 0s, and given that you already have lots of 1s and 0s on your machine this means that you already have the file even before you download it. It reminds me of Eric Morcambe and Andre Previn Previn: You were playing all the wrong notes Morcambe: No I was playing all the right notes just not necessarily in the right order
An Eye for an Eye will make the whole world blind - Gandhi
Sounds like he's trying to apply the concepts behind Single Instance Storage to P2P applications, which makes a certain amount of sense, but is sure to be a fair bit more difficult to track the hashes for all the file chunks of all the files available on the network (as opposed to comparing file hashes during a backup or across a LAN).
It might be a bit easier if P2P apps were more aware of the types of files they were transferring, and could make intelligent decisions about how to split up the files into data chunks.
This is just an illustration of the fact that P2P is an incredibly inefficient way of transferring files around. Most of the material is not only pirated, but a big fraction of the pirated material is the same stuff. P2P "peers" aren't necessarily nearby, either in a physical or bandwidth sense. So huge amounts of bandwidth are being spent shipping the same stuff around.
If it weren't for the piracy issue, the daily output of the RIAA, which is a few gigabytes, could be distributed efficiently by putting MP3s in a Usenet group. With Usenet's distribution mechanism, which is a flooding P2P system, nothing travels over a path more than once.
To paraphrase Morbo: "DOWNLOADS DO NOT WORK LIKE THAT!"
Now it would be GREAT if someone did manage to do that. Split the video from the audio (and from the sub-titles). And maybe create a meta-package.
And maybe if those researchers focus on that, this will be a better idea.
But that would ONLY work for material that could be split like that. If it's a song, what are you going to split? An ISO image? Same question.
So does this mean I go to download a picture of anyone, say Jennifer Aniston, I might get a picture of a monkey, just because they are 99% (or whatever) the same genetically.
I think I speak for all males when I say: Not cool man... not cool!
a new P2P app call BET promises faster downloads, utilizing the tried-and-true method of giving a file a set time limit to download then filling in random bytes after that.
I think Kazaa did something like this, which is partly to blame for the horrible malformed mp3 files that flowed through that network.
The RIAA is going to absolutely hate any research in this area that can improve P2P performance in any manner. And especially by a university, no less. Those hot beds of piracy don't deserve public money at all, when they spend it like this!!
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
"but are otherwise 99 percent similar.""
Sounds like all of Pop Music and most movies these days. Since the latest greatest movies are remakes and I, II, III, IV, V versions of the same theme.
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
It depends on how they calculate "similar". If they run checksums on the chunks and submit a query to other machines on the network that have pieces with identical chunks, then it would be valid to download. I'm pretty sure a few P2P services in the pre-bittorrent days did something like this, files with identical hashes would be grouped together.
;)
But the article makes it sound like their custom software breaks up media files into their component streams, which clients can download separately as they desire. Pull the english audio stream from Computer A, the video stream from B, etc. Once downloaded the client automagically reassembles it into a single file.
Writing a client that can disassemble and reassemble all known media types sounds absolutely horrifying though.
Anything compressed/encrypted won't work so well. Unless it is just a mislabeled peice of music. If you google around for Low Bandwidth File System (LBFS) you'll see what technique the article is really talking about.(disclaimer -- I didn't read the article either) Variable Length chunking will handle cases where new data is inserted halfway into the file, however with compression that extra data will end up changing the whole damned file.
So let say person A wants to download copyrighted material. They would connect to a tracker who would when tell person A where to get chunk 1 from. This chunk could come from copyrighted data or public domain data. What the tracker stores isn't the actual copyrighted data, but offsets, lengths, and ip addresses of where to find particular chunks.
In essence, if you were downloading a music CD, the data chunks could be coming from someone who has an Ubuntu ISO image, or some type of copyrighted material.
With this system it essential becomes impossible to tell who is uploading what.
They are talking about the chunks being the same, not the entire song. If you had a massive lookup table for md5's or something you could easily get a listing of every(body| torrent) that has the chunk you are looking for. Good idea but sounds quite difficult to maintain.
;)
Note: didn't RTFA
I thought of this a few years ago. The idea seemed an obvious one. There'll always be blocks of repeating data(e.g. FF00FF00), and exe, zip, rar, mp3, etc. headers with many similarities. If one can make the algo vary the size of the chunks it uses then you can derive your data from lots of different sources including getting data from image files that are to be applied to an .exe. The key would be to be able to recreate as much as possible from the similar data and the checksum/CRC using some of today's error correcting technology. It wouldn't be flawless of course but well worth the effort.
0x09F911029D74E35BD84156C5635688C0
Storage is cheap, bandwidth more expensive. Why not chunk up each file into many different permutations of its compressed data, with the variants recorded in the local index by fingerprint? Those fingerprints of unique chunks and the list of chunks to files can be maintained in the distributed index of many sites to each fingerprinted chunk. That would make more chances for a given content site to have a chunk that's identical to the one looked for, even if the chunk originated in a different file.
At some point, this protocol gets so far away from merely specifying a file's index in a local list to its specified remote storage server (eg. a tinyurl) for its monolithic compressed content (the best bandwidth for the least cacheable flexibility) that it's transferring more data among the distributed servers than that basic protocol. There's got to be some kind of "topological calculus" of the network connectedness and edge capacity vs node capacity that specifies the optimal distribution allocation of data to indices. Anyone?
--
make install -not war
Pretend that you're part of a swarm.
... and you're downloading a movie. Why am I (and a million others like me) going to be connecting to your box, asking your processor whether you have a file with checksum ghskldkjasa198d.a8.3ep ?
Your computer would then go through ALL YOUR FILES and advertise the md5 checksums to everyone.
Normally, you just advertise the blocks for the file that you're downloading.
So, I'm downloading a Debian iso
Normally, I would not even be talking to your box.
Suddenly, your bandwidth is gone from a million requests that have nothing to do with your download.
sheesh, it's not even in the summary, it's in TFA title, I also hate to point out the obvious, but I don't need to be a researcher at CMU to realize that if all those split files were put together it would be easier and faster to download the file... talk about pointing out the obvious
This could be a great tool for distributing updates. Say, if you already downloaded one DVD iso image for your favorite Linux distro, it could save a lot of time over downloading a whole new DVD iso. Even for smaller files or individual packages it could be really handy. I know there are already tools for generating rpm deltas and such, but if it could be transparant, it could really save a lot of hassle as well as bandwidth.
-- Knowledge shared is power lost. -- Aleister Crowley
What about splitting files in smaller chunks so that the possibility that separate files contain the same chunk of data gets more likely. The downloading is then directed by some chunk ids for example (global chunk id database). To download the full file you just need to download all the chunks and then the file is composed using these chunks and some ordering information. This might be just a stupid idea that won't work in practice but it might be interesting to see a porn movie that is composed from parts of Slashdot comments.
For single songs (mp3's or even flac) the time spent hunting down other bits doesn't seem like it would be any better than just downloading that song from one person.
...) but not in any other circumstances.
And things like md5 are useful because there is such a low probability of collisions (two different files having the same md5 checksum).
And by that same token, the likelihood that two different songs would have blocks of the exact same bits in a block is practically zero.
Their system WOULD work for movies IF they had previously incorporated the changes I mentioned (split the video from the audio from the sub-titles from the
And as I also had mentioned, this would do nothing more than suck up ALL your bandwidth as people who aren't even searching for songs you've listed as "sharable" are hitting your machine in the hopes of finding a packet with a specific md5 checksum.
So I'm downloading a Debian iso and I'm hitting your mp3 collection looking for blocks that match. Nope. Bad idea.
... that hashes collide, all the time. It probably won't collide over a large data chunk, but if you split the data chunks into $number chunks and send around MD5's (or other hashes) for that, you'll multiply the possible collisions by $number.
The only solution therefore is to create a one-to-one hash for each chunk, but then you could just as well transfer the data, because the hash size = chunk size.
Therefore, this approach won't work. Because, say you are transferring an OGG file (of your favorite indie band under the CCL of course) and you query for all people (multicast) for a chunk with that hash. A chunk of my LaTeX document could possibly have the same hash, so I send it to you. Since you have programmatically no way of checking the source of my chunk, and the hash is the same, your program will accept it and you'll get a corrupted OGG. Solution, rehash the whole file afterwards to check if it's still good, problem then is that you'll have to restart all the way, since you have no way of checking which chunk was wrong. You can off course, get sub-solutions off that to be more precise and loose less data, but the complexity and cost of such algorithms increases with such rate that it won't be worth implementing it.
Custom electronics and digital signage for your business: www.evcircuits.com
I haven't read the detailed paper, but it seems to me that there could be problems in finding similarity when there are random insertions. This is analogous to protein sequence matching (peripherally my field of research) when there are random insertions in the primary sequence. So if you have two identical 16k files, eight 2k chunks will be identical to each other. However, if you insert 512 bytes at the start of one file, eight 2k chunks will be different.
At their fundamental level, all files are essentially similar. They're encoded as 1's and 0's. So, wherever a file happens to call for a 1, you should be able to just pull that 1 from ANYWHERE. Even some random file on your local hard drive. And likewise for zeroes. All you need is a smart download algorithm to re-assemble the 1s and 0s in the correct order, and you're set.
You see? You see? Your stupid minds! Stupid! Stupid!
Part of the problem is that saturating one's download hurts one's upload. It also hurts otners as well.
"P2P is very inefficient.. but the problem is that means that are maximally efficient (e.g. proxies, usenet, etc.) are inaccessible to the masses."
Nonsense. The only thing that's making them inaccessable is the illegal nature of most of the content. One of these days humanity will understand cause and effect and that bad actions have bad consequences.
Shareaza has been doing this for years. When hashing MP3 files, it disregards what(s in the id3 tags, and just computes a hash for the audio information. This means that files with different id3 tags will still be added to the swarm, whicj is great.
Unfortunately, there are some issues with it:
-Only Shareaza supports it, other clients didn't want to play along.
-Shareaza has/d a bug where it would fail to reconstruct the id3 tag after downloading, giving you files with empty tags
-Only mp3 is supported, so no ogg, aac or wma
So this paper isn't as revolutionary (if that's what they mean).
This will only work with identical files that have metadata that is frequently changed by end-users, because there's no way you're going to be able to get a good file if you try to mix a cam with a dvdrip, or an ogg with an mp3, or an xvid file with a divx file. It just doesn't work that way.
It's called Byte Caching or Dictionary Compression. It's been used for WAN optimization for years and it works very well at taking data off the network. It hasn't been done on a P2P level, but the technology is very tried and tested.
ID3v1 tags won't pose a problem for this, they occur at the end of the file, i.e. the last chunk.
But ID3v2 files occur at the start of the file and have variable size. AVI files might have similar video streams but different language audio tracks, or be interleaved slightly differently, and so forth.
So although similar information might exist in these files, the chances of that information laying exactly on the same chunk boundaries, and thus the chunks having matching MD5s, is pretty low I bet. Even a 1ms delay in a CD rip could throw off all the MD5s _and_ encoded data.
It could be effective _if_ the server had parsers for the various file types and could separate meta data and streams withing a file, and the client could correctly re-assemble the media. Chances of a buggy parser or muxor destroying your file: high.
Let's go with a fairly vanilla scenario:
I grab an mp3 from person A. I then clean up the tag and rename it to suit me.
You want to download that same song with a different name and different tag.
You connect to person B sharing it. If you're using BitTorrent, you can also connect to any of 99 other people trying to download it from person B.
Using the new model, you could also connect to person A and myself and download the blocks that are the same.
So instead of only...
99 people in the swarm and 1 seeder
you'd have 99 people in the swarm, 1 seeder, person A and myself.
But in order to FIND person A and myself, you'd have to go through A MILLION OTHER PEOPLE to find if they have any blocks that you are looking for.
The CRITICAL PART THAT THEY LEFT OUT is the amount of bandwidth you'd be using to search A MILLION unrelated systems with unrelated files looking for those blocks.
This works in their lab because they have very few machines with very few files and they've already pre-loaded those machines with the files they want to be found.
If a client recreates a file from "similar" pieces, is it a derivative work?
This will be great for porn. Nothing worse than downloading 45 seconds of the last 3 minute video you downloaded. Now, I'll be able to tell which are just clips and which are the whole thing.
This is a good idea, and it delights me to see someone implement it. It reminds me a lot of Venti, the Plan 9 Archival Storage System (http://plan9.bell-labs.com/sys/doc/venti.html). An interesting side-effect of this is that, given a large variety of files in the system, it becomes a distributed look-up table of hash values. Any cryptologists out there need something like that as a resource? :)
Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar."
If the files differ ONLY in the artist-and-title headers, wouldn't they be "otherwise 100% identical"? I'd lump that in with "irregardless" and "I couldn't care less" as great moments in speaking(or typing) english.
I tried downloading "My Sweet Lord" by George Harrison, but I got "He's So Fine" by the Chiffons instead.
Intron: the portion of DNA which expresses nothing useful.
So, you speed up your download by grabbing identical chunks from "different" files. But, who/what decides which "different" end-piece is going to get downloaded? Are there serious implications to this? Let's say I reverse engineer an installer program and recompile it to install a trojan when SETUP.EXE is run. You go to P2P and there are two peers offering SETUP.EXE: my client and another client with the non-trojan version. How does the end-client then decide, from the two chunks that are different, which one to grab?
What the parent is saying can be summarized with a simple example:
;)
A 200MB, 30min video that was compressed at 1000kbps DiVX is not the "same file with minor changes" as a 200MB, 30min video that was compressed at 900kbps DiVX. They ARE different files and should be treated as such. You also can't deduce anything from their filenames, play length, or any other characteristic so how would you determine which ones can go together and which ones can't? I did not see codecs or compression mentioned at all in the article.
This is the fundamental problem here. You can't recombine video and audio files unless they ARE the same file. You have to account for different bitrates, compression ratios, and who knows what else (I am no expert in this area but this seems obvious...).
Lemme guess -- the mp3s mentioned in the article were ALL encoded at the same bitrate, right? If not, then please correct me because now you have my attention
I just read the paper. Not a single mention of different bitrates, codecs, or compression ratios that could be present in any of the files you guys are dealing with.
See my comment below and please explain this. I think a lot of us are missing this piece.
I like downloading sourcecode via p2p
I hope this technique will speed this up by finding equivalent sourcecodes...
(*eherm* undecidability *eherm*)
The MAFIAA is a bunch of mindless jerks who will be the first up against the wall when the revolution comes
This seems to be an intractable problem.
How do you know a file is similar? By hashing? There's no guarantee that a particular chunk of a file with an md5 hash (for example) contains the same bytes as that of another file.
There are 2^256 possible chunks of 256 bits of data. There are 2^16 possible hashes with (using a 16 bit binary key) for that same data. That means that for every hash match, the data has a 1 in 16 chance of actually matching.
You can extend the key length to reduce this ratio, but you'll end up with a key length equal to your data size before you're sure the data is not a collision.
The problem gets worse if the chunks of data aren't equal in size.
This can only work if you have a centralized database of every possible file combination on your network. It's workable for a small amount of files, but will grow exponentially in a real environment. Not to mention, the centralized database would have to handle a significant amount of traffic, reducing the speed gains possible.
Count me as skeptical.
If moderation could change anything, it would be illegal.
I recall hearing a story on NPR Music a few weeks ago about someone who plugged a CD into itunes and had it come up with the right piece but the wrong performer. Then it happened again, with the same ostensible pianist but yet another "wrong" performer. Detailed analysis showed the pianist had apparently published CDs claiming to perform pieces but actually substituting the work of others! Itunes must have used a signature over the content to index the piece by the earlier CD.
Similarity detection rules.
Mencken had it right. So glad that's old news.
So does ed2k, Bittorrent, ..., ..., .... this is hardly news. Even plain ol' FTP and HTTP can do this to a degree.
As far as the 99.9% similar "speedup"... I seriously doubt that you'll see any gains other than in lab conditions. MP3 is about the only format that might be agreeable to this, since I imagine its reasonably common for people to fix ID3's and then share the modified file. I just don't see it happening for other formats (.avi, .zip, .rar). And unless you've still got a 14.4 modem I doubt you'll notice the speedup even with MP3's, since they are so small to begin with.
this is already being done on numerous P2P networks. Ares, Limewire, etc. They all use this approach. This guy is just a moron.
This is nothing different from the way DHT bittorrent matching sections of the files, and ignoring the bad hash chunks
It's not even a new idea. You can generate super-efficient patches of files if you had local knowledge of which sections of the file are meta-data such as headers, relocation tables, etc and which is actual data. You could expand this further to generate par2-like differential patch data, and then just bittorrent all the chunks down with differential regeneration on the client-end. Sounds a lot like DHT again.
A little late for April Fool's day, isn't it? This can never work...
So how similar are Debian and Ubuntu, these days?
The higher the technology, the sharper that two-edged sword.
The Jigsaw Downloader (jigdo) can be used for this, it was developed to help with distribution of Debian ISO images. It is being used to build Debian ISO images from the packages located on the Debian mirrors. You don't have to worry about distributing the ISO images to the mirror sites or need additional disk space to store them either. The best thing is that you can update your local ISO to include new packages, etc, as things change. Have a look at http://www.debian.org/CD/jigdo-cd/ and http://atterer.net/jigdo/
Comment removed based on user account deletion