Faster P2P By Matching Similiar Files?
Andreaskem writes "A Carnegie Mellon University computer scientist says transferring large data files, such as movies and music, over the Internet could be sped up significantly if peer-to-peer (P2P) file-sharing services were configured to share not only identical files, but also similar files.
"SET speeds up data transfers by simultaneously downloading different chunks of a desired data file from multiple sources, rather than downloading an entire file from one slow source. Even then, downloads can be slow because these networks can't find enough sources to use all of a receiver's download bandwidth. That's why SET takes the additional step of identifying files that are similar to the desired file... No one knows the degree of similarity between data files stored in computers around the world, but analyses suggest the types of files most commonly shared are likely to contain a number of similar elements. Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.""
It depends on how they calculate "similar". If they run checksums on the chunks and submit a query to other machines on the network that have pieces with identical chunks, then it would be valid to download. I'm pretty sure a few P2P services in the pre-bittorrent days did something like this, files with identical hashes would be grouped together.
;)
But the article makes it sound like their custom software breaks up media files into their component streams, which clients can download separately as they desire. Pull the english audio stream from Computer A, the video stream from B, etc. Once downloaded the client automagically reassembles it into a single file.
Writing a client that can disassemble and reassemble all known media types sounds absolutely horrifying though.
Anything compressed/encrypted won't work so well. Unless it is just a mislabeled peice of music. If you google around for Low Bandwidth File System (LBFS) you'll see what technique the article is really talking about.(disclaimer -- I didn't read the article either) Variable Length chunking will handle cases where new data is inserted halfway into the file, however with compression that extra data will end up changing the whole damned file.
So let say person A wants to download copyrighted material. They would connect to a tracker who would when tell person A where to get chunk 1 from. This chunk could come from copyrighted data or public domain data. What the tracker stores isn't the actual copyrighted data, but offsets, lengths, and ip addresses of where to find particular chunks.
In essence, if you were downloading a music CD, the data chunks could be coming from someone who has an Ubuntu ISO image, or some type of copyrighted material.
With this system it essential becomes impossible to tell who is uploading what.
DHT even supports partially corrupted files, your client just discards the corrupt data.
My question is, why would I want to use SET over DHT? Does SET not need a ceneralized server, or does it have any other advantage at all?
TFA is really short on technical details, but it sounds to me as though SET is just a re-design of DHT. Still, I imagine SET support will be in the next builds of all the major bittorrent clients if it ends up being worth something.
Obligatory Soundbite Catchphrase
It's more then a few. Most people use the default settings in their audio ripper/compression program, and it's all from the same CD. Even more people never uses an audio ripper and/or compressor, and simply downloads the file from the Internet. Not that many people bother to change ID3-tags either, but every single person that do, leads to a different file.
My impression is that people are going out of their way to find specialized P2P programs that offer any advantage. And that very many people are willing to spend $100 in work-time finding a CD worth $10 for "free". The market has spoken, people are not acting the way you want them to.
If I download a song with bittorrent then change the tag, it doesn't treat it as a different file. It treats it as a corrupted file. It does indeed recognize that it is 99% similar, and it can use that file to seed the similar parts. How is this novel again?
Give me Classic Slashdot or give me death!
So if someone is sharing an older ISO, and it happens to have large portions that exactly match the one you're downloading, with other portions that are not identical, you don't want to download the identical chunks off that person?
It would be interesting if the implementing software could also look for possible matches within your existing file structure and reduce the data downloaded automatically, kind of like using diff and just downloading the patch.
-1 Uncomfortable Truth
I don't think this is just about inconsistent metadata.
I think what he's talking about may be more like the document fingerprinting algorithms used to pare search engine results, or to detect plagiarism in student papers.
In some cases you will be downloading components of a file from two sources, neither of which have the others' component. The example in TFA was downloading the video portion of a movie from a foreign language site and the audio from a site with the language you speak but less bandwidth.
I suppose another example would be that if you were downloading an anthology of stories, you could take a particular story from a server that hosted a different anthology including that story. Or maybe you are downloading the new distro; you could take some of your files from sites offering the distro version you are looking for, some from sites only offering the files you need to upgrade to that version, and some from entirely different distros or much older versions if they happened to be the same.
I guess it could be thought of as a kind of "fuzzy akamai".
It's an interesting idea, but I don't see any commercial support for it. In fact I see commercial opposition under the current regime of copyright laws and royalty based business models.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
If I recall correctly, DHT takes the file name into account when calculating the hash. Thus identical files with different names are treated differently.
Some P2P protocols allow looking up a file by a hash which does not take filename into account, but this will not handle the case where the files differ in only one small section. The best example is the following:
Person downloads an MP3.
Person finds that the MP3 is not properly tagged (for example, has a comment field saying who ripped it/released the rip, and has no track number.)
Person changes the MP3's ID3 tag
Now, nearly all existing P2P protocols will treat the new file as a completely different file, when in reality the most important contents (the audio itself) have not changed, only the file's metadata.
Other users will go for the "full-file" match with the largest number of sources, thus causing the mistagged MP3 to propagate more than a "fixed" one.
So a P2P system that ignores the ID3 tag when hashing would have significant advantages, in which the user could download the file from many sources and then choose which source to get their metadata from.
retrorocket.o not found, launch anyway?
I haven't read the detailed paper, but it seems to me that there could be problems in finding similarity when there are random insertions. This is analogous to protein sequence matching (peripherally my field of research) when there are random insertions in the primary sequence. So if you have two identical 16k files, eight 2k chunks will be identical to each other. However, if you insert 512 bytes at the start of one file, eight 2k chunks will be different.
No, it's a different kind of identifying which blocks are interesting in a swarm download.
.torrent files are needed at all. It's more efficient on the tracker, too.)
...which is why it's a hot research topic.
Basically - and this isn't the first time this idea has been tabled, I have an unpublished paper and a reference implementation - chuck BT's idea in the bin, as it uses lists of SHA1 hashes and that's not suitable for this. Shareaza's better placed to do this technically, but you could of course adapt torrent.
What you actually want to use is a TTH - Tiger Tree Hash (THEX standard). That's a Merkle hash tree based on TIGER192 (it's commonly represented in base-32, and is for example used by DC++ and Shareaza, and is a major part of the magnet: standard). The whole thing identifies the file, but that's a hash of a tree of hashes that progressively identify smaller parts of the file. You can exchange the leaves of the tree to any convenient depth and easily verify they're correct.
(The side effect of this is that standard magnet: links or lists of magnet: links work; no strange special
If you can search for the leaves you're interested in (and a distributed Bloom filter can make that very efficient indeed), you'll get matches from not just the file you wanted, but any identical blocks found in any other files - the same bits by any other name would smell as sweet. So peers with slightly different files can provide partial seeds to the swarm too, and vice versa.
This is useful if you have a common element between download groups. There's no need for a "batch torrent" any longer, because all the grouping would be done automatically. Media files who differ only in tags would be automatically matched and swarmed together as a matter of course.
More creatively, it could be used to swarm patches to a large group of clients, as this technique can efficiently perform a binary diff-based distributed rsync to millions of clients...
It doesn't have to operate on every file on the filesystem. It's perfectly capable of identifying common blocks among files you happen to be swarming (downloading/hosting) or seeding (hosting), and that's more than useful enough.
Or you can maintain a set of managed/hosted files, exactly like if they were on an rsync server.
As for network efficiency, this isn't Gnutella, we've learned a lot in years of research since then, such as how to make searches that scale well. Very, very well. Well enough to be able to efficiently search 10-million strong distributed networks for hash-tree leaves.
We don't have to ask you if you have any interesting blocks directly; we already have an excellent structure that lets us know if either you (or your section of the distributed hash table, as we drill down) haven't, or you probably might - say hello to the Distributed Bloom Filter.
Those million requests? More like ten or so, total, in itty bitty packets that would be totally dwarfed by the data transfer you'd probably be making, spread out throughout the network and not concentrated on any one host.
Does the idea sound more reasonable now?
If a client recreates a file from "similar" pieces, is it a derivative work?
I disagree, this could be done with No Change to any protocol, client, or server: the only thing that needs created is a torrent creator located on a machine that had a full version of every "simular" torrent to be shared. the torrents would all be linked read only to the same DVD image (for exampl), only part of that DVD image would be labeled as "downloaded" and you would put the client into upload only mode on this host machine. The downloaders would have no idea this same data was shared under many different torrents.
For example, My company distributes our manuals as DVD's, different DVD for every machine Type. These go to 100's of locations. We put the same engine in 50 different machines. Also we do updated DVD's, that may just add a few pages of options (for example.)
If we distributed DVD images by bittorent, with a server at every one of our distributor. It would be as simple as patching the existing torrent image at the distributor.
IE Distibutor X may only have the DVD only for model ZZ hosted on their server. Instead of having a single torrent served from their chunk of data, Our main office that has all of the DVD's we ever produced, could just send them a list of torrent files, that tells that 1 server, what chunks of their file are appliciple on all of our DVD's. Now they may be able to host a 100 different torrents, all saying I got 10% of model ZX I got 5% of model ZF...
This seems to be an intractable problem.
How do you know a file is similar? By hashing? There's no guarantee that a particular chunk of a file with an md5 hash (for example) contains the same bytes as that of another file.
There are 2^256 possible chunks of 256 bits of data. There are 2^16 possible hashes with (using a 16 bit binary key) for that same data. That means that for every hash match, the data has a 1 in 16 chance of actually matching.
You can extend the key length to reduce this ratio, but you'll end up with a key length equal to your data size before you're sure the data is not a collision.
The problem gets worse if the chunks of data aren't equal in size.
This can only work if you have a centralized database of every possible file combination on your network. It's workable for a small amount of files, but will grow exponentially in a real environment. Not to mention, the centralized database would have to handle a significant amount of traffic, reducing the speed gains possible.
Count me as skeptical.
If moderation could change anything, it would be illegal.
I recall hearing a story on NPR Music a few weeks ago about someone who plugged a CD into itunes and had it come up with the right piece but the wrong performer. Then it happened again, with the same ostensible pianist but yet another "wrong" performer. Detailed analysis showed the pianist had apparently published CDs claiming to perform pieces but actually substituting the work of others! Itunes must have used a signature over the content to index the piece by the earlier CD.
Similarity detection rules.
Mencken had it right. So glad that's old news.
Correct (and the parent is correct as well). We didn't exhaustively characterize the reasons that the files differed, but most of the cases appeared to be modifications _after_ encoding. We have on the TODO list to test encode the same CD multiple times with the same settings to see if it produces a similar file, but we haven't run the test yet. There were tons of cases of metadata-altered MP3s, many cases of video files with the same video content but different audio or subtitle information, and some cases where the files differed in "seemingly random" (aka, we're not sure why. :) parts of the file for no reason we've figured out yet.
.tar.gz file), etc. Part of what we really like about the SET technique is that it's able to speed transfers of all of those types of files without needing to have any file-type specific logic.
I think that for audio files, having a plugin for the p2p system that separated metadata from song information would probably capture a lot of the benefit we found. It's much harder for video files and for the files that had various weird changes that we don't have a good explanation for.
As other posters have suggested, the techniques we used also work well for software distributions sometimes. We've found around 10-30% similarity between different Linux ISOs and RPMs, *uncompressed* distributions of things like gcc (but not a