Faster P2P By Matching Similiar Files?

← Back to Stories (view on slashdot.org)

Faster P2P By Matching Similiar Files?

Posted by ryuzaki0 on Wednesday April 11, 2007 @03:45AM from the something-doesn't-jive-here dept.

Andreaskem writes "A Carnegie Mellon University computer scientist says transferring large data files, such as movies and music, over the Internet could be sped up significantly if peer-to-peer (P2P) file-sharing services were configured to share not only identical files, but also similar files. "SET speeds up data transfers by simultaneously downloading different chunks of a desired data file from multiple sources, rather than downloading an entire file from one slow source. Even then, downloads can be slow because these networks can't find enough sources to use all of a receiver's download bandwidth. That's why SET takes the additional step of identifying files that are similar to the desired file... No one knows the degree of similarity between data files stored in computers around the world, but analyses suggest the types of files most commonly shared are likely to contain a number of similar elements. Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.""

18 of 222 comments (clear)

Min score:

Reason:

Sort:

Re:This could work for some files, but not for oth by Xzzy · 2007-04-11 04:11 · Score: 2, Interesting

It depends on how they calculate "similar". If they run checksums on the chunks and submit a query to other machines on the network that have pieces with identical chunks, then it would be valid to download. I'm pretty sure a few P2P services in the pre-bittorrent days did something like this, files with identical hashes would be grouped together.

But the article makes it sound like their custom software breaks up media files into their component streams, which clients can download separately as they desire. Pull the english audio stream from Computer A, the video stream from B, etc. Once downloaded the client automagically reassembles it into a single file.

Writing a client that can disassemble and reassemble all known media types sounds absolutely horrifying though. ;)
Re:This could work for some files, but not for oth by Incy · 2007-04-11 04:13 · Score: 3, Interesting

Anything compressed/encrypted won't work so well. Unless it is just a mislabeled peice of music. If you google around for Low Bandwidth File System (LBFS) you'll see what technique the article is really talking about.(disclaimer -- I didn't read the article either) Variable Length chunking will handle cases where new data is inserted halfway into the file, however with compression that extra data will end up changing the whole damned file.
Blurring the Line Between Data by dduardo · 2007-04-11 04:14 · Score: 2, Interesting

So let say person A wants to download copyrighted material. They would connect to a tracker who would when tell person A where to get chunk 1 from. This chunk could come from copyrighted data or public domain data. What the tracker stores isn't the actual copyrighted data, but offsets, lengths, and ip addresses of where to find particular chunks.

In essence, if you were downloading a music CD, the data chunks could be coming from someone who has an Ubuntu ISO image, or some type of copyrighted material.

With this system it essential becomes impossible to tell who is uploading what.
1. Re:Blurring the Line Between Data by PaintyThePirate · 2007-04-11 05:17 · Score: 2, Interesting
  
  The chances of two 16kB chunks from a Ubuntu ISO and a music CD being identical are extremely unlikely, or for that matter, the chances of any two 16kbit chunks from any two non-similar (meaning, in this case, files that are not of/from the same source [the same movie, album, program, whatever]) files being the same are incredibly slim.
  
  That being said, about 90% of the commenters missed the point completly, though it is somewhat understandable given how vague and nontechnical the linked article and summary are. The idea is that by hashing each 16kB chunk, one peer can download a single chunk from another if the chunk is identical but the entire file isn't. The word "similar" in the title does not mean files that somewhat resemble each other, it means files that have identical chunks.
  
  Of course, with such a system, one would have to deal with the overhead traffic of announcing the hashes of every 16kB chunk instead of simply the hash of the entire file. I imagine that in an actual implementation of this system, instead of a peer requesting hashes of every chunk that all of the other peers are sharing, the downloader would only request hashes of files with similar filenames or keywords.
  
  As in, if you search for an Ubuntu ISO, you won't be getting hashes of chunks from someone who is only sharing The Matrix Trilogy DVD-Rs; you would get hashes from people who are sharing ubuntu-6.10-server-amd64.iso, ubuntu-6.10-server-i386.iso, ubuntu-6.06-server-amd64.iso, etc. I would imagine that there would be many shared chunks between those files.
  
  Of course, I can't imagine such a system working on a P2P network like Bittorrent (except maybe with a modified version of DHT), but with a more centralized system, it could work.
  
  The only remaining problem I see is how to deal with byte offsets on chunks. I haven't read the actual paper on the topic yet, but I imagine something was done about it.
  
  --
  eclecti.cc
Re:Nickelback? by thepotoo · 2007-04-11 04:17 · Score: 5, Interesting

If you use bittorrent, the DHT protocol (supported by Azureus, BitComet, and uTorrent, among others) does the exact thing you're describing. It checks MD5 hashes for files (the whole file, not the pieces, I think), and connects you to peers which have the same file.
DHT even supports partially corrupted files, your client just discards the corrupt data.
My question is, why would I want to use SET over DHT? Does SET not need a ceneralized server, or does it have any other advantage at all?
TFA is really short on technical details, but it sounds to me as though SET is just a re-design of DHT. Still, I imagine SET support will be in the next builds of all the major bittorrent clients if it ends up being worth something.

--
Obligatory Soundbite Catchphrase
Re:Similar Files? by joto · 2007-04-11 04:31 · Score: 2, Interesting

Okay, I'll admit that there's a few MP3s that have different ID3 tags but the actual audio is the same.

It's more then a few. Most people use the default settings in their audio ripper/compression program, and it's all from the same CD. Even more people never uses an audio ripper and/or compressor, and simply downloads the file from the Internet. Not that many people bother to change ID3-tags either, but every single person that do, leads to a different file.
And if they don't want it that bad, they aren't going to bother with some specialized P2P program that only has 1 advantage: It can tell some files are alike.

My impression is that people are going out of their way to find specialized P2P programs that offer any advantage. And that very many people are willing to spend $100 in work-time finding a CD worth $10 for "free". The market has spoken, people are not acting the way you want them to.
Re:Right.... by Hatta · 2007-04-11 04:35 · Score: 1, Interesting

If I download a song with bittorrent then change the tag, it doesn't treat it as a different file. It treats it as a corrupted file. It does indeed recognize that it is 99% similar, and it can use that file to seed the similar parts. How is this novel again?

--
Give me Classic Slashdot or give me death!
Re:TorrentSoup by ShieldW0lf · 2007-04-11 04:41 · Score: 3, Interesting

So if someone is sharing an older ISO, and it happens to have large portions that exactly match the one you're downloading, with other portions that are not identical, you don't want to download the identical chunks off that person?

It would be interesting if the implementing software could also look for possible matches within your existing file structure and reduce the data downloaded automatically, kind of like using diff and just downloading the patch.

--
-1 Uncomfortable Truth
Re:Nickelback? by hey! · 2007-04-11 04:41 · Score: 3, Interesting

I don't think this is just about inconsistent metadata.

I think what he's talking about may be more like the document fingerprinting algorithms used to pare search engine results, or to detect plagiarism in student papers.

In some cases you will be downloading components of a file from two sources, neither of which have the others' component. The example in TFA was downloading the video portion of a movie from a foreign language site and the audio from a site with the language you speak but less bandwidth.

I suppose another example would be that if you were downloading an anthology of stories, you could take a particular story from a server that hosted a different anthology including that story. Or maybe you are downloading the new distro; you could take some of your files from sites offering the distro version you are looking for, some from sites only offering the files you need to upgrade to that version, and some from entirely different distros or much older versions if they happened to be the same.

I guess it could be thought of as a kind of "fuzzy akamai".

It's an interesting idea, but I don't see any commercial support for it. In fact I see commercial opposition under the current regime of copyright laws and royalty based business models.

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Re:Nickelback? by Andy+Dodd · 2007-04-11 04:46 · Score: 4, Interesting

If I recall correctly, DHT takes the file name into account when calculating the hash. Thus identical files with different names are treated differently.

Some P2P protocols allow looking up a file by a hash which does not take filename into account, but this will not handle the case where the files differ in only one small section. The best example is the following:
Person downloads an MP3.
Person finds that the MP3 is not properly tagged (for example, has a comment field saying who ripped it/released the rip, and has no track number.)
Person changes the MP3's ID3 tag
Now, nearly all existing P2P protocols will treat the new file as a completely different file, when in reality the most important contents (the audio itself) have not changed, only the file's metadata.
Other users will go for the "full-file" match with the largest number of sources, thus causing the mistagged MP3 to propagate more than a "fixed" one.

So a P2P system that ignores the ID3 tag when hashing would have significant advantages, in which the user could download the file from many sources and then choose which source to get their metadata from.

--
retrorocket.o not found, launch anyway?
Problem with variable insertions? by digitalderbs · 2007-04-11 05:07 · Score: 2, Interesting

I haven't read the detailed paper, but it seems to me that there could be problems in finding similarity when there are random insertions. This is analogous to protein sequence matching (peripherally my field of research) when there are random insertions in the primary sequence. So if you have two identical 16k files, eight 2k chunks will be identical to each other. However, if you insert 512 bytes at the start of one file, eight 2k chunks will be different.
Re:Nickelback? by Anonymous Coward · 2007-04-11 05:13 · Score: 2, Interesting

No, it's a different kind of identifying which blocks are interesting in a swarm download.

Basically - and this isn't the first time this idea has been tabled, I have an unpublished paper and a reference implementation - chuck BT's idea in the bin, as it uses lists of SHA1 hashes and that's not suitable for this. Shareaza's better placed to do this technically, but you could of course adapt torrent.

What you actually want to use is a TTH - Tiger Tree Hash (THEX standard). That's a Merkle hash tree based on TIGER192 (it's commonly represented in base-32, and is for example used by DC++ and Shareaza, and is a major part of the magnet: standard). The whole thing identifies the file, but that's a hash of a tree of hashes that progressively identify smaller parts of the file. You can exchange the leaves of the tree to any convenient depth and easily verify they're correct.

(The side effect of this is that standard magnet: links or lists of magnet: links work; no strange special .torrent files are needed at all. It's more efficient on the tracker, too.)

If you can search for the leaves you're interested in (and a distributed Bloom filter can make that very efficient indeed), you'll get matches from not just the file you wanted, but any identical blocks found in any other files - the same bits by any other name would smell as sweet. So peers with slightly different files can provide partial seeds to the swarm too, and vice versa.

This is useful if you have a common element between download groups. There's no need for a "batch torrent" any longer, because all the grouping would be done automatically. Media files who differ only in tags would be automatically matched and swarmed together as a matter of course.

More creatively, it could be used to swarm patches to a large group of clients, as this technique can efficiently perform a binary diff-based distributed rsync to millions of clients... ...which is why it's a hot research topic.
Re:Think about how that would be accomplished. by Anonymous Coward · 2007-04-11 05:34 · Score: 2, Interesting

It doesn't have to operate on every file on the filesystem. It's perfectly capable of identifying common blocks among files you happen to be swarming (downloading/hosting) or seeding (hosting), and that's more than useful enough.

Or you can maintain a set of managed/hosted files, exactly like if they were on an rsync server.

As for network efficiency, this isn't Gnutella, we've learned a lot in years of research since then, such as how to make searches that scale well. Very, very well. Well enough to be able to efficiently search 10-million strong distributed networks for hash-tree leaves.

We don't have to ask you if you have any interesting blocks directly; we already have an excellent structure that lets us know if either you (or your section of the distributed hash table, as we drill down) haven't, or you probably might - say hello to the Distributed Bloom Filter.

Those million requests? More like ten or so, total, in itty bitty packets that would be totally dwarfed by the data transfer you'd probably be making, spread out throughout the network and not concentrated on any one host.

Does the idea sound more reasonable now?
An interesting licensing issue by DigitAl56K · 2007-04-11 05:39 · Score: 3, Interesting

If a client recreates a file from "similar" pieces, is it a derivative work?
Re:Right.... by Dare+nMc · 2007-04-11 05:49 · Score: 2, Interesting

The only way you could implement a system like this is to create an entirely new protocol, server software and client software.

I disagree, this could be done with No Change to any protocol, client, or server: the only thing that needs created is a torrent creator located on a machine that had a full version of every "simular" torrent to be shared. the torrents would all be linked read only to the same DVD image (for exampl), only part of that DVD image would be labeled as "downloaded" and you would put the client into upload only mode on this host machine. The downloaders would have no idea this same data was shared under many different torrents.

For example, My company distributes our manuals as DVD's, different DVD for every machine Type. These go to 100's of locations. We put the same engine in 50 different machines. Also we do updated DVD's, that may just add a few pages of options (for example.)

If we distributed DVD images by bittorent, with a server at every one of our distributor. It would be as simple as patching the existing torrent image at the distributor.

IE Distibutor X may only have the DVD only for model ZZ hosted on their server. Instead of having a single torrent served from their chunk of data, Our main office that has all of the DVD's we ever produced, could just send them a list of torrent files, that tells that 1 server, what chunks of their file are appliciple on all of our DVD's. Now they may be able to host a 100 different torrents, all saying I got 10% of model ZX I got 5% of model ZF...
Don't see how it could work on a large scale. by Bluesman · 2007-04-11 07:21 · Score: 2, Interesting

This seems to be an intractable problem.

How do you know a file is similar? By hashing? There's no guarantee that a particular chunk of a file with an md5 hash (for example) contains the same bytes as that of another file.

There are 2^256 possible chunks of 256 bits of data. There are 2^16 possible hashes with (using a 16 bit binary key) for that same data. That means that for every hash match, the data has a 1 in 16 chance of actually matching.

You can extend the key length to reduce this ratio, but you'll end up with a key length equal to your data size before you're sure the data is not a collision.

The problem gets worse if the chunks of data aren't equal in size.

This can only work if you have a centralized database of every possible file combination on your network. It's workable for a small amount of files, but will grow exponentially in a real environment. Not to mention, the centralized database would have to handle a significant amount of traffic, reducing the speed gains possible.

Count me as skeptical.

--
If moderation could change anything, it would be illegal.
perhaps 100% similar by Frisky070802 · 2007-04-11 07:46 · Score: 2, Interesting

I recall hearing a story on NPR Music a few weeks ago about someone who plugged a CD into itunes and had it come up with the right piece but the wrong performer. Then it happened again, with the same ostensible pianist but yet another "wrong" performer. Detailed analysis showed the pianist had apparently published CDs claiming to perform pieces but actually substituting the work of others! Itunes must have used a signature over the content to index the piece by the earlier CD.

Similarity detection rules.

--
Mencken had it right. So glad that's old news.
Re:mod this up -- angio...this is the problem by angio · 2007-04-11 15:59 · Score: 2, Interesting

Correct (and the parent is correct as well). We didn't exhaustively characterize the reasons that the files differed, but most of the cases appeared to be modifications _after_ encoding. We have on the TODO list to test encode the same CD multiple times with the same settings to see if it produces a similar file, but we haven't run the test yet. There were tons of cases of metadata-altered MP3s, many cases of video files with the same video content but different audio or subtitle information, and some cases where the files differed in "seemingly random" (aka, we're not sure why. :) parts of the file for no reason we've figured out yet.

I think that for audio files, having a plugin for the p2p system that separated metadata from song information would probably capture a lot of the benefit we found. It's much harder for video files and for the files that had various weird changes that we don't have a good explanation for.

As other posters have suggested, the techniques we used also work well for software distributions sometimes. We've found around 10-30% similarity between different Linux ISOs and RPMs, *uncompressed* distributions of things like gcc (but not a .tar.gz file), etc. Part of what we really like about the SET technique is that it's able to speed transfers of all of those types of files without needing to have any file-type specific logic.