Faster P2P By Matching Similiar Files?

← Back to Stories (view on slashdot.org)

Faster P2P By Matching Similiar Files?

Posted by ryuzaki0 on Wednesday April 11, 2007 @03:45AM from the something-doesn't-jive-here dept.

Andreaskem writes "A Carnegie Mellon University computer scientist says transferring large data files, such as movies and music, over the Internet could be sped up significantly if peer-to-peer (P2P) file-sharing services were configured to share not only identical files, but also similar files. "SET speeds up data transfers by simultaneously downloading different chunks of a desired data file from multiple sources, rather than downloading an entire file from one slow source. Even then, downloads can be slow because these networks can't find enough sources to use all of a receiver's download bandwidth. That's why SET takes the additional step of identifying files that are similar to the desired file... No one knows the degree of similarity between data files stored in computers around the world, but analyses suggest the types of files most commonly shared are likely to contain a number of similar elements. Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.""

21 of 222 comments (clear)

Nickelback? by onemorehour · 2007-04-11 03:46 · Score: 5, Funny

Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.

Well, sure, if you're only looking at Nickelback songs.
1. Re:Nickelback? by beckerist · 2007-04-11 03:54 · Score: 4, Informative
  
  The idea is that the song "Girl Talk - Once Again" might be represented as "girltalk-onceagain" "girl talk - once again" "GirlTalk - Once Again" "01-once again.mp3" "OnceAgain.mp3"
  
  Being that the only difference is just the text (ID3, ID3-2) tags, the rest of the song is exactly the same, so why can't you use that as a download source too? I personally organize all of my music, and because of this P2P programs believe that it's an entirely new file, when really it was just renamed and the header information was changed (generally to be grammatically correct.)</summary>
2. Re:Nickelback? by thepotoo · 2007-04-11 04:17 · Score: 5, Interesting
  
  If you use bittorrent, the DHT protocol (supported by Azureus, BitComet, and uTorrent, among others) does the exact thing you're describing. It checks MD5 hashes for files (the whole file, not the pieces, I think), and connects you to peers which have the same file.
  DHT even supports partially corrupted files, your client just discards the corrupt data.
  My question is, why would I want to use SET over DHT? Does SET not need a ceneralized server, or does it have any other advantage at all?
  TFA is really short on technical details, but it sounds to me as though SET is just a re-design of DHT. Still, I imagine SET support will be in the next builds of all the major bittorrent clients if it ends up being worth something.
  
  --
  Obligatory Soundbite Catchphrase
3. Re:Nickelback? by hey! · 2007-04-11 04:41 · Score: 3, Interesting
  
  I don't think this is just about inconsistent metadata.
  
  I think what he's talking about may be more like the document fingerprinting algorithms used to pare search engine results, or to detect plagiarism in student papers.
  
  In some cases you will be downloading components of a file from two sources, neither of which have the others' component. The example in TFA was downloading the video portion of a movie from a foreign language site and the audio from a site with the language you speak but less bandwidth.
  
  I suppose another example would be that if you were downloading an anthology of stories, you could take a particular story from a server that hosted a different anthology including that story. Or maybe you are downloading the new distro; you could take some of your files from sites offering the distro version you are looking for, some from sites only offering the files you need to upgrade to that version, and some from entirely different distros or much older versions if they happened to be the same.
  
  I guess it could be thought of as a kind of "fuzzy akamai".
  
  It's an interesting idea, but I don't see any commercial support for it. In fact I see commercial opposition under the current regime of copyright laws and royalty based business models.
  
  --
  Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
4. Re:Nickelback? by Robotech_Master · 2007-04-11 04:46 · Score: 3, Informative
  
  If I understand the article right, SET looks at individual files within a particular download. DHT just looks at the whole download.
  
  For instance, if I'm uploading my "Songs I Like to Dance To" mp3 mix, and someone else is uploading an "All-Time Greatest Dance Hits" CD rip, and there are a couple of songs both uploads have in common, SET would enable someone downloading my MP3 mix to treat the CD rip as a partial seed (and vice versa), and pull down the songs held in common from either one.
  
  Whereas DHT would simply enable people to pull down my mix from other people uploading the mix, or the CD rip from other people uploading the CD rip, even if the tracker was down. (If I understand what DHT does correctly. Which it is possible I don't.)
  
  --
  Editor Emeritus and Senior Writer, TeleRead.org
5. Re:Nickelback? by Andy+Dodd · 2007-04-11 04:46 · Score: 4, Interesting
  
  If I recall correctly, DHT takes the file name into account when calculating the hash. Thus identical files with different names are treated differently.
  
  Some P2P protocols allow looking up a file by a hash which does not take filename into account, but this will not handle the case where the files differ in only one small section. The best example is the following:
  Person downloads an MP3.
  Person finds that the MP3 is not properly tagged (for example, has a comment field saying who ripped it/released the rip, and has no track number.)
  Person changes the MP3's ID3 tag
  Now, nearly all existing P2P protocols will treat the new file as a completely different file, when in reality the most important contents (the audio itself) have not changed, only the file's metadata.
  Other users will go for the "full-file" match with the largest number of sources, thus causing the mistagged MP3 to propagate more than a "fixed" one.
  
  So a P2P system that ignores the ID3 tag when hashing would have significant advantages, in which the user could download the file from many sources and then choose which source to get their metadata from.
  
  --
  retrorocket.o not found, launch anyway?
grea tide a by underwhelm · 2007-04-11 03:51 · Score: 5, Funny

I'm hoping this CATCHES ON and wet ransfer a11 sorts of information like this. It'11 be 1ike getting every thing in the form of a ransom n0te.

--
I don't need large brains to have a good time.
The music kids listen to by Vollernurd · 2007-04-11 03:51 · Score: 3, Funny

So it's not me then? All new tunes DO sound the same?

--
Smokey, this is not 'Nam, this is bowling. There are rules.
Summary: by PhrostyMcByte · 2007-04-11 03:53 · Score: 4, Informative

instead of sharing files, divide them into 16KB chunks and share those, to help work around files that get renamed or trivially altered (eg a website tagging their url to all the files you upload).
Re:Snakeoil by Anonymous Coward · 2007-04-11 03:57 · Score: 3, Funny

this statement in particular is ludicrous. You don't listen to pop music, do you?
Re:Right.... by angio · 2007-04-11 04:01 · Score: 5, Informative

Take a peek at the paper - it actually does work, and we demonstrated it. The intuition: people make small changes to files like changing the artist or title in the MP3 header, and then BitTorrent and other systems treat this as a "different" file, when in fact it's 99.9% similar.
(Yes, I'm one of the authors.)
Re:TorrentSoup by joe_cot · 2007-04-11 04:06 · Score: 5, Informative

It would still work the same way as it does now: an md5 of each specific block, and an md5 of the whole thing. If the md5 for the block doesn't match, it's not going to download, and if it's someone using collision to inject a block with the same md5, 1) it's not going to pass the md5 on the whole thing, 2) you're already vulnerable to it. The reason this will work is that they'll be lots of people sharing incomplete or corrupted versions of your FreeBSD iso; you'll get the blocks that are good, and skip the blocks that aren't, making "similar" files very useful. Not too difficult to understand, and no need for tin foil hats.
Re:This could work for some files, but not for oth by Incy · 2007-04-11 04:13 · Score: 3, Interesting

Anything compressed/encrypted won't work so well. Unless it is just a mislabeled peice of music. If you google around for Low Bandwidth File System (LBFS) you'll see what technique the article is really talking about.(disclaimer -- I didn't read the article either) Variable Length chunking will handle cases where new data is inserted halfway into the file, however with compression that extra data will end up changing the whole damned file.
Re:TorrentSoup by drix · 2007-04-11 04:27 · Score: 3, Insightful

Because it gets you published and, thus, increases your chance for tenure, that from which all blessings flow.

--

I think there is a world market for maybe five personal web logs.
Re:TorrentSoup by ShieldW0lf · 2007-04-11 04:41 · Score: 3, Interesting

So if someone is sharing an older ISO, and it happens to have large portions that exactly match the one you're downloading, with other portions that are not identical, you don't want to download the identical chunks off that person?

It would be interesting if the implementing software could also look for possible matches within your existing file structure and reduce the data downloaded automatically, kind of like using diff and just downloading the patch.

--
-1 Uncomfortable Truth
Re:TorrentSoup by CastrTroy · 2007-04-11 04:58 · Score: 3, Informative

I'm not sure if this would work if you changed the byte offset though. Sure both ISOs may contain a lot of the same data, but I think it's very unlikely that the data would be at the same byte-offset in the file. I don't think that you'd be able to accomplish this for different byte offsets, because for a 100 MB File, assuming 5 MB chunks, You're looking at about 2,000,000,000 chunks to calculate (20 chunks, calculated at each byte offset).

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Could really work! by Junior+J.+Junior+III · 2007-04-11 05:10 · Score: 3, Funny

At their fundamental level, all files are essentially similar. They're encoded as 1's and 0's. So, wherever a file happens to call for a 1, you should be able to just pull that 1 from ANYWHERE. Even some random file on your local hard drive. And likewise for zeroes. All you need is a smart download algorithm to re-assemble the 1s and 0s in the correct order, and you're set.

--
You see? You see? Your stupid minds! Stupid! Stupid!
Re:Snakeoil by discord5 · 2007-04-11 05:28 · Score: 3, Funny

It's not even 9AM and I have already filled my bullshit quota for the day. The concept itself is dubious, but this statement in particular is ludicrous.

May I suggest you don't open your e-mail and refrain from answering the phone for today? I usually fill up my bullshit quota with those two media alone. Slashdot is just the icing on the cake. ;)
Re:Problem with variable insertions? by angio · 2007-04-11 05:38 · Score: 4, Informative

We define chunk boundaries using Rabin fingerprinting. It's a cute trick - not one of our own invention - that is relatively insensitive to insertions and deletions. It was used in some of the other work in this area, such as the Low Bandwidth File System (LBFS). There's a family of work in this area called "shingling" that can also apply to sequence similarity.
An interesting licensing issue by DigitAl56K · 2007-04-11 05:39 · Score: 3, Interesting

If a client recreates a file from "similar" pieces, is it a derivative work?
Re:Right.... by angio · 2007-04-11 05:41 · Score: 3, Informative

Similar in spirit - except rsync looks at files on your local hard drive by the same name, so there's only one possible candidate to draw from. SET looks at all of the files that everyone else is currently downloading, so we had to develop a much more efficient technique for locating useful files.