Faster P2P By Matching Similiar Files?

Nickelback? by onemorehour · 2007-04-11 03:46 · Score: 5, Funny

Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.

Well, sure, if you're only looking at Nickelback songs.

Re:Nickelback? by beckerist · 2007-04-11 03:54 · Score: 4, Informative

The idea is that the song "Girl Talk - Once Again" might be represented as "girltalk-onceagain" "girl talk - once again" "GirlTalk - Once Again" "01-once again.mp3" "OnceAgain.mp3"

Being that the only difference is just the text (ID3, ID3-2) tags, the rest of the song is exactly the same, so why can't you use that as a download source too? I personally organize all of my music, and because of this P2P programs believe that it's an entirely new file, when really it was just renamed and the header information was changed (generally to be grammatically correct.)</summary>
Re:Nickelback? by thepotoo · 2007-04-11 04:17 · Score: 5, Interesting

If you use bittorrent, the DHT protocol (supported by Azureus, BitComet, and uTorrent, among others) does the exact thing you're describing. It checks MD5 hashes for files (the whole file, not the pieces, I think), and connects you to peers which have the same file.
DHT even supports partially corrupted files, your client just discards the corrupt data.
My question is, why would I want to use SET over DHT? Does SET not need a ceneralized server, or does it have any other advantage at all?
TFA is really short on technical details, but it sounds to me as though SET is just a re-design of DHT. Still, I imagine SET support will be in the next builds of all the major bittorrent clients if it ends up being worth something.

--
Obligatory Soundbite Catchphrase
Re:Nickelback? by hey! · 2007-04-11 04:41 · Score: 3, Interesting

I don't think this is just about inconsistent metadata.

I think what he's talking about may be more like the document fingerprinting algorithms used to pare search engine results, or to detect plagiarism in student papers.

In some cases you will be downloading components of a file from two sources, neither of which have the others' component. The example in TFA was downloading the video portion of a movie from a foreign language site and the audio from a site with the language you speak but less bandwidth.

I suppose another example would be that if you were downloading an anthology of stories, you could take a particular story from a server that hosted a different anthology including that story. Or maybe you are downloading the new distro; you could take some of your files from sites offering the distro version you are looking for, some from sites only offering the files you need to upgrade to that version, and some from entirely different distros or much older versions if they happened to be the same.

I guess it could be thought of as a kind of "fuzzy akamai".

It's an interesting idea, but I don't see any commercial support for it. In fact I see commercial opposition under the current regime of copyright laws and royalty based business models.

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Re:Nickelback? by Robotech_Master · 2007-04-11 04:46 · Score: 3, Informative

If I understand the article right, SET looks at individual files within a particular download. DHT just looks at the whole download.

For instance, if I'm uploading my "Songs I Like to Dance To" mp3 mix, and someone else is uploading an "All-Time Greatest Dance Hits" CD rip, and there are a couple of songs both uploads have in common, SET would enable someone downloading my MP3 mix to treat the CD rip as a partial seed (and vice versa), and pull down the songs held in common from either one.

Whereas DHT would simply enable people to pull down my mix from other people uploading the mix, or the CD rip from other people uploading the CD rip, even if the tracker was down. (If I understand what DHT does correctly. Which it is possible I don't.)

--
Editor Emeritus and Senior Writer, TeleRead.org
Re:Nickelback? by Andy+Dodd · 2007-04-11 04:46 · Score: 4, Interesting

If I recall correctly, DHT takes the file name into account when calculating the hash. Thus identical files with different names are treated differently.

Some P2P protocols allow looking up a file by a hash which does not take filename into account, but this will not handle the case where the files differ in only one small section. The best example is the following:
Person downloads an MP3.
Person finds that the MP3 is not properly tagged (for example, has a comment field saying who ripped it/released the rip, and has no track number.)
Person changes the MP3's ID3 tag
Now, nearly all existing P2P protocols will treat the new file as a completely different file, when in reality the most important contents (the audio itself) have not changed, only the file's metadata.
Other users will go for the "full-file" match with the largest number of sources, thus causing the mistagged MP3 to propagate more than a "fixed" one.

So a P2P system that ignores the ID3 tag when hashing would have significant advantages, in which the user could download the file from many sources and then choose which source to get their metadata from.

--
retrorocket.o not found, launch anyway?
Re:Nickelback? by Anonymous Coward · 2007-04-11 05:13 · Score: 2, Interesting

No, it's a different kind of identifying which blocks are interesting in a swarm download.

Basically - and this isn't the first time this idea has been tabled, I have an unpublished paper and a reference implementation - chuck BT's idea in the bin, as it uses lists of SHA1 hashes and that's not suitable for this. Shareaza's better placed to do this technically, but you could of course adapt torrent.

What you actually want to use is a TTH - Tiger Tree Hash (THEX standard). That's a Merkle hash tree based on TIGER192 (it's commonly represented in base-32, and is for example used by DC++ and Shareaza, and is a major part of the magnet: standard). The whole thing identifies the file, but that's a hash of a tree of hashes that progressively identify smaller parts of the file. You can exchange the leaves of the tree to any convenient depth and easily verify they're correct.

(The side effect of this is that standard magnet: links or lists of magnet: links work; no strange special .torrent files are needed at all. It's more efficient on the tracker, too.)

If you can search for the leaves you're interested in (and a distributed Bloom filter can make that very efficient indeed), you'll get matches from not just the file you wanted, but any identical blocks found in any other files - the same bits by any other name would smell as sweet. So peers with slightly different files can provide partial seeds to the swarm too, and vice versa.

This is useful if you have a common element between download groups. There's no need for a "batch torrent" any longer, because all the grouping would be done automatically. Media files who differ only in tags would be automatically matched and swarmed together as a matter of course.

More creatively, it could be used to swarm patches to a large group of clients, as this technique can efficiently perform a binary diff-based distributed rsync to millions of clients... ...which is why it's a hot research topic.
Re:Nickelback? by thepotoo · 2007-04-11 05:20 · Score: 2, Informative

Actually, DHT doesn't care about file names. You can test this yourself if you have a LAN. Grab a torrent, start downloading it on one computer (save as a different filename), get a little of the file downloaded, and start the same torrent on a different computer. Use your firewall to block the second computer's access to the tracker. DHT will kick in, your computers will log in, get each other's IPs, and computer 2 will get an insanely fast download speed until it catches up with computer 1.
Tested on uTorrent 1.6.0 (old version) on my and my roomates computers. Incidentally, the process isn't any faster than downloading the file on one computer, and copying it over to the other one afterwards.
You are correct about the tags, though.

--
Obligatory Soundbite Catchphrase
Re:Nickelback? by evilviper · 2007-04-11 06:49 · Score: 2, Informative

Some P2P protocols allow looking up a file by a hash which does not take filename into account,

By "Some", you mean "Every Single Frickin' One Of Them", right?
nearly all existing P2P protocols will treat the new file as a completely different file,

No. Only the most brain-dead P2P protocols will. "tree" hashes are in use by several P2P protocols. Some are just old or primitive, and have a large number of old servants around that don't understand newer hashes.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant

grea tide a by underwhelm · 2007-04-11 03:51 · Score: 5, Funny

I'm hoping this CATCHES ON and wet ransfer a11 sorts of information like this. It'11 be 1ike getting every thing in the form of a ransom n0te.

--

I don't need large brains to have a good time.

The music kids listen to by Vollernurd · 2007-04-11 03:51 · Score: 3, Funny

So it's not me then? All new tunes DO sound the same?

--
Smokey, this is not 'Nam, this is bowling. There are rules.

Summary: by PhrostyMcByte · 2007-04-11 03:53 · Score: 4, Informative

instead of sharing files, divide them into 16KB chunks and share those, to help work around files that get renamed or trivially altered (eg a website tagging their url to all the files you upload).

Tiny chunks, Large files by LiquidCoooled · 2007-04-11 03:53 · Score: 2, Insightful

LOTS of overhead just to find the chunks.
The article talks about 16kb chunks, which for a dvd image would take more than the torrent protocol currently allows.

The client would spend more time communicating its chunk lists around than actually getting data.

(If I remember rightly, torrents can have a max of 65535 chunks and some servers prevent huge .torrent files which contain the chunk breakpoints anyway)

--
liqbase :: faster than paper

Re:Right.... by Icarus1919 · 2007-04-11 03:55 · Score: 2, Informative

Ok, perhaps you're not certain how files work. But things compressed with different codecs and bitrates look VERY different when you actually look at the coding in the file as opposed to the same file named differently or with minor changes.

Re:TorrentSoup by eric76 · 2007-04-11 03:56 · Score: 2, Insightful

The only thing I use the file sharing networks for is to download new images of FreeBSD and Linux using BitTorrent.

The last thing I want is a "similar" file.

What would be a "similar" file to a FreeBSD ISO? It would either be a corrupted file or one with an introduced exploit.

Re:Snakeoil by Anonymous Coward · 2007-04-11 03:57 · Score: 3, Funny

this statement in particular is ludicrous. You don't listen to pop music, do you?

Should work just fine... by Anonymous Coward · 2007-04-11 03:58 · Score: 2, Informative

I think everyone posting above saying "it won't work" should RTFA.

This works by breaking files down into clusters and hashing the clusters (like Bittorrent already does). Then it searches for other shares that have clusters with the same hash value, and requests them.

Assuming that the hashing scheme being used it "good" in that there are no collisions, two clusters with the same hash will contain the *exact same* information.

Should work just fine.

Re:Similar Files? by Aladrin · 2007-04-11 03:59 · Score: 2, Insightful

No seriously, the coward is right. WTF?

Okay, I'll admit that there's a few MP3s that have different ID3 tags but the actual audio is the same. A few. The large majority of duplicate songs are NOT the same audio data. It's been re-ripped, transcoded, or some other horrid thing done to it and is not the same data anymore.

Now, even assuming that there ARE tons of very-alike files out there, you'd have to write an intelligent comparer for each one so that it knew how to deal with the file and what information could be mixed without ruining the file.

At the end of the project, you've spent years on a project that'll never quite work right to save a bit of bandwidth for people that should have just gone and bought the song from iTunes in the first place if they wanted it that damned bad. And if they don't want it that bad, they aren't going to bother with some specialized P2P program that only has 1 advantage: It can tell some files are alike. (And probably has tons of disadvantages compared to the already-existing applications.)

--
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM

Re:Right.... by angio · 2007-04-11 04:01 · Score: 5, Informative

Take a peek at the paper - it actually does work, and we demonstrated it. The intuition: people make small changes to files like changing the artist or title in the MP3 header, and then BitTorrent and other systems treat this as a "different" file, when in fact it's 99.9% similar.
(Yes, I'm one of the authors.)

It gets worse. by khasim · 2007-04-11 04:01 · Score: 2

Taking advantage of those similarities could speed downloads considerably. If a U.S. computer user wanted to download a German-language version of a popular movie, for instance, existing systems would probably download most of the movie from sources in Germany. But if the user could download from similar files, the user could retrieve most of the video from English versions readily available from U.S. sources, and download only the audio portion of the movie from the German sources.

To paraphrase Morbo: "DOWNLOADS DO NOT WORK LIKE THAT!"

Now it would be GREAT if someone did manage to do that. Split the video from the audio (and from the sub-titles). And maybe create a meta-package.

And maybe if those researchers focus on that, this will be a better idea.

But that would ONLY work for material that could be split like that. If it's a song, what are you going to split? An ISO image? Same question.

Re:TorrentSoup by joe_cot · 2007-04-11 04:06 · Score: 5, Informative

It would still work the same way as it does now: an md5 of each specific block, and an md5 of the whole thing. If the md5 for the block doesn't match, it's not going to download, and if it's someone using collision to inject a block with the same md5, 1) it's not going to pass the md5 on the whole thing, 2) you're already vulnerable to it. The reason this will work is that they'll be lots of people sharing incomplete or corrupted versions of your FreeBSD iso; you'll get the blocks that are good, and skip the blocks that aren't, making "similar" files very useful. Not too difficult to understand, and no need for tin foil hats.

Re:This could work for some files, but not for oth by Xzzy · 2007-04-11 04:11 · Score: 2, Interesting

It depends on how they calculate "similar". If they run checksums on the chunks and submit a query to other machines on the network that have pieces with identical chunks, then it would be valid to download. I'm pretty sure a few P2P services in the pre-bittorrent days did something like this, files with identical hashes would be grouped together.

But the article makes it sound like their custom software breaks up media files into their component streams, which clients can download separately as they desire. Pull the english audio stream from Computer A, the video stream from B, etc. Once downloaded the client automagically reassembles it into a single file.

Writing a client that can disassemble and reassemble all known media types sounds absolutely horrifying though. ;)

Re:This could work for some files, but not for oth by Incy · 2007-04-11 04:13 · Score: 3, Interesting

Anything compressed/encrypted won't work so well. Unless it is just a mislabeled peice of music. If you google around for Low Bandwidth File System (LBFS) you'll see what technique the article is really talking about.(disclaimer -- I didn't read the article either) Variable Length chunking will handle cases where new data is inserted halfway into the file, however with compression that extra data will end up changing the whole damned file.

Blurring the Line Between Data by dduardo · 2007-04-11 04:14 · Score: 2, Interesting

So let say person A wants to download copyrighted material. They would connect to a tracker who would when tell person A where to get chunk 1 from. This chunk could come from copyrighted data or public domain data. What the tracker stores isn't the actual copyrighted data, but offsets, lengths, and ip addresses of where to find particular chunks.

In essence, if you were downloading a music CD, the data chunks could be coming from someone who has an Ubuntu ISO image, or some type of copyrighted material.

With this system it essential becomes impossible to tell who is uploading what.

Re:Blurring the Line Between Data by PaintyThePirate · 2007-04-11 05:17 · Score: 2, Interesting

The chances of two 16kB chunks from a Ubuntu ISO and a music CD being identical are extremely unlikely, or for that matter, the chances of any two 16kbit chunks from any two non-similar (meaning, in this case, files that are not of/from the same source [the same movie, album, program, whatever]) files being the same are incredibly slim.

That being said, about 90% of the commenters missed the point completly, though it is somewhat understandable given how vague and nontechnical the linked article and summary are. The idea is that by hashing each 16kB chunk, one peer can download a single chunk from another if the chunk is identical but the entire file isn't. The word "similar" in the title does not mean files that somewhat resemble each other, it means files that have identical chunks.

Of course, with such a system, one would have to deal with the overhead traffic of announcing the hashes of every 16kB chunk instead of simply the hash of the entire file. I imagine that in an actual implementation of this system, instead of a peer requesting hashes of every chunk that all of the other peers are sharing, the downloader would only request hashes of files with similar filenames or keywords.

As in, if you search for an Ubuntu ISO, you won't be getting hashes of chunks from someone who is only sharing The Matrix Trilogy DVD-Rs; you would get hashes from people who are sharing ubuntu-6.10-server-amd64.iso, ubuntu-6.10-server-i386.iso, ubuntu-6.06-server-amd64.iso, etc. I would imagine that there would be many shared chunks between those files.

Of course, I can't imagine such a system working on a P2P network like Bittorrent (except maybe with a modified version of DHT), but with a more centralized system, it could work.

The only remaining problem I see is how to deal with byte offsets on chunks. I haven't read the actual paper on the topic yet, but I imagine something was done about it.

--
eclecti.cc

Re:TorrentSoup by drix · 2007-04-11 04:27 · Score: 3, Insightful

Because it gets you published and, thus, increases your chance for tenure, that from which all blessings flow.

--

I think there is a world market for maybe five personal web logs.

Re:Similar Files? by joto · 2007-04-11 04:31 · Score: 2, Interesting

Okay, I'll admit that there's a few MP3s that have different ID3 tags but the actual audio is the same.

It's more then a few. Most people use the default settings in their audio ripper/compression program, and it's all from the same CD. Even more people never uses an audio ripper and/or compressor, and simply downloads the file from the Internet. Not that many people bother to change ID3-tags either, but every single person that do, leads to a different file.

And if they don't want it that bad, they aren't going to bother with some specialized P2P program that only has 1 advantage: It can tell some files are alike.

My impression is that people are going out of their way to find specialized P2P programs that offer any advantage. And that very many people are willing to spend $100 in work-time finding a CD worth $10 for "free". The market has spoken, people are not acting the way you want them to.

Re:TorrentSoup by ShieldW0lf · 2007-04-11 04:41 · Score: 3, Interesting

So if someone is sharing an older ISO, and it happens to have large portions that exactly match the one you're downloading, with other portions that are not identical, you don't want to download the identical chunks off that person?

It would be interesting if the implementing software could also look for possible matches within your existing file structure and reduce the data downloaded automatically, kind of like using diff and just downloading the patch.

--
-1 Uncomfortable Truth

Re:Similar Files? by Anonymous Coward · 2007-04-11 04:43 · Score: 2, Insightful

Break down the file into small (16Kb) chunks. Hash those chunks and let the client compare those chunks to the chunks you need. Most BT clients already do this, but still only draw the file from peers using the same file listed by the tracker. With this technology it can use any file that has chunks with the exact same hash as the file being downloaded by the user. I would imagine not a great many changes would be needed to implement it. There's no need for an 'intelligent comparer' as it's pretty much already built into almost every BT app out there. There won't be 'years on the project' either, since most of the infrastructure already exists. They can just build on what is already out there.

There could be a fairly large performance increase if I've understood the paper correctly. I have a 10Mb downstream cable connection at home. If I connect to a torrent that has many more seeders than leechers I can easily top out my d/l speed at around 1.1 MB/s. Reverse that scenario and the d/l is extremely slow due to the lack of seeders able to send out chunks of the file. Now, imagine there are multiple copies of these same file on multiple trackers being shared by many, many more seeders that this one torrent. This new implementation will find those chunks, as well as the chunks you originally connected to. Next time, RTFA.

Re:TorrentSoup by CastrTroy · 2007-04-11 04:58 · Score: 3, Informative

I'm not sure if this would work if you changed the byte offset though. Sure both ISOs may contain a lot of the same data, but I think it's very unlikely that the data would be at the same byte-offset in the file. I don't think that you'd be able to accomplish this for different byte offsets, because for a 100 MB File, assuming 5 MB chunks, You're looking at about 2,000,000,000 chunks to calculate (20 chunks, calculated at each byte offset).

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.

Re:Jennifer Aniston != Monkey by maxume · 2007-04-11 04:59 · Score: 2, Funny

Think of it as Jennifer Aniston vs a perfect Jennifer Aniston clone with a paper bag with a monkey face drawn on it over her head. Take off the bag(i.e., the differing headers), and you are good to go.

--
Nerd rage is the funniest rage.

Problem with variable insertions? by digitalderbs · 2007-04-11 05:07 · Score: 2, Interesting

I haven't read the detailed paper, but it seems to me that there could be problems in finding similarity when there are random insertions. This is analogous to protein sequence matching (peripherally my field of research) when there are random insertions in the primary sequence. So if you have two identical 16k files, eight 2k chunks will be identical to each other. However, if you insert 512 bytes at the start of one file, eight 2k chunks will be different.

Re:Problem with variable insertions? by angio · 2007-04-11 05:38 · Score: 4, Informative

We define chunk boundaries using Rabin fingerprinting. It's a cute trick - not one of our own invention - that is relatively insensitive to insertions and deletions. It was used in some of the other work in this area, such as the Low Bandwidth File System (LBFS). There's a family of work in this area called "shingling" that can also apply to sequence similarity.

Could really work! by Junior+J.+Junior+III · 2007-04-11 05:10 · Score: 3, Funny

At their fundamental level, all files are essentially similar. They're encoded as 1's and 0's. So, wherever a file happens to call for a 1, you should be able to just pull that 1 from ANYWHERE. Even some random file on your local hard drive. And likewise for zeroes. All you need is a smart download algorithm to re-assemble the 1s and 0s in the correct order, and you're set.

--
You see? You see? Your stupid minds! Stupid! Stupid!

Already been done a long time ago by J0nne · 2007-04-11 05:13 · Score: 2, Informative

Shareaza has been doing this for years. When hashing MP3 files, it disregards what(s in the id3 tags, and just computes a hash for the audio information. This means that files with different id3 tags will still be added to the swarm, whicj is great.
Unfortunately, there are some issues with it:
-Only Shareaza supports it, other clients didn't want to play along.
-Shareaza has/d a bug where it would fail to reconstruct the id3 tag after downloading, giving you files with empty tags
-Only mp3 is supported, so no ogg, aac or wma

So this paper isn't as revolutionary (if that's what they mean).

This will only work with identical files that have metadata that is frequently changed by end-users, because there's no way you're going to be able to get a good file if you try to mix a cam with a dvdrip, or an ogg with an mp3, or an xvid file with a divx file. It just doesn't work that way.

Re:Snakeoil by discord5 · 2007-04-11 05:28 · Score: 3, Funny

It's not even 9AM and I have already filled my bullshit quota for the day. The concept itself is dubious, but this statement in particular is ludicrous.

May I suggest you don't open your e-mail and refrain from answering the phone for today? I usually fill up my bullshit quota with those two media alone. Slashdot is just the icing on the cake. ;)

Re:Think about how that would be accomplished. by Anonymous Coward · 2007-04-11 05:34 · Score: 2, Interesting

It doesn't have to operate on every file on the filesystem. It's perfectly capable of identifying common blocks among files you happen to be swarming (downloading/hosting) or seeding (hosting), and that's more than useful enough.

Or you can maintain a set of managed/hosted files, exactly like if they were on an rsync server.

As for network efficiency, this isn't Gnutella, we've learned a lot in years of research since then, such as how to make searches that scale well. Very, very well. Well enough to be able to efficiently search 10-million strong distributed networks for hash-tree leaves.

We don't have to ask you if you have any interesting blocks directly; we already have an excellent structure that lets us know if either you (or your section of the distributed hash table, as we drill down) haven't, or you probably might - say hello to the Distributed Bloom Filter.

Those million requests? More like ten or so, total, in itty bitty packets that would be totally dwarfed by the data transfer you'd probably be making, spread out throughout the network and not concentrated on any one host.

Does the idea sound more reasonable now?

It will still be WORSE. by khasim · 2007-04-11 05:39 · Score: 2, Informative

Let's go with a fairly vanilla scenario:

I grab an mp3 from person A. I then clean up the tag and rename it to suit me.

You want to download that same song with a different name and different tag.

You connect to person B sharing it. If you're using BitTorrent, you can also connect to any of 99 other people trying to download it from person B.

Using the new model, you could also connect to person A and myself and download the blocks that are the same.

So instead of only...
99 people in the swarm and 1 seeder
you'd have 99 people in the swarm, 1 seeder, person A and myself.

But in order to FIND person A and myself, you'd have to go through A MILLION OTHER PEOPLE to find if they have any blocks that you are looking for.

The CRITICAL PART THAT THEY LEFT OUT is the amount of bandwidth you'd be using to search A MILLION unrelated systems with unrelated files looking for those blocks.

This works in their lab because they have very few machines with very few files and they've already pre-loaded those machines with the files they want to be found.

An interesting licensing issue by DigitAl56K · 2007-04-11 05:39 · Score: 3, Interesting

If a client recreates a file from "similar" pieces, is it a derivative work?

Re:Right.... by angio · 2007-04-11 05:41 · Score: 3, Informative

Similar in spirit - except rsync looks at files on your local hard drive by the same name, so there's only one possible candidate to draw from. SET looks at all of the files that everyone else is currently downloading, so we had to develop a much more efficient technique for locating useful files.

Re:Right.... by Dare+nMc · 2007-04-11 05:49 · Score: 2, Interesting

The only way you could implement a system like this is to create an entirely new protocol, server software and client software.

I disagree, this could be done with No Change to any protocol, client, or server: the only thing that needs created is a torrent creator located on a machine that had a full version of every "simular" torrent to be shared. the torrents would all be linked read only to the same DVD image (for exampl), only part of that DVD image would be labeled as "downloaded" and you would put the client into upload only mode on this host machine. The downloaders would have no idea this same data was shared under many different torrents.

For example, My company distributes our manuals as DVD's, different DVD for every machine Type. These go to 100's of locations. We put the same engine in 50 different machines. Also we do updated DVD's, that may just add a few pages of options (for example.)

If we distributed DVD images by bittorent, with a server at every one of our distributor. It would be as simple as patching the existing torrent image at the distributor.

IE Distibutor X may only have the DVD only for model ZZ hosted on their server. Instead of having a single torrent served from their chunk of data, Our main office that has all of the DVD's we ever produced, could just send them a list of torrent files, that tells that 1 server, what chunks of their file are appliciple on all of our DVD's. Now they may be able to host a 100 different torrents, all saying I got 10% of model ZX I got 5% of model ZF...

I tried it by Intron · 2007-04-11 06:36 · Score: 2, Funny

I tried downloading "My Sweet Lord" by George Harrison, but I got "He's So Fine" by the Chiffons instead.

--
Intron: the portion of DNA which expresses nothing useful.

mod this up -- angio...this is the problem by tacokill · 2007-04-11 07:09 · Score: 2, Insightful

What the parent is saying can be summarized with a simple example:

A 200MB, 30min video that was compressed at 1000kbps DiVX is not the "same file with minor changes" as a 200MB, 30min video that was compressed at 900kbps DiVX. They ARE different files and should be treated as such. You also can't deduce anything from their filenames, play length, or any other characteristic so how would you determine which ones can go together and which ones can't? I did not see codecs or compression mentioned at all in the article.

This is the fundamental problem here. You can't recombine video and audio files unless they ARE the same file. You have to account for different bitrates, compression ratios, and who knows what else (I am no expert in this area but this seems obvious...).

Lemme guess -- the mp3s mentioned in the article were ALL encoded at the same bitrate, right? If not, then please correct me because now you have my attention ;)

Re:mod this up -- angio...this is the problem by angio · 2007-04-11 15:59 · Score: 2, Interesting

Correct (and the parent is correct as well). We didn't exhaustively characterize the reasons that the files differed, but most of the cases appeared to be modifications _after_ encoding. We have on the TODO list to test encode the same CD multiple times with the same settings to see if it produces a similar file, but we haven't run the test yet. There were tons of cases of metadata-altered MP3s, many cases of video files with the same video content but different audio or subtitle information, and some cases where the files differed in "seemingly random" (aka, we're not sure why. :) parts of the file for no reason we've figured out yet.

I think that for audio files, having a plugin for the p2p system that separated metadata from song information would probably capture a lot of the benefit we found. It's much harder for video files and for the files that had various weird changes that we don't have a good explanation for.

As other posters have suggested, the techniques we used also work well for software distributions sometimes. We've found around 10-30% similarity between different Linux ISOs and RPMs, *uncompressed* distributions of things like gcc (but not a .tar.gz file), etc. Part of what we really like about the SET technique is that it's able to speed transfers of all of those types of files without needing to have any file-type specific logic.

Don't see how it could work on a large scale. by Bluesman · 2007-04-11 07:21 · Score: 2, Interesting

This seems to be an intractable problem.

How do you know a file is similar? By hashing? There's no guarantee that a particular chunk of a file with an md5 hash (for example) contains the same bytes as that of another file.

There are 2^256 possible chunks of 256 bits of data. There are 2^16 possible hashes with (using a 16 bit binary key) for that same data. That means that for every hash match, the data has a 1 in 16 chance of actually matching.

You can extend the key length to reduce this ratio, but you'll end up with a key length equal to your data size before you're sure the data is not a collision.

The problem gets worse if the chunks of data aren't equal in size.

This can only work if you have a centralized database of every possible file combination on your network. It's workable for a small amount of files, but will grow exponentially in a real environment. Not to mention, the centralized database would have to handle a significant amount of traffic, reducing the speed gains possible.

Count me as skeptical.

--
If moderation could change anything, it would be illegal.

perhaps 100% similar by Frisky070802 · 2007-04-11 07:46 · Score: 2, Interesting

I recall hearing a story on NPR Music a few weeks ago about someone who plugged a CD into itunes and had it come up with the right piece but the wrong performer. Then it happened again, with the same ostensible pianist but yet another "wrong" performer. Detailed analysis showed the pianist had apparently published CDs claiming to perform pieces but actually substituting the work of others! Itunes must have used a signature over the content to index the piece by the earlier CD.

Similarity detection rules.

--
Mencken had it right. So glad that's old news.

Huh? by sofla · 2007-04-11 07:47 · Score: 2, Informative

SET speeds up data transfers by simultaneously downloading different chunks of a desired data file from multiple sources, rather than downloading an entire file from one slow source.

So does ed2k, Bittorrent, ..., ..., .... this is hardly news. Even plain ol' FTP and HTTP can do this to a degree.

As far as the 99.9% similar "speedup"... I seriously doubt that you'll see any gains other than in lab conditions. MP3 is about the only format that might be agreeable to this, since I imagine its reasonably common for people to fix ID3's and then share the modified file. I just don't see it happening for other formats (.avi, .zip, .rar). And unless you've still got a 14.4 modem I doubt you'll notice the speedup even with MP3's, since they are so small to begin with.

Re:Right.... by poopdeville · 2007-04-11 11:23 · Score: 2, Insightful

Which is why you would download a .torrent-like file specifying which of those you want. Then you would download the 99.9999% that agrees from any/all of them (essentially making your personal swarm temporarily bigger), and download the missing .0001% from the version you requested.

This is very straightforward. I don't see how people can misunderstand this idea.

--
After all, I am strangely colored.

Re:Ok, I read the paper by Loligo · 2007-04-11 11:29 · Score: 2, Informative

It's already been addressed: files encoded with different codecs, bitrates, compression ratios, what-have-you look completely different, have vastly different checksums, and even if named exactly the same and with the exact same file size, would never be confused for each other by any algorithm that's comparing what's actually IN the files.

-l

Slashdot Mirror

Faster P2P By Matching Similiar Files?

49 of 222 comments (clear)