Faster P2P By Matching Similiar Files?

Nickelback? by onemorehour · 2007-04-11 03:46 · Score: 5, Funny

Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.

Well, sure, if you're only looking at Nickelback songs.

Re:Nickelback? by beckerist · 2007-04-11 03:54 · Score: 4, Informative

The idea is that the song "Girl Talk - Once Again" might be represented as "girltalk-onceagain" "girl talk - once again" "GirlTalk - Once Again" "01-once again.mp3" "OnceAgain.mp3"

Being that the only difference is just the text (ID3, ID3-2) tags, the rest of the song is exactly the same, so why can't you use that as a download source too? I personally organize all of my music, and because of this P2P programs believe that it's an entirely new file, when really it was just renamed and the header information was changed (generally to be grammatically correct.)</summary>
Re:Nickelback? by thepotoo · 2007-04-11 04:17 · Score: 5, Interesting

If you use bittorrent, the DHT protocol (supported by Azureus, BitComet, and uTorrent, among others) does the exact thing you're describing. It checks MD5 hashes for files (the whole file, not the pieces, I think), and connects you to peers which have the same file.
DHT even supports partially corrupted files, your client just discards the corrupt data.
My question is, why would I want to use SET over DHT? Does SET not need a ceneralized server, or does it have any other advantage at all?
TFA is really short on technical details, but it sounds to me as though SET is just a re-design of DHT. Still, I imagine SET support will be in the next builds of all the major bittorrent clients if it ends up being worth something.

--
Obligatory Soundbite Catchphrase
Re:Nickelback? by Anonymous Coward · 2007-04-11 04:20 · Score: 1, Informative

Shareaza already discards ID3 tags when hashing files. (It's an option if it's off by default.)
Re:Nickelback? by beckerist · 2007-04-11 04:24 · Score: 1

I read this as being less effective for bittorrent (as your assumption is correct, DHT basically does this on its own already) and more effective for Gnutella. While the gnutella network is maturing to a level past the Bearshare/Limewire days, there are still inherent problems with download speeds simply because a mechanism like DHT (or as the article calls it SET) doesn't currently exist. Are there problems with DHT? Absolutely, I get node errors all the time in Bittyrant. I think this is just the next natural extension for Bittorrent (or distributed FTP, etc)
Re:Nickelback? by LoofWaffle · 2007-04-11 04:28 · Score: 1

Is the ID tag really the only difference? You would also need to make sure that encoding bit rates were the same, that the original data set (let's stick with the MP3 example) was digitized the same way, etc. It may be more efficient, but given the polluted nature of P2P, accuracy is a bit questionable.

--
You know, Custer had a plan.
Re:Nickelback? by beckerist · 2007-04-11 04:29 · Score: 1

errrrr...sorry. I need to proofread better. Instead of saying "I think this is just the next natural extension for Bittorrent..." I meant to say "next natural extension for Gnutella."

--beckerist
Re:Nickelback? by Anonymous Coward · 2007-04-11 04:41 · Score: 0

Or if you're looking for commercial music, period; remember how the RIAA bombed Kazaa?
Re:Nickelback? by hey! · 2007-04-11 04:41 · Score: 3, Interesting

I don't think this is just about inconsistent metadata.

I think what he's talking about may be more like the document fingerprinting algorithms used to pare search engine results, or to detect plagiarism in student papers.

In some cases you will be downloading components of a file from two sources, neither of which have the others' component. The example in TFA was downloading the video portion of a movie from a foreign language site and the audio from a site with the language you speak but less bandwidth.

I suppose another example would be that if you were downloading an anthology of stories, you could take a particular story from a server that hosted a different anthology including that story. Or maybe you are downloading the new distro; you could take some of your files from sites offering the distro version you are looking for, some from sites only offering the files you need to upgrade to that version, and some from entirely different distros or much older versions if they happened to be the same.

I guess it could be thought of as a kind of "fuzzy akamai".

It's an interesting idea, but I don't see any commercial support for it. In fact I see commercial opposition under the current regime of copyright laws and royalty based business models.

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Re:Nickelback? by Robotech_Master · 2007-04-11 04:46 · Score: 3, Informative

If I understand the article right, SET looks at individual files within a particular download. DHT just looks at the whole download.

For instance, if I'm uploading my "Songs I Like to Dance To" mp3 mix, and someone else is uploading an "All-Time Greatest Dance Hits" CD rip, and there are a couple of songs both uploads have in common, SET would enable someone downloading my MP3 mix to treat the CD rip as a partial seed (and vice versa), and pull down the songs held in common from either one.

Whereas DHT would simply enable people to pull down my mix from other people uploading the mix, or the CD rip from other people uploading the CD rip, even if the tracker was down. (If I understand what DHT does correctly. Which it is possible I don't.)

--
Editor Emeritus and Senior Writer, TeleRead.org
Re:Nickelback? by Andy+Dodd · 2007-04-11 04:46 · Score: 4, Interesting

If I recall correctly, DHT takes the file name into account when calculating the hash. Thus identical files with different names are treated differently.

Some P2P protocols allow looking up a file by a hash which does not take filename into account, but this will not handle the case where the files differ in only one small section. The best example is the following:
Person downloads an MP3.
Person finds that the MP3 is not properly tagged (for example, has a comment field saying who ripped it/released the rip, and has no track number.)
Person changes the MP3's ID3 tag
Now, nearly all existing P2P protocols will treat the new file as a completely different file, when in reality the most important contents (the audio itself) have not changed, only the file's metadata.
Other users will go for the "full-file" match with the largest number of sources, thus causing the mistagged MP3 to propagate more than a "fixed" one.

So a P2P system that ignores the ID3 tag when hashing would have significant advantages, in which the user could download the file from many sources and then choose which source to get their metadata from.

--
retrorocket.o not found, launch anyway?
Re:Nickelback? by beckerist · 2007-04-11 05:09 · Score: 1

The problem might be inherent to MP3's though. Header information is still stored as binary information in the file, changing the amount of bits in the file, therefore the hash (as read by any hashing system I'm aware of.) If there were a way to store the header text such that it doesn't affect the filesize, current P2P protocols/software might already be able to do this...
Re:Nickelback? by Anonymous Coward · 2007-04-11 05:13 · Score: 2, Interesting

No, it's a different kind of identifying which blocks are interesting in a swarm download.

Basically - and this isn't the first time this idea has been tabled, I have an unpublished paper and a reference implementation - chuck BT's idea in the bin, as it uses lists of SHA1 hashes and that's not suitable for this. Shareaza's better placed to do this technically, but you could of course adapt torrent.

What you actually want to use is a TTH - Tiger Tree Hash (THEX standard). That's a Merkle hash tree based on TIGER192 (it's commonly represented in base-32, and is for example used by DC++ and Shareaza, and is a major part of the magnet: standard). The whole thing identifies the file, but that's a hash of a tree of hashes that progressively identify smaller parts of the file. You can exchange the leaves of the tree to any convenient depth and easily verify they're correct.

(The side effect of this is that standard magnet: links or lists of magnet: links work; no strange special .torrent files are needed at all. It's more efficient on the tracker, too.)

If you can search for the leaves you're interested in (and a distributed Bloom filter can make that very efficient indeed), you'll get matches from not just the file you wanted, but any identical blocks found in any other files - the same bits by any other name would smell as sweet. So peers with slightly different files can provide partial seeds to the swarm too, and vice versa.

This is useful if you have a common element between download groups. There's no need for a "batch torrent" any longer, because all the grouping would be done automatically. Media files who differ only in tags would be automatically matched and swarmed together as a matter of course.

More creatively, it could be used to swarm patches to a large group of clients, as this technique can efficiently perform a binary diff-based distributed rsync to millions of clients... ...which is why it's a hot research topic.
Re:Nickelback? by thepotoo · 2007-04-11 05:20 · Score: 2, Informative

Actually, DHT doesn't care about file names. You can test this yourself if you have a LAN. Grab a torrent, start downloading it on one computer (save as a different filename), get a little of the file downloaded, and start the same torrent on a different computer. Use your firewall to block the second computer's access to the tracker. DHT will kick in, your computers will log in, get each other's IPs, and computer 2 will get an insanely fast download speed until it catches up with computer 1.
Tested on uTorrent 1.6.0 (old version) on my and my roomates computers. Incidentally, the process isn't any faster than downloading the file on one computer, and copying it over to the other one afterwards.
You are correct about the tags, though.

--
Obligatory Soundbite Catchphrase
Re:Nickelback? by Anonymous Coward · 2007-04-11 05:28 · Score: 0

The Nickelback thing was a joke. And a funny one at that. You really don't have to explain.
Re:Nickelback? by paeanblack · 2007-04-11 05:50 · Score: 1

TFA is really short on technical details, but it sounds to me as though SET is just a re-design of DHT. Still, I imagine SET support will be in the next builds of all the major bittorrent clients if it ends up being worth something.

As TFA current describes things, I'm really struggling to find a use for this feature that does not involve copyright infringement. The rightsholders for "legitimate" p2p traffic already have a strong incentive to act as a central authority for low-bandwidth meta-data.

I think this is a solution for a problem that lies primarily with pirated files. I wouldn't be surprised if Bittorrent stayed far away from it.
Re:Nickelback? by LeRaldo · 2007-04-11 06:03 · Score: 1

DC++ Does this already by using TTH.
Re:Nickelback? by luckystuff · 2007-04-11 06:17 · Score: 1

really? Not in my experience. Torrent 1 has files a, b, and c. Torrent 2 has files c, d, and e. Torrent one has 100% of file c complete, while Torrent 2 is still trying to download file c with 94% remaining. File c has same size, name, hash, etc in both torrents. Thus the client mis-allocates resources, no?
Re:Nickelback? by rainman_bc · 2007-04-11 06:27 · Score: 1

Well, sure, if you're only looking at Nickelback songs.

Or "theory of a deadman" or "default"

I think they should merge and call themselves Theory of a Nickelfault

=D

--
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
Re:Nickelback? by marcosdumay · 2007-04-11 06:38 · Score: 1

"It's an interesting idea, but I don't see any commercial support for it. In fact I see commercial opposition under the current regime of copyright laws and royalty based business models."

Who needs commecial suport for P2P?

--
Rethinking email
Re:Nickelback? by evilviper · 2007-04-11 06:49 · Score: 2, Informative

Some P2P protocols allow looking up a file by a hash which does not take filename into account,

By "Some", you mean "Every Single Frickin' One Of Them", right?
nearly all existing P2P protocols will treat the new file as a completely different file,

No. Only the most brain-dead P2P protocols will. "tree" hashes are in use by several P2P protocols. Some are just old or primitive, and have a large number of old servants around that don't understand newer hashes.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Re:Nickelback? by Cramer · 2007-04-11 06:53 · Score: 1

DHT works from the exact same torrent file as everything else. Bittorrent doesn't care about the name of the TORRENT FILE, but the names, sizes, and order of files within the torrent file does matter. If I create a torrent of a file called 'foo' and you create the torrent but your file is named 'bar', they will have different info_hash's and thus be unique to both a tracker and DHT despite them both covering exactly the same data. The same is true if I make a torrent of 'foo' plus 'bar' and you make one of 'bar' plus 'foo'. Torrents don't represent files; they represent a "data set".

As such, torrents have two flaws... first, the file name, which even the spec calls "advisory only", should NOT be part of the info dictionary. Filenames (and paths) do not uniquely identify a file. (Even the MPAA/RIAA know this.) Second, there should be a hash per file, not just per peice of the resulting concatenated dataset. That is what is needed to identify a specific file.
Re:Nickelback? by Frenchy_2001 · 2007-04-11 06:59 · Score: 1

SET seems to be an incremental update of DHT or MD5 hashes.
Most recent p2p use hashes to recognize files instead of just filename, BUT this makes that music files that are different in their header will come out as different.
This method allows just to distinguish between data and meta-data for the file and allo the meta data to be different.
Anyway, just my understanding of it.
Otherwise, i'm sure they could just do md5 hashes of all your 16k/32k/64k parts of all your shared files and download parts with the same md5 hash even coming from a different file.
Might be faster but SU much more computational intensive.
Then again, IO is usually more scarce than computing power...
Re:Nickelback? by dreamlax · 2007-04-11 07:05 · Score: 1

I guess the idea would be to truncate the file on either end (depending on where the meta data is stored) to have the raw MP3 data. Hashing just that would mean that people such as myself who not only rename but re-tag all my MP3s (because I hate seeing things like "santana - baila mi hermana " . . . Capitalise!) would still be able to share our MP3s the same way we got it from someone else, and that next person can tag it however they please.
Re:Nickelback? by beckerist · 2007-04-11 08:36 · Score: 1

The point isn't that the ID tag isn't the only potential difference, it's that IF the ID tag WERE the only difference, it still wouldn't recognize that anyway.
Re:Nickelback? by WhiteDragon · 2007-04-11 12:42 · Score: 1

If you can search for the leaves you're interested in (and a distributed Bloom filter can make that very efficient indeed), you'll get matches from not just the file you wanted, but any identical blocks found in any other files - the same bits by any other name would smell as sweet. So peers with slightly different files can provide partial seeds to the swarm too, and vice versa.
Would this work if you add a byte at the beginning or middle of the file? I am pretty sure that the rsync protocol does support this, so it would be cool if the tree hash supports a similar feature. Any thoughts?

--
Did you mount a military-grade, variable-focus MASER on an unlicensed artificial intelligence?
Re:Nickelback? by DMUTPeregrine · 2007-04-11 14:37 · Score: 1

It doesn't care about the filename of the output file, but does it care about the filename of the input file (IE the remote file)?

--
Not a sentence!
Re:Nickelback? by Anonymous Coward · 2007-04-11 16:11 · Score: 0

Yeah I don't pirate songs. I make arrangements.
Re:Nickelback? by LupusCanis · 2007-04-12 01:27 · Score: 1

I don't know why this is getting an informative rating, there are plenty of P2P protocols without hashing. Apparently BitTorrent allows it (which I wasn't aware of) and the DC network definitely does, but one that does not is the Soulseek network.

Unfortunately, the SoulSeek network is probably the best one by a long way for finding obscure music, but the program and protocol show their age.

So ... one notable network without hashes. Personally, I think that a program that was somewhere between DC++ and Soulseek would be the best P2P program ever. (on the Soulseek network, with wishlists, without requirements to be in rooms and without arbitary limits you have to stick to to be in rooms but with hashing, without user ban lists, without this stupid attitude that some SoulSeek users take that sharing music is a privelege rather than a right when they're just sitting there on a P2P network, with autosearch and autoslots ... preferably the DC++ GUI too)

Back on topic. This would be a real improvement, as most files of the same song vary only by bitrate and tags and there's no reason why you shouldn't be able to dl fragments.
Re:Nickelback? by evilviper · 2007-04-14 12:21 · Score: 1

there are plenty of P2P protocols without hashing.

Then name them. If they exist, they must be extremely unpopular. Without hashing, you can't find alternate files for "swarming" (multi-host downloads). The major/popular P2P networks all support it, all the way back to the original, Gnutella.
one that does not is the Soulseek network.

Well, assuming you're correct (I can't find any technical details for SoulSeek, and certainly haven't used it) you're still only 1 for 100.

DC, ED2K, BitTorrent, Gnutella, Kazaa, Mute, Ares, MP2P/Manolito, etc., and those are just off the top of my head.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant

similiar?! by Anonymous Coward · 2007-04-11 03:47 · Score: 0

similiar files ftw!

Similar Files? by Anonymous Coward · 2007-04-11 03:48 · Score: 1, Insightful

Wait...what?

Re:Similar Files? by Aladrin · 2007-04-11 03:59 · Score: 2, Insightful

No seriously, the coward is right. WTF?

Okay, I'll admit that there's a few MP3s that have different ID3 tags but the actual audio is the same. A few. The large majority of duplicate songs are NOT the same audio data. It's been re-ripped, transcoded, or some other horrid thing done to it and is not the same data anymore.

Now, even assuming that there ARE tons of very-alike files out there, you'd have to write an intelligent comparer for each one so that it knew how to deal with the file and what information could be mixed without ruining the file.

At the end of the project, you've spent years on a project that'll never quite work right to save a bit of bandwidth for people that should have just gone and bought the song from iTunes in the first place if they wanted it that damned bad. And if they don't want it that bad, they aren't going to bother with some specialized P2P program that only has 1 advantage: It can tell some files are alike. (And probably has tons of disadvantages compared to the already-existing applications.)

--
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
Re:Similar Files? by lottameez · 2007-04-11 04:13 · Score: 1

Both of these files use the same 26-character alphabet! Share and substitute at will!

--
Yeah? Well I think you're overrated too.
Re:Similar Files? by joto · 2007-04-11 04:31 · Score: 2, Interesting

Okay, I'll admit that there's a few MP3s that have different ID3 tags but the actual audio is the same.

It's more then a few. Most people use the default settings in their audio ripper/compression program, and it's all from the same CD. Even more people never uses an audio ripper and/or compressor, and simply downloads the file from the Internet. Not that many people bother to change ID3-tags either, but every single person that do, leads to a different file.
And if they don't want it that bad, they aren't going to bother with some specialized P2P program that only has 1 advantage: It can tell some files are alike.

My impression is that people are going out of their way to find specialized P2P programs that offer any advantage. And that very many people are willing to spend $100 in work-time finding a CD worth $10 for "free". The market has spoken, people are not acting the way you want them to.
Re:Similar Files? by Anonymous Coward · 2007-04-11 04:38 · Score: 0

having downloaded many songs from emule and kazaa back in the day, there are many mp3s that are bit identical in the audio data that are different in the metadata. just search for a song on emule and download the dozen or so most available mp3s that have the same file size and do a foobar2000 bit compare on the audio and you'll find most of them are bit identical just with different tags. of course this is just as anecdotal evidence as you have cited, but my experience seems to be somewhat different to yours...

as for your other point, many p2p apps such as emule, limewire, shareaza have been metadata aware for many different filetypes for a long time now. i doubt it would be a huge task to have these programs support two hashes for a file, the normal hash for the entire file with metadata, and another hash just for the data. of course, the program in the article is going one step further by further subdividing the non-metadata data to look for more redundancy between files. makes sense to me.
Re:Similar Files? by Anonymous Coward · 2007-04-11 04:43 · Score: 2, Insightful

Break down the file into small (16Kb) chunks. Hash those chunks and let the client compare those chunks to the chunks you need. Most BT clients already do this, but still only draw the file from peers using the same file listed by the tracker. With this technology it can use any file that has chunks with the exact same hash as the file being downloaded by the user. I would imagine not a great many changes would be needed to implement it. There's no need for an 'intelligent comparer' as it's pretty much already built into almost every BT app out there. There won't be 'years on the project' either, since most of the infrastructure already exists. They can just build on what is already out there.

There could be a fairly large performance increase if I've understood the paper correctly. I have a 10Mb downstream cable connection at home. If I connect to a torrent that has many more seeders than leechers I can easily top out my d/l speed at around 1.1 MB/s. Reverse that scenario and the d/l is extremely slow due to the lack of seeders able to send out chunks of the file. Now, imagine there are multiple copies of these same file on multiple trackers being shared by many, many more seeders that this one torrent. This new implementation will find those chunks, as well as the chunks you originally connected to. Next time, RTFA.
Re:Similar Files? by woolio · 2007-04-11 04:48 · Score: 1

At the end of the project, you've spent years on a project that'll never quite work right to save a bit of bandwidth for people that should have just gone and bought the song from iTunes in the first place if they wanted it that damned bad. And if they don't want it that bad, they aren't going to bother with some specialized P2P program that only has 1 advantage: It can tell some files are alike. (And probably has tons of disadvantages compared to the already-existing applications.)

After spending several years of my life in grad school, I'm beginning to think that s loy of university research is like that.
Re:Similar Files? by Anonymous Coward · 2007-04-11 05:22 · Score: 0

Empirical testing proves you wrong. You'd be AMAZED how many common actions in standard media players perform trivial modifications to the ID3 tags that change the file hash. Just playing it with WMP can do it, or with iTunes, or even with some versions of foobar if set in a certain way.

In any case, there are much more interesting uses of this technology. You're looking at what can amount to a distributed, swarmed rsync-like protocol with only a single secure* hash to identify each file and that automatically accounts for binary diffs, and you don't think it's useful?! Plonk an Atom/RSS feed on it then, and subscribe to updates, and watch them come in to a huge swarm using very little bandwidth. Sound more useful now?

* Well, no-one's proved otherwise on TIGER192 yet; it's not orthodox enough for the Xiaoyun Wang et al. attacks to work. The final, improved THEX structure is secure (you need a flag bit to identify whether this layer came from the data or from another layer of hashes, otherwise it's collidable).
Re:Similar Files? by Delkster · 2007-04-11 09:28 · Score: 1

With this technology it can use any file that has chunks with the exact same hash as the file being downloaded by the user.
That makes one wonder, though, if the downloader might occasionally get chunks that have the same hash but where the actual data is still different. It's rare, I know, but if this were in common use, would it occasionally happen?

Hashes work fine against data corruption and intentional poisoning of a certain file because it's very rare for random corruption to happen so that the hash remains the same, and with that comes the fact that it's also very difficult (or at least laborious) to intentionally find another piece of data that has the same hash but isn't the same. However, if the comparison is done to all possible chunks in a P2P network, would the large mass of all possible chunks wield at least some that just happen to have the same hash?

It may not work in intentional poisoning because you'd have to generate the masses to find out which ones have a hash that matches the original but here you have the masses given and some just might match.

However, I don't know enough about the properties of MD5 or other hash algorithms to know whether that possibility would be significant if a hash is also computed for the entire file, not just all smaller chunks. After that it doesn't intuitively sound very likely anymore.
Re:Similar Files? by cubic6 · 2007-04-11 09:46 · Score: 1

It's more then a few. Most people use the default settings in their audio ripper/compression program, and it's all from the same CD. Even more people never uses an audio ripper and/or compressor, and simply downloads the file from the Internet. Not that many people bother to change ID3-tags either, but every single person that do, leads to a different file.

I think the variation in files would be more due to the fact that the most common CD ripping programs (iTunes, WMP, etc) don't really care that much about getting bit-identical copies. Different models (and even individual units) of CD/DVD drives can have varying amounts of jitter, and most ripping software doesn't use jitter correction or C2 error correction. It's not really noticeable in the result, but when you run slightly different sources through a heavy compression algorithm like MP3, there's no guarantee that the resulting bitstreams will be similar enough to use a tool like the article talks about.

The different tags issue has been discussed in other comments, but it's pretty trivial to strip the tags when you compare or hash files.

--
Karma: Contrapositive

Thats the dumbest thing I've ever heard by MetalliQaZ · 2007-04-11 03:50 · Score: 0, Flamebait

One wonders if these "researchers" have ever actually used p2p...

--
"Here Lies Philip J. Fry, named for his uncle, to carry on his spirit"

Re:Thats the dumbest thing I've ever heard by dknj · 2007-04-11 03:57 · Score: 1

Or it may even be native to the service itself (i think gnutella or edonkey.. one of the two). Anyway, the point is if you search for a certain file and the hash for specific blocks are identical, Shareaza will attempt to download that copy as well. So if you download, say, a 4mb tupac song and suddenly see a 2.5mb britney spears song in the list.. don't cancel the download. It could be that the 2.5mb britney spears song is actually the same tupac song that was renamed and incomplete.

However, there is no reason why the current generation of bittorrent clients don't already support this via DHT...
Re:Thats the dumbest thing I've ever heard by TodMinuit · 2007-04-11 04:09 · Score: 1

Not really. Lets use BitTorrent as an example. BitTorrent links chunks to a torrent. If it were to just track chunks, you could get a speed up for certain torrents.

Lets say I make a torrent, example-v1.0.torrent, containing file X (with the checksum "foobar"), and file Z (with the checksum "deadbeef"). I seed it, people download it, yippie.

Now lets say later on, file X changes, and now has the checksum "barfoo". So I create example-v1.1.torrent. Under the current BitTorrent system, both file X and file Z would have to be seeded from scratch. Whereas if you were to merely track chunks, anyone currently distributing file Z, which hans't changed, would be used in the seeding of example-v1.1.torrent.

For things like operating system ISOs, you could get a head start when seeding new versions. For compressed data, like MP3s or videos, you're screwed.

--
I wonder if I use bold in my signature, people will notice my posts.
Re:Thats the dumbest thing I've ever heard by Anonymous Coward · 2007-04-11 06:01 · Score: 0

Cool thing? This would also decrease the proportion of "dead" torrents out there, as first, anyone connecting to example-v1.0.torrent could grab file Z, and second, folks could easily seed multiple versions/torrents of the same stuff without playing around with file naming.

Downside? It gets really, really easy for the RIAA/MPAA to clear the chaff and find EVERY shared copy of their song/movie and initiate countermeasures.

Mitigating factor? That's a lot of C&D's to send out. Is that enough?

grea tide a by underwhelm · 2007-04-11 03:51 · Score: 5, Funny

I'm hoping this CATCHES ON and wet ransfer a11 sorts of information like this. It'11 be 1ike getting every thing in the form of a ransom n0te.

--

I don't need large brains to have a good time.

The music kids listen to by Vollernurd · 2007-04-11 03:51 · Score: 3, Funny

So it's not me then? All new tunes DO sound the same?

--
Smokey, this is not 'Nam, this is bowling. There are rules.

Re:The music kids listen to by penp · 2007-04-11 03:58 · Score: 0

It's not that they sound the same, it's that the files are similar. If a file is encoded in the same format as another, I'm guessing that this data that is part of these encoders' nature is what they are referring to. The similar (binary) data is data that has nothing to do with the actual output of the audio/video.
Re:The music kids listen to by ronanbear · 2007-04-11 04:21 · Score: 0

meh, if that's the case then you could just download one song, once, and then download the additional (non identical) data to turn it into a whole music collection.

--
the more they over-think the plumbing the easier it is to stop up the pipe
Re:The music kids listen to by iminplaya · 2007-04-11 04:44 · Score: 1

It's not that they sound the same, it's that the files are similar.

If you ever have the chance, go over and test drive the mail program that comes on every new macintosh, and send off a test mail. Make sure the volume is turned up.

--
What?
Re:The music kids listen to by jonadab · 2007-04-11 09:20 · Score: 1

> So it's not me then? All new tunes DO sound the same?

Pretty much. Hardly any musicians are writing any actual counterpoint anymore, and the principle of contrary motion has been largely disregarded these last two hundred and fifty years.

I think it's Bach's fault. New musicians take one look at his later works, particularly Art of Fugue, and immediately say to themselves, "I'll never be able to write anything that good in a billion years, so why try?" So then they just string some notes together into a basic melody, throw in a couple or three harmony parts, and let it go at that. It's so much easier, and at least two thirds of the population doesn't have enough musical training to know the difference anyway.

--
Cut that out, or I will ship you to Norilsk in a box.
Re:The music kids listen to by Anonymous Coward · 2007-04-11 11:08 · Score: 0

Yes, you could do that. Indeed, you could download a jpeg or avi and patches for it to turn it into any song you would like. Mind you, the patches would be pretty big and essentially contain the song.
Re:The music kids listen to by ivan+kk · 2007-04-11 14:03 · Score: 1

No, no, not at all.
Well, sorta http://www.youtube.com/watch?v=JdxkVQy7QLM

They already work like this by Noishe · 2007-04-11 03:51 · Score: 1

Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar. I've never used a p2p program that won't download from two songs, just because they are labeled differently, or have different headers.

Re:They already work like this by multipartmixed · 2007-04-11 04:46 · Score: 0, Flamebait

Still haven't learned about ID3 tags, huh?

You poor bastard.

Even Media Player messes with them now.

--

Do daemons dream of electric sleep()?
Re:They already work like this by dave420 · 2007-04-11 10:24 · Score: 1

You're confusing filenames with binary metadata stored in the file. P2P uses hashes of the complete file at its most permissive (and some protocols include the filename in the hash, also). That means if you change the artist information in your ID3 tags, the file will have a different hash. Change the filename, and for some protocols you have a different hash. P2P uses hashes to match binary-duplicate files, and then download from those sources. If they don't take the filename into account when generating the hash, then files with the same contents but different names will also be added as sources. So I would imagine that's what you've noticed.

That'll work great with by Rik+Sweeney · 2007-04-11 03:51 · Score: 1

porn.

No wait, hear me out. Most porn is going to be largely white or black skin colour (particularly with Friesian Cows if you're into that sort of thing), so the P2P can just find a chunk with a similar amount of that colour and download that!

--
Summation 2

Re:That'll work great with by FlatLine84 · 2007-04-11 03:58 · Score: 0

I'm sorry, if I'm downloading porn, I would rather have it as one file, instead of risking "mixing" genres... Although, for some people, that could be interesting I suppose.
Re:That'll work great with by Bat+Country · 2007-04-11 09:13 · Score: 1

That gives me a fabulous idea.

Imagine taking a video, figuring out its proximity to individual frames within a collection of other videos, then substituting those frames into a third.

Then you'd have something which is fooling your eye into seeing something happen which is totally other than what is actually occurring, provided the frame rate was high enough.

Obviously you'd need a massive corpus of high quality video to draw frames from, but the end result would be like an incredibly low fidelity moving photocollage.

--
The land shall stone them with the bread of his son.

TorrentSoup by snsr · 2007-04-11 03:52 · Score: 1

It must be too early for me.
Why would I want to let someone transfer part of a file on my system that resembles part of a completely different file that they're looking for? Maybe everyone should just transfers all of their files all the time.

Re:TorrentSoup by eric76 · 2007-04-11 03:56 · Score: 2, Insightful

The only thing I use the file sharing networks for is to download new images of FreeBSD and Linux using BitTorrent.

The last thing I want is a "similar" file.

What would be a "similar" file to a FreeBSD ISO? It would either be a corrupted file or one with an introduced exploit.
Re:TorrentSoup by joe_cot · 2007-04-11 04:06 · Score: 5, Informative

It would still work the same way as it does now: an md5 of each specific block, and an md5 of the whole thing. If the md5 for the block doesn't match, it's not going to download, and if it's someone using collision to inject a block with the same md5, 1) it's not going to pass the md5 on the whole thing, 2) you're already vulnerable to it. The reason this will work is that they'll be lots of people sharing incomplete or corrupted versions of your FreeBSD iso; you'll get the blocks that are good, and skip the blocks that aren't, making "similar" files very useful. Not too difficult to understand, and no need for tin foil hats.
Re:TorrentSoup by drix · 2007-04-11 04:27 · Score: 3, Insightful

Because it gets you published and, thus, increases your chance for tenure, that from which all blessings flow.

--

I think there is a world market for maybe five personal web logs.
Re:TorrentSoup by ShieldW0lf · 2007-04-11 04:41 · Score: 3, Interesting

So if someone is sharing an older ISO, and it happens to have large portions that exactly match the one you're downloading, with other portions that are not identical, you don't want to download the identical chunks off that person?

It would be interesting if the implementing software could also look for possible matches within your existing file structure and reduce the data downloaded automatically, kind of like using diff and just downloading the patch.

--
-1 Uncomfortable Truth
Re:TorrentSoup by CastrTroy · 2007-04-11 04:58 · Score: 3, Informative

I'm not sure if this would work if you changed the byte offset though. Sure both ISOs may contain a lot of the same data, but I think it's very unlikely that the data would be at the same byte-offset in the file. I don't think that you'd be able to accomplish this for different byte offsets, because for a 100 MB File, assuming 5 MB chunks, You're looking at about 2,000,000,000 chunks to calculate (20 chunks, calculated at each byte offset).

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:TorrentSoup by NittanyTuring · 2007-04-11 05:16 · Score: 1

It's a little more complicated than that... it has to use a rolling hash so that it can get variable-size chunks. It won't work too well with fixed-size chunks.
Re:TorrentSoup by kwark · 2007-04-11 05:17 · Score: 1

This is exactly what rsync does (IIRC, it been a while since I last tried to read the whitepaper).
Re:TorrentSoup by Anonymous Coward · 2007-04-11 06:39 · Score: 1, Informative

Your 1) is not true. MD5 has the property that:

If H(x) = H(y)

H(x+z) = H(y+z)

!
Re:TorrentSoup by CastrTroy · 2007-04-11 06:56 · Score: 1

From my understanding, RSync's job is made easier because it knows which files it's comparing. It looks at file A on location X, and compares it to file A on location Y. It looks for places where the file is the same, and doesn't transfer the data. It becomes a lot harder when you have a set of files A(1)-A(X) and you have to determine if any parts of A(n) exist in File B on the other system.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:TorrentSoup by chgros · 2007-04-11 06:59 · Score: 1

if it's someone using collision to inject a block with the same md5
Thankfully, that's not practical at this time.
You can fairly easily generate 2 chunks with the same md5.
You cannot easily generate a chunk with the same md5 as a given, pre-existing chunk.
Re:TorrentSoup by kwark · 2007-04-11 08:19 · Score: 1

rsync is very complex protocol (to me), it does find matching chunks at different offsets:

http://samba.anu.edu.au/rsync/how-rsync-works.html
"The sender process reads the file index numbers and associated block checksum sets one at a time from the generator.

For each file id the generator sends it will store the block checksums and build a hash index of them for rapid lookup.

Then the local file is read and a checksum is generated for the block beginning with the first byte of the local file. This block checksum is looked for in the set that was sent by the generator, and if no match is found, the non-matching byte will be appended to the non-matching data and the block starting at the next byte will be compared. This is what is referred to as the "rolling checksum"

If a block checksum match is found it is considered a matching block and any accumulated non-matching data will be sent to the receiver followed by the offset and length in the receiver's file of the matching block and the block checksum generator will be advanced to the next byte after the matching block.

Matching blocks can be identified in this way even if the blocks are reordered or at different offsets. This process is the very heart of the rsync algorithm. "

While you are right that rsync has a much simpler job, it is still based on exchanging a list of checksums.

A long, long time ago I was trying to download a Debian iso image. The howto to generate one was kinda like:
-download "this list" of packages from you favorite mirror
-cat *.deb > foo.iso
-rsync the remote iso image to your local foo.iso

The result was an iso generated from files that could have been downloaded from n sources.

This method is still mentioned under the "Aargh! The script fails with an error - have I downloaded all those MBs in vain?!" section on http://www.debian.org/CD/jigdo-cd/
Re:TorrentSoup by Anonymous Coward · 2007-04-11 08:30 · Score: 0

What would be a "similar" file to a FreeBSD ISO? It would either be a corrupted file or one with an introduced exploit. Linux? [ducks]

This could work for some files, but not for others by Icarus1919 · 2007-04-11 03:52 · Score: 0, Redundant

What if you're downloading a backup copy of a movie that is subtitled, as opposed to a version that is not? Considering how much space the video data takes up, these files would be awfully similar. Sometimes it IS the little things that make all the difference in files. Version of software with bugfix as opposed to without? Only difference between the two files is the name (1.0 vs. 1.1) and the fixed lines of code. That sort of thing. Seems to me this may not be as useful as advertised.

Right.... by simm1701 · 2007-04-11 03:52 · Score: 1, Troll

Sure this is going to work... really

I'll just splice that bit from that torrent, that bit from that one... it should work, I mean they are all the same TV episode and they are all mpeg4 - the file name says so...

Hmmm how about which bitrates, codecs, if it was from TV whether it was started at the same time??

That guy seriously has to be joking - the byte offsets are unlikely to ever specify a suitable join - and even if they rewrote the protocol so it split by seconds rather than fixed file widths you'd still have changing codecs and bitrates to deal with. Personally I'll stick to torrents with decent known trackers

--
$_="Slashdotter";$syn="OTT";s;..;;;sub _{print shift||$_};s!ash!Perl !;s=$syn=ack=i;tr+LLEd+BLAH+;_"Just Another ";_

Re:Right.... by Icarus1919 · 2007-04-11 03:55 · Score: 2, Informative

Ok, perhaps you're not certain how files work. But things compressed with different codecs and bitrates look VERY different when you actually look at the coding in the file as opposed to the same file named differently or with minor changes.
Re:Right.... by Volante3192 · 2007-04-11 03:56 · Score: 1

Beat me to it. This sounds like it'd just wreak havoc on checksums.

Plus, I'd also like to add that P2P doesn't download from a single source as the summary claims...but it pulls down chunks of the same file from MANY sources.

Maybe these guys are confusing P2P with FTP?...
Re:Right.... by angio · 2007-04-11 04:01 · Score: 5, Informative

Take a peek at the paper - it actually does work, and we demonstrated it. The intuition: people make small changes to files like changing the artist or title in the MP3 header, and then BitTorrent and other systems treat this as a "different" file, when in fact it's 99.9% similar.
(Yes, I'm one of the authors.)
Re:Right.... by Anonymous Coward · 2007-04-11 04:11 · Score: 0

RTFA. It only downloads identical chunks (by some hashing technique they call "handprinting"), you'll end up with the original file just the same as before. Consider the situation of the same film but with different subtitles and / or audio; if the source / encoder / settings were otherwise the same then a lot of the video chunks will match.

Likewise the "99% similar" mp3 situation mentioned is the case where you have essentially the same mp3 with different ID3 tags, though as pointed out by another comment a lot of p2p programs already ignore the headers for swarming purposes. The basic concept is sound, the question is if it can scale well enough to be practical; every incoming chunk request hash must presumably be searched against the hash of every chunk shared if the hash of the file isn't found, and since the chunk size is only 16k that's likely to be a LOT of chunks.
Re:Right.... by Reason58 · 2007-04-11 04:16 · Score: 0, Troll

The intuition: people make small changes to files like changing the artist or title in the MP3 header, and then BitTorrent and other systems treat this as a "different" file, when in fact it's 99.9% similar. BitTorrent, as most of you know, doesn't work this way. Files are selected from a server called a "tracker", and only users with that exact file size and hash will be linked with you. The only way you could implement a system like this is to create an entirely new protocol, server software and client software. Given the widespread adoption of BitTorrent I think the performance gains would have to be very substantial for people to migrate.
Re:Right.... by Maximum+Prophet · 2007-04-11 04:22 · Score: 1

Isn't this rsync meets bitTorrent?

It sounds like what you are saying is that someone wants to download X, but there are few sources of X. There are many sources of Y, which is really X, renamed. Your tool would download the proper header info from the X source and the majority of the data from the Y sources.

--
All ideas^H^H^H^H^Hprocesses in this post are Patent Pending. (as well as the process of patenting all postings)
Re:Right.... by nine-times · 2007-04-11 04:25 · Score: 1

I'll admit that I'm definitely not the most educated person in these matters, but I just don't quite "get it". If you are going to download only the differences between two files, doesn't that require that some computer has access to both files and can compare the differences? If one end or the other doesn't have both files, wouldn't you need to transfer the file first to make the comparison? (meaning you'd still need to download the whole thing?)
Anyway, I've thought about this before, even though I don't have the technical background to think about it properly-- what's the legal implication on copyrights in this instance? Let's imagine I have 50 public domain movies in MPEG format that I'm sharing through bittorrent, and that, miraculously, you could take different small chunks of data from these various movies, put those chunks into a different order, you could create a full copy of a copyrighted Hollywod movie released a few months ago. Now someone else introduces a bittorrent tracker that can pull those chunks from those 50 movies and put them into the correct order. In this unusual situation, is anyone violating the copyright of Hollywood movie?
It seems like a technicality, but it you'd have to figure that, for any size-limited chunk of data, there are a finite number of possibilities. Therefore, if you had a hard drive storing a different 4KB file representing every possible combination of data that could exist in 4KB, you would, in a sense, have every piece of information that it would be possible to create stored on that hard drive. Of course, in order to create a piece of information, you'd need additional information: which chunks need to be combines, in which order, and at what point to cut off the last 4KB chunk.
So it seems like a weird gray area to me. In one sense, you could consider this as a form of encoding. However, if several people and sharing several pieces of unrelated information which can be pieced together into a copyrighted work, it doesn't seem to me that anyone is necessarily guilty of copyright infringement. On the server end, the copyrighted work simply doesn't exist. At no point is the copyrighted film actually being copied. But after a given series of data chunks are copied, they can be put into a series which results on the copyrighted work.
I know, I know, I probably sound very silly to those who know better.
Re:Right.... by Incy · 2007-04-11 04:26 · Score: 1

Interesting.. The three types of "similar" files were things that were mis-labeled, in a different language, had errors, or packaged differently.... I wouldn't have expected those 4 cases to add up to anything significant. However I guess it only takes 1 mostly similar file to double*ish* the speed of your download.. so it doesn't take much to have a big impact on your download..
Re:Right.... by Anonymous Coward · 2007-04-11 04:34 · Score: 0

If that's all you're doing, then GNUnet does that already. For MP3 files with different ID3 tags, the first 99% of the file is exactly the same. But if you have something more complex, like the same movie with different subtitles, then you have to find similar pieces which are scattered around at different offsets.
Re:Right.... by Hatta · 2007-04-11 04:35 · Score: 1, Interesting

If I download a song with bittorrent then change the tag, it doesn't treat it as a different file. It treats it as a corrupted file. It does indeed recognize that it is 99% similar, and it can use that file to seed the similar parts. How is this novel again?

--
Give me Classic Slashdot or give me death!
Re:Right.... by joshv · 2007-04-11 04:42 · Score: 1

It would indeed be odd for someone to publish a paper about a novel way of speeding up downloads if in fact Bitorrent, or any other file sharing protocol already worked that way. Thanks for pointing out that they don't.
Re:Right.... by Incy · 2007-04-11 04:43 · Score: 1

Atlhough this comment from the article begs for further clarification: "We intentionally sampled a subset of files that were likely to be similar to each other and that we expected people to be interested in downloading."
Re:Right.... by zero1101 · 2007-04-11 04:47 · Score: 1

You can simplify your example and see why it doesn't make sense by looking at it this way: if I have a database on my hard drive that stores every possible representation of 1 bit, all I need to do to recreate a remote file is download a list that gives me the number of bits, as well as the state of each bit in the remote file in sequence. I can then rebuild the file by looking up the bits in the local state database in sequence and writing the result out to a file.
Re:Right.... by Anonymous Coward · 2007-04-11 04:48 · Score: 0

Checksums.

take 128 byte md5s of your 4k chunks. If your downstream bandwidth isn't saturated, look for other chunks with identical checksums and download them. When you get "the whole file", you can then use a checksum on it to verify that the chunks work. If this fails, you begin replacing the "similar" chunks with ones from your original source until the final file passes.

Your basically using your extra bandwith to gamble on identical checksum chunks. If some of these work, you could get a faster download.
Re:Right.... by mini+me · 2007-04-11 04:52 · Score: 1

Trivial example:

I want to download a file that contains "Hello World"
Another user has a file that contains "Hello John"

I can download "World" from user A, and download "Hello " from user B to make up the file I'm looking for, even though user B does not actually have the actual file on his computer.
Re:Right.... by Dare+nMc · 2007-04-11 04:54 · Score: 1

if it was from TV whether it was started at the same time??

More like creating a diff program for binary files, not just text.
thats the part their sharing. IE if you find identical segments in similar files, you can grab the identical segments from either torrent. They even have the example, a trailer.
A better example would be a movie with 3 version, IE a extended version, a theatricle version, and a TV version. If you cut the theater version and TV version from the extended version (after encoding, etc) you could download the extended version after downloading the theater version in 1/8th the time, since you really just need a difference file.
Re:Right.... by maxume · 2007-04-11 04:54 · Score: 1

They are improving the ability of the software to identify files as being the same, rather than reporting files with superficial differences as different(where superficial is different tags or whatever, and the media stream is identical).

--
Nerd rage is the funniest rage.
Re:Right.... by Anonymous Coward · 2007-04-11 04:56 · Score: 0

umm...

i don't think there is any suggestion to just trust the filename as you seem to be implying. you subdivide each file (making sure to ignore metadata) into chunks and then hash each chunk and compare the hashes.

as far as torrents, haven't you ever noticed that many torrents will have mostly the same files in them but be "different" because there are different nfo/txt files in them? haven't you ever thought it would just make a lot of sense if you could combine "different" torrents together which have files which are 99% the same? the concept in the article is the same, except looking for redundancy at the chunk level, rather than the file level as in my example.
Re:Right.... by Incy · 2007-04-11 05:01 · Score: 1

RTFA
Re:Right.... by Anonymous Coward · 2007-04-11 05:19 · Score: 0

http://en.wikipedia.org/wiki/Md5sum Educate yourself then.
Re:Right.... by Reason58 · 2007-04-11 05:22 · Score: 1

It would indeed be odd for someone to publish a paper about a novel way of speeding up downloads if in fact Bitorrent, or any other file sharing protocol already worked that way. Thanks for pointing out that they don't. Please excuse me for pointing out the de facto standards for P2P do not currently implement something like this, nor do they allow for this modification within the current structure. I thought this was a place for discussion of the article.
Re:Right.... by hotdiggitydawg · 2007-04-11 05:29 · Score: 1

OK... so instead let's go with the same codec, same bitrate, but ripped from two different instances of the same media. You can't seriously tell me that the same analog signal will be sampled exactly the same way every single time.

Have you tried ripping the song from the same CD twice in a row and doing a binary diff? Have you then tried it on a different PC with the same codec/bitrate/software? How about on a different OS with the same codec/bitrate?
Re:Right.... by TheoMurpse · 2007-04-11 05:29 · Score: 1

It does indeed recognize that it is 99% similar
I think that may depend on if the affected metadata affects the length of the entire file. I'm not sure of BT's hashing function, but I think (let X be a constant that is configurable by the guy making the torrent file) when creating a torrent, it creates a hash of each string of X bytes. Thus, if you make the file a different size (by, say, changing the genre tag from "Jpop" to "J-pop" -- assuming there is no NULL padding in the ID3 tag -- or by removing/adding NULL padding in the tag), the hash for each successive part will be different, and thus will only recognize the previous data as valid and uncorrupted (which would be around 1% or less of the file, I'd wager).

Although I could be wrong -- I know an awful lot about ID3 tags, but not so much about BT's torrent-creation process.
Re:Right.... by nine-times · 2007-04-11 05:34 · Score: 1

Well, yes, I know that you can obviously do a checksum, but that won't tell you which parts of the file have changed. Unless, that is, you run checksums on the individual chunks. However, checksums do not uniquely identify a file. That is, it's been shown that you can manufacture a file to match a given checksum and yet have it be different from the file that the checksum was originally created from.
Part of the reason checksums work so well is that it's extremely unlikely that two given files will have the same checksums. So, for example, if a file is corrupted it's *extremely* unlikely to generate the same checksum as the original. However, if we all split all the files on our hard drives into little chunks and ran a checksum on them all, would I feel extremely confident that we could swap all the chunks with matching checksums without anything getting corrupted? I'm not sure. It would depend on how many different chunks of data you were comparing, but obviously, given enough files, you'd eventually hit two that matched.
Re:Right.... by angio · 2007-04-11 05:41 · Score: 3, Informative

Similar in spirit - except rsync looks at files on your local hard drive by the same name, so there's only one possible candidate to draw from. SET looks at all of the files that everyone else is currently downloading, so we had to develop a much more efficient technique for locating useful files.
Re:Right.... by nine-times · 2007-04-11 05:44 · Score: 1

Yes, that's why I said "you could look at it as just encoding". The extreme case of my example, obviously, is the single bit. Therefore, the most likely candidate for copyright infringement would be the person providing the bittorrent tracker that told you which chunks to download and which order to put them in. However, it doesn't seem like the answer is so clear. If that were the case, then what about conventional bittorrent participants? Are they guilty of copyright infringement, or is it only the person offering the tracker?
Also, if you take the example in the other direction, what is someone offered two halves of a movie for download? Two complete halves. And then I told you, "Oh, well, put part one before part two, and you have the complete movie!" In that case, I wouldn't be guilty of copyright infringement, I don't think.
Personally, I think this is the inherent problem with copyright during the Internet Age. When you're talking about books or film, it's more clear what it means to "copy". However, digital media is constantly being cached and copies, pulled apart and strung back together. The result is a more abstract and ambiguous system.
Re:Right.... by Dare+nMc · 2007-04-11 05:49 · Score: 2, Interesting

The only way you could implement a system like this is to create an entirely new protocol, server software and client software.

I disagree, this could be done with No Change to any protocol, client, or server: the only thing that needs created is a torrent creator located on a machine that had a full version of every "simular" torrent to be shared. the torrents would all be linked read only to the same DVD image (for exampl), only part of that DVD image would be labeled as "downloaded" and you would put the client into upload only mode on this host machine. The downloaders would have no idea this same data was shared under many different torrents.

For example, My company distributes our manuals as DVD's, different DVD for every machine Type. These go to 100's of locations. We put the same engine in 50 different machines. Also we do updated DVD's, that may just add a few pages of options (for example.)

If we distributed DVD images by bittorent, with a server at every one of our distributor. It would be as simple as patching the existing torrent image at the distributor.

IE Distibutor X may only have the DVD only for model ZZ hosted on their server. Instead of having a single torrent served from their chunk of data, Our main office that has all of the DVD's we ever produced, could just send them a list of torrent files, that tells that 1 server, what chunks of their file are appliciple on all of our DVD's. Now they may be able to host a 100 different torrents, all saying I got 10% of model ZX I got 5% of model ZF...
Re:Right.... by g2devi · 2007-04-11 06:29 · Score: 1

99.9% similar may be significantly different.

Compare a video clip that has Nixon saying "I am NOT a crook" with one that says "I am a crook".
Compare a copy of the US constitution which removes the "due process of law" provision with one that doesn't.

Both are 99.9% similar and both may be popular on Bittorent as spoofs, but neither is what you'd want.
Re:Right.... by evilviper · 2007-04-11 07:00 · Score: 1

BitTorrent, as most of you know, doesn't work this way.

I, however, know that you're wrong, and that bittorrent DOES in-fact work precisely that way.
Files are selected from a server called a "tracker", and only users with that exact file size and hash will be linked with you.

Then please explain how partial files are shared with bittorrent... They won't have the exact same size and hash until they're completely downloaded, yet everyone is sharing their partially downloaded file... One of the most important parts of bittorrent.

In fact, bittorrent, like many other P2P programs, uses a 'tree' hash, which allows it to identify which chunks of a file match the original, even though the partial files is a different size, and the entire file doesn't match the hash.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Re:Right.... by Anonymous Coward · 2007-04-11 07:09 · Score: 0

More significantly,

Debian Etch
and
Debian Etch (paper bag release that fixes a boot problem on some machines)
and
Debian Etch (with my crack to allow me to get root access)

may be 99.9999% the same but getting the wrong one could be useless to me (at best) or fatal (at worst).
Re:Right.... by Sibko · 2007-04-11 07:21 · Score: 1

Okay, cue car analogy: Your bittorrent program is trying to construct the car of your choosing. [Let's say a Prius.] It's going to grab all the different parts from all the Prius owners out there, but not just that! If all the Pinto owners happen to have the same hubcaps as your Prius, then your bittorrent program will use those hubcaps on your Prius. They're the same parts, just from a different car.

That's how this is going to work. The code itself is exactly the same, the only difference is in where it comes from. Which means that if Ubuntu shares 2 lines of the exact same code as a Nickelback song, you can download those select lines of code from someone sharing Ubuntu over P2P. It's the same parts, they just come from different places.
Re:Right.... by theJML · 2007-04-11 07:32 · Score: 1

That sorta makes sense if you were to say have a text file "A" that was the same as text file "B" except that someone inserted 3 lines of "Blahblahblah" near the beginning. But it'd have to parse the file in such a way that it would use variable block sizes to determine not only the difference between the two files, but note exactly what on who's server can fill in those pieces. Heck, it'd be cool if it could use files you already have downloaded to pull from, requiring less transfer. However, I wonder how much data would need to be transfered just to figure out that it doesn't need to transfer as much data. Seems like sort of a waste and a good way to over complicate things.

--
-=JML=-
Re:Right.... by TheoMurpse · 2007-04-11 07:57 · Score: 1

That sorta makes sense if you were to say have a text file "A" that was the same as text file "B" except that someone inserted 3 lines of "Blahblahblah" near the beginning.
Yeah, that's more or less what I was getting at. The tags are stored in MP3 files as a combination of text and binary data (the tag size and encoding are binary -- such as the BOM and whether the tag is unicode or not -- while the actual value of the tag is text). For example, the song title tag may be something like:
TIT2abcdefYellow Lasers
"TIT2" is an indicator for title-of-track;
"abcd" are 4 bytes indicating the length of the tag minus 10 bytes, whose value is calculated by the formula ((((a<<7)+b)<<7)+c<<7)+d -- here, abcd should calculate out to 13, if I count correctly
"ef" are 2 bytes for certain flags
"Yellow Lasers" is the name of the track (and a great track it is)

So, as you can see, ID3 (was I calling it IDE before???) tag data is mostly plaintext.
Re:Right.... by zero1101 · 2007-04-11 08:09 · Score: 1

I was focusing on the technical side of the argument, in that it's less efficient than making a bit-for-bit copy of the original file, but I realize now that your point was about copyright issues, not download rates.
Re:Right.... by Bluesman · 2007-04-11 08:30 · Score: 1

How will this work in a real world situation?

And by real-world, I don't mean using the sample similar file set on a real network like you did for the paper, I mean finding the similarity data among thousands or millions of users.

For every new file available on the network, a comparison is going to have to be done between that file and *every other file* available on the network to check for similarity. This is feasible for some small number of files, but when you have, say, a million files, this is no longer insignificant.

Your paper assumes you already have this similarity data. But in a real-world situation, you won't.

This is neglecting the fact that a hash of a chunk of data doesn't guarantee that another chunk with the same hash is a match and not a collision. You substantially reduce the risk of a collision by using similar files, but because the files aren't identical, it can't be guaranteed. Since any collision of any chunk will corrupt the entire file and necessitate downloading the entire thing over, (because you don't know which chunk collides), this could severely reduce performance. Any numbers on when hash collisions become a factor, and by how much?

Let's say, for example, hash collisions are extremely rare if you use 99% similar files. What are the performance gains for that threshold, including the outside chance that a collision corrupts the entire download?

--
If moderation could change anything, it would be illegal.
Re:Right.... by Anonymous Coward · 2007-04-11 08:41 · Score: 0

Are you the ninenine.com guy? Where's the free porno?
Re:Right.... by poopdeville · 2007-04-11 11:23 · Score: 2, Insightful

Which is why you would download a .torrent-like file specifying which of those you want. Then you would download the 99.9999% that agrees from any/all of them (essentially making your personal swarm temporarily bigger), and download the missing .0001% from the version you requested.

This is very straightforward. I don't see how people can misunderstand this idea.

--
After all, I am strangely colored.
Re:Right.... by poopdeville · 2007-04-11 11:41 · Score: 1

It's very unlikely that you will find two distinct files that share MD5 hashes and have the same file size. I would even predict that it's provably impossible for small enough (but still huge) files, but I don't have a proof prepared.

(I'm thinking mathematical induction on the length of elements of Sigma**, and using the concatenation property of the MD5 function.)

--
After all, I am strangely colored.
Re:Right.... by RedBear · 2007-04-11 13:15 · Score: 1

Sure this is going to work... really

I'll just splice that bit from that torrent, that bit from that one... it should work, I mean they are all the same TV episode and they are all mpeg4 - the file name says so...

Hmmm how about which bitrates, codecs, if it was from TV whether it was started at the same time??

That guy seriously has to be joking - the byte offsets are unlikely to ever specify a suitable join - and even if they rewrote the protocol so it split by seconds rather than fixed file widths you'd still have changing codecs and bitrates to deal with. Personally I'll stick to torrents with decent known trackers
It might behoove you to actually know what you're talking about next time you post. Like maybe read up on how hash functions work, and maybe look up the definition of the word "identical" as it relates to computer data. None of the issues you raised are the slightest bit relevant to this technique. If two files have the same hash that means they are bit-for-bit identical. The same concept applies to taking a hash of smaller chunks of that file. Even if the overall file is not 100% identical, if you download two chunks that have the same hash they will be bit-for-bit identical even if they came from two different files, and even if those files both had corrupted or incomplete chunks elsewhere.

The problem they are trying to get around is that various not-very-important differences between files, like different filenames, will make the overall file hash not match, while the actual important data within the file is often identical. One of the given examples is ID3 tags on MP3 files. MP3 tags are embedded in the file, thus any change even as simple as changing a single character from uppercase to lowercase in any of those header tags will cause the overall file hash to fail. Now, I'm no file format expert but I'll bet there are a lot of formats that keep their metadata in a small, standard-length block at the beginning of the file. Thus if you were to split those files into smaller chunks and get hashes of those chunks, you would probably find that all chunks past the initial header chunk will be identical with chunks from files that previously couldn't be matched based on the overall file hash. Depending on the particular file this could vastly increase the number of possible sources that you could download from, at least for the matching chunks, which I daresay would probably be 99% or more of the total content of the file. Thus improving the efficiency of P2P applications and your ability to obtain uncommon files or files that no longer have sources available that contain your exact version of the file. You'll be able to just download the rest of the file from "similar" sources which contain "identical" pieces, where the word "similar" only relates to the overall file.

Now, just to address how your issues would even relate to this whole concept: If there are two files out there with different metadata (like bitrate or source to use your examples) and the rest of those two files consist of identical chunks of the important data (like a video or audio stream to use your example), you shouldn't care which file you're downloading those identical chunks from. The differing metadata will not affect your version of the file because it won't be downloaded. In other words, if you begin downloading a file and the metadata in your version of the file is correct, it will still be correct when you've finished downloading the file, and the overall hash of the completed file will be a perfect match with the file you initially started downloading. Where the identical chunks came from is irrelevent because bit-for-bit identical chunks are completely interchangeable. That's the whole point of the word "identical" in computer terms. What actual data is in the different files doesn't matter, because it's identical!

On the other hand if you started out by downloading a version of that file that has slightly different or
Re:Right.... by nine-times · 2007-04-12 04:49 · Score: 1

If that's true, does that mean it's be possible to start with a checksum and filesize and, given enough computing power, retrieve the full file? If so, that's some impressive compression.
Re:Right.... by poopdeville · 2007-04-12 07:02 · Score: 1

Good catch. I was wrong about the maximum size a file could be before redundant MD5's must occur. MD5's range has 2^128 elements. So if it is to be bijective on a subset of the domain, that subset needs to have 2^128 elements. Meaning that uniqueness is preserved for files of a given size up to 16 bytes.

I guess the MD5 + filesize concept doesn't buy you much unless you're willing to use very small block sizes. I wrote up a post about using 4kb blocks and realized that the improvement in accuracy was marginal. Basically, for any block size n (in bits), there are 2^n distinct blocks of size n. Meaning that there are \Sum_{i=1}^n 2^i blocks of size n or less. Obviously, the last term dominates the sum.

--
After all, I am strangely colored.
Re:Right.... by nine-times · 2007-04-12 08:33 · Score: 1

Yeah, I'm no mathematician, but checksums simply weren't designed to uniquely identify files. It would be very unlikely that two arbitrary files would have the same checksum, and so it helps guard against file corruption and tampering. You can use a checksum to discover whether a file has been altered or updated, since it would be extremely difficult to alter a file without altering its checksum. However, you can not identify a file uniquely without having access to enough data to reconstruct the original file.
Or, at least, I don't see how it could be possible.

Sounds not so useful. by realcoolguy425 · 2007-04-11 03:53 · Score: 0

Oh but if we only could! I'm sorry, sounds like someone is just imagining a 'what-if' scenario. This is not going to happen, unless, a new filetype is made specifically for it, and then you'd need a very specialized peer to peer program, and who knows how much extra work it's going to cost in terms of cpu/networking to determine if a file is similair or not. To me this just sounds stupid. Maybe it's just from waking up. They should work on optimising the 'pipes' or the 'tubes' to carry more data instead. It sounds like all the person thought of was hey, bit-torrent AND a stupid way to increase the number of people I can download from!

Snakeoil by Reason58 · 2007-04-11 03:53 · Score: 0, Troll

Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar. It's not even 9AM and I have already filled my bullshit quota for the day. The concept itself is dubious, but this statement in particular is ludicrous.

Re:Snakeoil by Anonymous Coward · 2007-04-11 03:57 · Score: 3, Funny

this statement in particular is ludicrous. You don't listen to pop music, do you?
Re:Snakeoil by SaturnTim · 2007-04-11 04:05 · Score: 1

The article is a bit vague, but I think this is really trying to match up identical files that just differ in meta-information. So this won't be
downloading parts of completely different files, it's just not relying on file names to find matches.

--
http://www.theMediaBunker.com
Re:Snakeoil by discord5 · 2007-04-11 05:28 · Score: 3, Funny

It's not even 9AM and I have already filled my bullshit quota for the day. The concept itself is dubious, but this statement in particular is ludicrous.

May I suggest you don't open your e-mail and refrain from answering the phone for today? I usually fill up my bullshit quota with those two media alone. Slashdot is just the icing on the cake. ;)
Re:Snakeoil by EllisDees · 2007-04-11 05:35 · Score: 1

What they are saying is that if you search for 'paradise city' on limewire, you might get 20 different entries in the search results all at the same bitrate and approximate size. If you can figure out which of these are 99% the same, with only the metadata changed, you can download the similar parts from many more sources than if you need an exact match.

--
-- Give me ambiguity or give me something else!

Summary: by PhrostyMcByte · 2007-04-11 03:53 · Score: 4, Informative

instead of sharing files, divide them into 16KB chunks and share those, to help work around files that get renamed or trivially altered (eg a website tagging their url to all the files you upload).

Re:Summary: by TheoMurpse · 2007-04-11 05:33 · Score: 1

If the file length was affected, the 16KB chunks would absolutely differ starting at the point where the length was affected.
Re:Summary: by EllisDees · 2007-04-11 05:45 · Score: 1

Even better, just ignore the metadata and search on a hash of the actual content. I'm not sure where the ID3 tags are placed (and I'm too lazy to look it up right now) in an mp3, but if you strip them off and ignore the file name, you should have the raw mp3 data left over.

--
-- Give me ambiguity or give me something else!
Re:Summary: by costas · 2007-04-11 07:09 · Score: 1

Is there any technical reason why BitTorrent doesn't let you have "sub-torrents"? As in for a set of files A within a torrent, some files maybe part of another torrent, available on a another tracker potentially.
Re:Summary: by intangible · 2007-04-11 10:45 · Score: 1

Since most of this variable information is in the beginning of the file in almost all formats, I suppose you could start counting chunks backwards from the end of the file.
Re:Summary: by TheoMurpse · 2007-04-11 11:35 · Score: 1

I think the newest version of the ID3 tag (v.4) has an optional tag placed at the end of the file itself, but I haven't read up on v.4 so much yet, so I'm not sure. An older version also puts the tags entirely at the end, and other tagging methods also go at the end.

However, your idea does work if, instead of starting at the beginning, we do some kind of conditional (IF FILETYPE IS MP3 THEN skip any metadata and get right to the binary audio data).

99% Similar? by swillden · 2007-04-11 03:53 · Score: 0, Redundant

I knew all the music on the radio sounded the same, but I didn't think it was *that* lacking in originality.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.

Re:99% Similar? by happyemoticon · 2007-04-11 04:06 · Score: 1

Well, they're all composed of 1's and 0's, so they must be basically identical. Except for those songs that contain a 2, or a -1.

Tiny chunks, Large files by LiquidCoooled · 2007-04-11 03:53 · Score: 2, Insightful

LOTS of overhead just to find the chunks.
The article talks about 16kb chunks, which for a dvd image would take more than the torrent protocol currently allows.

The client would spend more time communicating its chunk lists around than actually getting data.

(If I remember rightly, torrents can have a max of 65535 chunks and some servers prevent huge .torrent files which contain the chunk breakpoints anyway)

--
liqbase :: faster than paper

Re:Tiny chunks, Large files by burris · 2007-04-11 04:19 · Score: 1

Sorry, the limit on the number of pieces in a torrent is 2^32. The message length prefix (i.e. the max length of the have-bitfield) as well as the piece indicies are 4-byte unsigned ints.

meh by brunascle · 2007-04-11 03:54 · Score: 1

not particularly new. many P2Ps have been grouping identical files together. i know one of the early ones did it (was it Napster, Audiogalaxy?), but i think only if the files were 100% identical other than the filename.

there's definitely potential for problem here. what if those files really arent supposed to be the same? a swapped byte here and there could have huge effects on the end result.

Re:meh by Anonymous Coward · 2007-04-11 04:42 · Score: 0

Hello and welcome to 1999. Napster did that almost 10 years ago.

Well put... that was my first thought also when reading the summary : old news.

But the problem was actually that the file could be significantly different and still be treated as the same, so you would have some songs that you would get, say, half from a source (at a certain bitrate/song length/etc) and half from another source and then when you listen to it theres a big jump (forward of backward) in the song, change of pitch, volume, etc... it was pretty nasty. I still have quite a few songs from back then with these problems.

Overhead by TubeSteak · 2007-04-11 03:55 · Score: 1

Wouldn't this generate a lot more overhead traffic?
I'm sure smarter people than I have already thought this out,
but that seems to be the most immediate downside.

"SET divides a one-gigabyte file into 64,000 16-kilobyte chunks"

In other words: instead of seeking out the one master-hash for the file,
your P2P is looking up the thousands of chunk hashes.

--
[Fuck Beta]
o0t!

Re:Overhead by Anonymous Coward · 2007-04-11 04:41 · Score: 0

In other words: instead of seeking out the one master-hash for the file,
your P2P is looking up the thousands of chunk hashes.
In other words, they've reinvented GNUnet?

limewire by Anonymous Coward · 2007-04-11 03:55 · Score: 0

i don't think this makes sense for bittorrent p2p, but i think what they mean is
that when you search limewire you can see that there are many differently tagged versions with roughly the same filesize
when you try to download one, it isn't associated with the other files that are tagged differently even though it is the same song/bitrate everything.
so this program would recognize that i guess...

how about bringing the orig. napster back, or make a client IDENTICAL to it.

Container-aware P2P by Tx · 2007-04-11 03:56 · Score: 1

I have no idea what the overheads might be for their "handprinting" algorithms, or how effective they are. But I've wondered in the past whether something vaguely similar could be achieved by for example hashing each stream and headers separately in audio and video files, or each file within an archive. The same could apply to any container format. That would certainly deal with e.g. the same mp3 with different ID3 tags, and the overheads might be lower. Could get messy though, I guess.

--
Oh no... it's the future.

Interesting idea by Dan+Stephans+II · 2007-04-11 03:57 · Score: 1

Thought of this a while back on a similar subject: taking bittorrent as an example you track a file (or set of files) and the torrent has presized chunks that are hashed. A simple extension may be to change that relationship from one to many to many to many. Share a set of chunks and this particular torrent uses the following chunks (identified by their hash). Obviously this would cause overhead issues but would address the issue in TFA (which, being new here I did not read).

Should work just fine... by Anonymous Coward · 2007-04-11 03:58 · Score: 2, Informative

I think everyone posting above saying "it won't work" should RTFA.

This works by breaking files down into clusters and hashing the clusters (like Bittorrent already does). Then it searches for other shares that have clusters with the same hash value, and requests them.

Assuming that the hashing scheme being used it "good" in that there are no collisions, two clusters with the same hash will contain the *exact same* information.

Should work just fine.

Re:Should work just fine... by Shemmie · 2007-04-11 06:08 · Score: 1

... which is something I've always wondered - but sadly never researched into. If I've got two files, totally different - let's say one's an mp3 track, and one's a game ISO - what are the odds that the same, heck, 100k of binary data (exactly the same) occurs in both files? Based on what Parent's saying, this paper describes it. (Not RTFA - will do later). As long as your hashing prevents collisions, you can then pull that 100k of data from whatever source you like, and plug it into your download. Interesting.

Its all just ones and zeros by MosesJones · 2007-04-11 03:59 · Score: 1

Of COURSE all files are "basically" the same, after all its just a set of 1s and 0s, and given that you already have lots of 1s and 0s on your machine this means that you already have the file even before you download it. It reminds me of Eric Morcambe and Andre Previn Previn: You were playing all the wrong notes Morcambe: No I was playing all the right notes just not necessarily in the right order

--
An Eye for an Eye will make the whole world blind - Gandhi

Sounds like "Single Instance Storage" by Anonymous Coward · 2007-04-11 04:00 · Score: 0

Sounds like he's trying to apply the concepts behind Single Instance Storage to P2P applications, which makes a certain amount of sense, but is sure to be a fair bit more difficult to track the hashes for all the file chunks of all the files available on the network (as opposed to comparing file hashes during a backup or across a LAN).

It might be a bit easier if P2P apps were more aware of the types of files they were transferring, and could make intelligent decisions about how to split up the files into data chunks.

Shows the real trouble with P2P by Animats · 2007-04-11 04:00 · Score: 1

This is just an illustration of the fact that P2P is an incredibly inefficient way of transferring files around. Most of the material is not only pirated, but a big fraction of the pirated material is the same stuff. P2P "peers" aren't necessarily nearby, either in a physical or bandwidth sense. So huge amounts of bandwidth are being spent shipping the same stuff around.

If it weren't for the piracy issue, the daily output of the RIAA, which is a few gigabytes, could be distributed efficiently by putting MP3s in a Usenet group. With Usenet's distribution mechanism, which is a flooding P2P system, nothing travels over a path more than once.

Re:Shows the real trouble with P2P by abscissa · 2007-04-11 04:06 · Score: 1

I used to disagree with what you said... but after seeing how P2P affects networks first hand, I am now inclined to agree.

P2P is very inefficient.. but the problem is that means that are maximally efficient (e.g. proxies, usenet, etc.) are inaccessible to the masses.
Re:Shows the real trouble with P2P by Animats · 2007-04-11 04:23 · Score: 1

Yes, if we had "alt.binaries.music.riaa.top40" we could probably cut the world's P2P load in half. At least.
It might be a good move for the RIAA members to do that. They pay radio stations to play the stuff. Why not cut out the middleman and ship direct to consumers?
We may be headed for an era where top-40 music is free, but ad-supported.
Re:Shows the real trouble with P2P by maxume · 2007-04-11 05:14 · Score: 1

Reality is dangerously complete; I just pulled ~15,000 headers from alt.binaries.sounds.mp3.pop, and there are hundreds of songs by Pink, Beyonce, Hilary Duff, and of collections like 'The Very Best of the Disney Channel', 'Hit Machine 2007' and on and on(and the server has another 30,000+ headers available, so I wouldn't be surprised to find a top 40 collection in there).

--
Nerd rage is the funniest rage.
Re:Shows the real trouble with P2P by Threni · 2007-04-11 05:38 · Score: 1

> the daily output of the RIAA, which is a few gigabytes, could be distributed efficiently by putting MP3s in a Usenet group.

I think you'd need a lot of large servers. A few gigs a day adds up after a while, and that's not even counting the vast back catalogue.
Re:Shows the real trouble with P2P by Teancum · 2007-04-11 11:42 · Score: 1

I would disagree, but only from a very limited perspective. Most P2P networks are built up with the idea that all knowledge of the person supplying the data ought to be kept anonymous or nearly so. Freenet is perhaps the worst so far as it pushes packets beyond the original intended recipient and tries to "broadcast" packets to those computers who will never use them except under exceptional cases.

That is wonderful if you are trying to download some kiddy porn or sending an encrypted message to an Al-Queida operative, but there are other classes of users who could take advantage of some of the benefits of P2P without having to deal with all of that security overhead. Sure, I will admit that Gnutella protocol doesn't have nearly the same level of overhead as Freenet, and there are some even easier protocols that would reduce this complexity even further.

Also, most P2P protocols scale horribly. That is, the number of nodes connected to the network has a significant impact on the efficiency and throughput of the data. Attempts to dupilicate the entire internet under P2P protocols is IMHO going to fail for exactly this reason.

Where the real strength of P2P can be found at is in a small business situation (or even fairly large group.... about 1000 nodes more or less) or for a distributed group of volunteers that want to keep some stuff "on a server" that is consistently updated, but not have to worry about how to deal with a "central office".

There are some highly specialized applications that I also think might work very well with P2P content, such as a distributed version of Wikipedia. You would have one "trusted source" that would seed the rest of the network, and the date or version of the material wouldn't be so critical. I still think something like this could be incredibly useful, especially to help pull off some of the bandwidth needs of Wikipedia for those who are merely browsing or surfing through the pages.

The final word hasn't been written about P2P technology, but it doesn't seem like the glory road that was promised a few years ago. On that point I think you are square on the mark.

It gets worse. by khasim · 2007-04-11 04:01 · Score: 2

Taking advantage of those similarities could speed downloads considerably. If a U.S. computer user wanted to download a German-language version of a popular movie, for instance, existing systems would probably download most of the movie from sources in Germany. But if the user could download from similar files, the user could retrieve most of the video from English versions readily available from U.S. sources, and download only the audio portion of the movie from the German sources.

To paraphrase Morbo: "DOWNLOADS DO NOT WORK LIKE THAT!"

Now it would be GREAT if someone did manage to do that. Split the video from the audio (and from the sub-titles). And maybe create a meta-package.

And maybe if those researchers focus on that, this will be a better idea.

But that would ONLY work for material that could be split like that. If it's a song, what are you going to split? An ISO image? Same question.

Re:It gets worse. by Anonymous Coward · 2007-04-11 04:06 · Score: 0

I think the idea here is to identify sections of different files that are identical and use those sections as part of a typical swarm download.

Note the "think". This could just be idiots thinking they are the first to discover swarming. I didn't RTFA.
Re:It gets worse. by WoZzeR · 2007-04-11 04:16 · Score: 1

I think the original article used 'similar files' only as a rough example. How I imagine it would work would be similar blocks could be downloaded from many sources. So in the example of a movie with German language, as long as the video blocks were the same, they would be interchangeable between the 2 files. This actually would not be hard to do, if there was a standard for block sizes, each one should have a different md5 hash, so what a client would do is say "do you have md5 hash xxxxxx block, please send it to me". This way it would not matter where in the file the other block was, but the original (torrent file in this case) file would have the info on where the md5 hash would go.
Re:It gets worse. by brunascle · 2007-04-11 04:27 · Score: 1

hmmm, that's actually interesting. this could potentially completely change how P2P works. instead of requesting files, just request blocks by md5 hash. when you get a match, compare the hashes using another algorithm (to make sure it isnt a coincidence).

would that make it easier to defend yourself against the MAFIAA? since all they know about is 1 block that matches a copyrighted file (or at least, the hash matches)?

Jennifer Aniston != Monkey by Redbaran · 2007-04-11 04:04 · Score: 1

So does this mean I go to download a picture of anyone, say Jennifer Aniston, I might get a picture of a monkey, just because they are 99% (or whatever) the same genetically.

I think I speak for all males when I say: Not cool man... not cool!

Re:Jennifer Aniston != Monkey by maxume · 2007-04-11 04:59 · Score: 2, Funny

Think of it as Jennifer Aniston vs a perfect Jennifer Aniston clone with a paper bag with a monkey face drawn on it over her head. Take off the bag(i.e., the differing headers), and you are good to go.

--
Nerd rage is the funniest rage.
Re:Jennifer Aniston != Monkey by Anonymous Coward · 2007-04-11 06:09 · Score: 0

Puts a whole new spin on spanking the monkey, doesn't it?

in other news by brunascle · 2007-04-11 04:05 · Score: 1

a new P2P app call BET promises faster downloads, utilizing the tried-and-true method of giving a file a set time limit to download then filling in random bytes after that.

Kazaa hiss/pop by Anonymous Coward · 2007-04-11 04:07 · Score: 0

I think Kazaa did something like this, which is partly to blame for the horrible malformed mp3 files that flowed through that network.

But think of the RIAA... by Nom+du+Keyboard · 2007-04-11 04:07 · Score: 1

The RIAA is going to absolutely hate any research in this area that can improve P2P performance in any manner. And especially by a university, no less. Those hot beds of piracy don't deserve public money at all, when they spend it like this!!

--
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."

Pop Music by Archangel+Michael · 2007-04-11 04:08 · Score: 0, Redundant

"but are otherwise 99 percent similar.""

Sounds like all of Pop Music and most movies these days. Since the latest greatest movies are remakes and I, II, III, IV, V versions of the same theme.

--
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.

Re:This could work for some files, but not for oth by Xzzy · 2007-04-11 04:11 · Score: 2, Interesting

It depends on how they calculate "similar". If they run checksums on the chunks and submit a query to other machines on the network that have pieces with identical chunks, then it would be valid to download. I'm pretty sure a few P2P services in the pre-bittorrent days did something like this, files with identical hashes would be grouped together.

But the article makes it sound like their custom software breaks up media files into their component streams, which clients can download separately as they desire. Pull the english audio stream from Computer A, the video stream from B, etc. Once downloaded the client automagically reassembles it into a single file.

Writing a client that can disassemble and reassemble all known media types sounds absolutely horrifying though. ;)

Re:This could work for some files, but not for oth by Incy · 2007-04-11 04:13 · Score: 3, Interesting

Anything compressed/encrypted won't work so well. Unless it is just a mislabeled peice of music. If you google around for Low Bandwidth File System (LBFS) you'll see what technique the article is really talking about.(disclaimer -- I didn't read the article either) Variable Length chunking will handle cases where new data is inserted halfway into the file, however with compression that extra data will end up changing the whole damned file.

Blurring the Line Between Data by dduardo · 2007-04-11 04:14 · Score: 2, Interesting

So let say person A wants to download copyrighted material. They would connect to a tracker who would when tell person A where to get chunk 1 from. This chunk could come from copyrighted data or public domain data. What the tracker stores isn't the actual copyrighted data, but offsets, lengths, and ip addresses of where to find particular chunks.

In essence, if you were downloading a music CD, the data chunks could be coming from someone who has an Ubuntu ISO image, or some type of copyrighted material.

With this system it essential becomes impossible to tell who is uploading what.

Re:Blurring the Line Between Data by PaintyThePirate · 2007-04-11 05:17 · Score: 2, Interesting

The chances of two 16kB chunks from a Ubuntu ISO and a music CD being identical are extremely unlikely, or for that matter, the chances of any two 16kbit chunks from any two non-similar (meaning, in this case, files that are not of/from the same source [the same movie, album, program, whatever]) files being the same are incredibly slim.

That being said, about 90% of the commenters missed the point completly, though it is somewhat understandable given how vague and nontechnical the linked article and summary are. The idea is that by hashing each 16kB chunk, one peer can download a single chunk from another if the chunk is identical but the entire file isn't. The word "similar" in the title does not mean files that somewhat resemble each other, it means files that have identical chunks.

Of course, with such a system, one would have to deal with the overhead traffic of announcing the hashes of every 16kB chunk instead of simply the hash of the entire file. I imagine that in an actual implementation of this system, instead of a peer requesting hashes of every chunk that all of the other peers are sharing, the downloader would only request hashes of files with similar filenames or keywords.

As in, if you search for an Ubuntu ISO, you won't be getting hashes of chunks from someone who is only sharing The Matrix Trilogy DVD-Rs; you would get hashes from people who are sharing ubuntu-6.10-server-amd64.iso, ubuntu-6.10-server-i386.iso, ubuntu-6.06-server-amd64.iso, etc. I would imagine that there would be many shared chunks between those files.

Of course, I can't imagine such a system working on a P2P network like Bittorrent (except maybe with a modified version of DHT), but with a more centralized system, it could work.

The only remaining problem I see is how to deal with byte offsets on chunks. I haven't read the actual paper on the topic yet, but I imagine something was done about it.

--
eclecti.cc
Re:Blurring the Line Between Data by aardvarkjoe · 2007-04-11 05:55 · Score: 1

As the other reply said, the chances of two chunks, even ones that are relatively small (like 16kb) being identical are very, very small, unless the source data was similar in the first place. So you're not going to be able to download your top 40 CD from all-free sources using that method.

You could do something like reducing the chunk size to 10 bytes or something else really small, and probably be able to find chunks from non-copyrighted sources. (Disregard for the moment the fact that the overhead would be ridiculous). But at that point, a good argument could be made that the addresses that the tracker is giving you is really just an encoding of those 10 bytes, and so they're committing the copyright violation.

A similar suggestion has been made using streams of "random" numbers, such as the digits of pi: if the digits of pi are truly random, then any stream of data can be found somewhere within its digits. Some people immediately think that they can get around copyright laws if they say that "an MP3 of song AAA starts at digit 651923471 of pi, and continues for 771564 digits." But there are a couple of problems with this: first off, the offset of the data you want will often take more bits to express than the data that you're trying to find. (So to find that 5MB MP3, you first need to download a 5MB number that tells you where to find it.) The other problem is similar to the above: what you really have is just a clever way of encoding the data. You don't get off the hook for a copyright violation if you rot13 your data, and I doubt that this one would fly either.

--

How can we continue to believe in a just universe and freedom to eat crackers if we have no ale?
Re:Blurring the Line Between Data by jez9999 · 2007-04-12 00:41 · Score: 1

Yes. Although I can see some worth in the idea, I don't know why tiny 16kB chunks were suggested; the overhead would be enormous. Take a 600MB ISO; the whole file's hash is 1 hash, but divide it up into 16kB chunks and the file consists of about 38,000 hashes.

The idea would work a lot better with maybe 5 to 10MB chunks.

--
== Jez ==
Do you miss Firefox? Try Pale Moon.

Chunks not the entire song by ET_Fleshy · 2007-04-11 04:18 · Score: 1

They are talking about the chunks being the same, not the entire song. If you had a massive lookup table for md5's or something you could easily get a listing of every(body| torrent) that has the chunk you are looking for. Good idea but sounds quite difficult to maintain.

Note: didn't RTFA ;)

wondered how long it would take.. by ehrichweiss · 2007-04-11 04:35 · Score: 1

I thought of this a few years ago. The idea seemed an obvious one. There'll always be blocks of repeating data(e.g. FF00FF00), and exe, zip, rar, mp3, etc. headers with many similarities. If one can make the algo vary the size of the chunks it uses then you can derive your data from lots of different sources including getting data from image files that are to be applied to an .exe. The key would be to be able to recreate as much as possible from the similar data and the checksum/CRC using some of today's error correcting technology. It wouldn't be flawless of course but well worth the effort.

--
0x09F911029D74E35BD84156C5635688C0

Permuted Local Chunk Versions by Doc+Ruby · 2007-04-11 04:38 · Score: 1

Storage is cheap, bandwidth more expensive. Why not chunk up each file into many different permutations of its compressed data, with the variants recorded in the local index by fingerprint? Those fingerprints of unique chunks and the list of chunks to files can be maintained in the distributed index of many sites to each fingerprinted chunk. That would make more chances for a given content site to have a chunk that's identical to the one looked for, even if the chunk originated in a different file.

At some point, this protocol gets so far away from merely specifying a file's index in a local list to its specified remote storage server (eg. a tinyurl) for its monolithic compressed content (the best bandwidth for the least cacheable flexibility) that it's transferring more data among the distributed servers than that basic protocol. There's got to be some kind of "topological calculus" of the network connectedness and edge capacity vs node capacity that specifies the optimal distribution allocation of data to indices. Anyone?

--

--
make install -not war

Re:Permuted Local Chunk Versions by Anonymous Coward · 2007-04-11 07:38 · Score: 0

This is very similar to what I was thinking. Image you had a single DVD size file (a shared compression dictionary) which contained the most common 16K (or 8K, or whatever size ends up being optimal) chunks. Advertising a file would amount to listing the id's of the chunks, with chunks that aren't in your dictionary being listed in full. The optimal chunk size could be determined via a statistical analysis of sample data (what is the largest chunk size that we can find that is x% likely to be in a file). And the dictionary would be populated with the most common chunks. As time goes on, you might alter the dictionary, in which case you would have to advertize the dictionary version you're using, but that doesn't add much overhead. I haven't run the numbers, but I do believe you'd end up with a pretty good compression ratio.
The distribution of chunks not in the dictionary could then be done via the original posters method. I like this idea, might have a new project.
Re:Permuted Local Chunk Versions by Doc+Ruby · 2007-04-11 13:46 · Score: 1

Storage is even cheaper when considering all that connected network storage. If these redundant but multipurpose chunks were also encrypted, then we could keep our chunks distributed around the Net. This P2P thing is really powerful when it's used in the right proportions.

--
--
make install -not war

Think about how that would be accomplished. by khasim · 2007-04-11 04:39 · Score: 1

Pretend that you're part of a swarm.

Your computer would then go through ALL YOUR FILES and advertise the md5 checksums to everyone.

Normally, you just advertise the blocks for the file that you're downloading.

So, I'm downloading a Debian iso ... and you're downloading a movie. Why am I (and a million others like me) going to be connecting to your box, asking your processor whether you have a file with checksum ghskldkjasa198d.a8.3ep ?

Normally, I would not even be talking to your box.

Suddenly, your bandwidth is gone from a million requests that have nothing to do with your download.

Re:Think about how that would be accomplished. by maxume · 2007-04-11 04:50 · Score: 1

I would set it up so that my computer only went through the files I wanted to be sharing blocks of, which is slightly less hyperbolic. I imagine that networks will end up taking advantage of this, but instead of direct block searches, the search will ask 'do you have anything like...' and if your computer responds, the search will say 'are any of them exactly...', so even though it isn't super-awesome-perfect, there appears to be some room to improve over things as they are.

To me the real benefit would be that the software would be better at telling what songs are the same song and searches would return fewer, more accurate results(simply because they would be grouped better), bandwidth usually ends up coming from one or two 'good' sharers anyway.

--
Nerd rage is the funniest rage.
Re:Think about how that would be accomplished. by ichigo+2.0 · 2007-04-11 05:28 · Score: 1

Bram Cohen actually was going to implement a similar scheme in bittorrent at one point, but instead of asking everyone it asks the tracker for peers that have blocks with the correct hash. This would decrease the redundancy in trackers as many torrents have the same files, and it's silly to have separate swarms for them. I think this could probably have been extended to work in such a way that a torrent client queries multiple trackers (not just the original tracker of the torrent in question) for the pieces it needs. Such a system would result in much more traffic between the tracker and clients, but one could make it so that the client only queries individual pieces if the torrent is being very slow. I'm not sure what became of these plans, I guess he hasn't had time to improve on the protocol because of bittorrent.com.
Re:Think about how that would be accomplished. by Anonymous Coward · 2007-04-11 05:34 · Score: 2, Interesting

It doesn't have to operate on every file on the filesystem. It's perfectly capable of identifying common blocks among files you happen to be swarming (downloading/hosting) or seeding (hosting), and that's more than useful enough.

Or you can maintain a set of managed/hosted files, exactly like if they were on an rsync server.

As for network efficiency, this isn't Gnutella, we've learned a lot in years of research since then, such as how to make searches that scale well. Very, very well. Well enough to be able to efficiently search 10-million strong distributed networks for hash-tree leaves.

We don't have to ask you if you have any interesting blocks directly; we already have an excellent structure that lets us know if either you (or your section of the distributed hash table, as we drill down) haven't, or you probably might - say hello to the Distributed Bloom Filter.

Those million requests? More like ten or so, total, in itty bitty packets that would be totally dwarfed by the data transfer you'd probably be making, spread out throughout the network and not concentrated on any one host.

Does the idea sound more reasonable now?

similiar? by mephistophyles · 2007-04-11 04:42 · Score: 0, Troll

sheesh, it's not even in the summary, it's in TFA title, I also hate to point out the obvious, but I don't need to be a researcher at CMU to realize that if all those split files were put together it would be easier and faster to download the file... talk about pointing out the obvious

Updates by diakka · 2007-04-11 04:56 · Score: 1

This could be a great tool for distributing updates. Say, if you already downloaded one DVD iso image for your favorite Linux distro, it could save a lot of time over downloading a whole new DVD iso. Even for smaller files or individual packages it could be really handy. I know there are already tools for generating rpm deltas and such, but if it could be transparant, it could really save a lot of hassle as well as bandwidth.

--
-- Knowledge shared is power lost. -- Aleister Crowley

Matching Chunks? by The+Bionic+Vapour+Bo · 2007-04-11 04:57 · Score: 1

What about splitting files in smaller chunks so that the possibility that separate files contain the same chunk of data gets more likely. The downloading is then directed by some chunk ids for example (global chunk id database). To download the full file you just need to download all the chunks and then the file is composed using these chunks and some ordering information. This might be just a stupid idea that won't work in practice but it might be interesting to see a porn movie that is composed from parts of Slashdot comments.

That would be better, but still too big. by khasim · 2007-04-11 05:04 · Score: 1

For single songs (mp3's or even flac) the time spent hunting down other bits doesn't seem like it would be any better than just downloading that song from one person.

And things like md5 are useful because there is such a low probability of collisions (two different files having the same md5 checksum).

And by that same token, the likelihood that two different songs would have blocks of the exact same bits in a block is practically zero.

Their system WOULD work for movies IF they had previously incorporated the changes I mentioned (split the video from the audio from the sub-titles from the ...) but not in any other circumstances.

And as I also had mentioned, this would do nothing more than suck up ALL your bandwidth as people who aren't even searching for songs you've listed as "sharable" are hitting your machine in the hopes of finding a packet with a specific md5 checksum.

So I'm downloading a Debian iso and I'm hitting your mp3 collection looking for blocks that match. Nope. Bad idea.

Re:That would be better, but still too big. by maxume · 2007-04-11 05:26 · Score: 1

The status quo is that you are already sending me the md5 of whatever iso you are downloading and expecting a response about whether I have it; instead of(or in addition to) sending out that search, your software asks me if 'debian' matches anything and moves on if I say no, and asks for more details if I say yes. It may well overwhelm things, but it may not(bittorrent already 'fixes' corrupt files that are close to what you want to download, so it is characterizing the data as something other than a big blob).

If there really are a lot of files that are just different enough to be treated separately by existing protocols but contain identical media streams, spurring those protocols to work harder at characterizing files(by looking inside of lossless compression and isos and so forth) is only a good thing. (but I agree completely that a naive approach to matching blocks is almost guaranteed to be a waste of resources)

--
Nerd rage is the funniest rage.

The main problem with this is... by guruevi · 2007-04-11 05:06 · Score: 1

... that hashes collide, all the time. It probably won't collide over a large data chunk, but if you split the data chunks into $number chunks and send around MD5's (or other hashes) for that, you'll multiply the possible collisions by $number.

The only solution therefore is to create a one-to-one hash for each chunk, but then you could just as well transfer the data, because the hash size = chunk size.

Therefore, this approach won't work. Because, say you are transferring an OGG file (of your favorite indie band under the CCL of course) and you query for all people (multicast) for a chunk with that hash. A chunk of my LaTeX document could possibly have the same hash, so I send it to you. Since you have programmatically no way of checking the source of my chunk, and the hash is the same, your program will accept it and you'll get a corrupted OGG. Solution, rehash the whole file afterwards to check if it's still good, problem then is that you'll have to restart all the way, since you have no way of checking which chunk was wrong. You can off course, get sub-solutions off that to be more precise and loose less data, but the complexity and cost of such algorithms increases with such rate that it won't be worth implementing it.

--
Custom electronics and digital signage for your business: www.evcircuits.com

Re:The main problem with this is... by ASBands · 2007-04-11 06:01 · Score: 1

I've been thinking about this for the past 5 minutes and there seem to be a few problems with this "new" idea: 1: If two hashes taken from random sources collide, there is no way to check the source, except by human interaction with the downloaded content. 2: BitTorrent is already widely adopted. 3: Incredible bandwidth use. How can we solve this? By the creation of a dramatically different BitTorrent client. However, instead of loading one torrent file for one download group, you can load multiple torrents with one designated as "primary," the rest are used to point to other trackers for similar chunk downloads. Here's what I mean: You want torrent A, which is a properly ID3-tagged album of your favorite Creative Commons artist which has 1 seed and 7 peers. However, some idiot uploaded torrent B first, and it has the same music at the same bitrate, except the ID3 tags are wrong and there are 20 seeds and 100 peers. Torrent A is designated as "primary," and this BitTorrent client checks to see if Torrent B contains chunks of the same MD5 hash that A contains. Those chunks may be downloaded from either source, and chunks only in A must be downloaded from A. This solves problem 1, as there is 99.9% likelihood that the chunks are correct. It obviously solves problem 2, as it would work perfectly with existing BitTorrent trackers. Problem 3? Eh...if everybody used it... That should be similar to how the researchers did it, but I didn't RTFA, so I don't know.

--
My UID is a prime number. Yeah, I planned that.
Re:The main problem with this is... by Anonymous Coward · 2007-04-11 06:48 · Score: 0

Except that if you split the data into smaller chunks and send a hash for each chunk, you're actually decreasing the likelihood of a collision for each individual chunk. The smaller you make the chunks, the closer you got to your example of [hash size = chunk size].

The only real downside to smaller individually hashed chunks is that you're increasing the overhead of the whole transfer because you're sending more hashes.
Re:The main problem with this is... by Anonymous Coward · 2007-04-11 11:48 · Score: 0

No they don't collide. That's the whole point of cryptographically strong hash algorithms.
Re:The main problem with this is... by guruevi · 2007-04-12 04:39 · Score: 1

Yes they do collide, otherwise it would be called compression and not encryption.

--
Custom electronics and digital signage for your business: www.evcircuits.com

Problem with variable insertions? by digitalderbs · 2007-04-11 05:07 · Score: 2, Interesting

I haven't read the detailed paper, but it seems to me that there could be problems in finding similarity when there are random insertions. This is analogous to protein sequence matching (peripherally my field of research) when there are random insertions in the primary sequence. So if you have two identical 16k files, eight 2k chunks will be identical to each other. However, if you insert 512 bytes at the start of one file, eight 2k chunks will be different.

Re:Problem with variable insertions? by angio · 2007-04-11 05:38 · Score: 4, Informative

We define chunk boundaries using Rabin fingerprinting. It's a cute trick - not one of our own invention - that is relatively insensitive to insertions and deletions. It was used in some of the other work in this area, such as the Low Bandwidth File System (LBFS). There's a family of work in this area called "shingling" that can also apply to sequence similarity.

Could really work! by Junior+J.+Junior+III · 2007-04-11 05:10 · Score: 3, Funny

At their fundamental level, all files are essentially similar. They're encoded as 1's and 0's. So, wherever a file happens to call for a 1, you should be able to just pull that 1 from ANYWHERE. Even some random file on your local hard drive. And likewise for zeroes. All you need is a smart download algorithm to re-assemble the 1s and 0s in the correct order, and you're set.

--
You see? You see? Your stupid minds! Stupid! Stupid!

Shows the real trouble with TCP/IP. by Anonymous Coward · 2007-04-11 05:13 · Score: 0

Part of the problem is that saturating one's download hurts one's upload. It also hurts otners as well.

"P2P is very inefficient.. but the problem is that means that are maximally efficient (e.g. proxies, usenet, etc.) are inaccessible to the masses."

Nonsense. The only thing that's making them inaccessable is the illegal nature of most of the content. One of these days humanity will understand cause and effect and that bad actions have bad consequences.

Already been done a long time ago by J0nne · 2007-04-11 05:13 · Score: 2, Informative

Shareaza has been doing this for years. When hashing MP3 files, it disregards what(s in the id3 tags, and just computes a hash for the audio information. This means that files with different id3 tags will still be added to the swarm, whicj is great.
Unfortunately, there are some issues with it:
-Only Shareaza supports it, other clients didn't want to play along.
-Shareaza has/d a bug where it would fail to reconstruct the id3 tag after downloading, giving you files with empty tags
-Only mp3 is supported, so no ogg, aac or wma

So this paper isn't as revolutionary (if that's what they mean).

This will only work with identical files that have metadata that is frequently changed by end-users, because there's no way you're going to be able to get a good file if you try to mix a cam with a dvdrip, or an ogg with an mp3, or an xvid file with a divx file. It just doesn't work that way.

Blue Coat, Riverbed, Juniper already do this by Anonymous Coward · 2007-04-11 05:26 · Score: 0

It's called Byte Caching or Dictionary Compression. It's been used for WAN optimization for years and it works very well at taking data off the network. It hasn't been done on a P2P level, but the technology is very tried and tested.

Juuuust one problem... by DigitAl56K · 2007-04-11 05:37 · Score: 1

ID3v1 tags won't pose a problem for this, they occur at the end of the file, i.e. the last chunk.

But ID3v2 files occur at the start of the file and have variable size. AVI files might have similar video streams but different language audio tracks, or be interleaved slightly differently, and so forth.

So although similar information might exist in these files, the chances of that information laying exactly on the same chunk boundaries, and thus the chunks having matching MD5s, is pretty low I bet. Even a 1ms delay in a CD rip could throw off all the MD5s _and_ encoded data.

It could be effective _if_ the server had parsers for the various file types and could separate meta data and streams withing a file, and the client could correctly re-assemble the media. Chances of a buggy parser or muxor destroying your file: high.

It will still be WORSE. by khasim · 2007-04-11 05:39 · Score: 2, Informative

Let's go with a fairly vanilla scenario:

I grab an mp3 from person A. I then clean up the tag and rename it to suit me.

You want to download that same song with a different name and different tag.

You connect to person B sharing it. If you're using BitTorrent, you can also connect to any of 99 other people trying to download it from person B.

Using the new model, you could also connect to person A and myself and download the blocks that are the same.

So instead of only...
99 people in the swarm and 1 seeder
you'd have 99 people in the swarm, 1 seeder, person A and myself.

But in order to FIND person A and myself, you'd have to go through A MILLION OTHER PEOPLE to find if they have any blocks that you are looking for.

The CRITICAL PART THAT THEY LEFT OUT is the amount of bandwidth you'd be using to search A MILLION unrelated systems with unrelated files looking for those blocks.

This works in their lab because they have very few machines with very few files and they've already pre-loaded those machines with the files they want to be found.

An interesting licensing issue by DigitAl56K · 2007-04-11 05:39 · Score: 3, Interesting

If a client recreates a file from "similar" pieces, is it a derivative work?

great for porn by jcgf · 2007-04-11 05:58 · Score: 1

This will be great for porn. Nothing worse than downloading 45 seconds of the last 3 minute video you downloaded. Now, I'll be able to tell which are just clips and which are the whole thing.

Plan 9 - Venti by jlmale0 · 2007-04-11 05:58 · Score: 1

This is a good idea, and it delights me to see someone implement it. It reminds me a lot of Venti, the Plan 9 Archival Storage System (http://plan9.bell-labs.com/sys/doc/venti.html). An interesting side-effect of this is that, given a large variety of files in the system, it becomes a distributed look-up table of hash values. Any cryptologists out there need something like that as a resource? :)

Do humans ever pay attention to what they say? by Anonymous Coward · 2007-04-11 06:10 · Score: 0

Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar."

If the files differ ONLY in the artist-and-title headers, wouldn't they be "otherwise 100% identical"? I'd lump that in with "irregardless" and "I couldn't care less" as great moments in speaking(or typing) english.

I tried it by Intron · 2007-04-11 06:36 · Score: 2, Funny

I tried downloading "My Sweet Lord" by George Harrison, but I got "He's So Fine" by the Chiffons instead.

--
Intron: the portion of DNA which expresses nothing useful.

Who/what decides? by Seantotheizzo · 2007-04-11 06:48 · Score: 0

So, you speed up your download by grabbing identical chunks from "different" files. But, who/what decides which "different" end-piece is going to get downloaded? Are there serious implications to this? Let's say I reverse engineer an installer program and recompile it to install a trojan when SETUP.EXE is run. You go to P2P and there are two peers offering SETUP.EXE: my client and another client with the non-trojan version. How does the end-client then decide, from the two chunks that are different, which one to grab?

Re:Who/what decides? by catprog · 2007-04-11 13:54 · Score: 1

The idea is your trojan SETUP.EXE contains much of the same code (I.E the non-trojaned parts). So ABCDTEF is your trojan setup (T is the trojan). The actual setup.exe is ABCDEF. They will only downlaod ABCDEF not T from you.

--
My Transformation Website
Kindle Books http://www.catprog.org/rev
Interactive CYOA http://www.catprog.org/st

mod this up -- angio...this is the problem by tacokill · 2007-04-11 07:09 · Score: 2, Insightful

What the parent is saying can be summarized with a simple example:

A 200MB, 30min video that was compressed at 1000kbps DiVX is not the "same file with minor changes" as a 200MB, 30min video that was compressed at 900kbps DiVX. They ARE different files and should be treated as such. You also can't deduce anything from their filenames, play length, or any other characteristic so how would you determine which ones can go together and which ones can't? I did not see codecs or compression mentioned at all in the article.

This is the fundamental problem here. You can't recombine video and audio files unless they ARE the same file. You have to account for different bitrates, compression ratios, and who knows what else (I am no expert in this area but this seems obvious...).

Lemme guess -- the mp3s mentioned in the article were ALL encoded at the same bitrate, right? If not, then please correct me because now you have my attention ;)

Re:mod this up -- angio...this is the problem by GrievousMistake · 2007-04-11 09:49 · Score: 1

The mp3s mentioned in the article was, as I understand it, the same encoded mp3, but with different tagging (ID3, replay gain etc.) It's not too uncommon on p2p networks to see three or four versions of a mp3 with so similar size that it's almost certainly just differing tags. (And of course, it's always the file with least sources that has the best tags...)
I actually was going to suggest this on the eMule forum just yesterday, simply attaching some kind of tag agnostic hash attribute to music files, but I was to lazy to register. Guess I don't have to now.
Thank you for your attention.

--
In a fair world, refrigerators would make electricity.
Re:mod this up -- angio...this is the problem by angio · 2007-04-11 15:59 · Score: 2, Interesting

Correct (and the parent is correct as well). We didn't exhaustively characterize the reasons that the files differed, but most of the cases appeared to be modifications _after_ encoding. We have on the TODO list to test encode the same CD multiple times with the same settings to see if it produces a similar file, but we haven't run the test yet. There were tons of cases of metadata-altered MP3s, many cases of video files with the same video content but different audio or subtitle information, and some cases where the files differed in "seemingly random" (aka, we're not sure why. :) parts of the file for no reason we've figured out yet.

I think that for audio files, having a plugin for the p2p system that separated metadata from song information would probably capture a lot of the benefit we found. It's much harder for video files and for the files that had various weird changes that we don't have a good explanation for.

As other posters have suggested, the techniques we used also work well for software distributions sometimes. We've found around 10-30% similarity between different Linux ISOs and RPMs, *uncompressed* distributions of things like gcc (but not a .tar.gz file), etc. Part of what we really like about the SET technique is that it's able to speed transfers of all of those types of files without needing to have any file-type specific logic.
Re:mod this up -- angio...this is the problem by tacokill · 2007-04-14 02:56 · Score: 1

understood...thanks for the reply and clarification.

ps. its still very cool and nice leap forward :)

Ok, I read the paper by tacokill · 2007-04-11 07:14 · Score: 1

I just read the paper. Not a single mention of different bitrates, codecs, or compression ratios that could be present in any of the files you guys are dealing with.

See my comment below and please explain this. I think a lot of us are missing this piece.

Re:Ok, I read the paper by Loligo · 2007-04-11 11:29 · Score: 2, Informative

It's already been addressed: files encoded with different codecs, bitrates, compression ratios, what-have-you look completely different, have vastly different checksums, and even if named exactly the same and with the exact same file size, would never be confused for each other by any algorithm that's comparing what's actually IN the files.

-l

sourcecodes by AlgorithMan · 2007-04-11 07:15 · Score: 1

I like downloading sourcecode via p2p
I hope this technique will speed this up by finding equivalent sourcecodes...
(*eherm* undecidability *eherm*)

--
The MAFIAA is a bunch of mindless jerks who will be the first up against the wall when the revolution comes

Don't see how it could work on a large scale. by Bluesman · 2007-04-11 07:21 · Score: 2, Interesting

This seems to be an intractable problem.

How do you know a file is similar? By hashing? There's no guarantee that a particular chunk of a file with an md5 hash (for example) contains the same bytes as that of another file.

There are 2^256 possible chunks of 256 bits of data. There are 2^16 possible hashes with (using a 16 bit binary key) for that same data. That means that for every hash match, the data has a 1 in 16 chance of actually matching.

You can extend the key length to reduce this ratio, but you'll end up with a key length equal to your data size before you're sure the data is not a collision.

The problem gets worse if the chunks of data aren't equal in size.

This can only work if you have a centralized database of every possible file combination on your network. It's workable for a small amount of files, but will grow exponentially in a real environment. Not to mention, the centralized database would have to handle a significant amount of traffic, reducing the speed gains possible.

Count me as skeptical.

--
If moderation could change anything, it would be illegal.

Re:Don't see how it could work on a large scale. by Anonymous Coward · 2007-04-11 12:53 · Score: 0

"There are 2^256 possible chunks of 256 bits of data. There are 2^16 possible hashes with (using a 16 bit binary key) for that same data. That means that for every hash match, the data has a 1 in 16 chance of actually matching."

Sure, there's a problem if you generate random chunks and filter hash matches from them. (it's not 1 in 16 but 1 in 2^240 though - 16 bits known, 240 left to vary) But the hashes aren't 16 bits. They're going to be e.g. 128 bits (MD5). The expected amount of random chunks to generate to get one MD5 collision is sqrt(2^128) = 2^64. With SHA-1 it's even worse, sqrt(2^160) = 2^80. And that's in the whole system - if you are looking for collisions with a particular chunk instead of collisions between any two chunks, then you can drop the square root. You will likely never get collisions of this type, leaving only the non-random "collisions" (i.e. the ones where the data is actually the same).

perhaps 100% similar by Frisky070802 · 2007-04-11 07:46 · Score: 2, Interesting

I recall hearing a story on NPR Music a few weeks ago about someone who plugged a CD into itunes and had it come up with the right piece but the wrong performer. Then it happened again, with the same ostensible pianist but yet another "wrong" performer. Detailed analysis showed the pianist had apparently published CDs claiming to perform pieces but actually substituting the work of others! Itunes must have used a signature over the content to index the piece by the earlier CD.

Similarity detection rules.

--
Mencken had it right. So glad that's old news.

Huh? by sofla · 2007-04-11 07:47 · Score: 2, Informative

SET speeds up data transfers by simultaneously downloading different chunks of a desired data file from multiple sources, rather than downloading an entire file from one slow source.

So does ed2k, Bittorrent, ..., ..., .... this is hardly news. Even plain ol' FTP and HTTP can do this to a degree.

As far as the 99.9% similar "speedup"... I seriously doubt that you'll see any gains other than in lab conditions. MP3 is about the only format that might be agreeable to this, since I imagine its reasonably common for people to fix ID3's and then share the modified file. I just don't see it happening for other formats (.avi, .zip, .rar). And unless you've still got a 14.4 modem I doubt you'll notice the speedup even with MP3's, since they are so small to begin with.

Re:Huh? by m50d · 2007-04-11 10:42 · Score: 1

MP3 is about the only format that might be agreeable to this, since I imagine its reasonably common for people to fix ID3's and then share the modified file. I just don't see it happening for other formats (.avi, .zip, .rar).
There's at least a few other places I can see it happening: quite often people will make a zip/rar out of a bunch of files with no compression, just for the sake of bunching them together. And if someone corrects the subtitles on a video file which has been "softsubbed", it would be useful if I could download the new version taking advantage of sources which were sharing the old version.

--
I am trolling

P2P networks by Anonymous Coward · 2007-04-11 08:09 · Score: 0

this is already being done on numerous P2P networks. Ares, Limewire, etc. They all use this approach. This guy is just a moron.

patching Re:Right.... by NuShrike · 2007-04-11 08:26 · Score: 1

This is nothing different from the way DHT bittorrent matching sections of the files, and ignoring the bad hash chunks

It's not even a new idea. You can generate super-efficient patches of files if you had local knowledge of which sections of the file are meta-data such as headers, relocation tables, etc and which is actual data. You could expand this further to generate par2-like differential patch data, and then just bittorrent all the chunks down with differential regeneration on the client-end. Sounds a lot like DHT again.

NFW by Anonymous Coward · 2007-04-11 09:37 · Score: 0

A little late for April Fool's day, isn't it? This can never work...

Similar, eh? by ScrewMaster · 2007-04-11 12:00 · Score: 1

So how similar are Debian and Ubuntu, these days?

--
The higher the technology, the sharper that two-edged sword.

jigdo and Debian distribution by bastianmz · 2007-04-11 13:19 · Score: 1

The Jigsaw Downloader (jigdo) can be used for this, it was developed to help with distribution of Debian ISO images. It is being used to build Debian ISO images from the packages located on the Debian mirrors. You don't have to worry about distributing the ISO images to the mirror sites or need additional disk space to store them either. The best thing is that you can update your local ISO to include new packages, etc, as things change. Have a look at http://www.debian.org/CD/jigdo-cd/ and http://atterer.net/jigdo/

Comment removed by account_deleted · 2007-04-11 16:25 · Score: 0

Comment removed based on user account deletion

Slashdot Mirror

Faster P2P By Matching Similiar Files?

222 comments