Slashdot Mirror


Faster P2P By Matching Similiar Files?

Andreaskem writes "A Carnegie Mellon University computer scientist says transferring large data files, such as movies and music, over the Internet could be sped up significantly if peer-to-peer (P2P) file-sharing services were configured to share not only identical files, but also similar files. "SET speeds up data transfers by simultaneously downloading different chunks of a desired data file from multiple sources, rather than downloading an entire file from one slow source. Even then, downloads can be slow because these networks can't find enough sources to use all of a receiver's download bandwidth. That's why SET takes the additional step of identifying files that are similar to the desired file... No one knows the degree of similarity between data files stored in computers around the world, but analyses suggest the types of files most commonly shared are likely to contain a number of similar elements. Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.""

8 of 222 comments (clear)

  1. Similar Files? by Anonymous Coward · · Score: 1, Insightful

    Wait...what?

    1. Re:Similar Files? by Aladrin · · Score: 2, Insightful

      No seriously, the coward is right. WTF?

      Okay, I'll admit that there's a few MP3s that have different ID3 tags but the actual audio is the same. A few. The large majority of duplicate songs are NOT the same audio data. It's been re-ripped, transcoded, or some other horrid thing done to it and is not the same data anymore.

      Now, even assuming that there ARE tons of very-alike files out there, you'd have to write an intelligent comparer for each one so that it knew how to deal with the file and what information could be mixed without ruining the file.

      At the end of the project, you've spent years on a project that'll never quite work right to save a bit of bandwidth for people that should have just gone and bought the song from iTunes in the first place if they wanted it that damned bad. And if they don't want it that bad, they aren't going to bother with some specialized P2P program that only has 1 advantage: It can tell some files are alike. (And probably has tons of disadvantages compared to the already-existing applications.)

      --
      "If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
    2. Re:Similar Files? by Anonymous Coward · · Score: 2, Insightful

      Break down the file into small (16Kb) chunks. Hash those chunks and let the client compare those chunks to the chunks you need. Most BT clients already do this, but still only draw the file from peers using the same file listed by the tracker. With this technology it can use any file that has chunks with the exact same hash as the file being downloaded by the user. I would imagine not a great many changes would be needed to implement it. There's no need for an 'intelligent comparer' as it's pretty much already built into almost every BT app out there. There won't be 'years on the project' either, since most of the infrastructure already exists. They can just build on what is already out there.

      There could be a fairly large performance increase if I've understood the paper correctly. I have a 10Mb downstream cable connection at home. If I connect to a torrent that has many more seeders than leechers I can easily top out my d/l speed at around 1.1 MB/s. Reverse that scenario and the d/l is extremely slow due to the lack of seeders able to send out chunks of the file. Now, imagine there are multiple copies of these same file on multiple trackers being shared by many, many more seeders that this one torrent. This new implementation will find those chunks, as well as the chunks you originally connected to. Next time, RTFA.

  2. Tiny chunks, Large files by LiquidCoooled · · Score: 2, Insightful

    LOTS of overhead just to find the chunks.
    The article talks about 16kb chunks, which for a dvd image would take more than the torrent protocol currently allows.

    The client would spend more time communicating its chunk lists around than actually getting data.

    (If I remember rightly, torrents can have a max of 65535 chunks and some servers prevent huge .torrent files which contain the chunk breakpoints anyway)

    --
    liqbase :: faster than paper
  3. Re:TorrentSoup by eric76 · · Score: 2, Insightful

    The only thing I use the file sharing networks for is to download new images of FreeBSD and Linux using BitTorrent.

    The last thing I want is a "similar" file.

    What would be a "similar" file to a FreeBSD ISO? It would either be a corrupted file or one with an introduced exploit.

  4. Re:TorrentSoup by drix · · Score: 3, Insightful

    Because it gets you published and, thus, increases your chance for tenure, that from which all blessings flow.

    --

    I think there is a world market for maybe five personal web logs.
  5. mod this up -- angio...this is the problem by tacokill · · Score: 2, Insightful

    What the parent is saying can be summarized with a simple example:

    A 200MB, 30min video that was compressed at 1000kbps DiVX is not the "same file with minor changes" as a 200MB, 30min video that was compressed at 900kbps DiVX. They ARE different files and should be treated as such. You also can't deduce anything from their filenames, play length, or any other characteristic so how would you determine which ones can go together and which ones can't? I did not see codecs or compression mentioned at all in the article.

    This is the fundamental problem here. You can't recombine video and audio files unless they ARE the same file. You have to account for different bitrates, compression ratios, and who knows what else (I am no expert in this area but this seems obvious...).

    Lemme guess -- the mp3s mentioned in the article were ALL encoded at the same bitrate, right? If not, then please correct me because now you have my attention ;)

  6. Re:Right.... by poopdeville · · Score: 2, Insightful

    Which is why you would download a .torrent-like file specifying which of those you want. Then you would download the 99.9999% that agrees from any/all of them (essentially making your personal swarm temporarily bigger), and download the missing .0001% from the version you requested.

    This is very straightforward. I don't see how people can misunderstand this idea.

    --
    After all, I am strangely colored.