Slashdot Mirror


Use BitTorrent To Verify, Clean Up Files

jweatherley writes "I found a new (for me at least) use for BitTorrent. I had been trying to download beta 4 of the iPhone SDK for the last few days. First I downloaded the 1.5GB file from Apple's site. The download completed, but the disk image would not verify. I tried to install it anyway, but it fell over on the gcc4.2 package. Many things are cheap in India, but bandwidth is not one of them. I can't just download files > 1GB without worrying about reaching my monthly cap, and there are Doctor Who episodes to be watched. Fortunately we have uncapped hours in the night, so I downloaded it again. md5sum confirmed that the disk image differed from the previous one, but it still wouldn't verify, and fell over on gcc4.2 once more. Damn." That's not the end of the story, though — read on for a quick description of how BitTorrent saved the day in jweatherley's case.

jweatherley continues: "I wasn't having much success with Apple, so I headed off to the resurgent Demonoid. Sure enough they had a torrent of the SDK. I was going to set it up to download during the uncapped night hours, but then I had an idea. BitTorrent would be able to identify the bad chunks in the disk image I had downloaded from Apple, so I replaced the placeholder file that Azureus had created with a corrupt SDK disk image, and then reimported the torrent file. Sure enough it checked the file and declared it 99.7% complete. A few minutes later I had a valid disk image and installed the SDK. Verification and repair of corrupt files is a new use of BitTorrent for me; I thought I would share a useful way of repairing large, corrupt, but widely available, files."

55 of 212 comments (clear)

  1. Nice by Goldberg's+Pants · · Score: 5, Interesting

    Awesome idea. I've done this in the past with stuff. If a corrupt version was on one tracker, I'd save the files, get a new torrent and import the old files. Saves a lot of bandwidth wasting.

    1. Re:Nice by Goldberg's+Pants · · Score: 5, Interesting

      Okay, I had some AVI's and a bunch of them had issues. All I did was copy them out to a different directory, then find a GOOD torrent (with the same rips) then make sure the filenames match exactly. Chucked them in the directory and voila. It checks them all and uses what data it can that you already have and replaces the rest.

      Done this with RAR archived stuff as well. (Multipart rars on torrents are retarded, but that's another issue entirely.)

    2. Re:Nice by ThePhilips · · Score: 4, Interesting

      I do not know what GP meant precisely, but I had similar experience.

      Some game (very old RPG) was available on Overlord and on BitTorrent. Not sold anymore. Problem was that BitTorrent had only single seed which minuscule upload speed - in several day I have downloaded only few megs. I tried then Overlord and in few days I got the game almost complete - but another snag had hit me: whether by mistake or intentionally, file was poisoned and three parts couldn't be downloaded. I was ready to throw everything away - antique games interest me little (but friend was recommending it as milestone RPG I had to play). Then suddenly I was enlightened: I fed the incomplete ISO of game to BitTorrent. BT client happily announced something like 98% of file complete and in less than one night downloaded rest of the file.

      --
      All hope abandon ye who enter here.
    3. Re:Nice by gomiam · · Score: 5, Informative
      It should be quite simple. Let's say torrentA leaves you with a corrupt/incomplete filesetA (one or more files, it doesn't really matter). Let's supose torrentB contains the files in filesetA, perhaps with different names in its own filesetB.

      Ok, you load torrentB in your favorite Bittorrent client, and start it up. It will automatically create 0-sized files with the names in filesetB (at least, all clients I know do that). Stop the transfer of torrentB, and substitute the 0-sized files in filesetB with the corresponding files in filesetA (may require some renaming). As you restart torrentB, your Bittorrent client will recheck the whole filesetB, keeping the valid parts in order to avoid downloading them. Voilá! You have migrated files from one torrent to another.

      Note: You should make sure that the files you are substituting in are the same files you want to download through torrentB or, at least, keep a copy around until you see that the restart check accepts most of their contents.

    4. Re:Nice by peipas · · Score: 5, Funny

      Okay, my friend had some AVI's and a bunch of them had issues. There, fixed it for you. You're welcome.
    5. Re:Nice by empaler · · Score: 5, Insightful

      I assume you then continued seeding? :)

    6. Re:Nice by CastrTroy · · Score: 2, Insightful

      Reminds me of the "PARS" I used to get off usenet. I think it was bacally a RAR split up into hundreds of pieces, with parity information in each of the files. You only needed to download a certain percentage of the files to reconstruct the original file. It was great, because often pieces of the file would go missing, or become corrupted somewhere along the way.

      --

      Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
    7. Re:Nice by X0563511 · · Score: 2, Informative

      And then, Par2 came along, and allowed more flexibility.

      We still use them, on usenet anyways.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    8. Re:Nice by Christophotron · · Score: 2, Informative
      no, actually that's how the old par system worked. In the newer, more advanced .par2 system, the individual .rar files are divided into "blocks" and each par file can recover a certain number of "blocks". It's much more advanced than the old par files that you are referring to, and it's similar to the hashing mechanism in bittorrent. I haven't seen any of the old-style pars in a long time.

      For example, if you are missing a total of 3 blocks (one block from 3 different files) you only need to download a very small par2 file that says "+3 blocks" and it will repair the three missing blocks. Of course, if you are missing a lot more data, even entire files, you can get several of the larger "+128" par files and it'll repair everything (assuming there is enough parity data). Often you can even request additional parity blocks, but that's only necessary if you have a *really* crappy nntp provider.

    9. Re:Nice by Hes+Nikke · · Score: 2, Insightful

      Done this with RAR archived stuff as well. (Multipart rars on torrents are retarded, but that's another issue entirely.) any idea why the multipart RAR torrents tend to have healthier swarms than single file torrents of the same content? it pisses me off!
      --
      Don't call me back. Give me a call back. Bye. So yeah. But bye our, well, but alright we are on a shirt this chill.
    10. Re:Nice by i.of.the.storm · · Score: 3, Interesting

      I think it has to do with the way the "scene" releases things, they usually do it via multipart rars or something like that. I saw something to that effect in the comments on a torrent a while ago. I think the reason is that things in the "scene" get distributed in ways that aren't bittorrent, so the breaking up into pieces makes sense there. I'm still not entirely sure what the "scene" entails, and how they differ from the people that put the torrents up, so I don't know the whole answer to that.

      --
      All your base are belong to Wii.
    11. Re:Nice by maskedbishounen · · Score: 3, Informative

      Usenet. All "files" (posts) are stored server-side, and folks generally have a fast pipe to their ISP (or other provider).

      With multipart binary posts, a single file is split up between so many posts. Between fifteen to fifty, let's say. It's common for usenet providers to not receive all the posts, so folks are sometimes left with incomplete/corrupt files. Enter the small, spanned archive formats. It's quite common to see up to 10% parity per usenet posting, especially for large files. Small split set sizes make for easy reposting, as well.

      In regards to the grandparent, this likely relates to why the said torrents are healthier. Folks who can bypass the leeching process and go straight to the seeding. The only other means of really sharing on such binary groups would be posting (or reposting) stuff for folks. Due to ever limiting server retention, a lot of the binary groups look down on heavy posting.

      I gave up on usenet years ago, though; well, not really. My ISP gave up on it, and I was too much of a bum to pay someone for decent service. I would encourage anyone to check if their ISP offers usenet access if they're into P2P and don't like the "2P" part that much.

      --
      "An infinite number of monkeys typing into GNU emacs would never make a good program."
    12. Re:Nice by Jarik_Tentsu · · Score: 2, Interesting

      Mmm, in my experience, Firefox's Download Manager occassionally leaves me with incompletely downloaded files - especially when they're big. Dunno whether this is a bad connection (Telstra, I wouldn't be surprised) or an issue with the actual Download Manager, but I don't get these isseues when using Free Download Manager.

      Anyways, I've done this before for a different thing.

      There was a rare file I was trying to get my hands on, which was fairly large, but corrupted. There was a torrent which had it too, but was giving out really slow speeds (like...1-2 seeders, 3-4 leachers who must've been on dial up or Telstra broadband...). So I HTTP downloaded the corrupt file, then used the torrent to fix up the last corrupted parts. Worked perfectly. =)

      ~Jarik

    13. Re:Nice by Jurily · · Score: 3, Insightful

      Me too. But I never thought about the endless possibilities here.
      Just ship everything with a .torrent to verify.

      (Wow, all the authorities we could annoy with one minor change!)

    14. Re:Nice by meza · · Score: 4, Informative

      I can't be bothered to do the math of how many possibilities that would be at this time of night, but it'll certainly go faster than continuing to download it from nobody. Not quite so. Because if you do the math (and if mine is correct at this late hour) you would see that it actually takes a pretty long time(tm). Imagine if only 1MB was missing. You would have to calculate the hash of every single possible 1MB file, so that is 2^8000000=~10^2300000 files. If you had a computer that could, quite unrealistically, calculate one hash each clock-cycle at 1GHz that would still take you 10^2299991 seconds. As a reference the universe according to Wikipedia is roughly 10^17 seconds old.

      Besides that there is the information theory problem too. If the hash is 128bit long then every 2^128th file will have the same hash. This might seem unlikely if you only compare a few files (such as all the files ever created by man) but compared to the 2^8000000 hashes we where going to calculate it is actually quite substantial.
    15. Re:Nice by de_smudger · · Score: 2, Funny

      Look, they were AVIs of someone installing linux distros, you insensitive clod ;)

    16. Re:Nice by sexconker · · Score: 2, Informative

      No, you can't.

      A checksum is not unique.
      A 32 MB file has 8388608 ways to generate the same 32 bit checksum.

      Using "given" data to help narrow the search is a bad idea as well - there is no guarantee that the given data is correct, unless you have individual checksums for them. Bittorrent does do checksumming on each individual chunk (I believe), so you could narrow your search space to the size of the incomplete and missing chunks only. The existing data in incomplete chunks would be almost useless, since you don't know if that data is correct. But you can start your search assuming it's correct, (it probably is, mostly) and speed things up.

      But the bottom line is that checksums are smaller than the data they verify. Much smaller. Consider a simple example of a 2 bit checksum on an 8 bit chunk of data. Our checksum simple counts the ones, and rolls over.

      00000000 : 00 (0 ones)
      00011010 : 11 (3 ones)
      10110111 : 10 (6 ones - 110 is truncated to 10)

      There are 64 ways to get any particular checksum.
      2^8(data length) / 2^2(checksum length) = 2^6 = 64. And that's with us having 25% of our data duplicated in checksums.

      A checksum is a check. It is not a guarantee nor is it a blueprint from which you can reconstruct the original data. In certain cases it would be feasible - if you're downloading a thesis about Skittles, and it's corrupted, you could perform a brute force search (like you described) on the (small - it's just text) data, and then sort the matches by # of times "Skittles" is present, then by the % of data that is ASCII. You then hand verify the top 20 results or so, and you'll probably have it.

      The same could theoretically be applied to AVIs by enforcing the AVI frame structure (throw out checksum matches that don't generate valid AVI files), attempting to grab audio out of the generated files, and then doing a frequency analysis of the audio - rank the results in terms of % of audio that falls within normal listening ranges (since it's almost guaranteed that audio in an AVI will be compressed in a lossy format).

      You could do analysis of the video frames and such too. But the bottom line is that it's a HUGE undertaking - just redownload the damned thing, pay for it, or write your own paper. If it's vital data then go ahead, brute force it and waste your life away.

      PAR2 files are neat - they give you chunkettes of parity data at different offsets. This allows you to potentially patch holes in data and reconstruct the original files. WinRAR (and other programs) do give you the option to create a recovery record that's placed in the original RAR (or whatever format) files. The problem is that you then have to download the recovery data. With PAR files, you don't download them unless you need them. The downside is that availability then becomes a problem.

    17. Re:Nice by operagost · · Score: 4, Insightful

      Any modern file system will fragment if you expand an existing file. It simply has no way to guess how big the file will get when it is created unless your application chooses the proper allocation.

      To give you an extreme example, imagine a 100 GB volume which has no files. You create a 1 MB file, and your filesystem places it near the top. Now you create a second file, and your filesystem places it... well, it could place it anywhere except that first 1 MB, so let's say it places it right next to the first file. Uh oh, it turn out that you need to write 1 GB of data to that first file and extend it. Now you have two fragments.

      Ok, let's assume our file system is magical and knows that you like to extend files to huge sizes. So it places the second file at the end of the disk, instead. Oops, you fooled you file system: this time, you wanted to extend the second file by 1 GB. There is no room to append to the end of the file, so a second extent is created somewhere else and linked to the second file. You have two fragments again.

      This is why performance tuning requires that you anticipate data requirements and allocate space accordingly; for example, by setting the initial size of database files to one that should reasonably accommodate the data requirements for the foreseeable future (and not automatically shrinking the database down when records are deleted).

      --

      Gamingmuseum.com: Give your 3D accelerator a rest.
    18. Re:Nice by spasm · · Score: 2, Funny

      "... that would still take you 10^2299991 seconds. As a reference the universe according to Wikipedia is roughly 10^17 seconds old."

      It's ok, I just edited wikipedia to make the age of the universe 10^2299991.

      Oh, nearly forgot, ~~~~

  2. Good for seeding stuck torrents, too by b4dc0d3r · · Score: 5, Interesting

    If I happen to see a stuck torrent (many leechers, no seeds), sometimes I can find a good version of the file I already have - so I start the torrent, stop it, replace the single good file (sometimes you need more if the file is smaller than the part size), and upload a few Kb to finish the torrent. Then sit back and watch as everyone fills up.

  3. Anonymous Coward by Anonymous Coward · · Score: 3, Informative

    Those of us who use BitTorrent for *ahem* illegal purposes have been doing this since the beginning. The only way to get rare and complete downloads was to take the files to other trackers and match them against another md5 to finish the download.

    It's like getting parity files over on usenet to fix that damned .r23 file which is just a bit too short for some reason :)

  4. Scheduling by FiestaFan · · Score: 4, Informative

    Many things are cheap in India, but bandwidth is not one of them. I can't just download files > 1GB without worrying about reaching my monthly cap, and there are Doctor Who episodes to be watched. Fortunately we have uncapped hours in the night I don't know about other bittorrent clients, but uTorrent lets you set download speed caps by hour(like 0 during the day and unlimited at night).
    1. Re:Scheduling by urbanriot · · Score: 2, Informative
      Azureus also has an excellent scheduling plugin written for it - http://students.cs.byu.edu/~djsmith/azureus/index.php

      I don't know about other bittorrent clients, but uTorrent lets you set download speed caps by hour(like 0 during the day and unlimited at night).
  5. !new by gustgr · · Score: 2, Insightful

    For heavy BT users this tactic is very common, provided the file(s) you are willing to download is fairly well available from different sources.

    1. Re:!new by SanityInAnarchy · · Score: 2, Insightful

      It's an older concept than that, even. Goes back to the strange Debian habit of using a tool called Jigdo -- it would provide essentially a recipe for building an ISO out of all the files needed, where the files were mostly available from standard Debian mirrors. ISOs were available from far fewer mirrors than standard Debian packages, you see.

      So, you'd use Jigdo, and if all went well, it'd assemble a working image. But if a few packages couldn't be downloaded, you could always take your mostly-complete Jigdo file and use rsync with an rsync-capable mirror. (Or, more recently, BitTorrent on Ubuntu -- but that's another story.)

      I don't think this tactic is very common, though, as most people seem to have no fucking clue how BitTorrent works. I've seen torrents with gigantic multipart RARs, with an SFV of those. Let's see... so, my torrent software is already checksumming everything, and RAR has a builtin checksum too, or at least, acts like it does (it says "ok" or not) -- and on top of that, there's an SFV checksum (crappy CRC32), too. Never mind that RAR saves you at most a few megabytes (video is already compressed), which, based on the size of these files, you'll spend more time unpacking the RAR than you would downloading the extra couple megs. Or that, once you unpack and throw away the RAR, you can't seed that torrent from the working video. Or that multipart anything is retarded on BitTorrent, as the torrent is splitting it into 512k-4meg chunks anyway.

      Whoops, end of rant. Oh, by the way, that wasn't about me, it was about my friend. Wink wink.

      --
      Don't thank God, thank a doctor!
    2. Re:!new by Anonymous Coward · · Score: 5, Informative

      I don't think this tactic is very common, though, as most people seem to have no fucking clue how BitTorrent works. I've seen torrents with gigantic multipart RARs, with an SFV of those. Let's see... so, my torrent software is already checksumming everything, and RAR has a builtin checksum too, or at least, acts like it does (it says "ok" or not) -- and on top of that, there's an SFV checksum (crappy CRC32), too. Never mind that RAR saves you at most a few megabytes (video is already compressed), which, based on the size of these files, you'll spend more time unpacking the RAR than you would downloading the extra couple megs. Or that, once you unpack and throw away the RAR, you can't seed that torrent from the working video. Or that multipart anything is retarded on BitTorrent, as the torrent is splitting it into 512k-4meg chunks anyway.
      People who aren't aware of the full situation often make this complaint. These multipart rar files are "scene releases".

      First of all, scene releases are _never_ compressed; it's always done with the -0 argument, this makes is basically equivalent to the unix split program. If a file is to be compressed, it is done with a zip archive, and the zip archive is placed inside the rar archive. This is because rar archives can be created/extracted easily with FOSS software, but cannot easily be de/compressed. This was more of an issue before Alexader Roshal released source code (note:not FOSS) to decompress rar archives.

      Second, people often have parts of, or complete, scene releases and they are unwilling to unrar them (often because it's an intermediary, like a shell account somewhere where law isn't a problem).

      Third, people follow "the scene" and try and download the exact releases that are chosen by the social customs of the scene (I am not going to detail those here), thus, "breaking up" (ie, altering) the original scene release is seen as rude.

      Fourth, the archive splitting is in precise sizes so that fitting the archives onto physical media works better; typically the archive size is some rough factor of 698, ~4698 and ~8500.

      Fifth, archives are split due to poor data integrity on some transfer protocols (though this is largely historical nowadays); redownloading a corrupted 14.3mb archive is easier than redownloading a 350mb file.

      Sixth, traffic of the size is measured in terabytes, with some releases being tens, or sometimes hundreds of gigabytes in size. Thus, there become efficiency arguments for archive splitting; effective use of connections, limited efficiency of software(sftp scales remarkably poorly, though that is beginning to change - not that sftp is used everywhere), use of multiple coordinated machines and so on. This is an incomplete list of reasons; it is almost as though every time a new challenge is presented to the scene, splitting in some way helps to solve it.

      AC because I'm not stupid enough to expose my knowledge of this either to law enforcement, or to the scene (who might just hand me over for telling you this - it has been done). Suffice to say that this is more complex than you understand, and that even this level of incomplete explanation is rare.
    3. Re:!new by dk.r*nger · · Score: 4, Informative

      First of all, scene releases are _never_ compressed; it's always done with the -0 argument, [...] This was more of an issue before Alexader Roshal released source code (note:not FOSS) to decompress rar archives. So, historical, and pointless. And anyway, just an excuse if there's any point in using RAR anyway. Let's see..

      Second, people often have parts of, or complete, scene releases and they are unwilling to unrar them (often because it's an intermediary, like a shell account somewhere where law isn't a problem). So they should use BitTorrent. Run a seed on your [strike]compromised windows host[/strike] "shell account".

      Third, [....] social customs of the scene (I am not going to detail those here), thus, "breaking up" (ie, altering) the original scene release is seen as rude. Oh, I think we're at the core of the problem. Pale teenagers in their mothers basements getting hurt feelings. I appreciate that someone will rip the Lost episodes in HD pretty much as they are being broadcast, and I actually look for some "group names" in the torrents I get - because they provide one file, not a RAR. In other words, provide what people want, and they will respect you for that. Make their life hard, and they will not care about your 1998 social customs. Like anything else in life.

      Fourth, [...]fitting the archives onto physical media works better Yawn. 1998 called, they want their infrastructure back. Harddrives are cheaper than dirt. Five years ago "the scene" at my college exchanged 250 gb harddrives.

      Fifth, archives are split due to poor data integrity on some transfer protocols SO USE BITTORRENT! It easier and faster and better and more fun, but of course less 'leet than using [strike]compromised windows hosts[/strike] "shell accounts"

      Sixth, [...] Thus, there become efficiency arguments for archive splitting;[...]it is almost as though every time a new challenge is presented to the scene, splitting in some way helps to solve it. No, BitTorrent does ALL this for you. ALL of it.

      AC because I'm not stupid enough to expose my knowledge of this either to law enforcement, or to the scene (who might just hand me over for telling you this - it has been done). Badass gangster!

      Suffice to say that this is more complex than you understand, and that even this level of incomplete explanation is rare. What? Moving files around on the internet is "more complex" than we understand? It probably the simplest fucking thing there is. Let me put it very simple for you: 1) Multi-file RARs made sense back when people got their stuff from FTPs and newsgroups. 2) It's the past. It's pure nostalgia. Get over it. If you're not using your "scene" FTP servers as Torrent seeds instead, you're wasting your resources.
    4. Re:!new by totally+bogus+dude · · Score: 2, Interesting

      I actually look for some "group names" in the torrents I get - because they provide one file, not a RAR. In other words, provide what people want, and they will respect you for that. Make their life hard, and they will not care about your 1998 social customs. Like anything else in life.

      Firstly, if you use torrents than nobody in the "Scene" gives a flying toss about whether you respect them or not. I have nothing to do with the Scene, and even I know that. They are not ripping things for us, they're ripping things for themselves. We're feeding from their scraps, if you like.

      Once you understand that, all the other arguments become moot. Yes, multi-part RARs in torrents annoys me as well, but the people making them aren't doing it for us. Most (all?) Scene members would much prefer their releases never ever made it onto BT or USENET. Telling them that you disapprove of their distribution practices is, well, hilarious. Like a bank robber telling the cops he disapproves of their regular patrols of the street with all the banks on it. Actually, it's more like a bank robber in the US complaining about a pre-school teacher in Japan because he doesn't like the colour of the crayons they use. Thanks for the input, but who asked you, anyway?

      So you're left trying to convince the people who do upload to more public services to unrar before they upload. More power to you, and I wish you luck. But I think the mob has largely spoken on this matter, and the mob says: "I don't give a crap if I have to unrar it first, so long as it's a) complete and b) a fast download". The torrents with multi-part archives tend to be seeded better than those which contain the extracted file, and therefore more people download the multi-part; which results in more seeds on it, resulting in more people downloading it...

      As for using BT in the Scene -- it's up to them, it's their resources and they can do what they want with them -- so the following is purely mental masturbation. I would think BT would make it harder to keep "safe" and maybe easier to infiltrate. Password-protecting the servers (assuming most BT clients and trackers even support such) is probably insufficient; you'd likely want a local firewall to ensure only other Scene members can connect to your client. Keeping such a list updated in a secure manner would be somewhat tricky, I think, and telling everyone else the IP address of every other member sounds like a no-go.

  6. Any MD5s on Apple's page? by CSMatt · · Score: 3, Interesting

    Are their even MD5 hashes on Apple's download pages for such large files? Jusging by how the article was written and the lack of hashes on the QuickTime and iTunes download sites, it doesn't seem like they even bother.

    1. Re:Any MD5s on Apple's page? by Anonymous Coward · · Score: 3, Informative

      Yes, there are- though most of the latest ones are SHA-1 digests. They're not usually seen in the "public front page" download areas and aren't universal, but are generally present for the downloads for updates and security patches through links from the tech literature and developer sections.

  7. Re:What broken software were you using? by Dice · · Score: 5, Insightful

    I asked the same question. Wikipedia answered it.

  8. Hardware Failure is your bigger concern by Bazar · · Score: 4, Interesting

    One should be more concerned as to why your files are becoming corrupted.

    I'd say its a safe bet that the files from apple.com are in perfect condition.

    Which means it either became corrupted in transit to, or on arrival to your machine.

    Which leads the question, is your memory defective
    run memtest86 to check your memory.
    http://www.memtest86.com/

    Check if your Harddrives have SMART and are reporting anything. A disk checker would also be a good idea.

    The other idea that springs to mind is if your behind some proxy with the above problems, although i doubt anyone would want to proxy a 1.5gig file.

    Fact is, if files are being corrupted on your disk, its just a matter of time before something more important is hit by corruption.

    --
    To avoid criticism; Say nothing, Do nothing, Be nothing.
    1. Re:Hardware Failure is your bigger concern by Anonymous Coward · · Score: 5, Interesting

      could also be one's routers.

      There was a problem w/ dlink routers back in the day that hit alot of p2p users. If you placed your machine in the dmz, the router basically did a search and replace on all packets replacing the bitstring representing the global address w/ the bitstring representing the local address. On large files, this didn't just hit in the ip header, but in the data as well corrupting it. If you didn't use dmz functionality, just port mapping, it worked fine, so if you were using bittorrent, you'd get repeated hash fails on some parts that would never fix, because bitorrent has no capability to work around that (as opposed to eMule's extensions)

    2. Re:Hardware Failure is your bigger concern by cheesybagel · · Score: 5, Interesting
      Maybe, maybe not.

      IIRC TCP/IP has a guaranteed maximum error rate of at least 10^-5 bits. Well, the thing is, 1.5 Gigabytes is over 10^10 bits in length. So even at such an error rate, it is not guaranteed that your file will arrive without bit errors.

    3. Re:Hardware Failure is your bigger concern by BobPaul · · Score: 3, Informative

      As per the topic, Bittorrent fixed the problems - didn't cause them - so a failing router is not likely the problem. You misunderstood his comment; please read it again. In his story, bittorrent didn't cause any problem either--it identified a problem by use of the same mechanism (hash checks of file parts) that it solved the problem in the OP.

      While I agree that bad ram is most likely the issue, it's still possible bad ram in a router or even something goofy going on in a router, such as the firmware bug described, could have caused problems. The bits were mangled before they were written to the disk. They could have been mangled by anything that processed those bits as they traversed from apple's website to his HD, including Apple's website and the HD itself. That embedded devices tend to be more reliable does not mean they don't break and do weird things sometimes.
  9. Re:What broken software were you using? by kcbanner · · Score: 3, Insightful

    Its networking - shit happens. Some of his bits got thrown out of a router somewhere as heat, or maybe a packet timed out and didn't quite make it.

    --
    Obligatory blog plug: http://www.caseybanner.ca/
  10. Been using bittorrent and rsync for this for years by DiSKiLLeR · · Score: 5, Informative

    I've used bittorrent for this purpose many times in years gone by.

    Especially with our slow links, or worse yet, on dialup (if I go enough years back) in Australia.

    Before bittorrent I would use rsync. That required me to download the large file to a server in the US on a fast connection, then rsync my copy to the server's copy to fix what is corrupt in my copy.

    It works beautifully. :)

    --
    You can tell how powerful someone is by the magnitude of the crime they can commit and be able to get away with.
  11. Re:What broken software were you using? by Anonymous Coward · · Score: 5, Interesting

    Those who have never developed P2P software might never understand why they all need to use strong checksums to detect data corruption, and why bad blocks actually do appear in the wild; frequently.

    You'd be shocked - SHOCKED - at how much data gets corrupted routinely - by errant antivirus software, flaky network equipment, plain ol' line noise that the checksums don't detect (which will happen much more often than you expect, see also birthday paradox), or misbehaving routers who think that any occurence of 0xC0A80102 obviously must be an internal IP address and needs to be changed to your external one. Even if that's in the middle of a ZIP file. Oops.

    Encryption actually aids this somewhat, as the same byte patterns don't get repeated, so if there's an errant IDS changing things for example, it tends not to fire the second time.

    I've done this before for file repairs. Works a treat, but you sort of wish that torrent used a Merkle hash tree such as the modified THEX standard Tiger Tree Hash. SHA-1's so last century.

  12. Good for game files too by trawg · · Score: 5, Interesting

    We have been doing this for ages for certain high-demand games file that we mirror. While offering torrents for some of our download mirrors is only mildly useful (as we're in Australia we're trying to keep bandwidth on-shore to cut down international traffic, and BT doesn't really help this), it is extremely helpful for the VAST amount of users that appear to either have massively crazy Internet problems or are simply unable to drive a HTTP based downloader and resume downloads.

    When a large number of users are having problems downloading or resuming a particular file, I simply create a torrent for them and give them some vague instructions about how to resume it and then generally I never hear from them again. They're happy because they don't have to download a 4gb game client again from scratch, they don't have to worry about resuming/corrupt downloads, and because its a torrent it probably feels like they're getting something for free that they shouldn't be.

  13. Or synchronize with yourself... by greerga · · Score: 5, Interesting

    For even more fun, if you have two differently-corrupted copies of a file and a torrent to go with it, then you can have BitTorrent stitch them together into a valid file without involving any third parties.

    I used Azureus's internal tracker ability and two computers on a local network with the torrent modified to track on one of the machines, and one corrupted copy of the file on each.

    Obviously only works if they don't have corruption in common, but it also doesn't require the original torrent file tracker to work anymore.

  14. What a novel idea!!! by WarJolt · · Score: 2, Interesting

    Using bit torrent for it's actual legal intended use. I love it!!!

    I'm not a lawyer though. I just hope it doesn't violate apples NDA. Please please please follow the rules. Don't want to see you in prison or slapped with a large fine.

    Bit torrent has received a bad reputation because of pirates. There are legitimate uses though. I do believe that doctor who episodes aren't public domain, so shame on you for that. Might want to be careful what you admit to on /.

  15. Re:What broken software were you using? by complete+loony · · Score: 4, Informative

    The TCP checksum offloading on nForce 4 motherboards (I have one) were notorious for corrupting TCP packets and allowing them to be received by the application. That's the most likely kind of failure that would be able to reproduce this problem.

    --
    09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
  16. Re:What broken software were you using? by Anonymous Coward · · Score: 3, Informative

    It's obvious you have no clue how the Internet actually works. Shit happens, but the Internet is designed for it. Dropped packets cause retransmission, not corrupted data; the Internet drops packets *by design* and the entire system is designed around that. Flipped bits happen, but they are detected by multiple checksums which make it astronomically unlikely for corrupt data to remain undetected. Nope; if you receive corrupt data, the blame is squarely on some piece of software fiddling with your packets and changing the checksums to match. Maybe it's the crappy cheap NAT router, or the ISP's deep-packet-inspection P2P filter, or their (not so) transparent HTTP proxy. But whatever the cause, it's almost certain that software is to blame.

    I'd bet $100 that if he did the same download over HTTPS, thus preventing software meddling of the packet contents, it would come out perfect.

  17. Re:What broken software were you using? by SanityInAnarchy · · Score: 3, Informative

    It's obvious you have no clue how the Internet actually works. Shit happens, but the Internet is designed for it... Maybe it's the crappy cheap NAT router I'm fairly sure that's what GP meant.

    Oh, and TCP checksumming isn't perfect.
    --
    Don't thank God, thank a doctor!
  18. Re:What broken software were you using? by CastrTroy · · Score: 2, Informative

    I had the same problem. What's really terrible is that I don't think they ever fixed the problem. That drove me nuts for a few weeks trying to figure out why all my downloads were corrupted.

    --

    Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
  19. Re:the new dr. who sucks... by zippthorne · · Score: 2, Informative

    To be fair, very few British cops know how to use guns. At least, if the gun control advocates on my side of the pond can be believed.

    --
    Can you be Even More Awesome?!
  20. Re:What broken software were you using? by Skapare · · Score: 4, Informative

    Flipped bits happen, but they are detected by multiple checksums which make it astronomically unlikely for corrupt data to remain undetected.

    I actually saw this happen once ... the astronomically unlikely [1]. TCP accepted the corrupt packet. I'm sure it will never happen again. Fortunately, rsync caught it in the next run.

    One problem I ran into once with a certain Intel NIC was that a certain data pattern was always being corrupted. TCP always caught it and dropped the packet. There was no progress beyond that point because of the hardware defect always corrupted that data pattern. Turns out there was a run of zeros followed by a certain data byte (I tried a different data byte and with different run lengths and those never got corrupted). What the NIC did was drop 4 bytes, and put 4 bytes of garbage at the end. I suspect it was a clocking syncronization error. I got around the problem by adding the -z option to rsync (which I normally would not have done with an ISO of mostly compressed files). Another way would have been to do the rsync through ssh, either as a session agent (like rsync itself can do) or as a forwarded port (how I do it now for a lot of things).

    [1] ... approximately 1 in 2^31-1 chance that the TCP checksum will happen to match when the data is wrong (variance depending on what causes the error in the first place) ... which approaches astronomically unlikely. Take 1 Terabyte of random bits. Calculate the CRC-32 checksum for each 256 byte block. Sort all these checksums. You will find 2 (or more) data blocks with the same checksum (or a repeating pattern in your RNG). Why? Because CRC-32 has 2^32-1 possible states, and you have 2^32 random checksums.

    But whatever the cause, it's almost certain that software is to blame.

    Agreed. Since it is at least software's responsibility to detect and fix it, if the problem happens, the famous finger of fault points at the software.

    I'd bet $100 that if he did the same download over HTTPS, thus preventing software meddling of the packet contents, it would come out perfect.

    Your $100 is safe.

    --
    now we need to go OSS in diesel cars
  21. Re:simpler home-brew technique by Just+Some+Guy · · Score: 2, Insightful

    The person with the bad file runs option 1 to make the check file and sends that to the person with the good file. They run option 2 which identifies bad chunks and exports them, which they send back to the first person. Run option 3 and the exports are patched into their download and it's fixed.

    Isn't that almost exactly how rsync works?

    --
    Dewey, what part of this looks like authorities should be involved?
  22. Re:What broken software were you using? by Tawnos · · Score: 4, Informative

    TCP has a 16 bit checksum. That means there's a 1 in 2^16 chance of an error getting by the checksum. Let's assume, for a moment, that the packets were sent 1kb at a time (ethernet max is greater than this, but it's an easy number). In a 1.5Gb file (assuming base 10 throughout for simplicity), this means a total of 1,500,000 packets must be transmitted. Using only the TCP checksum, 22 of these packets would be corrupt, but allowed through. Even though there are additional checks at layer 2, the fact is that when dealing with large amounts of data, relying on TCP for data integrity is not enough.

  23. Torrent Distribution Network - Results: Awesome by erexx23 · · Score: 4, Interesting

    I have been using Torrents for this very reason.

    I was being required to copy sometimes 10-20GB of Virtual Machine Image Files from Server to PC or PC to PC on up 40 machines at one time.
    This was taking way too long and copies were not perfect.
    Restoration of VM images presented the same problem.
    Updating a VM meant redistribution of the entire file to all machines again.

    Using (Micro) Torrent and my own tracker changed all that.

    I came up with the following solution using all available resources.
    First I started by copying all images to workstations to a separate partition. (about 200GB of VM's.)
    Then I created created my own internal Tracker and Web Page to host torrents.

    The results were:
    1. Extremely efficient use of all available network hard drive space.
    2. Utilities every machine on the network to distribute the files.
    3. Works extremely well restoring or redistributing the VM's to any one machine or several machines at once. (The more the better)
    4. 100% accuracy in distribution.
    5. The ability to quickly modify any one image on any machine, recreate the torrent(hash) and then update that image across hundreds of machines very quickly.
    In other words, modifying a file only means that the machines only have to download the bits that changed not the whole image again.
    6. With Micro Torrent any machine can be used as the tracker.
    7. The Tracker is also the "master" file server, however any machine can be used to modifiy and upload a change
    Just recreate and re-upload the new torrent replacing the old one. Remember that a torrent file serving network is Not a server centric file sharing system.

  24. The first rule by tux0r · · Score: 5, Informative

    The first rule of Usenet: don't talk about Usenet.

    --
    ( Redundancy is ) ^ n
    1. Re:The first rule by pikakilla · · Score: 5, Funny

      Second rule: dont mod up the person talking about the first rule.

  25. Re:What asshole tagged this '!news'? by Free+the+Cowards · · Score: 2, Insightful

    To be honest, when I saw this story I was shocked it had shown up. I thought that using BitTorrent to repair mostly-whole files was obvious for this crowd. It's like "Using Water to Nourish Your Plants" showing up on a horticulturist site. If you know anything about how BitTorrent works then you should immediately realize that it will fix up mostly-good files for you.

    The subsequent discussion has revealed that a large chunk of the slashdot population not only doesn't understand how BitTorrent works but doesn't even know about classic open source tools like rsync.

    --
    If you mod me Overrated, you are admitting that you have no penis.
  26. Re:What broken software were you using? by rdebath · · Score: 2, Interesting

    Transparent proxies also kill large downloads; especially when the browser is not IE. I hear "not IE" also included IE7!

  27. Chance of CRC clashes is much higher by tucuxi · · Score: 2, Informative

    First, as rdebath argues, you only get 16 bits of CRC on TCP headers.

    And furthermore, if you start calculating CRCs off random data, chances (>50%) are you will get a collision (two chunks of data with the same CRC) around the 256th try (this is known as the "birthday paradox" in criptography). Of course, to be really sure to get a collision you will need to try at most 65536 values; but you will reach a very high probability of clash much sooner than intuition may tell you.

    See birthday attack for the math.