Slashdot Mirror


Use BitTorrent To Verify, Clean Up Files

jweatherley writes "I found a new (for me at least) use for BitTorrent. I had been trying to download beta 4 of the iPhone SDK for the last few days. First I downloaded the 1.5GB file from Apple's site. The download completed, but the disk image would not verify. I tried to install it anyway, but it fell over on the gcc4.2 package. Many things are cheap in India, but bandwidth is not one of them. I can't just download files > 1GB without worrying about reaching my monthly cap, and there are Doctor Who episodes to be watched. Fortunately we have uncapped hours in the night, so I downloaded it again. md5sum confirmed that the disk image differed from the previous one, but it still wouldn't verify, and fell over on gcc4.2 once more. Damn." That's not the end of the story, though — read on for a quick description of how BitTorrent saved the day in jweatherley's case.

jweatherley continues: "I wasn't having much success with Apple, so I headed off to the resurgent Demonoid. Sure enough they had a torrent of the SDK. I was going to set it up to download during the uncapped night hours, but then I had an idea. BitTorrent would be able to identify the bad chunks in the disk image I had downloaded from Apple, so I replaced the placeholder file that Azureus had created with a corrupt SDK disk image, and then reimported the torrent file. Sure enough it checked the file and declared it 99.7% complete. A few minutes later I had a valid disk image and installed the SDK. Verification and repair of corrupt files is a new use of BitTorrent for me; I thought I would share a useful way of repairing large, corrupt, but widely available, files."

10 of 212 comments (clear)

  1. Scheduling by FiestaFan · · Score: 4, Informative

    Many things are cheap in India, but bandwidth is not one of them. I can't just download files > 1GB without worrying about reaching my monthly cap, and there are Doctor Who episodes to be watched. Fortunately we have uncapped hours in the night I don't know about other bittorrent clients, but uTorrent lets you set download speed caps by hour(like 0 during the day and unlimited at night).
  2. Been using bittorrent and rsync for this for years by DiSKiLLeR · · Score: 5, Informative

    I've used bittorrent for this purpose many times in years gone by.

    Especially with our slow links, or worse yet, on dialup (if I go enough years back) in Australia.

    Before bittorrent I would use rsync. That required me to download the large file to a server in the US on a fast connection, then rsync my copy to the server's copy to fix what is corrupt in my copy.

    It works beautifully. :)

    --
    You can tell how powerful someone is by the magnitude of the crime they can commit and be able to get away with.
  3. Re:Nice by gomiam · · Score: 5, Informative
    It should be quite simple. Let's say torrentA leaves you with a corrupt/incomplete filesetA (one or more files, it doesn't really matter). Let's supose torrentB contains the files in filesetA, perhaps with different names in its own filesetB.

    Ok, you load torrentB in your favorite Bittorrent client, and start it up. It will automatically create 0-sized files with the names in filesetB (at least, all clients I know do that). Stop the transfer of torrentB, and substitute the 0-sized files in filesetB with the corresponding files in filesetA (may require some renaming). As you restart torrentB, your Bittorrent client will recheck the whole filesetB, keeping the valid parts in order to avoid downloading them. Voilá! You have migrated files from one torrent to another.

    Note: You should make sure that the files you are substituting in are the same files you want to download through torrentB or, at least, keep a copy around until you see that the restart check accepts most of their contents.

  4. Re:What broken software were you using? by complete+loony · · Score: 4, Informative

    The TCP checksum offloading on nForce 4 motherboards (I have one) were notorious for corrupting TCP packets and allowing them to be received by the application. That's the most likely kind of failure that would be able to reproduce this problem.

    --
    09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
  5. Re:What broken software were you using? by Skapare · · Score: 4, Informative

    Flipped bits happen, but they are detected by multiple checksums which make it astronomically unlikely for corrupt data to remain undetected.

    I actually saw this happen once ... the astronomically unlikely [1]. TCP accepted the corrupt packet. I'm sure it will never happen again. Fortunately, rsync caught it in the next run.

    One problem I ran into once with a certain Intel NIC was that a certain data pattern was always being corrupted. TCP always caught it and dropped the packet. There was no progress beyond that point because of the hardware defect always corrupted that data pattern. Turns out there was a run of zeros followed by a certain data byte (I tried a different data byte and with different run lengths and those never got corrupted). What the NIC did was drop 4 bytes, and put 4 bytes of garbage at the end. I suspect it was a clocking syncronization error. I got around the problem by adding the -z option to rsync (which I normally would not have done with an ISO of mostly compressed files). Another way would have been to do the rsync through ssh, either as a session agent (like rsync itself can do) or as a forwarded port (how I do it now for a lot of things).

    [1] ... approximately 1 in 2^31-1 chance that the TCP checksum will happen to match when the data is wrong (variance depending on what causes the error in the first place) ... which approaches astronomically unlikely. Take 1 Terabyte of random bits. Calculate the CRC-32 checksum for each 256 byte block. Sort all these checksums. You will find 2 (or more) data blocks with the same checksum (or a repeating pattern in your RNG). Why? Because CRC-32 has 2^32-1 possible states, and you have 2^32 random checksums.

    But whatever the cause, it's almost certain that software is to blame.

    Agreed. Since it is at least software's responsibility to detect and fix it, if the problem happens, the famous finger of fault points at the software.

    I'd bet $100 that if he did the same download over HTTPS, thus preventing software meddling of the packet contents, it would come out perfect.

    Your $100 is safe.

    --
    now we need to go OSS in diesel cars
  6. Re:!new by Anonymous Coward · · Score: 5, Informative

    I don't think this tactic is very common, though, as most people seem to have no fucking clue how BitTorrent works. I've seen torrents with gigantic multipart RARs, with an SFV of those. Let's see... so, my torrent software is already checksumming everything, and RAR has a builtin checksum too, or at least, acts like it does (it says "ok" or not) -- and on top of that, there's an SFV checksum (crappy CRC32), too. Never mind that RAR saves you at most a few megabytes (video is already compressed), which, based on the size of these files, you'll spend more time unpacking the RAR than you would downloading the extra couple megs. Or that, once you unpack and throw away the RAR, you can't seed that torrent from the working video. Or that multipart anything is retarded on BitTorrent, as the torrent is splitting it into 512k-4meg chunks anyway.
    People who aren't aware of the full situation often make this complaint. These multipart rar files are "scene releases".

    First of all, scene releases are _never_ compressed; it's always done with the -0 argument, this makes is basically equivalent to the unix split program. If a file is to be compressed, it is done with a zip archive, and the zip archive is placed inside the rar archive. This is because rar archives can be created/extracted easily with FOSS software, but cannot easily be de/compressed. This was more of an issue before Alexader Roshal released source code (note:not FOSS) to decompress rar archives.

    Second, people often have parts of, or complete, scene releases and they are unwilling to unrar them (often because it's an intermediary, like a shell account somewhere where law isn't a problem).

    Third, people follow "the scene" and try and download the exact releases that are chosen by the social customs of the scene (I am not going to detail those here), thus, "breaking up" (ie, altering) the original scene release is seen as rude.

    Fourth, the archive splitting is in precise sizes so that fitting the archives onto physical media works better; typically the archive size is some rough factor of 698, ~4698 and ~8500.

    Fifth, archives are split due to poor data integrity on some transfer protocols (though this is largely historical nowadays); redownloading a corrupted 14.3mb archive is easier than redownloading a 350mb file.

    Sixth, traffic of the size is measured in terabytes, with some releases being tens, or sometimes hundreds of gigabytes in size. Thus, there become efficiency arguments for archive splitting; effective use of connections, limited efficiency of software(sftp scales remarkably poorly, though that is beginning to change - not that sftp is used everywhere), use of multiple coordinated machines and so on. This is an incomplete list of reasons; it is almost as though every time a new challenge is presented to the scene, splitting in some way helps to solve it.

    AC because I'm not stupid enough to expose my knowledge of this either to law enforcement, or to the scene (who might just hand me over for telling you this - it has been done). Suffice to say that this is more complex than you understand, and that even this level of incomplete explanation is rare.
  7. Re:What broken software were you using? by Tawnos · · Score: 4, Informative

    TCP has a 16 bit checksum. That means there's a 1 in 2^16 chance of an error getting by the checksum. Let's assume, for a moment, that the packets were sent 1kb at a time (ethernet max is greater than this, but it's an easy number). In a 1.5Gb file (assuming base 10 throughout for simplicity), this means a total of 1,500,000 packets must be transmitted. Using only the TCP checksum, 22 of these packets would be corrupt, but allowed through. Even though there are additional checks at layer 2, the fact is that when dealing with large amounts of data, relying on TCP for data integrity is not enough.

  8. The first rule by tux0r · · Score: 5, Informative

    The first rule of Usenet: don't talk about Usenet.

    --
    ( Redundancy is ) ^ n
  9. Re:Nice by meza · · Score: 4, Informative

    I can't be bothered to do the math of how many possibilities that would be at this time of night, but it'll certainly go faster than continuing to download it from nobody. Not quite so. Because if you do the math (and if mine is correct at this late hour) you would see that it actually takes a pretty long time(tm). Imagine if only 1MB was missing. You would have to calculate the hash of every single possible 1MB file, so that is 2^8000000=~10^2300000 files. If you had a computer that could, quite unrealistically, calculate one hash each clock-cycle at 1GHz that would still take you 10^2299991 seconds. As a reference the universe according to Wikipedia is roughly 10^17 seconds old.

    Besides that there is the information theory problem too. If the hash is 128bit long then every 2^128th file will have the same hash. This might seem unlikely if you only compare a few files (such as all the files ever created by man) but compared to the 2^8000000 hashes we where going to calculate it is actually quite substantial.
  10. Re:!new by dk.r*nger · · Score: 4, Informative

    First of all, scene releases are _never_ compressed; it's always done with the -0 argument, [...] This was more of an issue before Alexader Roshal released source code (note:not FOSS) to decompress rar archives. So, historical, and pointless. And anyway, just an excuse if there's any point in using RAR anyway. Let's see..

    Second, people often have parts of, or complete, scene releases and they are unwilling to unrar them (often because it's an intermediary, like a shell account somewhere where law isn't a problem). So they should use BitTorrent. Run a seed on your [strike]compromised windows host[/strike] "shell account".

    Third, [....] social customs of the scene (I am not going to detail those here), thus, "breaking up" (ie, altering) the original scene release is seen as rude. Oh, I think we're at the core of the problem. Pale teenagers in their mothers basements getting hurt feelings. I appreciate that someone will rip the Lost episodes in HD pretty much as they are being broadcast, and I actually look for some "group names" in the torrents I get - because they provide one file, not a RAR. In other words, provide what people want, and they will respect you for that. Make their life hard, and they will not care about your 1998 social customs. Like anything else in life.

    Fourth, [...]fitting the archives onto physical media works better Yawn. 1998 called, they want their infrastructure back. Harddrives are cheaper than dirt. Five years ago "the scene" at my college exchanged 250 gb harddrives.

    Fifth, archives are split due to poor data integrity on some transfer protocols SO USE BITTORRENT! It easier and faster and better and more fun, but of course less 'leet than using [strike]compromised windows hosts[/strike] "shell accounts"

    Sixth, [...] Thus, there become efficiency arguments for archive splitting;[...]it is almost as though every time a new challenge is presented to the scene, splitting in some way helps to solve it. No, BitTorrent does ALL this for you. ALL of it.

    AC because I'm not stupid enough to expose my knowledge of this either to law enforcement, or to the scene (who might just hand me over for telling you this - it has been done). Badass gangster!

    Suffice to say that this is more complex than you understand, and that even this level of incomplete explanation is rare. What? Moving files around on the internet is "more complex" than we understand? It probably the simplest fucking thing there is. Let me put it very simple for you: 1) Multi-file RARs made sense back when people got their stuff from FTPs and newsgroups. 2) It's the past. It's pure nostalgia. Get over it. If you're not using your "scene" FTP servers as Torrent seeds instead, you're wasting your resources.