Use BitTorrent To Verify, Clean Up Files
jweatherley writes "I found a new (for me at least) use for BitTorrent. I had been trying to download beta 4 of the iPhone SDK for the last few days. First I downloaded the 1.5GB file from Apple's site. The download completed, but the disk image would not verify. I tried to install it anyway, but it fell over on the gcc4.2 package. Many things are cheap in India, but bandwidth is not one of them. I can't just download files > 1GB without worrying about reaching my monthly cap, and there are Doctor Who episodes to be watched. Fortunately we have uncapped hours in the night, so I downloaded it again. md5sum confirmed that the disk image differed from the previous one, but it still wouldn't verify, and fell over on gcc4.2 once more. Damn." That's not the end of the story, though — read on for a quick description of how BitTorrent saved the day in jweatherley's case.
jweatherley continues: "I wasn't having much success with Apple, so I headed off to the resurgent Demonoid. Sure enough they had a torrent of the SDK. I was going to set it up to download during the uncapped night hours, but then I had an idea. BitTorrent would be able to identify the bad chunks in the disk image I had downloaded from Apple, so I replaced the placeholder file that Azureus had created with a corrupt SDK disk image, and then reimported the torrent file. Sure enough it checked the file and declared it 99.7% complete. A few minutes later I had a valid disk image and installed the SDK. Verification and repair of corrupt files is a new use of BitTorrent for me; I thought I would share a useful way of repairing large, corrupt, but widely available, files."
jweatherley continues: "I wasn't having much success with Apple, so I headed off to the resurgent Demonoid. Sure enough they had a torrent of the SDK. I was going to set it up to download during the uncapped night hours, but then I had an idea. BitTorrent would be able to identify the bad chunks in the disk image I had downloaded from Apple, so I replaced the placeholder file that Azureus had created with a corrupt SDK disk image, and then reimported the torrent file. Sure enough it checked the file and declared it 99.7% complete. A few minutes later I had a valid disk image and installed the SDK. Verification and repair of corrupt files is a new use of BitTorrent for me; I thought I would share a useful way of repairing large, corrupt, but widely available, files."
Awesome idea. I've done this in the past with stuff. If a corrupt version was on one tracker, I'd save the files, get a new torrent and import the old files. Saves a lot of bandwidth wasting.
If I happen to see a stuck torrent (many leechers, no seeds), sometimes I can find a good version of the file I already have - so I start the torrent, stop it, replace the single good file (sometimes you need more if the file is smaller than the part size), and upload a few Kb to finish the torrent. Then sit back and watch as everyone fills up.
Fiesta Online
I asked the same question. Wikipedia answered it.
One should be more concerned as to why your files are becoming corrupted.
I'd say its a safe bet that the files from apple.com are in perfect condition.
Which means it either became corrupted in transit to, or on arrival to your machine.
Which leads the question, is your memory defective
run memtest86 to check your memory.
http://www.memtest86.com/
Check if your Harddrives have SMART and are reporting anything. A disk checker would also be a good idea.
The other idea that springs to mind is if your behind some proxy with the above problems, although i doubt anyone would want to proxy a 1.5gig file.
Fact is, if files are being corrupted on your disk, its just a matter of time before something more important is hit by corruption.
To avoid criticism; Say nothing, Do nothing, Be nothing.
I've used bittorrent for this purpose many times in years gone by.
:)
Especially with our slow links, or worse yet, on dialup (if I go enough years back) in Australia.
Before bittorrent I would use rsync. That required me to download the large file to a server in the US on a fast connection, then rsync my copy to the server's copy to fix what is corrupt in my copy.
It works beautifully.
You can tell how powerful someone is by the magnitude of the crime they can commit and be able to get away with.
Those who have never developed P2P software might never understand why they all need to use strong checksums to detect data corruption, and why bad blocks actually do appear in the wild; frequently.
You'd be shocked - SHOCKED - at how much data gets corrupted routinely - by errant antivirus software, flaky network equipment, plain ol' line noise that the checksums don't detect (which will happen much more often than you expect, see also birthday paradox), or misbehaving routers who think that any occurence of 0xC0A80102 obviously must be an internal IP address and needs to be changed to your external one. Even if that's in the middle of a ZIP file. Oops.
Encryption actually aids this somewhat, as the same byte patterns don't get repeated, so if there's an errant IDS changing things for example, it tends not to fire the second time.
I've done this before for file repairs. Works a treat, but you sort of wish that torrent used a Merkle hash tree such as the modified THEX standard Tiger Tree Hash. SHA-1's so last century.
We have been doing this for ages for certain high-demand games file that we mirror. While offering torrents for some of our download mirrors is only mildly useful (as we're in Australia we're trying to keep bandwidth on-shore to cut down international traffic, and BT doesn't really help this), it is extremely helpful for the VAST amount of users that appear to either have massively crazy Internet problems or are simply unable to drive a HTTP based downloader and resume downloads.
When a large number of users are having problems downloading or resuming a particular file, I simply create a torrent for them and give them some vague instructions about how to resume it and then generally I never hear from them again. They're happy because they don't have to download a 4gb game client again from scratch, they don't have to worry about resuming/corrupt downloads, and because its a torrent it probably feels like they're getting something for free that they shouldn't be.
For even more fun, if you have two differently-corrupted copies of a file and a torrent to go with it, then you can have BitTorrent stitch them together into a valid file without involving any third parties.
I used Azureus's internal tracker ability and two computers on a local network with the torrent modified to track on one of the machines, and one corrupted copy of the file on each.
Obviously only works if they don't have corruption in common, but it also doesn't require the original torrent file tracker to work anymore.
The TCP checksum offloading on nForce 4 motherboards (I have one) were notorious for corrupting TCP packets and allowing them to be received by the application. That's the most likely kind of failure that would be able to reproduce this problem.
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
I actually saw this happen once ... the astronomically unlikely [1]. TCP accepted the corrupt packet. I'm sure it will never happen again. Fortunately, rsync caught it in the next run.
One problem I ran into once with a certain Intel NIC was that a certain data pattern was always being corrupted. TCP always caught it and dropped the packet. There was no progress beyond that point because of the hardware defect always corrupted that data pattern. Turns out there was a run of zeros followed by a certain data byte (I tried a different data byte and with different run lengths and those never got corrupted). What the NIC did was drop 4 bytes, and put 4 bytes of garbage at the end. I suspect it was a clocking syncronization error. I got around the problem by adding the -z option to rsync (which I normally would not have done with an ISO of mostly compressed files). Another way would have been to do the rsync through ssh, either as a session agent (like rsync itself can do) or as a forwarded port (how I do it now for a lot of things).
[1] ... approximately 1 in 2^31-1 chance that the TCP checksum will happen to match when the data is wrong (variance depending on what causes the error in the first place) ... which approaches astronomically unlikely. Take 1 Terabyte of random bits. Calculate the CRC-32 checksum for each 256 byte block. Sort all these checksums. You will find 2 (or more) data blocks with the same checksum (or a repeating pattern in your RNG). Why? Because CRC-32 has 2^32-1 possible states, and you have 2^32 random checksums.
But whatever the cause, it's almost certain that software is to blame.Agreed. Since it is at least software's responsibility to detect and fix it, if the problem happens, the famous finger of fault points at the software.
I'd bet $100 that if he did the same download over HTTPS, thus preventing software meddling of the packet contents, it would come out perfect.Your $100 is safe.
now we need to go OSS in diesel cars
First of all, scene releases are _never_ compressed; it's always done with the -0 argument, this makes is basically equivalent to the unix split program. If a file is to be compressed, it is done with a zip archive, and the zip archive is placed inside the rar archive. This is because rar archives can be created/extracted easily with FOSS software, but cannot easily be de/compressed. This was more of an issue before Alexader Roshal released source code (note:not FOSS) to decompress rar archives.
Second, people often have parts of, or complete, scene releases and they are unwilling to unrar them (often because it's an intermediary, like a shell account somewhere where law isn't a problem).
Third, people follow "the scene" and try and download the exact releases that are chosen by the social customs of the scene (I am not going to detail those here), thus, "breaking up" (ie, altering) the original scene release is seen as rude.
Fourth, the archive splitting is in precise sizes so that fitting the archives onto physical media works better; typically the archive size is some rough factor of 698, ~4698 and ~8500.
Fifth, archives are split due to poor data integrity on some transfer protocols (though this is largely historical nowadays); redownloading a corrupted 14.3mb archive is easier than redownloading a 350mb file.
Sixth, traffic of the size is measured in terabytes, with some releases being tens, or sometimes hundreds of gigabytes in size. Thus, there become efficiency arguments for archive splitting; effective use of connections, limited efficiency of software(sftp scales remarkably poorly, though that is beginning to change - not that sftp is used everywhere), use of multiple coordinated machines and so on. This is an incomplete list of reasons; it is almost as though every time a new challenge is presented to the scene, splitting in some way helps to solve it.
AC because I'm not stupid enough to expose my knowledge of this either to law enforcement, or to the scene (who might just hand me over for telling you this - it has been done). Suffice to say that this is more complex than you understand, and that even this level of incomplete explanation is rare.
TCP has a 16 bit checksum. That means there's a 1 in 2^16 chance of an error getting by the checksum. Let's assume, for a moment, that the packets were sent 1kb at a time (ethernet max is greater than this, but it's an easy number). In a 1.5Gb file (assuming base 10 throughout for simplicity), this means a total of 1,500,000 packets must be transmitted. Using only the TCP checksum, 22 of these packets would be corrupt, but allowed through. Even though there are additional checks at layer 2, the fact is that when dealing with large amounts of data, relying on TCP for data integrity is not enough.
I have been using Torrents for this very reason.
I was being required to copy sometimes 10-20GB of Virtual Machine Image Files from Server to PC or PC to PC on up 40 machines at one time.
This was taking way too long and copies were not perfect.
Restoration of VM images presented the same problem.
Updating a VM meant redistribution of the entire file to all machines again.
Using (Micro) Torrent and my own tracker changed all that.
I came up with the following solution using all available resources.
First I started by copying all images to workstations to a separate partition. (about 200GB of VM's.)
Then I created created my own internal Tracker and Web Page to host torrents.
The results were:
1. Extremely efficient use of all available network hard drive space.
2. Utilities every machine on the network to distribute the files.
3. Works extremely well restoring or redistributing the VM's to any one machine or several machines at once. (The more the better)
4. 100% accuracy in distribution.
5. The ability to quickly modify any one image on any machine, recreate the torrent(hash) and then update that image across hundreds of machines very quickly.
In other words, modifying a file only means that the machines only have to download the bits that changed not the whole image again.
6. With Micro Torrent any machine can be used as the tracker.
7. The Tracker is also the "master" file server, however any machine can be used to modifiy and upload a change
Just recreate and re-upload the new torrent replacing the old one. Remember that a torrent file serving network is Not a server centric file sharing system.
The first rule of Usenet: don't talk about Usenet.
( Redundancy is ) ^ n