RIAA Tracking Songs by MD5 Hashes
aSiTiC writes "Apparently RIAA has obtained some technical experts in their prosecution of file swappers. Currently they are tracking traded mp3 files from the Napster network by matching MD5 hashes. This seems quite interesting but I was under the assumption that identical hashes could be created with identical rips and id3v2 tagging. Now may be the time to update your illegal mp3 file MD5 hash sums."
The md5 hashing algorithm has been proven to contain flaws allowing two files to produce identical md5 sums.
The only way for two files to have the same MD5 hash is for them to both be encoded with the same encoder, from the same WAV file, with the same bitrate and all advanced options, and to have exactly the same ID3 information, the same filesize, and to be identical to the last bit.
Otherwise, the MD5 will be nothing like the same, for two perfectly identical songs where one has a spelling error in one field of the ID3 tag. I imagine for any one song, there are many many different MD5sums out there, although perhaps one or another good quality version would exists on hundreds of different PCs...
Conversion Rate Optimisation French / English consultant
hmm Isn't that how k-sig, built into Kazaa Lite K++, works, by tracking MD5 hashes so ppl get exactly the file they want.
Changing MD5 hashes on songs to avoid RIAA would also lessen the effectiveness of K-SIG. Trading hashes of know working files was one of the ways ppl on P2p avoided downloading those fake RIAA files.
Yes, because for them to know that you have the MP3s, you have to be sharing them, which is the illegal part.
-- Dr. Eldarion --
Any modification, to ANY bit of the file covered by the hash, will change the MD5 hash (that's how hashes work). If you assume the hash includes the ID3 tag info, then simply editing the info (putting something in the notes field, for example) would change the hash.
On the other hand, if I were the RIAA attempting to identify common files in this way, I might be inclined to exclude the ID3 tag from the MD5 computation since it is so easily modified.
Any changes to the actual content, though, will ripple into the MD5 computation.
Short answer: "normalizing" the file for volume, or even chopping off a few seconds of trailing silence with something like CoolEdit will certainly change the hash and make it distinct from whatever their baseline hash value is.
The only problem is that a lot of file sharing software uses the fact that 2 files (from different sources) have the same hash in order to swarm the download from multiple sources. If everybody goes around intentionally making their mp3s have different hashes, swarming basically won't work anymore.
No, I don't want a free iPod
(This wouldn't, though, be a defense for the central problem that she made all of these MP3s available for download by millions of anonymous strangers without the consent of the copyright holders. And assuming her identity is revealed and she is sued, if the "ambiguous" claim's alternative interpretation is correct, she'll be able to show the CDs to the Judge.)
You are not alone. This is not normal. None of this is normal.
Uh, its not like the hash is in the file. Its computed from the file. You could write something in winamp that randomly changed bits in your music, and that would change the hash, but it would also slowly corrupt your music until you had static.
If the hash is using ID3 tags, you could change some unused field in there, but there would be a much smaller number of permutations available (although probelby still enough to be useful)
Pretty much no rip is identical.
First step: the *.wav is ripped. Using libcdparanoia, which i personally perfer, i find slight variation in size depending on the machine and cdrom drive i rip them on.
Second step: encoding on different machines, with different encoders, using different algorythms, using different levels of floating point precision, on different architectures etc... produces vastly different files.
Third step: sharing. Oftentimes an mp3 is downloaded 99.8% before the connection is broken. You keep the mp3 becuase mp3 is a sequential file format and you only lose a second or two of music. The rest of the file is intact.
Their md5 searching scheme could be circumvented quite easily by changing a comment in the id3 but they could get around that by cutting out the id3 part of the file when they make their md5sum.
The downside to this is that if you are searching for music on something like gnutella by the ***sum, the content would differ and you would not get as many results. Gnutella would not download from multiple sources becuase the file would not have the same signature.
Whatever the case, it is clear that some form of file obfuscation is now needed for safety online. Or we can wait for freenet to mature.
Imagine, the MD5 file as a solution, and the original file as the question. The MD5 file might contain the number '5', but you wouldn't know whether the question asked was 2+3 or 4+1. You do know however that the question wasn't 3+1 or 2+2 though.
If you download the question, you can check that the solution matches the expected solution. If so, the download is good.
Note, this is a very simplified version, using a pretty poor analogy. I'm sure there's a website that explains this better.
I do believe RIAA can afford 3.2gb harddisk.
You're right in that it is possible to have the same MD5 sum for multiple files, but the chances of it happening is extremely small for two reasons.
The first reason is that MD5 has 128 bits to describe the file, meaning that there is a 1 in 2^128 chance that any given random bitstream will have the same MD5 sum (Of course, MP3s aren't all that random in portions of the file format, but the basic argument still stands).
The second reason is the very process of verification. In order to verify a file, you must already have a checksum of the original file to compare it to, and you have a file which you think could be the same file, meaning file names and file sizes are already identical. If those files differ by as much as one bit, then they will produce different checksums. If you're willing to try to match a file named "ISO of Windows XP" with a file size of 650.1MB versus a file named "ISO of Mandrake" with a file size of 643.8MB then you're already sure that they're not the same file by the filesize alone.
In short, possible, but extremely unlikely.
http://news.bbc.co.uk/1/hi/entertainment/music/318 7695.stm
:wq
> This proof of RIAA is as good as the SCO evidences of greek language or bsd firewall code against linux
/. were clamoring for some MD5 sums instead...
Uh, actually this is irrefutable proof. It will miss a lot of songs, but it is virtually guaranteed to not give false positives. This is much more solid proof than SCO had.
To think a month or two ago when SCO was insisting on an NDA many on
Obviously the RIAA's technical experts know what they are doing... its time to alter a few ID3 tags like the story suggested.
The unofficial
You mean 128 kbps with no ID3 tag? Gee, I don't have a lot of files like that, or anything.
you have to be sharing them, which is the illegal part
Actually that's not true. They only care about the sharing because it leads to what they really care about: people listening to music that they didn't pay for. If everyone who shared mp3s had bought every CD of the songs they downloaded, no one would care because they would have already paid to listen to those songs. The problem is that most people don't own all of the CDs for the songs they download, and the RIAA doesn't like it when you try to wriggle out of their money trap. If the actual sharing was the problem, the distribution itself, then we wouldn't have radio stations playing music either, because that also lets people listen to music they didn't pay for, but it's a bit different because you don't really get a choice of what you hear. But now if you go and start recording songs you hear on the radio, so you could listen to them whenever you wanted, you're getting into that grey area. Of course the RIAA doesn't really care about that because they know that radio quality is shit, so there won't be widespread radio recording anyway.
--
Promoting critical thinking since 1994.
Revealed: How RIAA tracks downloaders
(Music industry discloses some methods used)
If that's all you want to do, much better not to use Cooledit, which has to expand and recompress the file to MP3. Use something like MP3Trim which can chop off any given number of MP3 frames, or normalise the volume, by operating on the MP3 directly. Much much faster, and no expand/recompress quality loss.
I just did some consecutive rips of an audio track and compared the md5 checksums.
I did the same song three times. The first two times, all things were equal including all settings. The MD5 checksums were the same.
I swapped out my DVD/CD player for a different model. Reripped the track on the same computer with the same exact settings and the MD5 was different.
I am using Exact Audio Copy in secure mode and Lame for the encoding. The ID tags were recieved the first time and the same tags used for all three attempts (EAC remembers the disk).
I'm sure I could try many things like changing the read speed, comparing the wav files and not just the resulting mp3 etc.. but I do not have the time for more analysis.
Bad boys rape our young girls but Violet gives willingly.
With all this hash talk going on, I thought I'd mention that Musicbrainz uses some sort of similarity hash in identifying songs. It compares the hashes of the files you have to an existing user submitted database. If the match is good, then you can use the database tag info, which is pretty handy.
I've compared albums I've ripped myself to the database and gotten "100%" matches (along with some matches of a much lower percentage) That leads me to think that if the RIAA kept its own database like that, they could do a whole lot of comparison with similarity or quasi-unique (ala MD5) hashes. I'd also venture that, with enough work at the comparison system, they could make court-valid assertions. They can hire plenty of geeks to handle the statistics necessary to call something 'beyond a reasonable doubt.' (for criminal proof)
No, you are demonstrably wrong. The RIAA cares about sharing because it means loss of control for them. The RIAA is all about controlling distribution channels and sharing disintermediates their existence. Make no mistake, if they could come up with a way to sell you the same song twice, they would (ever try to get a cracked 3-year old CD replaced? They won't do it, you gotta buy a new one even though you already "own" the music.
Now here is where it gets good - the downfall of mp3.com was exactly because of sharing. They put together a system where you could buy a CD online, have it shipped to you, but also immediately have it available online as an MP3 through a password protected account that only allowed a single simultaneous user. They also provided a method to "upload" your previously purchased CDs - you stuck your CD in your cd-rom drive and ran their program that verified that the CD had the same contents as the released one (so either you had a legit copy or a perfect rip&dupe, either way you *already* had the music) and then that disc was also made available in your private mp3.com account.
The RIAA freaked and sued and won. They won on the premise that mp3.com was making copies without permission (from the RIAA) and then sharing them. Never mind that the only people who had access where those who had proven they already owned the music to begin with. They won big too, something like $25M per RIAA member company. That used up a *lot* of VC and IPO cash.
When information is power, privacy is freedom.
This is pretty common at least with iTunes. Most of the people will not change the default settings, so each cd rip will be identical, all using the same id3 tags.
What, me worry?
Theres issues of offset values (as with CD audio it is difficult to hit an *exact* location on the disk), plus the way the reader deals with C1 and C2 error correction, as well as how different extracting software interfaces with the hardware.
It would almost be safe to say two mp3s with the the same MD5 are one file copied twice (as opposed to two individually created mp3s), but that doesn't mean they are illegal...
>> "It is also possible that, as someone else suggested, the magical mp3 fairy left those files behind on her hard drive. In fact, I would propose that the mp3 fairy theory is even more likely."
For loose definitions of "fairy", yes. eg child, friend, etc
>> "The only way that the MD5 hashes could be identical is if the two files are absolutely identical in every single bit."
Try the following: Install some CD ripping/encoding software. Leave it at the defaults. Use CDDB to generate the ID3 tags. Unless something gets corrupted, that *will* produce an identical file, down to the last bit.
When someone might yell at me, it has to be OpenBSD.
I believe what they are referring to is a system that takes a sample of a song (let's say 30 seconds) and generates a 'hash' based on that... The thing about this system is that it is a loose hash, meaning that changing one bit does NOT necessarily change the hash. It is a sonic fingerprint (Not in the digital watermark sense), so that in theory if you had a direct CD-ripped wave, and an analog rip from a cassette as a wave (for instance), you could match the two files, even though they are FAR from bit-for-bit exact.
This is what they mean when they say hash. NOT md5. Obviously MD5 could not track an mp3, since changing even one character in the ID3 tag would change the whole hash.
So they probably have an automated downloader that then generates a fingerprint from the downloaded file and compares it to a db of fingerprints to determine if the song is copyrighted. I'd bet that's all.
Well let me point you to the most likely problem:
The "offest".
If you use EAC you will see there is a tab where you can correct your drive's offset value.
Now if you do that (or atleast 'sync' them) you should get the same result on both drives if the disc is good enough. (Ofcourse all your other settings should be set properly too) (If your disc is bad, EAC can correct those errors by re-reading a dozen times and then using the most often occuring result, but if your disc is a little too bad on a specific part, EAC won't be able to return the same result each read)).
I know this because I have ripped discs on *three* diffrent cd-roms one 2x old HP burner, one el cheapo 36x drive and a toshiba laptop drive (also a burner).
Granted I compared wave files, but I guess that if you feed the same wave file to the same encoder with the same settings you should get the exact same result.
note:
Offset: When your cd-rom reads a position on the disc in audio mode it often misreads, ie say you tell it to read position 0, then it will read position 4. Normally this doesn't matter since offsets are measured in milliseconds so you won't hear a diffrence, but for ripping bit-perfect rips, it does matter.
You ccorrect it by finding out what offset your particular cd-drive has (every particular model number has a particular offset, few drives that are of the same brand and model have diffrent offsets)
What I mean by 'syncing' is not correcting the offset but making it the same between drives.
For example, burn a offset cd in EAC (use a cd-rw if you must). this disc will have the same offset of your cd-WRITER.
Now 'correct' the offset in all your drives (including your burner, 'cause burners have a diffrent offset when writing than reading) with this disc.
It won't be perfect, since now all your drives have the same offset, namely the write offset of your cd-burner.
BUT now the rips will be identical, since they will all have the same offset.
NOTE: I think the RIAA doesn't hash the ID3 tags, only the music.
That way the same mp3 with diffrent ID3 tags will still be identified as being the same.
Thats btw what Kazaa does if i'm not mistaken.
-fp
Most insurance policies will only pay a token amount -- a dollar or two per recording -- for losses to a collection of CDs, tapes, records, books, and the like. This is done to discourage fraud, and makes sense for the majority of people who have a large number of recordings that they really don't listen to any more. But for the few listeners who make use of their entire CD library, it is most unfair.
Right, but I figured, maybe the bit differences might disappear in the encoding, some wacky things you can only determine empirically
I wouldn't expect two different WAV's that sound exactly the same to give the same mp3. But I wouldn't have bothered to test it either.
As I think about it, your theory is interesting. Since mp3 compression is based on the perception of audio, or getting rid of everything that you don't perceive, then there is some argument that two very similar WAV bit patterns that sound identical might actually be closer after encoding to mp3 than you might think. Of course an MD5 hash of the two mp3's is not a good indicator of this, as one single bit difference in two files radically alters the MD5 hash.
The price of freedom is eternal litigation.
If that were possible, it would destroy the value of an MD5 hash immediately and everyone wouild quit using it faster than you could blink.
The purpose of CRC hashes is entirely different. They are designed to detect a burst of bit errors in a stream of data, the type of error that is most likely to occur in a network transmission. They are not meant for fingerprinting files.
I doubt that anyone with any degree of sophistication in cryptology would attempt to use CRC and MD5 hashes interchangeably.