Slashdot Mirror


RIAA Tracking Songs by MD5 Hashes

aSiTiC writes "Apparently RIAA has obtained some technical experts in their prosecution of file swappers. Currently they are tracking traded mp3 files from the Napster network by matching MD5 hashes. This seems quite interesting but I was under the assumption that identical hashes could be created with identical rips and id3v2 tagging. Now may be the time to update your illegal mp3 file MD5 hash sums."

34 of 779 comments (clear)

  1. MD5 Cannot stand up in court. by Organized+Konfusion · · Score: 5, Informative

    The md5 hashing algorithm has been proven to contain flaws allowing two files to produce identical md5 sums.

    1. Re:MD5 Cannot stand up in court. by Libor+Vanek · · Score: 2, Informative

      ANY hash can produce same result on two different files since the amount of information in hash is amount of information in files.

    2. Re:MD5 Cannot stand up in court. by Urkki · · Score: 5, Informative

      A bit of clarification is in order I think.

      First of all it's very clear that two files can give same MD5 checksums. After all, MD5 is only 16 bytes (2^128 different possible). So if you have just 17 byte files (2^136 different possible), it's clear that on average every MD5 sum matches to 256 of all possible files.

      It's just damn unlikely to get 2 files with same MD5, and if you wanted to brute force it, you would have to try average 2^64 different files before you found one with identical MD5 to another file. And this would take a long time (actually not that terribly long, a few years at most, and it parallelizes perfectly).

      The page you link to implies that it's possible to "easily" fabricate a file that produces a given check sum, so instead of months of processing time, only days or hours would be needed to get a MD5 hash collision.

      So all P2P users / software makers need to do to circumvent this, is to agree on a specific MD5 sum, then patch every file so that they produce this same MD5 sum :)

      Of course the obivious solution for RIAA would be to use a more secure hash algorithm, with more bits. Unbroken algorithm with enough bits can't be faked, as it would take more than age of the universe to brute force it.

      Though the basic problem with this RIAA method remains. If you rip with same software from identical CD digitally, and there are not bit errors at ay point, then you should end up with identical file, and therefore identical hash no matter how secure the algorithm is...

  2. MD5 Hash by fruey · · Score: 5, Informative
    This seems quite interesting but I was under the assumption that identical hashes could be created with identical rips and id3v2 tagging.

    The only way for two files to have the same MD5 hash is for them to both be encoded with the same encoder, from the same WAV file, with the same bitrate and all advanced options, and to have exactly the same ID3 information, the same filesize, and to be identical to the last bit.

    Otherwise, the MD5 will be nothing like the same, for two perfectly identical songs where one has a spelling error in one field of the ID3 tag. I imagine for any one song, there are many many different MD5sums out there, although perhaps one or another good quality version would exists on hundreds of different PCs...

    --
    Conversion Rate Optimisation French / English consultant
    1. Re:MD5 Hash by 3terrabyte · · Score: 3, Informative
      Many people will produce a file by ripping straight from a CD, which , given the same CD, will result in an identical source file.

      No!! That's definately not true. Making a perfect rip is something you have to WORK at, which not many rippers do. Especially years ago. Check out ChrisMyDen's Uber Network on a detailed guide on how to make the 'perfect mp3'.

      You need to use something like EAC's secure mode. It rips the cd twice and compares for exactness. Only then can you be assured your wav file has no errors.

      Even if you can convince people to use the best mp3 encoding techniques (LAME 3.92 or LAME 3.90.2 -aps) I have still seen people refuse to use EAC, instead enjoying cdex, audiograbber, or (gasp) jukebox due to 'ease of use'. These ripper DO NOT make perfect rips, and will almost always make a different wav file each time due to the way it tries to make error corrections. Most people will not ditch their source either, even if there are errors. And everyone has a different scratch on their cd's.

      Almost everyone encodes at 128kbps

      This isn't true anymore either. Considering most of the lazy people out there download mp3's instead of make their mp3's, many of the rippers today do care about quality, and will rip in VBR or at 192. Release groups (where I would imagine most of the new stuff originates nowadays will rip at 192, 224, 256, or 320)

      --

      Why are there only 19 people folding@home for slashdot?

    2. Re:MD5 Hash by dbs_flac · · Score: 2, Informative

      As far as I know, you would also have to use the same mp3 encoder as different encoders produce different results, therefore different files/md5sums. I'd also like to throw in flac as that uses a fingerprint, so even if the id3 tag changes, the hash doesn't.

  3. Md5 hashes are also used for.... by shione · · Score: 5, Informative

    hmm Isn't that how k-sig, built into Kazaa Lite K++, works, by tracking MD5 hashes so ppl get exactly the file they want.

    Changing MD5 hashes on songs to avoid RIAA would also lessen the effectiveness of K-SIG. Trading hashes of know working files was one of the ways ppl on P2p avoided downloading those fake RIAA files.

  4. Re:What if... by DrEldarion · · Score: 4, Informative

    Yes, because for them to know that you have the MP3s, you have to be sharing them, which is the illegal part.

    -- Dr. Eldarion --

  5. Re:What happen if by l1gunman · · Score: 5, Informative

    Any modification, to ANY bit of the file covered by the hash, will change the MD5 hash (that's how hashes work). If you assume the hash includes the ID3 tag info, then simply editing the info (putting something in the notes field, for example) would change the hash.

    On the other hand, if I were the RIAA attempting to identify common files in this way, I might be inclined to exclude the ID3 tag from the MD5 computation since it is so easily modified.

    Any changes to the actual content, though, will ripple into the MD5 computation.

    Short answer: "normalizing" the file for volume, or even chopping off a few seconds of trailing silence with something like CoolEdit will certainly change the hash and make it distinct from whatever their baseline hash value is.

  6. Easy by sprouty76 · · Score: 5, Informative
    Just take a random id3 field that you don't use for anything, and fill it with a random number. You can probably write a srcipt in a few seconds. Bingo, different md5.

    The only problem is that a lot of file sharing software uses the fact that 2 files (from different sources) have the same hash in order to swarm the download from multiple sources. If everybody goes around intentionally making their mp3s have different hashes, swarming basically won't work anymore.

    --

    No, I don't want a free iPod

  7. Re:gee? by squiggleslash · · Score: 4, Informative
    To put this in context, the RIAA was responding to the impression "Jane Doe" gave that the MP3s were rips of her own CDs:
    The disclosures were included in court papers filed against a Brooklyn woman fighting efforts to identify her for allegedly sharing nearly 1,000 songs over the Internet. The recording industry disputed her defense that songs on her family's computer were from compact discs she had legally purchased.
    Of course, the wording of the latter is ambiguous - it could mean nycfashiongirl meant she had downloaded MP3s of pieces of music that were also on CDs in her possession. A lot of amateur lawyers on Slashdot (ahem) claim this is fair use, and given it's non-commercial and wouldn't have an impact on the ability of the artist to make a sale, that may well be true.

    (This wouldn't, though, be a defense for the central problem that she made all of these MP3s available for download by millions of anonymous strangers without the consent of the copyright holders. And assuming her identity is revealed and she is sued, if the "ambiguous" claim's alternative interpretation is correct, she'll be able to show the CDs to the Judge.)

    --
    You are not alone. This is not normal. None of this is normal.
  8. Re:Time for a new WinAMP Plug-in by Gaijin42 · · Score: 2, Informative

    Uh, its not like the hash is in the file. Its computed from the file. You could write something in winamp that randomly changed bits in your music, and that would change the hash, but it would also slowly corrupt your music until you had static.

    If the hash is using ID3 tags, you could change some unused field in there, but there would be a much smaller number of permutations available (although probelby still enough to be useful)

  9. MD5 sums and different encoders by Psyborgue · · Score: 5, Informative

    Pretty much no rip is identical.

    First step: the *.wav is ripped. Using libcdparanoia, which i personally perfer, i find slight variation in size depending on the machine and cdrom drive i rip them on.
    Second step: encoding on different machines, with different encoders, using different algorythms, using different levels of floating point precision, on different architectures etc... produces vastly different files.
    Third step: sharing. Oftentimes an mp3 is downloaded 99.8% before the connection is broken. You keep the mp3 becuase mp3 is a sequential file format and you only lose a second or two of music. The rest of the file is intact.

    Their md5 searching scheme could be circumvented quite easily by changing a comment in the id3 but they could get around that by cutting out the id3 part of the file when they make their md5sum.
    The downside to this is that if you are searching for music on something like gnutella by the ***sum, the content would differ and you would not get as many results. Gnutella would not download from multiple sources becuase the file would not have the same signature.
    Whatever the case, it is clear that some form of file obfuscation is now needed for safety online. Or we can wait for freenet to mature.

  10. Re:Excuse my ignorance by tom+taylor · · Score: 2, Informative

    Imagine, the MD5 file as a solution, and the original file as the question. The MD5 file might contain the number '5', but you wouldn't know whether the question asked was 2+3 or 4+1. You do know however that the question wasn't 3+1 or 2+2 though.

    If you download the question, you can check that the solution matches the expected solution. If so, the download is good.

    Note, this is a very simplified version, using a pretty poor analogy. I'm sure there's a website that explains this better.

  11. Re:MD5-hashes by Anonymous Coward · · Score: 0, Informative

    I do believe RIAA can afford 3.2gb harddisk.

  12. Re:Excuse my ignorance by jacksonyee · · Score: 2, Informative

    You're right in that it is possible to have the same MD5 sum for multiple files, but the chances of it happening is extremely small for two reasons.

    The first reason is that MD5 has 128 bits to describe the file, meaning that there is a 1 in 2^128 chance that any given random bitstream will have the same MD5 sum (Of course, MP3s aren't all that random in portions of the file format, but the basic argument still stands).

    The second reason is the very process of verification. In order to verify a file, you must already have a checksum of the original file to compare it to, and you have a file which you think could be the same file, meaning file names and file sizes are already identical. If those files differ by as much as one bit, then they will produce different checksums. If you're willing to try to match a file named "ISO of Windows XP" with a file size of 650.1MB versus a file named "ISO of Mandrake" with a file size of 643.8MB then you're already sure that they're not the same file by the filesize alone.

    In short, possible, but extremely unlikely.

  13. Similar story on BBC by SuperChuck69 · · Score: 3, Informative
    --
    :wq
  14. Re:MD5-hashes by Gherald · · Score: 4, Informative

    > This proof of RIAA is as good as the SCO evidences of greek language or bsd firewall code against linux

    Uh, actually this is irrefutable proof. It will miss a lot of songs, but it is virtually guaranteed to not give false positives. This is much more solid proof than SCO had.

    To think a month or two ago when SCO was insisting on an NDA many on /. were clamoring for some MD5 sums instead...

    Obviously the RIAA's technical experts know what they are doing... its time to alter a few ID3 tags like the story suggested.

  15. Re:gee? by Anonymous Coward · · Score: 1, Informative

    You mean 128 kbps with no ID3 tag? Gee, I don't have a lot of files like that, or anything.

  16. Re:What if... by IpalindromeI · · Score: 3, Informative

    you have to be sharing them, which is the illegal part

    Actually that's not true. They only care about the sharing because it leads to what they really care about: people listening to music that they didn't pay for. If everyone who shared mp3s had bought every CD of the songs they downloaded, no one would care because they would have already paid to listen to those songs. The problem is that most people don't own all of the CDs for the songs they download, and the RIAA doesn't like it when you try to wriggle out of their money trap. If the actual sharing was the problem, the distribution itself, then we wouldn't have radio stations playing music either, because that also lets people listen to music they didn't pay for, but it's a bit different because you don't really get a choice of what you hear. But now if you go and start recording songs you hear on the radio, so you could listen to them whenever you wanted, you're getting into that grey area. Of course the RIAA doesn't really care about that because they know that radio quality is shit, so there won't be widespread radio recording anyway.

    --

    --
    Promoting critical thinking since 1994.
  17. How RIAA tracks downloaders by $exyNerdie · · Score: 2, Informative

    Revealed: How RIAA tracks downloaders


    (Music industry discloses some methods used)

  18. Re:What happen if by 1u3hr · · Score: 5, Informative
    Short answer: "normalizing" the file for volume, or even chopping off a few seconds of trailing silence with something like CoolEdit will certainly change the hash

    If that's all you want to do, much better not to use Cooledit, which has to expand and recompress the file to MP3. Use something like MP3Trim which can chop off any given number of MP3 frames, or normalise the volume, by operating on the MP3 directly. Much much faster, and no expand/recompress quality loss.

  19. Re:MD5-hashes by nolife · · Score: 5, Informative

    I just did some consecutive rips of an audio track and compared the md5 checksums.

    I did the same song three times. The first two times, all things were equal including all settings. The MD5 checksums were the same.

    I swapped out my DVD/CD player for a different model. Reripped the track on the same computer with the same exact settings and the MD5 was different.

    I am using Exact Audio Copy in secure mode and Lame for the encoding. The ID tags were recieved the first time and the same tags used for all three attempts (EAC remembers the disk).

    I'm sure I could try many things like changing the read speed, comparing the wav files and not just the resulting mp3 etc.. but I do not have the time for more analysis.

    --
    Bad boys rape our young girls but Violet gives willingly.
  20. Music Hashing with musicbrainz by ramk13 · · Score: 2, Informative

    With all this hash talk going on, I thought I'd mention that Musicbrainz uses some sort of similarity hash in identifying songs. It compares the hashes of the files you have to an existing user submitted database. If the match is good, then you can use the database tag info, which is pretty handy.

    I've compared albums I've ripped myself to the database and gotten "100%" matches (along with some matches of a much lower percentage) That leads me to think that if the RIAA kept its own database like that, they could do a whole lot of comparison with similarity or quasi-unique (ala MD5) hashes. I'd also venture that, with enough work at the comparison system, they could make court-valid assertions. They can hire plenty of geeks to handle the statistics necessary to call something 'beyond a reasonable doubt.' (for criminal proof)

  21. Re:What if... by Jah-Wren+Ryel · · Score: 2, Informative

    No, you are demonstrably wrong. The RIAA cares about sharing because it means loss of control for them. The RIAA is all about controlling distribution channels and sharing disintermediates their existence. Make no mistake, if they could come up with a way to sell you the same song twice, they would (ever try to get a cracked 3-year old CD replaced? They won't do it, you gotta buy a new one even though you already "own" the music.

    Now here is where it gets good - the downfall of mp3.com was exactly because of sharing. They put together a system where you could buy a CD online, have it shipped to you, but also immediately have it available online as an MP3 through a password protected account that only allowed a single simultaneous user. They also provided a method to "upload" your previously purchased CDs - you stuck your CD in your cd-rom drive and ran their program that verified that the CD had the same contents as the released one (so either you had a legit copy or a perfect rip&dupe, either way you *already* had the music) and then that disc was also made available in your private mp3.com account.

    The RIAA freaked and sued and won. They won on the premise that mp3.com was making copies without permission (from the RIAA) and then sharing them. Never mind that the only people who had access where those who had proven they already owned the music to begin with. They won big too, something like $25M per RIAA member company. That used up a *lot* of VC and IPO cash.

    --
    When information is power, privacy is freedom.
  22. Re:gee? by gozar · · Score: 4, Informative

    This is pretty common at least with iTunes. Most of the people will not change the default settings, so each cd rip will be identical, all using the same id3 tags.

    --
    What, me worry?
  23. Re:MD5-hashes by henele · · Score: 3, Informative
    If you read places like CDFreaks you'll see that extracting CD Audio is a mix of science and voodoo.

    Theres issues of offset values (as with CD audio it is difficult to hit an *exact* location on the disk), plus the way the reader deals with C1 and C2 error correction, as well as how different extracting software interfaces with the hardware.

    It would almost be safe to say two mp3s with the the same MD5 are one file copied twice (as opposed to two individually created mp3s), but that doesn't mean they are illegal...

  24. Re:gee? by anthonyrcalgary · · Score: 2, Informative

    >> "It is also possible that, as someone else suggested, the magical mp3 fairy left those files behind on her hard drive. In fact, I would propose that the mp3 fairy theory is even more likely."

    For loose definitions of "fairy", yes. eg child, friend, etc

    >> "The only way that the MD5 hashes could be identical is if the two files are absolutely identical in every single bit."

    Try the following: Install some CD ripping/encoding software. Leave it at the defaults. Use CDDB to generate the ID3 tags. Unless something gets corrupted, that *will* produce an identical file, down to the last bit.

    --
    When someone might yell at me, it has to be OpenBSD.
  25. Nowhere in that article do they mention MD5 by JPelzer · · Score: 2, Informative

    I believe what they are referring to is a system that takes a sample of a song (let's say 30 seconds) and generates a 'hash' based on that... The thing about this system is that it is a loose hash, meaning that changing one bit does NOT necessarily change the hash. It is a sonic fingerprint (Not in the digital watermark sense), so that in theory if you had a direct CD-ripped wave, and an analog rip from a cassette as a wave (for instance), you could match the two files, even though they are FAR from bit-for-bit exact.

    This is what they mean when they say hash. NOT md5. Obviously MD5 could not track an mp3, since changing even one character in the ID3 tag would change the whole hash.

    So they probably have an automated downloader that then generates a fingerprint from the downloaded file and compares it to a db of fingerprints to determine if the song is copyrighted. I'd bet that's all.

  26. Re:MD5-hashes by Anonymous Coward · · Score: 2, Informative

    Well let me point you to the most likely problem:

    The "offest".

    If you use EAC you will see there is a tab where you can correct your drive's offset value.

    Now if you do that (or atleast 'sync' them) you should get the same result on both drives if the disc is good enough. (Ofcourse all your other settings should be set properly too) (If your disc is bad, EAC can correct those errors by re-reading a dozen times and then using the most often occuring result, but if your disc is a little too bad on a specific part, EAC won't be able to return the same result each read)).

    I know this because I have ripped discs on *three* diffrent cd-roms one 2x old HP burner, one el cheapo 36x drive and a toshiba laptop drive (also a burner).
    Granted I compared wave files, but I guess that if you feed the same wave file to the same encoder with the same settings you should get the exact same result.

    note:
    Offset: When your cd-rom reads a position on the disc in audio mode it often misreads, ie say you tell it to read position 0, then it will read position 4. Normally this doesn't matter since offsets are measured in milliseconds so you won't hear a diffrence, but for ripping bit-perfect rips, it does matter.
    You ccorrect it by finding out what offset your particular cd-drive has (every particular model number has a particular offset, few drives that are of the same brand and model have diffrent offsets)

    What I mean by 'syncing' is not correcting the offset but making it the same between drives.
    For example, burn a offset cd in EAC (use a cd-rw if you must). this disc will have the same offset of your cd-WRITER.
    Now 'correct' the offset in all your drives (including your burner, 'cause burners have a diffrent offset when writing than reading) with this disc.
    It won't be perfect, since now all your drives have the same offset, namely the write offset of your cd-burner.
    BUT now the rips will be identical, since they will all have the same offset.

    NOTE: I think the RIAA doesn't hash the ID3 tags, only the music.
    That way the same mp3 with diffrent ID3 tags will still be identified as being the same.
    Thats btw what Kazaa does if i'm not mistaken.

  27. Re:gee? by Anonymous Coward · · Score: 2, Informative
    i don't know anything about your setup, so i can only speculate, but what you've just described is EXCEEDINGLY unlikely to occur in general. take a look at the cdparanoia FAQ on this subject for an explanation. on any of the three linux boxes i've used (one brand-new compaq and two older dells with yamaha and toshiba drives), i get different MD5 hashes from successive rips of the same track on the same drive. your drive must be extraordinarily consistent compared to the vast majority of drives out there if what you describe happens regularly. as many posters on this thread have pointed out, the "bit spread" in hashes such as MD5 is designed to be very, very large -- that is, if even one bit in the source file flips, about half (64?) of the bits in the hash will flip and the result will be totally different.

    -fp

  28. Re:Lost in a Fire? by Anonymous Coward · · Score: 1, Informative

    Most insurance policies will only pay a token amount -- a dollar or two per recording -- for losses to a collection of CDs, tapes, records, books, and the like. This is done to discourage fraud, and makes sense for the majority of people who have a large number of recordings that they really don't listen to any more. But for the few listeners who make use of their entire CD library, it is most unfair.

  29. Re:gee? by Anonym0us+Cow+Herd · · Score: 2, Informative

    Right, but I figured, maybe the bit differences might disappear in the encoding, some wacky things you can only determine empirically

    I wouldn't expect two different WAV's that sound exactly the same to give the same mp3. But I wouldn't have bothered to test it either.

    As I think about it, your theory is interesting. Since mp3 compression is based on the perception of audio, or getting rid of everything that you don't perceive, then there is some argument that two very similar WAV bit patterns that sound identical might actually be closer after encoding to mp3 than you might think. Of course an MD5 hash of the two mp3's is not a good indicator of this, as one single bit difference in two files radically alters the MD5 hash.

    --
    The price of freedom is eternal litigation.
  30. Re:MD5 hash "posers" by eric76 · · Score: 3, Informative
    Wonder if there is a utility for generating files with random content, but with the same hashes as another file?

    Perhaps a reverse md5 hash generator which takes a hash and generates a file.

    If that were possible, it would destroy the value of an MD5 hash immediately and everyone wouild quit using it faster than you could blink.

    The purpose of CRC hashes is entirely different. They are designed to detect a burst of bit errors in a stream of data, the type of error that is most likely to occur in a network transmission. They are not meant for fingerprinting files.

    I doubt that anyone with any degree of sophistication in cryptology would attempt to use CRC and MD5 hashes interchangeably.