"Fingerprinting" of Audio Files?

← Back to Stories (view on slashdot.org)

"Fingerprinting" of Audio Files?

Posted by CmdrTaco on Sunday August 27, 2000 @10:52PM from the tech-I'd-love-to-see dept.

Pseudonymous Coward writes: "This could be interesting: 'Tuneprint is an audio fingerprinting algorithm. It takes the unique 'fingerprint' of a sound clip, which can then be compared to a fingerprint database to get more information about the clip, like title and artist, lyrics, URLs, related music, copyright status, or almost anything else. The fingerprint doesn't change even if the sound is compressed, converted to a different file format, broadcast over the radio, and so on.'"

10 of 127 comments (clear)

Min score:

Reason:

Sort:

Re:Sceptical - Remember watermarks on images? by Webmonger · 2000-08-27 18:36 · Score: 3

FAQ:

Fingerprints in the abstract are fundamentally more secure: a properly constructed fingerprint can't be broken without scrambling the audio file, while a sufficiently smart and well-funded adversary can always break a watermark, given enough time.

If watermarks are steganography, fingerprints are more like hashes or CRCs. If you have a perfect fingerprint, the fingerprint being separate from the song, you'd have to make the song not sound like itself in order to stop it from being recognized.

Of course, we have yet to see how good Tuneprint is, but it sounds pretty cool. And it wouldn't be hard to build up a database with a bunch of CDs and CDDB.
This is for real, friends by gschmidt · 2000-08-27 20:37 · Score: 5

Hiya. My name is Geoff and Tuneprint is my baby which some excellent and astonishing friends at MIT are helping me deliver.

I'd already been up all night when the story was posted at 7am. I'm going to try to stumble my way through a few points, get some breakfast, and try to answer people's questions as soon as I can get to it.

First of all, this is not a hoax. Wow, hair triggers :) Yeah, I was sleep deprived whilst writing most of the website. Yeah, the barcode in the logo is '31337 24816'.. get it.. eleet powers of two. eleet two-to-the-n's. eleet two-n's. eleet tunes. yeah. well. you had to be there. and jamie's to blame for the 24816 pun :) Don't hold it against us that we're not suits.

The general idea is pretty simple. We take the input audio. We condition it (adjust it to a known sampling rate and volume.) We pass it through the psychoacoustic model (it's about a notch more complicated than what you'd see in a mp3 encoder, which ain't saying much. This is all stuff that was mostly hashed out decades ago.) This model effectively strips the parts of the sound you can't hear -- the desired result being that even if the audio has been compressed or manipulated subaudibly, the result is still the same. Okay, so the net result of all of this is a vector that covers a very small segment (fraction of a second) of audio. We stack several of these vectors (possibly separated in time by a bit) side-by-side to get a big vector. Then we do completely boring and standard and well-understood statistical and pattern-matching stuff on the vector to make it smaller and more palatable for the server -- think of it as lossy compression. Then it goes off to the server. The server is about equal in complexity to a text search engine. (I say this fully realizing that I have only a vague impression how Google works. It's certainly a lot more complicated than the obvious hash-table-of-sorted-lists stuff.) It finds the database vector that's the best match in a fairly boring but efficient way. (No, it does not involve searching through all tracks one by one, no more than Altavista searches through all web pages one by one every time you want to find some porn.) Call the result a submatch. Back at the client, the whole process is repeated a bunch more times, generating a stream of submatches ("Radiohead offset 0.. Radiohead offset 1024 or 16384.. Slashdot's Gr34test Hits 5262324.. Radiohead offset 3072..") from the input audio stream. Then, the client looks at the submatches and tries to figure out what the input audio was and where the song boundaries are (did somebody really stick in a sample from Slashdot's Gr34test Hits, or was that just an unlucky match?)

See? Not magic. It's a challenging problem, but not an impossible problem. The reason that this doesn't exist right now is not that generations of scientists have tried and failed, but rather that people didn't care too much until lately and nobody's gotten off their ass and done anything about it yet. I like big but approachable problems, which is one of the reasons I'm excited about this.

FOR ALL OF YOU WHO FELL ASLEEP THROUGH THAT: YOU CANNOT ADD AN INAUDIBLE TONE TO THE MUSIC AND BREAK TUNEPRINT. THE FINGERPRINT IS BASED ON THE LARGE-SCALE PSYCHOACOUSTIC FEATURES OF THE MUSIC. IF MP3 ENCODERS CAN DO IT, SO CAN WE. Maybe not perfectly, but enough to have a fighting chance. THAT'S THE WHOLE POINT HERE.

jen is telling me to go to breakfast but I want to say one more thing, which is that y'all should also pay attention to the second of our two goals as listed in the FAQ, which is to get this tech and access to a nice, well-maintained central database out into the hands of everybody, commerical and open source, major label and independent, so that people can go do lots of cool stuff with it. I don't want this to end up controlled by a single organization that permits its use only in ways that further its private agenda.

Hint: I know that there are sekrit batcave startups that are working on the same thing, because we're starting to bump into them.

Oh yeah. Also like I say in the FAQ, it's not done. No promises. I like the current algorithm; it reflects the wisdom of throwing several other stabs away in disgust. I like the very limited performance data we have. I like the mathematical theory. We haven't scaled it very far yet, though, and it may all come toppling down. In which case we'll pick up the pieces and try again. But I'm confident we'll pull off something cool, because, well, 70% of what we want to do isn't that hard. The other 30% is a bitch and will require cleverness, work, and chutzpah, but even the 70% is going to be a damn useful tool. And this project has started to catch the eyes of some pretty f*cking brilliant techincal people, in my opinion, so I think we're all over that 30%.

breakfast now. more later :)

geoff

PS: if you've emailed me in the past few days, and I haven't gotten back to you, I'm sorry -- things are pretty hectic around here. I really hope to burn through the backlog this afternoon before I get to the slashdot stuff. thanks :)

Ever wonder if you get a nice warning email before you show up on slashdot? the answer would be 'no' :p
1. Re:This is for real, friends by Chris+Johnson · 2000-08-27 23:14 · Score: 4
  
  Good job- yes, I can see how this would work. You could get 'thrown' by certain sampled music (it's Rick James' "Super Freak"! It's MC Hammer "U Can't Touch This"! It's a floor wax! It's a dessert topping!) in certain circumstances, but on the whole, you've really got something- the key concept, to me, is that it's not about embedding computer codes in the music (yech), but about finding the irreducible information minimum in a snippet of audio.
  I think I can help explain- let me put it this way. I've got a tune (obLink: see URL link above) called "Rain Dragon". There's a point toward the beginning where a 'mutating' synthesiser tone enters with a sort of warpy noise, on a beat that kicks really hard with bass drum and a splash cymbal. The total impact is quite aggressive- the synth sort of bursts in, and does so in a way that defines the range of unusual sounds that patch can produce.
  Take that as an example sound snippet to work with. Now, let's say for the sake of argument that the impact of the splash and bassdrum and synth are all perfectly synchronised (splash and bassdrum are in fact sequenced and are perfectly synchronised to within MIDI spec, synth was a lucky hit that seemed to link up extra nicely). Call the phase of the splash's initial attack A, the phase of the bassdrum B, the phase of the attack of the synth C. These may all be in phase, adding up to a big transient. Some may be out of phase- for instance, the splash may come through unaltered but the syn attack and bassdrum attack may be going opposite directions and cancel each other out.
  This is a very large level feature of the waveform- to alter it you would have to do such violence to the waveform as to render it unlistenable. Nothing you can do is going to make that syn attack and bassdrum attack be in different phase- obliterate the bass and you have a wimpy thin version of the same musical event signature, listen to it on a transistor radio and you have mostly the overtones and some distortions on the same musical event signature, record the transistor radio and it's the same deal- the LARGE SCALE waveform shapes are going to have a recognisable pattern if the music itself is still recognisable at all. In the crudest possible form you'd have to physically edit out certain drum hits or notes to alter the recognition- the crudest possible form for this type of identification is, say, MIDI. If there's a particularly interesting drum fill in something you can sequence it painstakingly in MIDI (not quantising but accurately placing each drum event in time) and get an instantly recognisable 'copy' of the original recording despite obliterating even the very sounds themselves and falling back on nothing but timing alone...
  There's a great deal of pre-existing work in other fields, such as image tracking, that defines fingerprinting as 'imposing a subtle added signal onto the media and then reading it back'. That's a far cry from what you're doing- might I suggest 'bodyprinting' instead? ;) after all, what you're doing is much closer to plunking the 'body' of a music snippet down in sand and recording the large scale attributes. It doesn't much matter what the details are. If you mixed one tune with a different tune, the 'bodyprint' of the one would gradually fade (not be instantly obliterated!) by increasing loudness of the other, and at the halfway point you'd be getting a 'bodyprint' that registered about equally for BOTH tunes (!).
  Now that you have this concept so nicely worked out, what do you intend to do with it? Are you going to give to the record industry the ability to track down unauthorised music wherever it may present itself- most notably, to identify samples used in other songs and bring lawsuits over them?
  I was trying to think of other ways the RIAA could abuse this technology, but I drew a blank- because at this time it's not necessary to _prove_ a music copy is from a particular source, to bring suit. Nobody has argued that britney spears mp3s are NOT the same tune as the original CDs because it's stupidly obvious that they're effectively the same tune. Hence, this process simply adds a level of certainty to a process of identification that's already enough to stand up in court. Is there any likelihood of this level of authentication of a copy becoming necessary in practice?
Please, everybody, stop confusing fingerprints by Kickasso · 2000-08-27 18:36 · Score: 4

with watermarks. They are two distinct concepts.
A fingerprint is an inherent property of a file, much like your own fingerprints are inherent properties of your fingers. Both kind of fingerprints are used to identify things. A cryptographic hash is a kind of fingerprint. If two files have the same hash they are likely to be identical.
A watermark is a piece of information artificially added to a file. They are akin of watermarks on dollar bills. There is one difference though. Digital watermarks are designed for difficulty of removal, while watermarks on money are designed for difficulty of reproduction. Watermarks are used to certify autenticity of things. A cryptographic signature is a kind of watermark. It can certify that I, not somebody else, signed some file.
--
Copying Vinyl is NOOOO problem by FreeUser · 2000-08-27 20:54 · Score: 3
the sound would be so bad, we couldn't copy it. problem solved.

Ahem.

I have converted a number of my old vinyl records to CD and MP3 format. It is rather simple, actually:
- Connect the stereo via the LINE IN port of your audio card.
- Run software to capture line in to digital file (under Linux, typically .wav format)
- Play record.
- Use a program such as xwav to trim the file, removing extrenous crap (e.g. silent hissing) from the beginning and end of the captured file.
- Use sox to convert to CDR format to burn onto a blank CD, or something like LAME to convert to OGG or MP3 format.
- Repeat for as many tracks as you like.
- If burning a CD, when done use a program such as xcdroast or gcombust to burn the music CD.
- Replace record in jacket and store in a cool, dry place
... and listen to the music as often as you like without damaging the master media.
--
The Future of Human Evolution: Autonomy
WHAT superfluous header data? by Chris+Johnson · 2000-08-27 23:46 · Score: 3

What superfluous header data? You're looking with a microscope, and you ought to be looking with a fisheye lens. Covered in vaseline. In a snowstorm ;)
Inspect the whole file all you want- you might even see interesting wiggles in the waveform which are of course exactly the sort of thing this will pick up on. You can go in and invert chunks of those waveform wiggles, and that will render that little snippet unmatchable with the original tune- at the expense of making the audio go sputter sputter sputter. Pitch-shifting the whole tune up about 2 octaves would work too :) or timestretching it to about twice its normal length- maybe only 1 1/2 times its normal length. That would work if you like slow dancing ;) most effective? Well, you know how some mp3 files ripped off CDs go BZRRRP every now and then 'cause the CD player choked? The data formerly existing during the section where it goes BZRRRP is rendered TOTALLY UNMATCHABLE by this technique ;) therefore you can completely destroy the fingerprint by simply arranging for the rip to be 100% bzrrrrp. I think I can safely say that this would be a completely effective way of eradicating fingerprintability, at least until they start fingerprinting CD failure modes :)
It's invincible! by antifuchs · 2000-08-27 17:56 · Score: 4

"The fingerprint doesn't change even if the sound is compressed, converted to a different file format, broadcast over the radio, ...</i>"

...sung at a karaoke event, covered, remixed, hummed by any being with vocal chords, played on a bagpipe, and so on.
--
this post was brought to you by Andreas Fuchs.

--
this post was brought to you by Andreas Fuchs.
echo [Address] | sed s/[-a-z]//g | tr A-Z a-z
Re:Get a clue by interiot · 2000-08-27 20:23 · Score: 3
From their FAQ:
- The fingerprint shouldn't change even if the music is made louder or softer, re-equalized slightly, passed through a mp3 encoder, speeded up or slowed down a little, and so on. Anything that doesn't change way the music sounds shouldn't change the fingerprint, and it should be impossible for even a smart, well-funded human being to make the fingerprint change without distorting the music.
(emphasis mine)
That sounds pretty reasonable and possible to me.
Not That Far-Fetched by tqbf · 2000-08-27 20:31 · Score: 3

As someone else here said, this is conceptually not much different than calculating a message digest. In fact, I'm sure that's exactly what they are doing (and hopefully with a standard digest function, like SHA or MD5). Obviously, the question is what data they are feeding to the digest function. They obviously aren't feeding raw audio data, because it varies heavily between different codecs and sources. So, clearly, they're doing some kind of analysis. The easiest thing to develop is an algorithm for summarizing raw audio data. This addresses the concerns about encoding MP3s, Vorbis, or whatever --- you simply operate on the decoded results of these files. The goal of summarizing should be to come up with a description of audio data that is the same for two identical-sounding files.
So the question then becomes how you "summarize" raw audio data so that 10 different sources/ decodings of the same piece of audio result in the same summary information.
One pretty obvious thing to do is to select frequencies, set a threshold value (relative to the average amplitudes in the audio data for the frequencies you are analyzing) for "peak" amplitude at those frequencies, and measure time deltas between peaks. You can synchronize different audio samples to a recognizable pattern of peaks to get time synch, and you can measure time in quarter-second chunks to be "fuzzy".
The raw data that you digest would then just be a series of peak-to-peak time deltas for each frequency, which should be consistant between recordings (even if you tack dummy data to the beginning and end of the file --- the latter problem being solved by only accounting for a fixed amount of time in each audio file). Think of it as summarizing/fingerprinting the audio data based on the images displayed in your MP3 player's spectrum analyzer.
I'm not sure if what I've described is practical; it's the first thing I came up with when I was presented with the same problem awhile ago. But it's evidence, I hope, to an important fact:
Anything your ears can do, a computer can do better.
Not talking about adding fingerprints/watermarks.. by johnhebert · 2000-08-27 18:34 · Score: 3

Though Tuneprints efforts are still pretty much alpha at this point, the idea is to derive a database of fingerprints, or signatures, of music tracks using a hidden algorithm. The use of the term "fingerprint" is kinda misleading as to how it works, though I'm sure it is unintentional.

The problem of course, is that all pop music would have the same signature, since it pretty much sounds all the same anyway... :)

The signatures are not added as metadata to the songs, though I guess they could be. They are kept in a separate database that is near the analyzing portion of the solution where the results can be queried.

This is an interesting idea. I proposed something pretty similar to my co-workers a few months back when we were looking for a means of uniquely identifying recorded music, but I only received funny looks. Damn me and my laziness! :)

I think it may be pretty difficult to get this solution working well, considering that songs can contain samples/riffs from other songs and many other factors, etc. I think the minimum length of the analyzation sample would have to be fairly long, relative to the size of the song in order to get an accurate signature.

--
"Classic UFO's ... crafts for kids..." Interpretations from