Slashdot Mirror


Visual Analysis Of Mp3 Encoders

Chris Johnson writes: "I've just finished an interesting scientific analysis of several mp3 encoders and have my findings up on the Web. The process involves differencing a 'sonogram' image from an encoded test signal with the image of the original signal, and then producing response curves showing the disparity in direct signal volume, and over time. Umm . . . which is just to say this is probably the most rigorous analysis of any encoders anywhere on the web, and very geeky (in a good way). LAME carries the day, but BladeEnc shows that it has a completely distinctive sonic approach- and Fraunhofer proves unacceptable (in the version I tested) for audiophile use, though it's unbeatable at very low bit rates. See why." Truth in advertising -- this is a cool example of how visual information can convey more than you'd expect it to.

16 of 127 comments (clear)

  1. Re:Is not! by jmv · · Score: 3

    If you are a big fan of classical you will have an opinion on _which_ parts of the sonic information are expendable

    No, when a certain frequency component is discarded, it's not because the listener won't mind, it's because even if it's there, the listener cannot hear it. If you can't hear a sound, why encode it? Now, there are sometimes problems with classical music, but that's because it's often hard to predict exactly what you can and can't hear.

  2. Quaint, but flawed by John+Whitley · · Score: 5
    This sonogram analysis is quaint, but the author fails to grok the basics of psychoacoustic model based audio compression. The first rule is: you cannot measure the perceptual quality of the compressed audio via a raw distortion metric. Subtracting the original signal's sonogram from the compressed signal's sonogram is a distortion metric.

    That said, it is generally the case that "pre-echo is bad" and "over-ring is bad." Reducing these can be thought of as a good thing. Let's assume that for these encoders, pre-echo and over-ring are universally bad (I'll give an example where this isn't the case, below). Furthermore, this comparison actually says nothing about these encoders other than the pre-echo or over-ring. I.e. what happened to the sound that was the "same" on the sonogram? It is quite possible for an "encoder" to mangle the audio quality yet have a pristine sonogram by this test's standards.

    Just to throw a wrench in the works, more advanced encoders and/or psychoacoustic models can utilize what's called temporal masking. This is the ability of a higher-amplitude signal to mask (make inaudible) a lower-amplitude signal either before or after itself, as far as the human ear is concerned. Pre-echo is the phenomenon whereby a transient signal (i.e. a very 'sudden' attack, like a drum hit) is smeared in time. The audible effect can be most obnoxious. Yet encoders utilizing temporal masking will explicitly allow a certain amount of pre-echo through, as long as it is temporally masked. This leaves the encoder to spend those bits on other parts of the signal that would be more seriously degraded as far as our ear is concerned. In short, a sufficiently savvy encoder could exhibit more pre-echo than another worse-sounding encoder, especially if it uses temporal masking.

    Quantitative analysis for perceptual audio coding is not easy; this has been a grail for researchers in the field for years. I strongly suggest that interested parties dig into various IEEE and AES (Audio Engineering Society) journal papers on the subject, as well as various books, etc.

  3. You can't make an objective test of mpeg encoders by geirt · · Score: 5

    The basic idea of mpeg is that the encoder removes the parts of the music which you (probably) can't hear. The encoder splits the sound into pieces, and rates each piece after how important it is for the total sound image. Then it starts with the most important sound and encodes that, and continuing with the less important parts until the available bit rate is reached (e.g 128kbit/s). The rest of the sound data is discarded.

    The tricky part is the calculation of the "importantness" of each sound, and that is what differentiates the encoders. This calculation is done with an algorithm called "a psycho acoustic model".

    To measure the quality of an mpeg encoder automatically, you need an algorithm which calculates the quality the the encoded signal. By knowing this algorithm it is trivial to create an encoder which will score maximum on this quality measurement, since the quality measurement algo is basically the same as the psychoacoustic model.

    This test is "snake oil", a real test of mpeg encoder unfortunately involves listening to the music to evaluate the psycho acoustic model of the encoder, and not comparing two artificially created psycho acoustic models with each other.

    --

    RFC1925
  4. Re:What about... by Chris+Johnson · · Score: 4
    I'd be hugely interested in that. I consider it very relevant. I'm doing all this on a Mac, and have tried repeatedly to compile Vorbis in any sort of way- one of the Ogg people did this at MacHack and has not made binaries available. If he had, Vorbis would be represented at every bit rate level. I am simply not coder enough to deal with porting Vorbis, even a cheap hack, and I wish I was. I've begged for Vorbis/Mac repeatedly, and finally I had to go on without it, as there were decisions I needed to make on what mp3 encoder to use for my stuff, and the whole project was to answer for me what was most appropriate for 128K-range and what was best at arbitrarily high bit rates.

    You can add me to that list- and such a comparison (I naturally kept a logbook to be able to reproduce the process later) would indeed be meaningful to me. For instance, if Vorbis was more sophisticated in its control of over-ring and either imposed a flatter characteristic (resisting resonant peaks) or went for an intentionally tailored characteristic (say, suppressing ring around 3-5K like Fraunhofer 32K bit rate) this would have obvious and interesting application to the sound quality. Conversely, if it had big ugly peaks and artifacts, their location in the frequency response would tell a lot about the sonic signature of the encoder.

  5. Done before (again). by xenoweeno · · Score: 3

    Spectral and waveform analysis and such has all been done before, and LAME has been known to be superior for quite some time. I've been singing the praises of this site for at least six months.

  6. Re:MP3 for Audiophiles?? by Anonymous Coward · · Score: 4
    Audiophiles are interested in the most accurate reproduction of sound...

    Absolutely. CD quality (44.1 kHz 16 bit PCM) is total CRAP to true audiophiles. I won't be satisfied until they invent a format that will store the timing and stength of every single air molecule hitting my eardrum, precise to within the Heisenberg uncertainty principle. Uncompressed.

  7. The portrayal of this is inaccurate. by Fross · · Score: 3
    first off, i must say this is a very interesting article, and an original and potentially useful analysis for comparison both between mp3 formats and, to some extent, between mp3 and other audio encoding formats. however, the correlations between visual distortion and loss of audio quality are *NOT* valid or accurate, something the article doesn't place enough emphasis on. :)

    the key point here is that mp3 encoding is in fact a process of two separate transformations (both of which consist of many processes, of course), the first of these is my bone of contention as it seems less well-known than the second, which i will address first.

    the "second transformation" is the one familiar to most people, the iterative fractal encoding procedure, which simply adds information to that audio frame until it a) either hits a "quality threshold" (ie is consider good enough), or b) fills up its bitrate allocation. it's similar in many ways to making a "jpeg of sound". you can get a good view of this whole process by following this link to a graphic of the aac encoding process on fraunhofer's website. It is the stuff inside the box at the lower left that this concerns.

    however the first transformation here is the important one, this is the stuff outside and above the box in the graphic linked above. (i am not sure the graphic is detailed enough, there may be some missing, from what i remember) - this is a series of transformations to limit the amount of data the second transformation has to deal with (and hence get essentially better encoding for the same bitrate), according to the way the human ear works. our ears have "features" like having a dead area in frequencies near loud noises, which means these bits can be cut out, and other bits and pieces that i can't remember and don't have to hand ;) this is of course psychoacoustics, as other people have commented. there is a _very_ basic primer on this at the fraunhofer site here, but it doesn't go into any technical detail.

    as an aside, there used to be some fantastic and informative articles on these subjects at mp3.org back in the day (1997-1998?), may it rest in peace. does anyone have some links for where something as good on this subject is? i haven't been as in touch with the technical side of mpeg encoding as i used to be...

    but anyway back on subject, this first transformation actually distorts the signal *significantly*, but only in a way that makes it easier to process, while still sounding the same (or close) to the human ear. it may be an interesting exercise to isolate this first transformation, apply it and then save without any fractal encoding, and compare that to the original signal. this transformation will cause great "visual degradation", as shown in the article, but imho this is not an accurate criteria for measuring audio quality. still interesting, and a good read, though :)

    fross

  8. Re:What about... by joey · · Score: 3

    He's comparing the output of the encoders, once decoded. If he had a vorbis decoder that allowed him to get the information he needs, or course he could do a meaningful comparison. And it's the comparison I and probably many of us are most interested in.
    --

    --
    see shy jo
  9. Re:LAME? by Rei · · Score: 3

    The author is on drugs, is all I have to say. :)

    I'm taking a course currently on audio and image compression, and his article annoys me greatly. He uses ambiguous terminology and often the wrong terminology (for example, calling things "wavelets" that aren't actually wavelets). He describes things which can't be seen clearly in the graphs and would much better be viewed with a different display format. Etc.

    I'm still wondering if some of my compression ideas will work... I plan to test them out before too long: grouping some of the generally weak high-frequency signals together since the human ear is less sensitive to high frequency pitch variation (we're sensitive to frequency on a logarithmic scale - an octave is a doubling of frequency); and, instead of doing block transforms on the music, generate a 2d image of the signal (graph: frequency vs. time), compress the frequency axis as you normally would, and instead of saving the time axis as a series of blocks of discrete frequencies, actually compress it greatly with a fft - doing this, you should be able to save space on recurring themes in songs (such as a chorus, a regular beat, etc). Voice may introduce complications, though, and I may end up having to do some kind of combination between the two (such as, compressing the difference between the original and final signal as a low quality block transform and saving it with the compressed signal). Two ideas of mine I plan to test when this incredible work load from my senior year stops bearing down on me ;)

    - Rei

    --
    He's just being nice so my real father won't freeze him in carbonite and sell him for spice.
  10. cuecat by jbridge21 · · Score: 3

    from the can-a-cue-cat-read-these? dept.

    Well, after calibrating my cat on a couple of Pop-Tarts boxes, I tried several scans on the diagrams on the web page... nothing! I can therefore conclusively answer this question with a big, fat NO.
    -----

  11. Visual analysis of MP3 is nonsense by Djinh · · Score: 3

    MP3 is about selectively discarding information from the audiostream. The purpose is not to create an output waveform which is as close as possible to the input. This is what the whole business with the psycho-acoustic model is about.

  12. So what? by jmv · · Score: 4

    OK, now we see what parts of the spectrum are thrown away at very low bit rate, but why is it supposed to be "probably the most rigorous analysis of any encoders anywhere on the web"? First off, the *only* way to evaluate the quality of a perceptual encoder is to listen to it, period. Who cares what is rejected (non encoded) if you don't hear it.

    Also, while using the 32 kbps bitrate amplifies the effects of perceptual quantization, so it's easy to see them, the problem is that not all the encoders where meant to work at this bitrate.

    Think about it, when standard institutes want to evaluate audio/speech codecs, they don't calculate sonograms like this, they make subjective tests. They make a bunch of listeners hear the result of many encoders on *many* audio files. That's right you need many files to evaluate a codec. Some will perform better for certain musical instruments, some will perform better with or without background noise, echo, ...

    For all these reasons, I do NOT consider this analysis rigorous at all!

    1. Re:So what? by jmv · · Score: 3

      with an oscilloscope I can get a more precise answer

      Yes, the guy's sonogram is more *precise* but it is still irrelevant. I could write an encoder that gives a much better result when evaluated with this "precise" sonogram, but yet will sound like crap.

      This is the point of perceptual encoding. The goal is not to produce the best result in terms of signal-to-noise ratio or spectral distortion, but to cause the encoder "errors" where the "non-precise" ear won't hear it. And if you don't hear it, you don't care, even if your oscilloscope of spectral analyser tells you there's an error.

      The most critical part of a perceptual encoder is the "psycho-acoustic model", which tries to model as best as it can the sensitivity of the ear at a given frequency, given the rest of the spectrum. This is not an easy task, and you have to make lots of approximations. Given two encoders that produce the same quantitive result (SNR, ...), the beat one will be the one with the best psycho-acoustic model and your $10 k oscilloscope of spectral analyser won't see that at all.

  13. Re:What about Xing (AudioCatalyst)? by hymie3 · · Score: 3
    I know that Xing (AudioCatalyst) doesn't have the greatest encoder, but that's no reason to leave it out...

    Well, actually, there is a reason: the Xing encoder blows chunks. Sure, it's fast, but the sound quality sucks. If all you're encoding is Teeny Bopper of the Week music, then you're not missing out on anything. If you're encoding stuff that's a lot more complex, you're better off with soemthing that doesn't sacrifice quality for speed..

    hymie

  14. Re:What about... by jmv · · Score: 5

    Ogg Vorbis?

    He's measuring the MP3 encoders, and Ogg Vorbis is not an MP3 encoder, but an Ogg Vorbis (duh!) encoder, it doesn't use exactly the same encoding scheme, though it is still a perceptual encoder (based on time-frequency masking).

  15. A similar, if not better comparison by Jack9 · · Score: 3

    http://users.belgacom.net/gc247244/analysis.htm#MP 3ENC31 This is what I found when searching for mp3 comparison. It compares different implementations of encoding for mp3 as well as output quality. Much more useful and definitive.

    Often wrong but never in doubt.
    I am Jack9.

    --

    Often wrong but never in doubt.
    I am Jack9.
    Everyone knows me.