Visual Analysis Of Mp3 Encoders
Chris Johnson writes: "I've just finished an interesting scientific analysis of several mp3 encoders and have my findings up on the Web. The process involves differencing a 'sonogram' image from an encoded test signal with the image of the original signal, and then producing response curves showing the disparity in direct signal volume, and over time. Umm . . . which is just to say this is probably the most rigorous analysis of any encoders anywhere on the web, and very geeky (in a good way). LAME carries the day, but BladeEnc shows that it has a completely distinctive sonic approach- and Fraunhofer proves unacceptable (in the version I tested) for audiophile use, though it's unbeatable at very low bit rates. See why." Truth in advertising -- this is a cool example of how visual information can convey more than you'd expect it to.
If you are a big fan of classical you will have an opinion on _which_ parts of the sonic information are expendable
No, when a certain frequency component is discarded, it's not because the listener won't mind, it's because even if it's there, the listener cannot hear it. If you can't hear a sound, why encode it? Now, there are sometimes problems with classical music, but that's because it's often hard to predict exactly what you can and can't hear.
Opus: the Swiss army knife of audio codec
That said, it is generally the case that "pre-echo is bad" and "over-ring is bad." Reducing these can be thought of as a good thing. Let's assume that for these encoders, pre-echo and over-ring are universally bad (I'll give an example where this isn't the case, below). Furthermore, this comparison actually says nothing about these encoders other than the pre-echo or over-ring. I.e. what happened to the sound that was the "same" on the sonogram? It is quite possible for an "encoder" to mangle the audio quality yet have a pristine sonogram by this test's standards.
Just to throw a wrench in the works, more advanced encoders and/or psychoacoustic models can utilize what's called temporal masking. This is the ability of a higher-amplitude signal to mask (make inaudible) a lower-amplitude signal either before or after itself, as far as the human ear is concerned. Pre-echo is the phenomenon whereby a transient signal (i.e. a very 'sudden' attack, like a drum hit) is smeared in time. The audible effect can be most obnoxious. Yet encoders utilizing temporal masking will explicitly allow a certain amount of pre-echo through, as long as it is temporally masked. This leaves the encoder to spend those bits on other parts of the signal that would be more seriously degraded as far as our ear is concerned. In short, a sufficiently savvy encoder could exhibit more pre-echo than another worse-sounding encoder, especially if it uses temporal masking.
Quantitative analysis for perceptual audio coding is not easy; this has been a grail for researchers in the field for years. I strongly suggest that interested parties dig into various IEEE and AES (Audio Engineering Society) journal papers on the subject, as well as various books, etc.
The basic idea of mpeg is that the encoder removes the parts of the music which you (probably) can't hear. The encoder splits the sound into pieces, and rates each piece after how important it is for the total sound image. Then it starts with the most important sound and encodes that, and continuing with the less important parts until the available bit rate is reached (e.g 128kbit/s). The rest of the sound data is discarded.
The tricky part is the calculation of the "importantness" of each sound, and that is what differentiates the encoders. This calculation is done with an algorithm called "a psycho acoustic model".
To measure the quality of an mpeg encoder automatically, you need an algorithm which calculates the quality the the encoded signal. By knowing this algorithm it is trivial to create an encoder which will score maximum on this quality measurement, since the quality measurement algo is basically the same as the psychoacoustic model.
This test is "snake oil", a real test of mpeg encoder unfortunately involves listening to the music to evaluate the psycho acoustic model of the encoder, and not comparing two artificially created psycho acoustic models with each other.
RFC1925
You can add me to that list- and such a comparison (I naturally kept a logbook to be able to reproduce the process later) would indeed be meaningful to me. For instance, if Vorbis was more sophisticated in its control of over-ring and either imposed a flatter characteristic (resisting resonant peaks) or went for an intentionally tailored characteristic (say, suppressing ring around 3-5K like Fraunhofer 32K bit rate) this would have obvious and interesting application to the sound quality. Conversely, if it had big ugly peaks and artifacts, their location in the frequency response would tell a lot about the sonic signature of the encoder.
Spectral and waveform analysis and such has all been done before, and LAME has been known to be superior for quite some time. I've been singing the praises of this site for at least six months.
Absolutely. CD quality (44.1 kHz 16 bit PCM) is total CRAP to true audiophiles. I won't be satisfied until they invent a format that will store the timing and stength of every single air molecule hitting my eardrum, precise to within the Heisenberg uncertainty principle. Uncompressed.
the key point here is that mp3 encoding is in fact a process of two separate transformations (both of which consist of many processes, of course), the first of these is my bone of contention as it seems less well-known than the second, which i will address first.
the "second transformation" is the one familiar to most people, the iterative fractal encoding procedure, which simply adds information to that audio frame until it a) either hits a "quality threshold" (ie is consider good enough), or b) fills up its bitrate allocation. it's similar in many ways to making a "jpeg of sound". you can get a good view of this whole process by following this link to a graphic of the aac encoding process on fraunhofer's website. It is the stuff inside the box at the lower left that this concerns.
however the first transformation here is the important one, this is the stuff outside and above the box in the graphic linked above. (i am not sure the graphic is detailed enough, there may be some missing, from what i remember) - this is a series of transformations to limit the amount of data the second transformation has to deal with (and hence get essentially better encoding for the same bitrate), according to the way the human ear works. our ears have "features" like having a dead area in frequencies near loud noises, which means these bits can be cut out, and other bits and pieces that i can't remember and don't have to hand ;) this is of course psychoacoustics, as other people have commented. there is a _very_ basic primer on this at the fraunhofer site here, but it doesn't go into any technical detail.
as an aside, there used to be some fantastic and informative articles on these subjects at mp3.org back in the day (1997-1998?), may it rest in peace. does anyone have some links for where something as good on this subject is? i haven't been as in touch with the technical side of mpeg encoding as i used to be...
but anyway back on subject, this first transformation actually distorts the signal *significantly*, but only in a way that makes it easier to process, while still sounding the same (or close) to the human ear. it may be an interesting exercise to isolate this first transformation, apply it and then save without any fractal encoding, and compare that to the original signal. this transformation will cause great "visual degradation", as shown in the article, but imho this is not an accurate criteria for measuring audio quality. still interesting, and a good read, though :)
fross
He's comparing the output of the encoders, once decoded. If he had a vorbis decoder that allowed him to get the information he needs, or course he could do a meaningful comparison. And it's the comparison I and probably many of us are most interested in.
--
see shy jo
The author is on drugs, is all I have to say. :)
;)
I'm taking a course currently on audio and image compression, and his article annoys me greatly. He uses ambiguous terminology and often the wrong terminology (for example, calling things "wavelets" that aren't actually wavelets). He describes things which can't be seen clearly in the graphs and would much better be viewed with a different display format. Etc.
I'm still wondering if some of my compression ideas will work... I plan to test them out before too long: grouping some of the generally weak high-frequency signals together since the human ear is less sensitive to high frequency pitch variation (we're sensitive to frequency on a logarithmic scale - an octave is a doubling of frequency); and, instead of doing block transforms on the music, generate a 2d image of the signal (graph: frequency vs. time), compress the frequency axis as you normally would, and instead of saving the time axis as a series of blocks of discrete frequencies, actually compress it greatly with a fft - doing this, you should be able to save space on recurring themes in songs (such as a chorus, a regular beat, etc). Voice may introduce complications, though, and I may end up having to do some kind of combination between the two (such as, compressing the difference between the original and final signal as a low quality block transform and saving it with the compressed signal). Two ideas of mine I plan to test when this incredible work load from my senior year stops bearing down on me
- Rei
He's just being nice so my real father won't freeze him in carbonite and sell him for spice.
from the can-a-cue-cat-read-these? dept.
Well, after calibrating my cat on a couple of Pop-Tarts boxes, I tried several scans on the diagrams on the web page... nothing! I can therefore conclusively answer this question with a big, fat NO.
-----
MP3 is about selectively discarding information from the audiostream. The purpose is not to create an output waveform which is as close as possible to the input. This is what the whole business with the psycho-acoustic model is about.
OK, now we see what parts of the spectrum are thrown away at very low bit rate, but why is it supposed to be "probably the most rigorous analysis of any encoders anywhere on the web"? First off, the *only* way to evaluate the quality of a perceptual encoder is to listen to it, period. Who cares what is rejected (non encoded) if you don't hear it.
...
Also, while using the 32 kbps bitrate amplifies the effects of perceptual quantization, so it's easy to see them, the problem is that not all the encoders where meant to work at this bitrate.
Think about it, when standard institutes want to evaluate audio/speech codecs, they don't calculate sonograms like this, they make subjective tests. They make a bunch of listeners hear the result of many encoders on *many* audio files. That's right you need many files to evaluate a codec. Some will perform better for certain musical instruments, some will perform better with or without background noise, echo,
For all these reasons, I do NOT consider this analysis rigorous at all!
Opus: the Swiss army knife of audio codec
Well, actually, there is a reason: the Xing encoder blows chunks. Sure, it's fast, but the sound quality sucks. If all you're encoding is Teeny Bopper of the Week music, then you're not missing out on anything. If you're encoding stuff that's a lot more complex, you're better off with soemthing that doesn't sacrifice quality for speed..
hymie
Ogg Vorbis?
He's measuring the MP3 encoders, and Ogg Vorbis is not an MP3 encoder, but an Ogg Vorbis (duh!) encoder, it doesn't use exactly the same encoding scheme, though it is still a perceptual encoder (based on time-frequency masking).
Opus: the Swiss army knife of audio codec
http://users.belgacom.net/gc247244/analysis.htm#MP 3ENC31
This is what I found when searching for mp3 comparison. It compares different implementations of encoding for mp3 as well as output quality. Much more useful and definitive.
Often wrong but never in doubt.
I am Jack9.
Often wrong but never in doubt.
I am Jack9.
Everyone knows me.