So if her recordings were so masterful, and they were identical to other recordings, then why didn't the critics recognize the similarity for so long?
This confirms my belief that music critics are mostly full of shit. If those recordings were so good, then the artists she copied from were obviously superb. However, one was apparently a very obscure Japanese pianist, so his brilliance wasn't recognized, and since no-one noticed the copy for so long, the others can't have been very prominent either.
Well, in the case of Minoru Nojima (the "very obscure Japanese pianist,") any critics would not have been wrong in recognizing that the playing was obviously superb, even if they couldn't discern who the actual pianist was. "Nojima Plays Liszt" is a wonderful CD, with a combination of both masterful playing and excellent sound quality. Too bad Nojima is as obscure as he is to the general public -- he just hasn't recorded much. But that just makes it all the more special to me that I got to see him play in a small junior college auditorium just minutes from my house!
128k Itunes is not significantly better than mp3. To my ears (and I am nearly 40) 128k just doesn't cut it for anything but very casual listening.
Unless you actually participated in the test, and rated the 128 kbit/s mp3's low, I think you reached the wrong conclusion. Yes, lame mp3 was tied with iTunes AAC. But that doesn't mean that either mp3 or AAC sounded crappy! In fact, they were both rated very high. The correct conclusion is that lame mp3 using VBR has made great quality strides, essentially catching up with AAC for now.
While I don't doubt that atrac3plus sounds better than atrac3, I just want to point out that when it comes to perceptual codecs, subjective, blind listening tests of multiple samples by a panel of listeners is considered to be the gold standard of assessing sound quality.
Technical tests of a codec (such as frequency response graphs) are not nearly as important as what it actually sounds like.
The Register article noted that atrac3plus would be used, which is better-sounding than atrac3 at the same bitrate. However, Roberto's listening test compared atrac3, not atrac3plus, because a bitrate near 128 kbit/s for this codec wasn't available in Sony's software encoder, SonicStage 2.
BTW, Roberto is currently conducting a low-bitrate streaming test (32 kbit/s), and everybody is invited to participate.
There are new WMA9 codecs, though. WMA9 Professional, which goes up to 96 KHz 24-bit 7.1. It does have a 2-pass VBR 128 Kbps 44.1 stereo mode, and it'd be interesting to see that included in a future version of this test.
WMA9 Pro, bitrate VBR at 128 was tested in the previous multi-format test:
Although results aren't strictly comparable between tests, in that test WMA9Pro was essentially tied with iTunes AAC 4.2, so it probably would have ended up near iTunes and Lame in this one.
The only thing that might be suspect is that the subjects could send in false reports after the testing was done.
It is possible, but unlikely. Both the configuration file and the results file are encrypted, so the listener can't tell how he rated things until the public key is distributed after the test is completed.
That seems like a questionable methadology to me. If the use couldn't tell the difference, it seems like that should be an automatic 5. Dropping cases where there isn't a perceptible difference woud tend to underrate the quality of the best encoders.
The problem is that it's difficult to develop a consistent and fair method of determining what to do with results where the reference was ranked. Maybe just assigning 5 to these cases is one way of dealing with it. I can see doing this for someone who marks 4.9 for one of the references, but what about somebody who scores 3.5 on multiple references (I'm exagerrating, but not too much!)? Is it wise to even keep results from such a listener? I think that as long as there are enough results, which seems to be the case here, the fairest and most consistent way to deal with these cases is to simply discard them.
Though I think that this is a great idea for a study, aren't the conclusions found by this research ultimately meaningless?
Not meaningless; just not representative of the general population. As you said, the self-selection helps to yield results that represent the population of people who actually care.
For everybody else, probably all of the codecs are "good enough," even wma9 standard and atrac3.
The AAC competitor was chosen on the basis of the winner of Roberto's AAC test, conducted a few months prior to this one. An HE-AAC contender was not available at this bitrate at the time. Moreover, I'm not sure that HE-AAC is actually competitive at this bitrate. Its goal is to improve low-bitrate encodings, and it may not produce transparency at higher bitrates.
For each of the 18 music samples, each of the 6 codecs are tested once. That's why there are 6 pairs of sliders. There are no hidden duplicates and no tricks (like having two references to compare against).
The sensitivity of Roberto's previous tests have been quite satisfying. That is, usually they identified significant differences between codecs to a high confidence, without getting too much statistical noise. I think this test will be just as sensitive.
Fisher LSD is better than I thought, but you are still going about it the wrong way. The only way the Fisher LSD is protected is if you do the ANOVA first and it shows a significant difference between the groups, then you do the post hoc tests accordingly.
That is exactly what is happening here. A blocked ANOVA (each listener is a block) is used to determine if there is a significant difference anywhere. Fisher LSD is used if the ANOVA is significant (this is what it means to be protected).
As I mentioned in another message, if people really want to nitpick, they should take the raw data and run it through my bootstrap resampling program, which does not assume a normal distribution and corrects for multiple comparisons.
However, you can't get the nice graphs that Roberto showed (tabular pvalues are the best you can get), and the difference in the conclusions will be something that is more conservative than the Fisher LSD. But again, who cares? So there is a bigger chance that you'll get a type I error. Get over it. It's not a black and white thing, where if you do it one way the results are totally wrong, like you're making it out to be. Are the conclusions reasonable? Yes. Are they significant to 95% confidence? Maybe if you use ANOVA/Fisher LSD. Maybe not (90% confidence?) if you use another, more conservative method. It's not the end of the world.
BTW, Blocked ANOVA with Fisher LSD is what the book, "Sensory Evaluation Techniques," by Meilgaard, Civille, and Carr recommends for subjective tests like these.
Also BTW, there really isn't one or two results which are strikingly different from the others. There was no low anchor included in this test, which would have produced such a result.
The intervals in the rating scale are 0.1 steps, which is close enough for argument's sake.
And ANOVA is a robust method as you've commented, so it's probably reasonable to assume normality. In any case, the raw data is available for any stats weenies to play with, and there are a couple of more conservative methods besides the Fisher LSD readily available to try, if anyone has an uncontrollable urge.
where you can run a non-parametric analysis of the raw data using a web form if you like. Or download the program to correct the pvals for multiple comparisons using Tukey's Honestly Significant Difference instead of the Fisher LSD (either paramteric or non-parametric).
And if you want to really geek out, you can do a bootstap resampling method with different methods of correcting the pvals for multiple comparisons:
But let's not lose the forest for the trees here. The blocked ANOVA/Protected Fisher LSD used for the test provides a reasonable (if not the most conservative) summary of the results.
A couple of more important weaknesses of the test are (and they're related):
1. Only 12 samples were used. Although this is probably close to the practical limit, more samples always provides a more comprehensive picture. The results are suggestive, but not definitive.
2. The selection of the samples makes a difference. If you listen mainly to classical for example, this test may not be representative for you.
ANOVA is used in the analysis. But in separating the means, no pval correction is used. The method for separating the means is a protected Fisher LSD.
So yes, you can perform a more conservative analysis (for example, a resampling method) that assumes neither a normal distribution and also corrects the pval for multiple comparisons. But hey, we're not talking about bringing a drug to market here. This is a listening test, for Christ's sake!
The statistics aren't the weakest link in interpreting the results of the test.
Also, don't be deceived by the "confidence intervals" shown in the graph. They're all drawn to the same widths for each set! At best, this is an approximation. At worst, the author is simply using a program that draws in some uniform (and meaningless) bars. Fear graphs.
The bars are not meaningless. The exact meaning of the bars is described in the results writeup. I suggest you read that writeup.
The exact procedure used to compare ratings is a blocked ANOVA, with a protected Fisher's Least Significant Difference to separate the means if the ANOVA says there is a significant difference somewhere. The Fisher's LSD yields a constant confidence interval for every mean. To get non-constant intervals, one would have to do something a lot more complicated (such as resampling). But then a graph couldn't tell the whole story (you'd need to be able to compare confidence intervals of one sample against every other sample), and we'd be stuck with a dreary matrix.
When the hell are people gonna learn? It doesn't matter how you "encode" or "enumerate" it, quantitative operations done to non-quantitative data have NO MEANING. NONE.
Sigh. When the hell are people gonna read up before they spout garbage? There is a whole field of science called psychophysics, i.e., the science of subjective testing Pick up this book and read it!
It looks to me like one can make the statement that Faac is worse than all the others with only about 85% confidence.
No, FAAC is clearly worse than all the others with greater than 95% confidence. At least for this group of samples and group of listeners.
And this is not even taking into account the systematic errors arising from a poorly-controlled test.
The most likely effect of having different listening conditions, etc. would be to increase the size of the error bars (increase of random error), not to create a bias (systematic error). So it is more accurate to say that this test managed to find significant differences despite the lack of some controls.
I agree, though, that it is an overstatement to say that Quicktime clearly won.
Yes, a ranking method could have been used to evaluate the codecs instead of a rating method. The rating method is typically more powerful for smaller sample sizes, though.
BTW, there is nothing wrong with using either method. Again refer to the MPEG group's own evaluation of AAC. They used the rating method.
There is a whole field of science which deals with the statistics of subjective measurement. Here's a reference to a book which you might pick up to inform yourself:
The statistics in the hydrogenaudio test treats each listener as a "block," which takes into account the fact that different listeners will have different ideas about what constitutes a "4" or a "2," etc.
The next test will use an anchor (Blade mp3 at 128 kbit/s) to keep the ratings in perspective.
I had several obvious choices for the AAC encoder: Psytel, the Quicktime, and Liquid Audio 5 (I hadn't looked into LA6). Liquid Audio 5 is another FhG low complexity encoder, but lowpasses at a lower frequency than the Quicktime. The Psytel encoder is worse-sounding than the Quicktime at 64 kbit/s. I did try to choose the best AAC implementation available to me (I do not have access to the latest and greatest implementations).
It's possible I could have set up the experiment as a Latin Square, and randomized which codecs any individual was comparing, but my home-grown statistical tools are not up to that task. That is, I can only perform balanced analyses, where N is the same for every codec.
The statistical technique used to evaluate differences in each individual sample was a parametric method: Tukey's Honestly Significant Difference, using each listener as a "block." That is, the fact that different people use different parts of the rating scale is taken into account. The Tukey's method also takes into account the fact that multiple samples are being rated, not just two.
The statistical technique used to rank the codecs overall was a non-parametric method: A Friedman omnibus test to see if there was a difference at all anywhere in the experiment, followed by a non-parametric Fishers Least Significant Difference (also "blocked"). A non-parametric method means that ranking (first, second, third, etc.) was used instead of rating points (4.7, 3.5, 2.6, etc.).
The ranking method was used for the overall evaluation because ratings for one sample don't necessarily mix and match with ratings for another sample.
By definition, the original was rated 5.0, or perfect. If the listener failed to rate the original 5.0 on any codec for a particular music sample, then all of the ratings for that listener on that sample were discarded.
This is a rather drastic way to screen, and certainly I might not have done this if there was less data and if more people had rated the original less than perfect. However, given the large amount of people who participated and the level of experience they had, I had that luxury.
A couple of things I personally would have like to have included but didn't for the sake of getting more reliable statistics: a 128 kbit/s mp3 anchor, and something like a 7 kHz lowpassed anchor. Just to kind of keep the ratings in perspective.
Oh, and to answer the criticism that the test doesn't represent the general population, that is quite true. The people who participated were not randomly selected off the street, but rather volunteered their services. I agree with the person who replied to this criticism, though, that I'm much more interested in the opinions of these motivated volunteers, who are much more likely to care about audio quality, than those of the average joe.
The decoder I use for mp3 is Fraunhofer's (in_mp3.dll from Winamp 2.76), and is as good as one can get, unless one thinks that dithering at the LSB makes a significant difference in sound quality. Winamp ditched its buggy Nitrane decoder starting from version 2.666. For a comprehensive comparison of mp3 decoders, see David Robinson's excellent site:
The optimal bitrate to use depends on the quality of the codec. To say that everything at 128 kbit/s will sound crappy doesn't allow for the chance that improved codecs may actually be acceptable at that bitrate.
I made them before I learned how to do the proper statistical analysis. Since I went to the trouble of making them, and they seem to agree with the numerical results anyway, I decided to keep them in. BTW, I will eventually change the formal analysis method over from the current "Friedman with Fisher's LSD" over to "bootstrap resampling," which yields more robust conclusions (i.e., less prone to error), but which in this particular case, doesn't change the results.
I chose WMA8 over WMA7 for the reason given in my writeup: this is the latest codec from Microsoft, even though it may not be the greatest, and older versions are generally not available from them. If I had chosen WMA7, think what the skeptical response would have been had this yielded a poor showing!
Comments about my Ogg Vorbis comments: they were based on the raw listener comments and were not stated as factual, but as a prediction to be tested. In fact, Monty paid attention to all the complaining of background hiss and improved his codec subsequent to RC2. The raw listener comments from the subsequent tests (not yet publicly available) have no mention of problems with background hiss in Ogg Vorbis pre-RC3.
The Washington Post test was just about the worst test I've ever seen performed. Those listeners may have a great deal of music experience, but the person who set up the test knows next to nothing about setting up a fair test. About the only good thing I can say about it is that they actually listened, and didn't just look at spectrum analyses of tone sweeps, or some other such nonsense.
So if her recordings were so masterful, and they were identical to other recordings, then why didn't the critics recognize the similarity for so long?
This confirms my belief that music critics are mostly full of shit. If those recordings were so good, then the artists she copied from were obviously superb. However, one was apparently a very obscure Japanese pianist, so his brilliance wasn't recognized, and since no-one noticed the copy for so long, the others can't have been very prominent either.
Well, in the case of Minoru Nojima (the "very obscure Japanese pianist,") any critics would not have been wrong in recognizing that the playing was obviously superb, even if they couldn't discern who the actual pianist was. "Nojima Plays Liszt" is a wonderful CD, with a combination of both masterful playing and excellent sound quality. Too bad Nojima is as obscure as he is to the general public -- he just hasn't recorded much. But that just makes it all the more special to me that I got to see him play in a small junior college auditorium just minutes from my house!
128k Itunes is not significantly better than mp3.
To my ears (and I am nearly 40) 128k just doesn't cut it for anything but very casual listening.
Unless you actually participated in the test, and rated the 128 kbit/s mp3's low, I think you reached the wrong conclusion. Yes, lame mp3 was tied with iTunes AAC. But that doesn't mean that either mp3 or AAC sounded crappy! In fact, they were both rated very high. The correct conclusion is that lame mp3 using VBR has made great quality strides, essentially catching up with AAC for now.
ff123
While I don't doubt that atrac3plus sounds better than atrac3, I just want to point out that when it comes to perceptual codecs, subjective, blind listening tests of multiple samples by a panel of listeners is considered to be the gold standard of assessing sound quality.
Technical tests of a codec (such as frequency response graphs) are not nearly as important as what it actually sounds like.
ff123
The Register article noted that atrac3plus would be used, which is better-sounding than atrac3 at the same bitrate. However, Roberto's listening test compared atrac3, not atrac3plus, because a bitrate near 128 kbit/s for this codec wasn't available in Sony's software encoder, SonicStage 2.
BTW, Roberto is currently conducting a low-bitrate streaming test (32 kbit/s), and everybody is invited to participate.
ff123
There are new WMA9 codecs, though. WMA9 Professional, which goes up to 96 KHz 24-bit 7.1. It does have a 2-pass VBR 128 Kbps 44.1 stereo mode, and it'd be interesting to see that included in a future version of this test.
WMA9 Pro, bitrate VBR at 128 was tested in the previous multi-format test:
First 128 multiformat test
Although results aren't strictly comparable between tests, in that test WMA9Pro was essentially tied with iTunes AAC 4.2, so it probably would have ended up near iTunes and Lame in this one.
ff123
The only thing that might be suspect is that the subjects could send in false reports after the testing was done.
It is possible, but unlikely. Both the configuration file and the results file are encrypted, so the listener can't tell how he rated things until the public key is distributed after the test is completed.
ff123
That seems like a questionable methadology to me. If the use couldn't tell the difference, it seems like that should be an automatic 5. Dropping cases where there isn't a perceptible difference woud tend to underrate the quality of the best encoders.
The problem is that it's difficult to develop a consistent and fair method of determining what to do with results where the reference was ranked. Maybe just assigning 5 to these cases is one way of dealing with it. I can see doing this for someone who marks 4.9 for one of the references, but what about somebody who scores 3.5 on multiple references (I'm exagerrating, but not too much!)? Is it wise to even keep results from such a listener? I think that as long as there are enough results, which seems to be the case here, the fairest and most consistent way to deal with these cases is to simply discard them.
ff123
Though I think that this is a great idea for a study, aren't the conclusions found by this research ultimately meaningless?
Not meaningless; just not representative of the general population. As you said, the self-selection helps to yield results that represent the population of people who actually care.
For everybody else, probably all of the codecs are "good enough," even wma9 standard and atrac3.
ff123
The AAC competitor was chosen on the basis of the winner of Roberto's AAC test, conducted a few months prior to this one. An HE-AAC contender was not available at this bitrate at the time. Moreover, I'm not sure that HE-AAC is actually competitive at this bitrate. Its goal is to improve low-bitrate encodings, and it may not produce transparency at higher bitrates.
ff123
For each of the 18 music samples, each of the 6 codecs are tested once. That's why there are 6 pairs of sliders. There are no hidden duplicates and no tricks (like having two references to compare against).
The sensitivity of Roberto's previous tests have been quite satisfying. That is, usually they identified significant differences between codecs to a high confidence, without getting too much statistical noise. I think this test will be just as sensitive.
ff123
Fisher LSD is better than I thought, but you are still going about it the wrong way. The only way the Fisher LSD is protected is if you do the ANOVA first and it shows a significant difference between the groups, then you do the post hoc tests accordingly.
That is exactly what is happening here. A blocked ANOVA (each listener is a block) is used to determine if there is a significant difference anywhere. Fisher LSD is used if the ANOVA is significant (this is what it means to be protected).
As I mentioned in another message, if people really want to nitpick, they should take the raw data and run it through my bootstrap resampling program, which does not assume a normal distribution and corrects for multiple comparisons.
http://ff123.net/bootstrap/
However, you can't get the nice graphs that Roberto showed (tabular pvalues are the best you can get), and the difference in the conclusions will be something that is more conservative than the Fisher LSD. But again, who cares? So there is a bigger chance that you'll get a type I error. Get over it. It's not a black and white thing, where if you do it one way the results are totally wrong, like you're making it out to be. Are the conclusions reasonable? Yes. Are they significant to 95% confidence? Maybe if you use ANOVA/Fisher LSD. Maybe not (90% confidence?) if you use another, more conservative method. It's not the end of the world.
BTW, Blocked ANOVA with Fisher LSD is what the book, "Sensory Evaluation Techniques," by Meilgaard, Civille, and Carr recommends for subjective tests like these.
Also BTW, there really isn't one or two results which are strikingly different from the others. There was no low anchor included in this test, which would have produced such a result.
ff123
The intervals in the rating scale are 0.1 steps, which is close enough for argument's sake.
And ANOVA is a robust method as you've commented, so it's probably reasonable to assume normality. In any case, the raw data is available for any stats weenies to play with, and there are a couple of more conservative methods besides the Fisher LSD readily available to try, if anyone has an uncontrollable urge.
I point you to:
http://ff123.net/friedman/stats.html
where you can run a non-parametric analysis of the raw data using a web form if you like. Or download the program to correct the pvals for multiple comparisons using Tukey's Honestly Significant Difference instead of the Fisher LSD (either paramteric or non-parametric).
And if you want to really geek out, you can do a bootstap resampling method with different methods of correcting the pvals for multiple comparisons:
http://ff123.net/bootstrap/
But let's not lose the forest for the trees here. The blocked ANOVA/Protected Fisher LSD used for the test provides a reasonable (if not the most conservative) summary of the results.
A couple of more important weaknesses of the test are (and they're related):
1. Only 12 samples were used. Although this is probably close to the practical limit, more samples always provides a more comprehensive picture. The results are suggestive, but not definitive.
2. The selection of the samples makes a difference. If you listen mainly to classical for example, this test may not be representative for you.
ANOVA is used in the analysis. But in separating the means, no pval correction is used. The method for separating the means is a protected Fisher LSD.
So yes, you can perform a more conservative analysis (for example, a resampling method) that assumes neither a normal distribution and also corrects the pval for multiple comparisons. But hey, we're not talking about bringing a drug to market here. This is a listening test, for Christ's sake!
The statistics aren't the weakest link in interpreting the results of the test.
ff123
That link is broken, apparently because there is a space between "demo" and "center".
A couple of things that Microsoft can do to slant things their way:
1. They can cherry pick samples which their codecs do well on.
2. They can choose a cruddy mp3 encoder to compare against.
Bottom line is that I don't trust any comparison featured on their site.
ff123
Also, don't be deceived by the "confidence intervals" shown in the graph. They're all drawn to the same widths for each set! At best, this is an approximation. At worst, the author is simply using a program that draws in some uniform (and meaningless) bars. Fear graphs.
The bars are not meaningless. The exact meaning of the bars is described in the results writeup. I suggest you read that writeup.
The exact procedure used to compare ratings is a blocked ANOVA, with a protected Fisher's Least Significant Difference to separate the means if the ANOVA says there is a significant difference somewhere. The Fisher's LSD yields a constant confidence interval for every mean. To get non-constant intervals, one would have to do something a lot more complicated (such as resampling). But then a graph couldn't tell the whole story (you'd need to be able to compare confidence intervals of one sample against every other sample), and we'd be stuck with a dreary matrix.
ff123
When the hell are people gonna learn? It doesn't matter how you "encode" or "enumerate" it, quantitative operations done to non-quantitative data have NO MEANING. NONE.
Sigh. When the hell are people gonna read up before they spout garbage? There is a whole field of science called psychophysics, i.e., the science of subjective testing Pick up this book and read it!
Sensory Evaluation Techniques
ff123
Because there was no sample that Ahead HE AAC did POORLY at.
HE-AAC did rather poorly on the EnolaGay sample, ranking in the bottom half.
ff123
It looks to me like one can make the statement that Faac is worse than all the others with only about 85% confidence.
No, FAAC is clearly worse than all the others with greater than 95% confidence. At least for this group of samples and group of listeners.
And this is not even taking into account the systematic errors arising from a poorly-controlled test.
The most likely effect of having different listening conditions, etc. would be to increase the size of the error bars (increase of random error), not to create a bias (systematic error). So it is more accurate to say that this test managed to find significant differences despite the lack of some controls.
I agree, though, that it is an overstatement to say that Quicktime clearly won.
Yes, a ranking method could have been used to evaluate the codecs instead of a rating method. The rating method is typically more powerful for smaller sample sizes, though.
BTW, there is nothing wrong with using either method. Again refer to the MPEG group's own evaluation of AAC. They used the rating method.
There is a whole field of science which deals with the statistics of subjective measurement. Here's a reference to a book which you might pick up to inform yourself:
Sensory Evaluation Techniques
Subjective tests of codecs are not new or particularly controversial. See the MPEG group's own subjective test of AAC:
Report On The MPEG-2 AAC Stereo Verification Tests (PDF File)
The statistics in the hydrogenaudio test treats each listener as a "block," which takes into account the fact that different listeners will have different ideas about what constitutes a "4" or a "2," etc.
The next test will use an anchor (Blade mp3 at 128 kbit/s) to keep the ratings in perspective.
ff123
I had several obvious choices for the AAC encoder: Psytel, the Quicktime, and Liquid Audio 5 (I hadn't looked into LA6). Liquid Audio 5 is another FhG low complexity encoder, but lowpasses at a lower frequency than the Quicktime. The Psytel encoder is worse-sounding than the Quicktime at 64 kbit/s. I did try to choose the best AAC implementation available to me (I do not have access to the latest and greatest implementations).
It's possible I could have set up the experiment as a Latin Square, and randomized which codecs any individual was comparing, but my home-grown statistical tools are not up to that task. That is, I can only perform balanced analyses, where N is the same for every codec.
ff123
The statistical technique used to evaluate differences in each individual sample was a parametric method: Tukey's Honestly Significant Difference, using each listener as a "block." That is, the fact that different people use different parts of the rating scale is taken into account. The Tukey's method also takes into account the fact that multiple samples are being rated, not just two.
The statistical technique used to rank the codecs overall was a non-parametric method: A Friedman omnibus test to see if there was a difference at all anywhere in the experiment, followed by a non-parametric Fishers Least Significant Difference (also "blocked"). A non-parametric method means that ranking (first, second, third, etc.) was used instead of rating points (4.7, 3.5, 2.6, etc.).
The ranking method was used for the overall evaluation because ratings for one sample don't necessarily mix and match with ratings for another sample.
ff123
By definition, the original was rated 5.0, or perfect. If the listener failed to rate the original 5.0 on any codec for a particular music sample, then all of the ratings for that listener on that sample were discarded.
This is a rather drastic way to screen, and certainly I might not have done this if there was less data and if more people had rated the original less than perfect. However, given the large amount of people who participated and the level of experience they had, I had that luxury.
A couple of things I personally would have like to have included but didn't for the sake of getting more reliable statistics: a 128 kbit/s mp3 anchor, and something like a 7 kHz lowpassed anchor. Just to kind of keep the ratings in perspective.
Oh, and to answer the criticism that the test doesn't represent the general population, that is quite true. The people who participated were not randomly selected off the street, but rather volunteered their services. I agree with the person who replied to this criticism, though, that I'm much more interested in the opinions of these motivated volunteers, who are much more likely to care about audio quality, than those of the average joe.
ff123
The decoder I use for mp3 is Fraunhofer's (in_mp3.dll from Winamp 2.76), and is as good as one can get, unless one thinks that dithering at the LSB makes a significant difference in sound quality. Winamp ditched its buggy Nitrane decoder starting from version 2.666. For a comprehensive comparison of mp3 decoders, see David Robinson's excellent site:
http://mp3decoders.org/
The optimal bitrate to use depends on the quality of the codec. To say that everything at 128 kbit/s will sound crappy doesn't allow for the chance that improved codecs may actually be acceptable at that bitrate.
Regarding the graphs at:
http://ff123.net/dogies/dogies_plots.html
I made them before I learned how to do the proper statistical analysis. Since I went to the trouble of making them, and they seem to agree with the numerical results anyway, I decided to keep them in. BTW, I will eventually change the formal analysis method over from the current "Friedman with Fisher's LSD" over to "bootstrap resampling," which yields more robust conclusions (i.e., less prone to error), but which in this particular case, doesn't change the results.
I chose WMA8 over WMA7 for the reason given in my writeup: this is the latest codec from Microsoft, even though it may not be the greatest, and older versions are generally not available from them. If I had chosen WMA7, think what the skeptical response would have been had this yielded a poor showing!
Comments about my Ogg Vorbis comments: they were based on the raw listener comments and were not stated as factual, but as a prediction to be tested. In fact, Monty paid attention to all the complaining of background hiss and improved his codec subsequent to RC2. The raw listener comments from the subsequent tests (not yet publicly available) have no mention of problems with background hiss in Ogg Vorbis pre-RC3.
The Washington Post test was just about the worst test I've ever seen performed. Those listeners may have a great deal of music experience, but the person who set up the test knows next to nothing about setting up a fair test. About the only good thing I can say about it is that they actually listened, and didn't just look at spectrum analyses of tone sweeps, or some other such nonsense.
ff123