Slashdot Mirror


Can Machine Learning Replace Focus Groups?

itwbennett writes "In a blog post, Steve Hanov explains how 20 lines of code can outperform A/B testing. Using an example from one of his own sites, Hanov reports a green button outperformed orange and white buttons. Why don't people use this method? Because most don't understand or trust machine learning algorithms, mainstream tools don't support it, and maybe because bad design will sometimes win."

15 of 93 comments (clear)

  1. OK, so... by war4peace · · Score: 5, Insightful

    I have read the synopsis 4 (four) times and I didn't get shit.
    Of course, TFA sheds some light on the whole thing, but really... work on your short version, guys, because what's in here makes no sense.

    --
    ...gis sdrawkcab (usually not responding to ACs; don't bother posting as AC)
    1. Re:OK, so... by WrongSizeGlass · · Score: 4, Funny

      I have read the synopsis 4 (four) times and I didn't get shit.
      Of course, TFA sheds some light on the whole thing, but really... work on your short version, guys, because what's in here makes no sense.

      If you had just clicked the green button the machine would have understood it for you.

    2. Re:OK, so... by Tarsir · · Score: 5, Insightful
      You know, I read the summary without understanding it, and just clicked through to read the article, but only after reading your comment did I realize just how little sense the summary really made.

      In a blog post, Steve Hanov explains how 20 lines of code can outperform A/B testing.

      It starts off talking about a nobody who did something that is apparently so trivial that it can be outdone by 20 lines of code. You might think that the following sentence will answer at least one of the questions raised by this sentence: Who is Steve Hanov? What is A/B testing? What do Steve's 20 lines of code do? But you'd be wrong.

      Using an example from one of his own sites, Hanov reports a green button outperformed orange and white buttons.

      Because the next sentence jumps to a topic whose banality and seeming irrelevance to the matter at hand defies belief. Three coloured buttons, one of which 'outperformed' the others, with nary a hint as to what these buttons do, or how one can outperform the others.

      Why don't people use this method?

      The third sentence appears to pick up where the first left off. Why don't people use the A/B testing method? Or are we talking about the three coloured buttons method?

      Because most don't understand or trust machine learning algorithms, mainstream tools don't support it, and maybe because bad design will sometimes win.

      The final sentence is a tour-de-force of disjointed confusion. It skips from machine learning algorithms that haven't been discussed, to tools with unknown purpose, to the design of something which was never specified.

      It's like the summary is some kind of abstract art installation whose purpose is to be as uninformative as possible. It is literally the opposite of informative: Not only does it provide no information, it raises questions which you can't even be sure relate to the purported topic at hand, because you don't know what the topic at hand is.

      It is either a bizarrely confused summary or one of the most artful trolls ever to grace Slashdot's front page

  2. Translation by Anonymous Coward · · Score: 5, Informative

    So that you don't have to click through the slashvertisement, I have read TFA for you.

    Here is a summary: Let's say you have several different designs for a web interface that you want to test to find out which one works the best.

    One method is to have a "testing period" in which you randomly show each person one of the designs at random and identify how well it works for that person. Then, once you've shown 1,000 people each of the designs, you figure out which one is the best on average. Now the "testing period" is over, and the best design is shown to everyone from that point forward. That is the "old" method.

    The "new" method is to dispense with the testing period. Instead, you show the first person one design at random. If it works (e.g. they click on the ad), it gets bonus points. If it doesn't work, it gets a penalty. At any time, you show the design with the most points; if it is bad, it will lose points over time and eventually stop being shown.

    The goal of the "new" method is to hopefully avoid showing bad designs to 2000 people just to figure out which one is the best.

    If you care about the details then you should probably read the article. This summary is just an approximation for those who can't be bothered or who object to slashvertisements on principle.

    1. Re:Translation by spazdor · · Score: 3, Informative

      The "new" method has the problem of immediately favoring the first design to get a positive response.

      No it doesn't. The designs are ranked according to what percentage of responses have been positive so far, not by the total number of positive responses. The first design to get a positive response will get shown more, and thus it will get more positive responses, and more negative responses.

      --
      DRM: Terminator crops for your mind!
    2. Re:Translation by spazdor · · Score: 2

      More people will inevitably vote it down (unless it is indeed the best option), because it's getting more exposure.

      Unless you're saying that display frequency will actually affect click-through rate. Are you suggesting that, for instance, a design which only gets shown 300 times and gets 100 positive responses, if it were shown 3000 times instead it should be expected to get more than 1000 positive responses? This seems unlikely if successive tests are causally independent (and given that successive tests are most likely completely different site users, at different computers, who have never met each other, that seems a fair assumption.)

      --
      DRM: Terminator crops for your mind!
    3. Re:Translation by WrongSizeGlass · · Score: 2

      Is there any way they can apply this to summaries and stories on /.? I'd be willing to read that summary ... and maybe even that story.

    4. Re:Translation by swillden · · Score: 5, Informative

      No.... I'm suggesting that the algorithm presented above, which only ever displays the single highest scoring design, is biased against designs that haven't yet had a chance to be viewed by anybody, and thus have not had an opportunity to get a positive response, when people are already showing some favor towards others.

      What you're missing is the implied assumption that all of the options will fail most of the time, and that all options are initialized with maximum scores. The goal is to find the design that best motivates the user to take some action (e.g. click a link), and the assumption is that most of the time the user will not take that action. By starting all of the choices at a high value, they will all gradually converge downward to their true effectiveness rate, at which point the most effective will be chosen nearly all of the time. During the convergence process, the "leader" may change, but if the current leader isn't the true best, as it gets driven towards it's true rate, it will eventually dip under one of the others.

      If, by chance, a more effective option has a really bad run early on and gets pushed below the true effectiveness rate of another option, it would never recover -- which is why the author includes an occasional randomly-selected choice. If there is a large difference between the effectiveness of the options this is really unlikely to happen, but in the rare event it happens the randomization will eventually fix it. The author also covers a method of handling the case where the audience preferences drift over time, by including the ability to "forget" old input via simple exponential decay.

      The only really bad thing about this approach is that it assumes you don't have a lot of repeat visitors. If you do, they'll be annoyed by seeing different versions, apparently at random (from their perspective).

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
  3. This is not exclusively machine learning by Anonymous Coward · · Score: 5, Insightful

    This is not "machine learning" subsituting for human A/B testing. It's just changing the ratio of the number of visitors exposed to the "new" feature to be tested from 50% to 10%, while keeping the rest (90%) of the visitors using the "best so far" feature. There's also a bit of randomness thrown in when choosing which new feature the 10% of visitors get to test.

    In this scheme, the human visitors are still doing the A/B testing, it's just that determination of which human is testing which feature dynamically adapts over time.

    Now, if this guy had subsituted human A/B testing completely with a machine learning technology that could somehow determine which feature is better without any input from humans, then I'd be impressed. That's kind of what the summary and article imply. But that's not what he's done. He's just being a bit more sophisticated regarding which humans get to test which feature.

    He's also made a big fat claim regarding the effectiveness of his method with zero evidence to back it up. Theoretical results regarding one-armed bandit problems are quite a far cry for real-world results regarding website feature selection. I'm looking forward to seeing some results of the proposed method on the latter.

    1. Re:This is not exclusively machine learning by tgv · · Score: 2

      Indeed, this has no relation to machine learning, whatsoever. The summary is once again ... deceptive.

      And I'm sure the proof, that the best one gets chosen, doesn't exist. I'm also sure that this [i]way of choosing[/i] an interface has a high probability of choosing the preferred one, but there is also a big difference with A/B testing: you'll never know how big the difference between the two is. In straight-forward testing with two groups (which is not really A/B, by the way: that is alternating between A and B and then ask the subject to chose the best one; it has its origins in perceptual testing, where ABX testing is preferred), you can find out the difference in scores. Here you can't.

  4. This Is News? by hondo77 · · Score: 2

    Throwing up banner ads with different color schemes and automatically re-weighting them based on click-through % is something I was doing well over ten years ago. This can't really be news, can it?

    --
    I live ze unknown. I love ze unknown. I am ze unknown.
  5. The article's premise is entirely wrong by RandCraw · · Score: 5, Insightful

    A/B focus testing is about observing how customers or users choose between two alternatives based on their qualitative sense of aesthetics. ML is about classifying data based on quantifying the data into defined classes or toward optimal values.

    Predicting the outcome of a focus group is a completely different problem than multi arm slot machines. In focus groups there is no objective metric, so focus group problems are not amenable to machine learning unless your machine can define, measure, and perhaps predict aesthetic criteria.

    Now THAT I'd like to see.

    1. Re:The article's premise is entirely wrong by retchdog · · Score: 2

      i don't know what the fuck a "double-blind" focus group is, since the user is clearly not blind to the design (this is the entire point).

      and the reason why this is "like" a focus group, is that it is a focus group. all the information is coming from humans; it's just being used in a not-completely-idiotic way.

      it's such an obvious idea it's surprising that no one has done this yet. oh, wait: http://m6d.com/about/about-us/

      "Because the approach is rooted in machine learning, it continuously updates advertising decisions based on real-time signals from a marketer’s customer base. That feedback loop allows us to improve advertising performance over time."

      --
      "They were pure niggers." – Noam Chomsky
  6. Bayesian modelling and experiment design by HalfFlat · · Score: 2

    It's a 'good-enough' approximation to an optimal selection process.

    The probability of someone clicking on option A, B or C is unknown, but is expected to be constant when averaged over the population. Given the ratio of clicks versus views on any given option, the posterior distribution of that probability can be modelled as a Beta distribution. The experimental question is then: given the current estimates, which option should be presented to maximise the utility of the test?

    For simply ranking the options, the utility may be the Shannon information. In this case though, the utility also has to incorporate the expected benefit of a click-through. One could set up a utility function which is weighted between the two outcomes, possibly varying over time.

    In practice though, Beta distributions with different means tend to converge to separate peaks quite quickly, so taking a possible 10% hit on the current best estimate click-through outcome seems an entirely plausible approximation. Bayesian experimental design though could also tell you when to stop testing and stick with the winner.

  7. Er, how about statistical significance? by blach · · Score: 2

    To be valid, the last step (of which the author makes no mention) should be to compare the three groups to see if their differences are statistically significant. With tens of thousands of clicks, it's likely that they are, but the percentages were awfully close in the 2-3% range.