Friday, December 7, 2012

Repeated tests, product tests and A/B tests

The Atlantic just published a profile of Uri Simonsohn (noticed via Andrew Gelman and Sanjay Srivastava). Simonsohn recently made headlines as he uncovered several cases of scientific fraud but this article also reminded me of some interesting earlier work done by Simosohn's group on how some simple mistakes or “tricks” can severely alter threaten the interpretation of standard psychological tests (Simmons, Nelson & Simonsohn, 2011, ungated copy).

One of the central examples in Simmons' paper is how using repeated interim analyses to decide whether or not to halt an experiment (“peeking” at the data) can increase the error level well beyond the level set in each individual test. They frame the discussion in terms of “spurious findings” in psychology (where continuing data collection based on the result of a statistical test seems quite common) is relevant for all settings in which similar techniques are used, including user experience research and A/B testing for web applications.

Most non-statisticians might not even realize why this is bad and one of the main contribution of Simmons's paper is to show the consequence of these interim analyses using a simple simulation. It is very easy to reproduce in R to play with the various parameters and see how it impacts the results.

Here is a little piece of code I wrote some time ago to do just that:

testsubset <- function(n, data) {
  t.test(val ~ iv, data[1:n,])$p.value
}

seqtest <- function(nmin, nmax, step) {
  df <- data.frame(val = rnorm(nmax*2), iv = rep(1:2, nmax*2))
  mean(sapply(seq(nmin*2, nmax*2, step*2), testsubset, df) <= .05) > 0
}

The second function generates some data and then runs many tests starting at nmin observations per group and going up to nmax adding step participants to each group between tests. Since all the data come from the same (random) distribution, there are no differences between the two groups and a statistical test at the standard .05 error level should only reject the null hypothesis 5% of the time.

Running 10000 simulations to find out how often this really happens in Simmons' best-case scenario (the lower-right point in the first figure in the paper) is as simple as

result <- replicate(10000, seqtest(20,50,20))
mean(result)

The nice thing about these simulations is that they can easily be adapted to other scenarios. For example, the sample sizes in the Bem paper I blogged about before suggest that the data were collected in slices of 50 participants and going at least up to 200 if the results were not satisfying. How likely is it to end up with a significant test result under these conditions?

result <- replicate(10000, seqtest(25,100,25))
mean(result)

The simulation shows that this alone is enough to have more than a 12% chance of obtaining a statistically significant difference under the null (at the standard 5% level), well above the nominal error level or the usual 5% threshold.

All these examples are pretty typical of social psychology experiments and not too different from what is done in many product or usability tests but the same problem can also come up elsewhere. For example, in A/B tests comparing different versions of the same web application, it is very common to work with much larger sample sizes and it is not very difficult to automate analysis and perform tests repeatedly as the data are coming in. Allen Downey's simulations for this scenario show that it only makes the problem much worse.

In fact, all this has long been described in the statistics literature, where it has been called “sampling to reach a foregone conclusion“ (Anscombe, 1954) and has also been used to support a more general critique of significance tests (e.g. Wagenmakers, 2007, ungated PDF copy). Those simulations show us how bad the problem really can be in practice.

Tuesday, December 4, 2012

Design & Emotion 2012

The 8th “Design & Emotion” conference took place this year in London. It has already been a couple of months since the end of the conference but I am only coming to blog about it now. This was the third time I attended Design & Emotion and it was a great experience as always. A few personal highlights from the program:
  • Pieter Desmet, Martijn Vastenburg, Daan Van Bel and Natalia Romero's “Pic-a-mood; Development and application of a pictorial mood-reporting instrument“, a paper about a promising tool to collect quick self-report of people's current mood (and I am not saying this because Pieter has been my boss for many years!).

  • Xiaojuan Ma, Jodi Forlizzi, and Steven Dow's “Guidelines for depicting emotions in storyboard scenarios”. The title emphasizes one potential applications but the paper also contains an interesting study of the interpretation of different types of cartoons (manga, US comics, etc.) by people from the US and India.

  • Elliott Hedman, Lucy Miller, Sarah Schoen, Darci Nielsen, Matthew Goodwin, and Rosalind Picard's “Measuring autonomic arousal during therapy”. Elliott's presentation was an excellent roundup of the challenges of psychophysiological measurement in applied settings. I talked with him about this and other topics before and I expected that there would be at least some disagreement between us but I actually found myself agreeing with most of what he said (OK, I almost wrote “with everything he said” but I have not read the paper itself and I would not want to commit myself too much!).
This short selection reflects my own interests and research topics but there were many interesting presentations and many papers that seemed well worth checking out. Unfortunately, the proceedings do not seem to be available online… Also, the location of the next edition was not known yet at the end of the conference but it has just been announced and it sounds quite exciting!

New Look!

After a little bit of fiddling with the templates, this blog just got a new look. It does not use anything fancy and I tested it in a couple of web browsers so it should work everywhere but do not hesitate to drop me a comment if you see some visual glitch or other problem in the new version.