Friday, December 7, 2012

Repeated tests, product tests and A/B tests

The Atlantic just published a profile of Uri Simonsohn (noticed via Andrew Gelman and Sanjay Srivastava). Simonsohn recently made headlines as he uncovered several cases of scientific fraud but this article also reminded me of some interesting earlier work done by Simosohn's group on how some simple mistakes or “tricks” can severely alter threaten the interpretation of standard psychological tests (Simmons, Nelson & Simonsohn, 2011, ungated copy).

One of the central examples in Simmons' paper is how using repeated interim analyses to decide whether or not to halt an experiment (“peeking” at the data) can increase the error level well beyond the level set in each individual test. They frame the discussion in terms of “spurious findings” in psychology (where continuing data collection based on the result of a statistical test seems quite common) is relevant for all settings in which similar techniques are used, including user experience research and A/B testing for web applications.

Most non-statisticians might not even realize why this is bad and one of the main contribution of Simmons's paper is to show the consequence of these interim analyses using a simple simulation. It is very easy to reproduce in R to play with the various parameters and see how it impacts the results.

Here is a little piece of code I wrote some time ago to do just that:

testsubset <- function(n, data) {
  t.test(val ~ iv, data[1:n,])$p.value
}

seqtest <- function(nmin, nmax, step) {
  df <- data.frame(val = rnorm(nmax*2), iv = rep(1:2, nmax*2))
  mean(sapply(seq(nmin*2, nmax*2, step*2), testsubset, df) <= .05) > 0
}

The second function generates some data and then runs many tests starting at nmin observations per group and going up to nmax adding step participants to each group between tests. Since all the data come from the same (random) distribution, there are no differences between the two groups and a statistical test at the standard .05 error level should only reject the null hypothesis 5% of the time.

Running 10000 simulations to find out how often this really happens in Simmons' best-case scenario (the lower-right point in the first figure in the paper) is as simple as

result <- replicate(10000, seqtest(20,50,20))
mean(result)

The nice thing about these simulations is that they can easily be adapted to other scenarios. For example, the sample sizes in the Bem paper I blogged about before suggest that the data were collected in slices of 50 participants and going at least up to 200 if the results were not satisfying. How likely is it to end up with a significant test result under these conditions?

result <- replicate(10000, seqtest(25,100,25))
mean(result)

The simulation shows that this alone is enough to have more than a 12% chance of obtaining a statistically significant difference under the null (at the standard 5% level), well above the nominal error level or the usual 5% threshold.

All these examples are pretty typical of social psychology experiments and not too different from what is done in many product or usability tests but the same problem can also come up elsewhere. For example, in A/B tests comparing different versions of the same web application, it is very common to work with much larger sample sizes and it is not very difficult to automate analysis and perform tests repeatedly as the data are coming in. Allen Downey's simulations for this scenario show that it only makes the problem much worse.

In fact, all this has long been described in the statistics literature, where it has been called “sampling to reach a foregone conclusion“ (Anscombe, 1954) and has also been used to support a more general critique of significance tests (e.g. Wagenmakers, 2007, ungated PDF copy). Those simulations show us how bad the problem really can be in practice.

No comments:

Post a Comment