Friday, December 7, 2012

Repeated tests, product tests and A/B tests

The Atlantic just published a profile of Uri Simonsohn (noticed via Andrew Gelman and Sanjay Srivastava). Simonsohn recently made headlines as he uncovered several cases of scientific fraud but this article also reminded me of some interesting earlier work done by Simosohn's group on how some simple mistakes or “tricks” can severely alter threaten the interpretation of standard psychological tests (Simmons, Nelson & Simonsohn, 2011, ungated copy).

One of the central examples in Simmons' paper is how using repeated interim analyses to decide whether or not to halt an experiment (“peeking” at the data) can increase the error level well beyond the level set in each individual test. They frame the discussion in terms of “spurious findings” in psychology (where continuing data collection based on the result of a statistical test seems quite common) is relevant for all settings in which similar techniques are used, including user experience research and A/B testing for web applications.

Most non-statisticians might not even realize why this is bad and one of the main contribution of Simmons's paper is to show the consequence of these interim analyses using a simple simulation. It is very easy to reproduce in R to play with the various parameters and see how it impacts the results.

Here is a little piece of code I wrote some time ago to do just that:

testsubset <- function(n, data) {
  t.test(val ~ iv, data[1:n,])$p.value
}

seqtest <- function(nmin, nmax, step) {
  df <- data.frame(val = rnorm(nmax*2), iv = rep(1:2, nmax*2))
  mean(sapply(seq(nmin*2, nmax*2, step*2), testsubset, df) <= .05) > 0
}

The second function generates some data and then runs many tests starting at nmin observations per group and going up to nmax adding step participants to each group between tests. Since all the data come from the same (random) distribution, there are no differences between the two groups and a statistical test at the standard .05 error level should only reject the null hypothesis 5% of the time.

Running 10000 simulations to find out how often this really happens in Simmons' best-case scenario (the lower-right point in the first figure in the paper) is as simple as

result <- replicate(10000, seqtest(20,50,20))
mean(result)

The nice thing about these simulations is that they can easily be adapted to other scenarios. For example, the sample sizes in the Bem paper I blogged about before suggest that the data were collected in slices of 50 participants and going at least up to 200 if the results were not satisfying. How likely is it to end up with a significant test result under these conditions?

result <- replicate(10000, seqtest(25,100,25))
mean(result)

The simulation shows that this alone is enough to have more than a 12% chance of obtaining a statistically significant difference under the null (at the standard 5% level), well above the nominal error level or the usual 5% threshold.

All these examples are pretty typical of social psychology experiments and not too different from what is done in many product or usability tests but the same problem can also come up elsewhere. For example, in A/B tests comparing different versions of the same web application, it is very common to work with much larger sample sizes and it is not very difficult to automate analysis and perform tests repeatedly as the data are coming in. Allen Downey's simulations for this scenario show that it only makes the problem much worse.

In fact, all this has long been described in the statistics literature, where it has been called “sampling to reach a foregone conclusion“ (Anscombe, 1954) and has also been used to support a more general critique of significance tests (e.g. Wagenmakers, 2007, ungated PDF copy). Those simulations show us how bad the problem really can be in practice.

Tuesday, December 4, 2012

Design & Emotion 2012

The 8th “Design & Emotion” conference took place this year in London. It has already been a couple of months since the end of the conference but I am only coming to blog about it now. This was the third time I attended Design & Emotion and it was a great experience as always. A few personal highlights from the program:
  • Pieter Desmet, Martijn Vastenburg, Daan Van Bel and Natalia Romero's “Pic-a-mood; Development and application of a pictorial mood-reporting instrument“, a paper about a promising tool to collect quick self-report of people's current mood (and I am not saying this because Pieter has been my boss for many years!).

  • Xiaojuan Ma, Jodi Forlizzi, and Steven Dow's “Guidelines for depicting emotions in storyboard scenarios”. The title emphasizes one potential applications but the paper also contains an interesting study of the interpretation of different types of cartoons (manga, US comics, etc.) by people from the US and India.

  • Elliott Hedman, Lucy Miller, Sarah Schoen, Darci Nielsen, Matthew Goodwin, and Rosalind Picard's “Measuring autonomic arousal during therapy”. Elliott's presentation was an excellent roundup of the challenges of psychophysiological measurement in applied settings. I talked with him about this and other topics before and I expected that there would be at least some disagreement between us but I actually found myself agreeing with most of what he said (OK, I almost wrote “with everything he said” but I have not read the paper itself and I would not want to commit myself too much!).
This short selection reflects my own interests and research topics but there were many interesting presentations and many papers that seemed well worth checking out. Unfortunately, the proceedings do not seem to be available online… Also, the location of the next edition was not known yet at the end of the conference but it has just been announced and it sounds quite exciting!

New Look!

After a little bit of fiddling with the templates, this blog just got a new look. It does not use anything fancy and I tested it in a couple of web browsers so it should work everywhere but do not hesitate to drop me a comment if you see some visual glitch or other problem in the new version.

Sunday, February 19, 2012

Recoding variables in R with match()

Most books on R start with a short chapter on the language itself before moving on to data analysis. Of course, statistical analysis is the raison d'être of the R ecosystem but the data manipulation and preparation functions are also quite powerful.

They can also be a little disconcerting for people used to imperative programming in languages like C, Java or Basic. Many problems that would be solved with a loop in those language are best handled differently in R, working directly on high-level structure like vectors.

For example, the match() function can be used to look up the position of an element in a vector:
> ex1 <- c(25, 49, 54, 65)
> match(54, ex1)
[1] 3

A single call can also retrieve the position of several elements in the same vector:
> match(c(54, 65), ex1)
[1] 3 4

This functionality can be used to easily recode some variable:
> test <- data.frame(var1 = c("A","A","B","A","C"))
> convtable
  old new
1   A  A1
2   B  A2
3   C  A3
> convtable$new[match(test$var1, convtable$old)]
[1] A1 A1 A2 A1 A3

All the As have been replaced by "A1", all the Bs by "A2", etc. The idea is to look up the position of each element of the test$var1 vector in convtable$old and to use this index to find the new values. All this can be expressed in a single line in R.

In this case, the same result could be obtained by playing with the levels of the var1 factor but this solution has several advantages: it works just as well with numeric values and text or categorical variables (factors) and the conversion table can itself be loaded from a file.

Saturday, January 28, 2012

Election season and statistics

The US primary season is now in full swing and it's as good an occasion as any to play with some data. Some interesting stuff from a statistical angle:

Thursday, January 26, 2012

Research as usual

Andrew Gelman just posted a follow-up on one of the big psychological research “scandals” of 2011: Daryl Bem's “Feeling the future” paper. Bem relates a series of experiments showing evidence of paranormal “precognition” (for example predicting on which side of the screen erotic pictures are going to appear or remembering some words better before learning them). Given the nature of the paper, the editors of the journal that published it decided to explain their decision and invite a methodological critique.

The study has been heavily commented on the web. Since then, another methodological critique and failures to replicate Bem's results have appeared (but it turns out to be more difficult to publish them in the JPSP than a paper arguing for psychic powers). One of the main point on which all commenters agree is that the original studies are actually quite banal methodologically speaking, as far as social psychology experiments go. Tal Yarkoni details all the little flaws that make it possible to find such spurious results but none of them seem very big and all are pretty common in the psychological literature.

All of this reminded me of another little scandal that unfolded last year: Satoshi Kanazawa blog post proclaiming that “black women are less attractive” (more on the content and methodological flaw in the analysis). It was not the first time that Kanazawa posted stupid and offensive stuff on his blog but this time, it started a storm of controversy, complete with calls to sack him, an official investigation, letters of support and Psychology Today finally caving in to the pressure and removing the text. True, it was not a peer-reviewed article but the sad thing is that his usual output is not much better and still his supporters are perfectly right when they stress that he has published many articles that were judged sound by reviewers.

The two scandals were quite different but in both cases boil down to offensive and ludicrous findings that still meet the current methodological standards within psychology. This should perhaps tell us something about those standards…

PS: A new post from Andrew Gelman with some thought on how to improve the situation just came in as I was writing this entry.

Friday, January 13, 2012

Our programme will resume shortly

It's now over two years since the last time I have posted anything on this blog. Not that I was intensively blogging before, but I had a pretty good excuse. Now that my PhD is out (it should be available soon on the TU Delft repository), I intend to return to blogging. Let's see how it goes…