Sunday, February 19, 2012

Recoding variables in R with match()

Most books on R start with a short chapter on the language itself before moving on to data analysis. Of course, statistical analysis is the raison d'être of the R ecosystem but the data manipulation and preparation functions are also quite powerful.

They can also be a little disconcerting for people used to imperative programming in languages like C, Java or Basic. Many problems that would be solved with a loop in those language are best handled differently in R, working directly on high-level structure like vectors.

For example, the match() function can be used to look up the position of an element in a vector:
> ex1 <- c(25, 49, 54, 65)
> match(54, ex1)
[1] 3

A single call can also retrieve the position of several elements in the same vector:
> match(c(54, 65), ex1)
[1] 3 4

This functionality can be used to easily recode some variable:
> test <- data.frame(var1 = c("A","A","B","A","C"))
> convtable
  old new
1   A  A1
2   B  A2
3   C  A3
> convtable$new[match(test$var1, convtable$old)]
[1] A1 A1 A2 A1 A3

All the As have been replaced by "A1", all the Bs by "A2", etc. The idea is to look up the position of each element of the test$var1 vector in convtable$old and to use this index to find the new values. All this can be expressed in a single line in R.

In this case, the same result could be obtained by playing with the levels of the var1 factor but this solution has several advantages: it works just as well with numeric values and text or categorical variables (factors) and the conversion table can itself be loaded from a file.

1 comment:

  1. Hmm, Quite Interesting.
    Are you cmoing March 6, to brew beer?
    Best, Jeroen

    ReplyDelete