Collapse

I recently found out about this resource https://lost-stats.github.io/Data_Manipulation/data_manipulation.html which is really helpful for those interested in learning how to implement basic data and estimation tasks in both Stata and R.  Unfortunately, the R implementations rely on dplyr, a part of the tidyverse. While I understand that many people prefer this approach, my concern is that new users of R will think dplyr is the only way to work with data in R. There is an alternative: data.table.

To give an example I wrote a short script (collapse) based on the code given on lost-stats for how to do the same thing in data.table. The task is to take the “storms” data set (provided in dplyr) and aggregate three variables by the storm name, year, month, and day. The original data set collapses from 10,010 observations to 2777 observations. This must be one of the most common operations most of us do with data.

I then benchmarked it and found data.table is 100 times faster on this task for this data set. I also show to use RStata to run Stata code in R (if you have Stata). Stata was also very fast, even running it from within R (about 50 times faster than dplyr). One think I like about the way R does it is that you continue to have in memory both the pre- and post-collapse versions of the data.