12 common Stata commands done (in un-tidy fashion) in R

Seeing today that more and more people are considering giving up on Stata based on their horrible pricing policies, I offer this collection of very frequently used Stata commands and how to do roughly the same thing using R. But not with dplyr and pipes and all that.  This how-to uses only base R, data.table, and fixest.

  1. Insheet, import a CSV file
    • R data.table::fread()
    • R advantages: faster for huge files, usually guesses what you want, can work with files from internet or zipped files
  2. Replace y if x something
    • DT[, y := x] #initialize a new variable equal to existing variable x in the datatable called DT
    • DT[x>5, y := 5]  # now censor the y version
    • DT[is.na(x), y := 0] # now code the missing x to be 0s in the y variable
  3. Rename variables:
    • setnames(DT,old=c(“blah”,”blah_blah”), new=c(“x1″,”x2”)) renames blah to be x1 and blah_blah to be x2. all other variables keep their names.
    • The old variables can be just a range of columns as in old=5:12
  4. gsort x -y: setorder(DT,x,-y)
  5. Merge
    • DT1 <- merge(DT1,DT2,by=c(“id1″,”id2”),all.x=TRUE) is like if in Stata merge 1:1 id1 id2 using DT2.dta followed by drop if _merge==2
    • if you don’t want to drop anything, use all =TRUE
    • R’s version of merge allows for the id vars to have different names in DT1 and DT2, e.g. suppose your country code is “iso” in DT1 but “iso3” in DT2. similarly “year’ and “yr”. then you write DT1 <- merge(DT1,DT2,by.x=c(“iso”,”year”),by.y=c(“iso_o’,”yr”),all.x=TRUE)
  6. Collapse and egen: these are the two commands i use the most and i still call them by their Stata names
    • collapse (mean) x_i = x_it, by(year) would be DTc <- DT[,.(x_i =mean(x_it)),by=year]
    • egen x_i = mean(x_it), by(year) would be DT[,x_i := mean(x_it),by=year]
    • if there are more than one by variables, both commands use .(iso,year)
    • Note1: the collapse must be “sent” to a data.table. If we don’t need the original DT, we could <- it to DT.
    • Note 2 The “:=” is essential in the egen-equivalent. It does something called “modification by reference”
  7. Reshape long or wide. While the syntax of reshape is often said to be hard to remember, data.table’s melt and dcast are even harder to keep straight but they are very flexible and powerful functions.
    • reshape long MFN_, i(iso_d year) j(product) string
    • DTl <-melt(DTw,id.vars=c(“iso_d”,”year”),measure=patterns(“^MFN_”),value.name=”MFN”,variable.factor = FALSE)
      DTl[,product := substr(variable,5,nchar(variable))] # extract code
      DTl[,variable := NULL] # we don’t need “variable” anymore
    • the reshape wide is done with dcast()
    • DTw2 <- dcast(DTl,iso_d+year~product,value.var = “MFN”) # the variables in the formula before the “~” are id vars that stay as rows. the variable to the right of “~” is the column variable.
    • in contrast to “reshape wide” dcast on a single value.var will name the columns after the “product” variable
  8. Drop/keep conditionally: In data.table you drop by selectively keeping.
    • Drop if x==. | y==0
    • DT <- DT[!is.na(x) & y!=0]
  9. Save/use: The first thing you need to do when getting started is go to Rstudio’s preferences settings and make sure you NEVER save or restore the workspace in .Rdata.  Furthermore, you should make a practice of restarting the R session fairly regularly.  But that’s just an aside. I save data as RDS forms which is automatically compressed.
    • use  is done by DT <-readRDS(“path_to_file/file.rds”)
    • save is saveRDS(DT,”path_to_file/file.rds”)
  10. reghdfe (AKA AKM). The pioneer was Simon Gaure’s excellent lfe::felm(). But for most purposes you should probably use Berge’s fixest::feols(). Not only is it faster, but it will make it easier to transition to various GLM in item 11.
    • res.ols <- fixest::feols(log(y) ~ educ + age+ I(age^2) | worker_id +firm_id,data=DT)
    • summary(res.ols)
  11. ppmlhdfe res.ppml <- fixest::feglm(y~educ+age +I(age^2) | worker_id +firm_id,
    combine.quick=FALSE,family=”poisson”,data=DT)
    Setting combine.quick to FALSE is useful only when you want to do something later with the Fixed effects and need to know which worker or firm they correspond to.
  12. esttab (LaTeX table of regression results)  There are many options for latex tables in the R world with Stargazer probably the best known but I have found the built in etable() function to be quite adequate fixest::etable(res.ols,res.ppml,sdBelow = TRUE,digits=3,fitstat=~sq.cor+pr2,
    tex=TRUE,file=”Tables/AKM_regs.tex”,signifCode = “letters”,cluster=”worker_id”,replace = TRUE)

10 places to get started with data.table

A drawback of R compared to Stata and, within R, of data.table compared to tidyverse, is that documentation is spread out all over the place. You end up doing a lot of googling and spending time at Stack overflow. The problem is many R users now think the tidyverse and dplyr in particular is all there is. I started with base R and it is still what i use for many things. But when I began to switch my data “wrangling” from Stata to R, I got started on dplyr, only to learn (from Julian Hinz) that data.table is much faster. After initial struggles I realized that the syntax of data.table fit much better with my intuitions of the right way to do things. I am in the minority on this topic but not alone. In an effort to convince more people to take advantage of the powerful syntax and awesome speed of data.table, I am posting this set of resources.

These items are not ordered for beginners. I started with number 8 (which, as the title suggests, was probably not optimal!). But 4 and then 3 might be ideal starting points. According to my notes 9 (and its sequels) are good starting places.

  1. https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly This is not actually a good starting point at all but if you’re already a dplyr user you might look at this to decide how interested you might be in data.table
  2. https://atrebas.github.io/post/2019-03-03-datatable-dplyr/ Comprehensive correspondence between dplyr and data.table ways to do things
  3. https://github.com/chuvanan/rdatatable-cookbook. Here’s a set of chapters showing how to do common tasks in data.table.
  4. https://github.com/Rdatatable/data.table/wiki//talks/useR2019_Arun.pdf Great slides introduce data.table’s three key components.
    Also illustrates .SD with patterns, a very powerful method.
  5.  Frequently Asked Questions: http://datatable.r-forge.r-project.org/datatable-faq.pdf
  6. The cheat sheet to print out and and pin to the wall somewhere nearby. https://s3.amazonaws.com/assets.datacamp.com/blog_assets/datatable_Cheat_Sheet_R.pdf
  7. Many posts have shown how much faster data.table is than the alternatives. This one compares with dplyr, base R, and Python Pandas. https://github.com/szilard/benchm-dplyr-dt And here is some code showing how they did each task in data.table compared to dplyr: https://github.com/szilard/benchm-dplyr-dt/blob/master/bm.Rmd
  8. http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/ Advanced tips and tricks with data.table
  9. https://www.r-bloggers.com/data-table-by-example-part-1/
  10. https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reference-semantics.html Explains about “modifying by reference” which is one of the trickiest topics and most important to understand. Here you will see why “DT <- copy(DT)”  will be an essential line in your functions to avoid unwanted side-effects. More discussion here: https://stackoverflow.com/questions/13756178/writings-functions-procedures-for-data-table-objects

Collapse

I recently found out about this resource https://lost-stats.github.io/Data_Manipulation/data_manipulation.html which is really helpful for those interested in learning how to implement basic data and estimation tasks in both Stata and R.  Unfortunately, the R implementations rely on dplyr, a part of the tidyverse. While I understand that many people prefer this approach, my concern is that new users of R will think dplyr is the only way to work with data in R. There is an alternative: data.table.

To give an example I wrote a short script (collapse) based on the code given on lost-stats for how to do the same thing in data.table. The task is to take the “storms” data set (provided in dplyr) and aggregate three variables by the storm name, year, month, and day. The original data set collapses from 10,010 observations to 2777 observations. This must be one of the most common operations most of us do with data.

I then benchmarked it and found data.table is 100 times faster on this task for this data set. I also show to use RStata to run Stata code in R (if you have Stata). Stata was also very fast, even running it from within R (about 50 times faster than dplyr). One think I like about the way R does it is that you continue to have in memory both the pre- and post-collapse versions of the data.