by Dhananjay Ghei
Many blog articles have demonstrated clustered standard errors, in R, either by writing a function or manually adjusting the degrees of freedom or both (example, example, example and example). These methods give close approximations to the standard Stata results, but they do not do the small sample correction as the Stata does.
In recent months, elegant solutions have come about in R, which push the envelope on functionality, and yield substantial improvements in speed. I use the test dataset of Petersen which is the workhorse of this field.
In regression analysis, getting accurate standard errors is as crucial as obtaining unbiased and consistent estimates of the regression coefficients. Standard errors are important in determining the accuracy of the coefficients and thereby, affecting hypothesis testing procedures.
The correct nature of standard errors depends on the underlying structure of the data. For our purposes, we consider cases where the error terms of the model are independent across groups but correlated within groups. For instance, studies with cross-sectional data on individuals with clustering on village/state/hospital level. Another example could be difference in difference regressions with clustering at a group level. Clustered standard errors allow for a general structure of the variance covariance matrix by allowing errors to be correlated within clusters but not across clusters. In such cases, obtaining standard errors without clustering can lead to misleadingly small standard errors, narrow confidence intervals and small p-values.
Clustered standard errors can be obtained in two steps. Firstly, estimate the regression model without any clustering and subsequently, obtain clustered errors by using the residuals. Clustered standard errors can be estimated consistently provided the number of clusters goes to infinity. However, the variance covariance matrix is downward-biased when dealing with a finite number of clusters. One of the methods commonly used for correcting the bias, is adjusting for the degrees of freedom in finite clusters.
R and Stata codes
The code below shows how to compute clustered standard errors in R, using the plm and lmtest packages. Petersen's dataset can be loaded directly from the multiwayvcov package. Pooled OLS and fixed effect (FE) models are estimated using the plm package.
# Loading the required libraries library(plm) library(lmtest) library(multiwayvcov) # Loading Petersen's dataset data(petersen) # Pooled OLS model pooled.ols <- plm(formula=y~x, data=petersen, model="pooling", index=c("firmid", "year")) # Fixed effects model fe.firm <- plm(formula=y~x, data=petersen, model="within", index=c("firmid", "year"))
Clustered standard errors can be computed in R, using the vcovHC() function from plm package. vcovHC.plm() estimates the robust covariance matrix for panel data models. The function serves as an argument to other functions such as coeftest(), waldtest() and other methods in the lmtest package. Clustering is achieved by the cluster argument, that allows clustering on either group or time. The type argument allows estimating standard errors by allowing for heteroskedasticity across groups. Recently, the plm package introduced the small sample correction as an option to the "type" argument of vcovHC.plm() function. This is switched on by specifying type="sss".
# OLS with SE clustered by firm (Petersen's Table 3) coeftest(pooled.ols, vcov=vcovHC(pooled.ols, type="sss", cluster="group")) # OLS with SE clustered by time (Petersen's Table 4) coeftest(pooled.ols, vcov=vcovHC(pooled.ols, type="sss", cluster="time")) # FE regression with SE clustered by firm coeftest(fe.firm, vcov=vcovHC(fe.firm, type="sss", cluster="group")) # FE regression with SE clustered by time coeftest(fe.firm, vcov=vcovHC(fe.firm, type="sss", cluster="time"))
Stata makes it easy to cluster, by adding the cluster option at the end of any routine regression command (such as reg or xtreg). The code below shows how to cluster in OLS and fixed effect models:
webuse set http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/ webuse test_data.dta, clear * OLS with SE clustered by firm (Petersen's Table 3) reg y x, vce(cluster firmid) * OLS with SE clustered by time (Petersen's Table 4) reg y x, vce(cluster year) * Declaring dataset to be a panel xtset firmid year * FE regression with SE clustered by firm xtreg y x, fe vce(cluster firmid) * FE regression with SE clustered by time xtreg y x, fe vce(cluster year) nonest
The table given below shows a comparison of the standard errors computed by R and Stata. The standard errors computed from R and Stata agree up to the fifth decimal place.
|Model||SE (in R)||SE (in Stata)|
|OLS with SE clustered by firm||0.05059||0.05059|
|OLS with SE clustered by time||0.03338||0.03338|
|FE regression with SE clustered by firm||0.03014||0.03014|
|FE regression with SE clustered by time||0.02668||0.02668|
I run benchmarks for comparing the speed of Stata MP and R for each of these models on a quad-core processor. The results show that R is faster than Stata. In order to do parallelisation, I set the number of processors that Stata MP will use as 4. An example of the benchmarking code in Stata is given below:
* Stata benchmarking program : Example set processors 4 timer clear timer on 1 bs, nodrop reps(1000) seed(1): reg y x timer off 1 timer list
Parallelisation in R is done using standard R packages. An example of the benchmarking code in R is given below:
# R benchmarking program : Example library(doParallel) library(rbenchmark) set.seed(1) c <- detectCores() cl <- makeCluster(c) ols.benchmark <- mcparallel(benchmark(lm(y~x, petersen), replications=1000)) mccollect(ols.benchmark) stopCluster(cl)
The table below shows a comparison of R and Stata MP for each of these models. The average time is calculated as the ratio of elapsed time to the number of replications. Relative efficiency is defined as the ratio of the average time taken by Stata MP to the average time taken by R. It turns out that the R is faster.
|Model||Replications||Average time (R - 4 core)||Average time (Stata MP - 4 core)||Relative efficiency|
|OLS with SE clustered by firm||1000||0.0737||0.1635||2.22|
|OLS with SE clustered by time||1000||0.0557||0.0742||1.33|
|FE regression with SE clustered by firm||1000||0.0880||0.3176||3.61|
|FE regression with SE clustered by time||1000||0.0729||0.1118||1.53|
Multi-level clustering in R
Two way clustering does not have a routine estimation procedure with most of the Stata commands (except for ivreg2 and xtivreg2). There are a few codes available online (See for example, here and here) that do two way clustering. This is easily handled in R, using the vcovDC.plm() function. The function can be used in a similar fashion as vcovHC.plm().
# OLS with SE clustered by firm and time (Petersen's Table 5) coeftest(pooled.ols, vcov=vcovDC(pooled.ols, type="sss"))
A more recent addition, multiwayvcov package is useful for clustering on multiple levels and, in computing bootstrapped clustered standard errors. The package supports parallelisation thereby, making it easier to work with large datasets. Two functions are exported from the package, cluster.vcov() and cluster.boot(). cluster.vcov() computes clustered standard errors, whereas, cluster.boot() calculates bootstrapped clustered standard errors. The code for replicating Petersen's results is available in the reference manual of the package. One limitation of cluster.vcov() is its inability to work with plm objects. This is because the package imports estfun() from the sandwich package, which is not compatible with plm objects.
Here's the R code to reproduce the results.
Dhananjay Ghei is a researcher at the National Institute of Public Finance and Policy. He thanks Ajay Shah, Vimal Balasubramaniam and Apoorva Gupta for valuable discussions and feedback.