Search interesting materials

Wednesday, June 15, 2016

Sophisticated clustered standard errors using recent R tools

by Dhananjay Ghei

Many blog articles have demonstrated clustered standard errors, in R, either by writing a function or manually adjusting the degrees of freedom or both (example, example, example and example). These methods give close approximations to the standard Stata results, but they do not do the small sample correction as the Stata does.

In recent months, elegant solutions have come about in R, which push the envelope on functionality, and yield substantial improvements in speed. I use the test dataset of Petersen which is the workhorse of this field.

The problem

In regression analysis, getting accurate standard errors is as crucial as obtaining unbiased and consistent estimates of the regression coefficients. Standard errors are important in determining the accuracy of the coefficients and thereby, affecting hypothesis testing procedures.

The correct nature of standard errors depends on the underlying structure of the data. For our purposes, we consider cases where the error terms of the model are independent across groups but correlated within groups. For instance, studies with cross-sectional data on individuals with clustering on village/state/hospital level. Another example could be difference in difference regressions with clustering at a group level. Clustered standard errors allow for a general structure of the variance covariance matrix by allowing errors to be correlated within clusters but not across clusters. In such cases, obtaining standard errors without clustering can lead to misleadingly small standard errors, narrow confidence intervals and small p-values.

Clustered standard errors can be obtained in two steps. Firstly, estimate the regression model without any clustering and subsequently, obtain clustered errors by using the residuals. Clustered standard errors can be estimated consistently provided the number of clusters goes to infinity. However, the variance covariance matrix is downward-biased when dealing with a finite number of clusters. One of the methods commonly used for correcting the bias, is adjusting for the degrees of freedom in finite clusters.

R and Stata codes

The code below shows how to compute clustered standard errors in R, using the plm and lmtest packages. Petersen's dataset can be loaded directly from the multiwayvcov package. Pooled OLS and fixed effect (FE) models are estimated using the plm package.

# Loading the required libraries

# Loading Petersen's dataset
# Pooled OLS model
pooled.ols <- plm(formula=y~x, data=petersen, model="pooling", index=c("firmid", "year")) 
# Fixed effects model
fe.firm <- plm(formula=y~x, data=petersen, model="within", index=c("firmid", "year")) 

Clustered standard errors can be computed in R, using the vcovHC() function from plm package. vcovHC.plm() estimates the robust covariance matrix for panel data models. The function serves as an argument to other functions such as coeftest(), waldtest() and other methods in the lmtest package. Clustering is achieved by the cluster argument, that allows clustering on either group or time. The type argument allows estimating standard errors by allowing for heteroskedasticity across groups. Recently, the plm package introduced the small sample correction as an option to the "type" argument of vcovHC.plm() function. This is switched on by specifying type="sss".

# OLS with SE clustered by firm (Petersen's Table 3)
coeftest(pooled.ols, vcov=vcovHC(pooled.ols, type="sss", cluster="group"))  

# OLS with SE clustered by time (Petersen's Table 4)
coeftest(pooled.ols, vcov=vcovHC(pooled.ols, type="sss", cluster="time")) 

# FE regression with SE clustered by firm
coeftest(fe.firm, vcov=vcovHC(fe.firm, type="sss", cluster="group")) 

# FE regression with SE clustered by time
coeftest(fe.firm, vcov=vcovHC(fe.firm, type="sss", cluster="time")) 

Stata makes it easy to cluster, by adding the cluster option at the end of any routine regression command (such as reg or xtreg). The code below shows how to cluster in OLS and fixed effect models:

webuse set
webuse test_data.dta, clear

* OLS with SE clustered by firm (Petersen's Table 3)
reg y x, vce(cluster firmid)
* OLS with SE clustered by time (Petersen's Table 4)
reg y x, vce(cluster year)

* Declaring dataset to be a panel
xtset firmid year
* FE regression with SE clustered by firm
xtreg y x, fe vce(cluster firmid)
* FE regression with SE clustered by time
xtreg y x, fe vce(cluster year) nonest

The table given below shows a comparison of the standard errors computed by R and Stata. The standard errors computed from R and Stata agree up to the fifth decimal place.

Model SE (in R) SE (in Stata)
OLS with SE clustered by firm 0.05059 0.05059
OLS with SE clustered by time 0.03338 0.03338
FE regression with SE clustered by firm      0.03014 0.03014
FE regression with SE clustered by time 0.02668 0.02668

Performance comparison

I run benchmarks for comparing the speed of Stata MP and R for each of these models on a quad-core processor. The results show that R is faster than Stata. In order to do parallelisation, I set the number of processors that Stata MP will use as 4. An example of the benchmarking code in Stata is given below:

* Stata benchmarking program : Example
set processors 4
timer clear
timer on 1
bs, nodrop reps(1000) seed(1): reg y x
timer off 1
timer list

Parallelisation in R is done using standard R packages. An example of the benchmarking code in R is given below:

# R benchmarking program : Example
c <- detectCores()
cl <- makeCluster(c)
ols.benchmark <- mcparallel(benchmark(lm(y~x, petersen), replications=1000))

The table below shows a comparison of R and Stata MP for each of these models. The average time is calculated as the ratio of elapsed time to the number of replications. Relative efficiency is defined as the ratio of the average time taken by Stata MP to the average time taken by R. It turns out that the R is faster.

Model Replications Average time (R - 4 core) Average time (Stata MP - 4 core) Relative efficiency
OLS with SE clustered by firm 1000 0.0737 0.1635 2.22
OLS with SE clustered by time 1000 0.0557 0.0742 1.33
FE regression with SE clustered by firm 1000 0.0880 0.3176 3.61
FE regression with SE clustered by time 1000 0.0729 0.1118 1.53

Multi-level clustering in R

Two way clustering does not have a routine estimation procedure with most of the Stata commands (except for ivreg2 and xtivreg2). There are a few codes available online (See for example, here and here) that do two way clustering. This is easily handled in R, using the vcovDC.plm() function. The function can be used in a similar fashion as vcovHC.plm().

# OLS with SE clustered by firm and time (Petersen's Table 5)
coeftest(pooled.ols, vcov=vcovDC(pooled.ols, type="sss"))

A more recent addition, multiwayvcov package is useful for clustering on multiple levels and, in computing bootstrapped clustered standard errors. The package supports parallelisation thereby, making it easier to work with large datasets. Two functions are exported from the package, cluster.vcov() and cluster.boot(). cluster.vcov() computes clustered standard errors, whereas, cluster.boot() calculates bootstrapped clustered standard errors. The code for replicating Petersen's results is available in the reference manual of the package. One limitation of cluster.vcov() is its inability to work with plm objects. This is because the package imports estfun() from the sandwich package, which is not compatible with plm objects.

R code

Here's the R code to reproduce the results.

Dhananjay Ghei is a researcher at the National Institute of Public Finance and Policy. He thanks Ajay Shah, Vimal Balasubramaniam and Apoorva Gupta for valuable discussions and feedback.


  1. Achim Zeileis has released

  2. I'm using the stata commands:

    xtset ndscd, year
    xtreg X Y year, fe vce(robust)

    I'm not able to translate these commands in R code. Can you helm me? Thank you

  3. Dear Dhananjay, thank you very much for you wonderful blog! If it comes to clustered standard errors, do you see any chance to cluster along other dimensions than group or time in the plm package? For example by country, although country is not a dimension of the plm-dataset?


Please note: Comments are moderated. Only civilised conversation is permitted on this blog. Criticism is perfectly okay; uncivilised language is not. We delete any comment which is spam, has personal attacks against anyone, or uses foul language. We delete any comment which does not contribute to the intellectual discussion about the blog article in question.

LaTeX mathematics works. This means that if you want to say $10 you have to say \$10.