We need some packages to simulate clustered data analysis. Let us load them here. Install them before loading if necessary by install.packages()
.
library('arm')
library('mvtnorm')
library('lme4')
library('multiwayvcov')
library('clusterSEs')
library('ggplot2')
library('dplyr')
library('haven')
First, let us create a function to create data. Here we suppose a simple regression model: \[y_i \sim \mbox{N}(\beta_0 + \beta_1 x_i, \sigma^2).\]
In the fucntion, intra-cluster correlation is set by rho (\(\rho\)). When \(\rho = 1\), all units within a cluster are cosidered to be identical, and the effective sample size is reduced to the number of clusters. When \(\rho =0\), there is no correlation of units within a cluster, and all observations are considered to be independent of each other.
gen_cluster <- function(param = c(.1, .5), n = 1000, n_cluster = 50, rho = .5) {
# Function to generate clustered data
# Required package: mvtnorm
# individual level
Sigma_i <- matrix(c(1, 0, 0, 1 - rho), ncol = 2)
values_i <- rmvnorm(n = n, sigma = Sigma_i)
# cluster level
cluster_name <- rep(1:n_cluster, each = n / n_cluster)
Sigma_cl <- matrix(c(1, 0, 0, rho), ncol = 2)
values_cl <- rmvnorm(n = n_cluster, sigma = Sigma_cl)
# predictor var consists of individual- and cluster-level components
x <- values_i[ , 1] + rep(values_cl[ , 1], each = n / n_cluster)
# error consists of individual- and cluster-level components
error <- values_i[ , 2] + rep(values_cl[ , 2], each = n / n_cluster)
# data generating process
y <- param[1] + param[2]*x + error
df <- data.frame(x, y, cluster = cluster_name)
return(df)
}
Next, let us make some functions to run simulations.
# Simulate a dataset with clusters and fit OLS
# Calculate cluster-robust SE when cluster_robust = TRUE
cluster_sim <- function(param = c(.1, .5), n = 1000, n_cluster = 50,
rho = .5, cluster_robust = FALSE) {
# Required packages: mvtnorm, multiwayvcov
df <- gen_cluster(param = param, n = n , n_cluster = n_cluster, rho = rho)
fit <- lm(y ~ x, data = df)
b1 <- coef(fit)[2]
if (!cluster_robust) {
Sigma <- vcov(fit)
se <- sqrt(diag(Sigma)[2])
b1_ci95 <- confint(fit)[2, ]
} else { # cluster-robust SE
Sigma <- cluster.vcov(fit, ~ cluster)
se <- sqrt(diag(Sigma)[2])
t_critical <- qt(.025, df = n - 2, lower.tail = FALSE)
lower <- b1 - t_critical*se
upper <- b1 + t_critical*se
b1_ci95 <- c(lower, upper)
}
return(c(b1, se, b1_ci95))
}
# Function to iterate the simulation. A data frame is returned.
run_cluster_sim <- function(n_sims = 1000, param = c(.1, .5), n = 1000,
n_cluster = 50, rho = .5, cluster_robust = FALSE) {
# Required packages: mvtnorm, multiwayvcov, dplyr
df <- replicate(n_sims, cluster_sim(param = param, n = n, rho = rho,
n_cluster = n_cluster,
cluster_robust = cluster_robust))
df <- as.data.frame(t(df))
names(df) <- c('b1', 'se_b1', 'ci95_lower', 'ci95_upper')
df <- df %>%
mutate(id = 1:n(),
param_caught = ci95_lower <= param[2] & ci95_upper >= param[2])
return(df)
}
First, we will run simulations of data with no cluster (\(\rho=0\)). Here, we set \(\beta_0 = 0.4\) and \(\beta_1 = 0\). Setting the significance level at 5%, we should incorrectly reject the null \(\beta_1 = 0\) in about 5 percent of the simulations.
sim_params <- c(.4, 0) # beta1 = 0: no effect of x on y
sim_nocluster <- run_cluster_sim(n_sims = 10000, param = sim_params, rho = 0)
hist_nocluster <- ggplot(sim_nocluster, aes(b1)) +
geom_histogram(color = 'black') +
geom_vline(xintercept = sim_params[2], color = 'red')
print(hist_nocluster)
As the histogram above shows, we can estimate the value of \(\beta_1\) correctly on average. Let’s check the confidence intervals (CIs).
ci95_nocluster <- ggplot(sample_n(sim_nocluster, 100),
aes(x = reorder(id, b1), y = b1,
ymin = ci95_lower, ymax = ci95_upper,
color = param_caught)) +
geom_hline(yintercept = sim_params[2], linetype = 'dashed') +
geom_pointrange() +
labs(x = 'sim ID', y = 'b1', title = 'Randomly Chosen 100 95% CIs') +
scale_color_discrete(name = 'True param value', labels = c('missed', 'hit')) +
coord_flip()
print(ci95_nocluster)
sim_nocluster %>% summarize(type1_error = 1 - sum(param_caught)/n())
## type1_error
## 1 0.0503
As shown, about 95% of the 95% CIs catch the true value of \(\beta_1\), which is zero. In other words, we incorrectly reject the null hypothesis about 5% of the time. This isn’t a problem because the rate of Type I error is close to the significance level.
Now, let us simulate data with clusters. Fist, we will fit the OLS to the clusterd data.
sim_params <- c(.4, 0) # beta1 = 0: no effect of x on y
sim_cluster_ols <- run_cluster_sim(n_sims = 10000, param = sim_params)
hist_cluster_ols <- hist_nocluster %+% sim_cluster_ols
print(hist_cluster_ols)
As this histogram shows, the point estimate is not biased; we can correctly estimate the slope on average.
How about CIs?
ci95_cluster_ols <- ci95_nocluster %+% sample_n(sim_cluster_ols, 100)
print(ci95_cluster_ols)
sim_cluster_ols %>% summarize(type1_error = 1 - sum(param_caught)/n())
## type1_error
## 1 0.4187
As shown, more than 5% of the CIs miss the true values, which means that we reject the null more than 5% of the time. In other words, if we anlyze clustered-data using OLS, we might incorretly reject the null hypothesis more than we should. This is because we underestimate the standard error
To avoid over-reporting of “significant” results, we need to calculate cluster-robust standard errors. We can get cluster-robust var-cov by multiwayvcov::cluster.vcov()
as defined in cluster_sim()
above.
sim_params <- c(.4, 0) # beta1 = 0: no effect of x on y
sim_cluster_robust <- run_cluster_sim(n_sims = 10000, param = sim_params,
cluster_robust = TRUE)
hist_cluster_robust <- hist_nocluster %+% sim_cluster_ols
print(hist_cluster_robust)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
When we calculate cluster-robust SEs, we use the OLS estimates for coefficient. So this histogram is simiilar (identical if the same datasets were used) to the previous histogram.
Now, let us examine the CIs.
ci95_cluster_robust <- ci95_nocluster %+% sample_n(sim_cluster_robust, 100)
print(ci95_cluster_robust)
sim_cluster_robust %>% summarize(type1_error = 1 - sum(param_caught)/n())
## type1_error
## 1 0.0685
As shown, the rate of Type I error has been dropped. This is because the cluster-robust standar errors are smalle thatn the standrd error calculated without cosidering the cluster structure.
Therefore, we should calculate cluster-robust standar errors when we suspect the data are clustered. Or we should always calculate cluster-robust SEs unless we are thoretically convinced that there is no cluster in the data.
To get cluster-robust CIs more easily with cluster-bootstrap variance-covariance matrix estimates, we can use clusterSEs::cluster.bs.glm
.
df_test <- gen_cluster(param = c(.4, 0))
fit_ols <- lm(y ~ x, data = df_test)
display(fit_ols)
## lm(formula = y ~ x, data = df_test)
## coef.est coef.se
## (Intercept) 0.36 0.03
## x 0.01 0.02
## ---
## n = 1000, k = 2
## residual sd = 0.93, R-Squared = 0.00
confint(fit_ols)
## 2.5 % 97.5 %
## (Intercept) 0.30245794 0.41808350
## x -0.03631271 0.04729855
# same estimates using glm to be passed to cluster.bs.glm
fit_glm <- glm(y ~ x, data = df_test, family = gaussian(link = 'identity'))
display(fit_glm)
## glm(formula = y ~ x, family = gaussian(link = "identity"), data = df_test)
## coef.est coef.se
## (Intercept) 0.36 0.03
## x 0.01 0.02
## ---
## n = 1000, k = 2
## residual deviance = 862.1, null deviance = 862.1 (difference = 0.1)
## overdispersion parameter = 0.9
## residual sd is sqrt(overdispersion) = 0.93
confint(fit_glm)
## 2.5 % 97.5 %
## (Intercept) 0.30252806 0.41801339
## x -0.03626201 0.04724785
# cluster-bootstrap CI
fit_robust <- cluster.bs.glm(fit_glm, dat = df_test, cluster = ~ cluster,
boot.reps = 1000, prog.bar = FALSE)
##
## Cluster Bootstrap p-values:
##
## variable name cluster bootstrap p-value
## (Intercept) 0
## x 0.92
##
## Confidence Intervals (derived from bootstrapped t-statistics):
##
## variable name CI lower CI higher
## (Intercept) 0.171881122186029 0.548660323561103
## x -0.110564980321601 0.12155081905199
Again, the cluster-robust CI is much wider than the OLS CI.
Though the clustered-robust SEs correct the standard errors in linear regression, the cluster structure isn’t taken into account when the coefficients are estimated. What will we get if we consider the cluster structure when estimating the coefficients?
Let us think and simulate two other estimation methods as well as OLS and cluster-robust SE.
We can fit multilevel model by lme4::lmer()
(see Bates et al. 2014).
To compare different models, define the function to simulate a dataset and anlyze it by 4 different models (strictly speaking, 3 models and 4 different SEs).
sim_4models <- function(param = c(.1, .5), n = 1000, rho = .5, n_cluster = 50) {
# Required packages: mvtnorm, multiwaycov, lme4,
df <- gen_cluster(param = param, n = n , n_cluster = n_cluster, rho = rho)
# Fit three different models
fit_ols <- lm(y ~ x, data = df)
fit_fe <- lm(y ~ x + factor(cluster), data = df)
fit_multi <- lmer(y ~ x + (1|cluster), data = df)
coef <- c(coef(fit_ols)[2], coef(fit_fe)[2], fixef(fit_multi)[2])
# 4 ways to get var-cov matrix
Sigma_ols <- vcov(fit_ols)
Sigma_crse <- cluster.vcov(fit_ols, ~ cluster)
Sigma_fe <- vcov(fit_fe)
Sigma_multi <- vcov(fit_multi)
# 4 different SEs
se <- sqrt(c(diag(Sigma_ols)[2], diag(Sigma_crse)[2],
diag(Sigma_fe)[2], diag(Sigma_multi)[2]))
return(c(coef[1], se[1], # OLS
coef[1], se[2], # OLS with cluster-robust SE
coef[2], se[3], # fixed-effect model
coef[3], se[4]) # multi-level model
)
}
Then, run the simulations.
n_sims <- 10000
sim_params <- c(.4, 0)
comp_4models <- replicate(n_sims, sim_4models(param = sim_params))
comp_df <- as.vector(comp_4models)
comp_df <- matrix(comp_df, ncol = 2, byrow = TRUE)
comp_df <- as.data.frame(comp_df)
names(comp_df) <- c('b1', 'se_b1')
comp_df <- comp_df %>%
mutate(model = rep(c('OLS', 'CRSE', 'FE', 'Multi'), n_sims),
id = rep(1:n_sims, each = 4))
First, let us compar the coefficient estimates.
density_b1 <- ggplot(comp_df, aes(b1, color = model)) +
geom_density()
print(density_b1)
Since the OLS and OLS with cluster-robust SEs share the coefficient estimates, the density lines for these two models completely overlap.
Compared to the OLS estimates, the estimates obtained by the fixed-effect model or multilevel model are concentrated around the true parameter value. That is, the fixed-effect model and multi-level models are more efficient than the OLS model. Therefore, in terms of efficiency, we should use a fixed effect model or multilevel model to analyze clustered-data instead of an OLS model with cluster-robust stardard errors.
Next, let us check the standard errors of these models.
density_se <- ggplot(comp_df, aes(se_b1, color = model)) +
geom_density()
print(density_se)
As this figure shows, the standard errors in the OLS, FE, and multi-level models are much smaller than the cluster-robust SEs. Does this mean that FE and multi-level models overreport “significant” results like OLS? Actually, no.
To understand why, we will create 95% confidence intervals and calculate the Type I error rate for each model.
comp_df <- comp_df %>%
mutate(lower = b1 - 1.96*se_b1,
upper = b1 + 1.96*se_b1,
param_caught = lower <= sim_params[2] & upper >= sim_params[2])
comp_df %>% group_by(model) %>%
summarize(type1_error = 1 - sum(param_caught)/n())
## Source: local data frame [4 x 2]
##
## model type1_error
## (chr) (dbl)
## 1 CRSE 0.0674
## 2 FE 0.0488
## 3 Multi 0.0495
## 4 OLS 0.4101
So, only the OLS overreports “significant” results. This is because the fixed-effect and multi-level models estimate the coefficint accurately enough to catch the true value inside the confidence intervals.
In sum, to anlayze clustered-data, we should use the fixed-effect or multi-level model. While cluster-robust standard error makes correction of the standard errors in a misspecified model (OLS), the fixed-effect and multi-level models correctly specify the model so that we don’t have to correct the standard erros afterwards.
Let us replicate and expand the logistic regression explained by UCLA’s IDRE. For explanation of the dataset and variables, go to the link above. Since the dataset is in Stata format, we use haven::read_dta()
to load it.
download.file(url = 'http://www.ats.ucla.edu/stat/stata/dae/binary.dta',
destfile = 'ucla_binary.dta')
myd <- read_dta('ucla_binary.dta')
Fit a logit model where the outcome is admit (binary: admitted or not) and the predictors are gre (GRE scores) and gpa (undergraduate GPA) .
fit_logit <- glm(admit ~ gre + gpa, data = myd,
family = binomial(link = 'logit'))
display(fit_logit)
## glm(formula = admit ~ gre + gpa, family = binomial(link = "logit"),
## data = myd)
## coef.est coef.se
## (Intercept) -4.95 1.08
## gre 0.00 0.00
## gpa 0.75 0.32
## ---
## n = 400, k = 3
## residual deviance = 480.3, null deviance = 500.0 (difference = 19.6)
confint(fit_logit)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -7.1092208579 -2.886727222
## gre 0.0006373957 0.004791712
## gpa 0.1347510239 1.390131222
Using this result, calculate cluster-robust standard errors, where the cluster variable is rank (rank [1–4] of the institution each observation graduated from. 1 represents the highest prestige, and 4 lowest).
cluster.bs.glm(fit_logit, dat = myd, ~ rank, prog.bar = FALSE)
##
## Cluster Bootstrap p-values:
##
## variable name cluster bootstrap p-value
## (Intercept) 0.08
## gre 0.071
## gpa 0.105
##
## Confidence Intervals (derived from bootstrapped t-statistics):
##
## variable name CI lower CI higher
## (Intercept) -14.2015669885559 4.3028103231421
## gre -0.00283582815057708 0.00821719516850139
## gpa -0.299916868844567 1.80929076540112
Next, we will take into account the rank-specific fixed effects in the model.
fit_logit_fe <- glm(admit ~ gre + gpa + as.factor(rank), data = myd,
family = binomial(link = 'logit'))
display(fit_logit_fe)
## glm(formula = admit ~ gre + gpa + as.factor(rank), family = binomial(link = "logit"),
## data = myd)
## coef.est coef.se
## (Intercept) -3.99 1.14
## gre 0.00 0.00
## gpa 0.80 0.33
## as.factor(rank)2 -0.68 0.32
## as.factor(rank)3 -1.34 0.35
## as.factor(rank)4 -1.55 0.42
## ---
## n = 400, k = 6
## residual deviance = 458.5, null deviance = 500.0 (difference = 41.5)
confint(fit_logit_fe)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -6.271620549 -1.792547372
## gre 0.000137592 0.004435874
## gpa 0.160296051 1.464142832
## as.factor(rank)2 -1.300888804 -0.056745723
## as.factor(rank)3 -2.027671329 -0.670372358
## as.factor(rank)4 -2.400026540 -0.753542604
Lastly, we willl fit a multi-level model.
fit_logit_multi <- glmer(admit ~ gre + gpa + (1|rank), data = myd,
family = binomial(link = 'logit'))
display(fit_logit_multi)
## glmer(formula = admit ~ gre + gpa + (1 | rank), data = myd, family = binomial(link = "logit"))
## coef.est coef.se
## (Intercept) -4.92 1.14
## gre 0.00 0.00
## gpa 0.80 0.33
##
## Error terms:
## Groups Name Std.Dev.
## rank (Intercept) 0.54
## Residual 1.00
## ---
## number of obs: 400, groups: rank, 4
## AIC = 478, DIC = 448.7
## deviance = 459.3
confint(fit_logit_multi)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 0.2135712191 1.458054755
## (Intercept) -7.2077038833 -2.723901257
## gre 0.0002214138 0.004495754
## gpa 0.1636032829 1.458933252
Which model do you prefer?