Analysis of Clustered Data

Preparation
Simulations of Linear Regression
Cluster-Robust SE, Fixed Effect, or Multilevel Models
An Example of Logistic Regression

Preparation

We need some packages to simulate clustered data analysis. Let us load them here. Install them before loading if necessary by install.packages().

library('arm')
library('mvtnorm')
library('lme4')
library('multiwayvcov')
library('clusterSEs')
library('ggplot2')
library('dplyr')
library('haven')

Simulations of Linear Regression

Setting Up Simulations

First, let us create a function to create data. Here we suppose a simple regression model: \[y_i \sim \mbox{N}(\beta_0 + \beta_1 x_i, \sigma^2).\]

In the fucntion, intra-cluster correlation is set by rho (\(\rho\)). When \(\rho = 1\), all units within a cluster are cosidered to be identical, and the effective sample size is reduced to the number of clusters. When \(\rho =0\), there is no correlation of units within a cluster, and all observations are considered to be independent of each other.

gen_cluster <- function(param = c(.1, .5), n = 1000, n_cluster = 50, rho = .5) {
    # Function to generate clustered data
    # Required package: mvtnorm
    
    # individual level
    Sigma_i <- matrix(c(1, 0, 0, 1 - rho), ncol = 2)
    values_i <- rmvnorm(n = n, sigma = Sigma_i)
    
    # cluster level
    cluster_name <- rep(1:n_cluster, each = n / n_cluster)
    Sigma_cl <- matrix(c(1, 0, 0, rho), ncol = 2)
    values_cl <- rmvnorm(n = n_cluster, sigma = Sigma_cl)
    
    # predictor var consists of individual- and cluster-level components
    x <- values_i[ , 1] + rep(values_cl[ , 1], each = n / n_cluster)
    
    # error consists of individual- and cluster-level components
    error <- values_i[ , 2] + rep(values_cl[ , 2], each = n / n_cluster)
    
    # data generating process
    y <- param[1] + param[2]*x + error
    
    df <- data.frame(x, y, cluster = cluster_name)
    return(df)
}

Next, let us make some functions to run simulations.

# Simulate a dataset with clusters and fit OLS
# Calculate cluster-robust SE when cluster_robust = TRUE
cluster_sim <- function(param = c(.1, .5), n = 1000, n_cluster = 50,
                            rho = .5, cluster_robust = FALSE) {
    # Required packages: mvtnorm, multiwayvcov
    df <- gen_cluster(param = param, n = n , n_cluster = n_cluster, rho = rho)
    fit <- lm(y ~ x, data = df)
    b1 <- coef(fit)[2]
    if (!cluster_robust) {
        Sigma <- vcov(fit)
        se <- sqrt(diag(Sigma)[2])
        b1_ci95 <- confint(fit)[2, ]
    } else { # cluster-robust SE
        Sigma <- cluster.vcov(fit, ~ cluster)
        se <- sqrt(diag(Sigma)[2])
        t_critical <- qt(.025, df = n - 2, lower.tail = FALSE)
        lower <- b1 - t_critical*se
        upper <- b1 + t_critical*se
        b1_ci95 <- c(lower, upper)
    }
    return(c(b1, se, b1_ci95))
}

# Function to iterate the simulation. A data frame is returned.
run_cluster_sim <- function(n_sims = 1000, param = c(.1, .5), n = 1000,
                            n_cluster = 50, rho = .5, cluster_robust = FALSE) {
    # Required packages: mvtnorm, multiwayvcov, dplyr
    df <- replicate(n_sims, cluster_sim(param = param, n = n, rho = rho,
                                        n_cluster = n_cluster,
                                        cluster_robust = cluster_robust))
    df <- as.data.frame(t(df))
    names(df) <- c('b1', 'se_b1', 'ci95_lower', 'ci95_upper')
    df <- df %>% 
        mutate(id = 1:n(),
        param_caught = ci95_lower <= param[2] & ci95_upper >= param[2])
    return(df)
}

Data without Clusters

First, we will run simulations of data with no cluster (\(\rho=0\)). Here, we set \(\beta_0 = 0.4\) and \(\beta_1 = 0\). Setting the significance level at 5%, we should incorrectly reject the null \(\beta_1 = 0\) in about 5 percent of the simulations.

sim_params <- c(.4, 0)   # beta1 = 0: no effect of x on y
sim_nocluster <- run_cluster_sim(n_sims = 10000, param = sim_params, rho = 0)
hist_nocluster <- ggplot(sim_nocluster, aes(b1)) +
    geom_histogram(color = 'black') +
    geom_vline(xintercept = sim_params[2], color = 'red')
print(hist_nocluster)

As the histogram above shows, we can estimate the value of \(\beta_1\) correctly on average. Let’s check the confidence intervals (CIs).

ci95_nocluster <- ggplot(sample_n(sim_nocluster, 100),
                         aes(x = reorder(id, b1), y = b1, 
                             ymin = ci95_lower, ymax = ci95_upper,
                             color = param_caught)) +
    geom_hline(yintercept = sim_params[2], linetype = 'dashed') +
    geom_pointrange() +
    labs(x = 'sim ID', y = 'b1', title = 'Randomly Chosen 100 95% CIs') +
    scale_color_discrete(name = 'True param value', labels = c('missed', 'hit')) +
    coord_flip()
print(ci95_nocluster)

sim_nocluster %>% summarize(type1_error = 1 - sum(param_caught)/n())

##   type1_error
## 1      0.0503

As shown, about 95% of the 95% CIs catch the true value of \(\beta_1\), which is zero. In other words, we incorrectly reject the null hypothesis about 5% of the time. This isn’t a problem because the rate of Type I error is close to the significance level.

Data with Clusters

Now, let us simulate data with clusters. Fist, we will fit the OLS to the clusterd data.

sim_params <- c(.4, 0)   # beta1 = 0: no effect of x on y
sim_cluster_ols <- run_cluster_sim(n_sims = 10000, param = sim_params)
hist_cluster_ols <- hist_nocluster %+% sim_cluster_ols
print(hist_cluster_ols)

As this histogram shows, the point estimate is not biased; we can correctly estimate the slope on average.

How about CIs?

ci95_cluster_ols <- ci95_nocluster %+% sample_n(sim_cluster_ols, 100)
print(ci95_cluster_ols)

sim_cluster_ols %>% summarize(type1_error = 1 - sum(param_caught)/n())

##   type1_error
## 1      0.4187

As shown, more than 5% of the CIs miss the true values, which means that we reject the null more than 5% of the time. In other words, if we anlyze clustered-data using OLS, we might incorretly reject the null hypothesis more than we should. This is because we underestimate the standard error

To avoid over-reporting of “significant” results, we need to calculate cluster-robust standard errors. We can get cluster-robust var-cov by multiwayvcov::cluster.vcov() as defined in cluster_sim() above.

sim_params <- c(.4, 0)   # beta1 = 0: no effect of x on y
sim_cluster_robust <- run_cluster_sim(n_sims = 10000, param = sim_params,
                                      cluster_robust = TRUE)
hist_cluster_robust <- hist_nocluster %+% sim_cluster_ols
print(hist_cluster_robust)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

When we calculate cluster-robust SEs, we use the OLS estimates for coefficient. So this histogram is simiilar (identical if the same datasets were used) to the previous histogram.

Now, let us examine the CIs.

ci95_cluster_robust <- ci95_nocluster %+% sample_n(sim_cluster_robust, 100)
print(ci95_cluster_robust)

sim_cluster_robust %>% summarize(type1_error = 1 - sum(param_caught)/n())

##   type1_error
## 1      0.0685

As shown, the rate of Type I error has been dropped. This is because the cluster-robust standar errors are smalle thatn the standrd error calculated without cosidering the cluster structure.

Therefore, we should calculate cluster-robust standar errors when we suspect the data are clustered. Or we should always calculate cluster-robust SEs unless we are thoretically convinced that there is no cluster in the data.

To get cluster-robust CIs more easily with cluster-bootstrap variance-covariance matrix estimates, we can use clusterSEs::cluster.bs.glm.

df_test <- gen_cluster(param = c(.4, 0))
fit_ols <- lm(y ~ x, data = df_test)
display(fit_ols)

## lm(formula = y ~ x, data = df_test)
##             coef.est coef.se
## (Intercept) 0.36     0.03   
## x           0.01     0.02   
## ---
## n = 1000, k = 2
## residual sd = 0.93, R-Squared = 0.00

confint(fit_ols)

##                   2.5 %     97.5 %
## (Intercept)  0.30245794 0.41808350
## x           -0.03631271 0.04729855

# same estimates using glm to be passed to cluster.bs.glm
fit_glm <- glm(y ~ x, data = df_test, family = gaussian(link = 'identity'))
display(fit_glm)

## glm(formula = y ~ x, family = gaussian(link = "identity"), data = df_test)
##             coef.est coef.se
## (Intercept) 0.36     0.03   
## x           0.01     0.02   
## ---
##   n = 1000, k = 2
##   residual deviance = 862.1, null deviance = 862.1 (difference = 0.1)
##   overdispersion parameter = 0.9
##   residual sd is sqrt(overdispersion) = 0.93

confint(fit_glm)

##                   2.5 %     97.5 %
## (Intercept)  0.30252806 0.41801339
## x           -0.03626201 0.04724785

# cluster-bootstrap CI
fit_robust <- cluster.bs.glm(fit_glm, dat = df_test, cluster = ~ cluster,
                             boot.reps = 1000, prog.bar = FALSE)

## 
##  Cluster Bootstrap p-values:  
##  
##             variable name   cluster bootstrap p-value
##               (Intercept)                           0
##                         x                        0.92
## 
##  Confidence Intervals (derived from bootstrapped t-statistics):  
##  
##      variable name             CI lower            CI higher
##        (Intercept)    0.171881122186029    0.548660323561103
##                  x   -0.110564980321601     0.12155081905199

Again, the cluster-robust CI is much wider than the OLS CI.

Cluster-Robust SE, Fixed Effect, or Multilevel Models

Though the clustered-robust SEs correct the standard errors in linear regression, the cluster structure isn’t taken into account when the coefficients are estimated. What will we get if we consider the cluster structure when estimating the coefficients?

Let us think and simulate two other estimation methods as well as OLS and cluster-robust SE.

Fixed effect model: cluster-specific fixed effects
Multilevel model: explicitly consider the two-level structure
- Fist level: individual units (observations)
- Second level: clusters

We can fit multilevel model by lme4::lmer() (see Bates et al. 2014).

To compare different models, define the function to simulate a dataset and anlyze it by 4 different models (strictly speaking, 3 models and 4 different SEs).

sim_4models <- function(param = c(.1, .5), n = 1000, rho = .5, n_cluster = 50) {
    # Required packages: mvtnorm, multiwaycov, lme4,
    df <- gen_cluster(param = param, n = n , n_cluster = n_cluster, rho = rho)
    
    # Fit three different models
    fit_ols <- lm(y ~ x, data = df)
    fit_fe <- lm(y ~ x + factor(cluster), data = df)
    fit_multi <- lmer(y ~ x + (1|cluster), data = df)
    
    coef <- c(coef(fit_ols)[2], coef(fit_fe)[2], fixef(fit_multi)[2])
    
    # 4 ways to get var-cov matrix
    Sigma_ols <- vcov(fit_ols)
    Sigma_crse <- cluster.vcov(fit_ols, ~ cluster)
    Sigma_fe <- vcov(fit_fe)
    Sigma_multi <- vcov(fit_multi)
    
    # 4 different SEs
    se <- sqrt(c(diag(Sigma_ols)[2], diag(Sigma_crse)[2],
                 diag(Sigma_fe)[2], diag(Sigma_multi)[2]))
    
    return(c(coef[1], se[1],  # OLS
             coef[1], se[2],  # OLS with cluster-robust SE
             coef[2], se[3],  # fixed-effect model
             coef[3], se[4])  # multi-level model
           )
}

Then, run the simulations.

n_sims <- 10000
sim_params <- c(.4, 0)
comp_4models <- replicate(n_sims, sim_4models(param = sim_params))
comp_df <- as.vector(comp_4models)
comp_df <- matrix(comp_df, ncol = 2, byrow = TRUE)
comp_df <- as.data.frame(comp_df)
names(comp_df) <- c('b1', 'se_b1')
comp_df <- comp_df %>%
    mutate(model = rep(c('OLS', 'CRSE', 'FE', 'Multi'), n_sims),
           id = rep(1:n_sims, each = 4))

First, let us compar the coefficient estimates.

density_b1 <- ggplot(comp_df, aes(b1, color = model)) +
    geom_density()
print(density_b1)

Since the OLS and OLS with cluster-robust SEs share the coefficient estimates, the density lines for these two models completely overlap.
Compared to the OLS estimates, the estimates obtained by the fixed-effect model or multilevel model are concentrated around the true parameter value. That is, the fixed-effect model and multi-level models are more efficient than the OLS model. Therefore, in terms of efficiency, we should use a fixed effect model or multilevel model to analyze clustered-data instead of an OLS model with cluster-robust stardard errors.

Next, let us check the standard errors of these models.

density_se <- ggplot(comp_df, aes(se_b1, color = model)) +
    geom_density()
print(density_se)

As this figure shows, the standard errors in the OLS, FE, and multi-level models are much smaller than the cluster-robust SEs. Does this mean that FE and multi-level models overreport “significant” results like OLS? Actually, no.

To understand why, we will create 95% confidence intervals and calculate the Type I error rate for each model.

comp_df <- comp_df %>%
    mutate(lower = b1 - 1.96*se_b1,
           upper = b1 + 1.96*se_b1,
           param_caught = lower <= sim_params[2] & upper >= sim_params[2])
comp_df %>% group_by(model) %>%
    summarize(type1_error = 1 - sum(param_caught)/n())

## Source: local data frame [4 x 2]
## 
##   model type1_error
##   (chr)       (dbl)
## 1  CRSE      0.0674
## 2    FE      0.0488
## 3 Multi      0.0495
## 4   OLS      0.4101

So, only the OLS overreports “significant” results. This is because the fixed-effect and multi-level models estimate the coefficint accurately enough to catch the true value inside the confidence intervals.

In sum, to anlayze clustered-data, we should use the fixed-effect or multi-level model. While cluster-robust standard error makes correction of the standard errors in a misspecified model (OLS), the fixed-effect and multi-level models correctly specify the model so that we don’t have to correct the standard erros afterwards.

An Example of Logistic Regression

Let us replicate and expand the logistic regression explained by UCLA’s IDRE. For explanation of the dataset and variables, go to the link above. Since the dataset is in Stata format, we use haven::read_dta() to load it.

download.file(url = 'http://www.ats.ucla.edu/stat/stata/dae/binary.dta',
              destfile = 'ucla_binary.dta')
myd <- read_dta('ucla_binary.dta')

Fit a logit model where the outcome is admit (binary: admitted or not) and the predictors are gre (GRE scores) and gpa (undergraduate GPA) .

fit_logit <- glm(admit ~ gre + gpa, data = myd,
                 family = binomial(link = 'logit'))
display(fit_logit)

## glm(formula = admit ~ gre + gpa, family = binomial(link = "logit"), 
##     data = myd)
##             coef.est coef.se
## (Intercept) -4.95     1.08  
## gre          0.00     0.00  
## gpa          0.75     0.32  
## ---
##   n = 400, k = 3
##   residual deviance = 480.3, null deviance = 500.0 (difference = 19.6)

confint(fit_logit)

## Waiting for profiling to be done...

##                     2.5 %       97.5 %
## (Intercept) -7.1092208579 -2.886727222
## gre          0.0006373957  0.004791712
## gpa          0.1347510239  1.390131222

Using this result, calculate cluster-robust standard errors, where the cluster variable is rank (rank [1–4] of the institution each observation graduated from. 1 represents the highest prestige, and 4 lowest).

cluster.bs.glm(fit_logit, dat = myd, ~ rank, prog.bar = FALSE)

## 
##  Cluster Bootstrap p-values:  
##  
##             variable name   cluster bootstrap p-value
##               (Intercept)                        0.08
##                       gre                       0.071
##                       gpa                       0.105
## 
##  Confidence Intervals (derived from bootstrapped t-statistics):  
##  
##        variable name               CI lower              CI higher
##          (Intercept)      -14.2015669885559        4.3028103231421
##                  gre   -0.00283582815057708    0.00821719516850139
##                  gpa     -0.299916868844567       1.80929076540112

Next, we will take into account the rank-specific fixed effects in the model.

fit_logit_fe <- glm(admit ~ gre + gpa + as.factor(rank),  data = myd,
                    family = binomial(link = 'logit'))
display(fit_logit_fe)

## glm(formula = admit ~ gre + gpa + as.factor(rank), family = binomial(link = "logit"), 
##     data = myd)
##                  coef.est coef.se
## (Intercept)      -3.99     1.14  
## gre               0.00     0.00  
## gpa               0.80     0.33  
## as.factor(rank)2 -0.68     0.32  
## as.factor(rank)3 -1.34     0.35  
## as.factor(rank)4 -1.55     0.42  
## ---
##   n = 400, k = 6
##   residual deviance = 458.5, null deviance = 500.0 (difference = 41.5)

confint(fit_logit_fe)

## Waiting for profiling to be done...

##                         2.5 %       97.5 %
## (Intercept)      -6.271620549 -1.792547372
## gre               0.000137592  0.004435874
## gpa               0.160296051  1.464142832
## as.factor(rank)2 -1.300888804 -0.056745723
## as.factor(rank)3 -2.027671329 -0.670372358
## as.factor(rank)4 -2.400026540 -0.753542604

Lastly, we willl fit a multi-level model.

fit_logit_multi <- glmer(admit ~ gre + gpa + (1|rank), data = myd,
                         family = binomial(link = 'logit'))
display(fit_logit_multi)

## glmer(formula = admit ~ gre + gpa + (1 | rank), data = myd, family = binomial(link = "logit"))
##             coef.est coef.se
## (Intercept) -4.92     1.14  
## gre          0.00     0.00  
## gpa          0.80     0.33  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  rank     (Intercept) 0.54    
##  Residual             1.00    
## ---
## number of obs: 400, groups: rank, 4
## AIC = 478, DIC = 448.7
## deviance = 459.3

confint(fit_logit_multi)

## Computing profile confidence intervals ...

##                     2.5 %       97.5 %
## .sig01       0.2135712191  1.458054755
## (Intercept) -7.2077038833 -2.723901257
## gre          0.0002214138  0.004495754
## gpa          0.1636032829  1.458933252

Which model do you prefer?

Go Back to Class Materials