Research Methods in Political Science I

Preparation
How to Handle Missing Values in R
Mechanism that Generates Missing Values
Analyzing Datasets with Missing Values

Preparation

Load packages we use this week.

library("readr")
library("dplyr")
library("ggplot2")
library("mi")
library("arm")

How to Handle Missing Values in R

In R, missing values are repsented by NA, which is not a character string “NA” but simply NA without quotations. When we load CSV dataset into R, blank cells are automatically filled with NA. For Stata datasets, read.dta() or haven::read_dta() transform “.”, which indicates missing values in Stata, into NA. If a dataset has a non-standard missing value, we have to specify what values represent missingness in the dataset. For instance, if “.” shows missing values in a CSV dataset, we must tell R that the file uses “.” as missing instead of the standard " " (blank chracter) as follows.

data_url <- "http://www2.kobe-u.ac.jp/~yyanai/classes/rm1/contents/data/fake-missing.csv"
height <- read.csv(data_url, na = '.')
dim(height)

## [1] 500   4

This dataset contains four variables for 500 observations. Let’s exmaine the data by summary().

summary(height)

##      child           sex          father          mother     
##  Min.   :151.1   female:242   Min.   :154.8   Min.   :146.4  
##  1st Qu.:164.0   male  :211   1st Qu.:166.8   1st Qu.:156.7  
##  Median :168.5   NA's  : 47   Median :170.0   Median :159.9  
##  Mean   :168.6                Mean   :170.1   Mean   :160.0  
##  3rd Qu.:173.6                3rd Qu.:173.4   3rd Qu.:163.0  
##  Max.   :184.1                Max.   :185.4   Max.   :174.2  
##  NA's   :54                   NA's   :41      NA's   :42

This shows that the four variables in the dataset are child, sex, father, and mother. In addition, these variables respectively have missing values for 54, 47, 41, and 42 observations.

We use is.na() to find missing values.

is.na(height$child[1:5])

## [1] FALSE FALSE FALSE  TRUE FALSE

sum(is.na(height$child))

## [1] 54

Now, let’s obtain the standard deviation of the child variable by sd().

sd(height$child)

## [1] NA

As this example shows, the functions to caclulate statistics return NA when the object passed to the argument has a missing value. To get the statistic excluding the missing values, we have to set na.rm = TRUE.

sd(height$child, na.rm = TRUE)

## [1] 6.680064

Let’s run linear regression with this dataset. First, let’s fit linear regression with child as response and father and mother as predictors.

fit_fake_1 <- lm(child ~ father + mother, data = height)
display(fit_fake_1)

## lm(formula = child ~ father + mother, data = height)
##             coef.est coef.se
## (Intercept) 93.64    15.66  
## father       0.02     0.07  
## mother       0.44     0.06  
## ---
## n = 377, k = 3
## residual sd = 6.36, R-Squared = 0.11

As this example shows, we can fit linear regression using the dataset with missing vlaues.

However, the result shows that the number of observations n = 377. Don’t we have the dataset with 500 observations?

What will happen if we add the variable sex as a predictor. Since sex has two values “male” and “female”, let’s create and use the male dummy.

height$male <- ifelse(height$sex == "male", 1, 0)
fit_fake_2 <- lm(child ~ father + mother + male, data = height)
display(fit_fake_2)

## lm(formula = child ~ father + mother + male, data = height)
##             coef.est coef.se
## (Intercept) 90.04    12.09  
## father       0.03     0.05  
## mother       0.43     0.05  
## male         8.38     0.50  
## ---
## n = 341, k = 4
## residual sd = 4.65, R-Squared = 0.51

Seemingly, we’ve succeeded in fitting the linear regression. But now n = 341 instead of 500 or 377. What’s going on here?

Probably, it’s obvious to you why the number of observations n is smaller than 500.

Now, let’s fit linear regression again after “manually” omitting the observations with missing values. To exclude the observations with at least one missing values from the dataset, we use na.omit().

height_comp <- na.omit(height)
fit_fake_1m <- lm(child ~ father + mother, data = height_comp)
fit_fake_2m <- lm(child ~ father + mother + male, data = height_comp)
display(fit_fake_1m)

## lm(formula = child ~ father + mother, data = height_comp)
##             coef.est coef.se
## (Intercept) 89.28    16.29  
## father       0.06     0.07  
## mother       0.44     0.07  
## ---
## n = 341, k = 3
## residual sd = 6.26, R-Squared = 0.11

display(fit_fake_2m)

## lm(formula = child ~ father + mother + male, data = height_comp)
##             coef.est coef.se
## (Intercept) 90.04    12.09  
## father       0.03     0.05  
## mother       0.43     0.05  
## male         8.38     0.50  
## ---
## n = 341, k = 4
## residual sd = 4.65, R-Squared = 0.51

Now the number of observations is 341 in both models. While the result of fit_fake_2m is same as that of fit_fake_2, fit_fake_1 has different result than fit_fake_1m.

To omit observations with missing values for each variable, you can run the code below, for instance.

height_2 <- height %>% filter(!is.na(sex)) 
summary(height_2)

##      child           sex          father          mother     
##  Min.   :151.1   female:242   Min.   :154.8   Min.   :146.4  
##  1st Qu.:164.1   male  :211   1st Qu.:166.8   1st Qu.:156.4  
##  Median :168.5                Median :170.1   Median :159.9  
##  Mean   :168.5                Mean   :170.1   Mean   :159.9  
##  3rd Qu.:173.6                3rd Qu.:173.5   3rd Qu.:163.0  
##  Max.   :183.6                Max.   :183.5   Max.   :174.2  
##  NA's   :48                   NA's   :39      NA's   :38     
##       male       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4658  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
##

In this dataset, sex is obseved for all observations.

Mechanism that Generates Missing Values

Specifying Data Generating Process

Suppose that the response y is generated by four variables x1, x2, x3, and x4 by the following relationship: \[y_i \sim \mathrm{N}(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3+ \beta_4 x_4, \sigma^2)\] This data generating process can be programmed as follows.

dgp <- function(beta, n, sigma = 4){
    ## Args: beta = parameter vector
    ##       n = observations
    ##       sigma = sd of y
    ## Return: data frame containig y and x's
    k <- length(beta)
    data_mat <- matrix(runif(n * (k-1), -5, 5), nrow = n)
    X <- cbind(rep(1, n), data_mat)
    y <- rnorm(n, X %*% beta, sigma)
    x_header <- paste0("x", 1:(k-1))
    df <- data.frame(cbind(y, data_mat))
    names(df) <- c("y", x_header)
    return(df)
}

Data with No Missing Values

Let’s run simulation of linear regression where the dataset has no missing values.

beta <- c(.2, 1, 3, -4, 2.5)
n <- 1000
trials <- 100
results <- matrix(NA, nrow = trials, ncol = length(beta))
for (i in 1:trials) {
    df <- dgp(beta, n)
    y <- as.matrix(df[, 1])
    X <- as.matrix(df)
    X[, 1] <- rep(1, n)
    results[i,] <- coef(lm(y ~ X + 0))
}
beta_hat <- apply(results, 2, mean)
rbind(beta, beta_hat)

##               [,1]     [,2]     [,3]      [,4]     [,5]
## beta     0.2000000 1.000000 3.000000 -4.000000 2.500000
## beta_hat 0.2058601 1.001192 2.999307 -3.999948 2.501431

Missing Completely at Ranodm (MCAR)

If missing values are generated in completely random way, what problems arise in statistical inference by omitting the observations with a missing value?

r <- .05  ## missing rate
results <- matrix(NA, nrow = trials, ncol = length(beta))
for (i in 1:trials) {
    df <- dgp(beta, n)
    missing <- rbinom(n, 1, r)
    df[, 1] <- ifelse(missing == 0, df[,1], NA) 
    df <- na.omit(df)
    y <- df[, 1]
    X <- as.matrix(df)
    X[, 1] <- rep(1, nrow(df))
    results[i,] <- coef(lm(y ~ X + 0))
}
beta_hat <- apply(results, 2, mean)
rbind(beta, beta_hat)

##               [,1]     [,2]    [,3]      [,4]     [,5]
## beta     0.2000000 1.000000 3.00000 -4.000000 2.500000
## beta_hat 0.1866423 1.001986 2.99366 -3.995962 2.492347

Missing at Random (MAR)

If the assignment of missingness is determined not completely randomly but by the values of the variables contained in the dataset, what problems arise in statistical inference by omitting the observations with a missing value?

results <- matrix(NA, nrow = trials, ncol = length(beta))
for (i in 1:trials) {
    df <- dgp(beta, n)
    missing <- rbinom(n, 1, invlogit(-6 + 1 * df$x2 -1 * df$x3))
    df[,1] <- ifelse(missing == 0, df[,1], NA)
    df <- na.omit(df)
    y <- df[,1]
    X <- as.matrix(df)
    X[,1] <- rep(1, nrow(df))
    results[i,] <- coef(lm(y ~ X +0))
}
beta_hat <- apply(results, 2, mean)
rbind(beta, beta_hat)

##               [,1]     [,2]    [,3]      [,4]     [,5]
## beta     0.2000000 1.000000 3.00000 -4.000000 2.500000
## beta_hat 0.1641886 1.007095 2.99806 -3.994628 2.500161

Missing Dependent on Unobserved Predictors

When the probability of missing and values of observations are both determined by the values of variables that are not included in the dataset, what problems do we have in statistical analyses excluding incomplete observations?

results <- matrix(NA, nrow = trials, ncol = length(beta) - 1)
for (i in 1:trials) {
    df <- dgp(beta, n)
    missing <- rbinom(n, 1, invlogit(-5 + df$x3))
    df[, 1] <- ifelse(missing == 0, df[, 1], NA)
    df <- na.omit(df)
    y <- df[, 1]
    ## assume x3 wern't observed
    X <- as.matrix(df[, -4])
    X[, 1] <- rep(1, nrow(df))
    results[i,] <- coef(lm(y ~ X + 0))
}
beta_hat <- apply(results, 2, mean)
rbind(beta[-4], beta_hat)

##              [,1]     [,2]     [,3]     [,4]
##          0.200000 1.000000 3.000000 2.500000
## beta_hat 1.236512 0.958303 3.019955 2.518119

Looking only at estimated coefficients, there is no obvious problem. Can you find what are the problems?

When the Probability of Missing Depends on the Values of Missing Values Themselves

results <- matrix(NA, nrow = trials, ncol = length(beta))
for (i in 1:trials) {
    df <- dgp(beta, n)
    missing <- rbinom(n, 1, invlogit(-12 + df[, 1]))
    df[, 1] <- ifelse(missing == 0, df[, 1], NA)
    df <- na.omit(df)
    y <- df[,1]
    X <- as.matrix(df)
    X[, 1] <- rep(1, nrow(df))
    results[i, ] <- coef(lm(y ~ X + 0))
}
beta_hat <- apply(results, 2, mean)
rbind(beta, beta_hat)

##                [,1]      [,2]     [,3]      [,4]     [,5]
## beta      0.2000000 1.0000000 3.000000 -4.000000 2.500000
## beta_hat -0.5026569 0.9558369 2.861868 -3.801306 2.398101

What problems do we have?

Analyzing Datasets with Missing Values

Complete Observation Analysis: Omitting Observations with a Missing Value

One way to analyze a dataset with missing values is omitting the observations with missing values and analyze the complete observations only. However, this is not recommended.

Imputing Missing Values

Let’s try implementing the content of Chapter 25 in Gelman and Hill (2007).

Data Preparation

Here we use the dataset siswave3v4impute3.csv. This dataset is available at http://www.stat.columbia.edu/~gelman/arm/examples/sis. Download the file, save it into the “data” folder within the project, and load it into R.

sis <- read.csv("data/siswave3v4impute3.csv")

As we can see in the Environment pane in RStudio, this dataset has 943 variables, and most of them are not necessary for our purpose. So, extract variables we are going to use and save a new dataset.

First, let’s make a variable earnings, which represents household income, based on the earning of the respondents and their spouse’s earning. Before creating the new variable, let’s check the earning variables we currently have.

summary(sis$rearn)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      -9       0   11000   27240   35000 1500000

summary(sis$tearn)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      -9       0       0   17740   17000 2500000

Looking at the summary statistics of the two variables, we find that the minimum of each variable is \(-9\). These variable represents income, and hence they should be nonnegative. Reading the codebook, \(-9\) is not the value of the earning but the code indicating missing. Since R doesn’t know that \(-9\) is a missing value, it treats it as a negative number.

We should replace \(-9\) with NA.

## Function to replace a negative number with NA
neg2na <- function(x) ifelse(x < 0, NA, x) 
rearn <- neg2na(sis$rearn)
tearn <- neg2na(sis$tearn)
summary(rearn)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0   16000   30330   39000 1500000     153

summary(tearn)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0       0   19820   21600 2500000     157

Now the values are all nonnegative. Looking at the output of summary(), rearn and tearn have 153 and 157 missing values, respectively.

We add these two variables together and create the new variable of household incomes.

myd <- data.frame(earnings = rearn + tearn)
summary(myd$earnings)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    5000   28000   52610   65000 3250000     251

Next, we create the dummy variable indicating Caucasian using the information contained in the variable race. According to the codebook, the value of 1 indicates white. We assign 0 to the other including missing.

myd$white <- ifelse(sis$race == 1, 1, 0)
myd$white[is.na(sis$race)] <- 0

Then, we’ll make the male dummy and the dummy indicating that the respondent is over 65 years old.

myd$male <- ifelse(sis$sex == 1, 1, 0)
myd$over65 <- ifelse(sis$r_age > 65, 1, 0)

Similarly, we’ll modify some other variables.

myd$immig <- sis$immig
myd$immig[is.na(myd$immig)] <- 0
myd$educ_r <- sis$educ_r
myd$educ_r[is.na(myd$educ_r)] <- 2.5
myd$earners <- sis$earners
myd$earners[is.na(myd$earners)] <- 1
myd$no_earners <- ifelse (myd$earners == 0, 1, 0)
myd$workmos <- sis$workmos
myd$earnings[myd$workmos == 0] <- 0
myd$retirement <- with(sis, neg2na(socsec) + neg2na(pension))
myd$interest <- neg2na(sis$interest)
myd$assistance <- with(sis, neg2na(unemp) + neg2na(ssi) +
                           neg2na(welfare) + neg2na(charity))
myd$other <- with(sis, neg2na(alimony) + neg2na(giftmon))

Function to create the dummy indicating the value isn’t 0.

is_any <- function(x) {
    any_x <- ifelse(x > 0, 1, 0)
    any_x[is.na(x)] <- 0
    return(any_x)
}
myd$any_unemp <- is_any(sis$unemp)
myd$any_ssi <- is_any(sis$ssi)
myd$any_welfare <- is_any(sis$welfare)
myd$any_charity <- is_any(sis$charity)

Transform some variables.

myd <- myd %>%
    mutate(earnings = earnings / 1000,
           retirement = retirement / 1000,
           interest = interest / 1000,
           assistance = assistance / 1000,
           other = other / 1000)
myd_original <- myd

Top code the variables skewed to the right.

## See Gelman and Hill (2007): p.534
topcode <- function(a, top) { 
    return(ifelse(a > top, top, a))
}
myd$workhrs_top <- topcode(sis$workhrs, 40)
myd$earnings_top <- topcode(myd$earnings, 100)
myd$retirement_top <- topcode(myd$retirement, 100)
myd$interest_top <- topcode(myd$interest, 100)
myd$assistance_top <- topcode(myd$assistance, 10)
myd$other_top <- topcode(myd$other, 10)

Imputing Missing Values

Simple random imputation.

## Gelman and Hill (2007): p.534
random_imp <- function(a) {
    missing <- is.na(a)
    n_missing <- sum(missing)
    a_obs <- a[!missing]
    imputed <- a
    ## Impute random values from the observed
    imputed[missing] <- sample(a_obs, n_missing, replace = TRUE)
    return(imputed)
}

The distribution of the observed values of eaning_top

df <- filter(myd, earnings > 0)
earn_obs <- ggplot(df, aes(earnings_top)) + geom_histogram(binwidth = 10)
earn_obs + labs(x = "earnings", title = "Observed Values (excl. 0)")

Impute the values.

myd <- myd %>%
    mutate(earnings_imp_1 = random_imp(earnings),
           earnings_imp_1_top = topcode(earnings_imp_1, 100))

The distribution of the imputed earnings_top.

df <- filter(myd, earnings_imp_1 > 0, is.na(earnings))
earn_imp_1 <- ggplot(df, aes(earnings_imp_1_top)) + 
    geom_histogram(binwidth = 10)
earn_imp_1 + labs(x = "earnings", title = "Randomly Imputed Values")

Imputing missing with the predicted values obtained by linear regression.

## Gelman and Hill (2007): p.534
lm_imp_2 <- lm(earnings ~ male + over65 + white + immig + educ_r +
                   workmos + workhrs_top + any_ssi + any_welfare + any_charity,
               data = myd, subset = (earnings > 0))
display(lm_imp_2)

## lm(formula = earnings ~ male + over65 + white + immig + educ_r + 
##     workmos + workhrs_top + any_ssi + any_welfare + any_charity, 
##     data = myd, subset = (earnings > 0))
##             coef.est coef.se
## (Intercept) -87.02    29.61 
## male         -0.36     8.79 
## over65      -39.80    39.21 
## white        26.03    10.18 
## immig       -12.28     9.10 
## educ_r       22.48     4.49 
## workmos       5.33     2.06 
## workhrs_top   0.79     0.62 
## any_ssi     -34.39    37.31 
## any_welfare -12.26    24.74 
## any_charity -29.36    40.49 
## ---
## n = 988, k = 11
## residual sd = 132.24, R-Squared = 0.08

pred_2 <- predict(lm_imp_2, myd)
impute <- function(a, a_impute){
    ifelse(is.na(a), a_impute, a)
}
myd <- myd %>%
    mutate(earnings_imp_2 = impute(myd$earnings, pred_2),
           earnings_imp_2_top = topcode(earnings_imp_2, 100))

The distribution of imputed values.

df <- myd %>% filter(earnings_imp_2 > 0, is.na(earnings))
earn_imp_2 <- ggplot(df, aes(earnings_imp_2_top)) + 
    geom_histogram(binwidth = 10)
earn_imp_2 + labs(x = "earnings", title = "Predicted-Value Imputation")

Make it better.

## Gelman and Hill (2007): p.535
lm_imp_3_sqrt <- lm(I(sqrt(earnings_top)) ~ male + over65 + white +
                        immig + educ_r + workmos + workhrs_top +
                        any_ssi + any_welfare + any_charity,
                    data = myd, subset = (earnings > 0))
display(lm_imp_3_sqrt)

## lm(formula = I(sqrt(earnings_top)) ~ male + over65 + white + 
##     immig + educ_r + workmos + workhrs_top + any_ssi + any_welfare + 
##     any_charity, data = myd, subset = (earnings > 0))
##             coef.est coef.se
## (Intercept) -1.67     0.44  
## male         0.32     0.13  
## over65      -1.44     0.58  
## white        0.96     0.15  
## immig       -0.62     0.14  
## educ_r       0.79     0.07  
## workmos      0.33     0.03  
## workhrs_top  0.06     0.01  
## any_ssi     -0.97     0.55  
## any_welfare -1.35     0.37  
## any_charity -1.17     0.60  
## ---
## n = 988, k = 11
## residual sd = 1.96, R-Squared = 0.44

pred_3_sqrt <- predict(lm_imp_3_sqrt, myd)
myd$pred_3 <- topcode(pred_3_sqrt^2, 100)
myd <- myd %>%
    mutate(earnings_imp_3 = impute(earnings_top, pred_3))
df <- myd %>% filter(earnings_imp_3 > 0, is.na(earnings))
earn_imp_3 <- ggplot(df, aes(earnings_imp_3)) +
    geom_histogram(binwidth = 10)
earn_imp_3 + labs(x = "earnings", title = "Predicted-Value Imputation: Better Fit")

Adding uncertainty to predicted values, and use them for imputation.

## Gelman and Hill (2007): p.536
pred_4_sqrt <- rnorm(nrow(myd), predict(lm_imp_3_sqrt, myd),
                     sigma.hat(lm_imp_3_sqrt))
myd$pred_4 <- topcode(pred_4_sqrt^2, 100)
myd <- myd %>%
    mutate(earnings_imp_4 = impute(earnings_top, pred_4))

Comparing the imputed variables: Prediceted-Value Imputation vs. Random Imputation.

reg_det <- ggplot(myd, aes(x = pred_3, y = earnings_imp_3, color = !is.na(earnings))) +
    geom_point() +
    scale_color_discrete(name = "", labels = c("imputed", "observed"))
reg_det + labs(x = "predicted by linear regression", y = "imputed eanings",
               title = "Imputation by Linear Prediction")

reg_rnd <- ggplot(myd, aes(x = pred_3, y = earnings_imp_1_top, color = !is.na(earnings))) +
    geom_point() +
    scale_color_discrete(name = "", labels = c("imputed", "observed"))
reg_rnd + labs(x = "predicted by linear regression", y = "imputed eanings",
               title = "Random Imputation")

Multiple Imputation

Steps for multiple imputaton to deal with missing values.

Impute missing values M times with (slightly) different model
Analyze each imputed dataset: M analyses (regressions)
Summarize the results of analysis obtained by M different datasets
- Coefficient estimate \(b\) is the mean of m estimates \[b=\frac{1}{M}\sum_{i=1}^M \hat{\beta}_m\]
- Standard error of coefficient \(b\) \(\mathrm{se}(b)\) is \[\mathrm{se}(b) = \sqrt{W + \left( 1 + \frac{1}{M}\right)B}\]
  - \(W=\frac{1}{M}\sum_{i=1}^M s_m^2\): variation within each dataset
  - \(s_m\): standard error of \(\hat{\beta}_m\)
  - \(B = \frac{1}{M - 1}\sum_{m=1}^M(\hat{\beta}_m - b)^2\): variation across datasets

An R packages mi helps us go through these steps.

Multiple Imputation with mi Package

I’ll briefly explain the basic usage here. For more information, see this.

Import the package if you haven’t yet done so.

if (!require(mi)) {
    install.packages("mi", dependencies = TRUE)
    library("mi")
}

First, convert a dataframe to a missing_data.frame.

mdf <- missing_data.frame(myd_original)

Let’s check the content of mdf by show()

show(mdf)

## Object of class missing_data.frame with 1501 observations on 17 variables
## 
## There are 29 missing data patterns
## 
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
## 
##                            type missing method  model
## earnings             continuous     241    ppd linear
## white                    binary       0   <NA>   <NA>
## male                     binary       0   <NA>   <NA>
## over65                   binary       0   <NA>   <NA>
## immig                    binary       0   <NA>   <NA>
## educ_r               continuous       0   <NA>   <NA>
## earners     ordered-categorical       0   <NA>   <NA>
## no_earners               binary       0   <NA>   <NA>
## workmos              continuous       0   <NA>   <NA>
## retirement           continuous     123    ppd linear
## interest             continuous     276    ppd linear
## assistance           continuous     128    ppd linear
## other                continuous     133    ppd linear
## any_unemp                binary       0   <NA>   <NA>
## any_ssi                  binary       0   <NA>   <NA>
## any_welfare              binary       0   <NA>   <NA>
## any_charity              binary       0   <NA>   <NA>
## 
##               family     link transformation
## earnings    gaussian identity    standardize
## white           <NA>     <NA>           <NA>
## male            <NA>     <NA>           <NA>
## over65          <NA>     <NA>           <NA>
## immig           <NA>     <NA>           <NA>
## educ_r          <NA>     <NA>    standardize
## earners         <NA>     <NA>           <NA>
## no_earners      <NA>     <NA>           <NA>
## workmos         <NA>     <NA>    standardize
## retirement  gaussian identity    standardize
## interest    gaussian identity    standardize
## assistance  gaussian identity    standardize
## other       gaussian identity    standardize
## any_unemp       <NA>     <NA>           <NA>
## any_ssi         <NA>     <NA>           <NA>
## any_welfare     <NA>     <NA>           <NA>
## any_charity     <NA>     <NA>           <NA>

As shown, mdf contains the information required to conduct multiple imputation. Above all, the type of variable is important. If the initial guess by R is wrong, you can change it by change(). For instance, to change the type of educ_r from continous to ordered-categorical, run the following.

mdf <- change(mdf, y = "educ_r", what = "type", to = "ordered-categorical")
show(mdf)

## Object of class missing_data.frame with 1501 observations on 17 variables
## 
## There are 29 missing data patterns
## 
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
## 
##                            type missing method  model
## earnings             continuous     241    ppd linear
## white                    binary       0   <NA>   <NA>
## male                     binary       0   <NA>   <NA>
## over65                   binary       0   <NA>   <NA>
## immig                    binary       0   <NA>   <NA>
## educ_r      ordered-categorical       0   <NA>   <NA>
## earners     ordered-categorical       0   <NA>   <NA>
## no_earners               binary       0   <NA>   <NA>
## workmos              continuous       0   <NA>   <NA>
## retirement           continuous     123    ppd linear
## interest             continuous     276    ppd linear
## assistance           continuous     128    ppd linear
## other                continuous     133    ppd linear
## any_unemp                binary       0   <NA>   <NA>
## any_ssi                  binary       0   <NA>   <NA>
## any_welfare              binary       0   <NA>   <NA>
## any_charity              binary       0   <NA>   <NA>
## 
##               family     link transformation
## earnings    gaussian identity    standardize
## white           <NA>     <NA>           <NA>
## male            <NA>     <NA>           <NA>
## over65          <NA>     <NA>           <NA>
## immig           <NA>     <NA>           <NA>
## educ_r          <NA>     <NA>           <NA>
## earners         <NA>     <NA>           <NA>
## no_earners      <NA>     <NA>           <NA>
## workmos         <NA>     <NA>    standardize
## retirement  gaussian identity    standardize
## interest    gaussian identity    standardize
## assistance  gaussian identity    standardize
## other       gaussian identity    standardize
## any_unemp       <NA>     <NA>           <NA>
## any_ssi         <NA>     <NA>           <NA>
## any_welfare     <NA>     <NA>           <NA>
## any_charity     <NA>     <NA>           <NA>

Let’s visualze the pattern of missingness

image(mdf)

hist(mdf)

Now, we’ll do multiple imputation. Since it takes some time to do this, multi-core imputation is recommended. If your computer has more than one core in the CPU, set that number first by options(mc.core). Then, implement multiple imputation by mi(). You should set the number of iterations n.iter and the number of chains n.chains. Here, we’ll impute 4 different chains with 30 iterations.

options(mc.core = 4)
imputations <- mi(mdf, n.iter = 30, n.chains = 4, max.minutes = 20)
show(imputations)

## Object of class mi with 4 chains, each with 30 iterations.
## Each chain is the evolution of an object of missing_data.frame class with 1501 observations on 17 variables.

Then, we need check if the number of iterations was large enough. The mean of each variable should be about same for all chains.

round(mipply(imputations, mean, to.matrix = TRUE), 3)

##                    chain:1 chain:2 chain:3 chain:4
## earnings             0.009   0.013   0.007   0.017
## white                1.318   1.318   1.318   1.318
## male                 1.362   1.362   1.362   1.362
## over65               1.077   1.077   1.077   1.077
## immig                1.392   1.392   1.392   1.392
## educ_r               3.324   3.324   3.324   3.324
## earners              2.143   2.143   2.143   2.143
## no_earners           1.181   1.181   1.181   1.181
## workmos              0.000   0.000   0.000   0.000
## retirement           0.028   0.022   0.019   0.022
## interest             0.002   0.019   0.007   0.007
## assistance          -0.001  -0.004  -0.011  -0.004
## other               -0.002   0.000   0.002  -0.003
## any_unemp            1.055   1.055   1.055   1.055
## any_ssi              1.037   1.037   1.037   1.037
## any_welfare          1.039   1.039   1.039   1.039
## any_charity          1.010   1.010   1.010   1.010
## missing_earnings     0.161   0.161   0.161   0.161
## missing_retirement   0.082   0.082   0.082   0.082
## missing_interest     0.184   0.184   0.184   0.184
## missing_assistance   0.085   0.085   0.085   0.085
## missing_other        0.089   0.089   0.089   0.089

Next, we’ll check \(\hat{R}\), which gets closer to 1 as the value converges.

Rhats(imputations)

##   mean_earnings mean_retirement   mean_interest mean_assistance 
##       1.0023568       1.0386536       1.0220113       0.9910506 
##      mean_other     sd_earnings   sd_retirement     sd_interest 
##       0.9972806       1.0115627       1.0147530       0.9882039 
##   sd_assistance        sd_other 
##       0.9924869       0.9859840

To add iterations, we apply mi() to an imputed object. Let’s add 20 more iterations.

imputations <- mi(imputations, n.iter = 20)

To visually check how imputation works, we plot the imputed object.

plot(imputations)
image(imputations)

(Outputs are omitted.)

Then, check the summary.

summary(imputations)

(Output is omiited.)

Finally, we anlayze m different datasets obtained from 4 chains of values. Here, we set \(m=5\).

fit <- pool(earnings ~ white + male, data = imputations, m = 5)
display(fit)

## bayesglm(formula = earnings ~ white + male, data = imputations, 
##     m = 5)
##             coef.est coef.se
## (Intercept) 42.57     4.75  
## white1      39.57     7.10  
## male1        1.23     6.90  
## n = 1498, k = 3
## residual deviance = 22843730.5, null deviance = 23357751.1 (difference = 514020.6)
## overdispersion parameter = 15249.5
## residual sd is sqrt(overdispersion) = 123.49

Let’s compare this result to that obtained by complete-observations analysis.

fit_comp <- lm(earnings ~ white + male, data = myd_original)
display(fit_comp)

## lm(formula = earnings ~ white + male, data = myd_original)
##             coef.est coef.se
## (Intercept) 40.22     4.86  
## white       36.69     7.54  
## male         1.92     7.21  
## ---
## n = 1260, k = 3
## residual sd = 123.19, R-Squared = 0.02

What differences do you find?

If you want to analyze each imputed dataset, extract the datasets as follows.

dfs <- complete(imputations, m = 5)
imp_1 <- dfs[[1]]
imp_2 <- dfs[[2]]
imp_3 <- dfs[[3]]
imp_4 <- dfs[[4]]
imp_5 <- dfs[[5]]

Reference

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression & Multilevel/Hierarchical Models. New York. Cambridge University Press.

Back to Class Materials

Research Methods in Political Science I

Multiple Imputation

Yuki Yanai

January 20, 2016

Preparation

How to Handle Missing Values in R

Mechanism that Generates Missing Values

Specifying Data Generating Process

Data with No Missing Values

Missing Completely at Ranodm (MCAR)

Missing at Random (MAR)

Missing Dependent on Unobserved Predictors

When the Probability of Missing Depends on the Values of Missing Values Themselves

Analyzing Datasets with Missing Values

Complete Observation Analysis: Omitting Observations with a Missing Value

Imputing Missing Values

Data Preparation

Imputing Missing Values

Multiple Imputation

Multiple Imputation with mi Package

Reference