Load packages we use this week.
library("readr")
library("dplyr")
library("ggplot2")
library("mi")
library("arm")
In R, missing values are repsented by NA, which is not a character string “NA” but simply NA without quotations. When we load CSV dataset into R, blank cells are automatically filled with NA. For Stata datasets, read.dta()
or haven::read_dta()
transform “.”, which indicates missing values in Stata, into NA. If a dataset has a non-standard missing value, we have to specify what values represent missingness in the dataset. For instance, if “.” shows missing values in a CSV dataset, we must tell R that the file uses “.” as missing instead of the standard " " (blank chracter) as follows.
data_url <- "http://www2.kobe-u.ac.jp/~yyanai/classes/rm1/contents/data/fake-missing.csv"
height <- read.csv(data_url, na = '.')
dim(height)
## [1] 500 4
This dataset contains four variables for 500 observations. Let’s exmaine the data by summary()
.
summary(height)
## child sex father mother
## Min. :151.1 female:242 Min. :154.8 Min. :146.4
## 1st Qu.:164.0 male :211 1st Qu.:166.8 1st Qu.:156.7
## Median :168.5 NA's : 47 Median :170.0 Median :159.9
## Mean :168.6 Mean :170.1 Mean :160.0
## 3rd Qu.:173.6 3rd Qu.:173.4 3rd Qu.:163.0
## Max. :184.1 Max. :185.4 Max. :174.2
## NA's :54 NA's :41 NA's :42
This shows that the four variables in the dataset are child, sex, father, and mother. In addition, these variables respectively have missing values for 54, 47, 41, and 42 observations.
We use is.na()
to find missing values.
is.na(height$child[1:5])
## [1] FALSE FALSE FALSE TRUE FALSE
sum(is.na(height$child))
## [1] 54
Now, let’s obtain the standard deviation of the child variable by sd()
.
sd(height$child)
## [1] NA
As this example shows, the functions to caclulate statistics return NA when the object passed to the argument has a missing value. To get the statistic excluding the missing values, we have to set na.rm = TRUE
.
sd(height$child, na.rm = TRUE)
## [1] 6.680064
Let’s run linear regression with this dataset. First, let’s fit linear regression with child as response and father and mother as predictors.
fit_fake_1 <- lm(child ~ father + mother, data = height)
display(fit_fake_1)
## lm(formula = child ~ father + mother, data = height)
## coef.est coef.se
## (Intercept) 93.64 15.66
## father 0.02 0.07
## mother 0.44 0.06
## ---
## n = 377, k = 3
## residual sd = 6.36, R-Squared = 0.11
As this example shows, we can fit linear regression using the dataset with missing vlaues.
However, the result shows that the number of observations n = 377. Don’t we have the dataset with 500 observations?
What will happen if we add the variable sex as a predictor. Since sex has two values “male” and “female”, let’s create and use the male dummy.
height$male <- ifelse(height$sex == "male", 1, 0)
fit_fake_2 <- lm(child ~ father + mother + male, data = height)
display(fit_fake_2)
## lm(formula = child ~ father + mother + male, data = height)
## coef.est coef.se
## (Intercept) 90.04 12.09
## father 0.03 0.05
## mother 0.43 0.05
## male 8.38 0.50
## ---
## n = 341, k = 4
## residual sd = 4.65, R-Squared = 0.51
Seemingly, we’ve succeeded in fitting the linear regression. But now n = 341 instead of 500 or 377. What’s going on here?
Probably, it’s obvious to you why the number of observations n is smaller than 500.
Now, let’s fit linear regression again after “manually” omitting the observations with missing values. To exclude the observations with at least one missing values from the dataset, we use na.omit()
.
height_comp <- na.omit(height)
fit_fake_1m <- lm(child ~ father + mother, data = height_comp)
fit_fake_2m <- lm(child ~ father + mother + male, data = height_comp)
display(fit_fake_1m)
## lm(formula = child ~ father + mother, data = height_comp)
## coef.est coef.se
## (Intercept) 89.28 16.29
## father 0.06 0.07
## mother 0.44 0.07
## ---
## n = 341, k = 3
## residual sd = 6.26, R-Squared = 0.11
display(fit_fake_2m)
## lm(formula = child ~ father + mother + male, data = height_comp)
## coef.est coef.se
## (Intercept) 90.04 12.09
## father 0.03 0.05
## mother 0.43 0.05
## male 8.38 0.50
## ---
## n = 341, k = 4
## residual sd = 4.65, R-Squared = 0.51
Now the number of observations is 341 in both models. While the result of fit_fake_2m
is same as that of fit_fake_2
, fit_fake_1
has different result than fit_fake_1m
.
To omit observations with missing values for each variable, you can run the code below, for instance.
height_2 <- height %>% filter(!is.na(sex))
summary(height_2)
## child sex father mother
## Min. :151.1 female:242 Min. :154.8 Min. :146.4
## 1st Qu.:164.1 male :211 1st Qu.:166.8 1st Qu.:156.4
## Median :168.5 Median :170.1 Median :159.9
## Mean :168.5 Mean :170.1 Mean :159.9
## 3rd Qu.:173.6 3rd Qu.:173.5 3rd Qu.:163.0
## Max. :183.6 Max. :183.5 Max. :174.2
## NA's :48 NA's :39 NA's :38
## male
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4658
## 3rd Qu.:1.0000
## Max. :1.0000
##
In this dataset, sex is obseved for all observations.
Suppose that the response y is generated by four variables x1, x2, x3, and x4 by the following relationship: \[y_i \sim \mathrm{N}(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3+ \beta_4 x_4, \sigma^2)\] This data generating process can be programmed as follows.
dgp <- function(beta, n, sigma = 4){
## Args: beta = parameter vector
## n = observations
## sigma = sd of y
## Return: data frame containig y and x's
k <- length(beta)
data_mat <- matrix(runif(n * (k-1), -5, 5), nrow = n)
X <- cbind(rep(1, n), data_mat)
y <- rnorm(n, X %*% beta, sigma)
x_header <- paste0("x", 1:(k-1))
df <- data.frame(cbind(y, data_mat))
names(df) <- c("y", x_header)
return(df)
}
Let’s run simulation of linear regression where the dataset has no missing values.
beta <- c(.2, 1, 3, -4, 2.5)
n <- 1000
trials <- 100
results <- matrix(NA, nrow = trials, ncol = length(beta))
for (i in 1:trials) {
df <- dgp(beta, n)
y <- as.matrix(df[, 1])
X <- as.matrix(df)
X[, 1] <- rep(1, n)
results[i,] <- coef(lm(y ~ X + 0))
}
beta_hat <- apply(results, 2, mean)
rbind(beta, beta_hat)
## [,1] [,2] [,3] [,4] [,5]
## beta 0.2000000 1.000000 3.000000 -4.000000 2.500000
## beta_hat 0.2058601 1.001192 2.999307 -3.999948 2.501431
If missing values are generated in completely random way, what problems arise in statistical inference by omitting the observations with a missing value?
r <- .05 ## missing rate
results <- matrix(NA, nrow = trials, ncol = length(beta))
for (i in 1:trials) {
df <- dgp(beta, n)
missing <- rbinom(n, 1, r)
df[, 1] <- ifelse(missing == 0, df[,1], NA)
df <- na.omit(df)
y <- df[, 1]
X <- as.matrix(df)
X[, 1] <- rep(1, nrow(df))
results[i,] <- coef(lm(y ~ X + 0))
}
beta_hat <- apply(results, 2, mean)
rbind(beta, beta_hat)
## [,1] [,2] [,3] [,4] [,5]
## beta 0.2000000 1.000000 3.00000 -4.000000 2.500000
## beta_hat 0.1866423 1.001986 2.99366 -3.995962 2.492347
If the assignment of missingness is determined not completely randomly but by the values of the variables contained in the dataset, what problems arise in statistical inference by omitting the observations with a missing value?
results <- matrix(NA, nrow = trials, ncol = length(beta))
for (i in 1:trials) {
df <- dgp(beta, n)
missing <- rbinom(n, 1, invlogit(-6 + 1 * df$x2 -1 * df$x3))
df[,1] <- ifelse(missing == 0, df[,1], NA)
df <- na.omit(df)
y <- df[,1]
X <- as.matrix(df)
X[,1] <- rep(1, nrow(df))
results[i,] <- coef(lm(y ~ X +0))
}
beta_hat <- apply(results, 2, mean)
rbind(beta, beta_hat)
## [,1] [,2] [,3] [,4] [,5]
## beta 0.2000000 1.000000 3.00000 -4.000000 2.500000
## beta_hat 0.1641886 1.007095 2.99806 -3.994628 2.500161
When the probability of missing and values of observations are both determined by the values of variables that are not included in the dataset, what problems do we have in statistical analyses excluding incomplete observations?
results <- matrix(NA, nrow = trials, ncol = length(beta) - 1)
for (i in 1:trials) {
df <- dgp(beta, n)
missing <- rbinom(n, 1, invlogit(-5 + df$x3))
df[, 1] <- ifelse(missing == 0, df[, 1], NA)
df <- na.omit(df)
y <- df[, 1]
## assume x3 wern't observed
X <- as.matrix(df[, -4])
X[, 1] <- rep(1, nrow(df))
results[i,] <- coef(lm(y ~ X + 0))
}
beta_hat <- apply(results, 2, mean)
rbind(beta[-4], beta_hat)
## [,1] [,2] [,3] [,4]
## 0.200000 1.000000 3.000000 2.500000
## beta_hat 1.236512 0.958303 3.019955 2.518119
Looking only at estimated coefficients, there is no obvious problem. Can you find what are the problems?
results <- matrix(NA, nrow = trials, ncol = length(beta))
for (i in 1:trials) {
df <- dgp(beta, n)
missing <- rbinom(n, 1, invlogit(-12 + df[, 1]))
df[, 1] <- ifelse(missing == 0, df[, 1], NA)
df <- na.omit(df)
y <- df[,1]
X <- as.matrix(df)
X[, 1] <- rep(1, nrow(df))
results[i, ] <- coef(lm(y ~ X + 0))
}
beta_hat <- apply(results, 2, mean)
rbind(beta, beta_hat)
## [,1] [,2] [,3] [,4] [,5]
## beta 0.2000000 1.0000000 3.000000 -4.000000 2.500000
## beta_hat -0.5026569 0.9558369 2.861868 -3.801306 2.398101
What problems do we have?
One way to analyze a dataset with missing values is omitting the observations with missing values and analyze the complete observations only. However, this is not recommended.
Let’s try implementing the content of Chapter 25 in Gelman and Hill (2007).
Here we use the dataset siswave3v4impute3.csv. This dataset is available at http://www.stat.columbia.edu/~gelman/arm/examples/sis. Download the file, save it into the “data” folder within the project, and load it into R.
sis <- read.csv("data/siswave3v4impute3.csv")
As we can see in the Environment pane in RStudio, this dataset has 943 variables, and most of them are not necessary for our purpose. So, extract variables we are going to use and save a new dataset.
First, let’s make a variable earnings, which represents household income, based on the earning of the respondents and their spouse’s earning. Before creating the new variable, let’s check the earning variables we currently have.
summary(sis$rearn)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9 0 11000 27240 35000 1500000
summary(sis$tearn)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9 0 0 17740 17000 2500000
Looking at the summary statistics of the two variables, we find that the minimum of each variable is \(-9\). These variable represents income, and hence they should be nonnegative. Reading the codebook, \(-9\) is not the value of the earning but the code indicating missing. Since R doesn’t know that \(-9\) is a missing value, it treats it as a negative number.
We should replace \(-9\) with NA.
## Function to replace a negative number with NA
neg2na <- function(x) ifelse(x < 0, NA, x)
rearn <- neg2na(sis$rearn)
tearn <- neg2na(sis$tearn)
summary(rearn)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 16000 30330 39000 1500000 153
summary(tearn)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 0 19820 21600 2500000 157
Now the values are all nonnegative. Looking at the output of summary()
, rearn and tearn have 153 and 157 missing values, respectively.
We add these two variables together and create the new variable of household incomes.
myd <- data.frame(earnings = rearn + tearn)
summary(myd$earnings)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 5000 28000 52610 65000 3250000 251
Next, we create the dummy variable indicating Caucasian using the information contained in the variable race. According to the codebook, the value of 1 indicates white. We assign 0 to the other including missing.
myd$white <- ifelse(sis$race == 1, 1, 0)
myd$white[is.na(sis$race)] <- 0
Then, we’ll make the male dummy and the dummy indicating that the respondent is over 65 years old.
myd$male <- ifelse(sis$sex == 1, 1, 0)
myd$over65 <- ifelse(sis$r_age > 65, 1, 0)
Similarly, we’ll modify some other variables.
myd$immig <- sis$immig
myd$immig[is.na(myd$immig)] <- 0
myd$educ_r <- sis$educ_r
myd$educ_r[is.na(myd$educ_r)] <- 2.5
myd$earners <- sis$earners
myd$earners[is.na(myd$earners)] <- 1
myd$no_earners <- ifelse (myd$earners == 0, 1, 0)
myd$workmos <- sis$workmos
myd$earnings[myd$workmos == 0] <- 0
myd$retirement <- with(sis, neg2na(socsec) + neg2na(pension))
myd$interest <- neg2na(sis$interest)
myd$assistance <- with(sis, neg2na(unemp) + neg2na(ssi) +
neg2na(welfare) + neg2na(charity))
myd$other <- with(sis, neg2na(alimony) + neg2na(giftmon))
Function to create the dummy indicating the value isn’t 0.
is_any <- function(x) {
any_x <- ifelse(x > 0, 1, 0)
any_x[is.na(x)] <- 0
return(any_x)
}
myd$any_unemp <- is_any(sis$unemp)
myd$any_ssi <- is_any(sis$ssi)
myd$any_welfare <- is_any(sis$welfare)
myd$any_charity <- is_any(sis$charity)
Transform some variables.
myd <- myd %>%
mutate(earnings = earnings / 1000,
retirement = retirement / 1000,
interest = interest / 1000,
assistance = assistance / 1000,
other = other / 1000)
myd_original <- myd
Top code the variables skewed to the right.
## See Gelman and Hill (2007): p.534
topcode <- function(a, top) {
return(ifelse(a > top, top, a))
}
myd$workhrs_top <- topcode(sis$workhrs, 40)
myd$earnings_top <- topcode(myd$earnings, 100)
myd$retirement_top <- topcode(myd$retirement, 100)
myd$interest_top <- topcode(myd$interest, 100)
myd$assistance_top <- topcode(myd$assistance, 10)
myd$other_top <- topcode(myd$other, 10)
Simple random imputation.
## Gelman and Hill (2007): p.534
random_imp <- function(a) {
missing <- is.na(a)
n_missing <- sum(missing)
a_obs <- a[!missing]
imputed <- a
## Impute random values from the observed
imputed[missing] <- sample(a_obs, n_missing, replace = TRUE)
return(imputed)
}
The distribution of the observed values of eaning_top
df <- filter(myd, earnings > 0)
earn_obs <- ggplot(df, aes(earnings_top)) + geom_histogram(binwidth = 10)
earn_obs + labs(x = "earnings", title = "Observed Values (excl. 0)")
Impute the values.
myd <- myd %>%
mutate(earnings_imp_1 = random_imp(earnings),
earnings_imp_1_top = topcode(earnings_imp_1, 100))
The distribution of the imputed earnings_top.
df <- filter(myd, earnings_imp_1 > 0, is.na(earnings))
earn_imp_1 <- ggplot(df, aes(earnings_imp_1_top)) +
geom_histogram(binwidth = 10)
earn_imp_1 + labs(x = "earnings", title = "Randomly Imputed Values")
Imputing missing with the predicted values obtained by linear regression.
## Gelman and Hill (2007): p.534
lm_imp_2 <- lm(earnings ~ male + over65 + white + immig + educ_r +
workmos + workhrs_top + any_ssi + any_welfare + any_charity,
data = myd, subset = (earnings > 0))
display(lm_imp_2)
## lm(formula = earnings ~ male + over65 + white + immig + educ_r +
## workmos + workhrs_top + any_ssi + any_welfare + any_charity,
## data = myd, subset = (earnings > 0))
## coef.est coef.se
## (Intercept) -87.02 29.61
## male -0.36 8.79
## over65 -39.80 39.21
## white 26.03 10.18
## immig -12.28 9.10
## educ_r 22.48 4.49
## workmos 5.33 2.06
## workhrs_top 0.79 0.62
## any_ssi -34.39 37.31
## any_welfare -12.26 24.74
## any_charity -29.36 40.49
## ---
## n = 988, k = 11
## residual sd = 132.24, R-Squared = 0.08
pred_2 <- predict(lm_imp_2, myd)
impute <- function(a, a_impute){
ifelse(is.na(a), a_impute, a)
}
myd <- myd %>%
mutate(earnings_imp_2 = impute(myd$earnings, pred_2),
earnings_imp_2_top = topcode(earnings_imp_2, 100))
The distribution of imputed values.
df <- myd %>% filter(earnings_imp_2 > 0, is.na(earnings))
earn_imp_2 <- ggplot(df, aes(earnings_imp_2_top)) +
geom_histogram(binwidth = 10)
earn_imp_2 + labs(x = "earnings", title = "Predicted-Value Imputation")
Make it better.
## Gelman and Hill (2007): p.535
lm_imp_3_sqrt <- lm(I(sqrt(earnings_top)) ~ male + over65 + white +
immig + educ_r + workmos + workhrs_top +
any_ssi + any_welfare + any_charity,
data = myd, subset = (earnings > 0))
display(lm_imp_3_sqrt)
## lm(formula = I(sqrt(earnings_top)) ~ male + over65 + white +
## immig + educ_r + workmos + workhrs_top + any_ssi + any_welfare +
## any_charity, data = myd, subset = (earnings > 0))
## coef.est coef.se
## (Intercept) -1.67 0.44
## male 0.32 0.13
## over65 -1.44 0.58
## white 0.96 0.15
## immig -0.62 0.14
## educ_r 0.79 0.07
## workmos 0.33 0.03
## workhrs_top 0.06 0.01
## any_ssi -0.97 0.55
## any_welfare -1.35 0.37
## any_charity -1.17 0.60
## ---
## n = 988, k = 11
## residual sd = 1.96, R-Squared = 0.44
pred_3_sqrt <- predict(lm_imp_3_sqrt, myd)
myd$pred_3 <- topcode(pred_3_sqrt^2, 100)
myd <- myd %>%
mutate(earnings_imp_3 = impute(earnings_top, pred_3))
df <- myd %>% filter(earnings_imp_3 > 0, is.na(earnings))
earn_imp_3 <- ggplot(df, aes(earnings_imp_3)) +
geom_histogram(binwidth = 10)
earn_imp_3 + labs(x = "earnings", title = "Predicted-Value Imputation: Better Fit")
Adding uncertainty to predicted values, and use them for imputation.
## Gelman and Hill (2007): p.536
pred_4_sqrt <- rnorm(nrow(myd), predict(lm_imp_3_sqrt, myd),
sigma.hat(lm_imp_3_sqrt))
myd$pred_4 <- topcode(pred_4_sqrt^2, 100)
myd <- myd %>%
mutate(earnings_imp_4 = impute(earnings_top, pred_4))
Comparing the imputed variables: Prediceted-Value Imputation vs. Random Imputation.
reg_det <- ggplot(myd, aes(x = pred_3, y = earnings_imp_3, color = !is.na(earnings))) +
geom_point() +
scale_color_discrete(name = "", labels = c("imputed", "observed"))
reg_det + labs(x = "predicted by linear regression", y = "imputed eanings",
title = "Imputation by Linear Prediction")
reg_rnd <- ggplot(myd, aes(x = pred_3, y = earnings_imp_1_top, color = !is.na(earnings))) +
geom_point() +
scale_color_discrete(name = "", labels = c("imputed", "observed"))
reg_rnd + labs(x = "predicted by linear regression", y = "imputed eanings",
title = "Random Imputation")
Steps for multiple imputaton to deal with missing values.
An R packages mi helps us go through these steps.
I’ll briefly explain the basic usage here. For more information, see this.
Import the package if you haven’t yet done so.
if (!require(mi)) {
install.packages("mi", dependencies = TRUE)
library("mi")
}
First, convert a dataframe to a missing_data.frame.
mdf <- missing_data.frame(myd_original)
Let’s check the content of mdf by show()
show(mdf)
## Object of class missing_data.frame with 1501 observations on 17 variables
##
## There are 29 missing data patterns
##
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
##
## type missing method model
## earnings continuous 241 ppd linear
## white binary 0 <NA> <NA>
## male binary 0 <NA> <NA>
## over65 binary 0 <NA> <NA>
## immig binary 0 <NA> <NA>
## educ_r continuous 0 <NA> <NA>
## earners ordered-categorical 0 <NA> <NA>
## no_earners binary 0 <NA> <NA>
## workmos continuous 0 <NA> <NA>
## retirement continuous 123 ppd linear
## interest continuous 276 ppd linear
## assistance continuous 128 ppd linear
## other continuous 133 ppd linear
## any_unemp binary 0 <NA> <NA>
## any_ssi binary 0 <NA> <NA>
## any_welfare binary 0 <NA> <NA>
## any_charity binary 0 <NA> <NA>
##
## family link transformation
## earnings gaussian identity standardize
## white <NA> <NA> <NA>
## male <NA> <NA> <NA>
## over65 <NA> <NA> <NA>
## immig <NA> <NA> <NA>
## educ_r <NA> <NA> standardize
## earners <NA> <NA> <NA>
## no_earners <NA> <NA> <NA>
## workmos <NA> <NA> standardize
## retirement gaussian identity standardize
## interest gaussian identity standardize
## assistance gaussian identity standardize
## other gaussian identity standardize
## any_unemp <NA> <NA> <NA>
## any_ssi <NA> <NA> <NA>
## any_welfare <NA> <NA> <NA>
## any_charity <NA> <NA> <NA>
As shown, mdf contains the information required to conduct multiple imputation. Above all, the type of variable is important. If the initial guess by R is wrong, you can change it by change()
. For instance, to change the type of educ_r from continous to ordered-categorical, run the following.
mdf <- change(mdf, y = "educ_r", what = "type", to = "ordered-categorical")
show(mdf)
## Object of class missing_data.frame with 1501 observations on 17 variables
##
## There are 29 missing data patterns
##
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
##
## type missing method model
## earnings continuous 241 ppd linear
## white binary 0 <NA> <NA>
## male binary 0 <NA> <NA>
## over65 binary 0 <NA> <NA>
## immig binary 0 <NA> <NA>
## educ_r ordered-categorical 0 <NA> <NA>
## earners ordered-categorical 0 <NA> <NA>
## no_earners binary 0 <NA> <NA>
## workmos continuous 0 <NA> <NA>
## retirement continuous 123 ppd linear
## interest continuous 276 ppd linear
## assistance continuous 128 ppd linear
## other continuous 133 ppd linear
## any_unemp binary 0 <NA> <NA>
## any_ssi binary 0 <NA> <NA>
## any_welfare binary 0 <NA> <NA>
## any_charity binary 0 <NA> <NA>
##
## family link transformation
## earnings gaussian identity standardize
## white <NA> <NA> <NA>
## male <NA> <NA> <NA>
## over65 <NA> <NA> <NA>
## immig <NA> <NA> <NA>
## educ_r <NA> <NA> <NA>
## earners <NA> <NA> <NA>
## no_earners <NA> <NA> <NA>
## workmos <NA> <NA> standardize
## retirement gaussian identity standardize
## interest gaussian identity standardize
## assistance gaussian identity standardize
## other gaussian identity standardize
## any_unemp <NA> <NA> <NA>
## any_ssi <NA> <NA> <NA>
## any_welfare <NA> <NA> <NA>
## any_charity <NA> <NA> <NA>
Let’s visualze the pattern of missingness
image(mdf)
hist(mdf)
Now, we’ll do multiple imputation. Since it takes some time to do this, multi-core imputation is recommended. If your computer has more than one core in the CPU, set that number first by options(mc.core)
. Then, implement multiple imputation by mi()
. You should set the number of iterations n.iter and the number of chains n.chains. Here, we’ll impute 4 different chains with 30 iterations.
options(mc.core = 4)
imputations <- mi(mdf, n.iter = 30, n.chains = 4, max.minutes = 20)
show(imputations)
## Object of class mi with 4 chains, each with 30 iterations.
## Each chain is the evolution of an object of missing_data.frame class with 1501 observations on 17 variables.
Then, we need check if the number of iterations was large enough. The mean of each variable should be about same for all chains.
round(mipply(imputations, mean, to.matrix = TRUE), 3)
## chain:1 chain:2 chain:3 chain:4
## earnings 0.009 0.013 0.007 0.017
## white 1.318 1.318 1.318 1.318
## male 1.362 1.362 1.362 1.362
## over65 1.077 1.077 1.077 1.077
## immig 1.392 1.392 1.392 1.392
## educ_r 3.324 3.324 3.324 3.324
## earners 2.143 2.143 2.143 2.143
## no_earners 1.181 1.181 1.181 1.181
## workmos 0.000 0.000 0.000 0.000
## retirement 0.028 0.022 0.019 0.022
## interest 0.002 0.019 0.007 0.007
## assistance -0.001 -0.004 -0.011 -0.004
## other -0.002 0.000 0.002 -0.003
## any_unemp 1.055 1.055 1.055 1.055
## any_ssi 1.037 1.037 1.037 1.037
## any_welfare 1.039 1.039 1.039 1.039
## any_charity 1.010 1.010 1.010 1.010
## missing_earnings 0.161 0.161 0.161 0.161
## missing_retirement 0.082 0.082 0.082 0.082
## missing_interest 0.184 0.184 0.184 0.184
## missing_assistance 0.085 0.085 0.085 0.085
## missing_other 0.089 0.089 0.089 0.089
Next, we’ll check \(\hat{R}\), which gets closer to 1 as the value converges.
Rhats(imputations)
## mean_earnings mean_retirement mean_interest mean_assistance
## 1.0023568 1.0386536 1.0220113 0.9910506
## mean_other sd_earnings sd_retirement sd_interest
## 0.9972806 1.0115627 1.0147530 0.9882039
## sd_assistance sd_other
## 0.9924869 0.9859840
To add iterations, we apply mi()
to an imputed object. Let’s add 20 more iterations.
imputations <- mi(imputations, n.iter = 20)
To visually check how imputation works, we plot the imputed object.
plot(imputations)
image(imputations)
(Outputs are omitted.)
Then, check the summary.
summary(imputations)
(Output is omiited.)
Finally, we anlayze m different datasets obtained from 4 chains of values. Here, we set \(m=5\).
fit <- pool(earnings ~ white + male, data = imputations, m = 5)
display(fit)
## bayesglm(formula = earnings ~ white + male, data = imputations,
## m = 5)
## coef.est coef.se
## (Intercept) 42.57 4.75
## white1 39.57 7.10
## male1 1.23 6.90
## n = 1498, k = 3
## residual deviance = 22843730.5, null deviance = 23357751.1 (difference = 514020.6)
## overdispersion parameter = 15249.5
## residual sd is sqrt(overdispersion) = 123.49
Let’s compare this result to that obtained by complete-observations analysis.
fit_comp <- lm(earnings ~ white + male, data = myd_original)
display(fit_comp)
## lm(formula = earnings ~ white + male, data = myd_original)
## coef.est coef.se
## (Intercept) 40.22 4.86
## white 36.69 7.54
## male 1.92 7.21
## ---
## n = 1260, k = 3
## residual sd = 123.19, R-Squared = 0.02
What differences do you find?
If you want to analyze each imputed dataset, extract the datasets as follows.
dfs <- complete(imputations, m = 5)
imp_1 <- dfs[[1]]
imp_2 <- dfs[[2]]
imp_3 <- dfs[[3]]
imp_4 <- dfs[[4]]
imp_5 <- dfs[[5]]
Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression & Multilevel/Hierarchical Models. New York. Cambridge University Press.