Preparation

Import some packages. Don’t forget to install them before loading.

library("readr")
library("dplyr")
library("arm")
library("ggplot2")

For illustration, we use hr96-09.csv, the dataset provided by Asano and Yanai (2013). Download the dataset and put it in the “data” folder in your project. Then, read the dataset by readr::read_csv().

HR <- read_csv("data/hr96-09.csv", na = ".")
head(HR)     ## display the first 6 rows
## Source: local data frame [6 x 16]
## 
##    year    ku   kun party              name   age status nocand    wl
##   (int) (chr) (int) (int)             (chr) (int)  (int)  (int) (int)
## 1  1996 aichi     1  1000 KAWAMURA, TAKASHI    47      2      7     1
## 2  1996 aichi     1   800     IMAEDA, NORIO    72      3      7     0
## 3  1996 aichi     1  1001     SATO, TAISUKE    53      2      7     0
## 4  1996 aichi     1   305   IWANAKA, MIHOKO    43      1      7     0
## 5  1996 aichi     1  1014       ITO, MASAKO    51      1      7     0
## 6  1996 aichi     1  1038  YAMADA, HIROSHIB    51      1      7     0
## Variables not shown: rank (int), previous (int), vote (int), voteshare
##   (dbl), eligible (int), turnout (dbl), exp (int)
tail(HR)     ## display the last 6 rows
## Source: local data frame [6 x 16]
## 
##    year        ku   kun party                name   age status nocand
##   (int)     (chr) (int) (int)               (chr) (int)  (int)  (int)
## 1  2009 yamanashi     2     1    NAGASAKI, KOTARO    41      2      4
## 2  2009 yamanashi     2   800    HORIUCHI, MITSUO    79      2      4
## 3  2009 yamanashi     2  1115 MIYAMATSU, HIROYUKI    69      1      4
## 4  2009 yamanashi     3  1001       GOTO, HITOSHI    52      2      3
## 5  2009 yamanashi     3   800           ONO, JIRO    56      2      3
## 6  2009 yamanashi     3  1115   SAKURADA, DAISUKE    47      1      3
## Variables not shown: wl (int), rank (int), previous (int), vote (int),
##   voteshare (dbl), eligible (int), turnout (dbl), exp (int)

Let’s add lables to some variables.

party_names <- c("independent", "JCP", "LDP", "CGP", "oki",
                 "tai", "saki", "NFP", "DPJ", "SDP",
                 "LF", "NJSP", "DRF", "kobe", "nii",
                 "sei", "JNFP", "bunka", "green", "LP",
                 "RC", "muk", "CP", "NCP", "ND",
                 "son", "sek", "NP", "NNP", "NPJ",
                 "NPD", "minna", "R", "H", "others")
HR <- HR %>%
    mutate(wl = factor(wl, levels = 0:2, labels = c("lost", "won", "zombie")),
           status = factor(status, levels = 1:3, 
                           labels = c("challenger", "incumbent", "ex")),
           party = factor(party, labels = party_names))

Let’s save this labelled dataset in “data” folder for future use.

save(HR, file = "data/HR-labelled.Rda")

To load this dataset, run load("data/HR-labelled.Rda") (Do not run HR <- load("data/HR-labelled.Rda"))

Now, let’s create a dummy variable indicating a cadidate’s prior experience of MP and a variable measuring the electoral expnditure in million yen, extract the obserbations of the 2009 election, and save them as HR09.

HR09 <- HR %>%
    mutate(experience = as.numeric(status == "incumbent" | status == "ex"),
           expm = exp / 10^6) %>%
    filter(year == 2009)


Linear Regression with R

Simple Regression with a Binary Predictor (Model 1)

Let’s consider a model that explains the vote share (the response variable) by the exprience of MP (the explanatory variable). The experience is a binary (dummy) variable indicting if a candidate has the prior exprience of MP. The variable takes the value of 1 if a candidate is the incumbents or an ex-MP and 0 otherwise. This model can be written as: \[\mathrm{Vote~share}_i \sim \mathrm{N}(\mu_i, \sigma^2),\] \[\mu_i = \beta_1 + \beta_2 \cdot \mathrm{Experience}_i.\]

We can estimate the regression line by lm() function in R.

fit_1 <- lm(voteshare ~ experience, data = HR09)

The object fit_1 has the estimation result (coefficient estimates, standard errors, etc.). We can look at the summary of the result using summary().

summary(fit_1)
## 
## Call:
## lm(formula = voteshare ~ experience, data = HR09)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.843 -12.134  -5.484   8.707  52.016 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  13.9840     0.6221   22.48   <2e-16
## experience   30.8589     0.9832   31.39   <2e-16
## 
## Residual standard error: 16.26 on 1137 degrees of freedom
## Multiple R-squared:  0.4642, Adjusted R-squared:  0.4637 
## F-statistic:   985 on 1 and 1137 DF,  p-value: < 2.2e-16

For better display of the result, let’s use the arm::display() function.

display(fit_1)
## lm(formula = voteshare ~ experience, data = HR09)
##             coef.est coef.se
## (Intercept) 13.98     0.62  
## experience  30.86     0.98  
## ---
## n = 1139, k = 2
## residual sd = 16.26, R-Squared = 0.46

You find the coeffient estimates in the column named coef.est. The results show that \(\hat{\beta}_1=\) 13.98, \(\hat{\beta}_2=\) 30.86, and \(\hat{\sigma}=\) 16.26. Thus, the predicted vote share can be written as \[\widehat{\mathrm{Vote~share}} = 13.98 + 30.86 \cdot \mathrm{Experience}.\]

Let’s visualize this result.

p1 <- ggplot(HR09, aes(x = experience, y = voteshare)) +
    scale_x_continuous(breaks = c(0, 1)) +
    geom_jitter(position = position_jitter(width = 0.05), size = 1) +
    geom_smooth(method = "lm", se = FALSE) +
    labs(x = "Experience", y = "Vote share (%)",
         title = "Vote Share vs. MP Experience")
print(p1)

To add the uncertainty of the estimation to the figure, we use geom_smooth(method = 'lm', se = TRUE) (In fact, we don’t have to specify se in this case because se is default to TRUE). By default, 95 percent confidence interval is displayed arond the line.

p1_ci95 <- p1 + geom_smooth(method = "lm", se = TRUE)
print(p1_ci95)

To change the degree of confidence, specify level. For example, to show the 99 percent confidence interval, run the following.

p1_ci99 <- p1 + geom_smooth(method = "lm", level = .99)
print(p1_ci99)

The \(y\) intercept of the line 13.98 is the mean vote share (or predicted vote share) of the candidates without the MP experience. You can get the predicted value of 13.98 by pluging 0 into the experience. By assigning the value of 1 to the experience, we get the mean vote share (or predicted vote share) for the candidates with the MP eperienc: \(13.98 + 30.86 \cdot 1 = 44.84\). Let’s verify the predicted values by calculating the group mean of the vote share by experience.

HR09 %>%
    filter(experience == 0) %>%   ## candidates without experience
    with(mean(voteshare)) %>%
    round(digits = 2)
## [1] 13.98
HR09 %>%
    filter(experience == 1) %>%   ## candidates with experience 
    with(mean(voteshare)) %>%
    round(digits = 2)
## [1] 44.84

As this result reveals, the predicted values of linear regressions are the mean value of the response variable conditional on the explanatory variable(s).


Simple Regression with a Continuous Predictor (Model 2)

Similarly, we can try to explain the vote share by the electoral expenditure (million yen) as follows.

fit_2 <- lm(voteshare ~ expm, data = HR09)
display(fit_2)
## lm(formula = voteshare ~ expm, data = HR09)
##             coef.est coef.se
## (Intercept) 7.74     0.76   
## expm        3.07     0.10   
## ---
## n = 1124, k = 2
## residual sd = 16.04, R-Squared = 0.48

Let’s fit the regression line to the data visually.

p2 <- ggplot(HR09, aes(x = expm, y = voteshare)) + 
    geom_point(size = 1) + 
    geom_smooth(method = "lm", se = FALSE) +
    labs(x = "Expenditure (million yen)", y = "Vote share (%)",
         title = "Vote Share vs. Expenditure")
print(p2)

Add the 95 percent confidence interval.

p2_ci95 <- p2 + geom_smooth(method = "lm")
print(p2_ci95)


Multiple Regression (Model 3)

Next, let’s use both the experience and expnditure to explain the vote share.

fit_3 <- lm(voteshare ~ experience + expm, data = HR09)
display(fit_3)
## lm(formula = voteshare ~ experience + expm, data = HR09)
##             coef.est coef.se
## (Intercept)  7.91     0.69  
## experience  18.10     1.23  
## expm         1.85     0.12  
## ---
## n = 1124, k = 3
## residual sd = 14.69, R-Squared = 0.56

Visualize the result.

## Make a data frame of predicted values
pred <- HR09 %>%
    with(expand.grid(expm = seq(min(expm, na.rm = TRUE), max(expm, na.rm = TRUE),
                                length = 100),
                     experience = 0:1))
## Calculate the predicted values
pred$voteshare <- predict(fit_3, newdata = pred)
## Make the sctter plot with the observed data (HR09) and draw the regression line
##   with the predicted values
p3 <- ggplot(HR09, aes(x = expm, y = voteshare, color = as.factor(experience))) +
    geom_point(size = 1) +
    geom_line(data = pred) +
    labs(x = "Expenditure (million yen)", y = "Vote share (%)")
p3 <- p3 +
    scale_color_discrete(name = "MP Experience", labels = c("No", "Yes")) +
    guides(color = guide_legend(reverse = TRUE)) +
    ggtitle("Vote Share vs. Expenditure by Experience")
print(p3)

As the figure shows, two lines are parallel, which means thaa the effect of the expenditure on the vore share, which is represented by the slope, is the same for the candidates with and without the experience in this model. Is it actually true? Not necessarily. Two lines are parallel simply because our statistical model here forced them to be.

Now let’s add 95 percent intervals to these lines. To manually calculate the confidence intervals, we use the standard errors of the coefficient estimates. We know that the starndard errors follow the \(t\) distribution with \(n-k\) degrees of freedonm, so we use qt() to obtain the 95 percent interval. (Without R, you might want to use the normal distribution rather than the \(t\) distribution, because the error should be approximately normal. However, with R, using the \(t\) is not more difficult than the normal, and we prefer the \(t\) because it is more accurate.)

## Obtain the predicted values and SE
err <- predict(fit_3, newdata = pred, se.fit = TRUE)
## Calculate the lower and upper bounds of the interval
pred$lower <- err$fit + qt(.025, df = err$df) * err$se.fit
pred$upper <- err$fit + qt(.975, df = err$df) * err$se.fit
## Visualize
p3_ci95 <- p3 +
    geom_smooth(data = pred, aes(ymin = lower, ymax = upper), stat = "identity")
print(p3_ci95)


Multiple Regression with Interaction (Model 4)

In the previous model, we restricted the model so that two regression lines are parallel. Now, let’s relax the restriction. To do so, we will add the explanatotry variable “Experience \(\times\) Expenditure”. In R, by giving “x1 * x2” to lm() as an explanatory variable, we add x1, x2, and x1 * x2 as the explanatory variables.

fit_4 <- lm(voteshare ~ experience * expm, data = HR09)
display(fit_4)
## lm(formula = voteshare ~ experience * expm, data = HR09)
##                 coef.est coef.se
## (Intercept)     -2.07     0.72  
## experience      45.91     1.58  
## expm             4.87     0.16  
## experience:expm -4.76     0.21  
## ---
## n = 1124, k = 4
## residual sd = 12.11, R-Squared = 0.70

Visualize the result.

Add the 95 percent confidence intervals.

p4_ci95 <- p4 + geom_smooth(method = "lm")
print(p4_ci95)


Understanding Linear Regression by Simulation

Let’s run some Monte Carlo simulations to understand how linear regression analyses work. In a simulation, we set the parameter values and generate data. Then, we analyze the data with a specific method (e.g., linear regression) and examine if the method correctly finds the parameter values.

Here, we will simulate simple regression (we will leave simulations of multiple regression for the homework assignment!). The main purpose of this simulation is to understand

  1. the data generating process that linear regression assumes,
  2. the basic characteristics of linear regression (so called OLS), and
  3. the properties of confidence intervals.

The simple regression model can be written as \[y \sim \mathrm{N}(X\beta, \sigma^2 I),\] or \[y_i \sim \mathrm{N}(\beta_1 + \beta_2 x_i, \sigma^2),\] where \(i = 1, 2, \dots, n\). Thus, we have to set three paramters(母数).

beta1 <- 10   ## the true value of the y-intercept
beta2 <- 3    ## the true value of the slope
sigma <- 6    ## the true variability of the error (underlying randomness)

Next, we will generate data following the data generation process that the linear regression model assumes.

n <- 100   ## set the sample size
## We use x ~ U(0, 10). x can be fixed values
x <- runif(n, 0, 10)   ## randomly choose x from the uniform
y <- rnorm(n, mean = beta1 + beta2 * x, sd = sigma)  ## data generation!!!

Let’s visualize the generated data and fit the regression line.

df <- data.frame(y = y, x = x)
scat <- ggplot(df, aes(x, y)) + 
    geom_point() + 
    geom_smooth(method = "lm")
print(scat)

Now we will estimate the paramters of the regression line.

eg_1 <- lm(y ~ x, data = df)
display(eg_1)
## lm(formula = y ~ x, data = df)
##             coef.est coef.se
## (Intercept) 9.14     1.28   
## x           3.10     0.22   
## ---
## n = 100, k = 2
## residual sd = 6.29, R-Squared = 0.68

The estimates for the three paramters \(\beta_1=10\), \(\beta_2=3\), and \(\sigma=6\) are \(b_1\) = 9.14, \(b_2\) = 3.1, and \(\hat{\sigma}\) = 6.29, respectively. Did we get good estimates?

The 95 percent confidence intervals can be obtained by confint().

confint(eg_1)
##                2.5 %    97.5 %
## (Intercept) 6.602367 11.686254
## x           2.672670  3.526034

The 95 percent confidence intervals for the intercept and the slope are [6.60, 11.69] and [2.67, 3.53]. Both intervals contain the true parameter value. Therefore, the probability that these confidence intervals catch the parameter values is 1 (100%).

For a Monte Carlo simulation, we will repeat the above process many times. Since it is tedious to write the same commands again and again, let’s define the function for our simulation.

simple_linear_reg <- function(beta, sigma, n = 100, trials = 10000,
                              x_rng = c(0, 10), conf_level = .95) {
    #############################################################################
    ## Function to run Monte Carlo simulations of simple linear regression
    ## Args: beta = parameter values for beta, 2-element vector
    ##       sigma = parameter value for sigma, a single value > 0
    ##       n = sample size, default to 100
    ##       trials = number of simulation trials, default to 10000
    ##       x_rng = range of the explanatory var x, default to [0, 10]
    ##       coef_level = the confidence level for the interval, default to .95
    ## Return: df = data frame containing
    ##           (1) parameter estimates,
    ##           (2) SEs of coefficient estimates,
    ##           (3) confidence intervals, and    
    ##           (4) simulation id (serial number beginning with 1)
    #############################################################################
    
    ## Create a data frame to save the results
    df <- as.data.frame(matrix(NA, nrow = trials, ncol = 10))
    colnames(df) <- c("b1", "b1_se", "b1_lower", "b1_upper",
                      "b2", "b2_se", "b2_lower", "b2_upper",
                      "sigma_hat", "id")
    
    for (i in 1:trials) {
        x <- runif(n, min = x_rng[1], max = x_rng[2])            ## draw x
        y <- rnorm(n, mean = beta[1] + beta[2] * x, sd = sigma)  ## data generation
    
        fit <- lm(y ~ x)  ## fit linear regression
    
        sigma_hat <- summary(fit)$sigma  ## sum of squared residuals
    
        ## coefficient estimates
        b1 <- coef(fit)[1]
        b2 <- coef(fit)[2]
    
        ## SE of coefficient estimates
        b1_se <- sqrt(summary(fit)$cov.unscaled[1, 1]) * sigma_hat
        b2_se <- sqrt(summary(fit)$cov.unscaled[2, 2]) * sigma_hat
    
        ## confidence interval
        b1_ci <- confint(fit, level = conf_level)[1, ]
        b2_ci <- confint(fit, level = conf_level)[2, ]
    
       ## write the results in the data frame
       df[i, ] <- c(b1, b1_se, b1_ci, b2, b2_se, b2_ci, sigma_hat, i)
    }
    
    return(df)
}

Let’s run a simulation with this function. We set \(\beta_1 = 10\), \(\beta_2 = 3\), and \(\sigma = 6\).

param <- c(10, 3, 6)
sim_1 <- simple_linear_reg(beta = param[1:2], sigma = param[3], trials = 1000)
head(sim_1)
##          b1    b1_se b1_lower b1_upper       b2     b2_se b2_lower
## 1 10.519381 1.333821 7.872455 13.16631 2.840636 0.2276824 2.388807
## 2  9.195160 1.558476 6.102415 12.28790 3.157845 0.2427141 2.676187
## 3  8.675866 1.365326 5.966421 11.38531 3.169810 0.2311890 2.711023
## 4  9.693470 1.422950 6.869673 12.51727 2.911867 0.2347170 2.446079
## 5  8.809144 1.111461 6.603487 11.01480 3.040280 0.2021615 2.639097
## 6  9.211938 1.086019 7.056770 11.36711 3.375313 0.1891709 2.999909
##   b2_upper sigma_hat id
## 1 3.292464  6.304427  1
## 2 3.639503  6.559746  2
## 3 3.628597  6.609825  3
## 4 3.377655  6.311905  4
## 5 3.441463  5.704105  5
## 6 3.750716  5.662732  6

The results of the simulation is saved in the object (data frame) sim_1.

Understanding Coefficient Estimates

First, we will examine b2, which is the estimate of beta[2] (=3). Does linear regression give us a good estimate of the parameter? Let’s look at the distribution of b2.

hist_b2 <- ggplot(sim_1, aes(x = b2, y = ..density..)) +
    geom_histogram(binwidth = .1, color = "black", fill = "royalblue") +
    labs(y = "Density", title = "Distribution of Simulated b2") +
    geom_vline(xintercept = param[2], color = "firebrick", size = 2)
print(hist_b2)

As this histogram shows, the linear regression (OLS) gives us a good estimate of the parameter on average. That is, the mean of the simulated (estemated) b2 equals to the true parameter value. This property is called unbiasedness(不偏性). However, it does not always provide us with an accurate estimate. It underestimates the parameter for some data and overestimates for some others. We usually have only one data set to analyze, and we never know if the obtained estimates are the “good” ones. Therefore, we must be explicit about the uncertainty of the estimates.

Understanding Standard Errors

Next, we will consider the standard errors of b2. To begin, let’s create a histogram.

hist_b2_se <- ggplot(sim_1, aes(x = b2_se, y = ..density..)) + 
    geom_histogram(binwidth = .01, color = "black", fill = "firebrick") +
    labs(x = "SE of b2", y = "density", title = "Distribution of Simulated SE of b2")
print(hist_b2_se)

As this histogram shows, the standard errors take different values for different datasets (or different samples), because standard errors are also estimated (beacause \(\sigma\) is a parameter).

In theory, the statistic \[\frac{b - \beta}{\mathrm{se}}\] follows the \(t\) distribution with \(n-k\) degrees of freedom (in this example, \(n = 100-2=98\)).

Let’s make a histogram of the statistic and overimpose the probability density curve fo the \(t\) distribution with df = 98.

sim_1 <- mutate(sim_1, b2_t = (b2 - beta2) / b2_se)
## Calculate the density of t and save it in a data frame
true_t <- data.frame(x = seq(-4, 4, length = 100))
true_t <- true_t %>% mutate(density = dt(x, df = 98))
hist_t <- ggplot(sim_1, aes(x = b2_t, y = ..density..)) +
    geom_histogram(binwidth = 0.5, color = "black", fill = "royalblue") + 
    geom_line(data = true_t, aes(x = x, y = density), color = "firebrick", size = 1.5) +
    labs(x = expression(frac(b[2] - beta[2], se(b[2]))),
         y = "Density", title = "Standardized b2 and t (df=98)")
print(hist_t)

The figure shows that the statistic (standardized b2) is approximately \(t\). Let’s verify that with a Q-Q plot.

qqplot_t <- ggplot(sim_1, aes(sample = b2_t)) + 
    stat_qq(distribution = qt, dparams = list(df = 98)) +
    geom_abline(intercept = 0, slope = 1, color = "firebrick") +
    coord_fixed() +
    labs(title = "Q-Q Plot", 
         x = "t distribution with df = 98",
         y = expression(paste("Simulated  ", frac(b[2] - beta[2], se(b[2])))))
print(qqplot_t)

Most of the points are on the 45 degree line, which tells that \((\mathrm{b2}-\beta_2)/\mathrm{se(b2)}\) distributes as the \(t\) with df = 98. The simulated values are off the line in both tails, though.

The mean of the standar errors obtained by this simulation is

round(mean(sim_1$b2_se), digits = 2)
## [1] 0.21

The standard error should be the standard deviation of the estimate. Let’s check it.

round(sd(sim_1$b2), digits = 2)
## [1] 0.21

In fact, the standard error is (approximately) the standard deviation of the estimated values.

Understanding Confidence Intervals

Let’s consider the 95 percent confidence interval (CI) of b2. The first simulated CI is

sim_1[1, c("b2_lower", "b2_upper")]
##   b2_lower b2_upper
## 1 2.388807 3.292464

That is, [2.39, 3.29] is the 95 percent CI of b2[1]. This interval contains the true parameter value \(\beta_2 = 3\). Thus, the probability that this confidence interval contains the true parameter value is 1 (100%).

Similarly, the second simulated CI is

sim_1[2, c("b2_lower", "b2_upper")]
##   b2_lower b2_upper
## 2 2.676187 3.639503

This interval also cataches the true parameter value. Again, the probability that this confidence interval contains the true parameter value is 1 (100%).

However, the 39th CI is

sim_1[39, c("b2_lower", "b2_upper")]
##    b2_lower b2_upper
## 39 3.025416 3.749442

This interval does not contain the true parameter value inside. Thus, the probability that this confidence interval contains the true parameter value is 0.

Now let’s examine which of the 1,000 simulated CIs catch the parameter value. The CIs that contain the parameter value have the lower bound smaller than the parameter value and the upper bound greater than the parameter.

check_ci <- sim_1$b2_lower <= param[2] & sim_1$b2_upper >= param[2]

The returned value TRUE signifies the CIs that contain the parameter value, and FALSE represents the other.

table(check_ci)
## check_ci
## FALSE  TRUE 
##    49   951

The table shows that 951 out of 1,000 CIs (95.1%) contain the parameter value, and the rest (4.9%) do not. As this simulation tells us, the level of confidence in the confidence interval is the proportion of the CIs that catach the true parameter value if we calculate the CIs for many different datasets that are generated by the same data generating process. It is never the confidence in a given confidence interval. The probability that a CI has the true parameter is either 0 or 1.

Let’s visualize this fact. Since 1,000 CIs are too many for a figure, we will show the result for randomly chosen 100 CIs.

## put check_ci into the data frame
sim_1$check_ci <- check_ci
## randomly choose 100 sets of sim values for graph
sim_1_sml <- sample_n(sim_1, 100)
## show simulated CIs graphically
ctplr <- ggplot(sim_1_sml, aes(x = reorder(id, b2), y = b2,
                               ymin = b2_lower, ymax = b2_upper,
                               color = check_ci)) +
    geom_hline(yintercept = beta2, linetype = "dashed") +
    geom_pointrange() +
    labs(x = "Simulation ID", y = expression(b[2]),
         title = "Simulated 95% Confidence Intervals") +
    scale_color_discrete(name = "Parameter is", labels = c("Not caught","Caught")) +
    coord_flip()
print(ctplr)

In this figure, 95 of 100 CIs catch the parameter value. That is, 95 percent of the 95 percent confidence intervals give us the correct answer.



Back to Class Materials