Preparation

Import some packages. It is better to load arm before coefplot.

library("dplyr")
library("arm")
library("ggplot2")

Let’s use the dataset we saved in the previous lecture.

## load the R format data from the web
load(url("http://www2.kobe-u.ac.jp/~yyanai/classes/rm1/contents/data/HR-labelled.Rda"))
## Or load the file in the local folder

Let’s create four dummy variables wl (the indicator of the win in the SMD), ldp (Liberal Democratic Party), dpj (Democratic Party of Japan) and experience (the prior MP experience), and a variable measuring the expenditure in million yen.

HR <- HR %>%
    mutate(wlsmd = as.numeric(wl == "won"),
           ldp = as.numeric(party == "LDP"),
           dpj = as.numeric(party == "DPJ"),
           experience = as.numeric(status == "incumbent" | status == "ex"),
           expm = exp / 10^6)

Let’s extract the 2009 election and delete the observations with a missing value for simplicity today (Please do not delete any observations in your own research.)

HR09 <- HR %>%
    filter(year == 2009) %>%
    na.omit()


Logistic Function or Inverse Logit Function

First, let’s define the logisitc (inverse logit) function in R.

invlogit <- function(x) {  ## logistic function
    y <- 1 / (1 + exp(-x))
    return(y)
}

We can use this invlogit() function to to show logistic cuves of continuous variables.

x <- seq(-5, 5, length = 1000)
y1 <- invlogit(-x)
y2 <- invlogit(6 + 3 * x)
df <- data.frame(x = x, y1 = y1, y2 = y2)
logistic_curve <- ggplot(df, aes(x = x)) + 
    geom_hline(y = c(0, 1), linetype = "dashed")
logistic_curve + geom_line(aes(y = y1), color = "firebrick") +
    ylab(expression(paste(logit^-1, "(-x)")))

logistic_curve + geom_line(aes(y = y2), color = "darkblue") +
    ylab(expression(paste(logit^-1, "(6 + 3x)")))

Play with this function to understand the logistic function or the inverse logit function.


Logistic Regreseion with One Predictor

For illustration, we model the vicotry or defeat of candidates in SMDs in the 2009 HR leleciotn. Here we use the electoral expenditure (million yen) as the only predictor of the model: \[\mathrm{Pr}(\mathrm{win}) = \mathrm{logit}^{-1}(\alpha + \beta \cdot \mathrm{expenditure}).\]

Let’s visuazlie the relationship between the response and the predictor.

p <- ggplot(HR09, aes(x = expm, y = wlsmd)) + geom_point(size = 1)
p <- p + labs(x = "expenditure (million yen)", y = "SMD victory")
print(p)

Fit the linear regression line (forcibly).

print(p + geom_smooth(method = "lm"))

To run logictic regression in R, we use the glm() function. This function is used to fit generalized linear models (GLM) and frequently used for other models than logistic. When we use it for logistic regression, we specify family = binomial(link = “logit”).

fit_1 <- glm(wlsmd ~ expm, data = HR09, family = binomial(link = "logit"))
display(fit_1)
## glm(formula = wlsmd ~ expm, family = binomial(link = "logit"), 
##     data = HR09)
##             coef.est coef.se
## (Intercept) -2.00     0.13  
## expm         0.14     0.01  
## ---
##   n = 1124, k = 2
##   residual deviance = 1184.1, null deviance = 1300.1 (difference = 116.0)

We got the estimates. Now the problem is how to interpret the estimates.

Let’s calculate the predicted probability of SMD victory. We can use the predict() function to obtain the predicted values. However, predict() returns the predicted values in the unit of linear predictors by default. The linear predictors of logisctic regression are in the log-odds scale \((-\infty, \infty)\), and `predict(fit_1)$ does not return probabilities (verify this). Since we would like the probability of success, we specify the argument type == response.

HR09$pred_1 <- predict(fit_1, type = "response")

Let’s visualize this.

p <- ggplot(HR09, aes(x = expm)) +
    geom_point(aes(y = wlsmd), size = 1) +
    geom_line(aes(y = pred_1), color = "firebrick")
p <- p + labs(x = "expenditure (million yen)", y = "prob. of SMD victory")
print(p)

Alternatively, we can create the same figure using geom_smooth() of ggplot2.

p <- ggplot(HR09, aes(x = expm, y = wlsmd)) + 
    geom_point(size = 1) +
    geom_smooth(method = "glm", family = "binomial")
p <- p + labs(x = "expenditure (million yen)", y = "prob. of SMD victory")
print(p)


Logistic Regression with Multiple Predictors

Let’s add the number of victories in the previous HR elections (previous) as the second predictor.

We run the logistic regresison as follows.

fit_2 <- glm(wlsmd ~ expm + previous, data = HR09,
            family = binomial(link = "logit"))
display(fit_2)
## glm(formula = wlsmd ~ expm + previous, family = binomial(link = "logit"), 
##     data = HR09)
##             coef.est coef.se
## (Intercept) -2.04     0.13  
## expm         0.04     0.02  
## previous     0.36     0.04  
## ---
##   n = 1124, k = 3
##   residual deviance = 1071.7, null deviance = 1300.1 (difference = 228.4)

Questions: How do you interpret the result?

We would like to visualize the result, but we can’t create a scatter plot with a curve because we have more than one predictors. In this model, the effect of the elecotral expenditure is not statistically significant. So let’s fixt the expenditure at the mean and visualize the relationship between the probability of winning and the previous victories.

We calculate the predicted probabiliies when the expenditure (expm) is at the mean as follows.

n <- nrow(HR09)
mean_expm <- mean(HR09$expm)
HR09$pred_2 <- predict(fit_2, type = "response",
                      newdata = list(previous = HR09$previous,
                                     expm = rep(mean_expm, n)))

Let’s visualize the predicted probabilities.

p <- ggplot(HR09, aes(x = previous)) + 
    geom_point(aes(y = wlsmd), size=1) +
    geom_line(aes(y = pred_2), color="royalblue")
p <- p + labs(x = "previous victories", y = "prob. of SMD victory")
print(p)

To add the confidence interval, we need calculate the standard erros of predicted values.

HR09$se_2 <- predict(fit_2, type = "response",
                     newdata = list(previous = HR09$previous,
                                    expm = rep(mean_expm, n)),
                      se.fit = TRUE)$se.fit

We calculate the lower and bounds of the interval, the predicted value $$ 2se, which is about 95% CI.

HR09 <- HR09 %>%
    mutate(fit2_lb = pred_2 - 2 * se_2,
           fit2_ub = pred_2 + 2 * se_2)

Let’s visualize the interval. We overimpose the interval to the previous figure. Since we modify the data frame HR09, do not forget to update the data by %+%.

p %+% HR09 + geom_ribbon(aes(ymin = fit2_lb, ymax = fit2_ub), alpha = 0.2)


For more information, read Chapters 1 and 2 of Kleinbaum and Klein (2009) and Chapter 14 of Asano and Yanai (2013).



Back to Class Materials