Research Methods in Political Science I

Preparation
Logistic (Logit) Regression
Probit Regression Model
- Fitting the regression
- Interpreting the Estimation Results
Comparing Logit with Probit

Preparation

Import packages to use.

library("ggplot2")
library("readr")
library("dplyr")

Define the inverse of the logit function (i.e., the logiscti function).

invlogit <- function(x) {
    p <- 1 / (1 + exp(-x))
    return(p)
}

Logistic (Logit) Regression

Reading Data

For illustration, we use the dataset appered in Asano and Yanai (2013): hr96-09.csv.

Using this dataset, we will try to explain the victory in single member districts (SMDs) of the Japanese House of Representatives elections by logistic regression.

First, load the dataset into R. Here, let’s use the dataset we modified (labelled) and saved before.

load(url("http://www2.kobe-u.ac.jp/~yyanai/classes/rm1/contents/data/HR-labelled.Rda"))
head(HR)

## Source: local data frame [6 x 16]
## 
##    year    ku   kun  party              name   age     status nocand
##   (int) (chr) (int) (fctr)             (chr) (int)     (fctr)  (int)
## 1  1996 aichi     1    NFP KAWAMURA, TAKASHI    47  incumbent      7
## 2  1996 aichi     1    LDP     IMAEDA, NORIO    72         ex      7
## 3  1996 aichi     1    DPJ     SATO, TAISUKE    53  incumbent      7
## 4  1996 aichi     1    JCP   IWANAKA, MIHOKO    43 challenger      7
## 5  1996 aichi     1  bunka       ITO, MASAKO    51 challenger      7
## 6  1996 aichi     1     NP  YAMADA, HIROSHIB    51 challenger      7
## Variables not shown: wl (fctr), rank (int), previous (int), vote (int),
##   voteshare (dbl), eligible (int), turnout (dbl), exp (int)

This dataset contains the variable showing the election results, wl. The values for the variable are “lost”, “won”, and “zombie” (candidates who lost the SMD but resurrected in a PR block). Using this variable, let’s make the binary variable indicating the victory in an SMD, wlsmd.

myd <- mutate(HR, wlsmd = as.numeric(wl == "won"))
with(myd, table(wlsmd))

## wlsmd
##    0    1 
## 4114 1500

In addition, let’s create three dummy variables ldp (Liberal Democratic Party), komei (New Komeito or Clean Government Party), dpj (Democratic Party of Japan) and a variable measuring the expenditure in million yen.

myd <- myd %>%
    mutate(ldp = as.numeric(party == "LDP"),
           komei = as.numeric(party == "CGP"),
           dpj = as.numeric(party == "DPJ"),
           expm = exp / 10^6)

For simpliciy, keep the data for the 2009 election and delete the observations with a missing value (You shouldn’t delete any observation because of missing values in a real data analysis).

myd_2009 <- myd %>%
    filter(year == 2009) %>%
    na.omit()

Fitting Regression Model

Now we’d like to model the victory in SMDs with two predictors. One is the number of privious wins in the HR elections, previous; the other is the elecotral expenditure in million yen, expm. (In a research paper, we should explicity explain why we use this model. In other words, we should provide theoretical reasons why we use these predictors to explain the response. However, we focus on the quantitative methods here.)

This logistic regression is estimated by the following code.

fit_logit_1 <- glm(wlsmd ~ previous + expm, data = myd_2009,
                   family = binomial(link = "logit")) 
summary(fit_logit_1)

## 
## Call:
## glm(formula = wlsmd ~ previous + expm, family = binomial(link = "logit"), 
##     data = myd_2009)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0691  -0.5922  -0.5111   0.5219   1.9741  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.03746    0.13296 -15.323   <2e-16
## previous     0.36052    0.03826   9.422   <2e-16
## expm         0.03825    0.01841   2.077   0.0378
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1300.1  on 1123  degrees of freedom
## Residual deviance: 1071.7  on 1121  degrees of freedom
## AIC: 1077.7
## 
## Number of Fisher Scoring iterations: 4

This result shows that \[\mathrm{Pr(victory)} = \frac{1}{1 + \exp[- (-2.04 + 0.36 \mathrm{previous~wins} + 0.04\mathrm{expenditure})]}.\]

Next, let’s add the candidate’s age to the linear predictor.

fit_logit_2 <- glm(wlsmd ~ previous + expm + age , data = myd_2009,
                   family = binomial(link = "logit")) 
summary(fit_logit_2)

## 
## Call:
## glm(formula = wlsmd ~ previous + expm + age, family = binomial(link = "logit"), 
##     data = myd_2009)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2093  -0.6602  -0.4414   0.4761   2.2864  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.766843   0.387483   1.979   0.0478
## previous     0.517282   0.046734  11.069  < 2e-16
## expm         0.044772   0.019158   2.337   0.0194
## age         -0.064545   0.008818  -7.319 2.49e-13
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1300.1  on 1123  degrees of freedom
## Residual deviance: 1012.1  on 1120  degrees of freedom
## AIC: 1020.1
## 
## Number of Fisher Scoring iterations: 5

This shows that \[\mathrm{Pr(victory)} = \frac{1}{1 + \exp[-(0.77 + 0.52 \mathrm{previous~wins} + 0.04 \mathrm{expenditure} -0.06 \mathrm{age})]}.\]

Furthermore, let’s add dummies for LDP, CGP, and DPJ.

fit_logit_3 <- glm(wlsmd ~ previous + expm + age + ldp + komei + dpj,
                   data = myd_2009,
                   family = binomial(link = "logit")) 
summary(fit_logit_3)

## 
## Call:
## glm(formula = wlsmd ~ previous + expm + age + ldp + komei + dpj, 
##     family = binomial(link = "logit"), data = myd_2009)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8819  -0.2646  -0.1906   0.2406   2.7992  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -2.49795    0.62996  -3.965 7.33e-05
## previous      0.48450    0.05767   8.402  < 2e-16
## expm          0.04006    0.03037   1.319  0.18706
## age          -0.03316    0.01256  -2.640  0.00829
## ldp           0.45332    0.42776   1.060  0.28925
## komei       -14.35608  589.26975  -0.024  0.98056
## dpj           4.18928    0.33995  12.323  < 2e-16
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1300.13  on 1123  degrees of freedom
## Residual deviance:  584.78  on 1117  degrees of freedom
## AIC: 598.78
## 
## Number of Fisher Scoring iterations: 14

Write an equation representing the estimated results yourself.

Evaluating the Fit

Comparing three models above, we find that AIC is smallest in the third model. This implies that the explanatory power is the strongest in the third model than the other.

Now, let’s calculate the “hit ratio” of each model. Here, we expect the candidates whose probability of winning is greater than 0.5 to win the SMD and the other to lose.

First, we’ll obtain the predicted probabilites by predict(). By setting the argument type to “response”, we are provided with the predected probability. Otherwise,predict() returns the logit (i.e., log odds) by default.

myd_2009 <- myd_2009 %>%
    mutate(pred_1 = predict(fit_logit_1, type = "response"),
           pred_2 = predict(fit_logit_2, type = "response"),
           pred_3 = predict(fit_logit_3, type = "response"))

Using these predicted probabilities, we’ll calculate the “hit ratio”.

## Error rates
ert_1 <- with(myd_2009, mean((pred_1 > 0.5 & wlsmd == 0) | (pred_1 < 0.5 & wlsmd == 1)))
ert_2 <- with(myd_2009, mean((pred_2 > 0.5 & wlsmd == 0) | (pred_2 < 0.5 & wlsmd == 1)))
ert_3 <- with(myd_2009, mean((pred_3 > 0.5 & wlsmd ==0 ) | (pred_3 < 0.5 & wlsmd == 1)))
## Hit rate = 1 - error rate
1 - c(ert_1, ert_2, ert_3)

## [1] 0.7464413 0.7704626 0.8825623

The hit ratio is the highest in the third model.

The observed ratio of victories is

(wr <- with(myd_2009, mean(wlsmd)))

## [1] 0.2651246

Thus, the baseline of prediction is

max(c(wr, 1 - wr))

## [1] 0.7348754

That is, without a model, we can achieve 73% hit by predicting everybody loses. The prediction based on Models 1 and 2 are only slightly better than the baseline, and we consider them as “bad” fits. In contrast, Model 3 improves the accuracy of prediction.

Next, we’ll compare the ROC curves of these three models. We define the functions to calculate the true positive rate (TPR, sensitivity) and the false positive rate (FPR, \(1 -\) specificity) and draw ROC curves with ggplot2.

## Function to calculate the true positive rate (sensitivity)
true_pos <- function(predicted, observed, cutoff = .5) {
    n <- length(cutoff)
    res <- rep(NA, n)
    for (i in 1:n) {
        cond_met <- predicted >= cutoff[i] & observed == 1
        res[i] <- mean(cond_met) / mean(observed)
    }
    return(res)
}
## Function to calculate the false positive rate（1- specificity)
false_pos <- function(predicted, observed, cutoff = .5) {
    n <- length(cutoff)
    res <- rep(NA, n)
    for (i in 1:n){
        cond_met <- predicted >= cutoff[i] & observed == 0
        res[i] <- mean(cond_met) / (1 - mean(observed))
    }
    return(res)
}
## Get TPRs and FPRs by changing the cutoff from 0 to 1
cutoff_vec <- seq(0, 1, length = 1000)
tp_1 <- true_pos(myd_2009$pred_1, myd_2009$wlsmd, cutoff = cutoff_vec)
tp_2 <- true_pos(myd_2009$pred_2, myd_2009$wlsmd, cutoff = cutoff_vec)
tp_3 <- true_pos(myd_2009$pred_3, myd_2009$wlsmd, cutoff = cutoff_vec)
fp_1 <- false_pos(myd_2009$pred_1, myd_2009$wlsmd, cutoff = cutoff_vec)
fp_2 <- false_pos(myd_2009$pred_2, myd_2009$wlsmd, cutoff = cutoff_vec)
fp_3 <- false_pos(myd_2009$pred_3, myd_2009$wlsmd, cutoff = cutoff_vec)
## Create data frame to use ggplot2
df_roc <- data.frame(tpr = c(tp_1, tp_2, tp_3),
                     fpr = c(fp_1, fp_2, fp_3),
                     model = rep(1:3, rep(length(cutoff_vec), 3)))
## Draw ROC cuves
p <- ggplot(df_roc, aes(x = fpr, y = tpr, color = as.factor(model))) +
    geom_line() +
    geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
    scale_color_discrete(name = "", labels = paste("Model", 1:3)) +
    labs(x = "1 - specificity", y = "sensitivity", title = "ROC Curves")
print(p)

This figure also shows that the third models is the best among three.

Examining Residuals

The residual is the difference between the observed value and the predicted value, \[\mathrm{residual}_i = y_i - \mathrm{logit}^{-1}({\bf x}_i^T {\bf \beta}).\] Since the response \(y_i\) takes only two different values 0 and 1, a spcific \({\bf x}_i^T {\bf \beta}\) gives us only two different residuals. For instance, for \(\mathrm{logit}^{-1}({\bf x}_i^T {\bf \beta})=0.4\), the possible residuals are \(-0.4\) and \(0.6\) only.

Plotting the residuals for Model 3, we get the following.

myd_2009$res_3 <- with(myd_2009, wlsmd - pred_3)
res_p <- ggplot(myd_2009, aes(x = pred_3, y = res_3)) +
    geom_point() +
    geom_hline(yintercept = 0, color = "royalblue") +
    labs(x = "predicted prob. of victory", y = "observed - predicted",
         title = "Residual Plot of Model 3")
print(res_p)

As this figure shows, the residual plot of logistic regression is (almost) useless. And so, we create binned residual plots instead of ordinary residual plots. To make this type of plot, we group the residuals by predicted values and calculate the group mean of the residuals. Then, we plot the mean residual versus the mean predicted value for each group.

Let’s create one for Model 3.

## We make 20 groups based on predicted values: num of groups is arbitrary
## Create a data frame with the predicted values and residuals
df <- myd_2009[, c("pred_3", "res_3")] 
## Sort the observations by predicted values
df <- arrange(df, pred_3)
N <- nrow(df)
n_group <- round(N / 20)   ## Observations in each group
## If N / 20 is not an integer, the nuber of observations differ in the last group
df$group <- rep(20, N)     ## Category var for groups
df$group[1:(n_group * 20)] <- rep(1:20, rep(n_group, 20))
## Caculate the group means of predicted values and residuals
df_group <- df %>%
    group_by(group) %>%
    summarize(pred = mean(pred_3),
              res = mean(res_3),
              obs = n())
## Calculate +-2se
df_group <- df_group %>%
    mutate(lb = -2 * sqrt(pred * (1 - pred) / obs),
           ub =  2 * sqrt(pred * (1 - pred) / obs))
## Plot
p_br <- ggplot(df_group, aes(x = pred, y = res)) +
    geom_point() +
    geom_line(aes(y = lb), linetype = "dotted") + 
    geom_line(aes(y = ub), linetype = "dotted") +
    geom_hline(yintercept = 0, color = "royalblue") +
    labs(x = "predicted probability of victory", y = "mean residual",
         title = "Binned Residual Plot of Model 3")
print(p_br)

This figure shows the binned residual plot with 20 groups for Model 3. The number of groups 20 was arbitrarily chosen. We should determine the number of groups so that the number of observatios in each group is not too small. Here, we have 1,123 observations in total and each group has 56 observations except the 20th group, which has 60.

Looking the figure, we cannot find an obvious pattern of the residuals. However, we are not so sure because the number of points are as small as 20. (Assignment: Create a similar figure with 40 groups and check if you can find any pattern.) The curves in the figure display \(\pm2\)se\(=\pm 2\sqrt{p(1-p)/n}\), where \(p\) is the predicted probability for each group, and \(n\) is the number of observations for each group. We expect that about 95% of observations are between two curves. In this figure, 18 out of 20 points are in that area. That is, 90% of the data are between the curves, which is not off the expectation much.

Null Hypothesis Significance Test

Using Model 3, I’ll explain how to conduct significance test. First, let’s consider the null hypothesis for the parameters other than the intercept \(\beta_2 = \beta_3 = \cdots = \beta_k=0\).

To decide if we reject the null or not, we use deviance. In the output of summary(fit_logit_3), Null deviance and Residual deviance are displayed. The former is the deviance where all parameters (except the intercept) are set to 0, and the latter is the deviance of the estimated model. We consider a model with smaller deviance is better. If the null is true, the difference between these two deviances follow the \(\chi^2\) distribution with \(\nu\) degrees of freedom, where \(\nu\) is the difference in the degree of freedom between two deviances (i.e., the number of predictors in the model). Therefore, we shoudl conduct \(\chi^2\) test for the difference of deviances.

The differences of deviances and the degree of freedom are obtained as follows.

X2_logit_3 <- with(fit_logit_3, null.deviance - deviance)
df_logit_3 <- with(fit_logit_3, df.null - df.residual)

Thus, the \(p\) values of the \(\chi^2\) test is

pchisq(X2_logit_3, df = df_logit_3, lower.tail = FALSE)

## [1] 2.97e-151

This is practically 0, so the null \(\beta_2 = \beta_3 = \cdots = \beta_k=0\) is rejected at 0.01 significance level.

To evaluate addition of a predictor, we check how much the deviance declines as we add a predictor. We expect that the deviance decreaes by 1 (on average) if we add a predictor that does not have any systematic effect on the response. Thus, to argue that the addition is meaningful, we should show that the reduction of the deviance is significantly larger than 1 by \(\chi^2\) test.

We can do that with anova() (ANalysis Of VAriance).

anova(fit_logit_3, test = "Chi")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: wlsmd
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)
## NULL                      1123    1300.13          
## previous  1  224.216      1122    1075.91 < 2.2e-16
## expm      1    4.226      1121    1071.69    0.0398
## age       1   59.553      1120    1012.13 1.190e-14
## ldp       1  167.361      1119     844.77 < 2.2e-16
## komei     1   36.216      1118     808.56 1.766e-09
## dpj       1  223.775      1117     584.78 < 2.2e-16

In the output, the column Deviance shows the amount of deviance reduction by including each variable in addition to the variables above. The deviance decreases more than 1 for all variables examined here.

Next, let’s consider the null that “coefficient = 0” for each variable. We can use \(p\) values displayed in summary to test the hypotheses. Let’s check the result using caterpillar plot.

df_logit_3 <- data.frame(
    order = 1:7,
    predictor = c("intercept", "previous wins", "expenditure", "age",
                  "LDP", "CGP", "DPJ"),
    coef = coef(fit_logit_3),
    se = summary(fit_logit_3)$coefficient[, 2])
df_logit_3 <- df_logit_3 %>%
    mutate(lb = coef - qt(.975, df = df.residual(fit_logit_3)) * se,
           ub = coef + qt(.975, df = df.residual(fit_logit_3)) * se)
ctplr_1 <- ggplot(filter(df_logit_3, predictor != "intercept"),
                         aes(x = reorder(predictor, order), y = coef,
                             ymin = lb, ymax = ub)) +
    geom_hline(yintercept = 0, linetype = "dashed") +
    geom_pointrange(color = "darkblue") +
    labs(x = "predictor", y = "estimate",
         title = "Coefficient Estimates and 95% CI: Response is Victory in SMD") +
    coord_flip()
print(ctplr_1)

This figure reveals that the effect of being CGP is not statistically significant. Let’s draw a similar figure without CGP.

ctplr_2 <- ctplr_1 %+%
    filter(df_logit_3, !(predictor %in% c("intercept", "CGP")))
print(ctplr_2)

This figure shows that the predictors “privious wins” and “DPJ” have statistically significant effetcs, while “expenditure” and “LDP” (and “CGP”) don’t. It is hard to tell if the effect of age is statistically significan from this figure. Since the estimates for party dummies are huge compared to the other variables, let’s make a figure without party dummies to see the estimation uncertainty in other variables.

ctplr_3 <- ctplr_1 %+%
    filter(df_logit_3, predictor %in% c("previous wins", "expenditure", "age"))
print(ctplr_3)

By looking these figures, we reject the null “coefficient = 0” for the predictors “DPJ”, “previous wins”, and “age”, but not for the other.

Interpreting the Estimation Results

Since the relationship between the response and the linear predictor is not linear in logistic regression, it is not trivial to interpret the estimated coeffcients. One way to tackle the problem is to consider how much the predicted probability changes corresponding to a specific change of each predictor. For illustration we use Model 2 estimated above.

Here is the result of Model 2.

summary(fit_logit_2)

## 
## Call:
## glm(formula = wlsmd ~ previous + expm + age, family = binomial(link = "logit"), 
##     data = myd_2009)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2093  -0.6602  -0.4414   0.4761   2.2864  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.766843   0.387483   1.979   0.0478
## previous     0.517282   0.046734  11.069  < 2e-16
## expm         0.044772   0.019158   2.337   0.0194
## age         -0.064545   0.008818  -7.319 2.49e-13
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1300.1  on 1123  degrees of freedom
## Residual deviance: 1012.1  on 1120  degrees of freedom
## AIC: 1020.1
## 
## Number of Fisher Scoring iterations: 5

First, let’s examine the effect of previous wins on the probability of victory. Consider how much the probability changes as the prvious wins increases from 0 to 1, other things equal (fixed). Let \(\delta\) denote the change in the probability of victory. \[\delta(\mathrm{expenditure, age}) = \mathrm{logit}^{-1}(0.77 + 0.52 \cdot 1 + 0.04 \mathrm{expenditure} - 0.06 \mathrm{age})\\ - \mathrm{logit}^{-1}(0.77 + 0.52 \cdot 0 +0.04 \mathrm{expenditure} - 0.06 \mathrm{age}).\]

Calculating this for each observation, we get the mean prediced difference (MPD) \[ \mathrm{MPD} = \frac{1}{n}\sum_{i=1}^n \delta(\mathrm{expenditure}_i, \mathrm{age}_i).\]

Let’s get it in R. We use our own invlogit() function for calculation.

## Extract the coefficient estimates
b <- coef(fit_logit_2)
## Calculate the predicted difference for each observation
delta <- with(myd_2009,
              invlogit(b[1] + b[2] * 1 + b[3] * expm + b[4] * age) -
              invlogit(b[1] + b[2] * 0 + b[3] * expm + b[4] * age))
## Calculate the mean predicted difference
mean(delta)

## [1] 0.06078672

This shows that the probability of winning the SMD increases by 6.08 points on average as the previous wins increases from 0 to 1.

Similarly, let’s calculate the change of the probability if the previous wins increases from 1 to 5.

delta <- with(myd_2009,
              invlogit(b[1] + b[2] * 5 + b[3] * expm + b[4] * age) -
              invlogit(b[1] + b[2] * 1 + b[3] * expm + b[4] * age))
mean(delta)

## [1] 0.4096711

Thus, if the previous wins increases from 1 to 5, the probability of winning increases by 40.97 points on average.

Next, let’s examin the effect of electoral spending. Now, we’ll check how much the winning probability changes as the spending changes from “the mean - sd” to “the mean + sd”.

lo <- with(myd_2009, mean(expm) - 0.5 * sd(expm)) 
hi <- with(myd_2009, mean(expm) + 0.5 * sd(expm)) 
delta <- with(myd_2009,
              invlogit(b[1] + b[2] * previous + b[3] * hi + b[4] * age) -
              invlogit(b[1] + b[2] * previous + b[3] * lo + b[4] * age))
mean(delta)

## [1] 0.03342016

This shows that the change of the elecotral spending from “the mean -sd” to the “mean + sd” is associated with the rise of the winning probability by 3.34 points.

Similarly, what happenes if the spending changes from the minimum to the maximum.

delta <- with(myd_2009,
              invlogit(b[1] + b[2] * previous + b[3] * max(expm) + b[4] * age) -
              invlogit(b[1] + b[2] * previous + b[3] * min(expm) + b[4] * age))
mean(delta)

## [1] 0.1860051

This shows that the winning probability increases by 18.6 points on average if the expenditure changes from the minimum to the maximum.

Lastly, let’s check the effect of age on the probability of winning the SMDs. How does the probability change if a candidate’s age changes from 30 to 40?

delta <- with(myd_2009,
              invlogit(b[1] + b[2] * previous + b[3] * expm + b[4] * 40) -
              invlogit(b[1] + b[2] * previous + b[3] * expm + b[4] * 30))
mean(delta)

## [1] -0.1116126

It is shown that the probability of elecotal victory increases by -11.16 points (or decreases by 11.16 points) on average by changing the age from 30 to 40.

Simlilary, what is the effect of age changing from 40 to 50?

delta <- with(myd_2009,
              invlogit(b[1] + b[2] * previous + b[3] * expm + b[4] * 50) -
              invlogit(b[1] + b[2] * previous + b[3] * expm + b[4] * 40))
mean(delta)

## [1] -0.09267792

It is presented that the aging from 40 to 50 is associated with the -9.27 points increase (or 9.27 points decrease) of the winning probability.

How about the change from 50 to 60?

delta <- with(myd_2009,
              invlogit(b[1] + b[2] * previous + b[3] * expm + b[4] * 60) -
              invlogit(b[1] + b[2] * previous + b[3] * expm + b[4] * 50))
mean(delta)

## [1] -0.07292911

It is displayed that the aging from 50 to 60 is associated with the -7.29 points increase (or 7.29 points decrease) of the winning probability.

As these examples show, we can substatively interpret the results of logistic regression by showing the average change of the predicted probability corresponding to some specific changes of the predictors. It is even better to show these results graphically (e.g. bar graphs).

Probit Regression Model

Fitting the regression

Let’s try to explain the winnin probability with probit model.

First, we’ll consider the model explaining wlsmd by previous and expm. The probit regression model can be estimated by the following code.

fit_probit_1 <- glm(wlsmd ~ previous + expm, data = myd_2009,
                   family = binomial(link = "probit")) 
summary(fit_probit_1)

## 
## Call:
## glm(formula = wlsmd ~ previous + expm, family = binomial(link = "probit"), 
##     data = myd_2009)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3842  -0.5862  -0.4922   0.5449   1.9910  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.24950    0.07392 -16.903   <2e-16
## previous     0.20837    0.02178   9.566   <2e-16
## expm         0.02509    0.01078   2.327     0.02
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1300.1  on 1123  degrees of freedom
## Residual deviance: 1066.2  on 1121  degrees of freedom
## AIC: 1072.2
## 
## Number of Fisher Scoring iterations: 5

The result can be written as \[\mathrm{Pr(victory)} = \Phi(-1.25 + 0.21 \mathrm{previous~wins} + 0.03\mathrm{expenditure}),\] where \(\Phi\) is the standard normal CDF.

Next, let’s add age to the linear predictor.

fit_probit_2 <- glm(wlsmd ~ previous + expm + age , data = myd_2009,
                   family = binomial(link = "probit")) 
summary(fit_probit_2)

## 
## Call:
## glm(formula = wlsmd ~ previous + expm + age, family = binomial(link = "probit"), 
##     data = myd_2009)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5465  -0.6569  -0.4291   0.5177   2.3166  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.351364   0.222982   1.576   0.1151
## previous     0.292056   0.025633  11.394  < 2e-16
## expm         0.027782   0.011136   2.495   0.0126
## age         -0.036245   0.004929  -7.353 1.94e-13
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1300.1  on 1123  degrees of freedom
## Residual deviance: 1009.1  on 1120  degrees of freedom
## AIC: 1017.1
## 
## Number of Fisher Scoring iterations: 5

This result can be written as \[\mathrm{Pr(victory)} = \Phi(0.32 + 0.29 \mbox{previous~wins} + 0.03\mathrm{expenditure} -0.04\mathrm{age}).\]

Furthermore, let’s add dummies for LDP, CGP, and DPJ.

fit_probit_3 <- glm(wlsmd ~ previous + expm + age + ldp + komei + dpj, data = myd_2009,
                    family = binomial(link = "probit")) 
summary(fit_logit_3)

## 
## Call:
## glm(formula = wlsmd ~ previous + expm + age + ldp + komei + dpj, 
##     family = binomial(link = "logit"), data = myd_2009)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8819  -0.2646  -0.1906   0.2406   2.7992  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -2.49795    0.62996  -3.965 7.33e-05
## previous      0.48450    0.05767   8.402  < 2e-16
## expm          0.04006    0.03037   1.319  0.18706
## age          -0.03316    0.01256  -2.640  0.00829
## ldp           0.45332    0.42776   1.060  0.28925
## komei       -14.35608  589.26975  -0.024  0.98056
## dpj           4.18928    0.33995  12.323  < 2e-16
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1300.13  on 1123  degrees of freedom
## Residual deviance:  584.78  on 1117  degrees of freedom
## AIC: 598.78
## 
## Number of Fisher Scoring iterations: 14

Like logistic regression, the third model has the smallest AIC and is the best fit among three.

Interpreting the Estimation Results

Like logistic regreesion, since the relationship betwen the response and the linear predictor is not linear, it is not straightforward to interpret the results of probit regression. So we’d like to consider how much the predicted probability changes corresponding to a specific change of each predictor again. For illustration we use Model 2 estimated above. For comparison, we use the same changes we examined for logistic regression.

Here’s the result of Model 2.

fit_probit_2 <- glm(wlsmd ~ previous + expm + age , data = myd_2009,
                   family = binomial(link = "probit")) 
summary(fit_probit_2)

## 
## Call:
## glm(formula = wlsmd ~ previous + expm + age, family = binomial(link = "probit"), 
##     data = myd_2009)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5465  -0.6569  -0.4291   0.5177   2.3166  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.351364   0.222982   1.576   0.1151
## previous     0.292056   0.025633  11.394  < 2e-16
## expm         0.027782   0.011136   2.495   0.0126
## age         -0.036245   0.004929  -7.353 1.94e-13
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1300.1  on 1123  degrees of freedom
## Residual deviance: 1009.1  on 1120  degrees of freedom
## AIC: 1017.1
## 
## Number of Fisher Scoring iterations: 5

First, let’s examine the effect of previous wins on the probability of victory. Consider how much the probability changes as the prvious wins increases from 0 to 1, other things equal (fixed). Let \(\delta\) denote the change in the probability of victory. Then, \[\delta(\mathrm{expenditure, age}) = \Phi(0.35 + 0.29 \cdot 1 +0.03 \mathrm{expenditure} - 0.04 \mathrm{age})\\ - \Phi(0.35 + 0.29 \cdot 0 +0.03 \mathrm{expenditure} - 0.04 \mathrm{age}).\]

We calculate this for each observation and get the mean predicted difference: \[\mathrm{MPD} = \frac{1}{n}\sum_{i=1}^n \delta(\mathrm{expenditure}_i, \mathrm{age}_i).\]

Let’s calculate it in R. R has a built-in function pnorm() to get the noraml CDF.

## Extract the coefficient estimates
b <- coef(fit_probit_2)
## Calculate the difference in predicted prob for each observation
delta <- with(myd_2009,
              pnorm(b[1] + b[2] * 1 + b[3] * expm + b[4] * age) -
              pnorm(b[1] + b[2] * 0 + b[3] * expm + b[4] * age))
## Obtain the mean predicted difference
mean(delta)

## [1] 0.06134222

This shows that the probability of winning the SMD increases by 6.13 points on average as the previous wins increases from 0 to 1.

Similarly, let’s calculate the change of the probability if the previous wins increases from 1 to 5.

delta <- with(myd_2009,
              pnorm(b[1] + b[2] * 5 + b[3] * expm + b[4] * age) -
              pnorm(b[1] + b[2] * 1 + b[3] * expm + b[4] * age))
mean(delta)

## [1] 0.3861617

Thus, if the previous wins increases from 1 to 5, the probability of winning increases by 38.62 points on average.

Next, let’s examin the effect of electoral spending. Now, we’ll check how much the winning probability changes as the spending changes from “the mean - sd” to “the mean + sd”.

## Follwing two lines were done when logistic results are examined
# lo <- with(myd_2009, mean(expm) - 0.5*sd(expm)) 
# hi <- with(myd_2009, mean(expm) + 0.5*sd(expm)) 
delta <- with(myd_2009,
              pnorm(b[1] + b[2] * previous + b[3] * hi + b[4] * age) -
              pnorm(b[1] + b[2] * previous + b[3] * lo + b[4] * age))
mean(delta)

## [1] 0.03586274

This shows that the change of the elecotral spending from “the mean -sd” to the “mean + sd” is associated with the rise of the winning probability by 3.59 points.

Similarly, what happenes if the spending changes from the minimum to the maximum.

delta <- with(myd_2009,
              pnorm(b[1] + b[2] * previous + b[3] * max(expm) + b[4] * age) -
              pnorm(b[1] + b[2] * previous + b[3] * min(expm) + b[4] * age))
mean(delta)

## [1] 0.198816

This shows that the winning probability increases by 19.88 points on average if the expenditure changes from the minimum to the maximum.

Lastly, let’s check the effect of age on the probability of winning the SMDs. How does the probability change if a candidate’s age changes from 30 to 40?

delta <- with(myd_2009,
              pnorm(b[1] + b[2] * previous + b[3] * expm + b[4] * 40) -
              pnorm(b[1] + b[2] * previous + b[3] * expm + b[4] * 30))
mean(delta)

## [1] -0.1071702

It is shown that the probability of elecotal victory increases by -10.72 points (or decreases by 10.72 points) on average by changing the age from 30 to 40.

Simlilary, what is the effect of age changing from 40 to 50?

delta <- with(myd_2009,
              pnorm(b[1] + b[2] * previous + b[3] * expm + b[4] * 50) -
              pnorm(b[1] + b[2] * previous + b[3] * expm + b[4] * 40))
mean(delta)

## [1] -0.09135181

It is presented that the aging from 40 to 50 is associated with the -9.14 points increase (or 9.14 points decrease) of the winning probability.

How about the change from 50 to 60?

delta <- with(myd_2009,
              pnorm(b[1] + b[2] * previous + b[3] * expm + b[4] * 60) -
              pnorm(b[1] + b[2] * previous + b[3] * expm + b[4] * 50))
mean(delta)

## [1] -0.07284359

It is displayed that the aging from 50 to 60 is associated with the -7.28 points increase (or 7.28 points decrease) of the winning probability.

Comparing Logit with Probit

Let’s comare the logistic regression results with the probit regression results we got above.

First, by looking the effects of the predictors on the winning probability, there is no real difference in substantive meaning between two. For instance, the increase in the winning probability corresponding to the change of previous win from 0 to 1 is 6.08 points for logistic and 6.13 points for probit. By rounding up to the first decimal, the results are the same. This is true for other mean predicted changes. Thus, choice between logistic and probit doen’t affect the political-science interpretation of the results (except rare singular cases). That is, you can choose either model based on your taste in most cases.

To simplify the explanation, let’s compare Model 1 of logisti and probit.

b_logit <- coef(fit_logit_1)
b_probit <- coef(fit_probit_1)
cbind(b_logit, b_probit)

##                 b_logit    b_probit
## (Intercept) -2.03745863 -1.24949910
## previous     0.36051937  0.20836892
## expm         0.03824864  0.02509093

Based on these results, we can write logistic regression as \[\mathrm{Pr(victory)} = \frac{1}{1 + \exp[-(-2.03 + 0.36 \mathrm{previous~wins} + 0.04\mathrm{expenditure} )]}\] and probit regresion as \[\mathrm{Pr(victory)} = \Phi(-1.25 + 0.21 \mathrm{previous~wins} + 0.03\mathrm{expenditure}).\]

As explained in class, logist regression should be equivalent to the probit with its sd multiplied by 1.6. To verify this, let’s multiply the logit’s coefficient estimates by 1.6.

b_probit_1.6 <- 1.6 * b_probit
cbind(b_logit, b_probit_1.6, b_probit)

##                 b_logit b_probit_1.6    b_probit
## (Intercept) -2.03745863  -1.99919856 -1.24949910
## previous     0.36051937   0.33339028  0.20836892
## expm         0.03824864   0.04014549  0.02509093

As shown, the results are very close, though not exactly same.

The notable difference between the standard normal and logistic distributions is their thickness of the tails. Logistic regression has the fatter tails than the standard normal. In other words, the logistic distribution gives the higher probability to the values far from the center than does the standard normal. Accordingly, we expect that the difference between logit and probit gets larger when the the value of the linear predictor is far from 0.

Back to Class Materials

Research Methods in Political Science I

Logit and Probit Models

Yuki Yanai

December 29, 2015

Preparation

Logistic (Logit) Regression

Reading Data

Fitting Regression Model

Evaluating the Fit

Examining Residuals

Null Hypothesis Significance Test

Interpreting the Estimation Results

Probit Regression Model

Fitting the regression

Interpreting the Estimation Results

Comparing Logit with Probit