Research Methods in Political Science I

Preparation
Regression Diagnostics
- Residual Plots
Evaluating Models by Predicted Values
Some Techniques for Linear Regressions
Presentation of the Results
- Reporting the Results with Tables
- Reporting the Results with Figures

Preparation

Import some packages. It is better to load arm before coefplot.

library("dplyr")
library("arm")
library("coefplot")
library("stargazer")
library("ggplot2")

Let’s use the dataset we saved in the previous lecture.

## load the R format data from the web
load(url("http://www2.kobe-u.ac.jp/~yyanai/classes/rm1/contents/data/HR-labelled.Rda"))
## Or load the file in the local folder
head(HR)

## Source: local data frame [6 x 16]
## 
##    year    ku   kun  party              name   age     status nocand
##   (int) (chr) (int) (fctr)             (chr) (int)     (fctr)  (int)
## 1  1996 aichi     1    NFP KAWAMURA, TAKASHI    47  incumbent      7
## 2  1996 aichi     1    LDP     IMAEDA, NORIO    72         ex      7
## 3  1996 aichi     1    DPJ     SATO, TAISUKE    53  incumbent      7
## 4  1996 aichi     1    JCP   IWANAKA, MIHOKO    43 challenger      7
## 5  1996 aichi     1  bunka       ITO, MASAKO    51 challenger      7
## 6  1996 aichi     1     NP  YAMADA, HIROSHIB    51 challenger      7
## Variables not shown: wl (fctr), rank (int), previous (int), vote (int),
##   voteshare (dbl), eligible (int), turnout (dbl), exp (int)

tail(HR)

## Source: local data frame [6 x 16]
## 
##    year        ku   kun       party                name   age     status
##   (int)     (chr) (int)      (fctr)               (chr) (int)     (fctr)
## 1  2009 yamanashi     2 independent    NAGASAKI, KOTARO    41  incumbent
## 2  2009 yamanashi     2         LDP    HORIUCHI, MITSUO    79  incumbent
## 3  2009 yamanashi     2           H MIYAMATSU, HIROYUKI    69 challenger
## 4  2009 yamanashi     3         DPJ       GOTO, HITOSHI    52  incumbent
## 5  2009 yamanashi     3         LDP           ONO, JIRO    56  incumbent
## 6  2009 yamanashi     3           H   SAKURADA, DAISUKE    47 challenger
## Variables not shown: nocand (int), wl (fctr), rank (int), previous (int),
##   vote (int), voteshare (dbl), eligible (int), turnout (dbl), exp (int)

Let’s create three dummy variables ldp (Liberal Democratic Party), dpj (Democratic Party of Japan) and experience (the prior MP experience), and a variable measuring the expenditure in million yen.

HR <- HR %>%
    mutate(ldp = as.numeric(party == "LDP"),
           dpj = as.numeric(party == "DPJ"),
           experience = as.numeric(status == "incumbent" | status == "ex"),
           expm = exp / 10^6)

Extract the 2005 and 2009 elections and save them as HR05 and HR09, respectively. The datasets have some missin values, and we should impute the values by some methods such as mutiple imputation using mi::mi(). However, we delete the observations with a missing value for simplicity today. Please do not delete any observations in your own research.

HR05 <- HR %>%
    filter(year == 2005) %>%
    na.omit()
HR09 <- HR %>%
    filter(year == 2009) %>%
    na.omit()

Regression Diagnostics

Residual Plots

fit_1 <- lm(voteshare ~ expm, data = HR05)
display(fit_1)

## lm(formula = voteshare ~ expm, data = HR05)
##             coef.est coef.se
## (Intercept) 9.12     0.71   
## expm        2.61     0.07   
## ---
## n = 985, k = 2
## residual sd = 12.57, R-Squared = 0.57

sigma_hat_1 <- summary(fit_1)$sigma
HR05$res_fit_1 <- fit_1$residuals
res_plot_1 <- ggplot(HR05, aes(x = expm, y = res_fit_1)) + geom_point()
res_plot_1 <- res_plot_1 +
    geom_hline(yintercept = 0, color = "royalblue") +
    ## let dashed lines show +- 1sd 
    geom_hline(yintercept = c(sigma_hat_1, -sigma_hat_1), 
               color = "red", linetype = "dashed") +
    labs(x = "Expenditure (million yen)", y = "Residual")
print(res_plot_1 + ggtitle("Residual plot of fit 1"))

Evaluating Models by Predicted Values

We can evaluate the model by the predicted values as follows.

Estimate the coefficients using a part of the sample.
Calculate the predicted values with the coefficient estimates and the rest of the sample, which was not used for the estimation.
Compare the predicted and observed values.

We expect that the predicted values are close to the observed values if the estimation is accurate, but that there remains the variability endogenous to the data generation process.

For example, let’s 1. estimate the regression line with the 2005 HR election data, 2. caluculate the predicted values using the estimated coefficients and the 2009 HR election data, and 3. compare the predicted values with the observed response in the 2009 election.

To compare the predicted with the observed values, we will create a scatter plot.

HR09$pred_1 <- predict(fit_1, new = HR09)
HR09 <- mutate(HR09, pred_err_1 = voteshare - pred_1)
p_pred_1 <- ggplot(HR09, aes(x = pred_1, y = voteshare)) +
    geom_point() +
    geom_abline(intercept = 0, slope = 1, color = "royalblue") +
    xlim(0, 100) + ylim(0, 100) + coord_fixed() +
    xlab("Predicted values based on the 2005 election") +
    ylab("Observed values in the 2009 election")
print(p_pred_1 + ggtitle(""))

Let’s change the vertical axis to the gap between the predicted and the observed values.

p_pred_err_1 <- ggplot(HR09, aes(x = pred_1, y = pred_err_1)) +
    geom_point() +
    geom_hline(yintercept = 0, color = "royalblue") +
    geom_hline(yintercept = c(sigma_hat_1, -sigma_hat_1), 
               color = "royalblue", linetype = "dashed") +
    xlab("Predicted values based on the 2005 election") +
    ylab("The difference between the predicted and observed values")
print(p_pred_err_1 + ggtitle("Vote Share: Difference b/w Prediction and Observation"))

As these figure show, it does not seem that our regression model capture the truth well.

Just as a trial, let’s fit a kitchen sink model to explain the vote share.

sink_1 <- lm(voteshare ~ experience * expm * age * previous * ldp * dpj, data = HR05)
HR09$pred_2 <- predict(sink_1, new = HR09)
HR09 <- mutate(HR09, pred_err_2 = voteshare - pred_2)
p_pred_2 <- ggplot(HR09, aes(x = pred_2, y = voteshare)) +
    geom_point() +
    geom_abline(intercept = 0, slope = 1, color = "royalblue") +
    labs(x = "Prediction based on the 2005 result", y = "Observation in 2009")
print(p_pred_2 + ggtitle(""))

Next, we will visualize the difference between the predicted and observed values.

p_pred_err_2 <- ggplot(HR09, aes(x = pred_2, y = pred_err_2)) +
    geom_point() +
    geom_hline(yintercept = 0, color = "royalblue") +
    geom_hline(yintercept = c(sigma_hat_1, -sigma_hat_1),
               color = "royalblue", linetype = "dashed") +
    xlab("Prediction based on the 2005 result") +
    ylab("Difference between prediction and observation")
print(p_pred_err_2 + ggtitle("Vote Share: Difference b/w Prediction and Observation"))

As this example shows, as far as prediction concerns, a kitchen sink model perform well. However, we do not inlude the predictors of which the effect on the response cannot be theoretically explained.

Some Techniques for Linear Regressions

Linear Transformation

We will fit the simple regression model with the vore share (%) as the response and the electoral expenditure as the predictor. Using the predictior exp, which mesures the expnditure in yen, we can estimate the coefficients as folows.

fit_2 <- lm(voteshare ~ exp, data = HR09)
display(fit_2, digits = 8)

## lm(formula = voteshare ~ exp, data = HR09)
##             coef.est   coef.se   
## (Intercept) 7.73521322 0.75659813
## exp         0.00000307 0.00000010
## ---
## n = 1124, k = 2
## residual sd = 16.04179841, R-Squared = 0.48

Thus, \[\mathrm{vote share} = 7.74 + 0.00000307 \cdot \mathrm{expenditure (yen)}+ error.\]

Similaryly, with the predictor expm, which measure the same expenditure in million yen, we will get the following result.

fit_3 <- lm(voteshare ~ expm, data = HR09)
## If we had not created "expm", we would run the following instead.
# fit_3 <- lm(voteshare ~ I(exp / 10^6), data = HR09)
display(fit_3)

## lm(formula = voteshare ~ expm, data = HR09)
##             coef.est coef.se
## (Intercept) 7.74     0.76   
## expm        3.07     0.10   
## ---
## n = 1124, k = 2
## residual sd = 16.04, R-Squared = 0.48

Thus, \[\mathrm{vote share} = 7.74 + 3.07 \cdot \mathrm{expenditure (million yen)} + error.\]

What these two equations tell us are substantively same. Which do you think is better, and why?

Standardization

Let’s fit the model with standarized variables. We standardize a random variable \(x\) to get a standardized variable \(z\) by \[z_x=\frac{x - \bar{x}}{u_x},\] where, \(u_x\) is the square root of the unbiased variace of \(x\).

Let’s try explaining the vote share by the standadized expendigure (million yen).

HR09 <- HR09 %>%
    mutate(z_expm = (expm - mean(expm, na.rm = TRUE)) / sd(expm, na.rm = TRUE))
summary(HR09$z_expm)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2220 -0.8652 -0.2619  0.0000  0.5985  3.8490

We have the \(z\) value of expm, z_expm. We will fit the model with this predictor.

fit_4 <- lm(voteshare ~ z_expm, data = HR09)
display(fit_4)

## lm(formula = voteshare ~ z_expm, data = HR09)
##             coef.est coef.se
## (Intercept) 26.53     0.48  
## z_expm      15.35     0.48  
## ---
## n = 1124, k = 2
## residual sd = 16.04, R-Squared = 0.48

This result shows that the vote share increases by 15.35 points on average as the expenditure increases by one standard deviation. The coefficient value for the intercept shows the predicted (mean) value of the vote share when the expenditure is at its mean.

Centering

Let’s try to exalain the vote share by the prior MP experience and the electoral expenditure. We will fit the model as follows.

fit_5 <- lm(voteshare ~ experience * expm, data = HR09)
display(fit_5)

## lm(formula = voteshare ~ experience * expm, data = HR09)
##                 coef.est coef.se
## (Intercept)     -2.07     0.72  
## experience      45.91     1.58  
## expm             4.87     0.16  
## experience:expm -4.76     0.21  
## ---
## n = 1124, k = 4
## residual sd = 12.11, R-Squared = 0.70

What does each coefficient estimate mean?

Now, let’s fit the similar model with the centered predictors.

HR09 <- HR09 %>%
    mutate(c_experience = experience - mean(experience),
           c_expm = expm - mean(expm, na.rm = TRUE))
fit_5_c <- lm(voteshare ~ c_experience * c_expm, data = HR09)
display(fit_5_c)

## lm(formula = voteshare ~ c_experience * c_expm, data = HR09)
##                     coef.est coef.se
## (Intercept)         34.53     0.50  
## c_experience        16.81     1.01  
## c_expm               2.96     0.11  
## c_experience:c_expm -4.76     0.21  
## ---
## n = 1124, k = 4
## residual sd = 12.11, R-Squared = 0.70

First, verify that two \(R^2\) are the same for two models, which implies that the models show the same substantive result.

Next, think about the meaning of the estimated coefficient. The value for the coefficient 34.53 is the predicted vote share when both predictors are at the mean. Before centering the predictors, the intercept values showed the predicted vote share when all the predictors take the value of zero, which is not possible. By centering the predictors, we got the meaningful value for the intercept coefficient.

The coefficient for c_epxerience shows that the experienced candidates are expected to get 16.81 points higher vote share than the unexpexted ones on average when the expenditure is at its mean. Before centering, the coefficient value shows the effetct of experience for those who spend nothing on election, even though every candidate spent some amount money. Here, we get the slope for the representative observation.

The coefficient for c_expm tells us that the vote share is expected to increase by 2.96 points on average as the expenditure increases by 1 unit (or 1 million yen) if the cadidate has the average experience. Before centering, the coefficient shows the effect of money for the unexpected candidates only. Here, the slope for an average person is presented.

Lastly, the coefficient for the interaction term shows that the difference in the slope of expenditure between the experience and unexperienced candidate is 4.76. Given the interpretation of the interaction, it is not unnatural that the coefficient values do not differe between the models before and after centering the predictors.

Logarighmic Transformation

To obtain a logarithmic value of a variable in R, we use log() or log10(). The base \(b\) logarithm of \(x\) can be calculated by log(x, base = b). The default base is \(e\), so the natural logarithm of x is simply log(x). We use log10() to get the base 10 logarithms: log10(x) equals log(x, base = 10).

To use a logratim of a variable in linear regression, we either prepare a lograithmic variable in the data frame or transform it in the model. For example, to transform the response in the mode, we run the following.

fit_6 <- lm(log(voteshare) ~ expm, data = HR09)
display(fit_6)

## lm(formula = log(voteshare) ~ expm, data = HR09)
##             coef.est coef.se
## (Intercept) 1.17     0.05   
## expm        0.22     0.01   
## ---
## n = 1124, k = 2
## residual sd = 1.10, R-Squared = 0.49

Presentation of the Results

Reporting the Results with Tables

To make tables of regression results, we use stargazer::stargazer() function. Though stars are annoying, we can suppress them.

stargazer(fit_2, fit_3, fit_4,  ## regression models to report
          digits = 2,           ## show  2 digits after the decimal point
          digits.extra = 0,     ## 0 extra zero after the specified number of digits
          align = TRUE,         ## align the figures by decimal point
          star.cutoffs = NA,    ## show no significance stars
          df = FALSE,           ## do not report the degrees of freedom
          keep.stat = c("n", "adj.rsq", "f"),  ## statistics to show in the bottom
          covariate.labels = c("expenditure (yen)", "expenditure (million yen)", 
                               "standardized expenditure (million yen)", 
                               "constant"),
          dep.var.caption = "Response",
          dep.var.labels = "vote share (%)",
          title = "Estimation Results by Linear Regression",
          notes = "Note: Standard errors are in parentheses.",  ## add the notes
          notes.align = "l",      ## alighn the notes to the left
          notes.append = FALSE,   ## overwrite the default notes
          notes.label = "",       ## add no note label
          type = "html")          ## html output.  Replace it with "latex" to use it for LaTeX

**Estimation Results by Linear Regression**

	Response

	vote share (%)
	(1)	(2)	(3)

expenditure (yen)	0.00
	(0.00)

expenditure (million yen)		3.07
		(0.10)

standardized expenditure (million yen)			15.35
			(0.48)

constant	7.74	7.74	26.53
	(0.76)	(0.76)	(0.48)


Observations	1,124	1,124	1,124
Adjusted R²	0.48	0.48	0.48
F Statistic	1,028.24	1,028.24	1,028.24

Note: Standard errors are in parentheses.

Reporting the Results with Figures

It has become common to use caterpillar plots to report regression results. We can create them with ggplot2 or coefplot packages

model_1 <- lm(voteshare ~ c_experience, data = HR09)
model_2 <- lm(voteshare ~ c_expm, data = HR09)
model_3 <- lm(voteshare ~ c_experience + c_expm, data = HR09)
model_4 <- lm(voteshare ~ c_experience * c_expm, data = HR09)
## make caterpillar plots with the coefplot packages
## to report a single model
coefplot(model_4, intercept = FALSE,
         title = "Estimation result: response is vote share (%)", 
         xlab = "Coefficient estimate", ylab = "Predictor",
         newNames = c(c_experience = "Experience",
                      c_expm = "Expenditure (million yen)",
                      "c_experience:c_expm" = "Experience x Expenditure"))

## to report multiple models
multiplot(model_1, model_2, model_3, model_4,
          intercept = FALSE, numberAngle = 0,
          title = "Estimation results: response is vote share (%)",
          xlab = "Coefficient estimate", ylab = "Predictor",
          newNames = c(c_experience = "Experience",
                       c_expm = "Expenditure (million yen)",
                      "c_experience:c_expm" = "Experience x Expenditure")) +
          scale_color_discrete(name = "", labels = paste("Model", 1:4)) +
          guides(color = guide_legend(reverse = TRUE))

If you are satisfied with these figures, you can use coefplot::coefplot() or coefplot::multiplot(). If you are not satisfied and would like to fine-tune the figures, you should create caterpillar plots with ggplot2::geom_pointrange().

Back to Class Materials