Import some packages. It is better to load arm before coefplot.
library("dplyr")
library("arm")
library("coefplot")
library("stargazer")
library("ggplot2")
Let’s use the dataset we saved in the previous lecture.
## load the R format data from the web
load(url("http://www2.kobe-u.ac.jp/~yyanai/classes/rm1/contents/data/HR-labelled.Rda"))
## Or load the file in the local folder
head(HR)
## Source: local data frame [6 x 16]
##
## year ku kun party name age status nocand
## (int) (chr) (int) (fctr) (chr) (int) (fctr) (int)
## 1 1996 aichi 1 NFP KAWAMURA, TAKASHI 47 incumbent 7
## 2 1996 aichi 1 LDP IMAEDA, NORIO 72 ex 7
## 3 1996 aichi 1 DPJ SATO, TAISUKE 53 incumbent 7
## 4 1996 aichi 1 JCP IWANAKA, MIHOKO 43 challenger 7
## 5 1996 aichi 1 bunka ITO, MASAKO 51 challenger 7
## 6 1996 aichi 1 NP YAMADA, HIROSHIB 51 challenger 7
## Variables not shown: wl (fctr), rank (int), previous (int), vote (int),
## voteshare (dbl), eligible (int), turnout (dbl), exp (int)
tail(HR)
## Source: local data frame [6 x 16]
##
## year ku kun party name age status
## (int) (chr) (int) (fctr) (chr) (int) (fctr)
## 1 2009 yamanashi 2 independent NAGASAKI, KOTARO 41 incumbent
## 2 2009 yamanashi 2 LDP HORIUCHI, MITSUO 79 incumbent
## 3 2009 yamanashi 2 H MIYAMATSU, HIROYUKI 69 challenger
## 4 2009 yamanashi 3 DPJ GOTO, HITOSHI 52 incumbent
## 5 2009 yamanashi 3 LDP ONO, JIRO 56 incumbent
## 6 2009 yamanashi 3 H SAKURADA, DAISUKE 47 challenger
## Variables not shown: nocand (int), wl (fctr), rank (int), previous (int),
## vote (int), voteshare (dbl), eligible (int), turnout (dbl), exp (int)
Let’s create three dummy variables ldp (Liberal Democratic Party), dpj (Democratic Party of Japan) and experience (the prior MP experience), and a variable measuring the expenditure in million yen.
HR <- HR %>%
mutate(ldp = as.numeric(party == "LDP"),
dpj = as.numeric(party == "DPJ"),
experience = as.numeric(status == "incumbent" | status == "ex"),
expm = exp / 10^6)
Extract the 2005 and 2009 elections and save them as HR05 and HR09, respectively. The datasets have some missin values, and we should impute the values by some methods such as mutiple imputation using mi::mi()
. However, we delete the observations with a missing value for simplicity today. Please do not delete any observations in your own research.
HR05 <- HR %>%
filter(year == 2005) %>%
na.omit()
HR09 <- HR %>%
filter(year == 2009) %>%
na.omit()
fit_1 <- lm(voteshare ~ expm, data = HR05)
display(fit_1)
## lm(formula = voteshare ~ expm, data = HR05)
## coef.est coef.se
## (Intercept) 9.12 0.71
## expm 2.61 0.07
## ---
## n = 985, k = 2
## residual sd = 12.57, R-Squared = 0.57
sigma_hat_1 <- summary(fit_1)$sigma
HR05$res_fit_1 <- fit_1$residuals
res_plot_1 <- ggplot(HR05, aes(x = expm, y = res_fit_1)) + geom_point()
res_plot_1 <- res_plot_1 +
geom_hline(yintercept = 0, color = "royalblue") +
## let dashed lines show +- 1sd
geom_hline(yintercept = c(sigma_hat_1, -sigma_hat_1),
color = "red", linetype = "dashed") +
labs(x = "Expenditure (million yen)", y = "Residual")
print(res_plot_1 + ggtitle("Residual plot of fit 1"))
We can evaluate the model by the predicted values as follows.
We expect that the predicted values are close to the observed values if the estimation is accurate, but that there remains the variability endogenous to the data generation process.
For example, let’s 1. estimate the regression line with the 2005 HR election data, 2. caluculate the predicted values using the estimated coefficients and the 2009 HR election data, and 3. compare the predicted values with the observed response in the 2009 election.
To compare the predicted with the observed values, we will create a scatter plot.
HR09$pred_1 <- predict(fit_1, new = HR09)
HR09 <- mutate(HR09, pred_err_1 = voteshare - pred_1)
p_pred_1 <- ggplot(HR09, aes(x = pred_1, y = voteshare)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "royalblue") +
xlim(0, 100) + ylim(0, 100) + coord_fixed() +
xlab("Predicted values based on the 2005 election") +
ylab("Observed values in the 2009 election")
print(p_pred_1 + ggtitle(""))
Let’s change the vertical axis to the gap between the predicted and the observed values.
p_pred_err_1 <- ggplot(HR09, aes(x = pred_1, y = pred_err_1)) +
geom_point() +
geom_hline(yintercept = 0, color = "royalblue") +
geom_hline(yintercept = c(sigma_hat_1, -sigma_hat_1),
color = "royalblue", linetype = "dashed") +
xlab("Predicted values based on the 2005 election") +
ylab("The difference between the predicted and observed values")
print(p_pred_err_1 + ggtitle("Vote Share: Difference b/w Prediction and Observation"))
As these figure show, it does not seem that our regression model capture the truth well.
Just as a trial, let’s fit a kitchen sink model to explain the vote share.
sink_1 <- lm(voteshare ~ experience * expm * age * previous * ldp * dpj, data = HR05)
HR09$pred_2 <- predict(sink_1, new = HR09)
HR09 <- mutate(HR09, pred_err_2 = voteshare - pred_2)
p_pred_2 <- ggplot(HR09, aes(x = pred_2, y = voteshare)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "royalblue") +
labs(x = "Prediction based on the 2005 result", y = "Observation in 2009")
print(p_pred_2 + ggtitle(""))
Next, we will visualize the difference between the predicted and observed values.
p_pred_err_2 <- ggplot(HR09, aes(x = pred_2, y = pred_err_2)) +
geom_point() +
geom_hline(yintercept = 0, color = "royalblue") +
geom_hline(yintercept = c(sigma_hat_1, -sigma_hat_1),
color = "royalblue", linetype = "dashed") +
xlab("Prediction based on the 2005 result") +
ylab("Difference between prediction and observation")
print(p_pred_err_2 + ggtitle("Vote Share: Difference b/w Prediction and Observation"))
As this example shows, as far as prediction concerns, a kitchen sink model perform well. However, we do not inlude the predictors of which the effect on the response cannot be theoretically explained.
We will fit the simple regression model with the vore share (%) as the response and the electoral expenditure as the predictor. Using the predictior exp, which mesures the expnditure in yen, we can estimate the coefficients as folows.
fit_2 <- lm(voteshare ~ exp, data = HR09)
display(fit_2, digits = 8)
## lm(formula = voteshare ~ exp, data = HR09)
## coef.est coef.se
## (Intercept) 7.73521322 0.75659813
## exp 0.00000307 0.00000010
## ---
## n = 1124, k = 2
## residual sd = 16.04179841, R-Squared = 0.48
Thus, \[\mathrm{vote share} = 7.74 + 0.00000307 \cdot \mathrm{expenditure (yen)}+ error.\]
Similaryly, with the predictor expm, which measure the same expenditure in million yen, we will get the following result.
fit_3 <- lm(voteshare ~ expm, data = HR09)
## If we had not created "expm", we would run the following instead.
# fit_3 <- lm(voteshare ~ I(exp / 10^6), data = HR09)
display(fit_3)
## lm(formula = voteshare ~ expm, data = HR09)
## coef.est coef.se
## (Intercept) 7.74 0.76
## expm 3.07 0.10
## ---
## n = 1124, k = 2
## residual sd = 16.04, R-Squared = 0.48
Thus, \[\mathrm{vote share} = 7.74 + 3.07 \cdot \mathrm{expenditure (million yen)} + error.\]
What these two equations tell us are substantively same. Which do you think is better, and why?
Let’s fit the model with standarized variables. We standardize a random variable \(x\) to get a standardized variable \(z\) by \[z_x=\frac{x - \bar{x}}{u_x},\] where, \(u_x\) is the square root of the unbiased variace of \(x\).
Let’s try explaining the vote share by the standadized expendigure (million yen).
HR09 <- HR09 %>%
mutate(z_expm = (expm - mean(expm, na.rm = TRUE)) / sd(expm, na.rm = TRUE))
summary(HR09$z_expm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2220 -0.8652 -0.2619 0.0000 0.5985 3.8490
We have the \(z\) value of expm, z_expm. We will fit the model with this predictor.
fit_4 <- lm(voteshare ~ z_expm, data = HR09)
display(fit_4)
## lm(formula = voteshare ~ z_expm, data = HR09)
## coef.est coef.se
## (Intercept) 26.53 0.48
## z_expm 15.35 0.48
## ---
## n = 1124, k = 2
## residual sd = 16.04, R-Squared = 0.48
This result shows that the vote share increases by 15.35 points on average as the expenditure increases by one standard deviation. The coefficient value for the intercept shows the predicted (mean) value of the vote share when the expenditure is at its mean.
Let’s try to exalain the vote share by the prior MP experience and the electoral expenditure. We will fit the model as follows.
fit_5 <- lm(voteshare ~ experience * expm, data = HR09)
display(fit_5)
## lm(formula = voteshare ~ experience * expm, data = HR09)
## coef.est coef.se
## (Intercept) -2.07 0.72
## experience 45.91 1.58
## expm 4.87 0.16
## experience:expm -4.76 0.21
## ---
## n = 1124, k = 4
## residual sd = 12.11, R-Squared = 0.70
What does each coefficient estimate mean?
Now, let’s fit the similar model with the centered predictors.
HR09 <- HR09 %>%
mutate(c_experience = experience - mean(experience),
c_expm = expm - mean(expm, na.rm = TRUE))
fit_5_c <- lm(voteshare ~ c_experience * c_expm, data = HR09)
display(fit_5_c)
## lm(formula = voteshare ~ c_experience * c_expm, data = HR09)
## coef.est coef.se
## (Intercept) 34.53 0.50
## c_experience 16.81 1.01
## c_expm 2.96 0.11
## c_experience:c_expm -4.76 0.21
## ---
## n = 1124, k = 4
## residual sd = 12.11, R-Squared = 0.70
First, verify that two \(R^2\) are the same for two models, which implies that the models show the same substantive result.
Next, think about the meaning of the estimated coefficient. The value for the coefficient 34.53 is the predicted vote share when both predictors are at the mean. Before centering the predictors, the intercept values showed the predicted vote share when all the predictors take the value of zero, which is not possible. By centering the predictors, we got the meaningful value for the intercept coefficient.
The coefficient for c_epxerience shows that the experienced candidates are expected to get 16.81 points higher vote share than the unexpexted ones on average when the expenditure is at its mean. Before centering, the coefficient value shows the effetct of experience for those who spend nothing on election, even though every candidate spent some amount money. Here, we get the slope for the representative observation.
The coefficient for c_expm tells us that the vote share is expected to increase by 2.96 points on average as the expenditure increases by 1 unit (or 1 million yen) if the cadidate has the average experience. Before centering, the coefficient shows the effect of money for the unexpected candidates only. Here, the slope for an average person is presented.
Lastly, the coefficient for the interaction term shows that the difference in the slope of expenditure between the experience and unexperienced candidate is 4.76. Given the interpretation of the interaction, it is not unnatural that the coefficient values do not differe between the models before and after centering the predictors.
To obtain a logarithmic value of a variable in R, we use log()
or log10()
. The base \(b\) logarithm of \(x\) can be calculated by log(x, base = b)
. The default base is \(e\), so the natural logarithm of x is simply log(x)
. We use log10()
to get the base 10 logarithms: log10(x)
equals log(x, base = 10)
.
To use a logratim of a variable in linear regression, we either prepare a lograithmic variable in the data frame or transform it in the model. For example, to transform the response in the mode, we run the following.
fit_6 <- lm(log(voteshare) ~ expm, data = HR09)
display(fit_6)
## lm(formula = log(voteshare) ~ expm, data = HR09)
## coef.est coef.se
## (Intercept) 1.17 0.05
## expm 0.22 0.01
## ---
## n = 1124, k = 2
## residual sd = 1.10, R-Squared = 0.49
To make tables of regression results, we use stargazer::stargazer()
function. Though stars are annoying, we can suppress them.
stargazer(fit_2, fit_3, fit_4, ## regression models to report
digits = 2, ## show 2 digits after the decimal point
digits.extra = 0, ## 0 extra zero after the specified number of digits
align = TRUE, ## align the figures by decimal point
star.cutoffs = NA, ## show no significance stars
df = FALSE, ## do not report the degrees of freedom
keep.stat = c("n", "adj.rsq", "f"), ## statistics to show in the bottom
covariate.labels = c("expenditure (yen)", "expenditure (million yen)",
"standardized expenditure (million yen)",
"constant"),
dep.var.caption = "Response",
dep.var.labels = "vote share (%)",
title = "Estimation Results by Linear Regression",
notes = "Note: Standard errors are in parentheses.", ## add the notes
notes.align = "l", ## alighn the notes to the left
notes.append = FALSE, ## overwrite the default notes
notes.label = "", ## add no note label
type = "html") ## html output. Replace it with "latex" to use it for LaTeX
Response | |||
vote share (%) | |||
(1) | (2) | (3) | |
expenditure (yen) | 0.00 | ||
(0.00) | |||
expenditure (million yen) | 3.07 | ||
(0.10) | |||
standardized expenditure (million yen) | 15.35 | ||
(0.48) | |||
constant | 7.74 | 7.74 | 26.53 |
(0.76) | (0.76) | (0.48) | |
Observations | 1,124 | 1,124 | 1,124 |
Adjusted R2 | 0.48 | 0.48 | 0.48 |
F Statistic | 1,028.24 | 1,028.24 | 1,028.24 |
Note: Standard errors are in parentheses. |
It has become common to use caterpillar plots to report regression results. We can create them with ggplot2 or coefplot packages
model_1 <- lm(voteshare ~ c_experience, data = HR09)
model_2 <- lm(voteshare ~ c_expm, data = HR09)
model_3 <- lm(voteshare ~ c_experience + c_expm, data = HR09)
model_4 <- lm(voteshare ~ c_experience * c_expm, data = HR09)
## make caterpillar plots with the coefplot packages
## to report a single model
coefplot(model_4, intercept = FALSE,
title = "Estimation result: response is vote share (%)",
xlab = "Coefficient estimate", ylab = "Predictor",
newNames = c(c_experience = "Experience",
c_expm = "Expenditure (million yen)",
"c_experience:c_expm" = "Experience x Expenditure"))
## to report multiple models
multiplot(model_1, model_2, model_3, model_4,
intercept = FALSE, numberAngle = 0,
title = "Estimation results: response is vote share (%)",
xlab = "Coefficient estimate", ylab = "Predictor",
newNames = c(c_experience = "Experience",
c_expm = "Expenditure (million yen)",
"c_experience:c_expm" = "Experience x Expenditure")) +
scale_color_discrete(name = "", labels = paste("Model", 1:4)) +
guides(color = guide_legend(reverse = TRUE))
If you are satisfied with these figures, you can use coefplot::coefplot()
or coefplot::multiplot()
. If you are not satisfied and would like to fine-tune the figures, you should create caterpillar plots with ggplot2::geom_pointrange()
.