First, let’s import the R packages we will use in this page. (Don’t forget to install them before importing).
library("readr")
library("dplyr")
library("xtable")
library("stargazer")
library("ggplot2")
library("reshape2")
We use a data set analyzed by Asano and Yanai (2013) for the purpose of illustration. Download hr96-09.csv, which is the data set of the House of Representatives elections in Japan, and save it in the data folder within your project folder (or anywhere you want and change the path to the file accordingly).
This data set is in CSV (comma separated values) format. R can read CSV data sets by readr::read_csv()
function (or a built-in function read.csv()
). In R, missing values are signified by NA (without quotation marks). If you use a different code(s) for missing values in the CSV file, specify the missing value code(s) by na argument. For instance, if you coded “Don’t know” as 9 and “No answer” as “-1”, and if you woud like to treat them both as missing, you must set na = c(9, -1). The data set we use here has “.” for missing values. Thus, to load the data set into R, run:
HR <- read_csv("data/hr96-09.csv", na = ".") ## dataset is in "data" folder
## Alternatively, you can use read.csv(), although we prefer readr::read_csv()
# HR <- read.csv("data/hr96-09.csv", na = ".")
head(HR) ## display the first 6 rows
## Source: local data frame [6 x 16]
##
## year ku kun party name age status nocand wl rank
## 1 1996 aichi 1 1000 KAWAMURA, TAKASHI 47 2 7 1 1
## 2 1996 aichi 1 800 IMAEDA, NORIO 72 3 7 0 2
## 3 1996 aichi 1 1001 SATO, TAISUKE 53 2 7 0 3
## 4 1996 aichi 1 305 IWANAKA, MIHOKO 43 1 7 0 4
## 5 1996 aichi 1 1014 ITO, MASAKO 51 1 7 0 5
## 6 1996 aichi 1 1038 YAMADA, HIROSHIB 51 1 7 0 6
## Variables not shown: previous (int), vote (int), voteshare (dbl), eligible
## (int), turnout (dbl), exp (int)
tail(HR) ## display the last 6 rows
## Source: local data frame [6 x 16]
##
## year ku kun party name age status nocand wl rank
## 1 2009 yamanashi 2 1 NAGASAKI, KOTARO 41 2 4 0 2
## 2 2009 yamanashi 2 800 HORIUCHI, MITSUO 79 2 4 0 3
## 3 2009 yamanashi 2 1115 MIYAMATSU, HIROYUKI 69 1 4 0 4
## 4 2009 yamanashi 3 1001 GOTO, HITOSHI 52 2 3 1 1
## 5 2009 yamanashi 3 800 ONO, JIRO 56 2 3 0 2
## 6 2009 yamanashi 3 1115 SAKURADA, DAISUKE 47 1 3 0 3
## Variables not shown: previous (int), vote (int), voteshare (dbl), eligible
## (int), turnout (dbl), exp (int)
Now you have loaded the data set into R and saved it as the data frame named “HR”.
This data set has election results for multiple elections. Let’s check the election years contained in the data set. We use table()
function to tabulate data. Since the variables are contained in the data frame, we use with()
function to access the data frame the variables of which we apply functions to.
with(HR, table(year))
## year
## 1996 2000 2003 2005 2009
## 1261 1199 1026 989 1139
## Alternatively, you can do
# table(HR$year)
It shows that we have 5 election years. For simplicity, let’s extract only the 2009 election. We can use dplyr::filter()
function to choose observations that satisfy the specified condition from the data set.
HR09 <- HR %>%
filter(year == 2009) ## extract observations with year==2009
with(HR09, table(year))
## year
## 2009
## 1139
The pipe %>%
passes the result of the left-hand side to the right-hand side, and the right-hand side function uses it as its first argument. Thus, HR09 %>% filter (year == 2009)
does the same as filter(HR09, year == 2009)
. That is, you do not have to use a pipe at all in this example, but I introduce it to you now in a very simple example because you will have to use it a lot in the future. You can use multiple pipes to accomplish complicated data transformation. You’ll see more examples later in this course. To use pipes, you need to import the magrittr packages, which is automatically imported when you import dplyr.
The data frame “HR09” contains only the 2009 election. We will use this smaller data set from now on.
Using the 2009 HR election data, tabulate the candidate’s status (status) and the election resuls (wl).
(tbl_st_wl <- with(HR09, table(status, wl)))
## wl
## status 0 1 2
## 1 559 81 43
## 2 168 174 52
## 3 15 45 2
Since it is hard to tell what each number means, let’s add labels to variables. To add labels, we transform the class of variables from integer to factor. Here, we add labels to three variables: wl, status, and party. Let’s add labels to the original, larger data set, and filter it to make a smaller data frame again.
## names of parties in the order of coded values
party_names <- c("independent", "JCP", "LDP", "CGP", "oki",
"tai", "saki", "NFP", "DPJ", "SDP",
"LF", "NJSP", "DRF", "kobe", "nii",
"sei", "JNFP", "bunka", "green", "LP",
"RC", "muk", "CP", "NCP", "ND",
"son", "sek", "NP", "NNP", "NPJ",
"NPD", "minna", "R", "H", "others")
HR <- HR %>%
mutate(wl = factor(wl, levels = 0:2, labels = c("lost", "won", "zombie")),
status = factor(status, levels = 1:3,
labels = c("challenger", "incumbent", "ex-member")),
party = factor(party, labels = party_names))
HR09 <- HR %>%
filter(year == 2009)
Now, create the table again.
(tbl_st_wl <- with(HR09, table(status, wl)))
## wl
## status lost won zombie
## challenger 559 81 43
## incumbent 168 174 52
## ex-member 15 45 2
This table might be what you want to understand the data. However, you cannot use this table as it is in your research paper or presentation (or even homework assignment!). You have to transform this table into some other formats so that the table looks nice. You might think that you can easily make a nice table with Apple Number, MS Excel, or LibreOffice Calc. However, don’t try to create a table by typing the numbers manually in MS Excel or some other spread sheet software because (1) you migh have typos and (2) it is tedious (meaning waste of time). R provides us with functions to create better looking tables.
You can produce LaTeX tables with R using xtable::xtable()
function. I will not explain how to use LaTeX in this page, but I strongly recommend you use LaTeX for your writing. If you would like to write research papers using quantitative methods, you should say good-bye to MS Word (or other similar word processing programms) right now. It will be eventually easier to write a paper with LaTeX than with MS Word, because your papers will have a lot of figures, tables, and mathematical formulae. In addition, since a LaTeX file (with the file extension “.tex”) is just a text file, it is better to write with LaTeX in order to make your research reproducible. Furthermore, LaTeX is free! For more information about LaTeX, visit this website or read this document (or google!). (However, it is possible that we will write all papers with only R Markdown in the near future. Even so, we will probably have to use LaTeX style math formulae.)
To get a LaTeX table, pass your table object to xtable()
. (You can also pass a matrix or a result of regression. Run methods(xtable)
for more information.)
xtbl_st_wl <- xtable(tbl_st_wl, align = "lccc",
caption = "Candidate's Status and Election Result",
label = "tbl:status-result")
print(xtbl_st_wl)
## % latex table generated in R 3.2.2 by xtable 1.7-4 package
## % Thu Oct 22 18:46:33 2015
## \begin{table}[ht]
## \centering
## \begin{tabular}{lccc}
## \hline
## & lost & won & zombie \\
## \hline
## challenger & 559 & 81 & 43 \\
## incumbent & 168 & 174 & 52 \\
## ex-member & 15 & 45 & 2 \\
## \hline
## \end{tabular}
## \caption{Candidate's Status and Election Result}
## \label{tbl:status-result}
## \end{table}
You can copy and paste this outcome, but you might want to save it in a tex file.
print(xtbl_st_wl, file = "tbl-st-wl.tex")
You can edit the saved tex file if you’d like to modify the table.
You can also diplay the table in html format by setting type = “html”.
print(xtbl_st_wl, type = "html")
lost | won | zombie | |
---|---|---|---|
challenger | 559 | 81 | 43 |
incumbent | 168 | 174 | 52 |
ex-member | 15 | 45 | 2 |
You use it when you use tables in R Markdown with specifying a chunk option results='asis'
.
Please read The xtable gallery by Jonathan Swinton for more information.
To make tables to report the results of regressios, you might want to use stargazer::stargazer()
function (bad naming, but a useful package). Let’s fit some (crappy) linear models and make a table for example.
fit_1 <- lm(voteshare ~ I(exp / 1000000), data = HR09)
fit_2 <- lm(voteshare ~ previous, data = HR09)
fit_3 <- lm(voteshare ~ I(exp / 1000000) + previous, data = HR09)
label_explanatory <- c("expenditure (million yen)", "previous wins", "constant")
stargazer(fit_1, fit_2, fit_3,
digits = 2, digits.extra = 0, align = TRUE,
star.cutoffs = NA, omit.table.layout = "n", ## this line is important!
keep.stat = c("n", "adj.rsq", "f"), df = FALSE,
covariate.labels = label_explanatory,
dep.var.caption = "Outcome variable",
dep.var.labels = "Vote share (%)",
title = "Results of Linear Regressions",
label = "tbl:reg-res",
out = "stargazer-reg-res.tex")
##
## % Table created by stargazer v.5.2 by Marek Hlavac, Harvard University. E-mail: hlavac at fas.harvard.edu
## % Date and time: Thu, Oct 22, 2015 - 18:46:33
## % Requires LaTeX packages: dcolumn
## \begin{table}[!htbp] \centering
## \caption{Results of Linear Regressions}
## \label{tbl:reg-res}
## \begin{tabular}{@{\extracolsep{5pt}}lD{.}{.}{-2} D{.}{.}{-2} D{.}{.}{-2} }
## \\[-1.8ex]\hline
## \hline \\[-1.8ex]
## & \multicolumn{3}{c}{Outcome variable} \\
## \cline{2-4}
## \\[-1.8ex] & \multicolumn{3}{c}{Vote share (%)} \\
## \\[-1.8ex] & \multicolumn{1}{c}{(1)} & \multicolumn{1}{c}{(2)} & \multicolumn{1}{c}{(3)}\\
## \hline \\[-1.8ex]
## expenditure (million yen) & 3.07 & & 2.14 \\
## & (0.10) & & (0.11) \\
## & & & \\
## previous wins & & 5.41 & 2.85 \\
## & & (0.20) & (0.22) \\
## & & & \\
## constant & 7.74 & 16.89 & 8.42 \\
## & (0.76) & (0.61) & (0.71) \\
## & & & \\
## \hline \\[-1.8ex]
## Observations & \multicolumn{1}{c}{1,124} & \multicolumn{1}{c}{1,139} & \multicolumn{1}{c}{1,124} \\
## Adjusted R$^{2}$ & \multicolumn{1}{c}{0.48} & \multicolumn{1}{c}{0.40} & \multicolumn{1}{c}{0.55} \\
## F Statistic & \multicolumn{1}{c}{1,028.24} & \multicolumn{1}{c}{770.86} & \multicolumn{1}{c}{676.97} \\
## \hline
## \hline \\[-1.8ex]
## \end{tabular}
## \end{table}
The LaTeX table is printed on screen and in the file “stargazer-reg-res.tex” because we set the out argument.
When you use stargazer()
, please do not forget to set star.cutoffs = NA and omit.table.layout = “n” in order to suprress annoying significance stars. After all, we are not stargazers!
To make an HTML table, set type = “html” (the default type is “latex”). (Don’t forget to set a chunks option **results=‘asis’ in R Markdown.)
stargazer(fit_1, fit_2, fit_3,
digits = 2, digits.extra = 0, align = TRUE,
star.cutoffs = NA, omit.table.layout = "n", ## this line is important!
keep.stat = c("n", "adj.rsq", "f"), df = FALSE,
covariate.labels = label_explanatory,
dep.var.caption = "Outcome variable",
dep.var.labels = "Vote share (%)",
title = "Results of Linear Regressions",
type = "html")
Outcome variable | |||
Vote share (%) | |||
(1) | (2) | (3) | |
expenditure (million yen) | 3.07 | 2.14 | |
(0.10) | (0.11) | ||
previous wins | 5.41 | 2.85 | |
(0.20) | (0.22) | ||
constant | 7.74 | 16.89 | 8.42 |
(0.76) | (0.61) | (0.71) | |
Observations | 1,124 | 1,139 | 1,124 |
Adjusted R2 | 0.48 | 0.40 | 0.55 |
F Statistic | 1,028.24 | 770.86 | 676.97 |
See A Stargazer Cheatsheet for more information.
If you do not use LaTeX (you should!), you can save the table in a CSV file, open it with MS Excel of LibreOffice Calc, and edit it on a spreadsheet. To save the table in a CSV file, use write.csv()
function.
write.csv(tbl_st_wl, file = "tbl-st-wl.csv")
Now you should have the file “tbl-st-wl.csv” in your working directory. Edit the file to change the appearance of the table.
You should create figures for your research papers and presentations with ggplot2 package in R. It enables you to make beautiful figures. Needless to say, the content (informtion delivered by the figure) is more important than its apperance. However, if the amount of information is the same, a beautiful figure is better than an ugly one. In addition, it is usually easier to read beautiful figures than to decipher ugly ones, which means that beautiful figures tend to deliver more information or more accurate information.
ggplot2 has been developed by Hadley Wickham. You can learn how to use commands of ggplot2 here. For more information, see Wickham. 2009. ggplot2: Elegant Graphics for Data Analysis (Springer) (you can download the book at Springer’s website via the university’s network. Please note that some commands written in the book is obsolete. See also Hrish V. Mittal. 2011. R Graphs Cookbook (Packt) and Winston Chang. 2013. R Graphics Cookbook (O’Reilly) (石井弓美子 訳. 2013.『Rグラフィックスクックブック』オライリー・ジャパン). Today, you learn to create basic figures with ggplot2. You will learn more about ggplot2 in this semester.
The function you use to make figures with ggplot2 is ggplot2:ggplot()
(You can also use ggplot2:qplot()
for some basic figures, but I will not explain it (1) because the arguments of ggplot()
and qplot()
are slightly different, and (2) because you will eventually have to use ggplot()
for more advance graphics).
The first argument you have to pass to ggplot()
is the data frame (data set). Thus, to use ggplot2, you have to save the variables in a data frame.
The second argument is aes (short for aesthetics). You specify the variable used in the figure by aes. For example, to draw a two-dimensional figure, you might want to specify a variable for the horizontal axis \(x\) and one for the vertical \(y\). Furthermore, by setting the color and size of points, you can add more dimensionality to the figure printed on the plane.
Next, you add the graphical layer to the object created by ggplot()
. You use different layers for different types of figures. For instance, to make a scatter plot, you use geom_point()
. To create a histogram, you use geom_histogram()
. As these examples show, you add a layer beginning with geom (short for geometry).
Lastly, you add some other elements such as axis labels, title of the figure, or legend.
As an illustration, let’s make a scatter plot of the election expenditure (exp) versus the number of previous wins (previous). First, we’ll create a ggplot object by ggplot()
. Then, we’ll add the layer of scatter plot by geom_point()
.
## 1. Create a ggplot object
scatter1 <- ggplot(HR09, aes(x = previous, y = exp))
## 2. Add the layer of the scatter plot
scatter1 <- scatter1 + geom_point()
## 3. Add axis labels and the title
scatter1 <- scatter1 + labs(x = "previous wins", y = "money spent (yen)",
title = "Scatter Plot")
## 4. Print the figure on screeen
print(scatter1)
We’ve got a scatter plot. The steps 1 through 3 above can be done at once as follows.
scatter1 <- ggplot(HR09, aes(x = previous, y = exp)) +
geom_point() +
labs(x = "previous wins", y = "money spent (yen)",
title = "Scatter Plot")
You can print the ggplot figure without print()
. That is, you can print the figure by scatter1
instead of print(scatter1)
. However, it is safer to always use print()
to print the figure.
Next, let’s make a histogram of the election expenditure. Since hisograms are normally single-variable figure, we’ll give only x to aes argument.
histogram1 <- ggplot(HR09, aes(x = exp)) +
geom_histogram() +
xlab("Money Spent (yen)") +
ggtitle("Histogram - Frequnecy")
print(histogram1)
The vertical axis of this histogram is the count (frequency). You can change it to the density as follows.
histogram2 <- ggplot(HR09, aes(x = exp)) +
geom_histogram(aes(y = ..density..)) +
xlab("Money Spent (yen)") +
ggtitle("Histogram - Density")
print(histogram2)
As shown, you need to set y = ..density..
in aes. Two dots before and after “density” indicate that density is not in the data frame but calculated by ggplot()
function.
Now, let’s make box plots of the expenditure by the number of previous wins.
box1 <- ggplot(HR09, aes(x = as.factor(previous), y = exp)) + geom_boxplot()
box1 <- box1 + labs(x = "Previous Wins", y = "Money Spent (yen)", title = "Box Plot")
print(box1)
Note that we use as.factor()
function to pass the variable previous to aes. It tells ggplot2 that previous is a categorical variable (and hence, we can group the expenditure by it).
Next, let’s draw a scatter plot of the vote share versus the ependiture and over-impose the regression line on it by geom_smooth()
.
scatter2 <- ggplot(HR09, aes(x = exp, y = voteshare)) + geom_point()
scatter2 <- scatter2 + labs(x = "Money Spent (yen)", y = "Vote Share")
scatter3 <- scatter2 + geom_smooth(method = "lm")
print(scatter3)
By default, geom_smooth()
shows the 95% confidence interval around the fitted line. To change the level of confidence, specify level. For instance, you can show the 50% confidence interval.
scatter4 <- scatter2 + geom_smooth(method = "lm", level = .5)
print(scatter4)
If you would like to suppress the confidence interval, set se = FALSE.
scatter5 <- scatter2 + geom_smooth(method = "lm", se = FALSE)
print(scatter5)
This section is for Japanese users only, so this section is written in Japanese.
Windows users might want to skip this section too.
通常の図で日本語を表示できるようにしただけでは、(Mac の)ggplot2で日本語を使うことはできない。 試しに、簡単な図を作ってみると、
plot(exp ~ previous, data = HR09,
main = "日本語表示のテスト: ggplot を使わない場合",
ylab = "選挙費用(円)", xlab = "当選回数")
test_jpn <- ggplot(HR09, aes(x = previous, y = exp)) + geom_point()
test_jpn <- test_jpn +
labs(x = "当選回数", y = "選挙費用(円)", title = "日本語表示のテスト: ggplot")
print(test_jpn)
のように、ggplotで作った図では日本語が表示されない。
ggplot2で日本語を使う場合、ggplotのテーマで日本語を表示できるフォンとを選ぶ必要がある。 Macでヒラギノ角ゴシックを使うには、以下のようにテーマを設定する(もちろん、他に好みのフォントがあれば、違うフォントを選んでも良い)。ただし、El Capitan ではうまくいかないかもしれない。ここにあるのは、Yosemite用の設定である。Windowsの場合は特に何もしなくてよい(はず)。
フォントに、12ポイントのヒラギノ角ゴシックProN-W3 を指定する。
theme_set(theme_gray(base_size = 12, base_family = "HiraKakuProN-W3"))
これでうまくいかないときは以下を試す。
quartzFonts(HiraKaku = quartzFont(rep("Hiragino Kaku Gothic Pro W3", 4)))
theme_set(theme_gray(base_size = 12, base_family = "HiraKaku"))
先ほどの図を表示してみよう。
print(test_jpn)
これで日本語が表示できるようになった。
You can save the last ggplot figure you printed on screen by ggplot2::ggsave()
function. For example,
ggsave(file = 'scatter-eg1.png', plot = scatter1, width = 6, height = 4.5)
Now you should have the image file (PNG file) named “sacatter-eg1.png” in your working directory. You should always sepcify the height and width of the figure so that it fits well to your purpose. A figure with width = 6 and height = 4.5 is about right to fill a half page.
If possible, you should save figures in PDF instead of PNG because the figures in PDF look better; the lines of PDF figures will not look juggy even if we zoom in. You can include PDF figures in LaTeX.
On Mac, you save the figure by quartz()
function with setting type = “pdf”.
quartz(file = "scatter-eg1.pdf", family = "sans", type = "pdf",
width = 6, height = 4.5)
print(scatter1)
dev.off()
You open the graphic device by quartz()
, print the figure on the device by print()
, and close the device by dev.off()
. Don’t forget to close the device once you save the figure.
On Windows, (I believe) you can save the figure by ggsave()
with file name ending with “.pdf”
ggsave(file = "scatter-eg1a.pdf", plot = scatter1)
If this does not work, use pdf()
.
pdf(file = "scatter-eg1b.pdf", widhth = 6, height = 4.5)
print(scatter1)
dev.off()
To change the color (size) by group, you pass a categorical variable to the argument color (size) in aes. ggplot2 automatically generate the legend.
For simplicity, let’s make a dummy variable indicating that a candidate belongs to the Democratic Party of Japan (DPJ).
HR09 <- HR09 %>%
mutate(dpj = as.numeric(party == "DPJ"),
dpj = factor(dpj, labels = c("others", "DPJ")))
Using this dummy variable, we’ll make a scatter plot of the vote share versus the expenditure for DPJ candidates and others.
sp_vs_ex <- ggplot(HR09, aes(x = exp, y = voteshare)) +
labs(x = "expenditure (yen)", y = "vote share (%)")
sp_vs_ex1 <- sp_vs_ex + geom_point(aes(color = dpj))
print(sp_vs_ex1)
To modify the legend of colord for the categorical varialbe use scale_color_discrete()
. Further more, we can change the order of the legend by guides()
.
dpj_legend <- scale_color_discrete(name = "party", labels = c("other", "DPJ"))
sp_vs_ex2 <- sp_vs_ex1 +
dpj_legend +
guides(color = guide_legend(reverse = TRUE))
print(sp_vs_ex2)
Let’s vary the size of the points by the number of previous wins. With using previous as a continuous variable, we can modify the legend by scale_size_continuous()
.
sp_vs_ex3 <- sp_vs_ex +
geom_point(aes(color = dpj, size = previous)) +
scale_size_continuous(name = "privious wins") +
dpj_legend +
guides(color = guide_legend(reverse = TRUE))
print(sp_vs_ex3)
To arrange the figures by groups, you use facetting. ggplot2 has two facetting functions. You use facet_grid()
to specify the row and column factors; you run facet_wrap
to make groups by one factor variable.
First, let me explain how to use facet_grid()
. For example, let’s make histograms grouped by DPJ dummy. We use facet_grid(row_factor ~ column_factor)
. Thus, to put histograms for different groups in different columns, we specify the column factor only:
hist_lab <- labs(x = "expenditure (yen)", y = "count")
hist_exp_dpj <- ggplot(HR09, aes(x = exp)) + geom_histogram()
hist_exp_dpj_1 <- hist_exp_dpj +
facet_grid(. ~ dpj) +
hist_lab
print(hist_exp_dpj_1)
Alternatively, to put histograms for different groups in different rows, we specify the row factor only:
hist_exp_dpj_2 <- hist_exp_dpj +
facet_grid(dpj ~ .) +
hist_lab
print(hist_exp_dpj_2)
To change the names of the groups, we use the labeller argumnent in facet_grid()
as follows.
dpj_labeller <- function(var, value) {
value <- as.character(value)
if (var == "dpj") {
value[value == "others"] <- "Not Dem."
value[value == "DPJ"] <- "Dem."
}
return(value)
}
hist_exp_dpj3 <- hist_exp_dpj +
facet_grid(. ~ dpj, labeller = dpj_labeller) +
hist_lab
print(hist_exp_dpj3)
As you might notice, it is little tricky to change group lables with labeller. It is easier to change the factor labels of the variables (see the next example) before creating a figure.
You can group the figure by two factors. For instance,
## change the labels of two variables
HR09 <- HR09 %>%
mutate(dpj = factor(dpj, labels = c("Not Dem.", "Dem.")),
status = factor(status, labels = c("new", "incumbent", "ex")))
## call ggplot() again because we modified the data frame to use
hist_exp_dpj_st <- ggplot(HR09, aes(x = exp)) +
geom_histogram() +
facet_grid(dpj ~ status) +
hist_lab
print(hist_exp_dpj_st)
To change the range of the figures according to the groups, set scale = “free” (or scale = “free_x” only frees the horizontal range, and scale = “free_y” the vertical).
hist_free_scale <- ggplot(HR09, aes(x = exp)) +
geom_histogram() +
facet_grid(dpj ~ status, scale = "free") +
hist_lab
print(hist_free_scale)
Next, I will explain how to use facet_wrap()
. You might want to use this function to arrange the figures grouped by only one variable which has more than a few groups. In that case, you want to have a matrix of the figure even though you have only one grouping variable. For instance, let’s make histograms of the electoral expenditure grouped by the number of previous wins.
hist_wins <- ggplot(HR09, aes(x = exp)) +
geom_histogram() +
facet_wrap( ~ previous) +
hist_lab
print(hist_wins)
Let’s use a cross table of the candidate’s status and the election result again.
(tbl_st_wl2 <- with(HR09, table(wl, status)))
## status
## wl new incumbent ex
## lost 559 168 15
## won 81 174 45
## zombie 43 52 2
To make the information more intuitively understandable (or make it visible), let’s make a mosaic plot. First, we have to create a data frame so that ggplot2 can make figures. Then, we reshpae the data frame by reshape2::melt()
function.
## total num. of candidates by status
tbl_st <- with(HR09, table(status))
## turn the table into a matrix
tbl_st_wl2 <- as.matrix(tbl_st_wl2[1:3, 1:3])
tbl_st <- as.matrix(tbl_st[1:3])
## create a data frame
## make variables of the share of each status and
## of the share of each result for each status
df <- data.frame(status = levels(HR09$status),
status_pct = 100 * tbl_st / sum(tbl_st),
win = 100 * tbl_st_wl2[2,] / tbl_st,
zombie = 100 * tbl_st_wl2[3,] / tbl_st,
lose = 100 * tbl_st_wl2[1,] / tbl_st)
## calculate the boundaries of categories on the horizontal axis
df <- df %>%
mutate(xmax = cumsum(status_pct),
xmin = xmax - status_pct)
## remove the variable status_pct because we won't use it
df$status_pct <- NULL
## reshape the data frame
dfm <- melt(df, id = c("status", "xmin", "xmax"))
##################################################
## COMPARE dfm with df
## to understand what melt() has accomplished
###################################################
## calculate the boundaries of categories on the vertical axis
dfm_1 <- dfm %>%
group_by(status) %>%
mutate(ymax = cumsum(value),
ymin = ymax - value)
## detemine the locations to put texts
dfm_1 <- dfm_1 %>%
mutate(xtext = xmin + (xmax - xmin) / 2,
ytext = ymin + (ymax - ymin) / 2)
## make a mosaic plot
mosaic <- ggplot(dfm_1, aes(ymin = ymin, ymax = ymax,
xmin = xmin, xmax = xmax,
fill = variable))
mosaic <- mosaic +
geom_rect(color = I("grey")) +
geom_text(aes(x = xtext, y = ytext, label = paste0(round(value), "%"))) +
geom_text(aes(x = xtext, y = 103, label = status)) +
labs(x = "", y = "") +
scale_fill_discrete(name = "", labels = c("won", "zombie", "lost")) +
guides(fill = guide_legend(reverse = TRUE))
print(mosaic)
This is a mosaic plot! It is surely cumbersome to make a mosaic plot, but it is superior to the corresponding table because readers can intuitively understand the information presented by the figure. The fact that it occupies a relatively large space is a drawback.
We use a simple regression model as an example.
fit_1 <- lm(voteshare ~ I(exp / (10^6)) + previous, data = HR09)
tbl_fit_1 <- xtable(fit_1)
print(tbl_fit_1, type = 'html')
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 8.4151 | 0.7071 | 11.90 | 0.0000 |
I(exp/(10^6)) | 2.1408 | 0.1143 | 18.74 | 0.0000 |
previous | 2.8486 | 0.2182 | 13.06 | 0.0000 |
Let’s transform this table into a caterpillar plot.
## Create a data frame:
## The plot will have the names of explanatory variables,
## the estimated coefficients, and 95% CI
fit_1_df <- data.frame(variable = c("expenditure(million yen)", "previous wins"),
mean = coef(fit_1)[-1],
lower = confint(fit_1)[-1, 1],
upper = confint(fit_1)[-1, 2])
row.names(fit_1_df) <- NULL
## making a caterpillar plot
ctplr <- ggplot(fit_1_df, aes(x = reorder(variable, lower),
y = mean,
ymin = lower,
ymax = upper)) +
geom_pointrange(size = 1.4) +
geom_hline(aes(intercept = 0), linetype = "dotted") +
labs(x = "Explanatory Variables", y = "Estimates") +
ggtitle("Outcome variable: vote share") +
coord_flip()
print(ctplr)
This is a caterpillar plot. We will make more complicated caterpillar plots when we study regression models.