Import packages we will use below. Install packages before importing if necessary.
library("ggplot2")
library("arm")
library("maxLik")
library("pROC")
library("haven")
Let’s find the MLE \(\hat{\theta}\) by bisection method.
Suppose we have a log-likelihood function \[l(\theta) = \log L(\theta).\]
The MLE is \(\theta_M\) when we stop the procedure.
For illustration, let’s consider the likelihood function \[L(\theta) = \theta^{789} (1-\theta)^{211}.\] The log-likelihood fucntion is \[l(\theta) = 789 \log \theta + 211 \log (1-\theta).\] We can easily obtaing the MLE \(\hat{\theta} = 0.789\) by solving \(S(\theta)=0\). Let’s verify the result by bisection method.
Fist, we define the log-likelihood function in R.
log_lik_0 <- function(theta) {
l <- 789 * log(theta) + 211 * log(1 - theta)
return(l)
}
Next, we define the funtions to implement bisection method.
## function to calculate the fist differential coefficient
df <- function(f, x) {
## Args: f = function to be differentiated
## x = location where f is differentiated
## Return: d = differential coefficient
h <- 0.0001
d <- (f(x + h) - f(x)) / h
return(d)
}
## Bisection Method
mle_bisection <- function(theta_l, theta_u, ll, criterion = 0.1^10){
## Function to get MLE by bisection
## Args: theta_l = lower bound of MLE
## theta_u = upper bound of MLE
## ll = log-likelihood funciton
## criterion = convergence criterion (the smalle, the more accurate)
## Return: theta_hat = MLE
counter <- 1 ## iteration counter
while (TRUE) {
theta_m <- (theta_l + theta_u) / 2
l_lower <- ll(theta_l)
l_upper <- ll(theta_u)
l_med <- ll(theta_m)
first <- df(ll, theta_m)
if (first == 0) {
l_max <- l_med
break
} else if (first > 0) {
theta_l <- theta_m
epsilon <- l_med - l_upper
} else {
theta_u <- theta_m
epsilon <- l_med - l_lower
}
## Display the maximum of log-likelihood found so far
l_max <- max(c(l_lower, l_med, l_upper))
cat("Iteration", counter, ": log likelihood =", l_max, "\n")
## break the loop if converged
if (abs(epsilon) < criterion) break
counter <- counter + 1
}
## Find the MLE and return it
if (l_lower == l_max) theta_hat <- theta_l
else if (l_med == l_max) theta_hat <- theta_m
else theta_hat <- theta_u
return(theta_hat)
}
Let’s use this function to find the MLE. To guess the MLE, we draw a graph of the log-likelihood.
curve(log_lik_0, from = 0, to = 1, lwd = 1.5, col = "darkblue",
xlab = expression(theta), ylab = expression(l(theta)),
main = "Log-likelihood Function: log_lik_0(theta)")
Lookin at the figure, we can roughly guess \(\hat{\theta}\): \(0.1 < \hat{\theta} < 0.9\). Stariting with these initial values, the bisection method will give us the MLE as follows.
mle_bisection(theta_l = .1, theta_u = .9, ll = log_lik_0)
## Iteration 1 : log likelihood = -568.9749
## Iteration 2 : log likelihood = -535.4548
## Iteration 3 : log likelihood = -515.6517
## Iteration 4 : log likelihood = -515.6517
## Iteration 5 : log likelihood = -515.6517
## Iteration 6 : log likelihood = -515.2853
## Iteration 7 : log likelihood = -515.2853
## Iteration 8 : log likelihood = -515.2853
## Iteration 9 : log likelihood = -515.2786
## Iteration 10 : log likelihood = -515.2786
## Iteration 11 : log likelihood = -515.2786
## Iteration 12 : log likelihood = -515.2786
## Iteration 13 : log likelihood = -515.2786
## Iteration 14 : log likelihood = -515.2786
## Iteration 15 : log likelihood = -515.2786
## Iteration 16 : log likelihood = -515.2786
## Iteration 17 : log likelihood = -515.2786
## Iteration 18 : log likelihood = -515.2786
## Iteration 19 : log likelihood = -515.2786
## Iteration 20 : log likelihood = -515.2786
## Iteration 21 : log likelihood = -515.2786
## Iteration 22 : log likelihood = -515.2786
## Iteration 23 : log likelihood = -515.2786
## Iteration 24 : log likelihood = -515.2786
## Iteration 25 : log likelihood = -515.2786
## Iteration 26 : log likelihood = -515.2786
## Iteration 27 : log likelihood = -515.2786
## Iteration 28 : log likelihood = -515.2786
## Iteration 29 : log likelihood = -515.2786
## Iteration 30 : log likelihood = -515.2786
## Iteration 31 : log likelihood = -515.2786
## Iteration 32 : log likelihood = -515.2786
## [1] 0.78895
As shown, through 32 iterations, we got the MLE \(\hat{\theta} = 0.789\), which confirms the result obtained analytically.
What happens if we have a better guess, \(0.75 < \hat{\theta} < 0.8\)?
mle_bisection(theta_l = .75, theta_u = .8, ll = log_lik_0)
## Iteration 1 : log likelihood = -515.6517
## Iteration 2 : log likelihood = -515.2853
## Iteration 3 : log likelihood = -515.2853
## Iteration 4 : log likelihood = -515.2853
## Iteration 5 : log likelihood = -515.2786
## Iteration 6 : log likelihood = -515.2786
## Iteration 7 : log likelihood = -515.2786
## Iteration 8 : log likelihood = -515.2786
## Iteration 9 : log likelihood = -515.2786
## Iteration 10 : log likelihood = -515.2786
## Iteration 11 : log likelihood = -515.2786
## Iteration 12 : log likelihood = -515.2786
## Iteration 13 : log likelihood = -515.2786
## Iteration 14 : log likelihood = -515.2786
## Iteration 15 : log likelihood = -515.2786
## Iteration 16 : log likelihood = -515.2786
## Iteration 17 : log likelihood = -515.2786
## Iteration 18 : log likelihood = -515.2786
## Iteration 19 : log likelihood = -515.2786
## Iteration 20 : log likelihood = -515.2786
## Iteration 21 : log likelihood = -515.2786
## Iteration 22 : log likelihood = -515.2786
## Iteration 23 : log likelihood = -515.2786
## Iteration 24 : log likelihood = -515.2786
## Iteration 25 : log likelihood = -515.2786
## Iteration 26 : log likelihood = -515.2786
## Iteration 27 : log likelihood = -515.2786
## Iteration 28 : log likelihood = -515.2786
## [1] 0.78895
We got the same reults with fewer iterations.
Let’s find the MLE \(\hat{\theta}\) by Newton-Raphson Method.
Newton-Raphson algorithm can be described as follows.
The MLE is \(\theta\) when we stop the procedure.
For illustration, we again consider the log likelihood function defined above. We will define the funciton to implement Newton-Raphson method as follws.
## Function to calculate the second differential coefficient
ddf <- function(f, x){
h <- -0.0001
return((df(f, x+h) - df(f, x)) / h)
}
newton_mle <- function(ll, theta_0, delta = 0.1^12) {
h <- 0.001
theta_bar <- theta_0
counter <- 1
while (TRUE) {
first <- df(ll, theta_bar)
second <- ddf(ll, theta_bar)
theta_0 <- theta_bar - first / second
## Display the maximum found so far
l_max <- ll(theta_0)
cat("Iteration", counter, ": log likelihood =", l_max, "\n")
## break the loop if converged
if (abs(theta_0 - theta_bar) < delta) break
theta_bar <- theta_0
counter = counter + 1
}
return(theta_0)
}
Let’s use this function to find the MLE with the initial guess \(\theta_0 = 0.5\).
newton_mle(log_lik_0, theta_0 = .5)
## Iteration 1 : log likelihood = -515.2786
## Iteration 2 : log likelihood = -515.2786
## Iteration 3 : log likelihood = -515.2786
## [1] 0.78895
We got the same result as above with only 3 iterations.
We can find the maximum when there are more than one paramters in a similar manner. However, relatively advanced programming skills are required to write the functions to solve mulptiple parameter problems. So let’s use the existing function to find maxima for multi-parameter likelihood functions. To obtain the MLE, we can use the maxLik::maxLik()
function. The arguments we need specify are (1) the log-likelihood fucntion, (2) the initial value for the estimates, and (3) the method we use to find the maximum. The log-likelihood function must have the paramer vector as its first argument. For initial value of the estimates, we pass the vector whose length is equal to the length of the number of parameters. To use Newton-Raphson methods, we set method = “NR”.
Let’s use this function.
maxLik(log_lik_0, start = .5, method = "NR")
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 2 iterations
## Return code 1: gradient close to zero
## Log-Likelihood: -515.2786 (1 free parameter(s))
## Estimate(s): 0.789
As this result shows, even if we start with a sloppy guess of \(\theta =0.5\), we can get the MLE promptly (only 2 iterations!).
For illustration, we use the fake data used in Chapeter 14 of Asano and Yanai (2013): (ml_wl.dta).
Let’s load the data set.
myd <- read_dta("http://www2.kobe-u.ac.jp/~yyanai/jp/quant-methods-stata/data/ml_wl.dta")
Let’s make a scatterplot of “wlsmd” (binary variable indicating the electoral victoray in SMDs) versus “previous” (the number of wins in previous elections).
p1 <- ggplot(myd, aes(previous, wlsmd)) + geom_point()
print(p1)
Using this data, let’s run a logistic regression with wlsmd as the response and the previous as the predictor. Here, we do not use glm()
. Instead, we define the likelihood function and find the maximum of it. First, we define the log-likelihood funciton (see the lecture slide). To find the maximum by maxLik()
later, we have the parameter vector as the first argument.
log_lik_1 <- function(beta, y = myd$wlsmd, x = myd$previous) {
prob <- 1 / (1 + exp(-beta[1] - beta[2] * x))
ll_vec <- y * log(prob) + (1 - y) * log(1 - prob)
ll <- sum(ll_vec)
return(ll)
}
Let’s find the MLE of this log-likelihood function. Because we estimate two paramters, we give the 2-element vector to start.
fit_01 <- maxLik(log_lik_1, start = c(0, 1), method = "NR")
summary(fit_01)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 5 iterations
## Return code 1: gradient close to zero
## Log-Likelihood: -9.404832
## 2 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## [1,] -0.9095 0.9643 -0.943 0.346
## [2,] 0.3621 0.2785 1.300 0.194
## --------------------------------------------
We got the MLE \(\hat{\beta}_1 \approx -0.91\) and \(\hat{\beta}_2 \approx 0.36\).
AIC is
-2 * summary(fit_01)$loglik + 2 * 2
## [1] 22.80966
Let’s verify the result using glm()
.
fit_1 <- glm(wlsmd ~ previous, data = myd, family = binomial(link = "logit"))
summary(fit_1)
##
## Call:
## glm(formula = wlsmd ~ previous, family = binomial(link = "logit"),
## data = myd)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5760 -1.0276 0.5995 0.9587 1.5798
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9095 0.9636 -0.944 0.345
## previous 0.3621 0.2784 1.301 0.193
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20.728 on 14 degrees of freedom
## Residual deviance: 18.810 on 13 degrees of freedom
## AIC: 22.81
##
## Number of Fisher Scoring iterations: 4
This result confirms the previous result.
Next, we calculate the accuracy of our prediction. To do so, we first calculate the predicted values (predicted probabilities).
predicted_1 <- predict(fit_1, type = "response")
Using these predicted probabilities, we calculate the proportion of accurate prediction when we think “the candidates with the predicted probability equal to or higher than 0.5 win the SMD”.
err_rate_1 <- mean((predicted_1 >= 0.5 & myd$wlsmd == 0) |
(predicted_1 < 0.5 & myd$wlsmd == 1))
1 - err_rate_1
## [1] 0.6666667
Thus, our prediction is correct for 67% of the data. Since completely random guess can get the correct answer at least 50 percent of the time (or if you “predict” all candidates win, 53% of your predictions are correct), this correct answer rate is not good enough.
Lastly, let’s draw an ROC curve. We can draw an ROC curve by pROC::roc()
.
roc(myd$wlsmd ~ predicted_1, plot = TRUE)
##
## Call:
## roc.formula(formula = myd$wlsmd ~ predicted_1, plot = TRUE)
##
## Data: predicted_1 in 7 controls (myd$wlsmd 0) < 8 cases (myd$wlsmd 1).
## Area under the curve: 0.6964
The ROC curve reveals that our model does not help us predict the outcome well. The area under the curve (AUC) is 0.69, and it is not much larger than 0.5.
Let’s make a scatter plot of “wlsmd” versus “expm” (electoral expenditure in million yen).
p2 <- ggplot(myd, aes(expm, wlsmd)) + geom_point()
print(p2)
We run a logistic regression with wlsmd as the response and the expm as the predictor. Like the previous example, we find the MLE without glm()
. First, we define the log-likelihood function.
log_lik_2 <- function(beta, y = myd$wlsmd, x = myd$expm) {
return(log_lik_1(beta, y, x))
}
Let’s find the MLE of this log-likelihood function.
fit_02 <- maxLik(log_lik_2, start = c(0, .5), method = "NR")
summary(fit_02)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 6 iterations
## Return code 1: gradient close to zero
## Log-Likelihood: -6.925697
## 2 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## [1,] -3.1811 1.7185 -1.851 0.0642
## [2,] 0.7046 0.3793 1.857 0.0633
## --------------------------------------------
As a result, we got \(\hat{\beta}_1 \approx -3.18\) and \(\hat{\beta}_2 \approx 0.70\).
AIC is
-2 * summary(fit_02)$loglik + 2 * 2
## [1] 17.85139
To verify the result by glm()
.
fit_2 <- glm(wlsmd ~ expm, data = myd, family = binomial(link = "logit"))
summary(fit_2)
##
## Call:
## glm(formula = wlsmd ~ expm, family = binomial(link = "logit"),
## data = myd)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1138 -0.8983 0.2037 0.5074 1.9642
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1811 1.7219 -1.847 0.0647
## expm 0.7046 0.3800 1.854 0.0637
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20.728 on 14 degrees of freedom
## Residual deviance: 13.851 on 13 degrees of freedom
## AIC: 17.851
##
## Number of Fisher Scoring iterations: 5
The same result has been obtained.
Next, we calculate the accuracy of our prediction. Predicted probabilities are.
predicted_2 <- predict(fit_2, type = "response")
Now let’s calculate the ratio of correct prediction by thinking that candidates with winning probabilty \(\geq 0.5\) wins the SMD.
err_rate_2 <- mean((predicted_2 >= 0.5 & myd$wlsmd == 0) |
(predicted_2 < 0.5 & myd$wlsmd == 1))
1 - err_rate_2
## [1] 0.8666667
The accuracy of the prediction is 87 percent.
Lastly, let’s draw the ROC curve.
roc(myd$wlsmd ~ predicted_2, plot = TRUE)
##
## Call:
## roc.formula(formula = myd$wlsmd ~ predicted_2, plot = TRUE)
##
## Data: predicted_2 in 7 controls (myd$wlsmd 0) < 8 cases (myd$wlsmd 1).
## Area under the curve: 0.7946
AUC is 0.79, which implies that this model gives us better prediction than the previous model.