Preparation

Import packages we will use below. Install packages before importing if necessary.

library("ggplot2")
library("arm")
library("maxLik")
library("pROC")
library("haven")


Numerical Methods: Finding MLE by Algorithms

Bisection Method

Let’s find the MLE \(\hat{\theta}\) by bisection method.

Suppose we have a log-likelihood function \[l(\theta) = \log L(\theta).\]

  1. Make a graph of the log-likelihood function and find \(\theta_L\) and \(\theta_U\) that satify \(\theta_L < \hat{\theta} < \theta_U\). That is, specify the interval within which the MLE resides by eyeballing.
  2. Find the middle point between \(\theta_L\) and \(\theta_U\): \(\theta_M = (\theta_L + \theta_U)/2\).
  3. If \(l'(\theta_M) > 0\), replace \(\theta_L\) with \(\theta_M\). If \(l'(\theta_M) < 0\), replace \(\theta_U\) with \(\theta_M\). Stop if \(l'(\theta_M) = 0\).
  4. Repeat Steps 2 and 3 until \(|l(\theta_M) - max(l(\theta_L), l(\theta_U))| < \epsilon\)

The MLE is \(\theta_M\) when we stop the procedure.

  • Strength: We can find the MLE for sure if we appropriately set the \(\theta_L\) and \(\theta_U\) in Step 1 for a continuous log-lilihood function (i.e., sure convergence).
  • Weakness: Slow convergence.

For illustration, let’s consider the likelihood function \[L(\theta) = \theta^{789} (1-\theta)^{211}.\] The log-likelihood fucntion is \[l(\theta) = 789 \log \theta + 211 \log (1-\theta).\] We can easily obtaing the MLE \(\hat{\theta} = 0.789\) by solving \(S(\theta)=0\). Let’s verify the result by bisection method.

Fist, we define the log-likelihood function in R.

log_lik_0 <- function(theta) {
    l <- 789 * log(theta) + 211 * log(1 - theta)
    return(l)
}

Next, we define the funtions to implement bisection method.

## function to calculate the fist differential coefficient
df <- function(f, x) {
    ## Args: f = function to be differentiated
    ##       x = location where f is differentiated
    ## Return: d = differential coefficient
    h <- 0.0001
    d <- (f(x + h) - f(x)) / h
    return(d)
}
## Bisection Method
mle_bisection <- function(theta_l, theta_u, ll, criterion = 0.1^10){
    ## Function to get MLE by bisection
    ## Args: theta_l = lower bound of MLE
    ##       theta_u = upper bound of MLE
    ##       ll = log-likelihood funciton
    ##       criterion = convergence criterion (the smalle, the more accurate)
    ## Return: theta_hat = MLE
    
    counter <- 1  ## iteration counter
    while (TRUE) {
        theta_m <- (theta_l + theta_u) / 2
        l_lower <- ll(theta_l)
        l_upper <- ll(theta_u)
        l_med <- ll(theta_m)
        first <- df(ll, theta_m)
        
        if (first == 0) {
            l_max <- l_med
            break   
        } else if (first > 0) {
            theta_l <- theta_m
            epsilon <- l_med - l_upper
        } else {
            theta_u <- theta_m
            epsilon <- l_med - l_lower
        }
        
        ## Display the maximum of log-likelihood found so far
        l_max <- max(c(l_lower, l_med, l_upper))
        cat("Iteration", counter, ": log likelihood =", l_max, "\n")
        
        ## break the loop if converged
        if (abs(epsilon) < criterion) break
        counter <- counter + 1
    }
    
    ## Find the MLE and return it
    if (l_lower == l_max) theta_hat <- theta_l
    else if (l_med == l_max) theta_hat <- theta_m
    else theta_hat <- theta_u
    return(theta_hat)
}

Let’s use this function to find the MLE. To guess the MLE, we draw a graph of the log-likelihood.

curve(log_lik_0, from = 0, to = 1, lwd = 1.5, col = "darkblue",
      xlab = expression(theta), ylab = expression(l(theta)),
      main = "Log-likelihood Function: log_lik_0(theta)")

Lookin at the figure, we can roughly guess \(\hat{\theta}\): \(0.1 < \hat{\theta} < 0.9\). Stariting with these initial values, the bisection method will give us the MLE as follows.

mle_bisection(theta_l = .1, theta_u = .9, ll = log_lik_0)
## Iteration 1 : log likelihood = -568.9749 
## Iteration 2 : log likelihood = -535.4548 
## Iteration 3 : log likelihood = -515.6517 
## Iteration 4 : log likelihood = -515.6517 
## Iteration 5 : log likelihood = -515.6517 
## Iteration 6 : log likelihood = -515.2853 
## Iteration 7 : log likelihood = -515.2853 
## Iteration 8 : log likelihood = -515.2853 
## Iteration 9 : log likelihood = -515.2786 
## Iteration 10 : log likelihood = -515.2786 
## Iteration 11 : log likelihood = -515.2786 
## Iteration 12 : log likelihood = -515.2786 
## Iteration 13 : log likelihood = -515.2786 
## Iteration 14 : log likelihood = -515.2786 
## Iteration 15 : log likelihood = -515.2786 
## Iteration 16 : log likelihood = -515.2786 
## Iteration 17 : log likelihood = -515.2786 
## Iteration 18 : log likelihood = -515.2786 
## Iteration 19 : log likelihood = -515.2786 
## Iteration 20 : log likelihood = -515.2786 
## Iteration 21 : log likelihood = -515.2786 
## Iteration 22 : log likelihood = -515.2786 
## Iteration 23 : log likelihood = -515.2786 
## Iteration 24 : log likelihood = -515.2786 
## Iteration 25 : log likelihood = -515.2786 
## Iteration 26 : log likelihood = -515.2786 
## Iteration 27 : log likelihood = -515.2786 
## Iteration 28 : log likelihood = -515.2786 
## Iteration 29 : log likelihood = -515.2786 
## Iteration 30 : log likelihood = -515.2786 
## Iteration 31 : log likelihood = -515.2786 
## Iteration 32 : log likelihood = -515.2786
## [1] 0.78895

As shown, through 32 iterations, we got the MLE \(\hat{\theta} = 0.789\), which confirms the result obtained analytically.

What happens if we have a better guess, \(0.75 < \hat{\theta} < 0.8\)?

mle_bisection(theta_l = .75, theta_u = .8, ll = log_lik_0)
## Iteration 1 : log likelihood = -515.6517 
## Iteration 2 : log likelihood = -515.2853 
## Iteration 3 : log likelihood = -515.2853 
## Iteration 4 : log likelihood = -515.2853 
## Iteration 5 : log likelihood = -515.2786 
## Iteration 6 : log likelihood = -515.2786 
## Iteration 7 : log likelihood = -515.2786 
## Iteration 8 : log likelihood = -515.2786 
## Iteration 9 : log likelihood = -515.2786 
## Iteration 10 : log likelihood = -515.2786 
## Iteration 11 : log likelihood = -515.2786 
## Iteration 12 : log likelihood = -515.2786 
## Iteration 13 : log likelihood = -515.2786 
## Iteration 14 : log likelihood = -515.2786 
## Iteration 15 : log likelihood = -515.2786 
## Iteration 16 : log likelihood = -515.2786 
## Iteration 17 : log likelihood = -515.2786 
## Iteration 18 : log likelihood = -515.2786 
## Iteration 19 : log likelihood = -515.2786 
## Iteration 20 : log likelihood = -515.2786 
## Iteration 21 : log likelihood = -515.2786 
## Iteration 22 : log likelihood = -515.2786 
## Iteration 23 : log likelihood = -515.2786 
## Iteration 24 : log likelihood = -515.2786 
## Iteration 25 : log likelihood = -515.2786 
## Iteration 26 : log likelihood = -515.2786 
## Iteration 27 : log likelihood = -515.2786 
## Iteration 28 : log likelihood = -515.2786
## [1] 0.78895

We got the same reults with fewer iterations.

Newton-Raphson Method

Let’s find the MLE \(\hat{\theta}\) by Newton-Raphson Method.

Newton-Raphson algorithm can be described as follows.

  1. Draw a graph of the log-likelihood function and guess the \(\hat{\theta}\).
  2. Let \(\theta_0\) denote your initial guess of the MLE.
  3. Replace \(\bar{\theta}\) with \(\theta_0\)
  4. Update \(\theta_0\) as follows: \[\theta_0 \leftarrow \bar{\theta} - \frac{l'(\bar{\theta})}{l''(\bar{\theta})}\]
  5. Repeat Steps 3 and 4 until \(|\theta - \bar{\theta}| < \delta\).

The MLE is \(\theta\) when we stop the procedure.

  • Strength: Fast convergence.
  • Weakness: Not necessarily find the maximum.

For illustration, we again consider the log likelihood function defined above. We will define the funciton to implement Newton-Raphson method as follws.

## Function to calculate the second differential coefficient
ddf <- function(f, x){
    h <- -0.0001
    return((df(f, x+h) - df(f, x)) / h)
}
newton_mle <- function(ll, theta_0, delta = 0.1^12) {
    h <- 0.001
    theta_bar <- theta_0
    counter <- 1
    while (TRUE) {
        first <- df(ll, theta_bar)
        second <- ddf(ll, theta_bar)
        theta_0 <- theta_bar - first / second
    
        ## Display the maximum found so far
        l_max <- ll(theta_0)
        cat("Iteration", counter, ": log likelihood =", l_max, "\n")
        
        ## break the loop if converged
        if (abs(theta_0 - theta_bar) < delta) break
            
        theta_bar <- theta_0
        counter = counter + 1
    }
    return(theta_0)
}

Let’s use this function to find the MLE with the initial guess \(\theta_0 = 0.5\).

newton_mle(log_lik_0, theta_0 = .5)
## Iteration 1 : log likelihood = -515.2786 
## Iteration 2 : log likelihood = -515.2786 
## Iteration 3 : log likelihood = -515.2786
## [1] 0.78895

We got the same result as above with only 3 iterations.

Estimate Multiple Parameters

We can find the maximum when there are more than one paramters in a similar manner. However, relatively advanced programming skills are required to write the functions to solve mulptiple parameter problems. So let’s use the existing function to find maxima for multi-parameter likelihood functions. To obtain the MLE, we can use the maxLik::maxLik() function. The arguments we need specify are (1) the log-likelihood fucntion, (2) the initial value for the estimates, and (3) the method we use to find the maximum. The log-likelihood function must have the paramer vector as its first argument. For initial value of the estimates, we pass the vector whose length is equal to the length of the number of parameters. To use Newton-Raphson methods, we set method = “NR”.

Let’s use this function.

maxLik(log_lik_0, start = .5, method = "NR")
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 2 iterations
## Return code 1: gradient close to zero
## Log-Likelihood: -515.2786 (1 free parameter(s))
## Estimate(s): 0.789

As this result shows, even if we start with a sloppy guess of \(\theta =0.5\), we can get the MLE promptly (only 2 iterations!).

Logistic Regression

Example 1: Predicting Electoral Victory by Previous Winnings

For illustration, we use the fake data used in Chapeter 14 of Asano and Yanai (2013): (ml_wl.dta).

Let’s load the data set.

myd <- read_dta("http://www2.kobe-u.ac.jp/~yyanai/jp/quant-methods-stata/data/ml_wl.dta")

Let’s make a scatterplot of “wlsmd” (binary variable indicating the electoral victoray in SMDs) versus “previous” (the number of wins in previous elections).

p1 <- ggplot(myd, aes(previous, wlsmd)) + geom_point()
print(p1)

Using this data, let’s run a logistic regression with wlsmd as the response and the previous as the predictor. Here, we do not use glm(). Instead, we define the likelihood function and find the maximum of it. First, we define the log-likelihood funciton (see the lecture slide). To find the maximum by maxLik() later, we have the parameter vector as the first argument.

log_lik_1 <- function(beta, y = myd$wlsmd, x = myd$previous) {
    prob <- 1 / (1 + exp(-beta[1] - beta[2] * x))
    ll_vec <- y * log(prob) + (1 - y) * log(1 - prob)
    ll <- sum(ll_vec)
    return(ll)
}

Let’s find the MLE of this log-likelihood function. Because we estimate two paramters, we give the 2-element vector to start.

fit_01 <- maxLik(log_lik_1, start = c(0, 1), method = "NR")
summary(fit_01)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 5 iterations
## Return code 1: gradient close to zero
## Log-Likelihood: -9.404832 
## 2  free parameters
## Estimates:
##      Estimate Std. error t value Pr(> t)
## [1,]  -0.9095     0.9643  -0.943   0.346
## [2,]   0.3621     0.2785   1.300   0.194
## --------------------------------------------

We got the MLE \(\hat{\beta}_1 \approx -0.91\) and \(\hat{\beta}_2 \approx 0.36\).

AIC is

-2 * summary(fit_01)$loglik + 2 * 2
## [1] 22.80966

Let’s verify the result using glm().

fit_1 <- glm(wlsmd ~ previous, data = myd, family = binomial(link = "logit"))
summary(fit_1)
## 
## Call:
## glm(formula = wlsmd ~ previous, family = binomial(link = "logit"), 
##     data = myd)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5760  -1.0276   0.5995   0.9587   1.5798  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.9095     0.9636  -0.944    0.345
## previous      0.3621     0.2784   1.301    0.193
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20.728  on 14  degrees of freedom
## Residual deviance: 18.810  on 13  degrees of freedom
## AIC: 22.81
## 
## Number of Fisher Scoring iterations: 4

This result confirms the previous result.

Next, we calculate the accuracy of our prediction. To do so, we first calculate the predicted values (predicted probabilities).

predicted_1 <- predict(fit_1, type = "response")

Using these predicted probabilities, we calculate the proportion of accurate prediction when we think “the candidates with the predicted probability equal to or higher than 0.5 win the SMD”.

err_rate_1 <- mean((predicted_1 >= 0.5 & myd$wlsmd == 0) |
                   (predicted_1 < 0.5 & myd$wlsmd == 1))
1 - err_rate_1
## [1] 0.6666667

Thus, our prediction is correct for 67% of the data. Since completely random guess can get the correct answer at least 50 percent of the time (or if you “predict” all candidates win, 53% of your predictions are correct), this correct answer rate is not good enough.

Lastly, let’s draw an ROC curve. We can draw an ROC curve by pROC::roc().

roc(myd$wlsmd ~ predicted_1, plot = TRUE)

## 
## Call:
## roc.formula(formula = myd$wlsmd ~ predicted_1, plot = TRUE)
## 
## Data: predicted_1 in 7 controls (myd$wlsmd 0) < 8 cases (myd$wlsmd 1).
## Area under the curve: 0.6964

The ROC curve reveals that our model does not help us predict the outcome well. The area under the curve (AUC) is 0.69, and it is not much larger than 0.5.

Example 2: Predicting Electoral Victory by Electoral Expense

Let’s make a scatter plot of “wlsmd” versus “expm” (electoral expenditure in million yen).

p2 <- ggplot(myd, aes(expm, wlsmd)) + geom_point()
print(p2)

We run a logistic regression with wlsmd as the response and the expm as the predictor. Like the previous example, we find the MLE without glm(). First, we define the log-likelihood function.

log_lik_2 <- function(beta, y = myd$wlsmd, x = myd$expm) {
    return(log_lik_1(beta, y, x))
}

Let’s find the MLE of this log-likelihood function.

fit_02 <- maxLik(log_lik_2, start = c(0, .5), method = "NR")
summary(fit_02)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 6 iterations
## Return code 1: gradient close to zero
## Log-Likelihood: -6.925697 
## 2  free parameters
## Estimates:
##      Estimate Std. error t value Pr(> t)
## [1,]  -3.1811     1.7185  -1.851  0.0642
## [2,]   0.7046     0.3793   1.857  0.0633
## --------------------------------------------

As a result, we got \(\hat{\beta}_1 \approx -3.18\) and \(\hat{\beta}_2 \approx 0.70\).

AIC is

-2 * summary(fit_02)$loglik + 2 * 2
## [1] 17.85139

To verify the result by glm().

fit_2 <- glm(wlsmd ~ expm, data = myd, family = binomial(link = "logit"))
summary(fit_2)
## 
## Call:
## glm(formula = wlsmd ~ expm, family = binomial(link = "logit"), 
##     data = myd)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1138  -0.8983   0.2037   0.5074   1.9642  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -3.1811     1.7219  -1.847   0.0647
## expm          0.7046     0.3800   1.854   0.0637
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20.728  on 14  degrees of freedom
## Residual deviance: 13.851  on 13  degrees of freedom
## AIC: 17.851
## 
## Number of Fisher Scoring iterations: 5

The same result has been obtained.

Next, we calculate the accuracy of our prediction. Predicted probabilities are.

predicted_2 <- predict(fit_2, type = "response")

Now let’s calculate the ratio of correct prediction by thinking that candidates with winning probabilty \(\geq 0.5\) wins the SMD.

err_rate_2 <- mean((predicted_2 >= 0.5 & myd$wlsmd == 0) |
                   (predicted_2 < 0.5 & myd$wlsmd == 1))
1 - err_rate_2
## [1] 0.8666667

The accuracy of the prediction is 87 percent.

Lastly, let’s draw the ROC curve.

roc(myd$wlsmd ~ predicted_2, plot = TRUE)

## 
## Call:
## roc.formula(formula = myd$wlsmd ~ predicted_2, plot = TRUE)
## 
## Data: predicted_2 in 7 controls (myd$wlsmd 0) < 8 cases (myd$wlsmd 1).
## Area under the curve: 0.7946

AUC is 0.79, which implies that this model gives us better prediction than the previous model.



Back to Class Materials