Research Methods in Political Science I

1. The Factorial

R has a built-in function factorial(), but we will build our own factorial function get_factorial(). The following codes are just samples. You do not have to write the exactly same codes.

get_factorial <- function(x) {  ## function to calculate a factorial
    ## Argument: x = vector of non-negative integers
    ## Return: res = the factorials of x
    
    res <- rep(NA, length(x))  ## vector to save the factorials
    for (i in seq_along(x)) {  ## loop for x's
        if (x[i] < 0 | x[i] != round(x[i])) { ## NaN if x[i] is negative or decimal
             res[i] <- NaN
        } else if (x[i] == 0){ ## factorial of 0 equals 1 
            res[i] <- 1
        } else { ## for a non-negative integer, calculate
            res[i] <- prod(1:x[i])
        }
    }
    return(res)
}

This function treats three conditions differently. The conditions are as follows:

A negative or decimal number is given,
0 is given, and
a non-negative integer is given.

In Case 1, the function simply gives NaN (Not a Number) because the factorial of a negative or decimal number is not defined. (The gamma function is an extension of the factorial. While the factorial is usually only used for non-negative integers, the gamma is for non-negative real numbers. Since the built-in factorial() uses gamma() for its calculation, factorial() can handle decimal numbers, which is not necessarily desirable.)

In Case 2, we know the anwer, and we do not have to calculate anything. Thus, it simply returns the answer 1. When the function reaches return(), it ends evaluating the code. The rest of the code (after return()) is not evaluated.

Then, in Case 3, we need to calculate the factorial. Finally, the function returns the the vector “res”, which contains the factorials of \(x\).

Let’s try using the function.

get_factorial(0)

## [1] 1

get_factorial(5)

## [1] 120

get_factorial(c(0, 5))

## [1]   1 120

get_factorial(c(-2, 0, 5, 6.2))

## [1] NaN   1 120 NaN

2. The Number of Possible Combinations

R has a built-in function choose(), but we will create our own function get_choose(). Since it is tedious to define the function from scratch, let’s use get_factorial() function to define get_choose().

get_choose <- function(n, k) { ## combinations for choosing k from n
    ## Only works for non-negative n
    ## Arguments: n = vector of the total number of elements
    ##            k = vector of the number of elements to choose from n
    ## Return: res = vector of possible combinations
    ## NOTE: n and k must have the same number of elements
    
    if (length(n) != length(k)) {
        stop(message = "lengths of n and k must be the same")
    }
    
    res <- rep(NA, length(n)) ## vector to save the results
    
    for (i in seq_along(n)) {
        if (n[i] != round(n[i])) { ## stops if n is decimal
            stop(message = paste0("n[", i, "] (" , n[i], ") must be integer"))
        } else if (k[i] != round(k[i])) { ## stops if k is decimal
            stop(message = paste0("k[", i, "] (" , k[i], ") must be integer"))
        } else if (n[i] < 0){ ## stops if n is negative
            stop(message = paste0("n[", i, "] (" , n[i], ") must be non-negative"))
        } else if (n[i] < k[i] | k[i] <0){ ## 0 prob. if n < k or if k < 0
            res[i] <- 0
        } else { ## calculate the combinations for the other
            res[i] <- get_factorial(n[i]) / 
                (get_factorial(k[i]) * get_factorial(n[i] - k[i]))
        }
    }
    return(res)
}

First, this function checks if the lengths of the vectors n and k are the same. If not, it prints an error message. As you can see in the codes, we use error() to handle anomalies. We can tell the user what is wrong by priting a message defined in message.

Then, it handles the following five cases separately. Please note that these five cases are not mutually exclusive. If one condition were met, the other cases would not be evaluated. The conditions are examined in the following order.

n is decimal
k is decimal
n is negative
choose more than n or choose a negative
otherwise

In Cases 1 through 3, the function prints an error message, but three cases print different one. For Cases 1 and 2, the error asks the user to use an integer. For Case 3, it asks for a non-negative number.

In Case 4, we know that there is no possible combination, and we get 0. We have to calculate the number of possible combinations only in Case 5 using our own get_factorial().

As this example shows, we can use a function(s) to define another function. Thinking bacward from this fact, we do not have to build a large function in a lump. Instead, we can build a lot of small functions and combine them to make a large one. It is usually desirable to separate large functions into small chunks (1) because it is easier to maintain or revise small functions than a large one, and (2) because a small function can be used for a lot of different functions (or different purposes).

Now, let’s try using the function we created. You should also try to get an error.

get_choose(10, 3)

## [1] 120

get_choose(3, 8)

## [1] 0

get_choose(c(10, 3), c(3, 8))

## [1] 120   0

3. PMF of the Binomial Distribution

R has a built-in function dbinom(), but we would like to define our own function pmf_binom().

pmf_binom <- function(x, n, prob) { ## PMF of binomial dist.
    ## Args: x = a random variable vector
    ##       n = num. of Bernoulli trials
    ##       prob = success rate of each Bernoulli trial, [0, 1]
    ## Return: mass = a vector of PMF f(x)
    
    mass <- rep(NA, length(x))  ## vector to save prob. mass for each x  
    
    for (i in seq_along(x)) { ## calculate f(x) for each x
        if (prob < 0 | prob > 1){  ## stops if prob isn't in [0, 1]
            stop(message = paste0("prob (", prob, ") must be in [0, 1]"))
        } else if (n <　0) {  ## stops if n is negative
            stop(message = paste0("n (", n, ") must be non-negative"))
        } else if (n != round(n)) { ## stops if n is a decimal number
            stop(message = paste0("n (", n, ") must be integer"))
        } else if (x[i] != round(x[i])) { ## decimal x[i] has 0 prob.
            mass[i] <- 0
        } else {
            mass[i] <- get_choose(n, x[i]) * prob^x[i] * (1 - prob)^(n - x[i])
        }
    }
    return(mass)
}

This function takes a vector of \(x\)’s as an argument and calculates the probability mass for each \(x\) using a for-loop, where it disinguishes four cases.

A non-probability is given to the prob argument
A negative total number
A decimal number is given to n
A decimal number is given to x
Otherwise

In Cases 1 through 3, it prints an appropriate error massage. In Case 4, it saves the value of 0 as the result because there is no possibility to get a decimal number of successes. Only in Case 5, it calculates the probability mass of \(x\).

4. Graphical Displays of the Binomial Distribution

Let’s display graphically the PMF of the binomial distribution using pmf_binom().

Scenario 1: \(p = 0.5\)

Let’s set \(p=0.5\).

p <- .5

We will make figures of a binomial distribtuion for \(n= 2, 5, 10\), and \(50\).

First, let’s create a vector of the values for \(n\).

n_vec <- c(2, 5, 10, 50)

For each \(n\), we’ll make a figure of the PMF. Since we only have to change the value of \(n\) to make different figures, a for-loop works well here.

par(mfrow = c(2,2), oma = c(0, 0, 2, 0))  ## set 2-by-2 plot area
for (i in seq_along(n_vec)){ ## a loop for n
    x_vec <- 0:n_vec[i]   ## possible values of x
    mass_vec <- pmf_binom(x = x_vec, n = n_vec[i], prob = p)
    plot(x_vec, mass_vec, ylim = c(0, max(mass_vec)),
         type = 'h', lwd = 2, col = "royalblue",
         xlab = "x", ylab = "f(x): probability",
         main = paste("n =", n_vec[i]))
}
mtext("Bin(n, p = 0.5)", outer = TRUE, cex = 1.5)

Scenario 2: \(p = 0.8\)

For \(p=0.8\), we only have to change the value of \(p\) and re-run the codes above.

p <- .8
par(mfrow = c(2,2), oma = c(0, 0, 2, 0))  ## set 2-by-2 plot area
for (i in seq_along(n_vec)){ ## a figure for each n
    x_vec <- 0:n_vec[i]   ## possible values of x
    mass_vec <- pmf_binom(x = x_vec, n = n_vec[i], prob = p)
    plot(x_vec, mass_vec, ylim = c(0, max(mass_vec)),
         type = 'h', lwd = 2, col = "tomato",
         xlab = "x", ylab = "f(x): probability",
         main = paste("n =", n_vec[i]))
}
mtext("Bin(n, p = 0.8)", outer = TRUE, cex = 1.5)

Making a Matrix of Figures for \(p \in \{0.1, 0.3, 0.5, 0.9\}\) Using a Loop

Two for-loops are used here. One is for the values of the success rate \(p\); the other is for the number of Bernoulli trials \(n\). The loop for \(p\) is nested in the loop for \(n\). We will save the probability mass in the data frame “binom_res”.

p_vec <- c(.1, .3, .5, .9)  ## use 4 different p's
## determine the dimension of the data frame
n_obs <- sum((n_vec + 1) * length(p_vec))
## make an empty data frame to save the results
binom_res <- data.frame(n = rep(NA, n_obs),
                        p = rep(NA, n_obs),
                        x = rep(NA, n_obs),
                        mass = rep(NA, n_obs))
## index to specify a row of the data frame
index <- 1
for (n in n_vec) {  ## loop for 4 n's defined above
    for (p in p_vec) { ## loop for 4 p's
        x_vec <- 0:n  ## possible values of x
        ## PMF
        mass_vec <- pmf_binom(x = x_vec, n = n, prob = p)
        ## save the results to the data frame
        slicer <- index:(index + n)  ## set locations to save the results
        binom_res$n[slicer] <- n
        binom_res$p[slicer] <- p
        binom_res$x[slicer] <- x_vec
        binom_res$mass[slicer] <- mass_vec
        ## increment the index by n + 1 (num of possible outcomes)
        index <- index + n + 1
    }
}

Let’s examine the content of the data frame.

head(binom_res)

##   n   p x mass
## 1 2 0.1 0 0.81
## 2 2 0.1 1 0.18
## 3 2 0.1 2 0.01
## 4 2 0.3 0 0.49
## 5 2 0.3 1 0.42
## 6 2 0.3 2 0.09

Now we have the probability mass for each combination of \(n\), \(x\), and \(p\).

Lastly, let’s display the result graphically.

library("ggplot2")
binom_plt <- ggplot(binom_res, aes(x = x, y = mass)) +
    geom_bar(stat = "identity") + 
    facet_grid(p ~ n, scale = "free_x", labeller = label_both) + 
    labs(y = "Probability", title = "PMF of Bin(n, p)") 
print(binom_plt)

Go Back