A Basic Model for Count Outcome: Poisson Model

Poisson Distribution

We can model the count outcome \(Y\) using binomial distribution as follows. \[Y \sim \mbox{Binomial}(n, \theta),\] where \(n\) is the number of Bernoulli trials, \(\theta\) is the probability of success for each trial, and \(Y\) is the number of successes (\(0 \leq Y \leq n\)).

Let’s sample \(Y\) with setting \(n = 10\) and \(\theta = 0.3\) in R.

n <- 10
theta <- .3
y1 <- rbinom(10000, size = n, prob = theta)
plt <- ggplot(data_frame(y1), aes(x = y1)) + 
    geom_bar(aes(y = ..count.. / 10000)) +
    scale_x_continuous(breaks = 0:n) +
    labs(x = '# of successes', y = '')
print(plt)

Theoretically, the mean and variance of \(Y\) should be \(n\theta = 3\) and \(n\theta(1-\theta) = 2.1\). Let’s see.

mean(y1)

## [1] 2.9818

var(y1)

## [1] 2.091278

Sometimes, we don’t know the number of trials. In such a case, suppose the number of trials \(n\) is large enough and the maximum value of the outcome \(Y\) is much smaller than \(n\) (or \(\theta\) is small). For instance, let’s assume thta \(n = 100000\) and \(\theta = 0.001\) and sample outcome values.

n <- 100000
theta <- 0.001
y2 <- rbinom(10000, size = n, prob = theta)
plt <- ggplot(data_frame(y2), aes(x = y2)) + 
    geom_bar(aes(y = ..count.. / 10000)) +
    labs(x = '# of successes', y = '')
print(plt)

Let’s calculate the mean and variance of y2.

mean(y2)

## [1] 99.9101

var(y2)

## [1] 98.31085

They are almost same because \(n\theta = 100\) and \(n\theta(1-\theta) = 99.9 \approx 100\). So under this situation, the mean and variance are almost equal.

We can think that this outocme follows a probability distribution that has the mean equal to the variance. The distribution is called Poisson distribution, and we can sample the values by rpois() function.

Let’s sample the outcome from the Poisson distribution with the mean \(\lambda=100\).

y3 <- rpois(10000, lambda = 100)
plt <- ggplot(data_frame(y3), aes(x = y3)) + 
    geom_bar(aes(y = ..count.. / 10000)) +
    labs(x = '# of successes', y = '')
print(plt)

The mean and variance are

mean(y3)

## [1] 100.1576

var(y3)

## [1] 97.99136

Now let’s compare y2, which is sampled from Binomial(10000, 0.01), with y3, which is sampled from Poisson(100).

df <- data_frame(y = c(y2, y3),
                 distribution = rep(c('Bin(10000, 0.001)', 'Poisson(100)'), c(10000, 10000)))
plt <- ggplot(df, aes(x = y, y = ..density.., fill = distribution)) + 
    geom_histogram(alpha = .4, position = 'identity')
print(plt)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Another visualization.

df <- data_frame(y2_Q = quantile(y2, seq(0, 1, length.out = 100)),
                 y3_Q = quantile(y3, seq(0, 1, length.out = 100)))
qqplt <- ggplot(df, aes(x = y2_Q, y = y3_Q)) +
    geom_point(color = 'darkblue') +
    geom_abline(intercept = 0, slope = 1, linetype = 'dashed') +
    coord_fixed() +
    labs(x = 'Quantile of y2', y = 'Quantile of y3')    
print(qqplt)

In fact, these two outcomes are so similar and we can guess that they are sampled from similar distributions.

Thus, when we do not know the number of trials (or when there is no upper limit of counts) and the probabiilty of event is small, we would like to model the count outcome using Poisson distribution.

PMF of Poisson Distribution

Poisson distribution can be thought as a distribution of counts. Since the count is non-negative integers, the outcome values sampled from Poisson distribution must be non-negative integers. The PMF of the Poisson distribution with the paramter (mean or variance) \(\lambda\) is \[p(y|\lambda) = \frac{\lambda^y \exp(-\lambda)}{y!} = \exp(y\log(\lambda) - \lambda - \log y!).\] In this formula, \(y\) appears as \(y\) itself inside \(\exp()\), so this is a cannonical form. And the natural parameter is \(\log(\lambda)\), so we can use this as the link function.

Let’s examine some Poisson distributions in R.

N <- 10000
means <- c(1, 5, 10, 20)
y1 <- rpois(N, lambda = means[1])
y2 <- rpois(N, lambda = means[2])
y3 <- rpois(N, lambda = means[3])
y4 <- rpois(N, lambda = means[4])
table(y1)

## y1
##    0    1    2    3    4    5    6 
## 3636 3651 1891  611  169   35    7

table(y2)

## y2
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##   69  322  846 1470 1749 1719 1493 1038  621  353  191   72   40   12    1 
##   15   17 
##    3    1

table(y3)

## y3
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##    1    5   18   64  184  376  640  928 1084 1238 1243 1180  943  763  523 
##   15   16   17   18   19   20   21   22   23   24 
##  344  201  120   69   40   22    8    3    1    2

table(y4)

## y4
##   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22 
##   1   3   3  15  28  46 108 188 244 421 531 673 774 799 887 923 787 732 
##  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  39  40 
## 659 566 470 317 271 215 119 101  51  24  15   9  12   2   4   1   1

df <- data.frame(y = c(y1, y2, y3, y4),
                 lambda = rep(means, rep(N, 4))) %>%
    mutate(lambda = factor(lambda, labels = paste0('lambda=', means)))
plt <- ggplot(df, aes(x = y)) +
    geom_bar() +
    facet_wrap(~ lambda)
print(plt)

Let’s take a closer look at one with \(\lambda = 5\)

plt <- df %>% 
    filter(lambda == 'lambda=5') %>%
    ggplot(aes(y)) + geom_bar()
print(plt)

And the means and variances.

df %>% group_by(lambda) %>%
    summarize(mean = round(mean(y), 2),
              variance = round(var(y), 2))

## # A tibble: 4 × 3
##      lambda  mean variance
##      <fctr> <dbl>    <dbl>
## 1  lambda=1  1.02     1.02
## 2  lambda=5  4.98     4.96
## 3 lambda=10 10.01     9.87
## 4 lambda=20 20.00    20.22

So, \(\mathrm{E}(Y) = \mathrm{Var}(Y) = \lambda\) for \(Y \sim \mbox{Poisson}(\lambda)\).

Linear Predictor and Link Function in Poisson Model

As we can see in the PMF formula, the natural parameter of the Poisson distribution is \(\log(\lambda)\). Therefore, we will use the log link function to connect the parameter \(\lambda\) to the linear predictor.

Here, let’s examine why.

First, generate the linear predictor \(\mu\) using an explanatory variable \(x\) and parameters \(\beta_1\) and \(\beta_2\).

N <- 10000
beta <- c(0.3, 0.9)
x <- runif(N, min = -2, max = 2)  # random values from uniform
df <- data.frame(x) %>%
    mutate(mu = beta[1] + beta[2]*x)

With a function \(g\), we write \[g(\lambda_i) = \mu_i = \beta_1 + \beta_2 x_i.\]

What kind of link function \(g\) do we need?

How about the identity link function?

df <- df %>%
    mutate(lambda_identity = mu)
plt <- ggplot(df, aes(x = mu)) +
    labs(x = 'linear predictor', y = expression(lambda))
plt + geom_line(aes(y = lambda_identity))

What is the problem?

How about the logit link?

df <- df %>%
    mutate(lambda_logit = 1 / (1 + exp(-mu)))
plt %+% df + geom_line(aes(y = lambda_logit))

What is the problem?

How about the log link? That is, \[\log(\lambda) = \mu,\] or \[\lambda = \exp(\mu).\]

df <- df %>%
    mutate(lambda_log = exp(mu))
plt %+% df + geom_line(aes(y = lambda_log))

This is what we want. Do you understand why we use the log link?

A Simulation of Poisson Model

Let’s simulate Poisson data and fit the Poisson model. Suppose there are three explanatory varialbles. Then we can simulate data as follows in R.

set.seed(2016-11-29)
N <- 500          # N of observations
x2 <- runif(N)
x3 <- runif(N)
x4 <- runif(N)
beta <- c(1, .75, -1.25, .5)    # set parameters at  (1, .75, -1.25, .5)
mu <- exp(beta[1] + beta[2]*x2 + beta[3]*x3 + beta[4]*x4)  
y <- rpois(N, lambda = mu)      # sample the outcome values

A statistical model can be written as follows. \[y_n \sim \mbox{Poisson}(\lambda_n),\] \[\lambda_n = \exp(\beta_1 + \beta_2 x_{2n} + \beta_3 x_{3n} + \beta_4 x_{4n}),\] \[\beta_k \sim \mbox{Normal}(0, 100),\] where \(k = 1, 2, 3, 4.\)

Fitting Poisson Model with RStan

We can translate the statistical model into a Stan model as follows (d4-sim-poisson0.stan)

// d4-sim-poisson0.stan
data {
    int<lower=0> N;
    real X2[N];
    real X3[N];
    real X4[N];
    int<lower=0> Y[N];
}

parameters {
    vector[4] b;
}

transformed parameters {
    real mu[N];
    for (n in 1:N)
        mu[n] = exp(b[1] + b[2]*X2[n] + b[3]*X3[n] + b[4]*X4[n]);
}

model {
    for (n in 1:N)
        Y[n] ~ poisson(mu[n]);
    b ~ normal(0, 100);
}

generated quantities {
    int<lower=0> y_pred[N];
    real log_lik[N];
    for (n in 1:N) {
        y_pred[n] = poisson_rng(mu[n]);
        log_lik[n] = poisson_lpmf(Y[n] | mu[n]);
    }
}

You can use this model, but we would like to modify the model a little as follows (d4-sim-poisson.stan).

// d4-sim-poisson.stan
data {
    int<lower=0> N;
    matrix[N, 4] X;    // prepare the matrix whose first column is all 1s
    int<lower=0> Y[N]; // cannot use vector because Y must be integers
}

parameters {
    vector[4] b;
}

transformed parameters {
    vector[N] mu;
    mu = multiply(X, b);  // matrix multiplication
}

model {
    Y ~ poisson_log(mu);
    b ~ normal(0, 100);
}

generated quantities {
    int<lower=0> y_pred[N];
    vector[N] log_lik;
    for (n in 1:N) {
        y_pred[n] = poisson_log_rng(mu[n]);
        log_lik[n] = poisson_log_lpmf(Y[n] | mu[n]);
    }
}

Let’s fit the model. First, compile the model.

sim_poisson <- stan_model('stan-models/d4-sim-poisson.stan')

Then, sample the values by passing a list of variables to rstan::sampling().

myd <- list(N = N,
            X = cbind(1, x2, x3, x4), # predictor matrix, 1st col is intercept
            Y = y)
fit_sim_poisson1 <- sampling(sim_poisson, data = myd, seed = 98765,
                             chains = 4, iter = 2000, warmup = 1000)

Let’s check the results.

print(fit_sim_poisson1, pars = 'b')

## Inference for Stan model: d4-sim-poisson.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## b[1]  0.99       0 0.09  0.81  0.93  0.99  1.04  1.17  1982    1
## b[2]  0.85       0 0.09  0.66  0.78  0.85  0.91  1.03  2260    1
## b[3] -1.23       0 0.09 -1.42 -1.30 -1.23 -1.17 -1.05  2509    1
## b[4]  0.34       0 0.09  0.17  0.28  0.34  0.40  0.52  2455    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:27:29 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

(Though I don’t create traceplots here, you should do to check the convergence.)

A visual display of the estimation result.

plot(precis(fit_sim_poisson1, pars = 'b', depth = 2))

It seems that the model correctly estiamte the paramter values.

Since we cannot know if we are just lucky enough to have gotten this result by a single trial, we would like to iterate the simulation.

First, let’s make a function to create a Poisson data.

poisson_dgp <- function(N = 500, beta = c(1, .75, -1.25, .5)) {
    x2 <- runif(N)
    x3 <- runif(N)
    x4 <- runif(N)
    mu <- exp(beta[1] + beta[2]*x2 + beta[3]*x3 + beta[4]*x4)
    y <- rpois(N, lambda = mu)
    return(list(N = N, X = cbind(1, x2, x3, x4), Y = y))
}

Then, write a function to simulate the process. We gonna use the Stan model we have already compiled.

sim_stan_poisson <- function(sims = 1000, N = 500){
    b_matrix <- matrix(rep(NA,sims*4), ncol = 4)
    for (i in 1:sims) {
        myd <- poisson_dgp(N = N)
        fit <- sampling(sim_poisson, data = myd, 
                        refresh = -1, open_progress = FALSE)
        mcs <- rstan::extract(fit, pars = 'b')
        b_matrix[i, ] <- apply(mcs$b, 2, median)
    }
    return(b_matrix)
}

We are ready to run a simulation (but please don’t run it in class. It will take about an hour to finish.).

sim_out <- sim_stan_poisson(sims = 1000, N = 500)

Let’s transform the outcome matrix into a data frame.

sim_out <- as_data_frame(sim_out)
names(sim_out) <- paste0('b', 1:4)

Let’s examine the result.

summary(sim_out)

##        b1               b2               b3                b4        
##  Min.   :0.7219   Min.   :0.4344   Min.   :-1.5196   Min.   :0.2106  
##  1st Qu.:0.9372   1st Qu.:0.6910   1st Qu.:-1.3106   1st Qu.:0.4413  
##  Median :0.9930   Median :0.7514   Median :-1.2457   Median :0.5058  
##  Mean   :0.9945   Mean   :0.7511   Mean   :-1.2474   Mean   :0.5015  
##  3rd Qu.:1.0513   3rd Qu.:0.8148   3rd Qu.:-1.1841   3rd Qu.:0.5629  
##  Max.   :1.2575   Max.   :1.1019   Max.   :-0.9725   Max.   :0.8166

On average, the estimates collectly capture the parameter values. Let’s make histograms to verify.

htgrm <- ggplot(sim_out, aes(y = ..density..))
h1 <- htgrm + geom_histogram(aes(x = b1), color = 'black') + 
    geom_vline(xintercept = beta[1], color = 'tomato')
h2 <- htgrm + geom_histogram(aes(x = b2), color = 'black') +
    geom_vline(xintercept = beta[2], color = 'tomato')
h3 <- htgrm + geom_histogram(aes(x = b3), color = 'black') +
    geom_vline(xintercept = beta[3], color = 'tomato')
h4 <- htgrm + geom_histogram(aes(x = b4), color = 'black') +
    geom_vline(xintercept = beta[4], color = 'tomato')
# display 4 plots in a single view
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 2)))
print(h1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(h2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
print(h3, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
print(h4, vp = viewport(layout.pos.row = 2, layout.pos.col = 2))

Example

Let’s consider the example given by McElreath (2016: pp.313–321): Oceanic tool complexity. The data set is provided in rethinking package.

# library('rethinking')
data(Kline)
d <- Kline
glimpse(d)

## Observations: 10
## Variables: 5
## $ culture     <fctr> Malekula, Tikopia, Santa Cruz, Yap, Lau Fiji, Tro...
## $ population  <int> 1100, 1500, 3600, 4791, 7400, 8000, 9200, 13000, 1...
## $ contact     <fctr> low, low, low, high, high, high, high, low, high,...
## $ total_tools <int> 13, 22, 24, 43, 33, 19, 40, 28, 55, 71
## $ mean_TU     <dbl> 3.2, 4.7, 4.0, 5.0, 5.0, 4.0, 3.8, 6.6, 5.4, 6.6

This data set has only five variables for ten observations. So why don’t we display the whole data?

##       culture population contact total_tools mean_TU
## 1    Malekula       1100     low          13     3.2
## 2     Tikopia       1500     low          22     4.7
## 3  Santa Cruz       3600     low          24     4.0
## 4         Yap       4791    high          43     5.0
## 5    Lau Fiji       7400    high          33     5.0
## 6   Trobriand       8000    high          19     4.0
## 7       Chuuk       9200    high          40     3.8
## 8       Manus      13000     low          28     6.6
## 9       Tonga      17500    high          55     5.4
## 10     Hawaii     275000     low          71     6.6

Now we can see that each row represents a culture. With this data set, we’d like to explain the total_tools, which is the count outcome. Following McElreath (2016), let’s use population and contact as explanatory variables.

Before building a model, we’ll check bivariate relationships between the variables.

The outcome and the population.

plt <- ggplot(d, aes(y = total_tools, x = population)) + geom_point()
print(plt)

It seems that there is a non-linear relationship. Let’s log the population

plt <- ggplot(d, aes(y = total_tools, x = log(population))) + geom_point()
print(plt)

The linear relationship has appeard. This might be because the order of population affect the number of tools.

Next, outcome and contact.

with(d, table(contact, total_tools))

##        total_tools
## contact 13 19 22 24 28 33 40 43 55 71
##    high  0  1  0  0  0  1  1  1  1  0
##    low   1  0  1  1  1  0  0  0  0  1

plt <- ggplot(d, aes(x = contact, y = total_tools)) + geom_boxplot()
print(plt)

It seems that the high level of contact is associated with the larger number of tools.

Lastly, log(population) and contact.

plt <- ggplot(d, aes(x = contact, y = log(population))) + geom_boxplot()
print(plt)

Using these variables T (total_tools), P (log population), and C (contact), we propose the follwoing statistical model. \[T_n \sim \mbox{Poisson}(\lambda_n),\] \[\lambda_n = \exp(\alpha + \beta_1 C_n + \gamma_n P_n),\] \[\gamma_n = \beta_2 + \beta_3 C_n,\] \[\alpha \sim \mbox{Normal}(0, 10),\] \[\beta_k \sim \mbox{Normal}(0, 1),\] where \(k = 1, 2, 3.\) We use ~~normalizaion~~ regularlizing prios to avoid overfitting becasue we have only 10 data points. Relatively flat prior is given to \(\alpha\) becasue we don’t know where the intercet should be.

This model can be translated into the following Stan model.

// d4-oceanic-tool.stan
data {
    int<lower=0> N;
    vector[N] P;
    vector[N] C;
    int<lower=0> T[N];
}

parameters {
    real a;
    vector[3] b;
}

transformed parameters {
    vector[N] gamma;
    vector<lower=0>[N] mu;
    gamma = b[2] + b[3]*C;
    mu = a + b[1]*C + gamma .* P;
}

model {
    T ~ poisson_log(mu);
    a ~ normal(0, 10);
    b ~ normal(0, 1);
}

generated quantities {
    int<lower=0> t_pred[N];
    vector[N] log_lik;
    for (n in 1:N) {
        t_pred[n] = poisson_log_rng(mu[n]);
        log_lik[n] = poisson_log_lpmf(T[n] | mu[n]);
    }
}

Compile this model.

ocean_model <- stan_model('stan-models/d4-oceanic-tool.stan')

Then, sample.

myd <- list(N = nrow(d),
            C = ifelse(d$contact == 'high', 1, 0),
            P = log(d$population) - mean(log(d$population)),
            T = d$total_tools)
fit_ocean <- sampling(ocean_model, data = myd, seed = 12345,
                      refresh = -1, open_progress = FALSE) # you don't want these 2 options

Check the result.

print(fit_ocean, pars = c('a', 'b'))

## Inference for Stan model: d4-oceanic-tool.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##      mean se_mean   sd  2.5%   25%  50%  75% 97.5% n_eff Rhat
## a    3.31       0 0.09  3.14  3.25 3.31 3.37  3.47  1859    1
## b[1] 0.29       0 0.11  0.07  0.21 0.29 0.36  0.51  1884    1
## b[2] 0.26       0 0.03  0.20  0.24 0.26 0.29  0.33  2256    1
## b[3] 0.07       0 0.16 -0.25 -0.04 0.07 0.17  0.38  1943    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:27:50 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

plot(precis(fit_ocean, pars = c('a', 'b'), depth = 2))

Now let’s visualize the result. First, we visualize the effects of the predictors on the mean of the outcome (\(\lambda\)).

mcs <- rstan::extract(fit_ocean)
d <- d %>%
    mutate(lambda_lb = exp(apply(mcs$mu, 2, quantile, .05)),
           lambda_md = exp(apply(mcs$mu, 2, quantile, .5)),
           lambda_ub = exp(apply(mcs$mu, 2, quantile, .95)))
plt <- ggplot(d, aes(x = log(population), y = lambda_md)) +
    geom_ribbon(aes(ymin = lambda_lb, ymax = lambda_ub, fill = contact), alpha = 1/5) +
    geom_line(aes(color = contact)) +
    ylab(expression(hat(lambda)))
print(plt)

So the shaded areas for the low-contct and high-contact groups overlap at some values of log(population). Does it mean that the variable contact doesn’t matter there? Let’s check it with setting the log population at 8.5.

lpop <- 8.5 - mean(log(d$population))
lambda_high <- exp(mcs$a + mcs$b[1] + (mcs$b[2] + mcs$b[3])*lpop) 
lambda_low <- exp(mcs$a  + mcs$b[3]*lpop) 
diff <- lambda_high - lambda_low

Look at the histogram.

plt <- ggplot(data_frame(diff), aes(x = diff, y = ..density..)) + 
    geom_histogram(color = 'black') +
    geom_vline(xintercept = 0, color = 'red')
print(plt)

It seems that the contact matters even when log(population) = 8.5. In fact, the probability that the high contact has larger \(\lambda\) when log(population) = 8.5 is

sum(diff > 0) / length(diff)

## [1] 1

This should be obvious by lookin at the histogram above.

How about the predicted values of the outcome?

d <- d %>%
    mutate(p5 = apply(mcs$t_pred, 2, quantile, .05),
           p50 = apply(mcs$t_pred, 2, quantile, .5),
           p95 = apply(mcs$t_pred, 2, quantile, .95))
plt <- ggplot(d, aes(x = log(population), y = p50)) +
    geom_ribbon(aes(ymin = p5, ymax = p95, fill = contact), alpha = 1/5) +
    geom_line(aes(color = contact)) +
    ylab('Predicted outcome')
print(plt)

Predicted vs. Observed.

plt <- ggplot(d, aes(x = total_tools, y = p50, ymin = p5, ymax = p95,
                     shape = contact, color = contact)) +
    geom_pointrange() +
    coord_fixed() +
    geom_abline(intercept = 0, slope = 1, linetype = 'dashed') +
    labs(x = 'Observed outcome', y = 'Predicted outcome')
print(plt)

Next, let’s compare this model with some other models. The models we will examine are

Though I don’t write statistical models here, you should write down each model as an exercise.

First, compile the models.

ocean_model2 <- stan_model('stan-models/d4-oceanic-tool2.stan')
ocean_model3 <- stan_model('stan-models/d4-oceanic-tool3.stan')
ocean_model4 <- stan_model('stan-models/d4-oceanic-tool4.stan')
ocean_model5 <- stan_model('stan-models/d4-oceanic-tool5.stan')

Sample.

fit_ocean2 <- sampling(ocean_model2, data = myd, seed = 12345,
                       refresh = -1, open_progress = FALSE)
fit_ocean3 <- sampling(ocean_model3, data = myd, seed = 67890,
                       refresh = -1, open_progress = FALSE)
fit_ocean4 <- sampling(ocean_model4, data = myd, seed = 98765,
                       refresh = -1, open_progress = FALSE)
fit_ocean5 <- sampling(ocean_model5, data = myd, seed = 43210,
                       refresh = -1, open_progress = FALSE)

print(fit_ocean2, pars = c('a', 'b'))

## Inference for Stan model: d4-oceanic-tool2.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##      mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## a    3.31       0 0.09 3.14 3.26 3.31 3.37  3.48  1328    1
## b[1] 0.30       0 0.11 0.08 0.22 0.30 0.37  0.50  1405    1
## b[2] 0.26       0 0.03 0.20 0.24 0.27 0.29  0.33  1654    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:27:59 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

print(fit_ocean3, pars = c('a', 'b'))

## Inference for Stan model: d4-oceanic-tool3.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##   mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## a 3.48       0 0.06 3.37 3.44 3.48 3.51  3.58  2543    1
## b 0.24       0 0.03 0.18 0.22 0.24 0.26  0.30  2334    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:28:05 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

print(fit_ocean4, pars = c('a', 'b'))

## Inference for Stan model: d4-oceanic-tool4.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##   mean se_mean   sd  2.5%  25%  50%  75% 97.5% n_eff Rhat
## a 3.45       0 0.08  3.30 3.40 3.45 3.50  3.60  1630    1
## b 0.19       0 0.10 -0.01 0.12 0.19 0.25  0.39  1598    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:28:11 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

print(fit_ocean5, pars = 'a')

## Inference for Stan model: d4-oceanic-tool5.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##   mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## a 3.55       0 0.05 3.44 3.51 3.55 3.59  3.65  2917    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:28:17 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Compare the models.

(model_comp <- compare(fit_ocean, fit_ocean2, fit_ocean3, fit_ocean4, fit_ocean5))

##             WAIC pWAIC dWAIC weight    SE   dSE
## fit_ocean2  78.4   3.9   0.0   0.83 10.88    NA
## fit_ocean   82.3   6.1   3.9   0.12 11.82  3.44
## fit_ocean3  84.1   3.5   5.7   0.05  8.89  8.20
## fit_ocean5 141.6   8.4  63.2   0.00 31.71 34.54
## fit_ocean4 148.5  15.6  70.1   0.00 43.62 45.59

plot(model_comp)

This comparison shows that the second model, which doesn’t have the interaction effect is best among the five models. It implies that we might overfit with the interaction. However, non-trivial amount of weights are given to the model with interaction. Therefore, we can’t just ignore that model.

Following McElreath (2016), let’s average the top 3 models.

log_pop_seq <- seq(from = 6, to = 13, length.out = 100)
log_pop_seq_c <- matrix(log_pop_seq - mean(log(d$population)), ncol = 1)

mcs2 <- rstan::extract(fit_ocean2)
mcs3 <- rstan::extract(fit_ocean3)

coef_m1 <- rbind(mcs$a, mcs$b[,1], mcs$b[, 2], mcs$b[, 3])
coef_m2 <- rbind(mcs2$a, mcs2$b[, 1], mcs2$b[, 2])
coef_m3 <- rbind(mcs3$a, mcs3$b)

# Trend for high-contact islands 
lambda_pred_h <- .83*exp(coef_m2[1, ] + coef_m2[2, ] + 
                             log_pop_seq_c %*% coef_m2[3, ]) +
    .12*exp(coef_m1[1, ] + coef_m1[2, ] + 
                log_pop_seq_c %*% (coef_m1[3, ] + coef_m1[4, ])) +
    .05*exp(coef_m3[1, ] + log_pop_seq_c %*% coef_m3[2,])
lambda_h_med <- apply(lambda_pred_h, 1, median)
lambda_h_PI <- apply(lambda_pred_h, 1, PI, prob = .9)  # 90 percentile

# Trend for low-contact islands
lambda_pred_l <- .83*exp(coef_m2[1, ] + 
                             log_pop_seq_c %*% coef_m2[3, ]) +
    .12*exp(coef_m1[1, ] + log_pop_seq_c %*% (coef_m1[3, ])) +
    .05*exp(coef_m3[1, ] + log_pop_seq_c %*% coef_m3[2,])
lambda_l_med <- apply(lambda_pred_l, 1, median)
lambda_l_PI <- apply(lambda_pred_l, 1, PI, prob = .9)  # 90 percentile


plt <- data_frame(log_pop = rep(log_pop_seq, 2),
                  med = c(lambda_h_med, lambda_l_med), 
                  lb = c(lambda_h_PI[1, ], lambda_l_PI[1, ]),
                  ub = c(lambda_h_PI[2, ], lambda_l_PI[2, ]),
                  contact = rep(c('high', 'low'), rep(length(log_pop_seq), 2))) %>%
    ggplot(aes(x = log_pop, y = med)) +
    geom_ribbon(alpha = 1/5, aes(ymin = lb, ymax = ub, fill = contact)) +
    geom_line(aes(color = contact)) +
    geom_point(data = d, aes(x = log(population), y = total_tools, 
                             color = contact, shape = contact)) +
    labs(x = 'log-population', y = 'total tools')
print(plt)

This figure shows the predicted outcome based on our model ensemble. It is clear that the log-population is associated with the number of tools. In addition, contact increaes the number of tools too. The effect of contact (difference between two curves) increases as the log-population gets larger. However, the 90 percentiles overlap with each other in some range. How do you interpret this result?

Different Exposure: Poisson with Offset

You can think the Poisson parameter \(\lambda\) as a rate. Then, you can have different exposure (time or space) for different observations.

Suppose for instance that one of your classmates records the weekly totals of emails he receives while you do daily totals. If you get both data sets, you can anlyze the totals with Poisson model.

In this case, \(\lambda\) can be thought as the the expected number of emails, \(\mu\), per unit time \(\tau\). Then, \(\lambda = \mu / \tau\). With the number of emails \(y_n\) and an explanatory variable \(x\), we can write: \[y_n \sim \mbox{Poisson}(\lambda_n),\] \[\log(\lambda_n) = \log \frac{\mu_n}{\tau_n} = \alpha + \beta x_n.\] We can rewrite this formula as: \[\log(\lambda_n) = \log \mu_n - \log \tau_n = \alpha + \beta x_n.\] Therefore, the log expected value can be written as: \[\log(\mu_n) = \alpha + \beta x_n + \log \tau_n.\] Using this link, we can rewrite our model as: \[y_n \sim \mbox{Poisson}(\mu_n),\] \[\mu_n = \exp(\alpha + \beta x_n + \log \tau_n).\]

In this model, \(\log \tau_n\) is called offset. This reduces to the basic Poisson model when \(\tau_n = 1\) because \(\log(1) = 0\).

Let’s fit this model with a simulated data. Suppose you record the daily number of emails you received for 50 days, and the expected number of emails per day is 2. Then, you can simulate the data as follows.

days <- 50
y_u <- rpois(days, lambda = 2)

And suppose that your classmate records the weekly number of emails for 6 weeks and that the expected number per day is 1.2. Then you can simulate the data as follows.

weeks <- 6
y_cm <- rpois(weeks, lambda = 1.2*7)

Now you have 56 observed counts with two different exposures.

d <- data_frame(y = c(y_u, y_cm),
                exposure = c(rep(1, days), rep(7, weeks)),
                you = c(rep(1, days), rep(0, weeks)))

In this model, our only explanatory variable is the dummy indicating whose record the coutn is (“you” dummy).

Here is a Stan model (You should write the statistical model before writing a Stan model).

Comiple the model.

exposure_model <- stan_model('stan-models/d4-poisson-exposure.stan')

Sample.

myd <- list(N = nrow(d), U = d$you, X = d$exposure, Y = d$y)
fit_exposure <- sampling(exposure_model, data = myd, seed = 123,
                         refresh = -1, open_progress = FALSE)

Check the result.

print(fit_exposure, pars = c('a', 'b'))

## Inference for Stan model: d4-poisson-exposure.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##   mean se_mean   sd  2.5%   25%  50%  75% 97.5% n_eff Rhat
## a 0.04    0.00 0.15 -0.26 -0.06 0.04 0.14  0.32  1167    1
## b 0.70    0.01 0.18  0.37  0.58 0.70 0.81  1.04  1141    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:28:26 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Using this result, we can get the posterior distribution of \(\lambda\) of you and the classmate.

mcs <- rstan::extract(fit_exposure)
lambda_u <- exp(mcs$a + mcs$b)
lambda_cm <- exp(mcs$a)
precis(data.frame(lambda_u, lambda_cm))

##           Mean StdDev |0.89 0.89|
## lambda_u  2.10   0.20  1.79  2.43
## lambda_cm 1.05   0.15  0.81  1.30

So even when you have counts measured in different exposure (different length of time, or different size of area), you can estimate the rate of the outcome by Poisson model.

Back to Class Materials

Research Methods in Political Science II

Day 5: Poisson Model

Yuki Yanai

2016-12-03 (revised on 2016-12-16)

Preparation