library('rstan')
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
library('rethinking')
library('grid')
library('tidyverse')
We can model the count outcome \(Y\) using binomial distribution as follows. \[Y \sim \mbox{Binomial}(n, \theta),\] where \(n\) is the number of Bernoulli trials, \(\theta\) is the probability of success for each trial, and \(Y\) is the number of successes (\(0 \leq Y \leq n\)).
Let’s sample \(Y\) with setting \(n = 10\) and \(\theta = 0.3\) in R.
n <- 10
theta <- .3
y1 <- rbinom(10000, size = n, prob = theta)
plt <- ggplot(data_frame(y1), aes(x = y1)) +
geom_bar(aes(y = ..count.. / 10000)) +
scale_x_continuous(breaks = 0:n) +
labs(x = '# of successes', y = '')
print(plt)
Theoretically, the mean and variance of \(Y\) should be \(n\theta = 3\) and \(n\theta(1-\theta) = 2.1\). Let’s see.
mean(y1)
## [1] 2.9818
var(y1)
## [1] 2.091278
Sometimes, we don’t know the number of trials. In such a case, suppose the number of trials \(n\) is large enough and the maximum value of the outcome \(Y\) is much smaller than \(n\) (or \(\theta\) is small). For instance, let’s assume thta \(n = 100000\) and \(\theta = 0.001\) and sample outcome values.
n <- 100000
theta <- 0.001
y2 <- rbinom(10000, size = n, prob = theta)
plt <- ggplot(data_frame(y2), aes(x = y2)) +
geom_bar(aes(y = ..count.. / 10000)) +
labs(x = '# of successes', y = '')
print(plt)
Let’s calculate the mean and variance of y2.
mean(y2)
## [1] 99.9101
var(y2)
## [1] 98.31085
They are almost same because \(n\theta = 100\) and \(n\theta(1-\theta) = 99.9 \approx 100\). So under this situation, the mean and variance are almost equal.
We can think that this outocme follows a probability distribution that has the mean equal to the variance. The distribution is called Poisson distribution, and we can sample the values by rpois()
function.
Let’s sample the outcome from the Poisson distribution with the mean \(\lambda=100\).
y3 <- rpois(10000, lambda = 100)
plt <- ggplot(data_frame(y3), aes(x = y3)) +
geom_bar(aes(y = ..count.. / 10000)) +
labs(x = '# of successes', y = '')
print(plt)
The mean and variance are
mean(y3)
## [1] 100.1576
var(y3)
## [1] 97.99136
Now let’s compare y2, which is sampled from Binomial(10000, 0.01), with y3, which is sampled from Poisson(100).
df <- data_frame(y = c(y2, y3),
distribution = rep(c('Bin(10000, 0.001)', 'Poisson(100)'), c(10000, 10000)))
plt <- ggplot(df, aes(x = y, y = ..density.., fill = distribution)) +
geom_histogram(alpha = .4, position = 'identity')
print(plt)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Another visualization.
df <- data_frame(y2_Q = quantile(y2, seq(0, 1, length.out = 100)),
y3_Q = quantile(y3, seq(0, 1, length.out = 100)))
qqplt <- ggplot(df, aes(x = y2_Q, y = y3_Q)) +
geom_point(color = 'darkblue') +
geom_abline(intercept = 0, slope = 1, linetype = 'dashed') +
coord_fixed() +
labs(x = 'Quantile of y2', y = 'Quantile of y3')
print(qqplt)
In fact, these two outcomes are so similar and we can guess that they are sampled from similar distributions.
Thus, when we do not know the number of trials (or when there is no upper limit of counts) and the probabiilty of event is small, we would like to model the count outcome using Poisson distribution.
Poisson distribution can be thought as a distribution of counts. Since the count is non-negative integers, the outcome values sampled from Poisson distribution must be non-negative integers. The PMF of the Poisson distribution with the paramter (mean or variance) \(\lambda\) is \[p(y|\lambda) = \frac{\lambda^y \exp(-\lambda)}{y!} = \exp(y\log(\lambda) - \lambda - \log y!).\] In this formula, \(y\) appears as \(y\) itself inside \(\exp()\), so this is a cannonical form. And the natural parameter is \(\log(\lambda)\), so we can use this as the link function.
Let’s examine some Poisson distributions in R.
N <- 10000
means <- c(1, 5, 10, 20)
y1 <- rpois(N, lambda = means[1])
y2 <- rpois(N, lambda = means[2])
y3 <- rpois(N, lambda = means[3])
y4 <- rpois(N, lambda = means[4])
table(y1)
## y1
## 0 1 2 3 4 5 6
## 3636 3651 1891 611 169 35 7
table(y2)
## y2
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 69 322 846 1470 1749 1719 1493 1038 621 353 191 72 40 12 1
## 15 17
## 3 1
table(y3)
## y3
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 1 5 18 64 184 376 640 928 1084 1238 1243 1180 943 763 523
## 15 16 17 18 19 20 21 22 23 24
## 344 201 120 69 40 22 8 3 1 2
table(y4)
## y4
## 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
## 1 3 3 15 28 46 108 188 244 421 531 673 774 799 887 923 787 732
## 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 39 40
## 659 566 470 317 271 215 119 101 51 24 15 9 12 2 4 1 1
df <- data.frame(y = c(y1, y2, y3, y4),
lambda = rep(means, rep(N, 4))) %>%
mutate(lambda = factor(lambda, labels = paste0('lambda=', means)))
plt <- ggplot(df, aes(x = y)) +
geom_bar() +
facet_wrap(~ lambda)
print(plt)
Let’s take a closer look at one with \(\lambda = 5\)
plt <- df %>%
filter(lambda == 'lambda=5') %>%
ggplot(aes(y)) + geom_bar()
print(plt)
And the means and variances.
df %>% group_by(lambda) %>%
summarize(mean = round(mean(y), 2),
variance = round(var(y), 2))
## # A tibble: 4 × 3
## lambda mean variance
## <fctr> <dbl> <dbl>
## 1 lambda=1 1.02 1.02
## 2 lambda=5 4.98 4.96
## 3 lambda=10 10.01 9.87
## 4 lambda=20 20.00 20.22
So, \(\mathrm{E}(Y) = \mathrm{Var}(Y) = \lambda\) for \(Y \sim \mbox{Poisson}(\lambda)\).
As we can see in the PMF formula, the natural parameter of the Poisson distribution is \(\log(\lambda)\). Therefore, we will use the log link function to connect the parameter \(\lambda\) to the linear predictor.
Here, let’s examine why.
First, generate the linear predictor \(\mu\) using an explanatory variable \(x\) and parameters \(\beta_1\) and \(\beta_2\).
N <- 10000
beta <- c(0.3, 0.9)
x <- runif(N, min = -2, max = 2) # random values from uniform
df <- data.frame(x) %>%
mutate(mu = beta[1] + beta[2]*x)
With a function \(g\), we write \[g(\lambda_i) = \mu_i = \beta_1 + \beta_2 x_i.\]
What kind of link function \(g\) do we need?
How about the identity link function?
df <- df %>%
mutate(lambda_identity = mu)
plt <- ggplot(df, aes(x = mu)) +
labs(x = 'linear predictor', y = expression(lambda))
plt + geom_line(aes(y = lambda_identity))
What is the problem?
How about the logit link?
df <- df %>%
mutate(lambda_logit = 1 / (1 + exp(-mu)))
plt %+% df + geom_line(aes(y = lambda_logit))
What is the problem?
How about the log link? That is, \[\log(\lambda) = \mu,\] or \[\lambda = \exp(\mu).\]
df <- df %>%
mutate(lambda_log = exp(mu))
plt %+% df + geom_line(aes(y = lambda_log))
This is what we want. Do you understand why we use the log link?
Let’s simulate Poisson data and fit the Poisson model. Suppose there are three explanatory varialbles. Then we can simulate data as follows in R.
set.seed(2016-11-29)
N <- 500 # N of observations
x2 <- runif(N)
x3 <- runif(N)
x4 <- runif(N)
beta <- c(1, .75, -1.25, .5) # set parameters at (1, .75, -1.25, .5)
mu <- exp(beta[1] + beta[2]*x2 + beta[3]*x3 + beta[4]*x4)
y <- rpois(N, lambda = mu) # sample the outcome values
A statistical model can be written as follows. \[y_n \sim \mbox{Poisson}(\lambda_n),\] \[\lambda_n = \exp(\beta_1 + \beta_2 x_{2n} + \beta_3 x_{3n} + \beta_4 x_{4n}),\] \[\beta_k \sim \mbox{Normal}(0, 100),\] where \(k = 1, 2, 3, 4.\)
We can translate the statistical model into a Stan model as follows (d4-sim-poisson0.stan)
// d4-sim-poisson0.stan
data {
int<lower=0> N;
real X2[N];
real X3[N];
real X4[N];
int<lower=0> Y[N];
}
parameters {
vector[4] b;
}
transformed parameters {
real mu[N];
for (n in 1:N)
mu[n] = exp(b[1] + b[2]*X2[n] + b[3]*X3[n] + b[4]*X4[n]);
}
model {
for (n in 1:N)
Y[n] ~ poisson(mu[n]);
b ~ normal(0, 100);
}
generated quantities {
int<lower=0> y_pred[N];
real log_lik[N];
for (n in 1:N) {
y_pred[n] = poisson_rng(mu[n]);
log_lik[n] = poisson_lpmf(Y[n] | mu[n]);
}
}
You can use this model, but we would like to modify the model a little as follows (d4-sim-poisson.stan).
// d4-sim-poisson.stan
data {
int<lower=0> N;
matrix[N, 4] X; // prepare the matrix whose first column is all 1s
int<lower=0> Y[N]; // cannot use vector because Y must be integers
}
parameters {
vector[4] b;
}
transformed parameters {
vector[N] mu;
mu = multiply(X, b); // matrix multiplication
}
model {
Y ~ poisson_log(mu);
b ~ normal(0, 100);
}
generated quantities {
int<lower=0> y_pred[N];
vector[N] log_lik;
for (n in 1:N) {
y_pred[n] = poisson_log_rng(mu[n]);
log_lik[n] = poisson_log_lpmf(Y[n] | mu[n]);
}
}
Let’s fit the model. First, compile the model.
sim_poisson <- stan_model('stan-models/d4-sim-poisson.stan')
Then, sample the values by passing a list of variables to rstan::sampling()
.
myd <- list(N = N,
X = cbind(1, x2, x3, x4), # predictor matrix, 1st col is intercept
Y = y)
fit_sim_poisson1 <- sampling(sim_poisson, data = myd, seed = 98765,
chains = 4, iter = 2000, warmup = 1000)
Let’s check the results.
print(fit_sim_poisson1, pars = 'b')
## Inference for Stan model: d4-sim-poisson.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## b[1] 0.99 0 0.09 0.81 0.93 0.99 1.04 1.17 1982 1
## b[2] 0.85 0 0.09 0.66 0.78 0.85 0.91 1.03 2260 1
## b[3] -1.23 0 0.09 -1.42 -1.30 -1.23 -1.17 -1.05 2509 1
## b[4] 0.34 0 0.09 0.17 0.28 0.34 0.40 0.52 2455 1
##
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:27:29 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
(Though I don’t create traceplots here, you should do to check the convergence.)
A visual display of the estimation result.
plot(precis(fit_sim_poisson1, pars = 'b', depth = 2))
It seems that the model correctly estiamte the paramter values.
Since we cannot know if we are just lucky enough to have gotten this result by a single trial, we would like to iterate the simulation.
First, let’s make a function to create a Poisson data.
poisson_dgp <- function(N = 500, beta = c(1, .75, -1.25, .5)) {
x2 <- runif(N)
x3 <- runif(N)
x4 <- runif(N)
mu <- exp(beta[1] + beta[2]*x2 + beta[3]*x3 + beta[4]*x4)
y <- rpois(N, lambda = mu)
return(list(N = N, X = cbind(1, x2, x3, x4), Y = y))
}
Then, write a function to simulate the process. We gonna use the Stan model we have already compiled.
sim_stan_poisson <- function(sims = 1000, N = 500){
b_matrix <- matrix(rep(NA,sims*4), ncol = 4)
for (i in 1:sims) {
myd <- poisson_dgp(N = N)
fit <- sampling(sim_poisson, data = myd,
refresh = -1, open_progress = FALSE)
mcs <- rstan::extract(fit, pars = 'b')
b_matrix[i, ] <- apply(mcs$b, 2, median)
}
return(b_matrix)
}
We are ready to run a simulation (but please don’t run it in class. It will take about an hour to finish.).
sim_out <- sim_stan_poisson(sims = 1000, N = 500)
Let’s transform the outcome matrix into a data frame.
sim_out <- as_data_frame(sim_out)
names(sim_out) <- paste0('b', 1:4)
Let’s examine the result.
summary(sim_out)
## b1 b2 b3 b4
## Min. :0.7219 Min. :0.4344 Min. :-1.5196 Min. :0.2106
## 1st Qu.:0.9372 1st Qu.:0.6910 1st Qu.:-1.3106 1st Qu.:0.4413
## Median :0.9930 Median :0.7514 Median :-1.2457 Median :0.5058
## Mean :0.9945 Mean :0.7511 Mean :-1.2474 Mean :0.5015
## 3rd Qu.:1.0513 3rd Qu.:0.8148 3rd Qu.:-1.1841 3rd Qu.:0.5629
## Max. :1.2575 Max. :1.1019 Max. :-0.9725 Max. :0.8166
On average, the estimates collectly capture the parameter values. Let’s make histograms to verify.
htgrm <- ggplot(sim_out, aes(y = ..density..))
h1 <- htgrm + geom_histogram(aes(x = b1), color = 'black') +
geom_vline(xintercept = beta[1], color = 'tomato')
h2 <- htgrm + geom_histogram(aes(x = b2), color = 'black') +
geom_vline(xintercept = beta[2], color = 'tomato')
h3 <- htgrm + geom_histogram(aes(x = b3), color = 'black') +
geom_vline(xintercept = beta[3], color = 'tomato')
h4 <- htgrm + geom_histogram(aes(x = b4), color = 'black') +
geom_vline(xintercept = beta[4], color = 'tomato')
# display 4 plots in a single view
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 2)))
print(h1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(h2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
print(h3, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
print(h4, vp = viewport(layout.pos.row = 2, layout.pos.col = 2))
Let’s consider the example given by McElreath (2016: pp.313–321): Oceanic tool complexity. The data set is provided in rethinking package.
# library('rethinking')
data(Kline)
d <- Kline
glimpse(d)
## Observations: 10
## Variables: 5
## $ culture <fctr> Malekula, Tikopia, Santa Cruz, Yap, Lau Fiji, Tro...
## $ population <int> 1100, 1500, 3600, 4791, 7400, 8000, 9200, 13000, 1...
## $ contact <fctr> low, low, low, high, high, high, high, low, high,...
## $ total_tools <int> 13, 22, 24, 43, 33, 19, 40, 28, 55, 71
## $ mean_TU <dbl> 3.2, 4.7, 4.0, 5.0, 5.0, 4.0, 3.8, 6.6, 5.4, 6.6
This data set has only five variables for ten observations. So why don’t we display the whole data?
d
## culture population contact total_tools mean_TU
## 1 Malekula 1100 low 13 3.2
## 2 Tikopia 1500 low 22 4.7
## 3 Santa Cruz 3600 low 24 4.0
## 4 Yap 4791 high 43 5.0
## 5 Lau Fiji 7400 high 33 5.0
## 6 Trobriand 8000 high 19 4.0
## 7 Chuuk 9200 high 40 3.8
## 8 Manus 13000 low 28 6.6
## 9 Tonga 17500 high 55 5.4
## 10 Hawaii 275000 low 71 6.6
Now we can see that each row represents a culture. With this data set, we’d like to explain the total_tools, which is the count outcome. Following McElreath (2016), let’s use population and contact as explanatory variables.
Before building a model, we’ll check bivariate relationships between the variables.
The outcome and the population.
plt <- ggplot(d, aes(y = total_tools, x = population)) + geom_point()
print(plt)
It seems that there is a non-linear relationship. Let’s log the population
plt <- ggplot(d, aes(y = total_tools, x = log(population))) + geom_point()
print(plt)
The linear relationship has appeard. This might be because the order of population affect the number of tools.
Next, outcome and contact.
with(d, table(contact, total_tools))
## total_tools
## contact 13 19 22 24 28 33 40 43 55 71
## high 0 1 0 0 0 1 1 1 1 0
## low 1 0 1 1 1 0 0 0 0 1
plt <- ggplot(d, aes(x = contact, y = total_tools)) + geom_boxplot()
print(plt)
It seems that the high level of contact is associated with the larger number of tools.
Lastly, log(population) and contact.
plt <- ggplot(d, aes(x = contact, y = log(population))) + geom_boxplot()
print(plt)
Using these variables T (total_tools), P (log population), and C (contact), we propose the follwoing statistical model. \[T_n \sim \mbox{Poisson}(\lambda_n),\] \[\lambda_n = \exp(\alpha + \beta_1 C_n + \gamma_n P_n),\] \[\gamma_n = \beta_2 + \beta_3 C_n,\] \[\alpha \sim \mbox{Normal}(0, 10),\] \[\beta_k \sim \mbox{Normal}(0, 1),\] where \(k = 1, 2, 3.\) We use normalizaion regularlizing prios to avoid overfitting becasue we have only 10 data points. Relatively flat prior is given to \(\alpha\) becasue we don’t know where the intercet should be.
This model can be translated into the following Stan model.
// d4-oceanic-tool.stan
data {
int<lower=0> N;
vector[N] P;
vector[N] C;
int<lower=0> T[N];
}
parameters {
real a;
vector[3] b;
}
transformed parameters {
vector[N] gamma;
vector<lower=0>[N] mu;
gamma = b[2] + b[3]*C;
mu = a + b[1]*C + gamma .* P;
}
model {
T ~ poisson_log(mu);
a ~ normal(0, 10);
b ~ normal(0, 1);
}
generated quantities {
int<lower=0> t_pred[N];
vector[N] log_lik;
for (n in 1:N) {
t_pred[n] = poisson_log_rng(mu[n]);
log_lik[n] = poisson_log_lpmf(T[n] | mu[n]);
}
}
Compile this model.
ocean_model <- stan_model('stan-models/d4-oceanic-tool.stan')
Then, sample.
myd <- list(N = nrow(d),
C = ifelse(d$contact == 'high', 1, 0),
P = log(d$population) - mean(log(d$population)),
T = d$total_tools)
fit_ocean <- sampling(ocean_model, data = myd, seed = 12345,
refresh = -1, open_progress = FALSE) # you don't want these 2 options
Check the result.
print(fit_ocean, pars = c('a', 'b'))
## Inference for Stan model: d4-oceanic-tool.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## a 3.31 0 0.09 3.14 3.25 3.31 3.37 3.47 1859 1
## b[1] 0.29 0 0.11 0.07 0.21 0.29 0.36 0.51 1884 1
## b[2] 0.26 0 0.03 0.20 0.24 0.26 0.29 0.33 2256 1
## b[3] 0.07 0 0.16 -0.25 -0.04 0.07 0.17 0.38 1943 1
##
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:27:50 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
plot(precis(fit_ocean, pars = c('a', 'b'), depth = 2))
Now let’s visualize the result. First, we visualize the effects of the predictors on the mean of the outcome (\(\lambda\)).
mcs <- rstan::extract(fit_ocean)
d <- d %>%
mutate(lambda_lb = exp(apply(mcs$mu, 2, quantile, .05)),
lambda_md = exp(apply(mcs$mu, 2, quantile, .5)),
lambda_ub = exp(apply(mcs$mu, 2, quantile, .95)))
plt <- ggplot(d, aes(x = log(population), y = lambda_md)) +
geom_ribbon(aes(ymin = lambda_lb, ymax = lambda_ub, fill = contact), alpha = 1/5) +
geom_line(aes(color = contact)) +
ylab(expression(hat(lambda)))
print(plt)
So the shaded areas for the low-contct and high-contact groups overlap at some values of log(population). Does it mean that the variable contact doesn’t matter there? Let’s check it with setting the log population at 8.5.
lpop <- 8.5 - mean(log(d$population))
lambda_high <- exp(mcs$a + mcs$b[1] + (mcs$b[2] + mcs$b[3])*lpop)
lambda_low <- exp(mcs$a + mcs$b[3]*lpop)
diff <- lambda_high - lambda_low
Look at the histogram.
plt <- ggplot(data_frame(diff), aes(x = diff, y = ..density..)) +
geom_histogram(color = 'black') +
geom_vline(xintercept = 0, color = 'red')
print(plt)
It seems that the contact matters even when log(population) = 8.5. In fact, the probability that the high contact has larger \(\lambda\) when log(population) = 8.5 is
sum(diff > 0) / length(diff)
## [1] 1
This should be obvious by lookin at the histogram above.
How about the predicted values of the outcome?
d <- d %>%
mutate(p5 = apply(mcs$t_pred, 2, quantile, .05),
p50 = apply(mcs$t_pred, 2, quantile, .5),
p95 = apply(mcs$t_pred, 2, quantile, .95))
plt <- ggplot(d, aes(x = log(population), y = p50)) +
geom_ribbon(aes(ymin = p5, ymax = p95, fill = contact), alpha = 1/5) +
geom_line(aes(color = contact)) +
ylab('Predicted outcome')
print(plt)
Predicted vs. Observed.
plt <- ggplot(d, aes(x = total_tools, y = p50, ymin = p5, ymax = p95,
shape = contact, color = contact)) +
geom_pointrange() +
coord_fixed() +
geom_abline(intercept = 0, slope = 1, linetype = 'dashed') +
labs(x = 'Observed outcome', y = 'Predicted outcome')
print(plt)
Next, let’s compare this model with some other models. The models we will examine are
Though I don’t write statistical models here, you should write down each model as an exercise.
First, compile the models.
ocean_model2 <- stan_model('stan-models/d4-oceanic-tool2.stan')
ocean_model3 <- stan_model('stan-models/d4-oceanic-tool3.stan')
ocean_model4 <- stan_model('stan-models/d4-oceanic-tool4.stan')
ocean_model5 <- stan_model('stan-models/d4-oceanic-tool5.stan')
Sample.
fit_ocean2 <- sampling(ocean_model2, data = myd, seed = 12345,
refresh = -1, open_progress = FALSE)
fit_ocean3 <- sampling(ocean_model3, data = myd, seed = 67890,
refresh = -1, open_progress = FALSE)
fit_ocean4 <- sampling(ocean_model4, data = myd, seed = 98765,
refresh = -1, open_progress = FALSE)
fit_ocean5 <- sampling(ocean_model5, data = myd, seed = 43210,
refresh = -1, open_progress = FALSE)
print(fit_ocean2, pars = c('a', 'b'))
## Inference for Stan model: d4-oceanic-tool2.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## a 3.31 0 0.09 3.14 3.26 3.31 3.37 3.48 1328 1
## b[1] 0.30 0 0.11 0.08 0.22 0.30 0.37 0.50 1405 1
## b[2] 0.26 0 0.03 0.20 0.24 0.27 0.29 0.33 1654 1
##
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:27:59 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
print(fit_ocean3, pars = c('a', 'b'))
## Inference for Stan model: d4-oceanic-tool3.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## a 3.48 0 0.06 3.37 3.44 3.48 3.51 3.58 2543 1
## b 0.24 0 0.03 0.18 0.22 0.24 0.26 0.30 2334 1
##
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:28:05 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
print(fit_ocean4, pars = c('a', 'b'))
## Inference for Stan model: d4-oceanic-tool4.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## a 3.45 0 0.08 3.30 3.40 3.45 3.50 3.60 1630 1
## b 0.19 0 0.10 -0.01 0.12 0.19 0.25 0.39 1598 1
##
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:28:11 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
print(fit_ocean5, pars = 'a')
## Inference for Stan model: d4-oceanic-tool5.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## a 3.55 0 0.05 3.44 3.51 3.55 3.59 3.65 2917 1
##
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:28:17 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Compare the models.
(model_comp <- compare(fit_ocean, fit_ocean2, fit_ocean3, fit_ocean4, fit_ocean5))
## WAIC pWAIC dWAIC weight SE dSE
## fit_ocean2 78.4 3.9 0.0 0.83 10.88 NA
## fit_ocean 82.3 6.1 3.9 0.12 11.82 3.44
## fit_ocean3 84.1 3.5 5.7 0.05 8.89 8.20
## fit_ocean5 141.6 8.4 63.2 0.00 31.71 34.54
## fit_ocean4 148.5 15.6 70.1 0.00 43.62 45.59
plot(model_comp)
This comparison shows that the second model, which doesn’t have the interaction effect is best among the five models. It implies that we might overfit with the interaction. However, non-trivial amount of weights are given to the model with interaction. Therefore, we can’t just ignore that model.
Following McElreath (2016), let’s average the top 3 models.
log_pop_seq <- seq(from = 6, to = 13, length.out = 100)
log_pop_seq_c <- matrix(log_pop_seq - mean(log(d$population)), ncol = 1)
mcs2 <- rstan::extract(fit_ocean2)
mcs3 <- rstan::extract(fit_ocean3)
coef_m1 <- rbind(mcs$a, mcs$b[,1], mcs$b[, 2], mcs$b[, 3])
coef_m2 <- rbind(mcs2$a, mcs2$b[, 1], mcs2$b[, 2])
coef_m3 <- rbind(mcs3$a, mcs3$b)
# Trend for high-contact islands
lambda_pred_h <- .83*exp(coef_m2[1, ] + coef_m2[2, ] +
log_pop_seq_c %*% coef_m2[3, ]) +
.12*exp(coef_m1[1, ] + coef_m1[2, ] +
log_pop_seq_c %*% (coef_m1[3, ] + coef_m1[4, ])) +
.05*exp(coef_m3[1, ] + log_pop_seq_c %*% coef_m3[2,])
lambda_h_med <- apply(lambda_pred_h, 1, median)
lambda_h_PI <- apply(lambda_pred_h, 1, PI, prob = .9) # 90 percentile
# Trend for low-contact islands
lambda_pred_l <- .83*exp(coef_m2[1, ] +
log_pop_seq_c %*% coef_m2[3, ]) +
.12*exp(coef_m1[1, ] + log_pop_seq_c %*% (coef_m1[3, ])) +
.05*exp(coef_m3[1, ] + log_pop_seq_c %*% coef_m3[2,])
lambda_l_med <- apply(lambda_pred_l, 1, median)
lambda_l_PI <- apply(lambda_pred_l, 1, PI, prob = .9) # 90 percentile
plt <- data_frame(log_pop = rep(log_pop_seq, 2),
med = c(lambda_h_med, lambda_l_med),
lb = c(lambda_h_PI[1, ], lambda_l_PI[1, ]),
ub = c(lambda_h_PI[2, ], lambda_l_PI[2, ]),
contact = rep(c('high', 'low'), rep(length(log_pop_seq), 2))) %>%
ggplot(aes(x = log_pop, y = med)) +
geom_ribbon(alpha = 1/5, aes(ymin = lb, ymax = ub, fill = contact)) +
geom_line(aes(color = contact)) +
geom_point(data = d, aes(x = log(population), y = total_tools,
color = contact, shape = contact)) +
labs(x = 'log-population', y = 'total tools')
print(plt)
This figure shows the predicted outcome based on our model ensemble. It is clear that the log-population is associated with the number of tools. In addition, contact increaes the number of tools too. The effect of contact (difference between two curves) increases as the log-population gets larger. However, the 90 percentiles overlap with each other in some range. How do you interpret this result?
You can think the Poisson parameter \(\lambda\) as a rate. Then, you can have different exposure (time or space) for different observations.
Suppose for instance that one of your classmates records the weekly totals of emails he receives while you do daily totals. If you get both data sets, you can anlyze the totals with Poisson model.
In this case, \(\lambda\) can be thought as the the expected number of emails, \(\mu\), per unit time \(\tau\). Then, \(\lambda = \mu / \tau\). With the number of emails \(y_n\) and an explanatory variable \(x\), we can write: \[y_n \sim \mbox{Poisson}(\lambda_n),\] \[\log(\lambda_n) = \log \frac{\mu_n}{\tau_n} = \alpha + \beta x_n.\] We can rewrite this formula as: \[\log(\lambda_n) = \log \mu_n - \log \tau_n = \alpha + \beta x_n.\] Therefore, the log expected value can be written as: \[\log(\mu_n) = \alpha + \beta x_n + \log \tau_n.\] Using this link, we can rewrite our model as: \[y_n \sim \mbox{Poisson}(\mu_n),\] \[\mu_n = \exp(\alpha + \beta x_n + \log \tau_n).\]
In this model, \(\log \tau_n\) is called offset. This reduces to the basic Poisson model when \(\tau_n = 1\) because \(\log(1) = 0\).
Let’s fit this model with a simulated data. Suppose you record the daily number of emails you received for 50 days, and the expected number of emails per day is 2. Then, you can simulate the data as follows.
days <- 50
y_u <- rpois(days, lambda = 2)
And suppose that your classmate records the weekly number of emails for 6 weeks and that the expected number per day is 1.2. Then you can simulate the data as follows.
weeks <- 6
y_cm <- rpois(weeks, lambda = 1.2*7)
Now you have 56 observed counts with two different exposures.
d <- data_frame(y = c(y_u, y_cm),
exposure = c(rep(1, days), rep(7, weeks)),
you = c(rep(1, days), rep(0, weeks)))
In this model, our only explanatory variable is the dummy indicating whose record the coutn is (“you” dummy).
Here is a Stan model (You should write the statistical model before writing a Stan model).
Comiple the model.
exposure_model <- stan_model('stan-models/d4-poisson-exposure.stan')
Sample.
myd <- list(N = nrow(d), U = d$you, X = d$exposure, Y = d$y)
fit_exposure <- sampling(exposure_model, data = myd, seed = 123,
refresh = -1, open_progress = FALSE)
Check the result.
print(fit_exposure, pars = c('a', 'b'))
## Inference for Stan model: d4-poisson-exposure.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## a 0.04 0.00 0.15 -0.26 -0.06 0.04 0.14 0.32 1167 1
## b 0.70 0.01 0.18 0.37 0.58 0.70 0.81 1.04 1141 1
##
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:28:26 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Using this result, we can get the posterior distribution of \(\lambda\) of you and the classmate.
mcs <- rstan::extract(fit_exposure)
lambda_u <- exp(mcs$a + mcs$b)
lambda_cm <- exp(mcs$a)
precis(data.frame(lambda_u, lambda_cm))
## Mean StdDev |0.89 0.89|
## lambda_u 2.10 0.20 1.79 2.43
## lambda_cm 1.05 0.15 0.81 1.30
So even when you have counts measured in different exposure (different length of time, or different size of area), you can estimate the rate of the outcome by Poisson model.