Load some packages and set RStan option.
library('rstan')
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
library('tidyverse')
library('ggmcmc')
library('haven')
"School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include the type of program in which the student is enrolled and a standardized test in math."
This example is taken from UCLA’s IDRE website.
Let’s read the data.
d <- read_dta('http://www.ats.ucla.edu/stat/stata/dae/nb_data.dta')
d[] <- lapply(d, unclass) # error fix
glimpse(d)
## Observations: 314
## Variables: 5
## $ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, ...
## $ gender <dbl> 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1, 1, ...
## $ math <dbl> 63, 27, 20, 16, 2, 71, 63, 3, 51, 49, 31, 22, 73, 77, ...
## $ daysabs <dbl> 4, 4, 2, 3, 3, 13, 11, 7, 10, 9, 4, 5, 5, 6, 1, 0, 1, ...
## $ prog <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 1, 3, 2, 2, 2, ...
This data set records attendance of 314 high school juniors from two urban high schools. The outcome variable we are analyzing is days absent, daysabs. math is the standardized math score, and prog is a trichotomous categorical variable for the type of instructional program in which the student is enrolled: 1 is general program, 2 academic, and 3 vocational.
Let’s check the mean and variance of the outcome variable.
mean(d$daysabs)
## [1] 5.955414
var(d$daysabs)
## [1] 49.51877
It seems that this count outcome is overdispersed.
Let’s compare the distribution of this variable to the theoretical distribution of Poisson(6).
df <- data_frame(count = 0:max(d$daysabs)) %>%
mutate(density = dpois(count, lambda = 6))
plt <- ggplot(d, aes(x = daysabs, y = ..density..)) +
geom_histogram(color = 'black', binwidth = 1) +
geom_line(data = df, aes(x = count, y = density), color = 'red') +
labs(x = '# of days absent', y = 'probability')
print(plt)
Aganin, it seems that the outcome is overdispersed.
The PMF of the negative-binomial distribution with mean \(\mu\) and dispersion parameter \(\phi\) (relative to the square of mean) can be written as follows. \[ f(y|\mu, \phi) = \binom{y + \phi - 1}{y} \left( \frac{\mu}{\mu + \phi} \right)^y \left(\frac{\phi}{\mu + \phi}\right)^\phi. \] This parameterization is sometimes called NB2. In this parameterization, the expected value and the variance are: \[\mathrm{E}(Y) = \mu,\] \[\mathrm{Var}(Y) = \mu + \frac{\mu^2}{\phi}.\]
The mean and variance of the Poisson distribution with the location parameter \(\mu\) is \[\mathrm{E}(Y) = \mathrm{Var}(Y) = \mu.\] So, the term \(\mu^2 / \phi\) represents the additional variance to the Poisson, and the inverse of \(\phi\) shows the degree of overdispersion. As \(\phi\) goes to infinity (i.e. as the inverse of \(\phi\) goes to zero), the NB2 is getting closer to a Poisson distribution.
In R, we can simulate negative-binomial distributions with rnbinom(n, mu, size)
. For instance, we can randomly generate 100 values from NegativeBinomial2(\(\mu = 4, \phi = 1)\) as follows.
y <- rnbinom(100, mu = 4, size = 1)
plt <- ggplot(data_frame(y), aes(x = y)) + geom_bar()
plt
Compare the following three sets of values.
y_pois <- rpois(1000, lambda = 4)
y_nb_a <- rnbinom(1000, mu = 4, size = 2)
y_nb_b <- rnbinom(1000, mu = 4, size = 10000)
df <- data_frame(y_pois, y_nb_a, y_nb_b) %>%
gather('dist', 'value', 1:3)
plt <- ggplot(df, aes(value, fill = dist)) +
geom_bar(position = 'identity', alpha = 1/3)
print(plt)
We can see that the distribution of values generated from NB2(4, 10000) is very similar to that from Poisson(4).
Consider the couunt outcome \(y\) which follws a Poisson distribution: \(y\sim \mbox{Poisson}(\mu)\). Then, the PMF is \[f(y|\mu) = \frac{e^{-\mu} \mu^y}{y!}.\] Now suppose that the location paramter of the Poisson distribution is distributed by the Gamma distribution with the shape parameter \(\alpha > 0\) and the rate parameter \(\beta\). Then, the PDF is \[g(\mu) = \frac{\beta^\alpha}{\Gamma(\alpha)} \mu^{\alpha - 1} e^{-\beta \mu}.\] Therefore, the joint distribution of \(Y\) and \(M\) is \[\Pr(Y=y|M = \mu) g(\mu) = \frac{e^{-\mu} \mu^y}{y!} \frac{\beta^\alpha}{\Gamma(\alpha)} \mu^{\alpha - 1} e^{-\beta \mu}.\] We can get the unconditional distribution of \(Y\) by integrating out \(\mu\). \[\Pr(Y=y) = \int_0^{\infty} \Pr(Y=y|M = \mu) g(\mu) d\mu\\ = \int_0^{\infty} \frac{e^{-\mu} \mu^y}{y!} \frac{\beta^\alpha}{\Gamma(\alpha)} \mu^{\alpha - 1} e^{-\beta \mu} d\mu\\ = \int_0^{\infty}\frac{\beta^\alpha}{y!\Gamma(\alpha)} \mu^{y + \alpha -1} e^{-(\beta + 1)\mu} d\mu\\ = \frac{\beta^\alpha}{y!\Gamma(\alpha)} \frac{\Gamma(y + \alpha)}{(\beta + 1)^{y + \alpha}} \int_0^{\infty} \frac{(\beta + 1)^{y + \alpha}}{\Gamma(y + \alpha)} \mu^{y + \alpha -1} e^{-(\beta + 1)\mu} d\mu\\ = \frac{\beta^\alpha}{y!\Gamma(\alpha)} \frac{\Gamma(y + \alpha)}{(\beta + 1)^{y + \alpha}}\\ = \frac{\Gamma(y + \alpha)}{\Gamma(y + 1)\Gamma(\alpha)} \left(\frac{1}{\beta + 1} \right)^y \left( \frac{\beta}{\beta + 1}\right)^\alpha\\ = \binom{y + \alpha - 1}{y} \left(\frac{1}{\beta + 1} \right)^y \left( \frac{\beta}{\beta + 1}\right)^\alpha.\\ \] Now, let \(\alpha = \phi\) and \(\beta = \phi/\mu\). Then, we get \[\Pr(Y = y) = \binom{y + \phi - 1}{y} \left( \frac{1}{\phi/\mu + 1} \right)^y \left( \frac{\phi/\mu}{\phi/\mu + 1} \right)^\phi\\ = \binom{y + \phi - 1}{y} \left( \frac{\mu}{\mu + \phi} \right)^y \left( \frac{\phi}{\mu + \phi} \right)^\phi.\\ \] This is the PMF of NB2(\(\mu, \phi\)).
Let’s verify this relationship by simulation.
set.seed(1)
N <- 10000
mu <- 4
phi <- 1
nb2 <- rnbinom(N, mu = mu, size = phi)
pg <- rpois(N, lambda = rgamma(N, shape = phi, rate = phi/mu))
df <- data_frame(nb2_q = quantile(nb2, seq(0, 1, length.out = 100)),
pg_q = quantile(pg, seq(0, 1, length.out = 100)))
qqplt <- ggplot(df, aes(x = nb2_q, y = pg_q)) +
geom_point(color = 'darkblue') +
geom_abline(intercept = 0, slope = 1, linetype = 'dashed') +
coord_fixed() +
labs(x = 'Quantile of Simulated NB2',
y = 'Quantile of Simulated Poisson-Gamma')
print(qqplt)
df <- data_frame(nb2, pg) %>%
gather('dist', 'value', 1:2)
plt <- ggplot(df, aes(value, fill = dist)) +
geom_bar(position = 'identity', alpha = 1/2)
print(plt)
Before building a statistical model, let’s check other variables, which are potential explanatory variables.
Math score.
summary(d$math)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 28.00 48.00 48.27 70.00 99.00
plt <- ggplot(d, aes(x = math)) + geom_histogram(color = 'black')
print(plt)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Correlation betwen the oucome and math.
with(d, cor(daysabs, math))
## [1] -0.2358878
Visualize the bivariate relationship.
plt <- ggplot(d, aes(x = math, y = daysabs, color = as.factor(prog))) +
geom_point()
print(plt)
The variable prog, which categorizes programs.
table(d$prog)
##
## 1 2 3
## 40 167 107
Days absent by program.
d %>% group_by(prog) %>%
summarize(mean = mean(daysabs),
var = var(daysabs))
## # A tibble: 3 × 3
## prog mean var
## <dbl> <dbl> <dbl>
## 1 1 10.650000 67.25897
## 2 2 6.934132 55.44744
## 3 3 2.672897 13.93916
Math score by program.
d %>% group_by(prog) %>%
summarize(mean = mean(math),
var = var(math))
## # A tibble: 3 × 3
## prog mean var
## <dbl> <dbl> <dbl>
## 1 1 44.82500 835.9942
## 2 2 42.59281 635.9537
## 3 3 58.41121 436.8293
Now, let’s build a statistical model. We can still use the log link.
\[Y_n \sim \mbox{Neg-Bin}(\mu_n, \phi)\] \[\mu_n = \exp(\alpha_{program_n} + \beta M_n)\] \[\alpha_k \sim \mbox{Normal}(a, s)\] \[\beta \sim \mbox{Normal}(0, 10)\] \[\phi \sim \mbox{Cauchy}^{+}(0, 10)\] \[a \sim \mbox{Normal}(0, 100)\] \[s \sim \mbox{Cauchy}^{+}(0, 10)\] where \(k=1, 2, 3\).
Now, let’s translate this model into a Stan model. To model a NB2 outcome, we use ~ neg_binomial_2()
. And for more efficient calculation, we can use neg_binomial_2_log(X)
instead of neg_binomial_2(exp(X))
.
Here is a sample model.
data {
int N;
int<lower=0> Y[N];
int<lower=0> D; // number of programs
int<lower=1, upper=D> P[N];
vector[N] M;
}
parameters {
vector[D] alpha;
real beta;
real<lower=0> phi;
real a;
real<lower=0> s;
}
transformed parameters {
vector[N] mu;
for (n in 1:N)
mu[n] = exp(alpha[P[n]] + beta*M[n]);
}
model {
Y ~ neg_binomial_2(mu, phi);
alpha ~ normal(a, s);
beta ~ normal(0, 10);
phi ~ cauchy(0, 10);
a ~ normal(0, 100);
s ~ cauchy(0, 10);
}
generated quantities {
int<lower=0> y_pred[N];
vector[N] log_lik;
for (n in 1:N) {
y_pred[n] = neg_binomial_2_rng(mu[n], phi);
log_lik[n] = neg_binomial_2_lpmf(Y[n] | mu[n], phi);
}
}
Compile the model.
nb_model <- stan_model('stan-models/d5-negbin.stan')
Fit the model to the data.
mydstan <- list(N = nrow(d), D = length(unique(d$prog)),
Y = d$daysabs, P = d$prog, M = d$math/100) # scale the math score
fit_nb1 <- sampling(nb_model, data = mydstan, seed = 777,
chains = 4, iter = 4000, warmup = 2000,
refresh = -1, open_progress = FALSE)
print(fit_nb1, pars = c('alpha', 'beta', 'phi', 'a', 's'))
## Inference for Stan model: d5-negbin.
## 4 chains, each with iter=4000; warmup=2000; thin=1;
## post-warmup draws per chain=2000, total post-warmup draws=8000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## alpha[1] 2.61 0.00 0.19 2.26 2.48 2.61 2.73 2.98 2304 1
## alpha[2] 2.18 0.00 0.13 1.94 2.10 2.18 2.26 2.45 1720 1
## alpha[3] 1.36 0.00 0.18 1.00 1.24 1.36 1.48 1.74 1814 1
## beta -0.61 0.01 0.25 -1.10 -0.77 -0.60 -0.45 -0.14 1698 1
## phi 1.03 0.00 0.10 0.85 0.96 1.03 1.10 1.24 2782 1
## a 2.03 0.03 1.53 -0.98 1.60 2.06 2.49 4.89 2005 1
## s 1.91 0.08 2.49 0.37 0.72 1.16 2.13 7.96 1025 1
##
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:32:27 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Now you have to evaluate the fit and interpret the coefficients carefully. Read the data analysis example here for example.s
Sometimes, it seems that we have overdispersed Poisson outcome because the outcome has more 0s than expected.
"The state wildlife biologists want to model how many fish are being caught by fishermen at a state park. Visitors are asked how long they stayed, how many people were in the group, were there children in the group and how many fish were caught. Some visitors do not fish, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish so there are excess zeros in the data because of the people that did not fish."
This example is also taken from UCLA’s IDRE website.
Let’s read the data.
d <- read_csv('http://www.ats.ucla.edu/stat/data/fish.csv')
## Parsed with column specification:
## cols(
## nofish = col_integer(),
## livebait = col_integer(),
## camper = col_integer(),
## persons = col_integer(),
## child = col_integer(),
## xb = col_double(),
## zg = col_double(),
## count = col_integer()
## )
glimpse(d)
## Observations: 250
## Variables: 8
## $ nofish <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1,...
## $ livebait <int> 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,...
## $ camper <int> 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,...
## $ persons <int> 1, 1, 1, 2, 1, 4, 3, 4, 3, 1, 4, 3, 3, 3, 1, 1, 4, 3,...
## $ child <int> 0, 0, 0, 1, 0, 2, 1, 3, 2, 0, 1, 2, 0, 0, 0, 0, 1, 2,...
## $ xb <dbl> -0.89631456, -0.55834496, -0.40173101, -0.95629811, 0...
## $ zg <dbl> 3.050404787, 1.746148944, 0.279938877, -0.601525664, ...
## $ count <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0,...
We have data for 250 groups that went to a park. For each group, we have a variable couting how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper).
First, let’s check the distribution of the outcome variable count.
mean(d$count)
## [1] 3.296
var(d$count)
## [1] 135.3739
plt <- ggplot(d, aes(x = count)) + geom_bar()
print(plt)
We can see that the mean and variance are hugely different. By looking at the bar graph, we see that the reason we have seemingly overdispersed Poisson is that there are so many 0s.
Following the UCLA’s website, let’s try to explain this count outcome with the variables child, persons, and camper.
(Though I skip the examination of explanatory variables to save space, you should always understand the variables before you analyze data).
Let’s fit a Poisson model to this data set.
Let \(Y\) be the outcome and \(C\), \(P\), and \(K\) denote child, persons, and camper, respectively. The model is: \[Y_n \sim \mbox{Poisson}(\lambda_n),\] \[\lambda_n = \exp(\beta_1 + \beta_2 C_n + \beta_3 P_n + \beta_4 K_n),\]
\[\beta_k \sim \mbox{Nomarl}(0, 10),\] where \(k = 1, 2, 3, 4\).
Translate this into a Stan model and compile it.
zip_m1 <- stan_model('stan-models/zip-poisson.stan')
mydstan <- list(N = nrow(d), Y = d$count,
C = d$child, P = d$persons, K = d$camper)
fit_zip_pois <- sampling(zip_m1, data = mydstan, seed = 2016-12-17,
chains = 4, iter = 2000, warmup = 1000,
refresh = -1, open_progress = FALSE)
Check the result.
print(fit_zip_pois, pars = 'beta')
## Inference for Stan model: zip-poisson.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta[1] -2.00 0 0.15 -2.29 -2.09 -1.99 -1.90 -1.71 1128 1
## beta[2] -1.69 0 0.08 -1.85 -1.74 -1.69 -1.64 -1.54 2398 1
## beta[3] 1.09 0 0.04 1.02 1.07 1.09 1.12 1.17 1623 1
## beta[4] 0.93 0 0.08 0.77 0.88 0.93 0.99 1.10 2012 1
##
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:32:42 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Check traceplot.
S <- ggs(fit_zip_pois, inc_warmup = TRUE)
ggmcmc(S, family = 'beta', plot = 'traceplot',
file = 'output/d5-zip-poisson-traceplot.pdf')
Let’s compare the predicted values to the observed values.
# Use the mode for comparison
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
mcs <- rstan::extract(fit_zip_pois)
d <- d %>%
mutate(y_mode_pois = apply(mcs$y_pred, 2, Mode))
table(d$count)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 21 22
## 142 31 20 12 6 10 4 3 2 2 1 1 1 1 2 1 2 1
## 29 30 31 32 38 65 149
## 1 1 1 2 1 1 1
table(d$y_mode_pois)
##
## 0 1 2 3 4 5 8 9 10 25 26 27 28 29
## 112 65 11 19 2 9 7 5 7 1 2 3 6 1
In terms of prediction, this model doesn’t explain the data well because Poisson assumes the mean and variance are the same.
Now, let’s build a model that takes into account the large portion of zeros in the outcome.
We can assume the following process generated the data.
The count of zero is inflated because of the first step. But Step 1 is not the only cause of zero; the outcome can take the value of 0 in the second step.
We can think that the outcome \(y\) is generated through these steps. So, we can model \(y\) by the mixture of a Bernoulli distribution and a Poisson distribtuion. This mixture is called zero-inflated-Poisson (ZIP).
Let’s write down a model describing these steps. In order to do so, we have to distinguish the variables that affect the first step and those affect the second step. In other words, we have to figure out which variable inflate zeros, and which affects the number of fish caught in this example.
For simplicity, let’s assume the number of persons affects the first step (you are more likely to fish when your group is large) and the other two explanatory varialbe affect the number of fish. Then, we can write a following model. \[\begin{cases} Y_n = 0 & (Z_n = 0)\\ Y_n \sim \mbox{Poisson}(\lambda_n) & (Z_n = 1) \end{cases}\] \[\lambda_n = \exp(\gamma_1 + \gamma_2 C_n + \gamma K_n)\] \[Z_n \sim \mbox{Bernoulli}(\theta_n)\] \[\theta_n = \mathrm{logit}^{-1}(\alpha_1 + \alpha_2 P_n)\] \[\alpha, \gamma \sim \mbox{Normal(0, 10)}.\]
We cannot traslate this model directly into a Stan model for the following reason. Because we don’t observe it, we have to estimate \(Z\). Since \(Z\) is the Bernoulli outcome, it is a discrete parameter. However, Stan cannot estimate discrete parameters directly, since HMC can’t handle discrete parameters. Therefore, we need to find a way to overcome this difficulty.
In Stan, we can avoid estimating discrete parameters by marginalizing them. For that purpose, we write ZIP PMF as follows. \[\mbox{ZIP}(y | \theta, \lambda) =\begin{cases} \mbox{Bernoulli}(0|\theta) + \mbox{Bernoulli}(1|\theta)\cdot \mbox{Poisson}(y=0|\lambda) & (y=0)\\ \mbox{Bernoulli}(1|\theta)\cdot \mbox{Poisson}(y>0|\lambda) & (y>0) \end{cases}\]
Let’s rewrite our model usign this PMF. \[Y_n \sim \mbox{ZIP}(\theta_n, \lambda_n)\] \[\theta_n = \mathrm{logit}^{-1}(\alpha_1 + \alpha_2 P_n)\] \[\lambda_n = \exp(\gamma_1 + \gamma_2 C_2 + \gamma_3 K_n)\] \[\alpha, \gamma \sim \mbox{Normal(0, 10)}\].
Let’s translate this model into a Stan model. To define a function in Stan, we create functions block. In the block, we define the log likelihood of ZIP ZIP_lpmf()
. We can use this function to calculate the log likelihood. Furthermore, we can model the ZIP sampling by y ~ ZIP()
. However, we cannot randomly generate numbers by ZIP_rng()
.
The stan model is written in this file.
To define the ZIP PMF, we use log_sum_exp()
function in Stan. This function does the following calculation. \[\mathrm{log\_sum\_exp}(a, b) = \log(\exp(a) + \exp(b)),\] where \(a\) and \(b\) are real numbers. In our model, we have a part
log_sum_exp(
bernoulli_lpmf(0 | theta),
bernoulli_lpmf(1 | theta) + poisson_lpmf(0 | lambda)
);
This part gives us \[\log \left(\mbox{Bernoulli}(0|\theta) + \mbox{Bernoulli}(1|\theta)\mbox{Poisson}(0|\lambda) \right)\] because \[\mathrm{bernoulli\_lpmf(0 | theta)} = \log\left(\mbox{Bernoulli}(0| \theta) \right),\] and \[\mathrm{bernoulli\_lpmf(1 | theta) + poisson\_lpmf(0 | lambda)} = \log \left( \mbox{Bernoulli}(1| \theta) \mbox{Poisson}(0|\lambda) \right).\]
Now, let’s compile the model.
zip_m2 <- stan_model('stan-models/zip-zip.stan')
Fit the model to data.
mydstan <- list(N = nrow(d), P = d$persons, C = d$child,
K = d$camper, Y = d$count)
fit_zip_zip <- sampling(zip_m2, data = mydstan, seed = 333,
chains = 4, iter = 2000, warmup = 1000,
refresh = -1, open_progress = FALSE)
Take a look at the result.
print(fit_zip_zip, pars = c('alpha', 'gamma'))
## Inference for Stan model: zip-zip.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## alpha[1] -1.32 0.01 0.36 -2.02 -1.57 -1.31 -1.07 -0.64 1711 1
## alpha[2] 0.58 0.00 0.16 0.29 0.47 0.57 0.68 0.89 1659 1
## gamma[1] 1.59 0.00 0.08 1.43 1.54 1.59 1.65 1.75 1939 1
## gamma[2] -1.05 0.00 0.10 -1.23 -1.12 -1.05 -0.98 -0.86 2259 1
## gamma[3] 0.84 0.00 0.09 0.66 0.78 0.84 0.90 1.02 1666 1
##
## Samples were drawn using NUTS(diag_e) at Fri Dec 16 11:32:57 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Check the traceplot.
S <- ggs(fit_zip_zip, inc_warmup = TRUE)
ggmcmc(S, family = 'alpha', plot = 'traceplot',
file = 'output/d5-zip-zip-traceplot-alpha.pdf')
ggmcmc(S, family = 'gamma', plot = 'traceplot',
file = 'output/d5-zip-zip-traceplot-gamma.pdf')
By looking at the estimates of \(\beta\) only, we don’t see a significant improvement from the Poisson model. However, we can (partially) explain what causes a lot of zeros.
For more information see the UCLA IDRE website and read: