\documentclass[nojss]{jss}

%% need no \usepackage{Sweave}
\usepackage{thumbpdf}

%% new commands
\newcommand{\class}[1]{``\code{#1}''}
\newcommand{\fct}[1]{\code{#1()}}

\author{Achim Zeileis\\Universit\"at Innsbruck \And
        Christian Kleiber\\Universit\"at Basel \And
	Simon Jackman\\Stanford University}
\Plainauthor{Achim Zeileis, Christian Kleiber, Simon Jackman}

\title{Regression Models for Count Data in \proglang{R}}
\Plaintitle{Regression Models for Count Data in R}

\Keywords{GLM, Poisson model, negative binomial model, hurdle model, zero-inflated model}

\Abstract{
  The classical Poisson, geometric and negative binomial regression models 
  for count data belong to the family of generalized linear models and
  are available at the core of the statistics toolbox in the \proglang{R}
  system for statistical computing. After reviewing the conceptual and
  computational features of these methods, a new implementation of
  hurdle and zero-inflated regression models in the functions \fct{hurdle}
  and \fct{zeroinfl} from the package \pkg{pscl} is introduced. It re-uses
  design and functionality of the basic \proglang{R} functions just
  as the underlying conceptual tools extend the classical models.
  Both hurdle and zero-inflated model, are able to incorporate over-dispersion and
  excess zeros---two problems
  that typically occur in count data sets in economics and the social
  sciences---better than their classical counterparts.
  Using cross-section data on the demand for medical care, it is illustrated
  how the classical as well as the zero-augmented models can be fitted,
  inspected and tested in practice.
}

\Address{
  Achim Zeileis\\
  Department of Statistics\\
  Universit\"at Innsbruck\\
  Universit\"atsstr.~15\\
  6020 Innsbruck, Austria\\
  E-mail: \email{Achim.Zeileis@R-project.org}\\
  URL: \url{http://statmath.wu-wien.ac.at/~zeileis/}
}

\begin{document}

\SweaveOpts{engine=R, eps=FALSE, keep.source = TRUE}
%\VignetteIndexEntry{Regression Models for Count Data in R}
%\VignetteDepends{sandwich,zoo,lmtest,MASS,car}
%\VignetteKeywords{GLM, Poisson model, negative binomial model, hurdle model, zero-inflated model}
%\VignettePackage{pscl}

<<preliminaries, echo=FALSE, results=hide>>=
library("sandwich")
library("lmtest")
library("MASS")
library("car")
library("pscl")
load("DebTrivedi.rda")

clog <- function(x) log(x + 0.5)
cfac <- function(x, breaks = NULL) {
  if(is.null(breaks)) breaks <- unique(quantile(x, 0:10/10))
  x <- cut(x, breaks, include.lowest = TRUE, right = FALSE)
  levels(x) <- paste(breaks[-length(breaks)], ifelse(diff(breaks) > 1,
    c(paste("-", breaks[-c(1, length(breaks))] - 1, sep = ""), "+"), ""), sep = "")
  return(x)
}

options(prompt = "R> ")
refit_models <- TRUE
@

\section{Introduction} \label{sec:intro}

Modeling count variables is a common task in economics and the social
sciences. The classical Poisson regression model for count
data is often of limited use in these disciplines because empirical
count data sets typically exhibit over-dispersion and/or an excess
number of zeros. The former issue can be addressed by extending 
the plain Poisson regression model in various directions: e.g.,
using sandwich covariances or estimating an additional dispersion
parameter (in a so-called quasi-Poisson model). Another more formal
way is to use a negative binomial (NB) regression. All of these models
belong to the family of generalized linear models 
\citep[GLMs, see][]{countreg:Nelder+Wedderburn:1972,countreg:McCullagh+Nelder:1989}.
However, although these models typically can capture over-dispersion
rather well, they are in many applications not sufficient for 
modeling excess zeros. Since \cite{countreg:Mullahy:1986} and
\cite{countreg:Lambert:1992} there is increased interest, both in the
econometrics and statistics literature, in zero-augmented models
that address this issue by a second model component capturing zero counts.
Hurdle models \citep{countreg:Mullahy:1986} combine a left-truncated
count component with a right-censored hurdle component. Zero-inflation
models \citep{countreg:Lambert:1992} take a somewhat different
approach: they are mixture models that combine a count component and a point mass at zero.
An overview of count data models in econometrics, including 
hurdle and zero-inflated models, is provided in
\cite{countreg:Cameron+Trivedi:1998,countreg:Cameron+Trivedi:2005}.

In \proglang{R} \citep{countreg:R:2008}, GLMs are provided
by the model fitting functions \fct{glm} \citep{countreg:Chambers+Hastie:1992}
in the \pkg{stats} package and \fct{glm.nb} in the \pkg{MASS}
package \citep{countreg:Venables+Ripley:2002} along with associated
methods for diagnostics and inference. Here, we discuss the implementation
of hurdle and zero-inflated models in the functions \fct{hurdle} and
\fct{zeroinfl} in the \pkg{pscl} package \citep{countreg:Jackman:2008},
available from the Comprehensive \proglang{R} Archive Network (CRAN)
at \url{http://CRAN.R-project.org/package=pscl}.
The design of both modeling functions as well as the methods operating
on the associated fitted model objects follows that of the base \proglang{R}
functionality so that the new software integrates easily into the
computational toolbox for modeling count data in \proglang{R}.

The remainder of this paper is organized as follows: Section~\ref{sec:software}
discusses both the classical and zero-augmented count data models and their
\proglang{R} implementations. In Section~\ref{sec:illustrations}, all
count regression models discussed are applied to a microeconomic
cross-section data set on the demand for medical care. The summary in
Section~\ref{sec:summary} concludes the main part of the paper; further
technical details are presented in the appendix.


\section{Models and software} \label{sec:software}

\begin{table}[b!]
\begin{center}
\begin{tabular}{|l|l|l|p{7.4cm}|}
\hline
Type & Distribution & Method & Description \\ \hline
GLM & Poisson & ML & Poisson regression: classical GLM, estimated by maximum likelihood (ML) \\
    &         & quasi & ``quasi-Poisson regression'':
                            same mean function, estimated by quasi-ML (QML) or equivalently
			    generalized estimating equations (GEE),
			    inference adjustment via estimated dispersion parameter \\
    &         & adjusted & ``adjusted Poisson regression'': same mean function,
                           estimated by QML/GEE,
			   inference adjustment via sandwich covariances\\
    & NB      & ML    & NB regression:
                        extended GLM, estimated by ML including additional shape parameter \\ \hline
zero-augmented & Poisson & ML & zero-inflated Poisson (ZIP), hurdle Poisson \\
               & NB      & ML & zero-inflated NB (ZINB), hurdle NB \\ \hline
\end{tabular}
\caption{\label{tab:overview} Overview of discussed count regression models. All GLMs use the
same log-linear mean function ($\log(\mu) = x^\top \beta$) but make different assumptions about
the remaining likelihood. The zero-augmented models extend the mean function by 
modifying (typically, increasing) the likelihood of zero counts.}
\end{center}
\end{table}

In this section, we briefly outline the theory and its implementation in 
\proglang{R} \citep{countreg:R:2008} for some basic count data regression
models as well as their zero-augmented extensions (see Table~\ref{tab:overview}
for an overview). The classical Poisson, geometric
and negative binomial models are described in a generalized linear model (GLM)
framework; they are implemented in \proglang{R} by the \fct{glm} function
\citep{countreg:Chambers+Hastie:1992} in the \pkg{stats} package and the
\fct{glm.nb} function in the \pkg{MASS} package \citep{countreg:Venables+Ripley:2002}.
The hurdle and zero-inflated extensions of these models are provided by the
functions \fct{hurdle} and \fct{zeroinfl} in package \pkg{pscl} \citep{countreg:Jackman:2008}.
The original implementation of \cite{countreg:Jackman:2008} was improved
by \cite{countreg:Kleiber+Zeileis:2008} for \pkg{pscl} to make the fitting functions and the
fitted model objects more similar to their \fct{glm} and \fct{glm.nb} counterparts.
The most important features of the new \fct{hurdle} and \fct{zeroinfl} functions
are discussed below while some technical aspects are deferred to the appendix.

An alternative implementation of zero-inflated count models is available in
the currently orphaned package \pkg{zicounts}
\citep{countreg:Mwalili:2007}. Another extension of zero-inflated Poisson models is available
in package \pkg{ZIGP} \citep{countreg:Erhardt:2008} which allows dispersion---in addition to
mean and zero-inflation level---to depend on regressors.
However, the interfaces of both packages are less standard with fewer (or no)
standard methods provided. Therefore, re-using generic inference tools is more
cumbersome and hence these packages are not discussed here. 

Two packages that embed zero-inflated models into more general implementations
of GLMs and GAMs (generalized additive models) are \pkg{gamlss} 
\citep{countreg:Stasinopoulos+Rigby:2007} and \pkg{VGAM} \citep{countreg:Yee:2008}.
The latter also provides hurdle models (under the name zero-altered models).
Both implementations allow specification of only one set of regressors.

In addition to zero-augmented models, there are many further extensions to
the classical Poisson model which are not discussed here. Some important model
classes include finite mixture models---implemented in \proglang{R} in package \pkg{flexmix}
\citep{countreg:Leisch:2004}---and generalized estimating equations
(GEE)---provided in \proglang{R} by package \pkg{geepack}
\citep{countreg:Halekoh+Hojsgaard+Yan:2006}---and mixed-effects models---available
in \proglang{R} in packages \pkg{lme4} and \pkg{nlme} \citep[see][]{countreg:Pinheiro+Bates:2000}.
Further information about the models and alternative \proglang{R} implementations
can be found in the respective references.

\subsection{Generalized linear models}

\subsubsection{Model frame}

The basic count data regression models can
be represented and understood using the GLM
framework that emerged in the statistical literature in the early 1970s
\citep{countreg:Nelder+Wedderburn:1972}. In the following, we briefly
sketch some important aspects relating to the unifying conceptual properties
and their implementation in \proglang{R}---for a detailed theoretical account of GLMs
see \cite{countreg:McCullagh+Nelder:1989}.

GLMs describe the dependence of a scalar variable $y_i$ ($i = 1, \dots, n$)
on a vector of regressors $x_i$. The conditional distribution of
$y_i | x_i$ is a linear exponential family with probability density function
\begin{equation} \label{eq:family}
f(y; \lambda, \phi) \quad = \quad
                              \exp \left( \frac{y \cdot \lambda - b(\lambda)}{\phi} +
                              c(y, \phi) \right),
\end{equation}
where $\lambda$ is the canonical parameter that depends on the regressors via
a linear predictor and $\phi$ is a dispersion parameter that is often known.
The functions $b(\cdot)$ and $c(\cdot)$ are known and determine which member
of the family is used, e.g., the normal, binomial or Poisson distribution.
Conditional mean and variance of $y_i$ are given by
$\E[y_i \, | \, x_i] = \mu_i = b'(\lambda_i)$ and
$\VAR[y_i \, | \, x_i] = \phi \cdot b''(\lambda_i)$. Thus, up to a scale
or dispersion parameter $\phi$, the distribution of $y_i$ is determined by
its mean. Its variance is proportional to $V(\mu) = b''(\lambda(\mu))$,
also called variance function.

The dependence of the conditional mean $\E[y_i \, | \, x_i] = \mu_i$
on the regressors $x_i$ is specified via
\begin{equation} \label{eq:mean}
g(\mu_i) \quad = \quad x_i^\top \beta,
\end{equation}
where $g(\cdot)$ is a known link function and $\beta$ is the vector of regression
coefficients which are typically estimated by maximum likelihood (ML)
using the iterative weighted least squares (IWLS) algorithm.

Instead of viewing GLMs as models for the full likelihood (as determined by
Equation~\ref{eq:family}), they can also be regarded as regression models for the
mean only (as specified in Equation~\ref{eq:mean}) where the estimating functions
used for fitting the model are derived from a particular family. As illustrated 
in the remainder of this section, the estimating function point of view is particularly
useful for relaxing the assumptions imposed by the Poisson likelihood.

\proglang{R} provides a very flexible implementation of the general GLM
framework in the function \fct{glm} \citep{countreg:Chambers+Hastie:1992}
contained in the \pkg{stats} package. Its most important arguments are
\begin{Soutput}
glm(formula, data, subset, na.action, weights, offset,
  family = gaussian, start = NULL, control = glm.control(...),
  model = TRUE, y = TRUE, x = FALSE, ...)
\end{Soutput}
where \code{formula} plus \code{data} is the now standard way of specifying
regression relationships in \proglang{R}/\proglang{S} introduced in
\cite{countreg:Chambers+Hastie:1992}. The remaining arguments in the first line
(\code{subset}, \code{na.action}, \code{weights}, and \code{offset}) are also standard 
for setting up formula-based regression models in \proglang{R}/\proglang{S}.
The arguments in the second line control aspects specific to GLMs while
the arguments in the last line specify which components are returned
in the fitted model object (of class \class{glm} which inherits from
\class{lm}). By default the model frame (\code{model}) and the vector
$(y_1, \dots, y_n)^\top$ (\code{y}) but not the model matrix (\code{x},
containing $x_1, \dots, x_n$ combined row-wise) are included. The
\code{family} argument specifies the link $g(\mu)$ and variance
function $V(\mu)$ of the model, \code{start} can be used to set 
starting values for $\beta$, and \code{control} contains control parameters for the IWLS
algorithm. For further arguments to \fct{glm} (including alternative specifications
of starting values) see \code{?glm}.
The high-level \fct{glm} interface relies on the function	
\fct{glm.fit} which carries out the actual model fitting (without
taking a formula-based input or returning classed output).

For \class{glm} objects, a set of standard methods (including \fct{print},
\fct{predict}, \fct{logLik} and many others) are provided. Inference can 
easily be performed using the \fct{summary} method for assessing the
regression coefficients via partial Wald tests or the \fct{anova} method
for comparing nested models via an analysis of deviance. These inference
functions are complemented by further generic inference functions in
contributed packages: e.g., \pkg{lmtest} \citep{countreg:Zeileis+Hothorn:2002}
provides a \fct{coeftest} function that also computes partial Wald tests
but allows for specification of alternative (robust) standard errors. Similarly,
\fct{waldtest} from \pkg{lmtest} and \fct{linearHypothesis} from \pkg{car}
\citep{countreg:Fox:2002} assess nested models via Wald tests (using
different specifications for the nested models). Finally, \fct{lrtest}
from \pkg{lmtest} compares nested models via likelihood ratio (LR) tests
based on an interface similar to \fct{waldtest} and \fct{anova}.


\subsubsection{Poisson model}

The simplest distribution used for modeling count data is the 
Poisson distribution with probability density function
\begin{equation} \label{eq:Poisson}
f(y; \mu) \quad = \quad \frac{\exp(-\mu) \cdot \mu^{y}}{y!},
\end{equation}
which is of type~(\ref{eq:family}) and thus Poisson regression
is a special case of the GLM framework. The canonical link is
$g(\mu) = \log(\mu)$ resulting in a log-linear relationship
between mean and linear predictor. The variance in the Poisson
model is identical to the mean, thus the dispersion is
fixed at $\phi = 1$ and the variance function is $V(\mu) = \mu$.

In \proglang{R}, this can easily be specified in the \fct{glm} 
call just by setting \code{family = poisson} (where the default log link
could also be changed in the \fct{poisson} call).

In practice, the Poisson model is often useful for describing the
mean $\mu_i$ but underestimates the variance in the data, rendering
all model-based tests liberal. One way of dealing with this is
to use the same estimating functions for the mean, but to base
inference on the more robust sandwich covariance matrix estimator.
In \proglang{R}, this estimator is provided by the \fct{sandwich}
function in the \pkg{sandwich} package \citep{countreg:Zeileis:2004,countreg:Zeileis:2006}.


\subsubsection{Quasi-Poisson model}

Another way of dealing with over-dispersion is to use the mean
regression function and the variance function from the Poisson GLM
but to leave the dispersion parameter $\phi$ unrestricted. Thus,
$\phi$ is not assumed to be fixed at $1$ but is estimated from
the data. This strategy leads
to the same coefficient estimates as the standard Poisson model
but inference is adjusted for over-dispersion. Consequently,
both models (quasi-Poisson and sandwich-adjusted Poisson)
adopt the estimating function view of the Poisson model and
do \emph{not} correspond to models with fully specified likelihoods.

In \proglang{R}, the quasi-Poisson model with estimated dispersion
parameter can also be fitted with the \fct{glm} function, simply
setting \code{family = quasipoisson}.

\subsubsection{Negative binomial models}

A third way of modeling over-dispersed count data is to assume
a negative binomial (NB) distribution for $y_i | x_i$ which can arise
as a gamma mixture of Poisson distributions. One parameterization of
its probability density function is
\begin{equation} \label{eq:negbin}
f(y; \mu, \theta) \quad = \quad \frac{\Gamma(y + \theta)}{\Gamma(\theta) \cdot y!} \cdot
                            \frac{\mu^{y} \cdot \theta^\theta}{(\mu + \theta)^{y + \theta}},
\end{equation}
with mean $\mu$ and shape parameter $\theta$; $\Gamma(\cdot)$ is the
gamma function. For every fixed
$\theta$, this is of type~(\ref{eq:family}) and thus is another
special case of the GLM framework. It also has $\phi = 1$
but with variance function $V(\mu) = \mu + \frac{\mu^2}{\theta}$.

Package \pkg{MASS} \citep{countreg:Venables+Ripley:2002} provides
the family function \fct{negative.binomial} that can directly
be plugged into \fct{glm} provided the argument \code{theta} is
specified. One application would be the geometric model, the
special case where $\theta = 1$, which can consequently be
fitted in \proglang{R} by setting
\code{family = negative.binomial(theta = 1)} in the \fct{glm}
call.

If $\theta$ is not known but to be estimated from the data,
the negative binomial model is not a special case of the general
GLM---however, an ML fit can easily be computed re-using GLM
methodology by iterating estimation of $\beta$ given $\theta$
and vice versa. This leads to ML estimates for both $\beta$
and $\theta$ which can be computed using the function \fct{glm.nb}
from the package \pkg{MASS}. It returns a model of class \class{negbin}
inheriting from \class{glm} for which appropriate methods
to the generic functions described above are again available.


\subsection{Hurdle models}

In addition to over-dispersion, many empirical count data sets
exhibit more zero observations than would be allowed for
by the Poisson model. One model class capable of capturing
both properties is the hurdle model, 
originally proposed by \cite{countreg:Mullahy:1986} in the econometrics literature
\citep[see][for an overview]{countreg:Cameron+Trivedi:1998,countreg:Cameron+Trivedi:2005}.
They are two-component models:
A truncated count component, such as Poisson, geometric or negative binomial,
is employed for positive counts, and
a hurdle component models zero vs.\ larger counts.
For the latter, either a binomial model or a censored count distribution can be employed.

More formally, the hurdle model combines a count data model 
$f_\mathrm{count}(y; x, \beta)$ (that is left-truncated at $y = 1$) and a 
zero hurdle model $f_\mathrm{zero}(y; z, \gamma)$ (right-censored at $y = 1$):
\begin{equation} \label{eq:hurdle}
f_\mathrm{hurdle}(y; x, z, \beta, \gamma) =
  \left\{
  \begin{array}{ll}
  f_\mathrm{zero}(0; z, \gamma) & \mbox{if } y = 0, \\
  (1 - f_\mathrm{zero}(0; z, \gamma)) \cdot
  f_\mathrm{count}(y; x, \beta)/(1 - f_\mathrm{count}(0; x, \beta)) & \mbox{if } y > 0
  \end{array}
  \right.
\end{equation}
The model parameters $\beta$, $\gamma$, and potentially one or two additional
dispersion parameters $\theta$
(if $f_\mathrm{count}$ or $f_\mathrm{zero}$ or both are negative binomial densities)
are estimated by ML, where the specification of the likelihood has the advantage that
the count and the hurdle component can be maximized separately. The corresponding
mean regression relationship is given by
\begin{equation} \label{eq:hurdle-mean}
\log(\mu_i) \quad = \quad x_i^\top \beta +
                    \log(1 - f_\mathrm{zero}(0; z_i, \gamma)) -
		    \log(1 - f_\mathrm{count}(0; x_i, \beta)),
\end{equation}
again using the canonical log link.
For interpreting the zero model as a hurdle, a binomial GLM is probably the most 
intuitive specification\footnote{Note that
binomial logit and censored geometric models as the hurdle part both lead to 
the same likelihood function and thus to the same coefficient estimates \citep{countreg:Mullahy:1986}.}.
Another useful interpretation arises if the same regressors $x_i = z_i$ are
used in the same count model in both components $f_\mathrm{count} = f_\mathrm{zero}$: 
A test of the hypothesis $\beta = \gamma$ then tests whether the hurdle is
needed or not.

In \proglang{R}, hurdle count data models can be fitted with the
\fct{hurdle} function from the \pkg{pscl} package \citep{countreg:Jackman:2008}.
Both its fitting function and the returned model objects of class \class{hurdle}
are modelled after the corresponding GLM functionality in \proglang{R}. The
arguments of \fct{hurdle} are given by

\begin{Soutput}
hurdle(formula, data, subset, na.action, weights, offset,
  dist = "poisson", zero.dist = "binomial", link = "logit",
  control = hurdle.control(...),
  model = TRUE, y = TRUE, x = FALSE, ...)
\end{Soutput}

where the first line contains the standard model-frame specifications,
the second and third lines have the arguments specific to hurdle models and
the arguments in the last line control some components of the return value.

If a \code{formula} of type \code{y ~ x1 + x2} is supplied, it not only
describes the count regression relationship of $y_i$ and $x_i$ but also implies that
the same set of regressors is used for the zero hurdle component $z_i = x_i$.
This is could be made more explicit by equivalently writing the formula as
\code{y ~ x1 + x2 | x1 + x2}. Of course, a different set of regressors
could be specified for the zero hurdle component, e.g.,
\code{y ~ x1 + x2 | z1 + z2 + z3}, giving the count data model \code{y ~ x1 + x2}
conditional on (\code{|}) the zero hurdle model \code{y ~ z1 + z2 + z3}.

The model likelihood can be specified by the \code{dist}, \code{zero.dist} and \code{link} arguments.
The count data distribution \code{dist} is \code{"poisson"} by default (it
can also be set to \code{"negbin"} or \code{"geometric"}), for which the canonical log link
is always used. The distribution for the zero hurdle model can be specified via
\code{zero.dist}. The default is a binomial model with \code{link} (defaulting
to \code{"logit"}, but all link functions of the \fct{binomial} family are also supported),
alternatively a right-censored count distribution (Poisson,
negative binomial or geometric, all with log link) could be specified.

ML estimation of all parameters employing analytical gradients is carried out
using \proglang{R}'s \fct{optim} with control options set in \fct{hurdle.control}.
Starting values can be user-supplied, otherwise they are estimated by \fct{glm.fit}
(the default). The covariance matrix estimate is derived numerically using
the Hessian matrix returned by \fct{optim}.
See Appendix~\ref{app:hurdle} for further technical details.

The returned fitted-model object of class \class{hurdle} is a list similar
to \class{glm} objects. Some of its elements---such as \code{coefficients} or
\code{terms}---are lists with a zero and count component,
respectively. For details see Appendix~\ref{app:hurdle}.

A set of standard extractor functions for fitted model objects is available for
objects of class \class{hurdle}, including the usual \fct{summary} method that
provides partial Wald tests for all coefficients. No \fct{anova} method is provided,
but the general \fct{coeftest}, \fct{waldtest} from \pkg{lmtest}, and \fct{linearHypothesis}
from \pkg{car} can be used for Wald tests and \fct{lrtest} from \pkg{lmtest}
for LR tests of nested models. The function \fct{hurdletest} is a convenience
interface to \fct{linearHypothesis} for testing for the presence of a hurdle
(which is only applicable if the same regressors and the same count distribution
are used in both components).


\subsection{Zero-inflated models}

Zero-inflated models \citep{countreg:Mullahy:1986,countreg:Lambert:1992}
are another model class capable of dealing with excess zero counts 
\citep[see][for an overview]{countreg:Cameron+Trivedi:1998,countreg:Cameron+Trivedi:2005}.
They are two-component mixture models
combining a point mass at zero with a count distribution such as
Poisson, geometric or negative binomial. Thus, there are two
sources of zeros: zeros may come from both the point mass and
from the count component. For modeling the unobserved state
(zero vs.\ count), a binary model is used: in the simplest case
only with an intercept but potentially containing regressors.

Formally, the zero-inflated density is a mixture of a point mass
at zero $I_{\{0\}}(y)$ and a count distribution $f_\mathrm{count}(y; x, \beta)$.
The probability of observing a zero count is inflated with probability
$\pi = f_\mathrm{zero}(0; z, \gamma)$:
\begin{equation} \label{eq:zeroinfl}
f_\mathrm{zeroinfl}(y; x, z, \beta, \gamma) \quad = \quad
  f_\mathrm{zero}(0; z, \gamma) \cdot I_{\{0\}}(y) \; + \;
  (1 - f_\mathrm{zero}(0; z, \gamma)) \cdot f_\mathrm{count}(y; x, \beta),
\end{equation}
where $I(\cdot)$ is the indicator function and the unobserved probability $\pi$
of belonging to the point mass component is modelled by a binomial GLM
$\pi =  g^{-1}(z^\top \gamma)$.
The corresponding regression equation for the mean is
\begin{equation} \label{eq:zeroinfl-mean}
\mu_i \quad = \quad \pi_i \cdot 0 \; + \; (1 - \pi_i) \cdot \exp(x_i^\top \beta),
\end{equation}
using the canonical log link.
The vector of regressors in the zero-inflation model $z_i$
and the regressors in the count component $x_i$ need not to be distinct---in
the simplest case, $z_i = 1$ is just
an intercept. The default link function $g(\pi)$ in binomial GLMs is the
logit link, but other links such as the probit are also available. The full
set of parameters of $\beta$, $\gamma$, and potentially the dispersion parameter $\theta$ (if 
a negative binomial count model is used) can be estimated by ML. Inference
is typically performed for $\beta$ and $\gamma$, while $\theta$ is treated
as a nuisance parameter even if a negative binomial model is used.

In \proglang{R}, zero-inflated count data models can be fitted with the
\fct{zeroinfl} function from the \pkg{pscl} package. Both 
the fitting function interface and the returned model objects of class \class{zeroinfl}
are almost identical to the corresponding \fct{hurdle} functionality and again
modelled after the corresponding GLM functionality in \proglang{R}. The
arguments of \fct{zeroinfl} are given by

\begin{Soutput}
zeroinfl(formula, data, subset, na.action, weights, offset,
  dist = "poisson", link = "logit", control = zeroinfl.control(...),
  model = TRUE, y = TRUE, x = FALSE, ...)
\end{Soutput}

where all arguments have almost the same meaning as for \fct{hurdle}. The main
difference is that there is no \code{zero.dist} argument: a binomial model is
always used for distribution in the zero-inflation component.

Again, ML estimates of all parameters are obtained from \fct{optim},
with control options set in \fct{zeroinfl.control} and employing analytical gradients.
Starting values can be user-supplied, estimated by the expectation maximization (EM)
algorithm, or by \fct{glm.fit} (the default).
The covariance matrix estimate is derived numerically using
the Hessian matrix returned by \fct{optim}. Using EM estimation for
deriving starting values is typically slower but can be numerically more stable.
It already maximizes the likelihood, but a single \fct{optim} iteration is used
for determining the covariance matrix estimate.
See Appendix~\ref{app:zeroinfl} for further technical details.

The returned fitted model object is of class \class{zeroinfl} whose structure
is virtually identical to that of \class{hurdle} models. As above,
a set of standard extractor functions for fitted model objects is available for
objects of class \class{zeroinfl}, including the usual \fct{summary} method that
provides partial Wald tests for all coefficients. Again, no \fct{anova} method is provided,
but the general functions \fct{coeftest} and \fct{waldtest} from \pkg{lmtest},
as well as \fct{linearHypothesis} from \pkg{car} can be used for Wald tests,
and \fct{lrtest} from \pkg{lmtest} for LR tests of nested models. 


\section{Application and illustrations} \label{sec:illustrations}

In the following, we illustrate all models described above by applying
them to a cross-sectional data set from health economics. Before the parametric models are fitted,
a basic exploratory analysis of the data set is carried out that addresses
some problems typically encountered when visualizing count data. At the
end of the section, all fitted models are compared highlighting that
the modelled mean function is similar but the fitted likelihood
is different and thus, the models differ with respect to explaining
over-dispersion and/or the number of zero counts.


\subsection{Demand for medical care by the elderly}

\cite{countreg:Deb+Trivedi:1997} analyze data on 4406
individuals, aged 66 and over, who are covered by Medicare, a public
insurance program. Originally obtained from the US National Medical
Expenditure Survey (NMES) for 1987/88, the data are available from the data archive of the
\textit{Journal of Applied Econometrics} at
\url{http://www.econ.queensu.ca/jae/1997-v12.3/deb-trivedi/}. It was
prepared for an \proglang{R} package accompanying
\cite{countreg:Kleiber+Zeileis:2008} and is also available as
\code{DebTrivedi.rda} in the \textit{Journal of Statistical Software}
together with \cite{countreg:Zeileis:2006}. The objective is to model
the demand for medical care---as captured by the number of physician/non-physician
office and hospital outpatient visits---by the covariates available
for the patients. Here, we adopt the number of physician office visits \code{ofp}
as the dependent variable and use the health status variables
\code{hosp} (number of hospital stays),
\code{health} (self-perceived health status),
\code{numchron} (number of chronic conditions),
as well as the socio-economic variables
\code{gender},
\code{school} (number of years of education), and
\code{privins} (private insurance indicator) as regressors. For convenience, we
select the variables used from the full data set:

<<dt>>=
dt <- DebTrivedi[, c(1, 6:8, 13, 15, 18)]
@

<<dt2, echo=FALSE>>=
dt2 <- DebTrivedi[, -(2:6)]
dt2$region <- relevel(dt2$region, "other")
@

To obtain a first overview of the dependent variable, we employ a histogram of the
observed count frequencies. In \proglang{R} various tools could be used, e.g.,
\code{hist(dt$ofp, breaks = 0:90 - 0.5)} for a histogram with rectangles or
<<ofp-plot, eval=FALSE>>=
plot(table(dt$ofp))
@
(see Figure~\ref{fig:ofp}) for a histogram with lines which brings out the extremely
large counts somewhat better. The histogram illustrates that the marginal distribution
exhibits both substantial variation and a rather large number of zeros.

\setkeys{Gin}{width=.5\textwidth} 
\begin{figure}[p]
\begin{center}
<<ofp-plot2, echo=FALSE, fig=TRUE, height=5, width=5>>=
plot(table(dt$ofp), xlab = "Number of physician office visits", ylab = "Frequency", axes = FALSE)
axis(2)
axis(1, at = 0:18 * 5, labels = FALSE)
axis(1, at = 0:9 * 10)
@
\caption{\label{fig:ofp} Frequency distribution for number of physician office visits.}
\end{center}
\end{figure}

\setkeys{Gin}{width=\textwidth} 
\begin{figure}[p]
\begin{center}
<<bad-good, fig=TRUE, height=5, width=10, echo=FALSE>>=
par(mfrow = c(1, 2))
plot(ofp ~ numchron, data = dt)
plot(clog(ofp) ~ cfac(numchron), data = dt)
@
\caption{\label{fig:bad-good} Bivariate explorative displays for number of physician office visits
  plotted against number of chronic conditions.}
\end{center}
\end{figure}

A natural second step in the exploratory analysis is to look at pairwise bivariate 
displays of the dependent variable against each of the regressors bringing out the
partial relationships. In \proglang{R}, such bivariate displays
can easily be generated with the \fct{plot} method for formulas, e.g., via \code{plot(y ~ x)}.
This chooses different types of displays depending on the combination of quantitative
and qualitative variables as dependent or regressor variable, respectively. However,
count variables are treated as all numerical variables and therefore the command

<<bad, eval=FALSE>>=
plot(ofp ~ numchron, data = dt)
@

produces a simple scatterplot as shown in the left panel of Figure~\ref{fig:bad-good}.
This is clearly not useful as both variables are count variables producing numerous
ties in the bivariate distribution and thus obscuring a large number of points in
the display. To overcome the problem, it is useful to group the number of chronic
conditions into a factor with levels `0', `1', `2', and `3 or more' and produce a
boxplot instead of a scatterplot. Furthermore, the picture is much clearer if the
dependent variable is log-transformed (just as all count regression models discussed
above also use a log link by default). As there are zero counts as well, we use a
convenience function \fct{clog} providing a continuity-corrected logarithm.

<<clog>>=
clog <- function(x) log(x + 0.5)
@

For transforming a count variable to a factor (for visualization purposes only),
we define another convenience function \fct{cfac}

<<cfac>>=
cfac <- function(x, breaks = NULL) {
  if(is.null(breaks)) breaks <- unique(quantile(x, 0:10/10))
  x <- cut(x, breaks, include.lowest = TRUE, right = FALSE)
  levels(x) <- paste(breaks[-length(breaks)], ifelse(diff(breaks) > 1,
    c(paste("-", breaks[-c(1, length(breaks))] - 1, sep = ""), "+"), ""),
    sep = "")
  return(x)
}
@

which by default tries to take an educated guess how to choose the breaks between the categories.
Clearly, the resulting exploratory display of the transformed variables produced by

<<good, eval=FALSE>>=
plot(clog(ofp) ~ cfac(numchron), data = dt)
@

(shown in the right panel of Figure~\ref{fig:bad-good}) brings out much better
how the number of doctor visits increases with the number of chronic conditions.

\setkeys{Gin}{width=\textwidth} 
\begin{figure}[p]
\begin{center}
<<ofp2-plot1, fig=TRUE, height=9, width=6.5, echo=FALSE>>=
par(mfrow = c(3, 2))
plot(clog(ofp) ~ health, data = dt, varwidth = TRUE,
  ylab = "Physician office visits (in clogs)", xlab = "Self-perceived health status", main = "health")
plot(clog(ofp) ~ cfac(numchron), data = dt,
  ylab = "Physician office visits (in clogs)", xlab = "Number of chronic conditions", main = "numchron")
plot(clog(ofp) ~ privins, data = dt, varwidth = TRUE,
  ylab = "Physician office visits (in clogs)", xlab = "Covered by private insurance", main = "privins")
plot(clog(ofp) ~ cfac(hosp, c(0:2, 8)), data = dt,
  ylab = "Physician office visits (in clogs)", xlab = "Number of hospital stays", main = "hosp")
plot(clog(ofp) ~ gender, data = dt, varwidth = TRUE,
  ylab = "Physician office visits (in clogs)", xlab = "Gender", main = "gender")
plot(cfac(ofp, c(0:2, 4, 6, 10, 100)) ~ school, data = dt, breaks = 9,
  ylab = "Physician office visits (number of visits)", xlab = "Number of years of education", main = "school")
@
\caption{\label{fig:ofp2} Number of physician office visits plotted against regressors used.}
\end{center}
\end{figure}

Analogous displays for the number of physician office visits against all regressors
can be produced via
<<ofp2, eval=FALSE>>=
plot(clog(ofp) ~ health, data = dt, varwidth = TRUE)
plot(clog(ofp) ~ cfac(numchron), data = dt)
plot(clog(ofp) ~ privins, data = dt, varwidth = TRUE)
plot(clog(ofp) ~ cfac(hosp, c(0:2, 8)), data = dt)
plot(clog(ofp) ~ gender, data = dt, varwidth = TRUE)
plot(cfac(ofp, c(0:2, 4, 6, 10, 100)) ~ school, data = dt, breaks = 9)
@
and are shown (with slightly enhanced labeling) in Figure~\ref{fig:ofp2}. 
The last plot uses a different type of display. Here, the dependent count variable
is not log-transformed but grouped into a factor and then a spinogram
is produced. This also groups the regressor (as in a histogram) and then
produces a highlighted mosaic plot. All displays show that the number of
doctor visits increases or decreases with the regressors as expected: \code{ofp}
decreases with the general health status but increases with the number of
chronic conditions or hospital stays. The median number of visits is also
slightly higher for patients with a private insurance and higher level of 
education. It is slightly lower for male compared to female patients.
The overall impression from all displays is that the changes in the mean
can only explain a modest amount of variation in the data.

<<models, echo=FALSE, results=hide>>=
if(refit_models & file.exists("countreg-models.rda")) file.remove("countreg-models.rda")
if(file.exists("countreg-models.rda")) {
  load("countreg-models.rda")
} else {
  fm_pois   <-      glm(ofp ~ ., data = dt, family = poisson)
  fm_qpois  <-      glm(ofp ~ ., data = dt, family = quasipoisson)
  fm_nbin   <-   MASS::glm.nb(ofp ~ ., data = dt)
  fm_zinb0  <- zeroinfl(ofp ~ ., data = dt, dist = "negbin")
  fm_zinb   <- zeroinfl(ofp ~ . | hosp + numchron + privins + school + gender, data = dt, dist = "negbin")
  fm_hurdle0<-   hurdle(ofp ~ ., data = dt, dist = "negbin")
  fm_hurdle <-   hurdle(ofp ~ . | hosp + numchron + privins + school + gender, data = dt, dist = "negbin")
  fm_hurdle2<-   hurdle(ofp ~ ., data = dt2, dist = "negbin")
  if(!refit_models) save(fm_pois, fm_qpois, fm_nbin, fm_zinb0, fm_zinb, fm_hurdle0, fm_hurdle, fm_hurdle2, file = "countreg-models.rda")
}
@

\subsection{Poisson regression}

As a first attempt to capture the relationship between the number of 
physician office visits and all regressors---described in \proglang{R} by the
formula \code{ofp ~ .}---in a parametric
regression model, we fit the basic Poisson regression model
<<poisson, eval=FALSE>>=
fm_pois <- glm(ofp ~ ., data = dt, family = poisson)
@
and obtain the coefficient estimates along with associated partial Wald tests
<<summary-poisson>>=
summary(fm_pois)
@
All coefficient estimates confirm the results from the exploratory analysis
in Figure~\ref{fig:ofp2}. All coefficients are highly significant with the
health variables leading to somewhat larger Wald statistics compared to the
socio-economic variables. However, the Wald test
results might be too optimistic due to a misspecification of the likelihood.
As the exploratory analysis suggested that over-dispersion is present in this data
set, we re-compute the Wald tests using sandwich standard errors via
<<coeftest-poisson>>=
coeftest(fm_pois, vcov = sandwich)
@
All regressors are still significant but the standard errors seem to be more
appropriate. This will also be confirmed by the following models that
deal with over-dispersion (and excess zeros) in a more formal way.

\subsection{Quasi-Poisson regression}

The quasi-Poisson model
<<quasipoisson, eval=FALSE>>=
fm_qpois <- glm(ofp ~ ., data = dt, family = quasipoisson)
@
leads to an estimated dispersion of
$\hat \phi = \Sexpr{round(summary(fm_qpois)$dispersion, digits = 3)}$ which
is clearly larger than $1$ confirming that over-dispersion is present in
the data.\footnote{Alternatively, over-dispersion can be confirmed by comparison
of the log-likelihoods of the Poisson and negative binomial model.}
The resulting partial Wald tests of the coefficients
are rather similar to the results obtained from the Poisson regression with
sandwich standard errors, leading to the same conclusions. As before, they
can be obtained via
<<summary-quasipoisson, eval=FALSE>>=
summary(fm_qpois)
@
The output is suppressed here and is presented in tabular form in Table~\ref{tab:summary}.


\subsection{Negative binomial regression}

A more formal way to accommodate over-dispersion in a count data regression
model is to use a negative binomial model, as in
<<nbin, eval=FALSE>>=
fm_nbin <- MASS::glm.nb(ofp ~ ., data = dt)
summary(fm_nbin)
@
As shown in Table~\ref{tab:summary},
both regression coefficients and standard errors are rather similar
to the quasi-Poisson and the sandwich-adjusted Poisson results above. Thus,
in terms of predicted means all three models give very similar results;
the associated partial Wald tests also lead to the same conclusions.

One advantage of the negative binomial model is that it is associated with a formal
likelihood so that information criteria are readily available. Furthermore, the expected
number of zeros can be computed from the fitted densities via 
$\sum_i f(0, \hat \mu_i, \hat \theta)$. 


\subsection{Hurdle regression}

The exploratory analysis conveyed the impression that there might be more zero observations
than explained by the basic count data distributions, hence a negative
binomial hurdle model is fitted via
<<hurdle0, eval=FALSE>>=
fm_hurdle0 <- hurdle(ofp ~ ., data = dt, dist = "negbin")
@
This uses the same type of count data model as in the preceding section
but it is now truncated for \code{ofp < 1} and has an additional hurdle
component modeling zero vs.\ count observations. By default, the hurdle
component is a binomial GLM with logit link which contains all regressors used in the count
model. The associated coefficient estimates and partial Wald tests for
both model components are displayed via
<<summary-hurdle0>>=
summary(fm_hurdle0)
@
The coefficients in the count component resemble those from the previous
models, but the increase in the log-likelihood (see also Table~\ref{tab:summary})
conveys that the model has improved by including the hurdle component.
However, it might be possible to omit the \code{health} variable from
the hurdle model. To test this hypothesis, the reduced model is fitted
via
<<hurdle, eval=FALSE>>=
fm_hurdle <- hurdle(ofp ~ . | hosp + numchron + privins + school + gender,
  data = dt, dist = "negbin")
@
and can then be compared to the full model in a Wald test
<<waldtest-hurdle>>=
waldtest(fm_hurdle0, fm_hurdle)
@
or an LR test
<<lrtest-hurdle, eval=FALSE>>=
lrtest(fm_hurdle0, fm_hurdle)
@
which leads to virtually identical results.

\begin{table}[p]
\begin{center}
\begin{tabular}{|l|rrrr|rr|} \hline
Type & \multicolumn{4}{|c|}{GLM} & \multicolumn{2}{|c|}{zero-augmented} \\
Distribution & \multicolumn{3}{|c}{Poisson} & \multicolumn{1}{c|}{NB} &
  \multicolumn{1}{|c}{Hurdle-NB} & \multicolumn{1}{c|}{ZINB} \\
Method & \multicolumn{1}{|c}{ML} & \multicolumn{1}{c}{adjusted} & \multicolumn{1}{c}{quasi} &
  \multicolumn{1}{c|}{ML} & \multicolumn{1}{|c}{ML} & \multicolumn{1}{c|}{ML} \\ 
Object & \code{fm_pois} & \code{fm_pois} & \code{fm_qpois} & \code{fm_nbin} & \code{fm_hurdle} & \code{fm_zinb} \\ \hline
<<summary-table, echo=FALSE, results=tex>>=
fm <- list("ML-Pois" = fm_pois, "Adj-Pois" = fm_pois, "Quasi-Pois" = fm_qpois, "NB" = fm_nbin,
  "Hurdle-NB" = fm_hurdle, "ZINB" = fm_zinb)
fm_summary <- matrix(character(6 * 33), ncol = 6)
colnames(fm_summary) <- names(fm)
rownames(fm_summary) <- c(as.vector(rbind(names(coef(fm_hurdle, model = "count")), "")),
  as.vector(rbind(names(coef(fm_hurdle, model = "zero")), "")),
  "no.\\ parameters", "$\\log L$", "AIC", "BIC", "$\\sum_i \\hat f_i(0)$")
rownames(fm_summary)[1:28] <- ifelse(rownames(fm_summary)[1:28] == "", "", 
  paste("\\code{", rownames(fm_summary)[1:28], "}", sep = ""))
fm_summary[1:8 * 2 - 1,] <- sapply(fm,
  function(x) paste("$", format(round(coef(x)[1:8], digits = 3)), "$\\phantom{)}", sep = ""))
fm_summary[1:8 * 2,] <- sapply(
  c(list("ML-Pois" = vcov(fm_pois), "Adj-Pois" = sandwich(fm_pois)),
  lapply(fm[-(1:2)], function(x) vcov(x))), function(x)
  paste("(", format(round(sqrt(diag(x))[1:8], digits = 3)), ")", sep = ""))
fm_summary[1:6 * 2 + 15,] <- cbind(NA, NA, NA, NA, sapply(fm[5:6], function(x)
  paste("$", format(round(coef(x, model = "zero"), digits = 3)), "$\\phantom{)}", sep = "")))
fm_summary[1:6 * 2 + 16,] <- cbind(NA, NA, NA, NA, sapply(fm[5:6],
  function(x) paste("(", format(round(sqrt(diag(vcov(x)))[-(1:8)], digits = 3)),  ")", sep = "")))
fm_summary[29,] <- sapply(fm, function(x) attr(logLik(x), "df"))
fm_summary[30,] <- paste("$", format(sapply(fm, function(x) round(logLik(x), digits = 1))), "$", sep = "")
fm_summary[31,] <- format(round(sapply(fm, AIC), digits = 1))
fm_summary[32,] <- format(round(sapply(fm, AIC, k = log(nrow(dt))), digits = 1))
fm_summary[33,] <- round(c("ML-Pois" = sum(dpois(0, fitted(fm_pois))),
  "Adj-Pois" = NA,
  "Quasi-Pois" = NA,
  "NB" = sum(dnbinom(0, mu = fitted(fm_nbin), size = fm_nbin$theta)),
  "NB-Hurdle" = sum(predict(fm_hurdle, type = "prob")[,1]),
  "ZINB" = sum(predict(fm_zinb, type = "prob")[,1])))
fm_summary[30:33,2:3] <- NA
fm_summary[is.na(fm_summary)] <- " "
fm_summary <- paste(apply(cbind(rownames(fm_summary), fm_summary), 1, paste, collapse = " & "), "\\\\")
fm_summary[c(16, 28, 33)] <- paste(fm_summary[c(16, 28, 33)], "\\hline")
writeLines(fm_summary)
@
\end{tabular}
\caption{\label{tab:summary} Summary of fitted count regression models for NMES data:
 coefficient estimates from count model, zero-inflation model (both with standard errors
 in parentheses), number of estimated parameters, maximized log-likelihood, AIC, BIC
 and expected number of zeros (sum of fitted densities evaluated at zero). The observed
 number of zeros is \Sexpr{sum(dt$ofp < 1)} in \Sexpr{nrow(dt)} observations.}
\end{center}
\end{table}


\subsection{Zero-inflated regression}

A different way of augmenting the negative binomial count model \code{fm_nbin}
with additional probability weight for zero counts is a zero-inflated
negative binomial (ZINB) regression. The default model is fitted via
<<zinb0, eval=FALSE>>=
fm_zinb0 <- zeroinfl(ofp ~ ., data = dt, dist = "negbin")
@
As for the hurdle model above, all regressors from the count model are also
used in the zero-inflation model. Again, we can modify the regressors in 
the zero-inflation part, e.g., by fitting a second model
<<zinb, eval=FALSE>>=
fm_zinb <- zeroinfl(ofp ~ . | hosp + numchron + privins + school + gender,
  data = dt, dist = "negbin")
@
that has the same variables in the zero-inflation part as the hurdle
component in \code{fm_hurdle}. By omitting the \code{health} variable, the
fit does not change significantly which can again be brought out by a Wald test
<<waldtest-zinb>>=
waldtest(fm_zinb0, fm_zinb)
@
or an LR test \code{lrtest(fm_zinb0, fm_zinb)} that produces virtually identical
results. The chosen fitted model can again be inspected via
<<summary-zinb, eval=FALSE>>=
summary(fm_zinb)
@
See Table~\ref{tab:summary} for a more concise summary.

\subsection{Comparison}

Having fitted several count data regression models to the demand for medical care
in the NMES data, it is, of course, of interest to understand what these models have
in common and what their differences are. In this section, we show how to compute
the components of Table~\ref{tab:summary} and provide some further comments and
interpretations. 

As a first comparison, it is of natural
interest to inspect the estimated regression coefficients in the count data model

<<coef-count, results=hide>>=
fm <- list("ML-Pois" = fm_pois, "Quasi-Pois" = fm_qpois, "NB" = fm_nbin,
  "Hurdle-NB" = fm_hurdle, "ZINB" = fm_zinb)
sapply(fm, function(x) coef(x)[1:8])
@

The result (see Table~\ref{tab:summary}) shows that there are some small differences,
especially between the GLMs and the
zero-augmented models. However, the zero-augmented models have to be interpreted 
slightly differently: While the GLMs all have the same mean function (\ref{eq:mean}),
the zero-augmentation also enters the mean function, see (\ref{eq:zeroinfl-mean}) and
(\ref{eq:hurdle-mean}). Nevertheless,
the overall impression is that the estimated mean functions are rather similar.
Moreover, the associated estimated standard errors
are very similar as well (see Table~\ref{tab:summary}):
<<se-count, results=hide>>=
cbind("ML-Pois" = sqrt(diag(vcov(fm_pois))),
  "Adj-Pois" = sqrt(diag(sandwich(fm_pois))),
  sapply(fm[-1], function(x) sqrt(diag(vcov(x)))[1:8]))
@
The only exception are the model-based standard errors for the Poisson model,
when treated as a fully specified model, which is obviously not appropriate for this data set.

In summary, the models are not too different with respect to their fitted 
mean functions. The differences become obvious if not only the mean but the
full likelihood is considered:
<<logLik>>=
rbind(logLik = sapply(fm, function(x) round(logLik(x), digits = 0)),
  Df = sapply(fm, function(x) attr(logLik(x), "df")))
@
The ML Poisson model is clearly inferior to all other fits. The quasi-Poisson
model and the sandwich-adjusted Poisson model are not associated with a fitted
likelihood. The negative binomial already improves the fit dramatically but
can in turn be improved by the hurdle and zero-inflated models which give
almost identical fits. This also reflects that the over-dispersion in the data
is captured better by the negative-binomial-based models than the plain
Poisson model. Additionally, it is of interest how the zero counts
are captured by the various models. Therefore, the observed zero counts are
compared to the expected number of zero counts for the likelihood-based models:
<<zero-counts>>=
round(c("Obs" = sum(dt$ofp < 1),
  "ML-Pois" = sum(dpois(0, fitted(fm_pois))),
  "NB" = sum(dnbinom(0, mu = fitted(fm_nbin), size = fm_nbin$theta)),
  "NB-Hurdle" = sum(predict(fm_hurdle, type = "prob")[,1]),
  "ZINB" = sum(predict(fm_zinb, type = "prob")[,1])))
@
Thus, the ML Poisson model is again not appropriate whereas the negative-binomial-based
models are much better in modeling the zero counts. By construction, the
expected number of zero counts in the hurdle model matches the observed
number.

In summary, the hurdle and zero-inflation models lead to the best 
results (in terms of likelihood) on this data set. Above, their mean function for the
count component was already shown to be very similar, below we take a
look at the fitted zero components:
<<coef-zero>>=
t(sapply(fm[4:5], function(x) round(x$coefficients$zero, digits = 3)))
@
This shows that the absolute values are rather different---which is not
surprising as they pertain to slightly different ways of modeling zero
counts---but the signs of the coefficients match, i.e., are just inversed.
For the hurdle model, the zero hurdle component describes the probability
of observing a positive count whereas, for the ZINB model, the zero-inflation
component predicts the probability of observing a zero count from the
point mass component. Overall, both models lead to the same qualitative
results and very similar model fits. Perhaps the hurdle model is slightly
preferable because it has the nicer interpretation: there is one process
that controls whether a patient sees a physician or not, and a second 
process that determines how many office visits are made.

\section{Summary} \label{sec:summary}

The model frame for basic count data models from the GLM framework as well
as their implementation in the \proglang{R} system for statistical computing
is reviewed. Starting from these basic tools, it is presented how
hurdle and zero-inflated models extend the classical models and how
likewise their \proglang{R} implementation in package \pkg{pscl} re-uses
design and functionality of the corresponding \proglang{R} software.
Hence, the new functions \fct{hurdle} and \fct{zeroinfl} are straightforward
to apply for model fitting. Additionally, standard methods for diagnostics 
are provided and generic inference tools from other packages can easily be
re-used.


\section*{Computational details}

The results in this paper were obtained using
\proglang{R}~\Sexpr{paste(R.Version()[6:7], collapse = ".")} with the packages
\pkg{MASS}~\Sexpr{gsub("-", "--", packageDescription("MASS")$Version)},
\pkg{pscl}~\Sexpr{gsub("-", "--", packageDescription("pscl")$Version)},
\pkg{sandwich}~\Sexpr{gsub("-", "--", packageDescription("sandwich")$Version)},
\pkg{car}~\Sexpr{gsub("-", "--", packageDescription("car")$Version)},
\pkg{lmtest}~\Sexpr{gsub("-", "--", packageDescription("lmtest")$Version)}.
\proglang{R} itself and all packages used are available from
CRAN at \url{http://CRAN.R-project.org/}.

\bibliography{countreg}

\newpage

\begin{appendix}

\section{Technical details for hurdle models} \label{app:hurdle}

The fitting of hurdle models via ML in \fct{hurdle} is controlled by
the arguments in the \fct{hurdle.control} wrapper function:

\begin{Soutput}
hurdle.control(method = "BFGS", maxit = 10000, trace = FALSE,
  separate = TRUE, start = NULL, ...)
\end{Soutput}

This modifies some default arguments passed on to the optimizer \fct{optim},
such as \code{method}, \code{maxit} and \code{trace}. The latter is also
used within \fct{hurdle} and can be set to produce more verbose output
concerning the fitting process. The argument \code{separate} controls
whether the two components of the model are optimized separately (the default)
or not. This is possible because there are no mixed sources for the zeros
in the data (unlike in zero-inflation models). The argument \code{start}
controls the choice of starting values for calling \fct{optim}, all remaining
arguments passed through \code{...} are directly passed on to \fct{optim}.

By default, starting values are estimated by calling \fct{glm.fit} for
both components of the model separately, once for the counts and once
for zero vs.\ non-zero counts. If starting values are supplied,
\code{start} needs to be set to a named list with the parameters for the
\code{$count} and \code{$zero} part of the model (and potentially a
\code{$theta} dispersion parameter if a negative binomial distribution is used).

The fitted model object of class \class{hurdle} is similar to \class{glm}
objects and contains sufficient information on all aspects of the fitting
process. In particular, the estimated parameters and associated covariances
are included as well as the result from the \fct{optim} call. Furthermore,
the call, formula, terms structure etc.\ is contained, potentially also the
model frame, dependent variable and regressor matrices.

Following \fct{glm.nb}, the $\theta$ parameter of the negative binomial
distribution is treated as a nuisance parameter. Thus, the \code{$coefficients}
component of the fitted model object just contains estimates of
$\beta$ and $\gamma$ while the estimate of $\theta$ and its standard deviation 
(on a log scale) are kept in extra list elements \code{$theta} and \code{$SE.logtheta}.


\section{Technical details for zero-inflated models} \label{app:zeroinfl}

Both the interface of the \fct{zeroinfl} function as well as its fitted
model objects are virtually identical to the corresponding \class{hurdle}
functionality. Hence, we only provide some additional information for
those aspects that differ from those discussed above. The details of the
ML optimization are again provided by a \fct{zeroinfl.control} wrapper:

\begin{Soutput}
zeroinfl.control(method = "BFGS", maxit = 10000, trace = FALSE,
  EM = FALSE, start = NULL, ...)
\end{Soutput}

The only new argument here is the argument \code{EM} which allows for EM
estimation of the starting values. Instead of calling \fct{glm.fit} only once for
both components of the model, this process can be iterated until convergence of the
parameters to the ML estimates. The optimizer is still called subsequently
(for a single iteration) to obtain the Hessian matrix from which the estimated
covariance matrix can be computed.


\section{Methods for fitted zero-inflated and hurdle models} \label{app:methods}

Users typically should not need to compute on the internal structure of
\class{hurdle} or \class{zeroinfl} objects because a set of standard extractor functions is provided,
an overview is given in Table~\ref{tab:methods}.
This includes methods to the generic functions \fct{print} and \fct{summary} which
print the estimated coefficients along with further information. The \fct{summary} in particular
supplies partial Wald tests based on the coefficients and the covariance matrix.
As usual, the \fct{summary} method returns an object of class \class{summary.hurdle}
or \class{summary.zeroinfl}, respectively, containing the relevant summary statistics
which can subsequently be printed using the associated \fct{print} method.

The methods for \fct{coef} and \fct{vcov} by default
return a single vector of coefficients and their associated covariance matrix,
respectively, i.e., all coefficients are concatenated. By setting their \code{model}
argument, the estimates for a single component can be extracted. Concatenating
the parameters by default and providing a matching covariance matrix estimate
(that does not contain the covariances of further nuisance parameters) facilitates
the application of generic inference functions such as \fct{coeftest}, \fct{waldtest},
and \fct{linearHypothesis}. All of these compute Wald tests for which coefficient
estimates and associated covariances is essentially all information required and 
can therefore be queried in an object-oriented way with the \fct{coef} and \fct{vcov}
methods.

Similarly, the \fct{terms} and \fct{model.matrix} extractors can
be used to extract the relevant information for either component of the model.
A \fct{logLik} method is provided, hence \fct{AIC}
can be called to compute information criteria and \fct{lrtest} for conducting
LR tests of nested models.

The \fct{predict} method computes predicted means (default) or probabilities (i.e.,
likelihood contributions) for observed or new data. Additionally, the means
from the count and zero component, respectively, can be predicted. For the
count component, this is the predicted count mean (without hurdle/inflation):
$\exp(x_i^\top \beta)$. For the zero component, this is the  the ratio of
probabilities $(1 - f_\mathrm{zero}(0; z_i, \gamma))/(1 - f_\mathrm{count}(0; x_i, \beta))$
of observing non-zero counts in hurdle models. In zero-inflation models, it is
the probability $f_\mathrm{zero}(0; z_i, \gamma)$ of observing a zero
from the point mass component in zero-inflated models

Predicted means for the observed data can also be obtained by the \fct{fitted} method. Deviations
between observed counts $y_i$ and predicted means $\hat \mu_i$ can be obtained
by the \fct{residuals} method returning either raw residuals $y_i - \hat \mu_i$
or the Pearson residuals (raw residuals standardized by square root of the variance function)
with the latter being the default.

\begin{table}[t!]
\begin{center}
\begin{tabular}{|l|p{8.7cm}|}
\hline
Function & Description \\ \hline
\fct{print} & simple printed display with coefficient estimates\\
\fct{summary} & standard regression output (coefficient estimates, standard errors, partial Wald tests);
                returns an object of class ``\code{summary.}\textit{class}''
                containing the relevant summary statistics (which has a \fct{print} method) \\ 	\hline
\fct{coef} & extract coefficients of model (full or components), a single vector of all coefficients by default \\
\fct{vcov} & associated covariance matrix (with matching names) \\
\fct{predict} & predictions (means or probabilities) for new data \\
\fct{fitted} & fitted means for observed data \\
\fct{residuals} & extract residuals (response or Pearson) \\ \hline
\fct{terms} & extract terms of model components \\
\fct{model.matrix} & extract model matrix of model components \\
\fct{logLik} & extract fitted log-likelihood \\ \hline
\fct{coeftest} & partial Wald tests of coefficients \\
\fct{waldtest} & Wald tests of nested models \\
\fct{linearHypothesis} & Wald tests of linear hypotheses \\
\fct{lrtest} & likelihood ratio tests of nested models \\
\fct{AIC} & compute information criteria (AIC, BIC, \dots) \\ \hline
\end{tabular}
\caption{\label{tab:methods} Functions and methods for objects of class \class{zeroinfl}
  and \class{hurdle}. The first three blocks refer to methods, the last block contains generic
  functions whose default methods work because of the information supplied by the methods above.}
\end{center}
\end{table}

\section{Replication of textbook results} \label{app:replication}

\citet[p.~204]{countreg:Cameron+Trivedi:1998} use a somewhat extended version
of the model employed above. Because not all variables in that extended
model are significant, a reduced set of variables was used throughout the main
paper. Here, however, we use the full model to show that the tools in \pkg{pscl}
reproduce the results of \cite{countreg:Cameron+Trivedi:1998}.

After omitting the responses other than \code{ofp} and setting \code{"other"}
as the reference category for \code{region} using
<<dt2a, eval=FALSE>>=
dt2 <- DebTrivedi[, -(2:6)]
dt2$region <- relevel(dt2$region, "other")
@
we fit a model that contains all explanatory variables, both in the
count model and the zero hurdle model:
<<hurdle2, eval=FALSE>>=
fm_hurdle2 <- hurdle(ofp ~ ., data = dt2, dist = "negbin")
@

The resulting coefficient estimates are virtually identical to those published
in \citet[p.~204]{countreg:Cameron+Trivedi:1998}. The associated Wald statistics
are also very similar provided that sandwich standard errors are used
\citep[which is not stated explicitely in][]{countreg:Cameron+Trivedi:1998}.
<<hurdle2-summary>>=
cfz <- coef(fm_hurdle2, model = "zero")
cfc <- coef(fm_hurdle2, model = "count")
se <- sqrt(diag(sandwich(fm_hurdle2)))
round(cbind(zero = cfz, zero_t = cfz/se[-seq(along = cfc)], 
  count = cfc, count_t = cfc/se[seq(along = cfc)]),
  digits = 3)[c(3, 2, 4, 5, 7, 6, 8, 9:17, 1),]
logLik(fm_hurdle2)
1/fm_hurdle2$theta
@
There are some small and very few larger deviations in the Wald statistics
which are probably explicable by different approximations to the gradient
of $\theta$ (or $1/\theta$ or $\log(\theta)$) and the usage of different
non-linear optimizers (and at least ten years of software development).

More replication exercises are performed in the example sections of
\pkg{AER} \citep{countreg:Zeileis+Kleiber:2008}, the software package
accompanying \cite{countreg:Kleiber+Zeileis:2008}.

\end{appendix}

\end{document}
