Understanding the Negative Binomial Distribution

January 4, 2018

If you’ve ever encountered count data, chances are you’re familiar with the Poisson distribution. The Poisson distribution models the probability with which an event occurs a certain number of times within a fixed time period. For example, count how often a book is sold on Amazon on a given day. Then the Poisson can describe the probability with which the book is sold at least two times. Furthermore, the book might sell 5 times on some days; but it is never sold -3 times or 0.5 times; the Poisson distribution only allocates probability to non-negative integers–count data.

Consider again the number of times an item is sold on Amazon on a given day. Then a sample of observations over a span of a few days could look like this:

One of the advantages of the Poisson distribution is that it is defined by a single parameter \(\lambda\). Furthermore, the distribution’s mean equals \(\lambda\), such that \(\lambda\) can be estimated by method-of-moments and maximum likelihood alike: Given a sample, our best estimate is the sample mean, such that \(\hat{\lambda}=\bar{x}\). Consequently, we can fit a Poisson to our example data like so (here, we take the high route with the gamlss package):

library(broom)
library(gamlss)

# a complicated way of computing the sample mean:
pois_fit <- gamlss(sales_only~1, 
                   family = PO(mu.link = "identity"),
                   control = gamlss.control(trace = FALSE)) 
pois_fit_tidy <- tidy(pois_fit)
round(pois_fit_tidy[,c(3:4,6)],2)
##   estimate std.error p.value
## 1     4.36      0.24       0
# Check whether we indeed estimated the sample mean
identical(round(pois_fit_tidy$estimate,2),
          round(mean(sales_only),2))
## [1] TRUE
Comparison of Sample and Theoretical Poisson Distribution.

Figure 1: Comparison of Sample and Theoretical Poisson Distribution.

This worked flawlessly. But if we evaluate the fit graphically, it’s a little disappointing. The sample distribution has much more probability in the tails of the distribution. Consequently, if we simulated new data from the fitted distribution, the simulated sample would have the correct mean, but too tight lower and upper quantiles:

set.seed(1024)
summary(product_data$sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   4.000   4.364   6.000  14.000
pois_sample <- rpois(83, pois_fit_tidy$estimate)
summary(pois_sample)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   4.542   6.000  10.000

What we failed to take into account so far is the variance of our sample: One fundamental assumption of the Poisson distribution is that both mean and variance are equal. Thus, if the random variable \(X\) is Poisson distributed with parameter \(\lambda\), \(X \sim \text{Pois}(\lambda)\), then \(E[X] = Var[X] = \lambda\). This is an assumption that holds at best approximately in applications. In the data set we used so far, we actually have a variance that is nearly twice as large as the mean:

round(mean(product_data$sales),2)
## [1] 4.36
round(var(product_data$sales),2)
## [1] 8.44

This case, where the sample variance is larger than the sample mean, occurs much more frequently. So if we fit a Poisson using the sample mean, we will end up with a fitted distribution whose variance is smaller than the one observed in the data. In the context of Poisson distributions we say that our sample is overdispersed: We expect a sample variance \(\hat{\sigma}^2 = \lambda\) but get \(\hat{\sigma}^2 > \lambda\). The sample is more “dispersed” than the fitted distribution.

An Alternative Distribution

A distribution for count data that takes overdispersion into account is the Negative Binomial distribution.

In contrast to the Poisson distribution, the Negative Binomial takes two parameters, and there are many different parameterizations which one can choose from. On Wikipedia we have a parameterization in terms of \(r\) and \(p\):

Suppose there is a sequence of independent Bernoulli trials. Thus, each trial has two potential outcomes called “success” and “failure”. In each trial the probability of success is \(p\) and of failure is \((1 − p)\). We are observing this sequence until a predefined number \(r\) of failures has occurred. Then the random number of successes we have seen, \(X\), will have the negative binomial distribution:

\[ X \sim \text{NB}(r,p) \]

Then, the mean is defined as \(\mu = E[X] = \frac{pr}{1-p}\). Using this formulation, one can write the variance as \(\sigma^2 = Var(X) = \frac{pr}{(1-p)^2} = \mu + \frac{\mu^2}{r}\). This way, it becomes obvious that the variance of a Negative Binomial is larger than that of a Poisson. Given that both distributions have a mean equal to \(\lambda\), the Negative Binomial has an additional variance of \(\lambda^2/r\). The Negative Binomial will always have longer tails. Only in the special case of \(r = \infty\), when the Negative Binomial reduces to the Poisson, the variances will be the same.

Now, are we able to improve upon our previous fit by using a Negative Binomial? We use the gamlss package as previously and compare the fitted density with our previous results.

nbin_fit <- gamlss(sales_only~1, 
                   family = NBI(mu.link = "identity",
                                sigma.link = "identity"))
nbin_fit_tidy <- tidy(nbin_fit)
nbin_fit_tidy[,c(1,3:4,6)] %>%
  knitr::kable(digits = 2)
parameter estimate std.error p.value
mu 4.36 0.33 0.00
sigma 0.22 0.08 0.01
Comparison of sample distribution with fitted theortical distributions. Even though the distributions are discrete, we use densities to make the comparison easier.

Figure 2: Comparison of sample distribution with fitted theortical distributions. Even though the distributions are discrete, we use densities to make the comparison easier.

AIC(pois_fit)
## [1] 391.9573
AIC(nbin_fit)
## [1] 373.7867

The graph shows that using a Negative Binomial instead of a Poisson distribution improves a lot upon the previous fit. The AIC has decreased as well compared to the fit of the Poisson distribution. There is now more probability in the tails and less around 4 to 5. We seem to have successfully taken into account the additional variance in our sample.

Why does the Negative Binomial take the variance into account differently? Or: Why does the Negative Binomial have a second parameter (while the Poisson has just one)?

The Negative Binomial as Poisson-Gamma-Mixture

To answer these questions, and to better understand the Negative Binomial distribution (Roll Credits!), suppose you have not just one, but a whole bunch of products, \(i = 1, ..., N\). As an example, consider the carparts data set published with “Forecasting with exponential smoothing: the state space approach” by Hyndman, Koehler, Ord and Snyder (Springer, 2008). The data set comes with the expsmooth package in R. It consists of 2674 time series describing monthly sales of different car parts. The majority of the series has 51 observations. We can summarize the complete series quickly:

n_series avg_part_sales median_part_sales avg_zero_share series_size
2509 0.51 0.39 0.75 51

Given that the items are mostly slow moving, and the observations integer valued, we assume that the likelihood of each individual part’s sample could again be modeled by a Poisson likelihood:

\[x_{i,t} \sim \text{Pois}(\lambda_i) \quad \forall i = 1, …, n.\]

One important concern of retailers is the forecast of new products for which no sales have been observed yet. Assuming a level of similarity among products, an initial solution could be to take the average of all historical products’ means as the mean that parameterizes the new product’s Poisson distribution. Index the new product by \(j\) such that: \(x_{j,t} \sim \text{Pois}(\lambda_j)\) where we estimate \(\lambda_j\) as \(\hat{\lambda}_j = \frac{1}{nT}\sum_{i=1}^{n} \sum_{t=1}^{T} x_{i,t}\).

Using the carparts data, we get an estimate in which a new product is modeled by \(\hat{\lambda}_j = 0.43\) – which is the average of all products’ sales averages.

## [1] 0.5073188

Given the characteristic properties of the Poisson distribution, new products would be modeled by a distribution with a variance that equals the mean:

pois_fixed <- rpois(10000, new_product_lambda)
round(var(pois_fixed),2)
## [1] 0.52

The plot below shows the discrete distribution of the estimated Poisson distribution that we would use to forecast sales for new products given the historical sales of the entire product range.

This approach is neat and simple, but it sweeps the uncertainty in our estimate \(\hat{\lambda}_j\) under the carpet: If we’d use the Poisson distribution parameterized by \(\hat{\lambda}_j\) to give forecasts, our prediction intervals would be too narrow. Given that the estimate comes from a sample of data, it is not certain that the parameter should be a fixed value (0.43). It might as well be a little smaller or larger.

To take this uncertainty in \(\hat{\lambda}_j\) into account, we can treat \(\lambda_j\) from now on as a random variable. We know that the parameter of a Poisson distribution should be non-negative, so a good candidate distribution for the random variable \(\lambda_j\) is the Gamma distribution. One way of parameterizing a Gamma distribution is by shape \(k\) and scale \(\theta\) such that

\[\lambda_j \sim \text{Gamma}(k, \theta).\]

Consequently, we have \(E[\lambda_j] = k\theta\) and \(Var[\lambda_j] = k\theta^2\).

The Gamma distribution models the distribution of average sales of the range of products. Thus we first compute the average historical sales for each product, and then fit the distribution to these.

parts_summary <- carparts_nona %>%
  as.data.frame %>%
  tbl_df %>%
  gather(part, sales) %>%
  group_by(part) %>%
  summarize(
    avg_sales = mean(sales),
    zero_share = mean(sales == 0),
    size = n()
  )

parts_summary %>% top_n(2,avg_sales)
## # A tibble: 4 x 4
##       part avg_sales zero_share  size
##      <chr>     <dbl>      <dbl> <int>
## 1 21017605  1.745098  0.3137255    51
## 2 21055552  1.745098  0.5098039    51
## 3 21311629  1.745098  0.2941176    51
## 4 21311636  1.745098  0.2941176    51

Then we again use the gamlss package to fit the Gamma distribution to the sample means, and tidy the model output using the broom package:

gamma_fit <- gamlss(avg_sales~1,
                    data = parts_summary,
                    family = GA(mu.link = "identity",
                                sigma.link = "identity"),
                    control = gamlss.control(trace = FALSE)) %>%
  tidy()

gamma_fit %>% 
  dplyr::select(parameter, estimate, std.error, p.value) %>%
  knitr::kable(digits = 2)
parameter estimate std.error p.value
mu 0.51 0.01 0
sigma 0.85 0.01 0
gamma_theta <- gamma_fit$estimate[1]
gamma_k <- gamma_fit$estimate[2]

mean(rgamma(1000, scale = gamma_theta,
            shape = gamma_k))
## [1] 0.4504594

The parameters of the Gamma distribution were estimated above through Empirical Bayes as \(\hat{\theta} = 0.507\) and \(\hat{k} = 0.85\).

Thus we estimated a Gamma distribution for the parameter of the new product with \(\hat{\theta} = 0.507\) and \(\hat{k} = 0.85\). The mean of the fitted Gamma distribution equals \(\hat{\theta} \cdot \hat{k} = 0.43\) which does not come as a surprise as we already computed the global average sales to be 0.43.

As we did before when we fitted the Poisson and the Negative Binomial distributions, we can check the fitted Gamma distribution visually. The fit is not ideal, which is likely because overly many products were sold very rarely.


With the fitted Gamma distribution at hand, we are now able to again use the Poisson distribution to simulate future sales of a new product. Previously, we would have drawn a random sample from a \(\text{Pois}(0.43)\) distribution. Now, however, the parameter of the Poisson distribution itself comes from a distribution. Thus, to simulate a future sales number, we first draw a value \(\lambda_{new}\) from the fitted Gamma distribution. This random value we then use as parameter in the Poisson distribution to generate a forecast simulation \(x_{new} \sim \text{Pois}(\lambda_{new})\).

That is, the simulated sales are a mix of a Poisson and a Gamma distribution—they are no longer generated by a Poisson, but by a Poisson-Gamma mixture distribution.

To highlight the difference between the Poisson distribution and the Poisson-Gamma mixture distribution, I animated the process of simulating sales from the mixture. In the plot below, the 10000 values are randomly drawn from the mixture, where first the Gamma value is drawn as indicated by the color scale. The Gamma value is plugged into the Poisson distribution and the resulting value falls on the x-axis. The result is again a discrete distribution.1

set.seed(1024)
# simulate the parameter for the Poisson distribution,
# which are then also used for the color scale
alpha_values <- rgamma(10000, 
                       scale = gamma_theta,
                       shape = gamma_k)
random_poissons <- rpois(10000, alpha_values)

round(mean(random_poissons),2)
## [1] 0.43
round(var(random_poissons),2)
## [1] 0.63

Comparing this distribution with the previous Poisson distribution, we see that the mixture is wider, has a little longer tails. The mean of both distributions is the same, 0.43, given the overall average historical sales. However, while the Poisson has a variance of 0.43, the Poisson-Gamma mixture has a larger variance of 0.63. The variance of a mixture is larger than the variance of the fixed distribution.

It is now clear why the Negative Binomial distribution has a larger variance than the Poisson distribution: The Negative Binomial is identical to a Poisson-Gamma mixture.

Given a \(\text{Gamma}(\theta, k)\) distribution used to create a mixture, the resulting mixture is a \(\text{NB}(r,p)\) negative binomial distribution, where \(r = k\) and \(p = \frac{\theta}{\theta+1}\). We can check whether the mixture really is a Negative Binomial by comparing the theoretical mean and variance to the sample mean and sample variance:

r <- gamma_k
p <- gamma_theta/(1+gamma_theta)

# theoretical mean and variance of the Negative Binomial
mu <- p*r/(1-p)
sigma <- mu + mu^2/r

round(mu,2)
## [1] 0.43
round(sigma,2)
## [1] 0.65

We come very close to the 0.43 and 0.63 from the sample.

By writing the variance of the negative binomial in terms of its mean, \(\sigma^2 = \mu + \frac{\mu^2}{r}\), it is obvious how much additional variance is added on top of the original Poisson variance. And given the derivation of the Poisson-Gamma mixture, it is clear that the additional variance in the Negative Binomial stems from the estimation of the Poisson parameter. Given that, it makes sense to immediately model the sales of a new product by fitting a Negative Binomial instead of a Poisson to the historical sales.2

References

For a more mathematical derivation of the Negative Binomial as Poisson-Gamma mixture, check out the blog post by Daniel Ma.

The general idea and format (and the headline!) for this post come from David Robinson’s awesome post on Understanding Empirical Bayes Estimation (Using Baseball Statistics). Just as he continued with a series of blog posts, I hope to write more on the Negative Binomial as conjugate prior, Gamma regression, and eventually LSTMs with Negative Binomial output.


  1. The animation is meant to be reminiscient of a Galton Board.

  2. This only holds if the historical products are each successfully modeled by a Poisson distribution.