It is generally difficult to model time series when there is insuffient data to model a (suspected) long seasonality. Here, we show how this difficulty can be overcome by learning a seasonality on a different, long related time series and transferring the posterior as a prior distribution to the model of the short time series. The result is a forecast that is believable and can be used for decisions in a business context. In contrast to traditional methods that are not able to incorporate the long seasonality, we observe a drastic increase in common evaluation metrics. Default models in the forecast and prophet package fail to produce good forecasts on this example.

Prior Information in Bayesian Models

⊕“The statistical technique allows us to encode expert knowledge into a model by stating prior beliefs about what we think our data looks like.” - Fast Forward Labs

⊕“It provides a natural and principled way of combining prior information with data” - SAS

When you read about the advantage of probabilistic modeling and Bayesian inference in statistics and machine learning, one argument that comes up again and again is the possibility of including expert knowledge via the prior distribution into your model. This is touted as a giant leap promising vast possibilities. At second look, however, it often is left unclear how business expertise can be encoded into a model for your application.

Even more frustratingly, the option of choosing a prior distribution consciously may be entirely dismissed and some default uniform prior is used. In the following, we try to provide an example that both underlines the usefulness of prior information as well as shows explicitly how it can be done in a non-obvious situation.

Modeling a Short Time Series

Imagine the following scenario. Your company has launched a new division whose task it is to sell bikes in New York City. After observing sales for the first quarter since launch, it is time to forecast sales for the holidays and the upcoming year. The following graph shows the daily sales so far.

⊕The data we use in the following is publicly available Citi Bike data from station 360 in New York City. You can access the raw data here.

A Naive Benchmark Model

Give someone such a time series and they tend to grab ready-made solutions such as Hyndman’s forecast or Sean J. Taylor’s and Ben Letham’s prophet package (for good reason). Let’s see how far these methods get us.

sales <- hciti %>%
  pull(rides) %>%
  ts(frequency = 7)

arima_fit <-  sales %>%
  forecast::auto.arima()
arima_fc <- forecast::forecast(arima_fit, h = 180)

While it is super simple to fit an ARIMA model, it’s often not sufficient to throw the automatic procedure at the data. Here, we quickly lose the weekly seasonality of the data in our forecasts.

The first trial can be improved with a little bit of intervention.

xreg_fourier <- forecast::fourier(sales, K = 3)
xreg_fourier_future <- forecast::fourier(sales, K = 3, h = 180)

arima_fit <- forecast::auto.arima(sales, seasonal = FALSE, 
                                  xreg = xreg_fourier)
arima_fc <- forecast::forecast(arima_fit, h = 180, 
                               xreg = xreg_fourier_future)

While this second iteration might be a better model of the weekly seasonality in our data, it fails to incorporate most of our problem-specific information.

Let’s take a step back and actually think about what we’re trying to do.

Existing Business Knowledge

First of all, we are dealing with count data: We sell 1 bike, we sell 102 bikes, but we don’t sell neither 8.1 bikes nor -15 bikes. As we can see in the plots above, this knowledge is not well represented by the ARIMA models with its assumption of Normal-distributed errors. The prediction intervals quickly move into the negative area which is not sensible and implies that we attribute not enough probability to the positive outcomes.11 One can try to account for this via a Box-Cox transformation, such as forecast::auto.arima(sales, seasonal = FALSE, xreg = xreg_fourier, lambda = 0).

Second, the forecast does not incorporate a yearly seasonality. Even though we have not observed a year of sales, our boss might have the strong belief that bike sales have a seasonality following nature’s seasons, just like the demand for ice cream: Higher in summer, lower in winter. The problem with any kind of seasonality, though, is that common advice is to only model the seasonality once you have at least two periods of data. That is less of a problem with the weekly seasonality, as we already observed several weeks. But waiting for two years before we can even try to incorporate yearly seasonality? That’s going to leave our stakeholders unsatisfied.

This is what happens when you model yearly seasonality naively with a quarter of data using maximum likelihood.

The most frustrating part about not being able to model the long seasonality is that we as data scientists as well as our colleagues know that crucial knowledge about the business is being neglected. How can we expect anyone to trust the forecast when it ignores obvious business knowledge?

Another expectation from the business could be to see growth in the demand for our bikes over time. Even if there is no growth to be seen yet, we might have to account for growth because of planned marketing, because the first spring and summer are coming up, or because we previously observed growth with other products.

Hell breaks loose when we add another free parameter for a linear trend to the yearly seasonality.

The results from Facebook's prophet package expose similar difficulties. The results from Facebook’s prophet package expose similar difficulties.

Collecting Everything for a Data Generating Process

In the previous paragraphs, we loosely specified assumptions we and our colleagues would like to see incorporated in a forecast of the bike sales. We also saw that we cannot apply a typical time series model on the few data we have and expect a good result that is in line with our knowledge. Thus, we will now try to build a probabilistic model that let’s us include the prior knowledge. Let us formalize the assumptions stated so far.

We model random variables \(Y_t\) representing sales with a count distribution such as Poisson or a Negative Binomial distribution. For the rest of the discussion, we continue with a Negative Binomial parameterized by \(\mu_t\) and \(\phi\), such that the mean and variance of \(Y_t\) are given by \(\text{E}[Y_t] = \lambda_t\) and \(\text{Var}[Y_t] = \lambda_t + \frac{\lambda_t^2}{\phi}\):

\[ Y \sim \text{NegBin}(\lambda_t, \phi), \] where \(\lambda_t, \phi > 0\).

Since we’re dealing with time series, we model the mean with a damped dynamic that combines information from the previous observation \(y_{t-1}\), the latent mean at the previous time step \(\lambda_{t-1}\), as well as the general level \(\mu_t\):

\[ \lambda_t = (1 - \delta - \gamma ) \cdot \mu_t + \delta \cdot \lambda_{t-1} + \gamma \cdot y_{t-1}, \] where \(0 \leq \delta \leq 1\) and \(0 \leq \gamma \leq 1-\delta\).

We model the general level \(\mu_t\) of the time series as a regression using dummy variables \(D_i\) for every day, as well as Fourier terms for the long yearly seasonality.22 See this blog post by Rob J. Hyndman or this paper on Fourier Recurrent Units in LSTMs for more information and motivation. Given that the mean of a Negative Binomial has to be positive, we use a log-link function:

Fourier series terms of different order for a seasonality of period 365.25.

\[ \log(\mu_t) = \alpha_0 + \sum_{i = 1}^6 \alpha_i D_i + \tau \frac{t}{m} + \sum_{k=1}^K \Big(\beta_k \sin \big(\frac{2\pi kt}{m} \big) + \gamma_k \cos \big(\frac{2\pi kt}{m} \big)\Big) \]

where \(m = 365.25\). \(K\) defines how complex we let the Fourier terms and corresponding seasonal pattern be; given that we are mostly interested in a cycle corresponding to the natural seasons, \(K = 6\) should be more than sufficient. The \(\tau \frac{t}{m}\) term models a linear trend of the log-transformed mean.

To make the data generating process complete, we add prior distributions to model the uncertainty for every parameter.

Following the Prior Choice Recommendations for the probabilistic programming language Stan, we don’t put a prior on \(\phi\) directly, but instead on the transformed \(\hat{\phi} = \frac{1}{\sqrt{\phi}}\) with

\[ \hat{\phi} \sim N(0,0.5). \]

As priors on the lagged mean and observation components, we use Gamma priors to enforce positive values with \(\delta \sim \text{Gamma}(1,10)\) and \(\gamma \sim \text{Gamma}(0.5,10)\).

For the coefficients in the regression equation, we don’t need any transformation. Given that we have quite some data for every day and since they are not the focus of this article, we give weakly-informative priors for every day.

\[ \alpha_i \sim N(0,0.5). \]

The prior for the intercept corresponds to the level of sales we roughly expect. Note that we model the intercept on the logarithmic scale: We expect around \(\exp(4) \approx 55\) sales per day.

\[ \alpha_0 \sim N(4,0.5) \]

More interesting, and the actual focus of this article, however, are the priors on the trend and Fourier terms coefficients. We could, of course, choose penalizing priors to account for the fact that we have not observed much data. We might not want to estimate long-term growth based on the first few weeks of data. Similar for the Fourier terms; we have seen above in the ARIMA examples that the parameters add too much flexibility as that they could be estimated reliably from our amount of data.

Instead, we want to very much inform the model about what parameters it can expect in the real world. Having weakly informing priors does not make sense; we have a clear idea of how much growth makes sense, and what the seasonality might look like. Let’s start with the trend.

Choosing a Prior for the Trend

One advantage of the log-link parameterization of the mean is that we may interpret the coefficients as follows: A one unit change in a given variable corresponds to a change in \(\mu_t\) corresponding to 100% multiplied by the variable’s coefficient, holding every other variable constant. So for the trend as defined above, \(\tau\) gives the percentage change in \(\mu_t\) from one year to the next. Thus, if the business very much plans with a growth of, let’s say, 3% year-over-year, then we should choose \(\text{E}[\tau] = 0.03\).

If we additionally believe very much in slight positive growth, then a prior such as

\[ \alpha_0 \sim N(0.03, 0.02) \]

might be reasonable. Note that this does not imply negative growth is impossible—it just means we think it’s unlikely. Indeed, our prior quantifies a negative growth scenario with a probability of about 6.68%.

Choosing a Prior for the Yearly Seasonality

Thinking about a useful linear combination of sine and cosine curves that combined portray the right amount of up and down over a year is a little more involved than choosing a single YoY growth. It would surely be possible to come up with reasonable coefficients for the Fourier terms by thinking hard or playing around, but we don’t really have an intuition on what reasonable priors on the coefficients look like. We do, however, have an idea about the functions that these priors induce together. So we will try a different approach here.

Transfer of a Learned Posterior as New Prior

In many real world applications, we have a good grasp on what patterns to expect in a time series even if our own time series is short. Rough market patterns can be drawn from public market data, your company’s previous products, or Google Trends.

Let’s again consider our task of estimating the demand for bikes. We suspect that the demand is heavily correlated with the weather and the four seasons: it should be high in summer and low in winter. Theoretically, one could try to use the temperature as a input feature of some kind, however, we neither have enough demand observations nor can we necessarily forecast the weather. We can, however, learn a linear combination of Fourier terms that model the temperature’s yearly seasonality and try to transfer these Fourier terms to our demand forecast model.

Modeling Seasonality in the Long Time Series

To make our life a little easier for this article, we used a data set available on Kaggle to model the yearly seasonality of temperature in New York City. There might be better data sets, but this one should suffice to get the yearly seasonality right.33 Including data from the period for which we want to predict sales is not ideal, but given the regularity of temperatures, let’s not make this a concern here.

Three years of daily maximum temperatures for New York City.

After filtering and aggregating the data set to three years of daily temperatures for New York City, we can prepare the Fourier terms for the same time period. As the primary focus here is on deriving posterior distributions for the regression coefficients of yearly seasonality Fourier terms–as we will use in our final model of bike sales–we fit a simplified Poisson regression including the same Fourier terms we will use later. Note how we create a vector all_dates for all dates we are potentially interested in so as to align the Fourier terms correctly across models.

all_dates <- seq(as.Date("2012-01-01"), as.Date("2020-01-01"), by = 1)
fourier_yearly <- forecast::fourier(ts(all_dates, frequency = 365.25), K = 6)
fourier_table <- data.frame(date = all_dates, fourier_yearly)
names(fourier_table)[2:13] <- paste0("fourier", 1:12)
fourier_train <- filter(fourier_table, date %in% daily_temp$date)

Also note that we subtract the minimum daily temperature observed from all temperatures so that the smallest temperature in our data is equal to 0. We do so since zero bike sales are feasible on a cold day in winter. Also, given that we are in a log-linear model, we are interested in percentage changes in the outcome. Since a change from 1 to 10 is very different than a change from 91 to 100, the corresponding posterior coefficients should be more suitable to the sales data.

y_t <- daily_temp$max_temp - min(daily_temp$max_temp)
xreg <- as.matrix(fourier_train[,paste0("fourier", 1:12)])
K <- dim(xreg)[2]
N <- length(y_t)

Having prepared the data, we can perform MCMC inference using Stan for the regression model of the temperature data.

library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

pois_stan <- stan_model("pois_reg.stan")
temperature_fit <- sampling(pois_stan,
                            data = list(y = y_t, x = xreg, N = N, K = K),
                            iter = 4000, algorithm = "NUTS", seed = 512)

Posterior linear combination of the Fourier terms for the temperature model.

We can extract the posterior samples of each coefficient of the Fourier terms and use them to plot their posterior distributions.

beta_samples <- as.data.frame(temperature_fit)[,paste0("beta[", 1:K, "]")]
names(beta_samples) <- paste0("fourier", 1:12)

Furthermore, we can now look at the linear combination of Fourier terms as implied by the posterior mean of the coefficients. Similarly to the previous plot of the posterior distributions of the coefficients, this plot makes clear that the low frequency terms are strongest in the temperature data. This is hardly surprising, as we were mainly interested in the period of one year.

beta_means <- as.numeric(colMeans(beta_samples))
beta_sds <- as.numeric(apply(beta_samples, 2, sd))

xx <- matrix(rep(beta_means, N), byrow = TRUE, ncol = 12) * xreg
fouriers <- data.frame(xx)
fouriers$date <- daily_temp$date
fouriers$fourier <- rowSums(xx)

Posterior mean linear combination of the Fourier terms for the temperature model.

Compressing the Posterior for Use as Prior

Most importantly though, we can use the inferred posterior distributions of the Fourier coefficients to inform our choice of priors for the coefficients in our model of bike sales. In the next model, the priors on the coefficients will Normal distributions as described above. To parameterize those distributions, we need values for the mean and the standard deviation. These, of course, are easily summarized from the posterior samples. Indeed, we have with beta_means and beta_sds from above:

	Fourier 1	Fourier 2	Fourier 3	Fourier 4	Fourier 5	Fourier 6
Mean	-0.252	-0.545	-0.122	-0.047	-0.045	0.048
Standard.Deviation	0.009	0.010	0.009	0.009	0.009	0.009

Modeling the Short Time Series Using the Learned Posteriors

Now it’s finally time to forecast the sales data. For the sake of completeness, let us walk through the Stan model that we will fit as it will detail how we use the posterior from the previous model as prior in this model. It mostly corresponds to the data generating process described above.

In the data section, we define all data that we pass to the sampling() function. Besides the historical data, the regressors, and the dimensions, we also pass along the parameter values we choose for the prior of the trend and the Fourier terms via trend_loc, trend_sd, fourier_loc, and fourier_sd. We do so in order to use the same Stan model with different priors.

data {
  int<lower=1> N;
  int<lower=1> K;
  int<lower=1> F;
  int y[N];
  matrix[N, K] x;
  matrix[N, F] fouriers;
  matrix[N, 1] trend;
  real trend_loc;
  real trend_sd;
  real fourier_loc[F];
  real fourier_sd[F];
}

Next, we define the unknown parameter values which we want to infer. Among others, we have the coefficients for the Fourier terms and the trend.

parameters {
  real<lower=0,upper=1> phi;
  real<lower=0,upper=1-phi> alpha;
  real delta_inv;
  vector[K] beta;
  vector[F] beta_fourier;
  vector[1] beta_trend;
  real beta_0;
}
transformed parameters {
  real<lower=0> delta;
  real<lower=0> mu_t[N];
  vector<lower=0>[N] seasonal;
  delta = 1/pow(delta_inv,2);
  seasonal = exp(x * beta + fouriers * beta_fourier + 
                  trend * beta_trend + beta_0);
  mu_t[1] = y[1];
  for (n in 2:N) {
    mu_t[n] = (1 - phi - alpha) * seasonal[n] + phi * mu_t[n-1] + alpha * y[n-1];
  }
}

In the model section, we see how the prior distributions are defined. As described before, we put Normal priors on every regression coefficient. The Normal distributions of the beta_fourier coefficients are parameterized by the means and standard deviations we have extracted from the posterior distributions of the temperature model and pass as fourier_loc and fourier_sd as data to the sampling() function below.

model {
  phi ~ gamma(1,10);
  alpha ~ gamma(0.5,10);
  delta_inv ~ normal(0,0.5);
  beta_fourier ~ normal(fourier_loc, fourier_sd);
  beta_trend ~ normal(trend_loc, trend_sd);
  beta ~ normal(0,0.5);
  beta_0 ~ normal(2,1);
  for (n in 2:N) {
    target += neg_binomial_2_lpmf(y[n] | mu_t[n], delta);
  }
}
generated quantities {
  int y_hat[N];
  
  for (n in 1:N)
    y_hat[n] = neg_binomial_2_rng(mu_t[n], delta);
}

As last preparation step, we only have to take the time series of bike sales as well as the corresponding regressors and bring them into the format defined in the Stan model’s data section. We can then call the sampling() function to again perform MCMC inference using priors from the temperature model and the ~3% YoY growth expected from the business.

hciti <- readRDS("hciti_full.Rds")
hciti <- filter(hciti, date < as.Date("2013-10-15"))
fourier_hciti <- fourier_table %>%
  filter(date %in% hciti$date) %>%
  select(-date) %>%
  as.matrix()
trend_hciti <- matrix((1:length(hciti$date))/365.25, ncol = 1)
xreg_hciti <- as.matrix(hciti[,paste0("wday", 1:6)])
y_hciti <- hciti$rides
N_hciti <- length(y_hciti)
K_hciti <- dim(xreg_hciti)[2]

citi_temp_fit <- sampling(negbin_prior_stan, 
                     data = list(y = y_hciti, 
                                 x = xreg_hciti,
                                 fouriers = fourier_hciti,
                                 trend = trend_hciti,
                                 trend_loc = 0.03,
                                 trend_sd = 0.02,
                                 fourier_loc = beta_means,
                                 fourier_sd = beta_sds,
                                 N = N_hciti, K = K_hciti,
                                 F = 12),
                     iter = 4000, algorithm = "NUTS", seed = 512)

As we did previously with the frequentist ARIMA models, we can now use our fitted model to forecast the time series. Since we live in the future, we can compare our model’s forecast against the actual sales. The result looks as follows.

Forecast of bike sales using priors learned from temperature model. The shaded areas correspond to the 50% and 90% prediction intervals.

There are a couple things left to highlight.

On the one hand, the degree to which the predicted yearly seasonality matches the actual seasonality of the sales is striking. It is impressive how much structure we were able to incorporate into the forecast by using a little prior knowledge.

On the other hand, the forecast is also too optimistic. There might be several reasons for this. First, we have not previously observed how low sales really are during winter. Also, sales appear to have decreased in summer 2014 compared to summer 2013. This, of course, is the opposite of what we told the model to expect with the 3% growth prior.

In any case, compared to the intitial ARIMA model with weekly Fourier terms, or a simple mean forecast, this simple trick results in a reduction of more than 50% of any error metric.

model	rmse	mae	mape
arima	39.9	33.1	2.6
meanf	42.3	35.8	3.0
nb_prior	19.1	15.3	1.0

Most importantly, however, we were able to produce a forecast that is believable and trustworthy by incorporating our prior knowledge into the forecast. The forecast is by no means perfect–when is any forecast?–but it is useful. And it actually comes pretty close! One can work with it and use it to derive decisions in a business context. And this long before two years have passed to model yearly seasonality with traditional methods.

Modeling Short Time Series with Prior Knowledge

What ‘Including Prior Information’ really looks like.

Tim Radtke

2019-04-16