R-squared as Forecast Evaluation Metric
When Jane Street’s $120,000 “Real-Time Market Data Forecasting” Kaggle competition closes tomorrow, submitted forecasts will be evaluated using \(R^2\):
Submissions are evaluated on a scoring function defined as the sample weighted zero-mean R-squared score (\(R^2\)) of
responder_6
. The formula is give by: \(R^2 = 1 - \frac{\sum w_i(y_i - \hat{y}_i)^2}{\sum w_i y_i^2}\) where \(y\) and \(\hat{y}\) are the ground-truth and predicted value vectors ofresponder_6
, respectively; \(w\) is the sample weight vector.
Stats 101 courses have left \(R^2\) with a bad rep as gameable regression metric. But I’ve come to appreciate it as a metric in forecasting—when calculated on the test set with the total-sum-of-squares denominator based on the test period’s empirical mean, \(\bar{y}_{T,h} = \sum_{t = T+1}^{T+h}y_t\), where \(T\) is the training set’s last observation and \(h\) the number of observations in the test set. Then:
\[ R^2 = 1 - \frac{\sum_{t = T+1}^{T+h}(y_t - \hat{y}_t)^2}{\sum_{t = T+1}^{T+h}(y_t - \bar{y}_{T,h})^2} \]
Above definition gives the total-sum-of-squares the advantage of future knowledge included in \(\bar{y}_{T,h}\) that the predictions \(\hat{y}_t\) used for the residual-sum-of-squares do not have. In comparison to its linear regression origin, this test set-focused definition is not going to improve as parameters are added to the forecast model. But it does keep its interpretability:
- A value of 1 indicates a perfect model
- A value of 0 indicates a model as good as the test mean (the latter being a forecast that is both perfectly unbiased and perfectly predicts the total sum over the forecast horizon)
- Any negative value indicates a model that isn’t even as good as the simple test mean (but then again, the test mean uses future knowledge)
For example, when predicting the final twelve months of AirPassengers data, the obviously poor prediction given by the training set mean results in an \(R^2\) of -8.24 whereas a seasonal naive prediction achieves an \(R^2\) of 0.54.
This shouldn’t come as a suprise. Interpreting \(R^2\) as the fraction of (test) variance explained, the seasonal naive prediction can explain more than half of the variance that the test mean left unexplained. The model captures a large part of the signal hidden in the data’s variation.
Just like Jane Street replaced the test set empirical mean in their definition of \(R^2\) by zero, which in their application of predicting expected financial returns is a hard to beat expected value, one could replace the total sum of squares denominator by the squared errors of a different benchmark such as seasonal naive predictions when it is known that due to the data’s seasonality the seasonal naive prediction will consistently outperform the test set mean despite its future knowledge. Thus, \(R^2\) is not so different from the more common relative measures such as the Relative Mean Squared Error described in section 6.1.4 of Hewamalage et al. (2022), or Skill Scores employed in weather prediction described by Murphy (1988).
Then again, when a time series’ variance is not dominated by its seasonal component, \(R^2\) and its interpretation as fraction of variance explained can leave a damning picture of a model’s forecasts not actually being all that skillful. It casts doubt by asking how much signal your model has picked up to meaningfully predict future variation. Suddenly any value larger than zero will feel like an enormous success.