Notes

2024/04/21
Debug Forecasts with Animated Plots

Speaking of GIFs, animated visualizations of rolling forecasts are eye-opening to the impact of individual observations, the number of observations, and default settings on a model’s forecasts. In the example below, the default forecast::auto.arima() transitions between poor model specifications until it can finally pick up the seasonality after 24 observations, only to generate a negative point forecast despite purely non-negative observations.

Fantastic way to understand forecast methods’ edge-case behavior.

My favorite frame? After nine observations, when a model specification with trend is picked and returns an exploding forecast based on very little evidence.

The code for this GIF in this small Github repo.


2024/03/24
AI Act Article 17 - Quality Management System

Article 17 of the AI Act adopted by the EU Parliament is the ideal jump-off point into other parts of the legislation. While article 16 lists the “Obligations of providers of high-risk AI systems”, Article 17 describes the main measure by which providers can ensure compliance: the quality management system.

That system shall be documented in a systematic and orderly manner in the form of written policies, procedures and instructions, and shall include at least the following aspects […]

Note that providers write the policies themselves. The policies have to cover aspects summarizing critical points found elsewhere in the document, as for example the development, quality control, and validation of the AI system. But also…

  • a description of the applied harmonized standards or other means to comply with the requirements for high-risk AI systems as set out in Section 2,
  • procedures for data management such as analysis, labeling, storage, aggregation, and retention,
  • procedures for post-marketing monitoring as in Article 72, incident notification as in Article 73, and record-keeping as in Article 12, and
  • procedures for communication with competent authorities, notified bodies, and customers.

This doesn’t necessarily sound all that exciting and might be glossed over on first reading, but search for Article 17 in the AI Act and you’ll find the quality management system listed as the criterion alongside technical documentation in the conformity assessment of high-risk AI systems that providers perform themselves (Annex VI) or that notified bodies perform for them (Annex VII).

Quality management systems are the means by which providers can self-assess their high-risk AI systems against official standards and comply with the regulation and thus central to the reading of the AI Act.


2024/03/14
AI Act Approved by EU Parliament

The EU AI Act has finally come to pass, and pass it did with 523 of 618 votes of the EU Parliament in favor. The adopted text (available as of writing as PDF or Word document—the latter is much easier to work with!) has seen a number of changes since the original proposal by the EU Commission in 2021.

For example, the current text reduces the set of systems considered high-risk somewhat by excluding those that are “not materially influencing the outcome of decision making” (Chapter III, Section 1, Article 6, Paragraph 3) except for those already covered by EU regulation such as medical devices and elevators. It also requires providers of any AI systems to mark generated audio, image, video or text content as such in a “machine-readable format and detectable as artificially generated” while also being “interoperable” (Chapter IV, Article 50, Paragraph 2).

And then there is the “right to an explanation” (Article 86). While data scientists and machine learning engineers hear “Explainable AI” and start submitting abstracts to CHI, the provision does not appear to ask for an explanation of the AI system’s recommendation, but only for a description of the overall process in which the system was deployed:

Any affected person subject to a decision which is taken by the deployer on the basis of the output from a high-risk AI system listed in Annex III, with the exception of systems listed under point 2 thereof, and which produces legal effects or similarly significantly affects that person in a way that they consider to have an adverse impact on their health, safety or fundamental rights shall have the right to obtain from the deployer clear and meaningful explanations of the role of the AI system in the decision-making procedure and the main elements of the decision taken.

This reading1 is confirmed by the references to the “deployer” and not the “provider” of the AI system. The latter would be the one who can provide an interpretable algorithm or at least explanations.

Additionally, this only refers to high-risk systems listed in Annex III such as those used for, for example, hiring and workers management or credit scoring. Therefore, however, it also excludes high-risk systems falling under existing EU regulations listed in Annex I, such as medical devices and elevators.

The devil is in the details.


  1. Which is not based on any legal expertise whatsoever, by the way. ↩︎


2023/12/12
Stop Using Dynamic Time Warping for Business Time Series

Dynamic Time Warping (DTW) is designed to reveal inherent similarity between two time series of similar scale that was obscured because the time series were shifted in time or sampled at different speeds. This makes DTW useful for time series of natural phenomena like electrocardiogram measurements or recordings of human movements, but less so for business time series such as product sales.

To see why that is, let’s first refresh our intuition of DTW, to then check why DTW is not the right tool for business time series.

What Dynamic Time Warping Does

Given two time series, DTW computes aligned versions of the time series to minimize the cumulative distance between the aligned observations. The alignment procedure repeats some of the observations to reduce the resulting distance. The aligned time series end up with more observations than in the original versions, but the aligned versions still have to have the same start and end, no observation may be dropped, and the order of observations must be unchanged.

But the best way to understand the alignment is to visualize it.

Aligning a Shift

To start, I’ll use an example where DTW can compute a near perfect alignment: That of a cosine curve and a shifted cosine curve—which is just a sine curve.1 For each of the curves, we observe 12 observations per period, and 13 observations in total.

series_sine <- sinpi((0:12) / 6)
series_cosine <- cospi((0:12) / 6)

To compute the aligned versions, I use the DTW implementation of the dtw package (also available for Python as dtw-python) with default parameters.

library(dtw)
dtw_shift <- dtw::dtw(x = series_sine, y = series_cosine)

Besides returning the distance of the aligned series, DTW produces a mapping of original series to aligned series in the form of alignment vectors dtw_shift$index1 and dtw_shift$index2. Using those, I can visualize both the original time series and the aligned time series along with the repetitions used for alignment.

# DTW returns a vector of indices of the original observations
# where some indices are repeated to create aligned time series
dtw_shift$index2
##  [1]  1  1  1  1  2  3  4  5  6  7  8  9 10 11 12 13
# In this case, the first index is repeated thrice so that the first
# observation appears four times in the aligned time series
series_cosine[dtw_shift$index2] |> head(8) |> round(2)
## [1]  1.00  1.00  1.00  1.00  0.87  0.50  0.00 -0.50

The plot below shows the original time series in the upper half and the aligned time series in the lower half, with the sine in orange and the cosine in black. Dashed lines indicate where observations were repeated to achieve the alignment.

Given that the sine is a perfect shifted copy of the cosine, three-quarters of the observed period can be aligned perfectly. The first quarter of the sine and the last quarter of the cosine’s period, however, can’t be aligned and stand on their own. Their indices are mapped to the repeated observations from the other time series, respectively.

Aligning Speed

Instead of shifting the cosine, I can sample it at a different speed (or, equivalently, observe a cosine of different frequency) to construct a different time series that can be aligned well by DTW. In that case, the required alignment is not so much a shift as it is a squeezing and stretching of the observed series.

Let’s create a version of the original cosine that is twice as “fast”: In the time that we observe one period of the original cosine, we observe two periods of the fast cosine.

series_cosine_fast <- cospi((0:12) / 3)
dtw_fast <- dtw::dtw(x = series_cosine_fast, y = series_cosine)

The resulting alignment mapping looks quite different than in the first example. Under a shift, most observations still have a one-to-one mapping after alignment. Under varying frequencies, most observations of the faster time series have to be repeated to align. Note how the first half of the fast cosine’s first period can be neatly aligned with the first half of the slow cosine’s period by repeating observations (in an ideal world exactly twice).

The kind of alignment reveals itself better when the time series are observed for more than just one or two periods. Below, for longer versions of the same series, half of the fast time series can be matched neatly with the slow cosine as we observe twice the number of periods for the fast cosine.

Aligning Different Scales

What’s perhaps unexpected, though, is that the alignment works only on the time dimension. Dynamic time warping will not scale the time series’ amplitude. But at the same time DTW is not scale independent. This can make the alignment fairly awkward when time series have varying scales as DTW exploits the time dimension to reduce the cumulative distance of observations in the value dimension.

To illustrate this, let’s take the sine and cosine from the first example but scale the sine’s amplitude by 5 and check the resulting alignment.

series_sine_scaled <- series_sine * 10
dtw_scaled <- dtw::dtw(x = series_sine_scaled, y = series_cosine)

We might expect DTW to align the two series as it did in the first example above with unscaled series. After all, the series have the same frequencies as before and the same period shift as before.

This is what the alignment would look like using the alignment vectors from the first example above based on un-scaled observations. While it’s a bit hard to see due to the different scales, the observations at peak amplitude are aligned (e.g., indices 4, 10, 16) as are those at the minimum amplitude (indices 7 and 13).

But Dynamic Time Warping’s optimization procedure doesn’t actually try to identify characteristics of time series such as their period length to align them. It warps time series purely to minimize the cumulative distance between aligned observations. This may lead to a result in which also the periodicity is aligned as in the first and second examples above. But that’s more by accident than by design.

This is how DTW actually aligns the scaled sine and the unscaled cosine:

The change in the series’ amplitude leads to a more complicated alignment: Observations at the peak amplitude of the cosine (which has the small amplitude) are repeated many times to reduce the Euclidean distance to the already high amplitude observations of the sine. Reversely, the minimum-amplitude of the sine is repeated many times to reduce the Euclidean distance to the already low amplitude observations of the cosine.

DTW Is Good at Many Things…

Dynamic time warping is great when you’re observing physical phenomena that are naturally shifted in time or at different speeds.

Consider, for example, measurements taken in a medical context such as those of an electrocardiogram (ECG) that measures the electrical signals in a patient’s heart. In this context, it is helpful to align time series to identify similar heart rhythms across patients. The rhythms’ periods could be aligned to check whether one patient has the same issues as another. Even for the same person a DTW can be useful to align measurements taken on different days at different heart rates.

data("aami3a") # ECG data included in `dtw` package
dtw_ecg <- dtw::dtw(
  x = as.numeric(aami3a)[1:2880],
  y = as.numeric(aami3a)[40001:42880]
)

An application such as this one is nearly identical to the first example of the shifted cosine. And as the scale of the time series is naturally the same across the two series, the alignment works well, mapping peaks and valleys with minimal repetitions.

There is also a natural interpretation to the alignment, namely that we are aligning the heart rhythm across measurements. A large distance after alignment would indicate differences in rhythms.

…But Comparing Sales Ain’t One of Them

It is enticing to use Dynamic Time Warping in a business context, too. Not only does it promise a better distance metric to identify time series that are “similar” and to separate them from those that are “different”, but it also has a cool name. Who doesn’t want to warp stuff?

We can, for example, warp time series that count the number of times products are used for patients at a hospital per month. Not exactly product sales but sharing the same characteristics. The data comes with the expsmooth package.

set.seed(47772)
sampled_series <- sample(x = ncol(expsmooth::hospital), size = 2)
product_a <- as.numeric(expsmooth::hospital[,sampled_series[1]])
product_b <- as.numeric(expsmooth::hospital[,sampled_series[2]])

While product_b is used about twice as often as product_a, both products exhibit an increase in their level at around index 40 which is perhaps one characteristic that makes them similar or at least more similar compared to a series that doesn’t share this increase.

However, the warped versions exhibit a lot of unwanted repetitions of observations. Given the different level of the products, this should not come as a surprise.

dtw_products_ab <- dtw::dtw(x = product_b, y = product_a)

We can mostly fix this by min-max scaling the series, forcing their observations onto the same range of values.

dtw_products_ab_standardized <- dtw::dtw(
  x = (product_b - min(product_b)) / (max(product_b) - min(product_b)),
  y = (product_a - min(product_a)) / (max(product_a) - min(product_a))
)

While we got rid of the unwanted warping artifacts in the form of extremely long repetitions of the same observation, the previous plot should raise some questions:

  • How can we interpret the modifications for “improved” alignment?
  • Are we aligning just noise?
  • What does the warping offer us that we didn’t already have before?

If you ask me, there is no useful interpretation to the alignment modifications, because the time series already were aligned. Business time series are naturally aligned by the point in time at which they are observed.2

We also are not taking measurement of a continuous physical phenomenon that goes up and down like the electric signals in a heart. While you can double the amount of ECG measurements per second, it does not make sense to “double the amount of sales measurements per month”. And while a product can “sell faster” than another, this results in an increased amplitude not an increased frequency (or shortened period lengths).

So if you apply Dynamic Time Warping to business time series, you will mostly be warping random fluctuations with the goal of reducing the cumulative distance just that tiny bit more.

This holds when you use Dynamic Time Warping to derive distances for clustering of business time series, too. You might as well calculate the Euclidean distance3 on the raw time series without warping first. At least then the distance between time series will tell you that they were similar (or different) at the same point in time.

In fact, if you want to cluster business time series (or train any kind of model on them), put your focus on aligning them well in the value dimension by thinking carefully about how you standardize them. That makes all the difference.

Before using Dynamic Time Warping, ask yourself: What kind of similarity are you looking for when comparing time series? Will there be meaning to the alignment that Dynamic Time Warping induces in your time series? And what is it that you can do after warping that you weren’t able to do before?


  1. A similar example is also used by the dtw package in its documentation.↩︎

  2. Two exceptions come to mind: First, you may want to align time series of products by the first point in time at which the product was sold. But that’s straightforward without DTW. Second, you may want to align products that actually exhibit a shift in their seasonality—such as when a product is heavily affected by the seasons and sold in both the northern and southern hemisphere. DTW might be useful for this, but there might also be easier ways to accomplish it.↩︎

  3. Or Manhattan distance, or any other distance.↩︎


2023/12/03
Comes with Anomaly Detection Included

A powerful pattern in forecasting is that of model-based anomaly detection during model training. It exploits the inherently iterative nature of forecasting models and goes something like this:

  1. Train your model up to time step t based on data [1,t-1]
  2. Predict the forecast distibution at time step t
  3. Compare the observed value against the predicted distribution at step t; flag the observation as anomaly if it is in the very tail of the distribution
  4. Don’t update the model’s state based on the anomalous observation

For another description of this idea, see, for example, Alexandrov et al. (2019).

Importantly, you use your model to detect anomalies during training and not after training, thereby ensuring its state and forecast distribution are not corrupted by the anomalies.

The beauty of this approach is that

  • the forecast distribution can properly reflect the impact of anomalies, and
  • you don’t require a separate anomaly detection method with failure modes different from your model’s.

In contrast, standalone anomaly detection approaches would first have to solve the handling of trends and seasonalities themselves, too, before they could begin to detect any anomalous observation. So why not use the model you already trust with predicting your data to identify observations that don’t make sense to it?

This approach can be expensive if your model doesn’t train iteratively over observations in the first place. Many forecasting models1 do, however, making this a fairly negligiable addition.

But enough introduction, let’s see it in action. First, I construct a simple time series y of monthly observations with yearly seasonality.

set.seed(472)
x <- 1:80
y <- pmax(0, sinpi(x / 6) * 25 + sqrt(x) * 10 + rnorm(n = 80, sd = 10))

# Insert anomalous observations that need to be detected
y[c(55, 56, 70)] <- 3 * y[c(55, 56, 70)]

To illustrate the method, I’m going to use a simple probabilistic variant of the Seasonal Naive method where the forecast distribution is assumed to be Normal with zero mean. Only the \(\sigma\) parameter needs to be fitted, which I do using the standard deviation of the forecast residuals.

The estimation of the \(\sigma\) parameter occurs in lockstep with the detection of anomalies. Let’s first define a data frame that holds the observations and will store a cleaned version of the observations, the fitted \(\sigma\) and detected anomalies…

df <- data.frame(
  y = y,
  y_cleaned = y,
  forecast = NA_real_,
  residual = NA_real_,
  sigma = NA_real_,
  is_anomaly = FALSE
)

… and then let’s loop over the observations.

At each iteration, I first predict the current observation given the past and update the forecast distribution by calculating the standard deviation over all residuals that are available so far, before calculating the latest residual.

If the latest residual is in the tails of the current forecast distribution (i.e., larger than multiples of the standard deviation), the observation is flagged as anomalous.

For time steps with anomalous observations, I update the cleaned time series with the forecasted value (which informs a later forecast at step t+12) and set the residual to missing to keep the anomalous observation from distorting the forecast distribution.

# Loop starts when first prediction from Seasonal Naive is possible
for (t in 13:nrow(df)) {
  df$forecast[t] <- df$y_cleaned[t - 12]
  df$sigma[t] <- sd(df$residual, na.rm = TRUE)
  df$residual[t] <- df$y[t] - df$forecast[t]
  
  if (t > 25) {
    # Collect 12 residuals before starting anomaly detection
    df$is_anomaly[t] <- abs(df$residual[t]) > 3 * df$sigma[t]
  }
  
  if (df$is_anomaly[t]) {
    df$y_cleaned[t] <- df$forecast[t]
    df$residual[t] <- NA_real_
  }
}

Note that I decide to start the anomaly detection not before there are 12 residuals for one full seasonal period, as the \(\sigma\) estimate based on less than a handful of observations can be flaky.

In a plot of the results, the combination of 1-step-ahead prediction and forecast distribution is used to distinguish between expected and anomalous observations, with the decision boundary indicated by the orange ribbon. At time steps where the observed value falls outside the ribbon, the orange line indicates the model prediction that is used to inform the model’s state going forward in place of the anomaly.

Note how the prediction at time t is not affected by the anomaly at time step t-12. Neither is the forecast distribution estimate.

This would look very different when one gets the update behavior slightly wrong. For example, the following implementation of the loop detects the first anomaly in the same way, but uses it to update the model’s state, leading to subsequently poor predictions and false positives, and fails to detect later anomalies.

# Loop starts when first prediction from Seasonal Naive is possible
for (t in 13:nrow(df)) {
  df$forecast[t] <- df$y[t - 12]
  df$sigma[t] <- sd(df$residual, na.rm = TRUE)
  df$residual[t] <- df$y[t] - df$forecast[t]
  
  if (t > 25) {
    # Collect 12 residuals before starting anomaly detection
    df$is_anomaly[t] <- abs(df$residual[t]) > 3 * df$sigma[t]
  }
  
  if (df$is_anomaly[t]) {
    df$y_cleaned[t] <- df$forecast[t]
  }
}

What about structural breaks?

While anomaly detection during training can work well, it may fail spectacularly if an anomaly is not an anomaly but the start of a structural break. Since structural breaks make the time series look different than it did before, chances are the first observation after the structural break will be flagged as anomaly. Then so will be the second. And then the third. And so on, until all observations after the structural break are treated as anomalies because the model never starts to adapt to the new state.

This is particularly frustrating because the Seasonal Naive method is robust against certain structural breaks that occur in the training period. Adding anomaly detection makes it vulnerable.

What values to use for the final forecast distribution?

Let’s get philosophical for a second. What are anomalies?

Ideally, they reflect a weird measurement that will never occur again. Or if it does, it’s another wrong measurement—but not the true state of the measured phenomenon. In that case, let’s drop the anomalies and ignore them in the forecast distribution.

But what if the anomalies are weird states that the measured phenomenon can end up in? For example, demand for subway rides after a popular concert. While perhaps an irregular and rare event, a similar event may occur again in the future. Do we want to exclude that possibility from our predictions about the future? What if the mention of a book on TikTok let’s sales spike? Drop the observation and assume it will not repeat? Perhaps unrealistic.

It depends on your context. In a business context, where measurement errors might be less of an issue, anomalies might need to be modeled, not excluded.


  1. Notably models from the state-space model family.↩︎


2023/11/12
Code Responsibly

There exists this comparison of software before and software after machine learning.

Before machine learning, code was deterministic: Software engineers wrote code, the code included conditions with fixed thresholds, and at least in theory the program was entirely understandable.

After machine learning, code is no longer deterministic. Instead of software engineers instantiating it, the program’s logic is determined by a model and its parameters. Those parameters are not artisinally chosen by a software engineer but learned from data. The program becomes a function of data, and in some cases incomprehensible to the engineer due to the sheer number of parameters.

Given the current urge to regulate AI and making its use responsible and trustworthy, humans appear to expect machine learning models to introduce an obscene number of bugs into software. Perhaps humans underestimate the ways in which human programmers can mess up.

For example, when I hear Regulate AI, all I can think is Have you seen this stuff? By Pierluigi Bizzini for Algorithm Watch1 (emphasis mine):

The algorithm evaluates teachers' CVs and cross-references their preferences for location and class with schools' vacancies. If there is a match, a provisional assignment is triggered, but the algorithm continues to assess other candidates. If it finds another matching candidate with a higher score, that second candidate moves into the lead. The process continues until the algorithm has assessed all potential matches and selected the best possible candidate for the role.

[…] [E]rrors have caused much confusion, leaving many teachers unemployed and therefore without a salary. Why did such errors occur?

When the algorithm finds an ideal candidate for a position, it does not reset the list of remaining candidates before commencing the search to fill the next vacancy. Thus, those candidates who missed out on the first role that matched their preferences are definitively discarded from the pool of available teachers, with no possibility of employment. The algorithm classes those discarded teachers as “drop-outs”, ignoring the possibility of matching them with new vacancies.

This is not AI gone rogue. This is just a flawed human-written algorithm. At least Algorithm Watch is aptly named Algorithm Watch.

Algorithms existed before AI.2 But there was no outcry, no regulation of algorithms before AI, no “Proposal for a Regulation laying down harmonised rules on artificial intelligence”. Except there actually is regulation in aviation and medical devices and such.3 Perhaps because of the extent to which these fields are entangled with hardware, posing unmediated danger to human lives.

Machine learning and an increased prowess in data processing have not introduced more bugs into software compared to software written by humans previously. What they have done is to enable applications for software that were previously infeasible by way of processing and generating data.

Some of these applications are… not great. They should never be done. Regulate those applications. By all means, prohibit them. If a use case poses unacceptable risks, when we can’t tolerate any bugs but bugs are always a possibility, then let’s just not do it.

Other applications are high-risk, high-reward. Given large amounts of testing imposed by regulation, we probably want software to enable these applications. The aforementioned aviation and medical devices come to mind.4 Living with regulated software is nothing new!

Then there is the rest that doesn’t really harm anyone where people can do whatever.

Regulating software in the context of specific use cases is feasible and has precedents.

Regulating AI is awkward. Where does the if-else end and AI start? Instead, consider it part of software as a whole and ask in which cases software might have flaws we are unwilling to accept.

We need responsible software, not just responsible AI.


  1. I tried to find sources for the description of the bug provided in the article, but couldn’t find any. I don’t know where the author takes them from, so take this example with the necessary grain of salt. ↩︎

  2. Also, much of what were algorithms or statistics a few years ago are now labeled as AI. And large parts of AI systems are just if statements and for loops and databases and networking. ↩︎

  3. Regulation that the EU regulation of artificial intelligence very much builds upon. See “Demystifying the Draft EU Artificial Intelligence Act” (https://arxiv.org/abs/2107.03721) for more on the connection to the “New Legislative Framework”. ↩︎

  4. Perhaps this is the category for autonomous driving? ↩︎


2023/05/20
The 2-by-2 of Forecasting

False Positives and False Negatives are traditionally a topic in classification problems only. Which makes sense: There is no such thing as a binary target in forecasting, only a continuous range. There is no true and false, only a continuous scale of wrong. But there lives an MBA student in me who really likes 2-by-2 charts, so let’s come up with one for forecasting.

The {True,False}x{Positive,Negative} confusion matrix is the one opportunity for university professors to discuss the stakeholders of machine learning systems. The fact that a stakeholder might care more about reducing the number of False Positives and thus accepting a higher rate of False Negatives. Certain errors are more critical than others. That’s just as much the case in forecasting.

To construct the 2-by-2 of forecasting, the obvious place to start is the sense of “big errors are worse”. Let’s put that on the y-axis.

This gives us the False and “True” equivalents of forecasting. The “True” is in quotes because any deviation from the observed value is some error. But for the sake of the 2-by-2, let’s call small errors “True”.

Next, we need the Positive and Negative equivalents. When talking about stakeholder priorities, Positive and Negative differentiate the errors that are Critical from those that are Acceptable. Let’s put that on the x-axis.

While there might be other ways to define criticality1, human perception of time series forecastability comes up as soon as users of your product inspect your forecasts. The human eye will detect apparent trends and seasonal patterns and project them into the future. If your model does not, and wrongly so, it raises confusion instead. Thus, forecasts of series predictable by humans will be critized, while forecasts of series with huge swings and more noise than signal are easily acceptable.

To utilize this notion, we require a model of human perception of time series forecastability. In a business context, where seasonality may be predominant, the seasonal naive method captures much of what is modelled by the human eye. It is also inherently conservative, as it does not overfit to recent fluctuations or potential trends. It assumes business will repeat as it did before.

Critical, then, are forecasts of series that the seasonal naive method, or any other appropriate benchmark, predicts with small error, while Acceptable are any forecasts of series that the seasonal naive method predicts poorly. This completes the x-axis.

With both axes defined, the quadrants of the 2-by-2 emerge. Small forecast model erros are naturally Acceptable True when the benchmark model fails, and large forecast model errors are Acceptable False when the benchmark model also fails. Cases of series that feel predictable and are predicted well are Critical True. Lastly, series that are predicted well by a benchmark but not by the forecast model are Critical False.

The Critical False group contains the series for which users expect a good forecast because they themselves can project the series into the future—but your model fails to deliver that forecast and does something weird instead. It’s the group of forecasts that look silly in your tool, the ones that cause you discomfort when users point them out.

Keep that group small.


  1. For example, the importance of a product to the business as measured by revenue. ↩︎


2021/09/01
Forecasting Uncertainty Is Never Too Large

Rob J. Hyndman gave a presentation titled “Uncertain futures: what can we forecast and when should we give up?” as part of the ACEMS public lecture series with recording available on Youtube.

He makes an often underappreciated point around minute 50 of the talk:

When the forecast uncertainty is too large to assist decision making? I don’t think that’s ever the case. Forecasting uncertainty being too large does assist decision making by telling the decision makers that the future is very uncertain and they should be planning for lots of different possible outcomes and not assuming just one outcome or another. And one of the problems we have in providing forecasts to decision makers is getting them to not focus in on the most likely outcome but to actually take into account the range of possibilities and to understand that futures are uncertain, that they need to plan for that uncertainty.


2021/05/02
Everything is an AI Technique

Along with their proposal for regulation of artificial intelligence, the EU published a definition of AI techniques. It includes everything, and that’s great!

From the proposal’s Annex I:

ARTIFICIAL INTELLIGENCE TECHNIQUES AND APPROACHES referred to in Article 3, point 1

  • (a) Machine learning approaches, including supervised, unsupervised and reinforcement learning, using a wide variety of methods including deep learning;
  • (b) Logic- and knowledge-based approaches, including knowledge representation, inductive (logic) programming, knowledge bases, inference and deductive engines, (symbolic) reasoning and expert systems;
  • (c) Statistical approaches, Bayesian estimation, search and optimization methods.

Unsurprisingly, this definition and the rest of the proposal made the rounds: Bob Carpenter quipped about the fact that according to this definition, he has been doing AI for 30 years now (and that the EU feels the need to differentiate between statistics and Bayesian inference). In his newsletter, Thomas Vladeck takes the proposal apart to point out potential ramifications for applications. And Yoav Goldberg was tweeting about it ever since a draft of the document leaked.

From a data scientist’s point of view, this definition is fantastic: First, it highlights that AI is a marketing term used to sell whatever method does the job. Not including optimization as AI technique would have given everyone who called their optimizer “AI” a way to wiggle out of the regulation otherwise. This implicit acknowledgement is welcome.

Second, and more importantly, as practitioner it’s practical to have this “official” set of AI techniques in your backpocket for when someone asks what exactly AI is. The fact that one doesn’t have to use deep learning to wear the AI bumper sticker means that we can be comfortable in choosing the right tool for the job. At this point, AI refers less to a set of techniques or artificial intelligence, and more to a family of problems that are solved by one of the tools listed above.


2021/01/01
Resilience, Chaos Engineering and Anti-Fragile Machine Learning

In his interview with The Observer Effect, Tobi Lütke, CEO of Shopify, describes how Shopify benefits from resilient systems:

Most interesting things come from non-deterministic behaviors. People have a love for the predictable, but there is value in being able to build systems that can absorb whatever is being thrown at them and still have good outcomes.

So, I love Antifragile, and I make everyone read it. It finally put a name to an important concept that we practiced. Before this, I would just log in and shut down various servers to teach the team what’s now called chaos engineering.

But we’ve done this for a long, long time. We’ve designed Shopify very well because resilience and uptime are so important for building trust. These lessons were there in the building of our architecture. And then I had to take over as CEO.

It sticks out that Lütke uses “resilient” and “antifragile” interchangeably even though Taleb would point out that they are not the same thing: Whereas a resilient system doesn’t fail due to randomly turned off servers, an antifragile system benefits. (Are Shopify’s systems robust or have they become somehow better beyond robust due to their exposure to “chaos”?)

But this doesn’t diminish Lütke’s notion of resilience and uptime being “so important for building trust” (with users, presumably): Users’ trust in applications is fragile. Earning users’ trust in a tool that augments or automates decisions is difficult, and the trust is withdrawn quickly when the tool makes a stupid mistake. Making your tool robust against failure modes is how you make it trustworthy—and used.

Which makes it interesting to reason about what an equivalent to shutting off random servers is to machine learning applications (beyond shutting off the server running the model). Label noise? Shuffling features? Adding Covid-19-style disruptions to your time series? The latter might be more related to the idea of experimenting with a software system in production.

And—to return to the topic of discerning anti-fragile and robust—what would it mean for machine learning algorithms “to gain from disorder”? Dropout comes to mind. What about causal inference through natural experiments?


2018/11/11
Videos from PROBPROG 2018 Conference

Videos of the talks given at the International Conference on Probabilistic Programming (PROBPROG 2018) back in October were published a few days ago and are now available on Youtube. I have not watched all presentations yet, but a lot of big names attended the conference so there should be something for everyone. In particular the talks by Brooks Paige (“Semi-Interpretable Probabilistic Models”) and Michael Tingley (“Probabilistic Programming at Facebook”) made me curious to explore their topics more.


2018/09/30
Videos from Exploration in RL Workshop at ICML

One of the many fantastic workshops at ICML this year was the Exploration in Reinforcement Learning workshop. All talks were recorded and are now available on Youtube. Highlights include presentations by Ian Osband, Emma Brunskill, and Csaba Szepesvari, among others. You can find the workshop’s homepage here with more information and the accepted papers.


2016/12/14
Multi-Armed Bandits at Tinder

In a post on Tinder’s tech blog, Mike Hall presents a new application for multi-armed bandits. At Tinder, they started to use multi-armed bandits to optimize the photo of users that is shown first: While a user can have multiple photos in his profile, only one of them is shown first when another user swipes through the deck of user profiles. By employing an adapted epsilon-greedy algorithm, Tinder optimizes this photo for the “Swipe-Right-Rate”. Mike Hall about the project:

It seems to fit our problem perfectly. Let’s discover which profile photo results in the most right swipes, without wasting views on the low performing ones. …

We were off to a solid start with just a little tweaking and tuning. Now, we are able to leverage Tinder’s massive volume of swipes in order to get very good results in a relatively small amount of time, and we are convinced that Smart Photos will give our users a significant upswing in the number of right swipes they are receiving with more complex and fine-tuned algorithms as we move forward.