Minimize Regret is a blog about forecasting, optimization, and building tools humans use to make decisions under uncertainty. Tim is a data scientist in Berlin. Enjoy.

Read this first:

Problem Representations and Model-Based Machine Learning

Back in 2003, Paul Graham, of Viaweb and Y Combinator fame, published an article entitled “Better Bayesian Filtering”. I was scrolling chronologically through his essays archive the other day when… Continue Reading ➝

How One Bad CrowdStrike Update Crashed the World’s Computers

Days like today serve as a reminder that software doesn’t have to be AI to bring high-risk infrastructure to a halt. From Code Responsibly:

Regulating AI is awkward. Where does the if-else end and AI start? Instead, consider it part of software as a whole and ask in which cases software might have flaws we are unwilling to accept. We need responsible software, not just responsible AI.

Thanks to everyone who has to spend the weekend cleaning up.

How The Economist’s Presidential Forecast Works

The Economist is back with a forecast for the 2024 US presidential election in collaboration with Andrew Gelman and others. One detail in the write-up of their approach stood out to me:

The ultimate result is a list of 10,001 hypothetical paths that the election could take.

Not 10,000, but 10,000 and one MCMC samples. I can’t remember seeing any reference for this choice before (packages love an even number as default), but I have been adding a single additional sample as tie breaker for a long time: If nothing else, it comes in handy to have a dedicated path represent the median to prevent an awkward median estimate of 269.5 electoral votes.

The extra sample is especially helpful when the main outcome of interest is a sum of more granular outcomes. In the case of the presidential election, the main outcome is the sum of electoral votes provided by the states. One can first identify the median of the main outcome (currently 237 Democratic electoral votes). Given the extra sample, there will be one MCMC sample that results in the median. From here, one can work backwards and identify this sample index and the corresponding value for every state, for example. The value might not be a state’s most “representative” outcome and it is unlikely to be the state’s median number of electoral votes. But the sum across states will be the median of the main outcome. Great for a visualization depicting what scenario would lead to this projected constellation of the electoral college.

In contrast, summing up the median outcome of each state, there would be only 226 Democratic electoral votes as of today.1

  1. CA, DC, HI, MA, MD, ME 1, NY, VT, CO, CT, DE, IL, NJ, OR, RI, WA, NM, ME, NE 2, VA, NH, MN.↩︎

Helmut Schmidt Future Prize Winner’s Speech

Meredith Whittaker, president of the Signal Foundation, received the Helmut Schmidt Future Prize in May. In her prize winner’s speech, she highlights the proliferation of artificial intelligence applications from ad targeting to military applications:

We are all familiar with being shown ads in our feeds for yoga pants (even though you don’t do yoga) or a scooter (even if you just bought one), or whatever else. We see these because the surveillance company running the ad market or social platform has determined that these are things “people like us” are assumed to want or be attracted to, based on a model of behavior built using surveillance data. Since other people with data patterns that look like yours bought a scooter, the logic goes, you will likely buy a scooter (or at least click on an ad for one). And so you’re shown an ad. We know how inaccurate and whimsical such targeting is. And when it’s an ad it’s not a crisis when it’s mistargeted. But when it’s more serious, it’s a different story.

It’s all fun and games as long as were talking about which ad is served next. Code responsibly.

(Deep) Non-Parametric Time Series Forecaster

If you read The History of Amazon’s Forecasting Algorithm, you’ll hear about fantastic models such as Quantile Random Forests, and the MQTransformer. In GluonTS you’ll find DeepAR and DeepVARHierarchical. But the real hero is the simple model that does the work when all else fails. Tim Januschowski on Linkedin:

One of the baselines that we’ve developed over the years is the non-parametric forecaster or NPTS for short. Jan Gasthaus invented it probably a decade ago and Valentin Flunkert made it seasonality aware and to the best of my knowledge it’s been re-written a number of times and still runs for #amazon retail (when other surrounding systems were switched off long ago).

Januschowski mentions this to celebrate the Arxiv paper describing NPTS and its DeepNPTS variant with additional “bells and whistles”. Which I celebrate as I no longer have to refer people to section 4.3 of the GluonTS paper.

Debug Forecasts with Animated Plots

Speaking of GIFs, animated visualizations of rolling forecasts are eye-opening to the impact of individual observations, the number of observations, and default settings on a model’s forecasts. In the example below, the default forecast::auto.arima() transitions between poor model specifications until it can finally pick up the seasonality after 24 observations, only to generate a negative point forecast despite purely non-negative observations.

Fantastic way to understand forecast methods’ edge-case behavior.

My favorite frame? After nine observations, when a model specification with trend is picked and returns an exploding forecast based on very little evidence.

The code for this GIF in this small Github repo.

Reliably Forecasting Time-Series in Real Time

Straight from my YouTube recommendations, a PyData London 2018 (!) presentation by Charles Masson of Datadog. To predict whether server metrics cross a threshold, he builds a method model that focuses on being robust to all the usual issues of anomalies and structural breaks. He keeps it simple, interpretable, and–for the sake of real-time forecasting–fast. Good stuff all around. The GIFs are the cherry on top.

Chronos: Learning the Language of Time Series

Ansari et al. (2024) introduce their Chronos model family on Github:

Chronos is a family of pretrained time series forecasting models based on language model architectures. A time series is transformed into a sequence of tokens via scaling and quantization, and a language model is trained on these tokens using the cross-entropy loss. Once trained, probabilistic forecasts are obtained by sampling multiple future trajectories given the historical context.

The whole thing is very neat. The repository can be pip-installed as package wrapping the pre-trained models on HuggingFace so that the installation and “Hello, world” example just work, and the paper is extensive at 40 pages overall. I commend the authors for using that space to include section 5.7 “Qualitative Analysis and Limitations” discussing and visualizing plenty of examples. The limitation arising from the quantization approach (Figure 16) would not have been as clear otherwise.

Speaking of quantization, the approach used to tokenize time series onto a fixed vocabulary reminds me of the 2020 paper “The Effectiveness of Discretization in Forecasting” by Rabanser et al., a related group of (former) Amazon researchers.

The large set of authors of Chronos also points to the NeurIPS 2023 paper “Large Language Models Are Zero-Shot Time Series Forecasters”, though the approach of letting GPT-3 or LLaMA-2 predict a sequence of numeric values directly is very different.

Average Temperatures by Month Instead of Year

This tweet is a prime example for why it’s hard to analyze one signal in a time series (here, its trend) without simultaneously adjusting for its other signal components (here, its seasonality).

If the tweet gets taken down, perhaps this screenshot on Mastodon remains.

AI Act Article 17 - Quality Management System

Article 17 of the AI Act adopted by the EU Parliament is the ideal jump-off point into other parts of the legislation. While article 16 lists the “Obligations of providers of high-risk AI systems”, Article 17 describes the main measure by which providers can ensure compliance: the quality management system.

That system shall be documented in a systematic and orderly manner in the form of written policies, procedures and instructions, and shall include at least the following aspects […]

Note that providers write the policies themselves. The policies have to cover aspects summarizing critical points found elsewhere in the document, as for example the development, quality control, and validation of the AI system. But also…

  • a description of the applied harmonized standards or other means to comply with the requirements for high-risk AI systems as set out in Section 2,
  • procedures for data management such as analysis, labeling, storage, aggregation, and retention,
  • procedures for post-marketing monitoring as in Article 72, incident notification as in Article 73, and record-keeping as in Article 12, and
  • procedures for communication with competent authorities, notified bodies, and customers.

This doesn’t necessarily sound all that exciting and might be glossed over on first reading, but search for Article 17 in the AI Act and you’ll find the quality management system listed as the criterion alongside technical documentation in the conformity assessment of high-risk AI systems that providers perform themselves (Annex VI) or that notified bodies perform for them (Annex VII).

Quality management systems are the means by which providers can self-assess their high-risk AI systems against official standards and comply with the regulation and thus central to the reading of the AI Act.

Demystifying the Draft EU AI Act

Speaking of AI Act details, the paper “Demystifying the Draft EU AI Act” (Veale and Borgesius, 2021) has been a real eye-opener and fundamental to my understanding of the regulation.1

Different than most coverage of the regulation, the two law researchers highlight the path by which EU law eventually impacts practice: Via standards and company-internal self-assessments. This explains why you will be left wondering what human oversight and technical robustness mean after reading the AI Act. The AI Act purposely does not provide specifications practitioners could follow to stay within the law when developing AI systems. Instead, specifics are outsourced to the private European standardization agencies CEN and CENELEC. The EU Commission will task them with definition of standards (think ISO or DIN) that companies can then follow during implementation of their systems and subsequently self-assess. This is nothing unusual in EU law making (for example, it’s used for medical devices and kids' chemistry sets). But, as the authors argue, it implies that “standardisation is arguably where the real rule-making in the Draft AI Act will occur”.

Chapter III, section 4 “Conformity Assessment and Presumption” for high-risk AI systems, as well as chapters V and VI provide context not found anywhere else, leading up to strong concluding remarks:

The high-risk regime looks impressive at first glance. But scratching the surface finds arcane electrical standardisation bodies with no fundamental rights experience expected to write the real rules, which providers will quietly self-assess against.

  1. As the paper’s title suggests, it has been written in 2021 as a dissection of the EU Commission’s initial proposal of the AI Act. Not all descriptions might apply to the current version adopted by the EU Parliament on Tuesday. Consequently the new regulation of foundation models, for example, is not covered. ↩︎

AI Act Approved by EU Parliament

The EU AI Act has finally come to pass, and pass it did with 523 of 618 votes of the EU Parliament in favor. The adopted text (available as of writing as PDF or Word document—the latter is much easier to work with!) has seen a number of changes since the original proposal by the EU Commission in 2021.

For example, the current text reduces the set of systems considered high-risk somewhat by excluding those that are “not materially influencing the outcome of decision making” (Chapter III, Section 1, Article 6, Paragraph 3) except for those already covered by EU regulation such as medical devices and elevators. It also requires providers of any AI systems to mark generated audio, image, video or text content as such in a “machine-readable format and detectable as artificially generated” while also being “interoperable” (Chapter IV, Article 50, Paragraph 2).

And then there is the “right to an explanation” (Article 86). While data scientists and machine learning engineers hear “Explainable AI” and start submitting abstracts to CHI, the provision does not appear to ask for an explanation of the AI system’s recommendation, but only for a description of the overall process in which the system was deployed:

Any affected person subject to a decision which is taken by the deployer on the basis of the output from a high-risk AI system listed in Annex III, with the exception of systems listed under point 2 thereof, and which produces legal effects or similarly significantly affects that person in a way that they consider to have an adverse impact on their health, safety or fundamental rights shall have the right to obtain from the deployer clear and meaningful explanations of the role of the AI system in the decision-making procedure and the main elements of the decision taken.

This reading1 is confirmed by the references to the “deployer” and not the “provider” of the AI system. The latter would be the one who can provide an interpretable algorithm or at least explanations.

Additionally, this only refers to high-risk systems listed in Annex III such as those used for, for example, hiring and workers management or credit scoring. Therefore, however, it also excludes high-risk systems falling under existing EU regulations listed in Annex I, such as medical devices and elevators.

The devil is in the details.

  1. Which is not based on any legal expertise whatsoever, by the way. ↩︎

Stop Using Dynamic Time Warping for Business Time Series

Dynamic Time Warping (DTW) is designed to reveal inherent similarity between two time series of similar scale that was obscured because the time series were shifted in time or sampled at different speeds. This makes DTW useful for time series of natural phenomena like electrocardiogram measurements or recordings of human movements, but less so for business time series such as product sales.

To see why that is, let’s first refresh our intuition of DTW, to then check why DTW is not the right tool for business time series.

What Dynamic Time Warping Does

Given two time series, DTW computes aligned versions of the time series to minimize the cumulative distance between the aligned observations. The alignment procedure repeats some of the observations to reduce the resulting distance. The aligned time series end up with more observations than in the original versions, but the aligned versions still have to have the same start and end, no observation may be dropped, and the order of observations must be unchanged.

But the best way to understand the alignment is to visualize it.

Aligning a Shift

To start, I’ll use an example where DTW can compute a near perfect alignment: That of a cosine curve and a shifted cosine curve—which is just a sine curve.1 For each of the curves, we observe 12 observations per period, and 13 observations in total.

series_sine <- sinpi((0:12) / 6)
series_cosine <- cospi((0:12) / 6)

To compute the aligned versions, I use the DTW implementation of the dtw package (also available for Python as dtw-python) with default parameters.

dtw_shift <- dtw::dtw(x = series_sine, y = series_cosine)

Besides returning the distance of the aligned series, DTW produces a mapping of original series to aligned series in the form of alignment vectors dtw_shift$index1 and dtw_shift$index2. Using those, I can visualize both the original time series and the aligned time series along with the repetitions used for alignment.

# DTW returns a vector of indices of the original observations
# where some indices are repeated to create aligned time series
##  [1]  1  1  1  1  2  3  4  5  6  7  8  9 10 11 12 13
# In this case, the first index is repeated thrice so that the first
# observation appears four times in the aligned time series
series_cosine[dtw_shift$index2] |> head(8) |> round(2)
## [1]  1.00  1.00  1.00  1.00  0.87  0.50  0.00 -0.50

The plot below shows the original time series in the upper half and the aligned time series in the lower half, with the sine in orange and the cosine in black. Dashed lines indicate where observations were repeated to achieve the alignment.

Given that the sine is a perfect shifted copy of the cosine, three-quarters of the observed period can be aligned perfectly. The first quarter of the sine and the last quarter of the cosine’s period, however, can’t be aligned and stand on their own. Their indices are mapped to the repeated observations from the other time series, respectively.

Aligning Speed

Instead of shifting the cosine, I can sample it at a different speed (or, equivalently, observe a cosine of different frequency) to construct a different time series that can be aligned well by DTW. In that case, the required alignment is not so much a shift as it is a squeezing and stretching of the observed series.

Let’s create a version of the original cosine that is twice as “fast”: In the time that we observe one period of the original cosine, we observe two periods of the fast cosine.

series_cosine_fast <- cospi((0:12) / 3)
dtw_fast <- dtw::dtw(x = series_cosine_fast, y = series_cosine)

The resulting alignment mapping looks quite different than in the first example. Under a shift, most observations still have a one-to-one mapping after alignment. Under varying frequencies, most observations of the faster time series have to be repeated to align. Note how the first half of the fast cosine’s first period can be neatly aligned with the first half of the slow cosine’s period by repeating observations (in an ideal world exactly twice).

The kind of alignment reveals itself better when the time series are observed for more than just one or two periods. Below, for longer versions of the same series, half of the fast time series can be matched neatly with the slow cosine as we observe twice the number of periods for the fast cosine.

Aligning Different Scales

What’s perhaps unexpected, though, is that the alignment works only on the time dimension. Dynamic time warping will not scale the time series’ amplitude. But at the same time DTW is not scale independent. This can make the alignment fairly awkward when time series have varying scales as DTW exploits the time dimension to reduce the cumulative distance of observations in the value dimension.

To illustrate this, let’s take the sine and cosine from the first example but scale the sine’s amplitude by 5 and check the resulting alignment.

series_sine_scaled <- series_sine * 10
dtw_scaled <- dtw::dtw(x = series_sine_scaled, y = series_cosine)

We might expect DTW to align the two series as it did in the first example above with unscaled series. After all, the series have the same frequencies as before and the same period shift as before.

This is what the alignment would look like using the alignment vectors from the first example above based on un-scaled observations. While it’s a bit hard to see due to the different scales, the observations at peak amplitude are aligned (e.g., indices 4, 10, 16) as are those at the minimum amplitude (indices 7 and 13).

But Dynamic Time Warping’s optimization procedure doesn’t actually try to identify characteristics of time series such as their period length to align them. It warps time series purely to minimize the cumulative distance between aligned observations. This may lead to a result in which also the periodicity is aligned as in the first and second examples above. But that’s more by accident than by design.

This is how DTW actually aligns the scaled sine and the unscaled cosine:

The change in the series’ amplitude leads to a more complicated alignment: Observations at the peak amplitude of the cosine (which has the small amplitude) are repeated many times to reduce the Euclidean distance to the already high amplitude observations of the sine. Reversely, the minimum-amplitude of the sine is repeated many times to reduce the Euclidean distance to the already low amplitude observations of the cosine.

DTW Is Good at Many Things…

Dynamic time warping is great when you’re observing physical phenomena that are naturally shifted in time or at different speeds.

Consider, for example, measurements taken in a medical context such as those of an electrocardiogram (ECG) that measures the electrical signals in a patient’s heart. In this context, it is helpful to align time series to identify similar heart rhythms across patients. The rhythms’ periods could be aligned to check whether one patient has the same issues as another. Even for the same person a DTW can be useful to align measurements taken on different days at different heart rates.

data("aami3a") # ECG data included in `dtw` package
dtw_ecg <- dtw::dtw(
  x = as.numeric(aami3a)[1:2880],
  y = as.numeric(aami3a)[40001:42880]

An application such as this one is nearly identical to the first example of the shifted cosine. And as the scale of the time series is naturally the same across the two series, the alignment works well, mapping peaks and valleys with minimal repetitions.

There is also a natural interpretation to the alignment, namely that we are aligning the heart rhythm across measurements. A large distance after alignment would indicate differences in rhythms.

…But Comparing Sales Ain’t One of Them

It is enticing to use Dynamic Time Warping in a business context, too. Not only does it promise a better distance metric to identify time series that are “similar” and to separate them from those that are “different”, but it also has a cool name. Who doesn’t want to warp stuff?

We can, for example, warp time series that count the number of times products are used for patients at a hospital per month. Not exactly product sales but sharing the same characteristics. The data comes with the expsmooth package.

sampled_series <- sample(x = ncol(expsmooth::hospital), size = 2)
product_a <- as.numeric(expsmooth::hospital[,sampled_series[1]])
product_b <- as.numeric(expsmooth::hospital[,sampled_series[2]])

While product_b is used about twice as often as product_a, both products exhibit an increase in their level at around index 40 which is perhaps one characteristic that makes them similar or at least more similar compared to a series that doesn’t share this increase.

However, the warped versions exhibit a lot of unwanted repetitions of observations. Given the different level of the products, this should not come as a surprise.

dtw_products_ab <- dtw::dtw(x = product_b, y = product_a)

We can mostly fix this by min-max scaling the series, forcing their observations onto the same range of values.

dtw_products_ab_standardized <- dtw::dtw(
  x = (product_b - min(product_b)) / (max(product_b) - min(product_b)),
  y = (product_a - min(product_a)) / (max(product_a) - min(product_a))

While we got rid of the unwanted warping artifacts in the form of extremely long repetitions of the same observation, the previous plot should raise some questions:

  • How can we interpret the modifications for “improved” alignment?
  • Are we aligning just noise?
  • What does the warping offer us that we didn’t already have before?

If you ask me, there is no useful interpretation to the alignment modifications, because the time series already were aligned. Business time series are naturally aligned by the point in time at which they are observed.2

We also are not taking measurement of a continuous physical phenomenon that goes up and down like the electric signals in a heart. While you can double the amount of ECG measurements per second, it does not make sense to “double the amount of sales measurements per month”. And while a product can “sell faster” than another, this results in an increased amplitude not an increased frequency (or shortened period lengths).

So if you apply Dynamic Time Warping to business time series, you will mostly be warping random fluctuations with the goal of reducing the cumulative distance just that tiny bit more.

This holds when you use Dynamic Time Warping to derive distances for clustering of business time series, too. You might as well calculate the Euclidean distance3 on the raw time series without warping first. At least then the distance between time series will tell you that they were similar (or different) at the same point in time.

In fact, if you want to cluster business time series (or train any kind of model on them), put your focus on aligning them well in the value dimension by thinking carefully about how you standardize them. That makes all the difference.

Before using Dynamic Time Warping, ask yourself: What kind of similarity are you looking for when comparing time series? Will there be meaning to the alignment that Dynamic Time Warping induces in your time series? And what is it that you can do after warping that you weren’t able to do before?

  1. A similar example is also used by the dtw package in its documentation.↩︎

  2. Two exceptions come to mind: First, you may want to align time series of products by the first point in time at which the product was sold. But that’s straightforward without DTW. Second, you may want to align products that actually exhibit a shift in their seasonality—such as when a product is heavily affected by the seasons and sold in both the northern and southern hemisphere. DTW might be useful for this, but there might also be easier ways to accomplish it.↩︎

  3. Or Manhattan distance, or any other distance.↩︎

Comes with Anomaly Detection Included

A powerful pattern in forecasting is that of model-based anomaly detection during model training. It exploits the inherently iterative nature of forecasting models and goes something like this:

  1. Train your model up to time step t based on data [1,t-1]
  2. Predict the forecast distibution at time step t
  3. Compare the observed value against the predicted distribution at step t; flag the observation as anomaly if it is in the very tail of the distribution
  4. Don’t update the model’s state based on the anomalous observation

For another description of this idea, see, for example, Alexandrov et al. (2019).

Importantly, you use your model to detect anomalies during training and not after training, thereby ensuring its state and forecast distribution are not corrupted by the anomalies.

The beauty of this approach is that

  • the forecast distribution can properly reflect the impact of anomalies, and
  • you don’t require a separate anomaly detection method with failure modes different from your model’s.

In contrast, standalone anomaly detection approaches would first have to solve the handling of trends and seasonalities themselves, too, before they could begin to detect any anomalous observation. So why not use the model you already trust with predicting your data to identify observations that don’t make sense to it?

This approach can be expensive if your model doesn’t train iteratively over observations in the first place. Many forecasting models1 do, however, making this a fairly negligiable addition.

But enough introduction, let’s see it in action. First, I construct a simple time series y of monthly observations with yearly seasonality.

x <- 1:80
y <- pmax(0, sinpi(x / 6) * 25 + sqrt(x) * 10 + rnorm(n = 80, sd = 10))

# Insert anomalous observations that need to be detected
y[c(55, 56, 70)] <- 3 * y[c(55, 56, 70)]

To illustrate the method, I’m going to use a simple probabilistic variant of the Seasonal Naive method where the forecast distribution is assumed to be Normal with zero mean. Only the \(\sigma\) parameter needs to be fitted, which I do using the standard deviation of the forecast residuals.

The estimation of the \(\sigma\) parameter occurs in lockstep with the detection of anomalies. Let’s first define a data frame that holds the observations and will store a cleaned version of the observations, the fitted \(\sigma\) and detected anomalies…

df <- data.frame(
  y = y,
  y_cleaned = y,
  forecast = NA_real_,
  residual = NA_real_,
  sigma = NA_real_,
  is_anomaly = FALSE

… and then let’s loop over the observations.

At each iteration, I first predict the current observation given the past and update the forecast distribution by calculating the standard deviation over all residuals that are available so far, before calculating the latest residual.

If the latest residual is in the tails of the current forecast distribution (i.e., larger than multiples of the standard deviation), the observation is flagged as anomalous.

For time steps with anomalous observations, I update the cleaned time series with the forecasted value (which informs a later forecast at step t+12) and set the residual to missing to keep the anomalous observation from distorting the forecast distribution.

# Loop starts when first prediction from Seasonal Naive is possible
for (t in 13:nrow(df)) {
  df$forecast[t] <- df$y_cleaned[t - 12]
  df$sigma[t] <- sd(df$residual, na.rm = TRUE)
  df$residual[t] <- df$y[t] - df$forecast[t]
  if (t > 25) {
    # Collect 12 residuals before starting anomaly detection
    df$is_anomaly[t] <- abs(df$residual[t]) > 3 * df$sigma[t]
  if (df$is_anomaly[t]) {
    df$y_cleaned[t] <- df$forecast[t]
    df$residual[t] <- NA_real_

Note that I decide to start the anomaly detection not before there are 12 residuals for one full seasonal period, as the \(\sigma\) estimate based on less than a handful of observations can be flaky.

In a plot of the results, the combination of 1-step-ahead prediction and forecast distribution is used to distinguish between expected and anomalous observations, with the decision boundary indicated by the orange ribbon. At time steps where the observed value falls outside the ribbon, the orange line indicates the model prediction that is used to inform the model’s state going forward in place of the anomaly.

Note how the prediction at time t is not affected by the anomaly at time step t-12. Neither is the forecast distribution estimate.

This would look very different when one gets the update behavior slightly wrong. For example, the following implementation of the loop detects the first anomaly in the same way, but uses it to update the model’s state, leading to subsequently poor predictions and false positives, and fails to detect later anomalies.

# Loop starts when first prediction from Seasonal Naive is possible
for (t in 13:nrow(df)) {
  df$forecast[t] <- df$y[t - 12]
  df$sigma[t] <- sd(df$residual, na.rm = TRUE)
  df$residual[t] <- df$y[t] - df$forecast[t]
  if (t > 25) {
    # Collect 12 residuals before starting anomaly detection
    df$is_anomaly[t] <- abs(df$residual[t]) > 3 * df$sigma[t]
  if (df$is_anomaly[t]) {
    df$y_cleaned[t] <- df$forecast[t]

What about structural breaks?

While anomaly detection during training can work well, it may fail spectacularly if an anomaly is not an anomaly but the start of a structural break. Since structural breaks make the time series look different than it did before, chances are the first observation after the structural break will be flagged as anomaly. Then so will be the second. And then the third. And so on, until all observations after the structural break are treated as anomalies because the model never starts to adapt to the new state.

This is particularly frustrating because the Seasonal Naive method is robust against certain structural breaks that occur in the training period. Adding anomaly detection makes it vulnerable.

What values to use for the final forecast distribution?

Let’s get philosophical for a second. What are anomalies?

Ideally, they reflect a weird measurement that will never occur again. Or if it does, it’s another wrong measurement—but not the true state of the measured phenomenon. In that case, let’s drop the anomalies and ignore them in the forecast distribution.

But what if the anomalies are weird states that the measured phenomenon can end up in? For example, demand for subway rides after a popular concert. While perhaps an irregular and rare event, a similar event may occur again in the future. Do we want to exclude that possibility from our predictions about the future? What if the mention of a book on TikTok let’s sales spike? Drop the observation and assume it will not repeat? Perhaps unrealistic.

It depends on your context. In a business context, where measurement errors might be less of an issue, anomalies might need to be modeled, not excluded.

  1. Notably models from the state-space model family.↩︎

Code Responsibly

There exists this comparison of software before and software after machine learning.

Before machine learning, code was deterministic: Software engineers wrote code, the code included conditions with fixed thresholds, and at least in theory the program was entirely understandable.

After machine learning, code is no longer deterministic. Instead of software engineers instantiating it, the program’s logic is determined by a model and its parameters. Those parameters are not artisinally chosen by a software engineer but learned from data. The program becomes a function of data, and in some cases incomprehensible to the engineer due to the sheer number of parameters.

Given the current urge to regulate AI and making its use responsible and trustworthy, humans appear to expect machine learning models to introduce an obscene number of bugs into software. Perhaps humans underestimate the ways in which human programmers can mess up.

For example, when I hear Regulate AI, all I can think is Have you seen this stuff? By Pierluigi Bizzini for Algorithm Watch1 (emphasis mine):

The algorithm evaluates teachers' CVs and cross-references their preferences for location and class with schools' vacancies. If there is a match, a provisional assignment is triggered, but the algorithm continues to assess other candidates. If it finds another matching candidate with a higher score, that second candidate moves into the lead. The process continues until the algorithm has assessed all potential matches and selected the best possible candidate for the role.

[…] [E]rrors have caused much confusion, leaving many teachers unemployed and therefore without a salary. Why did such errors occur?

When the algorithm finds an ideal candidate for a position, it does not reset the list of remaining candidates before commencing the search to fill the next vacancy. Thus, those candidates who missed out on the first role that matched their preferences are definitively discarded from the pool of available teachers, with no possibility of employment. The algorithm classes those discarded teachers as “drop-outs”, ignoring the possibility of matching them with new vacancies.

This is not AI gone rogue. This is just a flawed human-written algorithm. At least Algorithm Watch is aptly named Algorithm Watch.

Algorithms existed before AI.2 But there was no outcry, no regulation of algorithms before AI, no “Proposal for a Regulation laying down harmonised rules on artificial intelligence”. Except there actually is regulation in aviation and medical devices and such.3 Perhaps because of the extent to which these fields are entangled with hardware, posing unmediated danger to human lives.

Machine learning and an increased prowess in data processing have not introduced more bugs into software compared to software written by humans previously. What they have done is to enable applications for software that were previously infeasible by way of processing and generating data.

Some of these applications are… not great. They should never be done. Regulate those applications. By all means, prohibit them. If a use case poses unacceptable risks, when we can’t tolerate any bugs but bugs are always a possibility, then let’s just not do it.

Other applications are high-risk, high-reward. Given large amounts of testing imposed by regulation, we probably want software to enable these applications. The aforementioned aviation and medical devices come to mind.4 Living with regulated software is nothing new!

Then there is the rest that doesn’t really harm anyone where people can do whatever.

Regulating software in the context of specific use cases is feasible and has precedents.

Regulating AI is awkward. Where does the if-else end and AI start? Instead, consider it part of software as a whole and ask in which cases software might have flaws we are unwilling to accept.

We need responsible software, not just responsible AI.

  1. I tried to find sources for the description of the bug provided in the article, but couldn’t find any. I don’t know where the author takes them from, so take this example with the necessary grain of salt. ↩︎

  2. Also, much of what were algorithms or statistics a few years ago are now labeled as AI. And large parts of AI systems are just if statements and for loops and databases and networking. ↩︎

  3. Regulation that the EU regulation of artificial intelligence very much builds upon. See “Demystifying the Draft EU Artificial Intelligence Act” ( for more on the connection to the “New Legislative Framework”. ↩︎

  4. Perhaps this is the category for autonomous driving? ↩︎

A Flexible Model Family of Simple Forecast Methods

Introducing a flexible model family that interpolates between simple forecast methods to produce interpretable probabilistic forecasts of seasonal data by weighting past observations. In business forecasting applications for operational decisions, simple approaches are hard-to-beat and provide robust expectations that can be relied upon for short- to medium-term decisions. They’re often better at recovering from structural breaks or modeling seasonal peaks than more complicated models, and they don’t overfit unrealistic trends.

Continue reading?

Video: Tim Januschowski, ISF 2023 Practitioner Speaker

We don’t have enough presentations of industry practitioners discussing the detailed business problems they’re addressing and what solutions and trade-offs they were able to implement. Tim Januschoswki did just that, though, in his presentation at the International Symposium on Forecasting 2023. He discusses demand forecasting for optimal pricing at Zalando.

Presentations such as this one are rare opportunites to peak at the design of real world solutions. My favorite quote:

What we’re not using that might be also interesting is weather data. My business counterparts, they always, you know, make me aware of that fact. But we’re not using it.

‘ECB Must Accept Forecasting Limitations to Restore Trust’

Christine Lagarde, president of the European Central Bank, declared her intent to communicate the shortcomings of the ECB’s forecasts better—and in doing so, provides applied data science lessons for the rest of us. As quoted by the Financial Times:

“Even if these [forecast] errors were to deplete trust, we can mitigate this if we talk about forecasts in a way that is both more contingent and more accessible, and if we provide better explanations for those errors,” Lagarde said.

In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making

Raymond Fok and Daniel S. Weld in a recent Arxiv preprint:

We argue explanations are only useful to the extent that they allow a human decision maker to verify the correctness of an AI’s prediction, in contrast to other desiderata, e.g., interpretability or spelling out the AI’s reasoning process.

This does ring true to me: Put yourself into the position of an employee of Big Company Inc. whose task it is to allocate marketing budgets, to purchase product inventory, or to perform any other monetary decision as part of a business process. Her dashboard, powered by a data pipeline and machine learning model, suggests to increase TV ad spend in channel XYZ, or to order thousands of units of a seasonal product to cover the summer.

In her shoes, if you had to sign the check, what let’s you sleep at night: Knowing the model’s feature importances, or having verified the prediction’s correctness?

I’d prefer the latter, and the former only so much as it helps in the pursuit of verification. Feature importance alone, it is argued however, can’t determine correctness:

Here, we refer to verification of an answer as the process of determining its correctness. It follows that many AI explanations fundamentally cannot satisfy this desideratum […] While feature importance explanations may provide some indication of how much each feature influenced the AI’s decision, they typically do not allow a decision maker to verify the AI’s recommendation.

We want verifiability, but we cannot have it for most relevant supervised learning problems. The number of viewers of the TV ad are inherently unknown at prediction time, as is the demand for the seasonal product. These applications are in stark contrast to the maze example the authors provide, in which the explanation method draws the proposed path through the maze.

If verifiability is needed to complement human decision making, then this might be why one can get the impression of explanation washing of machine learning systems: While current explanation methods are the best we can do, they fall short of what is really needed to trust a system’s recommendation.

What can we do instead? We could start by showing the actual data alongside the recommendation. Making the data explorable. The observation in question can be put into the context of observations from the training data for which labels exist, essentially providing case-based explanations.

Ideally, any context provided to the model’s recommendation is not based on another model that adds another layer to be verified, but on hard actuals.

In the case of forecasting, simply visualizing the forecast alongside the historical observations can be extremely effective at establishing trust. When the time series is stable and shows clear patterns, a human actually can verify the the forecast’s correctness up to a point. And a human easily spots likely incorrect forecasts given historical data.

The need for verifiability makes me believe in building data products, not just a model.

Explainability Washing

Upol Ehsan ponders on Mastodon:

Explainable AI suffers from an epidemic. I call it Explainability Washing. Think of it as window dressing–techniques, tools, or processes created to provide the illusion of explainability but not delivering it.

Ah yes, slapping feature importance values onto a prediction and asking your users “Are you not entertained?”.

This thread pairs well with Rick Saporta’s presentation. Both urge you to focus solely on your user’s decision when deciding what to build.

A Framework for Data Product Management for Increasing Adoption & User Love

You might have heard this one before: To build successful data products, focus on the decisions your customers make. But when was the last time you considered “how your work get[s] converted into action”?

At Data Council 2023, Rick Saporta lays out a framework of what data products to build and how to make them successful with customers. He goes beyond the platitudes, his advice sounds hard-earned.

Slides are good, talk is great.

The 2-by-2 of Forecasting

False Positives and False Negatives are traditionally a topic in classification problems only. Which makes sense: There is no such thing as a binary target in forecasting, only a continuous range. There is no true and false, only a continuous scale of wrong. But there lives an MBA student in me who really likes 2-by-2 charts, so let’s come up with one for forecasting.

The {True,False}x{Positive,Negative} confusion matrix is the one opportunity for university professors to discuss the stakeholders of machine learning systems. The fact that a stakeholder might care more about reducing the number of False Positives and thus accepting a higher rate of False Negatives. Certain errors are more critical than others. That’s just as much the case in forecasting.

To construct the 2-by-2 of forecasting, the obvious place to start is the sense of “big errors are worse”. Let’s put that on the y-axis.

This gives us the False and “True” equivalents of forecasting. The “True” is in quotes because any deviation from the observed value is some error. But for the sake of the 2-by-2, let’s call small errors “True”.

Next, we need the Positive and Negative equivalents. When talking about stakeholder priorities, Positive and Negative differentiate the errors that are Critical from those that are Acceptable. Let’s put that on the x-axis.

While there might be other ways to define criticality1, human perception of time series forecastability comes up as soon as users of your product inspect your forecasts. The human eye will detect apparent trends and seasonal patterns and project them into the future. If your model does not, and wrongly so, it raises confusion instead. Thus, forecasts of series predictable by humans will be critized, while forecasts of series with huge swings and more noise than signal are easily acceptable.

To utilize this notion, we require a model of human perception of time series forecastability. In a business context, where seasonality may be predominant, the seasonal naive method captures much of what is modelled by the human eye. It is also inherently conservative, as it does not overfit to recent fluctuations or potential trends. It assumes business will repeat as it did before.

Critical, then, are forecasts of series that the seasonal naive method, or any other appropriate benchmark, predicts with small error, while Acceptable are any forecasts of series that the seasonal naive method predicts poorly. This completes the x-axis.

With both axes defined, the quadrants of the 2-by-2 emerge. Small forecast model erros are naturally Acceptable True when the benchmark model fails, and large forecast model errors are Acceptable False when the benchmark model also fails. Cases of series that feel predictable and are predicted well are Critical True. Lastly, series that are predicted well by a benchmark but not by the forecast model are Critical False.

The Critical False group contains the series for which users expect a good forecast because they themselves can project the series into the future—but your model fails to deliver that forecast and does something weird instead. It’s the group of forecasts that look silly in your tool, the ones that cause you discomfort when users point them out.

Keep that group small.

  1. For example, the importance of a product to the business as measured by revenue. ↩︎

Bayesian Intermittent Demand Forecasting at NeurIPS 2016

Oldie but a goodie: A recording of Matthias Seeger’s presentation of “Bayesian Intermittent Demand Forecasting for Large Inventories” at NeurIPS 2016. The corresponding paper is a favorite of mine, but I only now stumbled over the presentation. It sparked an entire catalogue of work on time series forecasting by Amazon, and like few others called out the usefulness of sample paths.

On the Factory Floor

What works at Google-scale is not the pattern most data scientists need to employ at their work. But the paper “On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models” is the kind of paper that we need more of: Thrilling reports of what works in practice.

Also, the authors do provide abstract lessons anyone can use, such as considering the constraints of your problem rather than using whatever is state-of-the-art:

A major design choice is how to represent an ad-query pair x. The semantic information in the language of the query and the ad headlines is the most critical component. Usage of attention layers on top of raw text tokens may generate the most useful language embeddings in current literature [64], but we find better accuracy and efficiency trade-offs by combining variations of fully-connected DNNs with simple feature generation such as bi-grams and n-grams on sub-word units. The short nature of user queries and ad headlines is a contributing factor. Data is highly sparse for these features, with typically only a tiny fraction of non-zero feature values per example.

SAP Design Guidelines for Intelligent Systems

From SAP’s Design Guidelines for Intelligent Systems:

High–stakes decisions are more common in a professional software environment than in everyday consumer apps, where the consequences of an action are usually easy to anticipate and revert. While the implications of recommending unsuitable educational content to an employee are likely to be minimal, recommendations around critical business decisions can potentially cause irreversible damage (for example, recommending an unreliable supplier or business partner, leading to the failure or premature termination of a project or contract). It’s therefore vital to enable users to take an informed decision.

While sometimes overlooked, this guide presents software in internal business processes as rich opportunity to augment human capabilities, deserving just as much love and attention as “everyday consumer apps”.

The chapters on intelligent systems are not too tuned to SAP systems, but they do have the specific context of business applications in mind which differentiates them from other (great!) guides on user interfaces for machine learning systems.

Based on that context, the guide dives deep on Ranking, Recommendations, and Matching, proving that it’s based on a much more hands-on view than any text discussing Supervised, Unsupervised, and Reinforcement Learning.

When aiming to build systems that “augment human capabilities”, the importance of “gain[ing] the user’s trust and foster[ing] successful adoption” can’t be overstated, making it worthwhile to deeply consider how we present the output of our systems.

Related: Formalizing Trust in Artificial Intelligence: Prerequisites, Causes and Goals of Human Trust in AI by Alon Jacovi, Ana Marasović, Tim Miller, and Yoav Goldberg.

Skillful Image Fast-Forwarding

Russel Jacobs for Slate on the discontinued Dark Sky weather app, via Daring Fireball:

Indeed, Dark Sky’s big innovation wasn’t simply that its map was gorgeous and user-friendly: The radar map was the forecast. Instead of pulling information about air pressure and humidity and temperature and calculating all of the messy variables that contribute to the weather–a multi-hundred-billion-dollars-a-year international enterprise of satellites, weather stations, balloons, buoys, and an army of scientists working in tandem around the world (see Blum’s book)–Dark Sky simply monitored changes to the shape, size, speed, and direction of shapes on a radar map and fast-forwarded those images. “It wasn’t meteorology,” Blum said. “It was just graphics practice.”

Reminds me of DeepMind’s “Skilful precipitation nowcasting using deep generative models of radar”.

ChatGPT and ML Product Management

Huh, look at that, OpenAI’s ChatGPT portrays absolute confidence while giving plain wrong answers. But ChatGPT also does provide helpful responses a large number of times. So one kind of does want to use it. Sounds an awful lot like every other machine learning model deployed in 2022. But really, how do we turn fallible machine learning models into products to be used by humans? Not by injecting its answers straight into StackOverflow.

Continue reading?

GluonTS Workshop at Amazon Berlin on September 29

The workshop will revolve around tools that automatically transform your data, in particular time series, into high-quality predictions based on AutoML and deep learning models. The event will be hosted by the team at AWS that develops AutoGluon, Syne Tune and GluonTS, and consist of a mix of tutorial-style presentation on the tools, discussion, and contributions from external partners on their applications.

Unique opportunity to hear from industry practitioners and GluonTS developers in person or by joining online.

Design a System, not an “AI”

Ryxcommar on Twitter:

I think one of the bigger mistakes people make when designing AI powered systems is seeing them as an AI first and foremost, and not as a system first and foremost.

Once you have your API contracts in place, the AI parts can be seen as function calls inside the system. Maybe your first version of these functions just return an unconditional expected value. But the system is the bulk of the work, the algorithm is a small piece.

To me, this is why regulation of AI (in contrast to regulation of software generally) can feel misguided: Any kind of function within a system has the potential to be problematic. It doesn’t have to use matrix multiplication for that to be the case.

More interestingly though, this is why it’s so effective to start with a simple model. It provides the function call around which you can build the system users care about.

Some free advice for data scientists– every time I have seen people treat their systems primarily as AI and not as systems, both the AI and the system suffered for it. Don’t make that mistake; design a system, not an “AI.”

Berlin Bayesians Meetup on September 27

The Berlin Bayesians meetup is happening again in-person. Juan Orduz is going to present Buy ‘Til You Die models implemented in PyMC:

In this talk, we introduce a certain type of customer lifetime models for the non-contractual setting commonly known as BTYD (Buy Till You Die) models. We focus on two sub-model components: the frequency BG/NBD model and the monetary gamma-gamma model. We begin by introducing the model assumptions and parameterizations. Then we walk through the maximum-likelihood parameter estimation and describe how the models are used in practice. Next, we describe some limitations and how the Bayesian framework can help us to overcome some of them, plus allowing more flexibility. Finally, we describe some ongoing efforts in the open source community to bring these new ideas and models to the public.

Buy ‘Til You Die models for the estimation of customer lifetime value were one of the first applications I worked on in my data science career, I’m glad to see they’re still around and kicking. Now implemented in the shiny new version of PyMC!

Edit: The event was rescheduled to September 27.

Legible Forecasts, and Design for Contestability

Some models are inherently interpretable because one can read their decision boundary right off them. In fact, you could call them interpreted as there is nothing left for you to interpret: The entire model is written out for you to read. For example, assume it’s July and we need to predict how many scoops of ice cream we’ll sell next month. The Seasonal Naive method tells us: AS 12 MONTHS AGO IN August, PREDICT sales = 3021.

Continue reading?

Where Is the Seasonal Naive Benchmark?

Yesterday morning, I retweeted this tweet by sklearn_inria that promotes a scikit-learn tutorial notebook on time-related feature engineering. It’s a neat notebook that shows off some fantastic ways of creating features to predict time series within a scikit-learn pipeline. There are, however, two things that irk me: All features of the dataset including the hourly weather are passed to the model. I don’t know the details of this dataset, but skimming what I believe to be its description on the OpenML repository, I suspect this might introduce data leakage as in reality we can’t know the exact hourly humidity and temperature days in advance.

Continue reading?

When Quantiles Do Not Suffice, Use Sample Paths Instead

I don’t need to convince you that you should absolutely, to one hundred percent, quantify your forecast uncertainty—right? We agree about the advantages of using probabilistic measures to answer questions and to automate decision making—correct? Great. Then let’s dive a bit deeper. So you’re forecasting not just to fill some numbers in a spreadsheet, you are trying to solve a problem, possibly aiming to make optimal decisions in a process concerned with the future.

Continue reading?

Be Skeptical of the t-SNE Bunny

Matt Henderson on Twitter (click through for the animation):

Be skeptical of the clusters shown in t-SNE plots! Here we run t-SNE on a 3d shape - it quickly invents some odd clusters and structures that aren’t really present in the original bunny.

What would happen if every machine learning method would come with a built-in visualization of the spurious results that it found?

Never mind the the answer to that question. I think that this dimensionality reduction of a 3D bunny into two dimensions isn’t even all that bad—the ears are still pretty cute. And it’s not like the original data had a lot more global and local structure once you consider that the bunny is not much more than noise in the shape of a rectangle with two ears that human eyes ascribe meaning to.

I’m the first to admit that t-SNE, UMAP, and all kinds of other methods will produce clusters from whatever data you provide. But so will k-means always return k clusters. One shouldn’t trust any model without some kind of evaluation of its results.

If you don’t take them at face value, UMAP and Co. can be powerful tools to explore data quickly and interactively. Look no further than the cool workflows Vincent Warmerdam is building for annotating text.

Failure Modes of State Space Models

State space models are great, but they will fail in predictable ways. Well, claiming that they “fail” is a bit unfair. They actually behave exactly as they should given the input data. But if the input data fails to adhere to the Normal assumption or lacks stationarity, then this will affect the prediction derived from the state space models in perhaps unexpected yet deterministic ways. This article ensures that none of us is surprised by these “failure modes”.

Continue reading?

Approach to Estimate Uncertainty Distributions of Walmart Sales

We present our solution for the M5 Forecasting - Uncertainty competition. Our solution ranked 6th out of 909 submissions across all hierarchical levels and ranked first for prediction at the finest level of granularity (product-store sales, i.e. SKUs). The model combines a multi-stage state-space model and Monte Carlo simulations to generate the forecasting scenarios (trajectories). Observed sales are modelled with negative binomial distributions to represent discrete over-dispersed sales. Seasonal factors are hand-crafted and modelled with linear coefficients that are calculated at the store-department level.

The approach chosen by this team of prior Lokad employees hits all the sweet spots. It’s simple, yet comes 6th in a Kaggle challenge, and produces multi-horizon sample paths.

Having the write-up of a well-performing result available in this detail is great—they share some nuggets:

Considering the small search space, this optimisation is done via grid search.

Easy to do for a two-parameter model and a neat trick to get computational issues under control. Generally neat to also enforce additional prior knowledge via arbitrary constraints on the search space.

According to the M5 survey by Makridakis et al. [3], our solution had the best result at the finest level of granularity (level 12 in the competition), commonly referred to as product-store level or SKU level (Stock Keeping Unit). For store replenishment and numerous other problems, the SKU level is the most relevant level.

Good on them to point this out. Congrats!

On Google Maps Directions

Google Maps and its Directions feature are the kind of data science product everyone wished they’d be building. It augments the user, enabling decision-making while driving. Directions exemplifies the difference between prediction and prescription. Google Maps doesn’t just expose data, and it doesn’t provide a raw analysis by-product like SHAP values. It processes historical and live data to predict the future and to optimize my route based on it, returning only the refined recommendations.

Continue reading?

Forecasting Uncertainty Is Never Too Large

Rob J. Hyndman gave a presentation titled “Uncertain futures: what can we forecast and when should we give up?” as part of the ACEMS public lecture series with recording available on Youtube.

He makes an often underappreciated point around minute 50 of the talk:

When the forecast uncertainty is too large to assist decision making? I don’t think that’s ever the case. Forecasting uncertainty being too large does assist decision making by telling the decision makers that the future is very uncertain and they should be planning for lots of different possible outcomes and not assuming just one outcome or another. And one of the problems we have in providing forecasts to decision makers is getting them to not focus in on the most likely outcome but to actually take into account the range of possibilities and to understand that futures are uncertain, that they need to plan for that uncertainty.

What Needs to Prove True for This to Work?

Data science projects are a tricky bunch. They entice you with challenging problems and promise a huge return if successful. In contrast to more traditional software engineering projects, however, data science projects entail more upfront uncertainty: You’ll not know until you tried whether the technology is good enough to solve the problem. Consequently, a data science endeavor fails more often, or doesn’t turn out to be the smash hit you and your stakeholders expected it to be.

Continue reading?

Everything is an AI Technique

Along with their proposal for regulation of artificial intelligence, the EU published a definition of AI techniques. It includes everything, and that’s great!

From the proposal’s Annex I:


  • (a) Machine learning approaches, including supervised, unsupervised and reinforcement learning, using a wide variety of methods including deep learning;
  • (b) Logic- and knowledge-based approaches, including knowledge representation, inductive (logic) programming, knowledge bases, inference and deductive engines, (symbolic) reasoning and expert systems;
  • (c) Statistical approaches, Bayesian estimation, search and optimization methods.

Unsurprisingly, this definition and the rest of the proposal made the rounds: Bob Carpenter quipped about the fact that according to this definition, he has been doing AI for 30 years now (and that the EU feels the need to differentiate between statistics and Bayesian inference). In his newsletter, Thomas Vladeck takes the proposal apart to point out potential ramifications for applications. And Yoav Goldberg was tweeting about it ever since a draft of the document leaked.

From a data scientist’s point of view, this definition is fantastic: First, it highlights that AI is a marketing term used to sell whatever method does the job. Not including optimization as AI technique would have given everyone who called their optimizer “AI” a way to wiggle out of the regulation otherwise. This implicit acknowledgement is welcome.

Second, and more importantly, as practitioner it’s practical to have this “official” set of AI techniques in your backpocket for when someone asks what exactly AI is. The fact that one doesn’t have to use deep learning to wear the AI bumper sticker means that we can be comfortable in choosing the right tool for the job. At this point, AI refers less to a set of techniques or artificial intelligence, and more to a family of problems that are solved by one of the tools listed above.

Resilience, Chaos Engineering and Anti-Fragile Machine Learning

In his interview with The Observer Effect, Tobi Lütke, CEO of Shopify, describes how Shopify benefits from resilient systems:

Most interesting things come from non-deterministic behaviors. People have a love for the predictable, but there is value in being able to build systems that can absorb whatever is being thrown at them and still have good outcomes.

So, I love Antifragile, and I make everyone read it. It finally put a name to an important concept that we practiced. Before this, I would just log in and shut down various servers to teach the team what’s now called chaos engineering.

But we’ve done this for a long, long time. We’ve designed Shopify very well because resilience and uptime are so important for building trust. These lessons were there in the building of our architecture. And then I had to take over as CEO.

It sticks out that Lütke uses “resilient” and “antifragile” interchangeably even though Taleb would point out that they are not the same thing: Whereas a resilient system doesn’t fail due to randomly turned off servers, an antifragile system benefits. (Are Shopify’s systems robust or have they become somehow better beyond robust due to their exposure to “chaos”?)

But this doesn’t diminish Lütke’s notion of resilience and uptime being “so important for building trust” (with users, presumably): Users’ trust in applications is fragile. Earning users’ trust in a tool that augments or automates decisions is difficult, and the trust is withdrawn quickly when the tool makes a stupid mistake. Making your tool robust against failure modes is how you make it trustworthy—and used.

Which makes it interesting to reason about what an equivalent to shutting off random servers is to machine learning applications (beyond shutting off the server running the model). Label noise? Shuffling features? Adding Covid-19-style disruptions to your time series? The latter might be more related to the idea of experimenting with a software system in production.

And—to return to the topic of discerning anti-fragile and robust—what would it mean for machine learning algorithms “to gain from disorder”? Dropout comes to mind. What about causal inference through natural experiments?

Embedding Many Time Series via Recurrence Plots

We demonstrate how recurrence plots can be used to embed a large set of time series via UMAP and HDBSCAN to quickly identify groups of series with unique characteristics such as seasonality or outliers. The approach supports exploratory analysis of time series via visualization that scales poorly when combined with large sets of related time series. We show how it works using a Walmart dataset of sales and a Citi Bike dataset of bike rides.

Continue reading?

Rediscovering Bayesian Structural Time Series

This article derives the Local-Linear Trend specification of the Bayesian Structural Time Series model family from scratch, implements it in Stan and visualizes its components via tidybayes. To provide context, links to GAMs and the prophet package are highlighted. The code is available here. I tried to come up with a simple way to detect “outliers” in time series. Nothing special, no anomaly detection via variational auto-encoders, just finding values of low probability in a univariate time series.

Continue reading?

Are You Sure This Embedding Is Good Enough?

Suppose you are given a data set of five images to train on, and then have to classify new images with your trained model. Five training samples are in general not sufficient to train a state-of-the-art image classification model, thus this problem is hard and earned it’s own name: few-shot image classification. A lot has been written on few-shot image classification and complex approaches have been suggested.1 Tian et al.

Continue reading?

The Causal Effect of New Year’s Resolutions

We treat the turn of the year as an intervention to infer the causal effect of New Year’s resolutions on McFit’s Google Trend index. By comparing the observed values from the treatment period against predicted values from a counterfactual model, we are able to derive the overall lift induced by the intervention. Throughout the year, people’s interest in a McFit gym membership appears quite stable.1 The following graph shows the Google Trend for the search term “McFit” in Germany for April 2017 to until the week of December 17, 2017.

Continue reading?

satRday Berlin Presentation

My satRday Berlin slides on “Modeling Short Time Series” are available here. This saturday, June 15, Berlin had its first satRday conference. I eagerly followed the hashtags of satRday Amsterdam last year and satRday Capetown the year before that on Twitter. Thanks to Noa Tamir, Jakob Graff, Steve Cunningham, and many others, we got a conference in Berlin as well. When I saw the call for papers, I jumped at the opportunity to present, trying what it feels like to be on the other side of the microphone; being in the hashtag instead of following it.

Continue reading?

Modeling Short Time Series with Prior Knowledge

I just published a longer case study, Modeling Short Time Series with Prior Knowledge: What ‘Including Prior Information’ really looks like. It is generally difficult to model time series when there is insuffient data to model a (suspected) long seasonality. We show how this difficulty can be overcome by learning a seasonality on a different, long related time series and transferring the posterior as a prior distribution to the model of the short time series.

Continue reading?

The Probabilistic Programming Workflow

Last week, I gave a presentation about the concept of and intuition behind probabilistic programming and model-based machine learning in front of a general audience. You can read my extended notes here. Drawing on ideas from Winn and Bishop’s “Model-Based Machine Learning” and van de Meent et al.’s “An Introduction to Probabilistic Programming”, I try to show why the combination of a data-generating process with an abstracted inference is a powerful concept by walking through the example of a simple survival model.

Continue reading?

Problem Representations and Model-Based Machine Learning

Back in 2003, Paul Graham, of Viaweb and Y Combinator fame, published an article entitled “Better Bayesian Filtering”. I was scrolling chronologically through his essays archive the other day when this article stuck out to me (well, the “Bayesian” keyword). After reading the first few paragraphs, I was a little disappointed to realize the topic was Naive Bayes rather than Bayesian methods. But it turned out to be a tale of implementing a machine learning solution for a real world application before anyone dared to mention AI in the same sentence.

Continue reading?

Videos from PROBPROG 2018 Conference

Videos of the talks given at the International Conference on Probabilistic Programming (PROBPROG 2018) back in October were published a few days ago and are now available on Youtube. I have not watched all presentations yet, but a lot of big names attended the conference so there should be something for everyone. In particular the talks by Brooks Paige (“Semi-Interpretable Probabilistic Models”) and Michael Tingley (“Probabilistic Programming at Facebook”) made me curious to explore their topics more.

Videos from Exploration in RL Workshop at ICML

One of the many fantastic workshops at ICML this year was the Exploration in Reinforcement Learning workshop. All talks were recorded and are now available on Youtube. Highlights include presentations by Ian Osband, Emma Brunskill, and Csaba Szepesvari, among others. You can find the workshop’s homepage here with more information and the accepted papers.

SVD for a Low-Dimensional Embedding of Instacart Products

Building on the Instacart product recommendations based on Pointwise Mutual Information (PMI) in the previous article, we use Singular Value Decomposition to factorize the PMI matrix into a matrix of lower dimension (“embedding”). This allows us to identify groups of related products easily. We finished the previous article with a long table where every row measured how surprisingly often two products were bought together according to the Instacart Online Grocery Shopping dataset.

Continue reading?

Pointwise Mutual Information for Instacart Product Recommendations

Using pointwise mutual information, we create highly efficient “customers who bought this item also bought” style product recommendations for more than 8000 Instacart products. The method can be implemented in a few lines of SQL yet produces high quality product suggestions. Check them out in this Shiny app. Back in school, I was a big fan of the Detective Conan anime. For whatever reason, one of the episodes stuck with me.

Continue reading?

Pokémon Recommendation Engine

Using t-SNE, I wrote a Shiny app that recommends similar Pokémon. Try it out here. Needless to say, I was and still am a big fan of the Pokémon games. So I was very excited to see that a lot of the meta data used in Pokémon games is available on Github due to the Pokémon API project. Data on Pokémon’s names, types, moves, special abilities, strengths and weaknesses is all cleanly organized in a few dozen csv files.

Continue reading?

Look At All These Links

By now, some time has passed since NIPS 2016. Consequently, several recaps can be found on blogs. One of them is this one by Eric Jang. If you want to make your first steps in putting some of the theory presented at NIPS into practice, why not take a look at this slide deck about reinforcement learning in R? The RStudio Conference also took place, and apparently has been a blast.

Continue reading?

Multi-Armed Bandits at Tinder

In a post on Tinder’s tech blog, Mike Hall presents a new application for multi-armed bandits. At Tinder, they started to use multi-armed bandits to optimize the photo of users that is shown first: While a user can have multiple photos in his profile, only one of them is shown first when another user swipes through the deck of user profiles. By employing an adapted epsilon-greedy algorithm, Tinder optimizes this photo for the “Swipe-Right-Rate”. Mike Hall about the project:

It seems to fit our problem perfectly. Let’s discover which profile photo results in the most right swipes, without wasting views on the low performing ones. …

We were off to a solid start with just a little tweaking and tuning. Now, we are able to leverage Tinder’s massive volume of swipes in order to get very good results in a relatively small amount of time, and we are convinced that Smart Photos will give our users a significant upswing in the number of right swipes they are receiving with more complex and fine-tuned algorithms as we move forward.

Look At All These Links

At Airbnb, the data science team has written their own R packages to scale with the company’s growth. The most basic achievement of the packages is the standardization of the work (ggplot and RMarkdown templates) and reduction of duplicate effort (importing data). New employees are introduced to the infrastructure with extensive workshops. This reminded me of a presentation by Hilary Parker in April at the New York R Conference on Scaling Analysis Responsibly.

Continue reading?

Three Types of Cluster Reproducibility

Christian Hennig provides a function called clusterboot() in his R package fpc which I mentioned before when talking about assessing the quality of a clustering. The function runs the same cluster algorithm on several bootstrapped samples of the data to make sure that clusters are reproduced in different samples; it validates the cluster stability. In a similar vein, the reproducibility of clusterings with subsequent use for marketing segmentation is discussed in this paper by Dolnicar and Leisch.

Continue reading?

Assessing the Quality of a Clustering Solution

During one of the talks at PyData Berlin, a presenter quickly mentioned a k-means clustering used to group similar clothing brands. She commented that it wasn’t perfect, but good enough and the result you would expect from a k-means clustering. There remains the question, however, how one can assess whether a clustering is “good enough”. In above case, the number of brands is rather small, and simply by looking at the groups one is able to assess whether the combination of Tommy Hilfiger and Marc O’Polo is sensible.

Continue reading?

Taxi Pulse of New York City

I don’t know about you, but I think taxi data is fascinating. There is a lot you can do with the data sets as they usually contain observations on geolocation as well as time stamps besides other information, which makes them unique. Geolocation and timestamps alone, as well as the large number of observations in cities like New York enable you to create stunning visualizations that aren’t possible with any other set of data.

Continue reading?

Analyzing Taxi Data to Create a Map of New York City

Yet another day was spent working on the taxi data provided by the NYC Taxi and Limousine Commission (TLC). My goal in working with the data was to create a plot that maps the streets of New York using the geolocation data that is provided for the taxis’ pickup and dropoff locations as longitude and latitude values. So far, I had only used the dataset for January of 2015 to plot the locations; also, I hadn’t used the more than 12 million observations in January alone but a smaller sample (100000 to 500000 observations).

Continue reading?