Minimize Regret

2025/05/10
‘Data Engineers Should Be Held to the Same Standards as Bakers’ ∞

Over at Hermit Tech, Jordan Andersen laments over embarrassingly low project delivery expectations within the software and data industry when compared to other professions such as the one of a baker. His concern, however, is not only with the customers of the project, but with the developers:

The latter consequence of low expectations is the more serious one. It’s something that some people may never recover from. […] The trajectory of a data or software engineer seems to be a bit different: they burn out from the mental stress of working in dishonest and fraudulent cultures that produce unsatisfying work, but the engineer stays in the profession. Only they’re a shell of a human.

2025/05/10
Patrick Collison Interviews Jony Ive at Stripe Sessions ∞

Having browsed my thesaurus, I’m settling on thoughtful as the one word I’d choose to describe Jony Ive, as he is both deliberate in his words and stories and attentive to those who use his tools. Tools, not products, because Ive describes himself as a toolmaker.

Ive’s description of his profession and work could come across as pretentious if Ive was less self-conscious and at times self-deprecating. Instead, presented in his thoughtful manner, Ive’s attitude is the kind of stuff that gets you out of bed in the morning. Impart meaning to your work by caring to the extent that humans who interact with your work sense that someone cared.

This conversation is like a Happy Meal, full of little golden nuggets.

2025/02/23
‘Performance of Zero-Shot Time Series Foundation Models on Cloud Data’ ∞

Toner et al. compare time series foundation models such as Chronos, Mamba4Cast, and TimesFM on data from Huawei data centers (data is available on Github!):

To examine the behaviour of (zero-shot) FMs on cloud time series, we perform experiments on function demand data drawn from real-world usage. Our results show that FMs perform poorly, being outperformed by the simple baselines of a linear model and a naive seasonal forecaster.

If you think about it, the seasonal naive method is the original zero-shot foundation model.

For example, the naive seasonal forecaster performs better than all the FMs across all datasets and forecast horizons. Moreover, the performance difference is often large; for example, the naive seasonal forecaster incurs a MASE typically half that of TimesFM.

Alas, the authors make little attempt to explain why the foundation models are outperformed by the seasonal naive method beyond the data’s spikiness. But perhaps that’s just it. Chronos, for example, is known to fail on spiky data due to its mean scaling and quantization (see figure 16 in the corresponding paper).

We also present evidence of pathological behaviours of FMs, demonstrating they can produce chaotic (as in Figure 1) or illogical (as in Figure 2) forecasts.

Check out the mentioned plots in figures 1, 2, 4 and 5. They’re why I don’t trust any paper that doesn’t visualize its method’s predictions.

2024/10/06
Explosion Back to Its Roots ∞

In 2016, Matthew Honnibal presented at the first PyData Berlin meetup that I attended. He had already started spaCy and was training models on Reddit comments, whereas I wasn’t really into NLP and heard about spaCy for the first time that evening. And while I’m still not really into NLP, I never stopped keeping tabs on what he and Ines Montani were building at Explosion.

Honnibal recounts Explosion’s history in view of recent changes:

For most of Explosion’s life we’ve been a very small company, running off revenues. In 2021 that changed, and we became a slightly less small company running off venture capital. We’ve been unable to make that configuration work, so we’re back to running Explosion as an independent-minded self-sufficient company. We’re going to stay small and not look for any more venture capital. spaCy and Prodigy will continue.

With a focus on “industrial-strength”, Explosion has built opinionated data science tooling with beautiful documentation. SpaCy is beloved open-source software (with the community coming together for spaCy IRL—a real treat of a conference) that convinces data scientists to spend company budget on Prodigy. This combination of spaCy and Prodigy is the ingredient to Explosion’s unique success as small, self-sufficient company in a venture-funded AI environment. Already familiar with spaCy, data scientists are comfortable with purchasing Prodigy licenses to ease their annotation workflows common to NLP. And being technical expert users, they also are capable of hosting the software themselves. Explosion doesn’t have to handle customers' data.

License revenues, no hosting, no data: Enablers of a profitable business run by a small team. I wish they continue to thrive!

In his post, Honnibal shares realities of maintaining software that companies and developers rarely admit to, yet which are determinants of a team’s success:

Engineering for spaCy and our other projects was also very challenging to hand over. spaCy is implemented in Cython, and big chunks of the project are essentially C code with funny syntax. We pass around pointers to arrays of structs, and if you access them out of bounds, well, hopefully it crashes. You have to just not do that. And then in addition to this memory-managed code, there’s all the GPU-specific considerations, all the numpy minutiae, and maintaining compatibility with a big matrix of Python versions, operating systems and hardware. It’s a lot.

The infrastructure required for machine learning doesn’t make it any easier:

I’ve been finding the transition back to the way things were quite difficult. I still know our codebases well, but the associated infrastructure isn’t easy to wrangle. Overall I haven’t been very productive over the last few months, but it’s getting better now.

On top come unexpected team dynamics as the previous architect shifts his focus:

As I became less involved in the hands-on work, I struggled to be effective as a decision-maker. A lot of the bigger questions got deferred, and we had an increasing bias towards whichever approach was least committal.

On a different note, I am fascinated that Hugging Face has the funds to provide a quarter-million grant for open-source developers. How many of these funds do they provide?¹

We considered selling the company, but we weren’t able to find a good fit. Instead, we’re back at the same sort of size we had before the investment. We’re very grateful to Hugging Face for a $250,000 grant to support our open-source work as our funding ran out, and we’ve applied successfully for a German R&D reimbursement grant that will give us up to €1.5m in unconditional funding.

To me, Explosion is one of the coolest exports that Berlin and Germany have to offer. Great to see them receive such grant.

At roughly 0.1% of their Series D funding round, there might be a few. ↩︎

2024/10/06
The New Internet ∞

Avery Pennarun, CEO of Tailscale, in a company all-hands meeting:

In modern computing, we tolerate long builds, and then docker builds, and uploading to container stores, and multi-minute deploy times before the program runs, and even longer times before the log output gets uploaded to somewhere you can see it, all because we’ve been tricked into this idea that everything has to scale. People get excited about deploying to the latest upstart container hosting service because it only takes tens of seconds to roll out, instead of minutes. But on my slow computer in the 1990s, I could run a perl or python program that started in milliseconds and served way more than 0.2 requests per second, and printed logs to stderr right away so I could edit-run-debug over and over again, multiple times per minute.

2024/07/27
Trained Random Forests Completely Reveal Your Dataset ∞

The paper’s title is a small portion of clickbait: As of yet not all but some trained random forests completely reveal your dataset. Still, using constraint programming the authors completely reconstruct the data used to train binary classification random forests without bagging on only binary features.

I imagine a lot of random forests having been trained on sensitive data in the past and their model files been more loosely handeled than the data. What private information could the model possibly reveal? Yeah.

Watch Julien Ferry present his paper in a video recorded for ICML 2024.

2024/07/26
No, Hashing Still Doesn’t Make Your Data Anonymous ∞

Just the Federal Trade Commission (FTC) reminding all of us that you can’t anonymize private data by hashing unique identifiers.

And the stories the FTC has to tell:

In 2022 the FTC brought a case against an online counseling service BetterHelp, alleging they had shared consumers’ sensitive health data—including hashed email addresses—with Facebook. The complaint laid out that BetterHelp knew that Facebook would “undo the hashing and reveal the email addresses of those Visitors and Users.” Though BetterHelp sent hashes to Facebook, rather than email addresses, the outcome was the same: Facebook allegedly learned who was seeking counselling for mental health and used that sensitive information to target ads to them.

What will be the equivalent to hashing when it comes to regulation of AI? When reviewing a company’s practices, hashing is straightforward to find and offers a black-and-white case. But when reviewing “an appropriate level of accuracy” of a system or the “appropriate measures to detect, prevent and mitigate possible biases”, what will clearly be not good enough?

2024/07/19
How One Bad CrowdStrike Update Crashed the World’s Computers ∞

Days like today serve as a reminder that software doesn’t have to be AI to bring high-risk infrastructure to a halt. From Code Responsibly:

Regulating AI is awkward. Where does the if-else end and AI start? Instead, consider it part of software as a whole and ask in which cases software might have flaws we are unwilling to accept. We need responsible software, not just responsible AI.

Thanks to everyone who has to spend the weekend cleaning up.

2024/06/16
How The Economist’s Presidential Forecast Works ∞

The Economist is back with a forecast for the 2024 US presidential election in collaboration with Andrew Gelman and others. One detail in the write-up of their approach stood out to me:

The ultimate result is a list of 10,001 hypothetical paths that the election could take.

Not 10,000, but 10,000 and one MCMC samples. I can’t remember seeing any reference for this choice before (packages love an even number as default), but I have been adding a single additional sample as tie breaker for a long time: If nothing else, it comes in handy to have a dedicated path represent the median to prevent an awkward median estimate of 269.5 electoral votes.

The extra sample is especially helpful when the main outcome of interest is a sum of more granular outcomes. In the case of the presidential election, the main outcome is the sum of electoral votes provided by the states. One can first identify the median of the main outcome (currently 237 Democratic electoral votes). Given the extra sample, there will be one MCMC sample that results in the median. From here, one can work backwards and identify this sample index and the corresponding value for every state, for example. The value might not be a state’s most “representative” outcome and it is unlikely to be the state’s median number of electoral votes. But the sum across states will be the median of the main outcome. Great for a visualization depicting what scenario would lead to this projected constellation of the electoral college.

In contrast, summing up the median outcome of each state, there would be only 226 Democratic electoral votes as of today.¹

CA, DC, HI, MA, MD, ME 1, NY, VT, CO, CT, DE, IL, NJ, OR, RI, WA, NM, ME, NE 2, VA, NH, MN.↩︎

2024/06/16
Helmut Schmidt Future Prize Winner’s Speech ∞

Meredith Whittaker, president of the Signal Foundation, received the Helmut Schmidt Future Prize in May. In her prize winner’s speech, she highlights the proliferation of artificial intelligence applications from ad targeting to military applications:

We are all familiar with being shown ads in our feeds for yoga pants (even though you don’t do yoga) or a scooter (even if you just bought one), or whatever else. We see these because the surveillance company running the ad market or social platform has determined that these are things “people like us” are assumed to want or be attracted to, based on a model of behavior built using surveillance data. Since other people with data patterns that look like yours bought a scooter, the logic goes, you will likely buy a scooter (or at least click on an ad for one). And so you’re shown an ad. We know how inaccurate and whimsical such targeting is. And when it’s an ad it’s not a crisis when it’s mistargeted. But when it’s more serious, it’s a different story.

It’s all fun and games as long as were talking about which ad is served next. Code responsibly.

2024/05/06
(Deep) Non-Parametric Time Series Forecaster ∞

If you read The History of Amazon’s Forecasting Algorithm, you’ll hear about fantastic models such as Quantile Random Forests, and the MQTransformer. In GluonTS you’ll find DeepAR and DeepVARHierarchical. But the real hero is the simple model that does the work when all else fails. Tim Januschowski on Linkedin:

One of the baselines that we’ve developed over the years is the non-parametric forecaster or NPTS for short. Jan Gasthaus invented it probably a decade ago and Valentin Flunkert made it seasonality aware and to the best of my knowledge it’s been re-written a number of times and still runs for #amazon retail (when other surrounding systems were switched off long ago).

Januschowski mentions this to celebrate the Arxiv paper describing NPTS and its DeepNPTS variant with additional “bells and whistles”. Which I celebrate as I no longer have to refer people to section 4.3 of the GluonTS paper.

2024/04/15
Reliably Forecasting Time-Series in Real Time ∞

Straight from my YouTube recommendations, a PyData London 2018 (!) presentation by Charles Masson of Datadog. To predict whether server metrics cross a threshold, he builds a method model that focuses on being robust to all the usual issues of anomalies and structural breaks. He keeps it simple, interpretable, and–for the sake of real-time forecasting–fast. Good stuff all around. The GIFs are the cherry on top.

2024/03/27
Chronos: Learning the Language of Time Series ∞

Ansari et al. (2024) introduce their Chronos model family on Github:

Chronos is a family of pretrained time series forecasting models based on language model architectures. A time series is transformed into a sequence of tokens via scaling and quantization, and a language model is trained on these tokens using the cross-entropy loss. Once trained, probabilistic forecasts are obtained by sampling multiple future trajectories given the historical context.

The whole thing is very neat. The repository can be pip-installed as package wrapping the pre-trained models on HuggingFace so that the installation and “Hello, world” example just work, and the paper is extensive at 40 pages overall. I commend the authors for using that space to include section 5.7 “Qualitative Analysis and Limitations” discussing and visualizing plenty of examples. The limitation arising from the quantization approach (Figure 16) would not have been as clear otherwise.

Speaking of quantization, the approach used to tokenize time series onto a fixed vocabulary reminds me of the 2020 paper “The Effectiveness of Discretization in Forecasting” by Rabanser et al., a related group of (former) Amazon researchers.

The large set of authors of Chronos also points to the NeurIPS 2023 paper “Large Language Models Are Zero-Shot Time Series Forecasters”, though the approach of letting GPT-3 or LLaMA-2 predict a sequence of numeric values directly is very different.

2024/03/25
Average Temperatures by Month Instead of Year ∞

This tweet is a prime example for why it’s hard to analyze one signal in a time series (here, its trend) without simultaneously adjusting for its other signal components (here, its seasonality).

If the tweet gets taken down, perhaps this screenshot on Mastodon remains.

2024/03/16
Demystifying the Draft EU AI Act ∞

Speaking of AI Act details, the paper “Demystifying the Draft EU AI Act” (Veale and Borgesius, 2021) has been a real eye-opener and fundamental to my understanding of the regulation.¹

Different than most coverage of the regulation, the two law researchers highlight the path by which EU law eventually impacts practice: Via standards and company-internal self-assessments. This explains why you will be left wondering what human oversight and technical robustness mean after reading the AI Act. The AI Act purposely does not provide specifications practitioners could follow to stay within the law when developing AI systems. Instead, specifics are outsourced to the private European standardization agencies CEN and CENELEC. The EU Commission will task them with definition of standards (think ISO or DIN) that companies can then follow during implementation of their systems and subsequently self-assess. This is nothing unusual in EU law making (for example, it’s used for medical devices and kids' chemistry sets). But, as the authors argue, it implies that “standardisation is arguably where the real rule-making in the Draft AI Act will occur”.

Chapter III, section 4 “Conformity Assessment and Presumption” for high-risk AI systems, as well as chapters V and VI provide context not found anywhere else, leading up to strong concluding remarks:

The high-risk regime looks impressive at first glance. But scratching the surface finds arcane electrical standardisation bodies with no fundamental rights experience expected to write the real rules, which providers will quietly self-assess against.

As the paper’s title suggests, it has been written in 2021 as a dissection of the EU Commission’s initial proposal of the AI Act. Not all descriptions might apply to the current version adopted by the EU Parliament on Tuesday. Consequently the new regulation of foundation models, for example, is not covered. ↩︎

2023/10/05
Video: Tim Januschowski, ISF 2023 Practitioner Speaker ∞

We don’t have enough presentations of industry practitioners discussing the detailed business problems they’re addressing and what solutions and trade-offs they were able to implement. Tim Januschoswki did just that, though, in his presentation at the International Symposium on Forecasting 2023. He discusses demand forecasting for optimal pricing at Zalando.

Presentations such as this one are rare opportunites to peak at the design of real world solutions. My favorite quote:

What we’re not using that might be also interesting is weather data. My business counterparts, they always, you know, make me aware of that fact. But we’re not using it.

2023/09/11
‘ECB Must Accept Forecasting Limitations to Restore Trust’ ∞

Christine Lagarde, president of the European Central Bank, declared her intent to communicate the shortcomings of the ECB’s forecasts better—and in doing so, provides applied data science lessons for the rest of us. As quoted by the Financial Times:

“Even if these [forecast] errors were to deplete trust, we can mitigate this if we talk about forecasts in a way that is both more contingent and more accessible, and if we provide better explanations for those errors,” Lagarde said.

2023/05/31
In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making ∞

Raymond Fok and Daniel S. Weld in a recent Arxiv preprint:

We argue explanations are only useful to the extent that they allow a human decision maker to verify the correctness of an AI’s prediction, in contrast to other desiderata, e.g., interpretability or spelling out the AI’s reasoning process.

This does ring true to me: Put yourself into the position of an employee of Big Company Inc. whose task it is to allocate marketing budgets, to purchase product inventory, or to perform any other monetary decision as part of a business process. Her dashboard, powered by a data pipeline and machine learning model, suggests to increase TV ad spend in channel XYZ, or to order thousands of units of a seasonal product to cover the summer.

In her shoes, if you had to sign the check, what let’s you sleep at night: Knowing the model’s feature importances, or having verified the prediction’s correctness?

I’d prefer the latter, and the former only so much as it helps in the pursuit of verification. Feature importance alone, it is argued however, can’t determine correctness:

Here, we refer to verification of an answer as the process of determining its correctness. It follows that many AI explanations fundamentally cannot satisfy this desideratum […] While feature importance explanations may provide some indication of how much each feature influenced the AI’s decision, they typically do not allow a decision maker to verify the AI’s recommendation.

We want verifiability, but we cannot have it for most relevant supervised learning problems. The number of viewers of the TV ad are inherently unknown at prediction time, as is the demand for the seasonal product. These applications are in stark contrast to the maze example the authors provide, in which the explanation method draws the proposed path through the maze.

If verifiability is needed to complement human decision making, then this might be why one can get the impression of explanation washing of machine learning systems: While current explanation methods are the best we can do, they fall short of what is really needed to trust a system’s recommendation.

What can we do instead? We could start by showing the actual data alongside the recommendation. Making the data explorable. The observation in question can be put into the context of observations from the training data for which labels exist, essentially providing case-based explanations.

Ideally, any context provided to the model’s recommendation is not based on another model that adds another layer to be verified, but on hard actuals.

In the case of forecasting, simply visualizing the forecast alongside the historical observations can be extremely effective at establishing trust. When the time series is stable and shows clear patterns, a human actually can verify the the forecast’s correctness up to a point. And a human easily spots likely incorrect forecasts given historical data.

The need for verifiability makes me believe in building data products, not just a model.

2023/05/29
Explainability Washing ∞

Upol Ehsan ponders on Mastodon:

Explainable AI suffers from an epidemic. I call it Explainability Washing. Think of it as window dressing–techniques, tools, or processes created to provide the illusion of explainability but not delivering it.

Ah yes, slapping feature importance values onto a prediction and asking your users “Are you not entertained?”.

This thread pairs well with Rick Saporta’s presentation. Both urge you to focus solely on your user’s decision when deciding what to build.

2023/05/29
A Framework for Data Product Management for Increasing Adoption & User Love ∞

You might have heard this one before: To build successful data products, focus on the decisions your customers make. But when was the last time you considered “how your work get[s] converted into action”?

At Data Council 2023, Rick Saporta lays out a framework of what data products to build and how to make them successful with customers. He goes beyond the platitudes, his advice sounds hard-earned.

Slides are good, talk is great.

2023/03/25
Bayesian Intermittent Demand Forecasting at NeurIPS 2016 ∞

Oldie but a goodie: A recording of Matthias Seeger’s presentation of “Bayesian Intermittent Demand Forecasting for Large Inventories” at NeurIPS 2016. The corresponding paper is a favorite of mine, but I only now stumbled over the presentation. It sparked an entire catalogue of work on time series forecasting by Amazon, and like few others called out the usefulness of sample paths.

2023/02/26
On the Factory Floor ∞

What works at Google-scale is not the pattern most data scientists need to employ at their work. But the paper “On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models” is the kind of paper that we need more of: Thrilling reports of what works in practice.

Also, the authors do provide abstract lessons anyone can use, such as considering the constraints of your problem rather than using whatever is state-of-the-art:

A major design choice is how to represent an ad-query pair x. The semantic information in the language of the query and the ad headlines is the most critical component. Usage of attention layers on top of raw text tokens may generate the most useful language embeddings in current literature [64], but we find better accuracy and efficiency trade-offs by combining variations of fully-connected DNNs with simple feature generation such as bi-grams and n-grams on sub-word units. The short nature of user queries and ad headlines is a contributing factor. Data is highly sparse for these features, with typically only a tiny fraction of non-zero feature values per example.

2023/01/08
SAP Design Guidelines for Intelligent Systems ∞

From SAP’s Design Guidelines for Intelligent Systems:

High–stakes decisions are more common in a professional software environment than in everyday consumer apps, where the consequences of an action are usually easy to anticipate and revert. While the implications of recommending unsuitable educational content to an employee are likely to be minimal, recommendations around critical business decisions can potentially cause irreversible damage (for example, recommending an unreliable supplier or business partner, leading to the failure or premature termination of a project or contract). It’s therefore vital to enable users to take an informed decision.

While sometimes overlooked, this guide presents software in internal business processes as rich opportunity to augment human capabilities, deserving just as much love and attention as “everyday consumer apps”.

The chapters on intelligent systems are not too tuned to SAP systems, but they do have the specific context of business applications in mind which differentiates them from other (great!) guides on user interfaces for machine learning systems.

Based on that context, the guide dives deep on Ranking, Recommendations, and Matching, proving that it’s based on a much more hands-on view than any text discussing Supervised, Unsupervised, and Reinforcement Learning.

When aiming to build systems that “augment human capabilities”, the importance of “gain[ing] the user’s trust and foster[ing] successful adoption” can’t be overstated, making it worthwhile to deeply consider how we present the output of our systems.

Related: Formalizing Trust in Artificial Intelligence: Prerequisites, Causes and Goals of Human Trust in AI by Alon Jacovi, Ana Marasović, Tim Miller, and Yoav Goldberg.

2023/01/02
Skillful Image Fast-Forwarding ∞

Russel Jacobs for Slate on the discontinued Dark Sky weather app, via Daring Fireball:

Indeed, Dark Sky’s big innovation wasn’t simply that its map was gorgeous and user-friendly: The radar map was the forecast. Instead of pulling information about air pressure and humidity and temperature and calculating all of the messy variables that contribute to the weather–a multi-hundred-billion-dollars-a-year international enterprise of satellites, weather stations, balloons, buoys, and an army of scientists working in tandem around the world (see Blum’s book)–Dark Sky simply monitored changes to the shape, size, speed, and direction of shapes on a radar map and fast-forwarded those images. “It wasn’t meteorology,” Blum said. “It was just graphics practice.”

Reminds me of DeepMind’s “Skilful precipitation nowcasting using deep generative models of radar”.

2022/09/28
GluonTS Workshop at Amazon Berlin on September 29 ∞

The workshop will revolve around tools that automatically transform your data, in particular time series, into high-quality predictions based on AutoML and deep learning models. The event will be hosted by the team at AWS that develops AutoGluon, Syne Tune and GluonTS, and consist of a mix of tutorial-style presentation on the tools, discussion, and contributions from external partners on their applications.

Unique opportunity to hear from industry practitioners and GluonTS developers in person or by joining online.

2022/09/14
Design a System, not an “AI” ∞

Ryxcommar on Twitter:

I think one of the bigger mistakes people make when designing AI powered systems is seeing them as an AI first and foremost, and not as a system first and foremost.

Once you have your API contracts in place, the AI parts can be seen as function calls inside the system. Maybe your first version of these functions just return an unconditional expected value. But the system is the bulk of the work, the algorithm is a small piece.

To me, this is why regulation of AI (in contrast to regulation of software generally) can feel misguided: Any kind of function within a system has the potential to be problematic. It doesn’t have to use matrix multiplication for that to be the case.

More interestingly though, this is why it’s so effective to start with a simple model. It provides the function call around which you can build the system users care about.

Some free advice for data scientists– every time I have seen people treat their systems primarily as AI and not as systems, both the AI and the system suffered for it. Don’t make that mistake; design a system, not an “AI.”

2022/09/06
Berlin Bayesians Meetup on September 27 ∞

The Berlin Bayesians meetup is happening again in-person. Juan Orduz is going to present Buy ‘Til You Die models implemented in PyMC:

In this talk, we introduce a certain type of customer lifetime models for the non-contractual setting commonly known as BTYD (Buy Till You Die) models. We focus on two sub-model components: the frequency BG/NBD model and the monetary gamma-gamma model. We begin by introducing the model assumptions and parameterizations. Then we walk through the maximum-likelihood parameter estimation and describe how the models are used in practice. Next, we describe some limitations and how the Bayesian framework can help us to overcome some of them, plus allowing more flexibility. Finally, we describe some ongoing efforts in the open source community to bring these new ideas and models to the public.

Buy ‘Til You Die models for the estimation of customer lifetime value were one of the first applications I worked on in my data science career, I’m glad to see they’re still around and kicking. Now implemented in the shiny new version of PyMC!

Edit: The event was rescheduled to September 27.

2022/07/11
Be Skeptical of the t-SNE Bunny ∞

Matt Henderson on Twitter (click through for the animation):

Be skeptical of the clusters shown in t-SNE plots! Here we run t-SNE on a 3d shape - it quickly invents some odd clusters and structures that aren’t really present in the original bunny.

What would happen if every machine learning method would come with a built-in visualization of the spurious results that it found?

Never mind the the answer to that question. I think that this dimensionality reduction of a 3D bunny into two dimensions isn’t even all that bad—the ears are still pretty cute. And it’s not like the original data had a lot more global and local structure once you consider that the bunny is not much more than noise in the shape of a rectangle with two ears that human eyes ascribe meaning to.

I’m the first to admit that t-SNE, UMAP, and all kinds of other methods will produce clusters from whatever data you provide. But so will k-means always return k clusters. One shouldn’t trust any model without some kind of evaluation of its results.

If you don’t take them at face value, UMAP and Co. can be powerful tools to explore data quickly and interactively. Look no further than the cool workflows Vincent Warmerdam is building for annotating text.

2021/12/29
Approach to Estimate Uncertainty Distributions of Walmart Sales ∞

We present our solution for the M5 Forecasting - Uncertainty competition. Our solution ranked 6th out of 909 submissions across all hierarchical levels and ranked first for prediction at the finest level of granularity (product-store sales, i.e. SKUs). The model combines a multi-stage state-space model and Monte Carlo simulations to generate the forecasting scenarios (trajectories). Observed sales are modelled with negative binomial distributions to represent discrete over-dispersed sales. Seasonal factors are hand-crafted and modelled with linear coefficients that are calculated at the store-department level.

The approach chosen by this team of prior Lokad employees hits all the sweet spots. It’s simple, yet comes 6th in a Kaggle challenge, and produces multi-horizon sample paths.

Having the write-up of a well-performing result available in this detail is great—they share some nuggets:

Considering the small search space, this optimisation is done via grid search.

Easy to do for a two-parameter model and a neat trick to get computational issues under control. Generally neat to also enforce additional prior knowledge via arbitrary constraints on the search space.

According to the M5 survey by Makridakis et al. [3], our solution had the best result at the finest level of granularity (level 12 in the competition), commonly referred to as product-store level or SKU level (Stock Keeping Unit). For store replenishment and numerous other problems, the SKU level is the most relevant level.

Good on them to point this out. Congrats!

Linkeds