Statistics

2020/01/18
The Causal Effect of New Year’s Resolutions

We treat the turn of the year as an intervention to infer the causal effect of New Year’s resolutions on McFit’s Google Trend index. By comparing the observed values from the treatment period against predicted values from a counterfactual model, we are able to derive the overall lift induced by the intervention. Throughout the year, people’s interest in a McFit gym membership appears quite stable.1 The following graph shows the Google Trend for the search term “McFit” in Germany for April 2017 to until the week of December 17, 2017.

Continue reading?


2016/06/14
Three Types of Cluster Reproducibility

Christian Hennig provides a function called clusterboot() in his R package fpc which I mentioned before when talking about assessing the quality of a clustering. The function runs the same cluster algorithm on several bootstrapped samples of the data to make sure that clusters are reproduced in different samples; it validates the cluster stability. In a similar vein, the reproducibility of clusterings with subsequent use for marketing segmentation is discussed in this paper by Dolnicar and Leisch.

Continue reading?


2016/05/30
Assessing the Quality of a Clustering Solution

During one of the talks at PyData Berlin, a presenter quickly mentioned a k-means clustering used to group similar clothing brands. She commented that it wasn’t perfect, but good enough and the result you would expect from a k-means clustering. There remains the question, however, how one can assess whether a clustering is “good enough”. In above case, the number of brands is rather small, and simply by looking at the groups one is able to assess whether the combination of Tommy Hilfiger and Marc O’Polo is sensible.

Continue reading?