Last week, I gave a presentation about the concept of and intuition behind probabilistic programming and model-based machine learning in front of a general audience. The following are my extended notes.
The posterior probability distribution represents the unique and complete solution to the problem. There is no need to invent ‘estimators’; nor do we need to invent criteria for comparing alternative estimators with each other. Whereas orthodox statisticians offer twenty ways of solving a problem, and another twenty different criteria for deciding which of these solutions is the best, Bayesian statistics only offers one answer to a well-posed problem.
What I would like to do in the following is to show you–using a very simple example–what the typical workflow in probabilistic programming and model-based machine learning1 I am referring to the definition introduced in Winn and Bishop’s book, Model-Based Machine Learning. It provides a new view on probabilistic programming and Bayesian methods within the broader machine learning field. looks like.
I will guide you through describing our problem by defining a model, and how we can combine the latter with data to draw conclusions about the real world.
We can use the following graphic2 The graphic is very much inspired by Figure 1.2 in Jan-Willem van de Meent, Brooks Paige, Hongseok Yang and Frank Wood (2018). An Introduction to Probabilistic Programming. as the guiding thread throughout the example. But before, we discuss the graphic, let’s start with the the first step in any problem.
The very first step is to define the problem you would like to solve. While this is very easy in the example I will present, it can be much more complicated. Yet, it is worth spending time on, because if you get this step right, it will make the following ones much easier.
Consider you’re a YouTuber, and you’re interested in the viewing behavior of your audience. Maybe you’re an educational YouTuber and upload lectures that are quite long. In particular, the videos are so long that you notice that no viewer makes it to the end of the video.3 Yes, yes, I’m only doing this because I don’t want to talk about censoring and proper survival analysis. You wouldn’t want to skip this when you define an actual problem. You decide to produce shorter videos in the future. “What’s a good length for my future videos?”, you ask yourself and open YouTube’s backend4 This is a fictional example. I don’t know what data YouTube offers its creators..
You find for every viewer i the time until he quit the video, t:
Viewer | Viewtime
--------------------
Viewer 1 | 10.5 min
Viewer 2 | 14.2 min
Viewer 3 | 38.0 min
... | ...
Viewer N | 2.8 min
We see that we have positive and continuous data, and one observation for every viewer.5 Let’s skip the possibility of returning viewers, different videos, etc. for this example. Knowing that we have this data available, we would like to answer questions, as for example:
Having defined the problem setting and the available data, we are able to start the actual workflow.
We start off by following the column in the downward direction, which represents the so-called data generating process: Knowing what our data will look like, we can think about what process might have lead to the data. Specifically, what could our model and the model’s parameters look like?
Let me get into more details on how this works.
From our problem setting we know that our viewers’ view times are positive and continuous, and we assume that they all quit the video before it ends. To model this data, we need a probability distribution that can generate such data. One distribution that fulfills the assumptions is the Exponential distribution. The Exponential distribution can generate data that follows our assumptions.
The Exponential distribution is defined by a single parameter, \(\mu\). \(\mu\) defines what data are more likely or less likely to be generated. A larger value of \(\mu\) tends to generate longer view times \(t_i\), while a smaller value of \(\mu\) implies shorter view times \(t_i\).
What we need to do in this step is to provide an initial idea of what a reasonable value for \(\mu\) might be. This in turn defines what data \(t_i\) we expect to be reasonable.
Importantly, this step happens before we look at our data.
The following plot shows a so-called survival curve for different values of \(\mu\). The survival curve is one way of looking at the Exponential distribution and the data that it generates. For every view time \(t\), it displays the probabiliy that a viewer watches the video for longer than this time \(t\). For example, this particular survival curve implies that less than 10% of the viewers stay for more than 10 minutes. The Exponential distribution’s parameter \(\mu\) can be interpreted as the expected view time.
So if we think that 7 minutes is a reasonable view time to expect from our viewers, then we could for example choose \(\mu = 7\). This would fix the Exponential distribution and corresponding survival curve to the following one:
At this point, we have defined that this single value of \(\mu\) is reasonable.
But we really think that other \(\mu\) are also possible, not just this single value. Otherwise we wouldn’t need to look at our data.
Thus, we add another probability distribution to our model.
While the first probability distribution describes what data we expect, the second probability distribution describes what values for the parameter we think are possible. Here, we choose a Gamma distribution (as it is the conjugate prior, but don’t you worry about this now) to model it.
The impact this has, is that we now have uncertainty over the possible \(\mu\) parameter and the corresponding survival curves. This is indicated by the shaded area in the plot below. Given the Gamma distribution, it displays the 90% most likely survival functions. Additionally, we show realizations of survival curves according to how likely they are given the Gamma distribution.
At this point, we have finished the first step: We have defined the initial models that might have generated our data—before we have seen our data.
Consider again the graphic we started with in the beginning.
So far, we moved down from paramaters to observations and thereby defined the data generating process. Now, we want to draw conclusions about the real world. To do so, we move back up: From observations to parameters, which we call inference.
We now observe actual data (i.e., look at the numbers in our hypothetical YouTube dashboard). We combine the model with the data and let the model and the model parameters adjust to the observations. After this adjustment, the parameters of the model can tell us something about the state of the real world, and what data we can expect to observe in the future.
The fancy word inference simply means to draw conclusions from data.
So what exactly does happen to our model when we observe data?
Our model at zero observations and after one observation.
In the beginning, when we have zero observations, we have the same plot as before—simply our initial view of the world.
As soon as we start to add data, we see an adjustment of the model towards the data. In this case, since the observation is large (larger than the initially \(\mu\)), the inferred parameter and survival functions move to the longer view times. The observation is indicated by the small line at the bottom of the plot.
Note that, because of our initial ideas and state of the model, the model does not jump exactly on top of the data; instead, it moves towards the data, but stops somewhere between the observation and our initial guess.
As we add more and more observations, the model moves towards shorter view times since the observations are mostly smaller than our initial idea. Consequently, our model adjusts downward to the data. Note how it moves closer and closer to the data and “forgets” the initial configuration as we add more observations.
Note also how the model becomes more and more certain in what a reasonable survival curve and parameter value can be. As we add more observations, the shaded area shrinks since fewer and fewer survival curves are in consistence with the data. We draw conclusions about which parameters and survival curves are still feasible after observing this particular set of data.
After 100 observations, we are quite certain that reasonable values for the expected view time (the adjusted \(\mu\) parameter) are mostly around 5.
This means that our model would now–after seeing the data–expect on average a view time of 5 minutes for a new viewer. To answer our initial question, we can read from the expected survival curve that a viewer watches our videos longer than 10 minutes with a probability of about 12%.
With this example, I tried to convey why probabilistic programming and model-based machine learning provide solutions to common challenges in machine learning. Some of you value knowing how a machine learning system arrives at its decisions. Here, we know thanks to defining the data generating process in Step 1 exactly how the model sees the world and makes decisions. You might also value knowing how confident a machine learning system is in its decisions. As we’ve seen throughout Step 2, our model quantifies at every point in time how likely certain parameter ranges and corresponding predictions are.
Did you notice that I didn’t explain how the model adjusts to the data during inference? This abstraction is at the heart of probabilistic programming and model-based machine learning: We define our problem, the data generating process and the corresponding model; afterwards, a generic, automatic inference method can be applied to adjust the model to the data. But the user shouldn’t have to worry too much about the inference method.6 J. Winn and C.M. Bishop (2018). Model-Based Machine Learning.
I hope that this small example was able to illustrate the basics of the probabilistic programming workflow and makes you eager to learn more about it.
Any questions or comments? Find me on Twitter.
I would like to thank Eren M. Elçi for suggesting the initial idea, his valuable feedback, and for providing the crucial links to Van de Meent et al.’s summary of probabilistic programming and Winn and Bishop’s description of model-based machine learning.
M. Betancourt (2018). Towards A Principled Bayesian Workflow (RStan).
B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li and A. Riddell (2017). Stan: A Probabilistic Programming Language. Journal of Statistical Software, Articles 76 1–32.
C. Davidson-Pilon (2015). Probabilistic Programming & Bayesian Methods for Hackers.
J. K. Kruschke (2012). Graphical model diagrams in Doing Bayesian Data Analysis versus traditional convention.
D. J.C. MacKay (2003). Information Theory, Inference, and Learning Algorithms.
R. McElreath (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
J.-W. van de Meent, B. Paige, H. Yang and F. Wood (2018). An Introduction to Probabilistic Programming.
E. Meijer (2017). Making Money Using Math. Modern applications are increasingly using probabilistic machine-learned models. ACM Queue Vol. 15.
T. Minka, J. Winn, J. Guiver, Y. Zaykov, D. Fabian, and J. Bronskill (2018). Infer.NET 0.3. Microsoft Research Cambridge.
Salvatier J., Wiecki T.V., Fonnesbeck C. (2016) Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55.
J. Winn and C.M. Bishop (2018). Model-Based Machine Learning.