Last week, I gave a presentation about the concept of and intuition behind probabilistic programming and model-based machine learning in front of a general audience. The following are my extended notes.

What I would like to do in the following is to show you–using a very simple example–what the typical workflow in probabilistic programming and model-based machine learning11 I am referring to the definition introduced in Winn and Bishop’s book, Model-Based Machine Learning. It provides a new view on probabilistic programming and Bayesian methods within the broader machine learning field. looks like.

I will guide you through describing our problem by defining a model, and how we can combine the latter with data to draw conclusions about the real world.

We can use the following graphic22 The graphic is very much inspired by Figure 1.2 in Jan-Willem van de Meent, Brooks Paige, Hongseok Yang and Frank Wood (2018). An Introduction to Probabilistic Programming. as the guiding thread throughout the example. But before, we discuss the graphic, let’s start with the the first step in any problem.

Step 0: Defining the Problem

The very first step is to define the problem you would like to solve. While this is very easy in the example I will present, it can be much more complicated. Yet, it is worth spending time on, because if you get this step right, it will make the following ones much easier.

Consider you’re a YouTuber, and you’re interested in the viewing behavior of your audience. Maybe you’re an educational YouTuber and upload lectures that are quite long. In particular, the videos are so long that you notice that no viewer makes it to the end of the video.33 Yes, yes, I’m only doing this because I don’t want to talk about censoring and proper survival analysis. You wouldn’t want to skip this when you define an actual problem. You decide to produce shorter videos in the future. “What’s a good length for my future videos?”, you ask yourself and open YouTube’s backend44 This is a fictional example. I don’t know what data YouTube offers its creators..

You find for every viewer i the time until he quit the video, t:

    Viewer   | Viewtime 
    --------------------
    Viewer 1 | 10.5 min
    Viewer 2 | 14.2 min
    Viewer 3 | 38.0 min
      ...    |  ...
    Viewer N |  2.8 min

We see that we have positive and continuous data, and one observation for every viewer.55 Let’s skip the possibility of returning viewers, different videos, etc. for this example. Knowing that we have this data available, we would like to answer questions, as for example:

What is the expected time until a viewer quits?
What’s the probability that a viewer watches longer than 10 minutes?

Interlude

Having defined the problem setting and the available data, we are able to start the actual workflow.

We start off by following the column in the downward direction, which represents the so-called data generating process: Knowing what our data will look like, we can think about what process might have lead to the data. Specifically, what could our model and the model’s parameters look like?

Let me get into more details on how this works.

Step 1: The Data Generating Process

Adding the First Distribution

From our problem setting we know that our viewers’ view times are positive and continuous, and we assume that they all quit the video before it ends. To model this data, we need a probability distribution that can generate such data. One distribution that fulfills the assumptions is the Exponential distribution. The Exponential distribution can generate data that follows our assumptions.

The Exponential distribution is defined by a single parameter, \(\mu\). \(\mu\) defines what data are more likely or less likely to be generated. A larger value of \(\mu\) tends to generate longer view times \(t_i\), while a smaller value of \(\mu\) implies shorter view times \(t_i\).

What we need to do in this step is to provide an initial idea of what a reasonable value for \(\mu\) might be. This in turn defines what data \(t_i\) we expect to be reasonable.

Importantly, this step happens before we look at our data.

The following plot shows a so-called survival curve for different values of \(\mu\). The survival curve is one way of looking at the Exponential distribution and the data that it generates. For every view time \(t\), it displays the probabiliy that a viewer watches the video for longer than this time \(t\). For example, this particular survival curve implies that less than 10% of the viewers stay for more than 10 minutes. The Exponential distribution’s parameter \(\mu\) can be interpreted as the expected view time.

So if we think that 7 minutes is a reasonable view time to expect from our viewers, then we could for example choose \(\mu = 7\). This would fix the Exponential distribution and corresponding survival curve to the following one:

The survival curve we choose as reasonable for our data generating process. The expected view time is equal to seven minutes.

At this point, we have defined that this single value of \(\mu\) is reasonable.

But we really think that other \(\mu\) are also possible, not just this single value. Otherwise we wouldn’t need to look at our data.

Thus, we add another probability distribution to our model.

Adding the Second Distribution

While the first probability distribution describes what data we expect, the second probability distribution describes what values for the parameter we think are possible. Here, we choose a Gamma distribution (as it is the conjugate prior, but don’t you worry about this now) to model it.

The impact this has, is that we now have uncertainty over the possible \(\mu\) parameter and the corresponding survival curves. This is indicated by the shaded area in the plot below. Given the Gamma distribution, it displays the 90% most likely survival functions. Additionally, we show realizations of survival curves according to how likely they are given the Gamma distribution.

At this point, we have finished the first step: We have defined the initial models that might have generated our data—before we have seen our data.

Interlude

Consider again the graphic we started with in the beginning.

So far, we moved down from paramaters to observations and thereby defined the data generating process. Now, we want to draw conclusions about the real world. To do so, we move back up: From observations to parameters, which we call inference.

Step 2: Observing Data and Inference

We now observe actual data (i.e., look at the numbers in our hypothetical YouTube dashboard). We combine the model with the data and let the model and the model parameters adjust to the observations. After this adjustment, the parameters of the model can tell us something about the state of the real world, and what data we can expect to observe in the future.

The fancy word inference simply means to draw conclusions from data.

So what exactly does happen to our model when we observe data?

Our model at zero observations and after one observation.

In the beginning, when we have zero observations, we have the same plot as before—simply our initial view of the world.

As soon as we start to add data, we see an adjustment of the model towards the data. In this case, since the observation is large (larger than the initially \(\mu\)), the inferred parameter and survival functions move to the longer view times. The observation is indicated by the small line at the bottom of the plot.

Note that, because of our initial ideas and state of the model, the model does not jump exactly on top of the data; instead, it moves towards the data, but stops somewhere between the observation and our initial guess.

As we add more and more observations, the model moves towards shorter view times since the observations are mostly smaller than our initial idea. Consequently, our model adjusts downward to the data. Note how it moves closer and closer to the data and “forgets” the initial configuration as we add more observations.

Note also how the model becomes more and more certain in what a reasonable survival curve and parameter value can be. As we add more observations, the shaded area shrinks since fewer and fewer survival curves are in consistence with the data. We draw conclusions about which parameters and survival curves are still feasible after observing this particular set of data.

After 100 observations, we are quite certain that reasonable values for the expected view time (the adjusted \(\mu\) parameter) are mostly around 5.

This means that our model would now–after seeing the data–expect on average a view time of 5 minutes for a new viewer. To answer our initial question, we can read from the expected survival curve that a viewer watches our videos longer than 10 minutes with a probability of about 12%.

Our model after its adjustment to 100 observations. The range of reasonable survival curves has shrunk towards one with an expected view time of about five minutes.

Wrapping Up

With this example, I tried to convey why probabilistic programming and model-based machine learning provide solutions to common challenges in machine learning. Some of you value knowing how a machine learning system arrives at its decisions. Here, we know thanks to defining the data generating process in Step 1 exactly how the model sees the world and makes decisions. You might also value knowing how confident a machine learning system is in its decisions. As we’ve seen throughout Step 2, our model quantifies at every point in time how likely certain parameter ranges and corresponding predictions are.

Did you notice that I didn’t explain how the model adjusts to the data during inference? This abstraction is at the heart of probabilistic programming and model-based machine learning: We define our problem, the data generating process and the corresponding model; afterwards, a generic, automatic inference method can be applied to adjust the model to the data. But the user shouldn’t have to worry too much about the inference method.66 J. Winn and C.M. Bishop (2018). Model-Based Machine Learning.

I hope that this small example was able to illustrate the basics of the probabilistic programming workflow and makes you eager to learn more about it.

Any questions or comments? Find me on Twitter.

Acknowledgements

I would like to thank Eren M. Elçi for suggesting the initial idea, his valuable feedback, and for providing the crucial links to Van de Meent et al.’s summary of probabilistic programming and Winn and Bishop’s description of model-based machine learning.

References and Continued Reading

M. Betancourt (2018). Towards A Principled Bayesian Workflow (RStan).

B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li and A. Riddell (2017). Stan: A Probabilistic Programming Language. Journal of Statistical Software, Articles 76 1–32.

C. Davidson-Pilon (2015). Probabilistic Programming & Bayesian Methods for Hackers.

J. K. Kruschke (2012). Graphical model diagrams in Doing Bayesian Data Analysis versus traditional convention.

D. J.C. MacKay (2003). Information Theory, Inference, and Learning Algorithms.

R. McElreath (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.

J.-W. van de Meent, B. Paige, H. Yang and F. Wood (2018). An Introduction to Probabilistic Programming.

E. Meijer (2017). Making Money Using Math. Modern applications are increasingly using probabilistic machine-learned models. ACM Queue Vol. 15.

T. Minka, J. Winn, J. Guiver, Y. Zaykov, D. Fabian, and J. Bronskill (2018). Infer.NET 0.3. Microsoft Research Cambridge.

Salvatier J., Wiecki T.V., Fonnesbeck C. (2016) Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55.

TensorFlow Probability.

J. Winn and C.M. Bishop (2018). Model-Based Machine Learning.