The Probabilistic Programming Workflow

Intuition and Essentials of Model-Based Machine Learning

Tim Radtke

2019-03-21

Last week, I gave a presentation about the concept of and intuition behind probabilistic programming and model-based machine learning in front of a general audience. The following are my extended notes.

The posterior probability distribution represents the unique and complete solution to the problem. There is no need to invent ‘estimators’; nor do we need to invent criteria for comparing alternative estimators with each other. Whereas orthodox statisticians offer twenty ways of solving a problem, and another twenty different criteria for deciding which of these solutions is the best, Bayesian statistics only offers one answer to a well-posed problem.

What I would like to do in the following is to show you–using a very simple example–what the typical workflow in probabilistic programming and model-based machine learning1 I am referring to the definition introduced in Winn and Bishop’s book, Model-Based Machine Learning. It provides a new view on probabilistic programming and Bayesian methods within the broader machine learning field. looks like.

I will guide you through describing our problem by defining a model, and how we can combine the latter with data to draw conclusions about the real world.

We can use the following graphic2 The graphic is very much inspired by Figure 1.2 in Jan-Willem van de Meent, Brooks Paige, Hongseok Yang and Frank Wood (2018). An Introduction to Probabilistic Programming. as the guiding thread throughout the example. But before, we discuss the graphic, let’s start with the the first step in any problem.

Step 0: Defining the Problem

The very first step is to define the problem you would like to solve. While this is very easy in the example I will present, it can be much more complicated. Yet, it is worth spending time on, because if you get this step right, it will make the following ones much easier.

Consider you’re a YouTuber, and you’re interested in the viewing behavior of your audience. Maybe you’re an educational YouTuber and upload lectures that are quite long. In particular, the videos are so long that you notice that no viewer makes it to the end of the video.3 Yes, yes, I’m only doing this because I don’t want to talk about censoring and proper survival analysis. You wouldn’t want to skip this when you define an actual problem. You decide to produce shorter videos in the future. “What’s a good length for my future videos?”, you ask yourself and open YouTube’s backend4 This is a fictional example. I don’t know what data YouTube offers its creators..

You find for every viewer i the time until he quit the video, t:

    Viewer   | Viewtime 
    --------------------
    Viewer 1 | 10.5 min
    Viewer 2 | 14.2 min
    Viewer 3 | 38.0 min
      ...    |  ...
    Viewer N |  2.8 min

We see that we have positive and continuous data, and one observation for every viewer.5 Let’s skip the possibility of returning viewers, different videos, etc. for this example. Knowing that we have this data available, we would like to answer questions, as for example:

Interlude

Having defined the problem setting and the available data, we are able to start the actual workflow.

We start off by following the column in the downward direction, which represents the so-called data generating process: Knowing what our data will look like, we can think about what process might have lead to the data. Specifically, what could our model and the model’s parameters look like?

Let me get into more details on how this works.

Step 1: The Data Generating Process

Adding the First Distribution

From our problem setting we know that our viewers’ view times are positive and continuous, and we assume that they all quit the video before it ends. To model this data, we need a probability distribution that can generate such data. One distribution that fulfills the assumptions is the Exponential distribution. The Exponential distribution can generate data that follows our assumptions.

The Exponential distribution is defined by a single parameter, \(\mu\). \(\mu\) defines what data are more likely or less likely to be generated. A larger value of \(\mu\) tends to generate longer view times \(t_i\), while a smaller value of \(\mu\) implies shorter view times \(t_i\).

What we need to do in this step is to provide an initial idea of what a reasonable value for \(\mu\) might be. This in turn defines what data \(t_i\) we expect to be reasonable.

Importantly, this step happens before we look at our data.

The following plot shows a so-called survival curve for different values of \(\mu\). The survival curve is one way of looking at the Exponential distribution and the data that it generates. For every view time \(t\), it displays the probabiliy that a viewer watches the video for longer than this time \(t\). For example, this particular survival curve implies that less than 10% of the viewers stay for more than 10 minutes. The Exponential distribution’s parameter \(\mu\) can be interpreted as the expected view time.

So if we think that 7 minutes is a reasonable view time to expect from our viewers, then we could for example choose \(\mu = 7\). This would fix the Exponential distribution and corresponding survival curve to the following one:

The survival curve we choose as reasonable for our data generating process. The expected view time is equal to seven minutes.

The survival curve we choose as reasonable for our data generating process. The expected view time is equal to seven minutes.

At this point, we have defined that this single value of \(\mu\) is reasonable.

But we really think that other \(\mu\) are also possible, not just this single value. Otherwise we wouldn’t need to look at our data.

Thus, we add another probability distribution to our model.

Adding the Second Distribution

While the first probability distribution describes what data we expect, the second probability distribution describes what values for the parameter we think are possible. Here, we choose a Gamma distribution (as it is the conjugate prior, but don’t you worry about this now) to model it.

The impact this has, is that we now have uncertainty over the possible \(\mu\) parameter and the corresponding survival curves. This is indicated by the shaded area in the plot below. Given the Gamma distribution, it displays the 90% most likely survival functions. Additionally, we show realizations of survival curves according to how likely they are given the Gamma distribution.

At this point, we have finished the first step: We have defined the initial models that might have generated our data—before we have seen our data.

Interlude

Consider again the graphic we started with in the beginning.

So far, we moved down from paramaters to observations and thereby defined the data generating process. Now, we want to draw conclusions about the real world. To do so, we move back up: From observations to parameters, which we call inference.

Step 2: Observing Data and Inference

We now observe actual data (i.e., look at the numbers in our hypothetical YouTube dashboard). We combine the model with the data and let the model and the model parameters adjust to the observations. After this adjustment, the parameters of the model can tell us something about the state of the real world, and what data we can expect to observe in the future.

The fancy word inference simply means to draw conclusions from data.

So what exactly does happen to our model when we observe data?

Our model at zero observations and after one observation. Our model at zero observations and after one observation.

In the beginning, when we have zero observations, we have the same plot as before—simply our initial view of the world.

As soon as we start to add data, we see an adjustment of the model towards the data. In this case, since the observation is large (larger than the initially \(\mu\)), the inferred parameter and survival functions move to the longer view times. The observation is indicated by the small line at the bottom of the plot.

Note that, because of our initial ideas and state of the model, the model does not jump exactly on top of the data; instead, it moves towards the data, but stops somewhere between the observation and our initial guess.