Pointwise Mutual Information for Instacart Product Recommendations

June 17, 2018

Using pointwise mutual information, we create highly efficient “customers who bought this item also bought” style product recommendations for more than 8000 Instacart products. The method can be implemented in a few lines of SQL yet produces high quality product suggestions. Check them out in this Shiny app.

Back in school, I was a big fan of the Detective Conan anime. For whatever reason, one of the episodes stuck with me. In that episode, the protagonists “pick up receipts in a convenience store to guess what the people are buying for dinner.” While this leads them inadvertently to a crime they need to solve, we will rather stick with the idea of finding out which products appear together in customers’ baskets.

Based on the Instacart Online Grocery Shopping dataset released a year ago, we analyze about 3 million orders of about 200,000 Instacart users. Similarly to how the detective boys used bought-together patterns to identify what customers were going to cook that evening, we’re going to find products that are bought together in order to create an effective, yet simple recommendation algorithm. So simple in fact, that the entire analysis could be productionized in plain SQL.

Instacart Data Set

From Wikipedia:

Instacart is an American company that operates as a same-day grocery delivery service. Customers select groceries through a web application from various retailers and delivered by a personal shopper.

The Instacart Online Grocery Shopping Dataset 2017 was made public by Instacart and can be downloaded here and offers a unique ability to try out recommendation algorithms on customer basket data. Then Instacart’s VP Data Science, Jeremy Kun introduced the data set in a Medium post. The dataset contains information on the products contained in about 3 million orders made by 200,000 Instacart customers. It thus lends itself as a testbed for machine learning methods that one would tend to apply at ecommerce companies–in particular those with a large variety of products, large basket sizes and returning customers.

Expected Result

In his blog post, Jeremy Kun highlights how Instacart uses the data for example to sort their Buy It Again listings, or to model the Frequently Bought With recommendations. Here I will restrict myself to recommendations in the style of the latter. The results of the algorithm should be able to run under a “Frequently Bought Together” or “Customers Who Bought This Item Also Bought” headline. Much like Amazon’s famous recommendations, or the ones that Instacart employs itself.

Take for example this Whole Foods pita bread offered on Instacart. Its page features recommendations for hummus and baba ghannouj under an “Often Bought With” headline, offering them as common complements. This style of recommendation is the goal, where we find items that go well together based on past purchases made by all customers. Those recommendations serve as a simple way for customers to fill their baskets with items that increase the value of the items they have already added to their baskets. It also ensures that customers don’t forget to buy items they really ought to buy.

This stands in contrast to “Similar Items” or “Related Items” recommendations that are often found on the same product detail pages. These recommendations usually aim at direct substitutes to the product on the current detail page. On Instacart’s page for Whole Foods Market Organic Whole Wheat Pita Bread, I got served recommendations for a couple of other pita varieties, for example.

Methodology

So how exactly are we going to find the products that are often bought with pita bread? How do we know what customers who bought this item also bought?

The naive approach would be to count the pure item co-occurrence in orders: For every item, count how often it has been in an order with the pita bread, then recommend the item with the highest count. While this might surface a good recommendation from time to time, it will mostly surface bananas and toilet paper. Bananas and toilet paper are examples of a few very common items which appear in a large share of orders without being related to any product in particular. They would dominate any raw co-occurrence count just by their own purchase probability.

To account for this difficulty, we will make use of a simple trick from natural language processing: Pointwise Mutual Information.

Pointwise Mutual Information

Pointwise Mutual Information is a measure of association from information theory and has found a popular application in natural language processing. There, it measures the association between a word and the word’s context, e.g. close words in a sentence (bi-grams, n-grams, etc.). It does so by comparing how often the word and the context appear together against how often they would appear together were they independent events.

Following Wikipedia, we have for the outcomes \(x\) and \(c\) of two discrete random variables \(X\) and \(C\):

\[ pmi(x;c) = \log \frac{p(x,c)}{p(x)p(c)} \]

Here, the numerator describes the joint probability, while the denominator describes the joint probability under independence. Thus, were the two events independent, we would have \(pmi(x;c) = \log(1) = 0\). Consequently, positive PMI values imply positive association between the events (e.g., the word and its context, or between two products).

Similarly, negative PMI values should indicate negative relationships—but it’s generally not as easy to think in terms of words that do not appear together. While it’s easy to come up with co-occuring words (Google & Facebook, Scrum & Agile, Obama & Merkel), I failed to quickly come up with examples for the opposite. It’s not how we’re tuned to think. Also, the necessary corpus to correctly measure the PMI of words that do not appear together is very, very large—because they don’t appear together (see this book chapter by Daniel Jurafsky and James H. Martin).

Implementation

Data Preparation

We’ll now prepare the Instacart data and apply the PMI measure on the observed orders to find products that have been bought together.

To follow along, download the csv files from Instacart. First, we load some libraries and read the csv files. Note that these are the only packages we’ll need. This goes to show that we can do the same analysis in SQL, even though what follows is written in R.

library(dplyr)
library(readr)

# this is on order level
orders <- read_csv("orders.csv")
# this is on product level
products <- read_csv("products.csv")
# this is on order-product level
order_products <- read_csv("order_products__prior.csv")

Note that we only read one of the available order_products tables, since we will not perform an evaluation based on a test set. The orders table, however, contains information on more orders than those contained in order_products, so we slim it down:

# number of all orders
length(unique(orders$order_id))

## [1] 3421083

# number of orders in our subset
length(unique(order_products$order_id))

## [1] 3214874

# we focus on the "prior" evaluation set for now
orders <- orders %>%
  filter(eval_set == "prior") %>%
  select(-eval_set)

In the next few steps, we trim down the set of considered customers and products to only include those for which we have enough observations.

# first, get for every user his number of orders
users <- orders %>%
  group_by(user_id) %>%
  summarize(orders = n())
summary(users$orders)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    9.00   15.59   19.00   99.00

# drop all users who had a single order or a very large number of orders
# (customers who only made one order might have bought "trial baskets")
good_users <- users %>%
  filter(orders <= 50, orders >= 1) %>%
  pull(user_id)
# filter for the corresponding orders
good_orders <- orders %>%
  filter(user_id %in% good_users)

# count for every user the number of different items he bought
product_by_customer_count <- order_products %>%
  inner_join(select(good_orders, order_id, user_id)) %>%
  distinct(user_id, product_id) %>%
  count(product_id)

# A considered product should have been bought by 
# at least 0.1% of the customers
product_threshold <- length(unique(good_orders$user_id)) * 0.001

good_products <- product_by_customer_count %>% 
  filter(n >= product_threshold) %>%
  pull(product_id)

op <- order_products %>%
  select(order_id, product_id) %>%
  inner_join(select(good_orders, order_id, user_id)) %>%
  filter(product_id %in% good_products)

# as last step, exclude all orders with a basket size of 1
op_size_one <- op %>% 
  group_by(order_id) %>%
  summarize(basket_size = n()) %>%
  ungroup %>% 
  filter(basket_size == 1) %>%
  pull(order_id)

op <- op %>% 
  filter(!(order_id %in% op_size_one))

After this initial data cleaning, let’s see how many orders, users, and products we are dealing here:

length(unique(op$order_id))

## [1] 2315386

length(unique(op$user_id))

## [1] 194760

length(unique(op$product_id))

## [1] 8979

So after dropping some customers and orders, we are left with about 200k users who bought about 9000 different products across 2.3 million orders. That should more than suffice to compute some PMI values.

Pointwise Mutual Information for Instacart Products

To compute the PMI value for every product, we first of all need to count how often products appear together. For the dataset at hand, the following expansion of the order_products table works fine (you should have quite some RAM though…); for every order, we join every product against every product in the order. This makes our table much longer, so depending on the average basket size and number of orders in another dataset, it might be a prohibitively expensive computation. We then immediately count how often products appear together:

op_pp <- inner_join(op, op,
                    by = c(order_id = "order_id")) %>%
  count(product_id.x, product_id.y)

dim(op_pp)

## [1] 31708163        3

Next, we need to count how often every product appears in orders, generally. This is used to compute the probabilities \(p(x)\) and \(p(c)\). We add these counts to the co-occurrence counts, and add the number of total orders in the total_n column. At this point we have all ingredients to compute the empirical probabilities and the PMIs.

product_count_train <- op %>%
  count(product_id)

pp_common_count <- op_pp %>%
  inner_join(product_count_train, by = c(product_id.x = "product_id")) %>%
  inner_join(product_count_train, by = c(product_id.y = "product_id")) %>%
  rename(common_n = n.x, x_n = n.y, y_n = n) %>%
  # total number of orders considered
  mutate(total_n = length(unique(op$order_id)))

Computing the PMI is now as simple as dividing columns and taking the logarithm. We add the corresponding product names to analyze the results afterwards.

pp_pmi <- pp_common_count %>%
  mutate(common_freq = log(common_n / total_n),
         x_freq = log(x_n / total_n),
         y_freq = log((y_n / total_n)),
         pmi = common_freq - x_freq - y_freq)

pp_rec <- pp_pmi %>%
  select(product_id.x, product_id.y, total_n, common_n, x_n, y_n, pmi) %>%
  left_join(select(products, product_id, product_name), 
            by = c(product_id.x = "product_id")) %>%
  left_join(select(products, product_id, product_name), 
            by = c(product_id.y = "product_id"))

Detailed Look at Recommendations

Given that we have not exactly fitted a model here, it’s not clear how to evaluate the results. We’re not explicitly optimizing for anything, so the following evaluation will be restricted to looking at some recommendations and judging whether the recommendations “make sense”.¹

Given the large sortiment, I had to pick some products at random to evaluate the recommendations. Also, I had to pick products that I actually know–I’m not living in the U.S., so what is Glacier Freeze Frost?

For a start, let’s act as if we are about to add Spicy Avocado Hummus to the cart. What could I buy with hummus? Apparently a lot of other hummus, yogurt, as well as crackers or chips:

pp_rec %>% 
  filter(product_id.x == 5973, product_id.x != product_id.y) %>%
  arrange(-pmi) %>%
  top_n(10, pmi) %>%
  select(product_name.x, product_name.y, pmi, common_n, x_n, y_n) %>%
  knitr::kable(digits = 2)

product_name.x	product_name.y	pmi	common_n	x_n	y_n
Spicy Avocado Hummus	Organic Jalapeno Cilantro Hummus	4.70	118	2541	982
Spicy Avocado Hummus	Organic Kale Pesto Hummus	4.47	97	2541	1015
Spicy Avocado Hummus	Organic Thai Coconut Curry Hummus	4.32	78	2541	945
Spicy Avocado Hummus	Organic Sriracha Hummus	4.31	144	2541	1762
Spicy Avocado Hummus	Hummus, Hope, Original Recipe	3.71	206	2541	4572
Spicy Avocado Hummus	Total 2% Lowfat Greek Yogurt with Honey	3.12	12	2541	481
Spicy Avocado Hummus	Organic Jalapeno Crackers	3.11	12	2541	489
Spicy Avocado Hummus	Chomperz Original Crunchy Seaweed Chips	3.09	17	2541	707
Spicy Avocado Hummus	Teriyaki Turkey Jerky	3.06	11	2541	472
Spicy Avocado Hummus	Soft Toothbrush	3.03	9	2541	397

Observe how we don’t have the Hummus, Hope, Original Recipe as the top recommended product even though the avocado hummus was bought most often with it. That is because the PMI takes into account how often the two products appear in orders independently. We see that the Hummus, Hope, Original Recipe is quite popular, which is why the 206 common orders are not as impactful as the 118 orders together with Organic Jalapeno Cilantro Hummus for the PMI. And so we want to rank the jalapeno hummus higher.

Notice also how some recommendations are based on just 9, 11, 12, or 17 common orders. If we think about how many customers we have, 12 orders can be noise. The toothbrush, for example, does not look like a good recommendation. We will address this with a smoothing method in a minute.

If we pick a different hummus, Garlic Hummus, we get very different results. There is no other hummus recommended, and instead the recommendations focus on pita bread. But notice again how the PMI favors products with a small number of common orders.

product_name.x	product_name.y	pmi	common_n	x_n	y_n
Garlic Hummus	Whole Wheat Pita	2.76	23	6893	489
Garlic Hummus	White Pita	2.75	71	6893	1518
Garlic Hummus	Peanut Butter Dark Chocolate Fruit & Nut Protein Bars	2.62	15	6893	368
Garlic Hummus	Gluten Free Black Bean and Quinoa Burrito	2.53	18	6893	480
Garlic Hummus	Organic Spinach & Potatoes 2, 6 Months+	2.50	18	6893	495
Garlic Hummus	Turkey Meatball Bites	2.50	13	6893	358
Garlic Hummus	100% Whole Wheat Hot Dog Buns	2.46	23	6893	659
Garlic Hummus	Organic White Pita Bread	2.42	37	6893	1102
Garlic Hummus	Lotus Forbidden Rice Ramen	2.38	10	6893	312
Garlic Hummus	Gochujang Fermented Garlic Chile Paste	2.20	7	6893	260

Similarly, here are recommendations for products that go well with Granny Smith Apples.

product_name.x	product_name.y	pmi	common_n	x_n	y_n
Granny Smith Apples	Royal Gala Apples	2.24	238	27712	2127
Granny Smith Apples	Bag of Red Delicious Apples	2.00	74	27712	835
Granny Smith Apples	Seedless Grapes Green	1.98	34	27712	392
Granny Smith Apples	Dark Chocolate Chili Almond Nuts & Spices	1.96	34	27712	399
Granny Smith Apples	Outshine Lime Fruit Bars	1.95	55	27712	651
Granny Smith Apples	Golden Delicious Apple	1.95	359	27712	4258
Granny Smith Apples	Braeburn Apple	1.93	126	27712	1523
Granny Smith Apples	Garlic Parmesan Deli Style Pretzel Crisps	1.91	49	27712	606
Granny Smith Apples	Bag of Oranges	1.89	131	27712	1657
Granny Smith Apples	Whole Frozen Strawberries	1.88	35	27712	445

When you work yourself through a couple of examples, it might stand out to you that the PMI tends to favor products with a small probability, that is, rare products tend to be recommended more. This is not necessarily desired, in particular not from the standpoint of a business.

Context Distribution Smoothing

As explained in Jurafsky and Martin (2017) citing Levy at al. (2015), a simple way to address this bias is context distribution smoothing, where the context probability is raised to the power of \(\alpha\), where \(\alpha \in (0,1)\). Since, for example, \(0.1^{0.75} \approx 0.1778\), doing so increases the probability of the context, and consequently decreases the PMI.

While there is also an impact on events with larger probability, the effect on events with small probability can be more extreme as for example here, leading to a larger absolute discount of their PMI values:

\[ \log(0.25) - \log(0.5) - \log(0.3) \approx 0.511 \] \[ \log(0.01) - \log(0.5) - \log(0.01) \approx 0.693 \] \[ \log(0.25) - \log(0.5) - \log(0.3^{0.75}) \approx 0.210 \] \[ \log(0.01) - \log(0.5) - \log(0.01^{0.75}) \approx -0.458 \]

It also implies that everything that would have been perfectly independent previously does now become negatively associated:

\[ \log(0.25) - \log(0.5) - \log(0.5) = 0 \] \[ \log(0.25) - \log(0.5) - \log(0.5^{0.75}) \approx -0.173 \]

Setting this aside, the context distribution smoothing can help in many cases to make the top ranks more sensible by returning more mainstream results. We can add the exponent (here 0.75) and compare the results:

context_exponent <- 0.75
pp_pmi_smooth <- pp_common_count %>%
  # smooth using the prior
  mutate(common_freq = log(common_n / total_n),
         x_freq = log(x_n / total_n),
         y_freq = log((y_n / total_n)^context_exponent),
         pmi = common_freq - x_freq - y_freq)

pp_rec_smooth <- pp_pmi_smooth %>%
  select(product_id.x, product_id.y, total_n, common_n, x_n, y_n, pmi) %>%
  left_join(select(products, product_id, product_name), 
            by = c(product_id.x = "product_id")) %>%
  left_join(select(products, product_id, product_name), 
            by = c(product_id.y = "product_id"))

For the apples we can observe that the seedless grapes, the Dark Chocolate Chili Almond Nuts & Spices, as well as Outshine Lime Fruit Bars have all been replaced by more apples, and the most frequently purchased item of them all: bananas.

pp_rec_smooth %>%
  filter(product_id.x == 9387, product_id.x != product_id.y) %>%
  arrange(-pmi) %>%
  select(product_name.x, product_name.y, pmi, common_n, x_n, y_n) %>%
  head(10) %>%
  knitr::kable(digits = 2)

product_name.x	product_name.y	pmi	common_n	x_n	y_n
Granny Smith Apples	Royal Gala Apples	0.49	238	27712	2127
Granny Smith Apples	Golden Delicious Apple	0.38	359	27712	4258
Granny Smith Apples	Gala Apples	0.37	1061	27712	18335
Granny Smith Apples	Banana	0.25	8919	27712	365728
Granny Smith Apples	Red Delicious Apple	0.21	236	27712	3062
Granny Smith Apples	Mandarins Bag	0.20	295	27712	4175
Granny Smith Apples	Bosc Pear	0.15	301	27712	4578
Granny Smith Apples	Braeburn Apple	0.10	126	27712	1523
Granny Smith Apples	Bag of Oranges	0.08	131	27712	1657
Granny Smith Apples	Organic Fuji Apple	0.08	2156	27712	69495

We see a similar effect for the garlic hummus. Compared to the previous recommendations, we also observe that more of the recommended items now have larger common_n values, i.e., by introducing the smoothing, we have implicitly ensured that the ranking relies more on common purchases.

product_name.x	product_name.y	pmi	common_n	x_n	y_n
Garlic Hummus	White Pita	0.92	71	6893	1518
Garlic Hummus	Sea Salt Pita Chips	0.90	351	6893	13135
Garlic Hummus	Jalapeno Hummus	0.65	157	6893	6310
Garlic Hummus	Whole Wheat Pita	0.64	23	6893	489
Garlic Hummus	Lemon Hummus	0.59	250	6893	12625
Garlic Hummus	Organic White Pita Bread	0.51	37	6893	1102
Garlic Hummus	Pita Chips Simply Naked	0.48	175	6893	9110
Garlic Hummus	Organic Peeled Whole Baby Carrots	0.44	532	6893	42519
Garlic Hummus	Peanut Butter Dark Chocolate Fruit & Nut Protein Bars	0.43	15	6893	368
Garlic Hummus	100% Whole Wheat Hot Dog Buns	0.42	23	6893	659

Why PMI and not Common Order Count?

To quickly show the impact of using pointwise mutual information to rank the recommendations instead of the raw count of common orders, consider the following example. If we use the pointwise mutual information to get products that are bought together with Birthday Candles, we will get the following items as the top recommendations. The lighter is a natural complement, and everything else is to prepare the cake on which the candles are placed:

product_name.x	product_name.y	pmi	common_n	x_n	y_n
Birthday Candles	Classic Lighters	2.69	4	208	331
Birthday Candles	Super Moist Chocolate Fudge Cake Mix	2.63	4	208	360
Birthday Candles	Creamy Classic Vanilla Frosting	2.55	3	208	273
Birthday Candles	Rich and Creamy Milk Chocolate Frosting	2.51	3	208	285
Birthday Candles	Funfetti Premium Cake Mix With Candy Bits	2.48	5	208	587

If we instead rank by the absolute count of common purchases, the recommended products would be the generally frequently purchased bananas, strawberries, etc., just as I alluded to in the beginning. This just goes to show that the raw count is not an alternative to come up with product recommendations.

product_name.x	product_name.y	pmi	common_n	x_n	y_n
Birthday Candles	Banana	-0.78	24	208	365728
Birthday Candles	Organic Strawberries	-0.69	16	208	189410
Birthday Candles	Large Lemon	-0.74	11	208	122928
Birthday Candles	Bag of Organic Bananas	-1.66	8	208	274515
Birthday Candles	Strawberries	-1.00	8	208	113606

Closing Thoughts

Recommendations based on pointwise mutual information alone are of course not perfect. It’s easy to find cases in which seemingly random products are recommended based on a few common orders. It’s difficult to filter these cases out by setting some threshold on the common orders; three common orders can produce good recommendations depending on the product (just consider the Birthday Candles example above).

More, since we’re not training a model and optimizing a metric, there is no scalable way of evaluating the result. Without picking a few example products and comparing recommendations, it’s difficult to, for example, pick the optimal smoothing exponent.

But the PMI ranking serves as excellent baseline solution. Given that only four columns have to be counted, the above recommendations can be written in a couple of lines of SQL. It doesn’t take more than a morning to go from no recommendations to a good solution. The PMI gives a lot of bang for the buck.

Not only that, but the PMI is also a natural starting point for word embedding models. As indicated in the references below, one could for example extend the ranking here to a full product-product PMI matrix. This high dimensional matrix could then be reduced into a lower dimensional embedding using something as simple as singular value decomposition (see Chris Moody’s “Stop Using word2vec” post on the Stitchfix blog). A word embedding makes it easy to train other models, as for example a clustering to find groups of related products.

In any case, take a look at the recommendations from the PMI ranking. I have published an interactive Shiny app which lets you select different products to simulate what could be presented on product display pages. The context smoothing parameter is adjustable as well. Try it out here. And next time your company needs product recommendations, try this as a cheap and good baseline.

References

The Instacart Online Grocery Shopping Dataset 2017. Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on May 2, 2018.

Daniel Jurafsky and James H. Martin. Vector Semantics. Book chapter in Speech and Language Processing. Draft of August 7, 2017.

Omer Levy and Yoav Goldberg. Neural Word Embedding as Implicit Matrix Factorization. In Advances in Neural Information Processing Systems 27 (NIPS 2014).

Omer Levy, Yoav Goldberg and Ido Dagan. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, vol. 3, pp. 211–225, 2015.

Chris Moody. Stop Using word2vec. Blog post in Stitchfix’ MultiThreaded blog.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality.

An first alternative would be to compare recommendations we derive from this “training” against the product combinations that appear in a test set.↩