SVD for a Low-Dimensional Embedding of Instacart Products

July 25, 2018

Building on the Instacart product recommendations based on Pointwise Mutual Information (PMI) in the previous article, we use Singular Value Decomposition to factorize the PMI matrix into a matrix of lower dimension (“embedding”). This allows us to identify groups of related products easily.

We finished the previous article with a long table where every row measured how surprisingly often two products were bought together according to the Instacart Online Grocery Shopping dataset. The main point being that recommendations based on pointwise mutual information (PMI) are a strong first solution despite their simplicity. You can explore the results in a Shiny app.

While the previous result could be implemented entirely in SQL, we will move beyond what is feasible in SQL very soon.

Goal: A word2vec-style Representation of Products

Using word2vec is all the rage to derive numeric vector representations of words in natural language processing. The result is be a \(N\times K\) matrix of \(N\) words represented by \(K\) vectors. Since \(K \ll N\), such a low-dimensional representation is extremely useful as input for subsequent tasks such as classification, or to perform word2vec’s hallmark word vector operations such as zuckerberg - facebook + microsoft ~ nadella.

In our case, we deal with \(N=8979\) products, which is manageable. Still, by using a small \(K = 64\) we are able to go from a \(N \times N\) pointwise mutual information matrix to a small-ish set of vectors summarizing the information. And since we will be using singular value decomposition to do so, (hopefully most) vectors will even highlight different groups of related products! That’s a nice side effect when trying to derive higher level abstractions and patterns in the purchase data—compared to individual product-to-product relations—and comparable to topics derived from topic models like Latent Dirichlet Allocation.

This blog post is inspired by Chris Moody’s Stop Using word2vec on StichFix’ MultiThreaded blog, in which Chris applied SVD in combination with PMI on the more typical NLP task of embedding words from the Hacker News corpus. In the following, we will see that one can use the exact same method to derive a representation of items sold on Instacart.

From PMI Matrix to SVD

The previous blog post introduced the Pointwise Mutual Information and how it can be applied to derive product recommendations for products that have been purchased in the same order. We finished with a long table called pp_rec_smooth in which every row represented an observed combination of two products that have been purchased together a couple of times, as well as their corresponding pointwise mutual information.

product_name.x product_name.y pmi
Birthday Candles Classic Lighters 2.69
Birthday Candles Super Moist Chocolate Fudge Cake Mix 2.63
Birthday Candles Creamy Classic Vanilla Frosting 2.55
Birthday Candles Rich and Creamy Milk Chocolate Frosting 2.51
Birthday Candles Funfetti Premium Cake Mix With Candy Bits 2.48

Consequently, the table is quite long with 31708163 rows. This is still smaller than the \(8979\cdot 8979 = 80622441\) row table for all combinations of products. The latter, however, is exactly what we need for the matrix factorization through singular value decomposition: A \(N \times N\) item-by-item pointwise mutual information matrix. We create this matrix by widening the long table; it’s not beautiful, but it does the job.

pmi_matrix <- pp_rec_smooth %>%
  select(product_id.x, product_id.y, pmi) %>%
  tidyr::spread(product_id.y, pmi)
pmi_matrix_product_ids <- pmi_matrix[,1]
pmi_matrix <- pmi_matrix[,-1]

pmi_matrix[1:5,1:5]
## # A tibble: 5 x 5
##     `1`    `10`   `23`    `25`  `28`
##   <dbl>   <dbl>  <dbl>   <dbl> <dbl>
## 1  5.59  NA      NA     NA     NA   
## 2 NA      5.29  - 1.71 - 0.548 NA   
## 3 NA    - 1.51    5.89 - 1.36  NA   
## 4 NA    - 0.498 - 1.51   5.44  NA   
## 5 NA     NA      NA     NA      6.14

All NAs in this table represent product combinations that have not been observed in the data and for which we consequently do not have a PMI value.

The problem with the NAs is that we cannot perform singular value decomposition on a matrix with missing entries. But assigning those entries any other value would give them meaning: A very negative value would convey a negative relationship, while setting all of them to zero implies no relationship. Neither is something we would like to simply assume in the following.

A good alternative described in Jurafsky and Martin is to set not only the NAs to zero, but also the negative PMI values. This can furthermore be justified by the observation that it is generally more difficult to observe a negative relationship using PMI—as described in the previous blog post and by Jurafsky and Martin. Thus we proceed by replacing both NAs and negative values by zero.

pmi_matrix[is.na(pmi_matrix)] <- 0
pmi_matrix[pmi_matrix < 0] <- 0
pmi_matrix[1:5,1:5]
## # A tibble: 5 x 5
##     `1`  `10`  `23`  `25`  `28`
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  5.59  0     0     0     0   
## 2  0     5.29  0     0     0   
## 3  0     0     5.89  0     0   
## 4  0     0     0     5.44  0   
## 5  0     0     0     0     6.14
colSums(pmi_matrix)[1:5] # not only the diagonal is positive
##        1       10       23       25       28 
## 281.6510 126.0473 162.0798 265.7752 323.0297

The \(N \times N\) matrix is now prepared for singular value decomposition. Though, it’s not actually a matrix yet. So let’s change that, and let’s use the fact that the matrix has become really sparse by converting the object into a sparse matrix.

pmi_matrix <- as.matrix(pmi_matrix)
row.names(pmi_matrix) <- pmi_matrix_product_ids$product_id.x
colnames(pmi_matrix) <- pmi_matrix_product_ids$product_id.x

library(Matrix)
pmi_sparse_matrix <- Matrix(pmi_matrix, sparse = TRUE)
pmi_sparse_matrix[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##           1       10       23       25       28
## 1  5.587646 .        .        .        .       
## 10 .        5.291013 .        .        .       
## 23 .        .        5.891188 .        .       
## 25 .        .        .        5.442685 .       
## 28 .        .        .        .        6.135897

To make use of the sparsity when performing singular value decomposition, we can for example use the sparsesvd package. Here, we choose to find a representation of rank 64. This means in particular that we will have 64 vectors representing what was previously represented by 8979 vectors. Using the singular value decomposition, we arrange the available information in a more compact way; we lose some information, but hopefully retain most of it. Feel free to experiment with different values for the rank \(K\). Problematic is that there is no good quantitative way of finding a good \(K\). One could examine the eigenvalues, or check whether the vectors corresponding to the smallest eigenvalue for a given value of \(K\) make some sense, and, if not, choose a smaller \(K\) instead.

library(sparsesvd)
set.seed(80495)
pmi_svd <- sparsesvd(pmi_sparse_matrix, 
                     rank = 64)

The result looks as follows. We have the \(N \times K\) matrix, pmi_svd$u, as well as the corresponding \(K\) singular values. If we plot them, we can see how much information is contained in just the first couple of dimensions.

dim(pmi_svd$u)
## [1] 8979   64

Working with the SVD

Before we start using the result of the singular value decomposition, we add the product names to the rows of the left-singular vectors.

products <- readr::read_csv("products.csv")
departments <- readr::read_csv("departments.csv")

id_name_dict <- data.frame(product_id = as.numeric(row.names(pmi_sparse_matrix))) %>%
  inner_join(select(products, product_id, 
                    product_name, aisle_id, department_id)) %>%
  left_join(departments)

product_vectors <- pmi_svd$u
row.names(product_vectors) <- id_name_dict$product_name

With the row names added, we can easily search for specific products as well as the products that are closest in the \(K\)-dimensional space as measured by a measure as simple as the dot product. That is, we’re starting to apply the typical techniques from word embeddings on our product matrix!

# first a function to find the actual product names
search_product <- function(products, name) {
  products[grepl(name, products, fixed = TRUE)]
}

# then the function that finds similar products by dot product
search_synonyms <- function(word_vectors, selected_vector) {
  # https://juliasilge.com/blog/tidy-word-vectors/
  require(broom)
  
  similarities <- word_vectors %*% selected_vector %>%
    tidy() %>%
    as_tibble() %>%
    rename(product = .rownames,
           similarity = unrowname.x.)
  
  similarities %>%
    arrange(-similarity)    
}

We search for products containing the word Ketchup and then look for similar products, where the similarity is based on the dot product on a singular value decomposition of commonly bought together items.

search_product(row.names(product_vectors), "Ketchup")
##  [1] "Reduced Sugar Tomato Ketchup" "Tomato Ketchup"              
##  [3] "Organic Ketchup Gluten Free"  "Organic Ketchup"             
##  [5] "Simply Heinz Tomato Ketchup"  "Organic Unsweetened Ketchup" 
##  [7] "Classic Ketchup"              "Squeeze Tomato Ketchup"      
##  [9] "Ketchup"                      "Organic Tomato Ketchup"
search_synonyms(product_vectors, 
                product_vectors["Tomato Ketchup",]) %>% 
  head(10) %>%
  knitr::kable(digits = 4)
product similarity
Classic Hamburger Buns 0.0025
Dill Relish 0.0025
Bun Length Skinless Beef Franks 0.0022
Sweet Relish 0.0021
Celery Salt 0.0019
Jumbo Beef Franks 0.0018
Hot Dog Buns 0.0017
Classic Yellow Mustard 0.0017
Thousand Island Dressing 0.0016
Golden Crinkles French Fried Potatoes 0.0016
search_product(row.names(product_vectors), "Smoothie") %>% tail()
## [1] "Blueberry B Fruit Smoothie"                              
## [2] "Organic Original Acai Berry + Guarana Smoothie Packs"    
## [3] "Protein Smoothie Choc-a-lot Flavor"                      
## [4] "Power C Machine All Natural Fruit + Boost Juice Smoothie"
## [5] "Danimals Strawberry Explosion Flavored Smoothie"         
## [6] "Green Machine Juice Smoothie"
search_synonyms(product_vectors, 
                product_vectors["Green Machine Juice Smoothie",]) %>%
  head(10) %>%
  knitr::kable(digits = 4)
product similarity
Berry Veggie Juice Smoothie 0.0289
Power C Machine All Natural Fruit + Boost Juice Smoothie 0.0271
All Natural Berry Blast 100% Juice Smoothie 0.0266
Red Machine Juice Smoothie 0.0263
Blue Machine 100% Juice Smoothie 0.0263
Orange Mango Juice 0.0260
Protein Zone Protein Juice Smoothie 0.0259
Protein Zone Double Berry Juice Smoothie 0.0258
Strawberry Banana Juice 0.0235
Kale Blazer Juice 0.0228

These results look quite good! But also similar to what we have been able to do based on the PMI alone.

Exploring Groups of Products via Dimensions

Something that the PMI values did not provide us with were groups of similar/frequently purchased together products. Similarly to how we would derive topics when applying Latent Dirichlet Allocation on news articles, it would be super nice to build entire groups of bought-together products. In the case of Instacart, we might even deduct entire recipes. While no one gave singular value decomposition a metaphor as good as the topic model one, the individual singular vectors still allow us to find groups of related products. This is the case because related products should have similar values in certain dimensions. In the extreme, SVD compresses the information about a group of related products onto a single dimension in which these products have particularly high values.

Thus we can find “important” groups of products by exploring for a given vector the products with the largest loading on that vector.

The graph below contrasts singular vectors 7 and 8 from the product_vectors matrix. We see that most observations are very close to 0 on both vectors, while comparatively few products are loaded in absolute terms on just one of the two vectors. This already indicates how the vectors split the work of representing the information that was previously spread across all 8979 vectors: By focusing on a few products per vector, we can more easily store the overall patterns in the data. If a single vector can contain information on about 140 products, then \(64 \cdot 140 = 8979\) products can be represented in 64 dimensions.

If the model works well, related products will load similarly on a given vector. Let’s check for example what kind of information vector 8 stores about products. To get an idea, let’s just look at the 20 most negatively loaded products on vector 8. You see the list of the products below. Given the many “vegan”, “beefless”, “dairy free”, “fishless” mentions in product names, it should be safe to say that vegan products tend to load negatively on vector 8. To hype it up a notch: Our model has learned the concept of a vegan lifestyle from products that were bought together!

product_name department X8
Home Style Beefless Tips frozen -0.0939
Golden Fishless Filet frozen -0.0893
Mini Crispy Crabless Cakes frozen -0.0892
Vegan Egg dairy eggs -0.0876
Alfredo Cheezy Mac frozen -0.0854
Vegan White Cheddar Mac & Cheese dry goods pasta -0.0847
Chao, Vegan, Tomato Cayenne, Slices dairy eggs -0.0833
Vegan Cheddar Mac & Cheese dry goods pasta -0.0824
Lightly Seasoned Chick’n Scallopini frozen -0.0817
Dairy Free New York Style Cheezecake frozen -0.0806
Vegan Cheddar Flavor Squares snacks -0.0802
Vegan Chao Creamy Original Cheese Slices dairy eggs -0.0791
Veggie Loaf Mashed Potatoes and Vegetables frozen -0.0791
Vegan Hickory & Sage Benevolent Bacon deli -0.0784
Cheese Alternative, American Style, Slices dairy eggs -0.0782
Milk Free Better Than Sour Cream dairy eggs -0.0770
The Ultimate Beefless Burger frozen -0.0764
Vegan Shells & Creamy Sauce Mac & Cheese Macaroni & Cheese Organic dry goods pasta -0.0763
Dairy Free Grated Parmesan Style Topping dairy eggs -0.0762
Vegan Margherita Pizza With Daiya Cheeze frozen -0.0759

We can also look at the other end of vector 8, by considering products that loaded positively. Here we find a selection of snacks. But not just any snacks; we have a selection of high protein snacks ranging from nut bars to turkey jerkey.

product_name department X8
Organic Almond Butter With Carob Macrobar snacks 0.0294
Blueberry Pecan Plus Fiber Fruit & Nut Bar snacks 0.0299
Macrobar Prolonged Powder Banana + Almond Butter snacks 0.0303
Potato Chips Sea Salt & Vinegar snacks 0.0304
Cranberry Almond Plus Fruit & Nut Bars snacks 0.0306
Fruit & Nut Bar Blueberry Vanilla & Cashew snacks 0.0310
Lightly Salted Chips snacks 0.0318
Original Turkey Jerky snacks 0.0319
Organic Coriander Seeds pantry 0.0320
Dark Chocolate Cinnamon Pecan Bar snacks 0.0321
Sweet Rejuvenation Cashew Butter Macrobar snacks 0.0333
Green Dragon with Passion Fruit Tea beverages 0.0335
Healthy Grains Granola Bar, Vanilla Blueberry snacks 0.0338
Apple Cinnamon & Pecan Fruit & Nut Bar snacks 0.0345
Organic Ground Cloves pantry 0.0359
Dark Chocolate Chili Almond Nuts & Spices snacks 0.0360
Turkey Bar, Almond + Cranberry snacks 0.0363
Teriyaki Turkey Jerky snacks 0.0379
Almond Walnut Macadamia Plus Bar snacks 0.0381
Fruit & Nut Delight Bar snacks 0.0391

Parallel Coordinates Plot

To take the previous plot to the next level—or rather to the next dimension—one can try to display several dimensions in a parallel coordinates plot like the one below. This plot is not ideal on a number of dimensions; it’s obviously prone to overplotting. And not being interactive, it’s difficult to find out whether there are items that load on two neighboring dimensions.

But by adding a color for every department, one gets a sense of the fact that some dimensions load only on products from one or very few departments (e.g., dimensions 2, 3, or 6), while other dimensions (e.g., 7) bring together products from several departments. In the case of dimensions 2 and 3, we even have the lucky case that the same products load on neighboring dimensions. This should generally not happen too often, since it implies that similar information is stored twice. It does happen when the products in question carry a lot of information about the dataset, i.e., explain a lot of the variance. In this case, it creates a very explicit delineation of one category from all others: the babies department.

# show some of the products in the baby cluster
set.seed(5463)
product_vectors[,c(2,3)] %>%
  tidy() %>%
  bind_cols(id_name_dict) %>%
  filter(X1 > 0.05, X2 > 0.03) %>%
  select(product_name, department, X1, X2) %>%
  rename(X3 = X2, X2 = X1) %>%
  sample_n(5) %>%
  knitr::kable(digits = 2)
product_name department X2 X3
Apple and Carrot Stage 2 Baby Food babies 0.06 0.04
Berry & Barley Organic Baby Food babies 0.07 0.05
Baby Food Pouch - Butternut Squash, Carrot & Chickpea babies 0.06 0.04
Squash & Sweet Peas Stage 2 babies 0.09 0.06
Stage 2 Spinach, Apple & Kale babies 0.08 0.05

While this effect is particularly drastic for the baby food, it is also observable in later dimensions for different product groups. For example, we could consider dimensions 13 and 14. They are not able to separate products quite as clearly, but some clusters are still visible. The products indicated in blue, for example, are a cluster of frozen and foremost Indian meals. The orange cluster, on the other hand, groups different items used for baking. Note how the rug lines of dimension 13 show that this dimension alone is not able to separate blue from orange observations. It takes the second dimension to successfully perform a separation.

Frozen meals, Indian dishes for the most part:

product_name department X13 X14
Harvest Casserole Bowls frozen -0.07 0.06
Roasted Vegetable Tamale frozen -0.07 0.06
Gluten Free & Dairy Free Thai Red Curry With Organic Jasmin Rice & Vegetables frozen -0.07 0.06
Indian Vegetable Korma Entrée frozen -0.07 0.06
Light & Lean Sweet & Sour Asian Noodle Pasta frozen -0.07 0.06
Light and Lean Meatless Swedish Meatballs With Broccoli And Pasta frozen -0.07 0.06
Thai Green Curry frozen -0.07 0.06
Indian Mattar Tofu frozen -0.07 0.06
Teriyaki Meal Bowls frozen -0.07 0.06
Black Bean Tamale Verde frozen -0.07 0.06

Advanced baking ingredients:

product_name department X13 X14
Angel Flake Sweetened Coconut pantry -0.06 0.01
Nestle Toll House Premier White Chocolate Morsels pantry -0.07 0.02
Pure Granulated Sugar pantry -0.06 -0.01
Light Corn Syrup with Real Vanilla pantry -0.08 0.03
Powdered Confectioners Sugar pantry -0.08 0.01

Since this post would become even longer if I walked you through every interesting dimension, I instead put the results in a small Shiny app. You can use it to explore more customer behavior patterns that this simple model was able to learn. For example, find a dimension for all-things burrito, as well as dairy free yogurts and milk in dimension 42.

Closing Thoughts

In this post we have derived a low dimensional representation of products sold through Instacart. Even though a singular value decomposition is a simple model, the derived vector representation seems useful upon inspection. In this post, we have focused our attention on exploring the resulting vectors and showing how they compress and represent information about certain groups of products that are purchased together by customers.

We have seen that the singular vectors can be interpreted similarly to LDA’s topics given that a given group of products loads on a one or a few vectors. What is still to be explored in future blog posts is the ability to perform vector operations, i.e., operations such as Hamburger Buns - Beef + Wiener ~ Hot Dog Buns. We have quickly shown how one can similar products to a given product through the dot product, but we have not used the cosine similarity yet. A future blog post will use the two dimensional graphs above to show how the dot product and cosine similarity can be used to explore the space of products.

Until then, check out the product dimensions in the Shiny app to find more than just the group of vegan products.

References

The Instacart Online Grocery Shopping Dataset 2017. Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on May 2, 2018.

Chris Moody. Stop Using word2vec. Blog post on Stitchfix’ MultiThreaded blog, October 18, 2017.

Julia Silge. Word Vectors with Tidy Data Principles. Blog post, October 30, 2017.

Daniel Jurafsky and James H. Martin. Vector Semantics. Book chapter in Speech and Language Processing. Draft of August 7, 2017.

Omer Levy and Yoav Goldberg. Neural Word Embedding as Implicit Matrix Factorization. In Advances in Neural Information Processing Systems 27 (NIPS 2014).

Omer Levy, Yoav Goldberg and Ido Dagan. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, vol. 3, pp. 211–225, 2015.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality.