SVD for a Low-Dimensional Embedding of Instacart Products
Building on the Instacart product recommendations based on Pointwise Mutual Information (PMI) in the previous article, we use Singular Value Decomposition to factorize the PMI matrix into a matrix of lower dimension (“embedding”). This allows us to identify groups of related products easily.
We finished the previous article with a long table where every row measured how surprisingly often two products were bought together according to the Instacart Online Grocery Shopping dataset. The main point being that recommendations based on pointwise mutual information (PMI) are a strong first solution despite their simplicity. You can explore the results in a Shiny app.
While the previous result could be implemented entirely in SQL, we will move beyond what is feasible in SQL very soon.
Goal: A word2vec-style Representation of Products
Using word2vec is all the rage to derive numeric vector representations of words in natural language processing. The result is be a \(N\times K\) matrix of \(N\) words represented by \(K\) vectors. Since \(K \ll N\), such a low-dimensional representation is extremely useful as input for subsequent tasks such as classification, or to perform word2vec’s hallmark word vector operations such as zuckerberg - facebook + microsoft ~ nadella.
In our case, we deal with \(N=8979\) products, which is manageable. Still, by using a small \(K = 64\) we are able to go from a \(N \times N\) pointwise mutual information matrix to a small-ish set of vectors summarizing the information. And since we will be using singular value decomposition to do so, (hopefully most) vectors will even highlight different groups of related products! That’s a nice side effect when trying to derive higher level abstractions and patterns in the purchase data—compared to individual product-to-product relations—and comparable to topics derived from topic models like Latent Dirichlet Allocation.
This blog post is inspired by Chris Moody’s Stop Using word2vec on StichFix’ MultiThreaded blog, in which Chris applied SVD in combination with PMI on the more typical NLP task of embedding words from the Hacker News corpus. In the following, we will see that one can use the exact same method to derive a representation of items sold on Instacart.
From PMI Matrix to SVD
The previous blog post introduced the Pointwise Mutual Information and how it can be applied to derive product recommendations for products that have been purchased in the same order. We finished with a long table called pp_rec_smooth
in which every row represented an observed combination of two products that have been purchased together a couple of times, as well as their corresponding pointwise mutual information.
product_name.x | product_name.y | pmi |
---|---|---|
Birthday Candles | Classic Lighters | 2.69 |
Birthday Candles | Super Moist Chocolate Fudge Cake Mix | 2.63 |
Birthday Candles | Creamy Classic Vanilla Frosting | 2.55 |
Birthday Candles | Rich and Creamy Milk Chocolate Frosting | 2.51 |
Birthday Candles | Funfetti Premium Cake Mix With Candy Bits | 2.48 |
Consequently, the table is quite long with 31708163 rows. This is still smaller than the \(8979\cdot 8979 = 80622441\) row table for all combinations of products. The latter, however, is exactly what we need for the matrix factorization through singular value decomposition: A \(N \times N\) item-by-item pointwise mutual information matrix. We create this matrix by widening the long table; it’s not beautiful, but it does the job.
pmi_matrix <- pp_rec_smooth %>%
select(product_id.x, product_id.y, pmi) %>%
tidyr::spread(product_id.y, pmi)
pmi_matrix_product_ids <- pmi_matrix[,1]
pmi_matrix <- pmi_matrix[,-1]
pmi_matrix[1:5,1:5]
## # A tibble: 5 x 5
## `1` `10` `23` `25` `28`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.59 NA NA NA NA
## 2 NA 5.29 - 1.71 - 0.548 NA
## 3 NA - 1.51 5.89 - 1.36 NA
## 4 NA - 0.498 - 1.51 5.44 NA
## 5 NA NA NA NA 6.14
All NA
s in this table represent product combinations that have not been observed in the data and for which we consequently do not have a PMI value.
The problem with the NA
s is that we cannot perform singular value decomposition on a matrix with missing entries. But assigning those entries any other value would give them meaning: A very negative value would convey a negative relationship, while setting all of them to zero implies no relationship. Neither is something we would like to simply assume in the following.
A good alternative described in Jurafsky and Martin is to set not only the NA
s to zero, but also the negative PMI values. This can furthermore be justified by the observation that it is generally more difficult to observe a negative relationship using PMI—as described in the previous blog post and by Jurafsky and Martin. Thus we proceed by replacing both NA
s and negative values by zero.
pmi_matrix[is.na(pmi_matrix)] <- 0
pmi_matrix[pmi_matrix < 0] <- 0
pmi_matrix[1:5,1:5]
## # A tibble: 5 x 5
## `1` `10` `23` `25` `28`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.59 0 0 0 0
## 2 0 5.29 0 0 0
## 3 0 0 5.89 0 0
## 4 0 0 0 5.44 0
## 5 0 0 0 0 6.14
colSums(pmi_matrix)[1:5] # not only the diagonal is positive
## 1 10 23 25 28
## 281.6510 126.0473 162.0798 265.7752 323.0297
The \(N \times N\) matrix is now prepared for singular value decomposition. Though, it’s not actually a matrix yet. So let’s change that, and let’s use the fact that the matrix has become really sparse by converting the object into a sparse matrix.
pmi_matrix <- as.matrix(pmi_matrix)
row.names(pmi_matrix) <- pmi_matrix_product_ids$product_id.x
colnames(pmi_matrix) <- pmi_matrix_product_ids$product_id.x
library(Matrix)
pmi_sparse_matrix <- Matrix(pmi_matrix, sparse = TRUE)
pmi_sparse_matrix[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## 1 10 23 25 28
## 1 5.587646 . . . .
## 10 . 5.291013 . . .
## 23 . . 5.891188 . .
## 25 . . . 5.442685 .
## 28 . . . . 6.135897
To make use of the sparsity when performing singular value decomposition, we can for example use the sparsesvd
package. Here, we choose to find a representation of rank 64. This means in particular that we will have 64 vectors representing what was previously represented by 8979 vectors. Using the singular value decomposition, we arrange the available information in a more compact way; we lose some information, but hopefully retain most of it. Feel free to experiment with different values for the rank \(K\). Problematic is that there is no good quantitative way of finding a good \(K\). One could examine the eigenvalues, or check whether the vectors corresponding to the smallest eigenvalue for a given value of \(K\) make some sense, and, if not, choose a smaller \(K\) instead.
library(sparsesvd)
set.seed(80495)
pmi_svd <- sparsesvd(pmi_sparse_matrix,
rank = 64)
The result looks as follows. We have the \(N \times K\) matrix, pmi_svd$u
, as well as the corresponding \(K\) singular values. If we plot them, we can see how much information is contained in just the first couple of dimensions.
dim(pmi_svd$u)
## [1] 8979 64
Working with the SVD
Before we start using the result of the singular value decomposition, we add the product names to the rows of the left-singular vectors.
products <- readr::read_csv("products.csv")
departments <- readr::read_csv("departments.csv")
id_name_dict <- data.frame(product_id = as.numeric(row.names(pmi_sparse_matrix))) %>%
inner_join(select(products, product_id,
product_name, aisle_id, department_id)) %>%
left_join(departments)
product_vectors <- pmi_svd$u
row.names(product_vectors) <- id_name_dict$product_name
With the row names added, we can easily search for specific products as well as the products that are closest in the \(K\)-dimensional space as measured by a measure as simple as the dot product. That is, we’re starting to apply the typical techniques from word embeddings on our product matrix!
# first a function to find the actual product names
search_product <- function(products, name) {
products[grepl(name, products, fixed = TRUE)]
}
# then the function that finds similar products by dot product
search_synonyms <- function(word_vectors, selected_vector) {
# https://juliasilge.com/blog/tidy-word-vectors/
require(broom)
similarities <- word_vectors %*% selected_vector %>%
tidy() %>%
as_tibble() %>%
rename(product = .rownames,
similarity = unrowname.x.)
similarities %>%
arrange(-similarity)
}
We search for products containing the word Ketchup and then look for similar products, where the similarity is based on the dot product on a singular value decomposition of commonly bought together items.
search_product(row.names(product_vectors), "Ketchup")
## [1] "Reduced Sugar Tomato Ketchup" "Tomato Ketchup"
## [3] "Organic Ketchup Gluten Free" "Organic Ketchup"
## [5] "Simply Heinz Tomato Ketchup" "Organic Unsweetened Ketchup"
## [7] "Classic Ketchup" "Squeeze Tomato Ketchup"
## [9] "Ketchup" "Organic Tomato Ketchup"
search_synonyms(product_vectors,
product_vectors["Tomato Ketchup",]) %>%
head(10) %>%
knitr::kable(digits = 4)
product | similarity |
---|---|
Classic Hamburger Buns | 0.0025 |
Dill Relish | 0.0025 |
Bun Length Skinless Beef Franks | 0.0022 |
Sweet Relish | 0.0021 |
Celery Salt | 0.0019 |
Jumbo Beef Franks | 0.0018 |
Hot Dog Buns | 0.0017 |
Classic Yellow Mustard | 0.0017 |
Thousand Island Dressing | 0.0016 |
Golden Crinkles French Fried Potatoes | 0.0016 |
search_product(row.names(product_vectors), "Smoothie") %>% tail()
## [1] "Blueberry B Fruit Smoothie"
## [2] "Organic Original Acai Berry + Guarana Smoothie Packs"
## [3] "Protein Smoothie Choc-a-lot Flavor"
## [4] "Power C Machine All Natural Fruit + Boost Juice Smoothie"
## [5] "Danimals Strawberry Explosion Flavored Smoothie"
## [6] "Green Machine Juice Smoothie"
search_synonyms(product_vectors,
product_vectors["Green Machine Juice Smoothie",]) %>%
head(10) %>%
knitr::kable(digits = 4)
product | similarity |
---|---|
Berry Veggie Juice Smoothie | 0.0289 |
Power C Machine All Natural Fruit + Boost Juice Smoothie | 0.0271 |
All Natural Berry Blast 100% Juice Smoothie | 0.0266 |
Red Machine Juice Smoothie | 0.0263 |
Blue Machine 100% Juice Smoothie | 0.0263 |
Orange Mango Juice | 0.0260 |
Protein Zone Protein Juice Smoothie | 0.0259 |
Protein Zone Double Berry Juice Smoothie | 0.0258 |
Strawberry Banana Juice | 0.0235 |
Kale Blazer Juice | 0.0228 |
These results look quite good! But also similar to what we have been able to do based on the PMI alone.
Exploring Groups of Products via Dimensions
Something that the PMI values did not provide us with were groups of similar/frequently purchased together products. Similarly to how we would derive topics when applying Latent Dirichlet Allocation on news articles, it would be super nice to build entire groups of bought-together products. In the case of Instacart, we might even deduct entire recipes. While no one gave singular value decomposition a metaphor as good as the topic model one, the individual singular vectors still allow us to find groups of related products. This is the case because related products should have similar values in certain dimensions. In the extreme, SVD compresses the information about a group of related products onto a single dimension in which these products have particularly high values.
Thus we can find “important” groups of products by exploring for a given vector the products with the largest loading on that vector.
The graph below contrasts singular vectors 7 and 8 from the product_vectors
matrix. We see that most observations are very close to 0 on both vectors, while comparatively few products are loaded in absolute terms on just one of the two vectors. This already indicates how the vectors split the work of representing the information that was previously spread across all 8979 vectors: By focusing on a few products per vector, we can more easily store the overall patterns in the data. If a single vector can contain information on about 140 products, then \(64 \cdot 140 = 8979\) products can be represented in 64 dimensions.
If the model works well, related products will load similarly on a given vector. Let’s check for example what kind of information vector 8 stores about products. To get an idea, let’s just look at the 20 most negatively loaded products on vector 8. You see the list of the products below. Given the many “vegan”, “beefless”, “dairy free”, “fishless” mentions in product names, it should be safe to say that vegan products tend to load negatively on vector 8. To hype it up a notch: Our model has learned the concept of a vegan lifestyle from products that were bought together!
product_name | department | X8 |
---|---|---|
Home Style Beefless Tips | frozen | -0.0939 |
Golden Fishless Filet | frozen | -0.0893 |
Mini Crispy Crabless Cakes | frozen | -0.0892 |
Vegan Egg | dairy eggs | -0.0876 |
Alfredo Cheezy Mac | frozen | -0.0854 |
Vegan White Cheddar Mac & Cheese | dry goods pasta | -0.0847 |
Chao, Vegan, Tomato Cayenne, Slices | dairy eggs | -0.0833 |
Vegan Cheddar Mac & Cheese | dry goods pasta | -0.0824 |
Lightly Seasoned Chick’n Scallopini | frozen | -0.0817 |
Dairy Free New York Style Cheezecake | frozen | -0.0806 |
Vegan Cheddar Flavor Squares | snacks | -0.0802 |
Vegan Chao Creamy Original Cheese Slices | dairy eggs | -0.0791 |
Veggie Loaf Mashed Potatoes and Vegetables | frozen | -0.0791 |
Vegan Hickory & Sage Benevolent Bacon | deli | -0.0784 |
Cheese Alternative, American Style, Slices | dairy eggs | -0.0782 |
Milk Free Better Than Sour Cream | dairy eggs | -0.0770 |
The Ultimate Beefless Burger | frozen | -0.0764 |
Vegan Shells & Creamy Sauce Mac & Cheese Macaroni & Cheese Organic | dry goods pasta | -0.0763 |
Dairy Free Grated Parmesan Style Topping | dairy eggs | -0.0762 |
Vegan Margherita Pizza With Daiya Cheeze | frozen | -0.0759 |
We can also look at the other end of vector 8, by considering products that loaded positively. Here we find a selection of snacks. But not just any snacks; we have a selection of high protein snacks ranging from nut bars to turkey jerkey.
product_name | department | X8 |
---|---|---|
Organic Almond Butter With Carob Macrobar | snacks | 0.0294 |
Blueberry Pecan Plus Fiber Fruit & Nut Bar | snacks | 0.0299 |
Macrobar Prolonged Powder Banana + Almond Butter | snacks | 0.0303 |
Potato Chips Sea Salt & Vinegar | snacks | 0.0304 |
Cranberry Almond Plus Fruit & Nut Bars | snacks | 0.0306 |
Fruit & Nut Bar Blueberry Vanilla & Cashew | snacks | 0.0310 |
Lightly Salted Chips | snacks | 0.0318 |
Original Turkey Jerky | snacks | 0.0319 |
Organic Coriander Seeds | pantry | 0.0320 |
Dark Chocolate Cinnamon Pecan Bar | snacks | 0.0321 |
Sweet Rejuvenation Cashew Butter Macrobar | snacks | 0.0333 |
Green Dragon with Passion Fruit Tea | beverages | 0.0335 |
Healthy Grains Granola Bar, Vanilla Blueberry | snacks | 0.0338 |
Apple Cinnamon & Pecan Fruit & Nut Bar | snacks | 0.0345 |
Organic Ground Cloves | pantry | 0.0359 |
Dark Chocolate Chili Almond Nuts & Spices | snacks | 0.0360 |
Turkey Bar, Almond + Cranberry | snacks | 0.0363 |
Teriyaki Turkey Jerky | snacks | 0.0379 |
Almond Walnut Macadamia Plus Bar | snacks | 0.0381 |
Fruit & Nut Delight Bar | snacks | 0.0391 |
Parallel Coordinates Plot
To take the previous plot to the next level—or rather to the next dimension—one can try to display several dimensions in a parallel coordinates plot like the one below. This plot is not ideal on a number of dimensions; it’s obviously prone to overplotting. And not being interactive, it’s difficult to find out whether there are items that load on two neighboring dimensions.
But by adding a color for every department, one gets a sense of the fact that some dimensions load only on products from one or very few departments (e.g., dimensions 2, 3, or 6), while other dimensions (e.g., 7) bring together products from several departments. In the case of dimensions 2 and 3, we even have the lucky case that the same products load on neighboring dimensions. This should generally not happen too often, since it implies that similar information is stored twice. It does happen when the products in question carry a lot of information about the dataset, i.e., explain a lot of the variance. In this case, it creates a very explicit delineation of one category from all others: the babies department.
# show some of the products in the baby cluster
set.seed(5463)
product_vectors[,c(2,3)] %>%
tidy() %>%
bind_cols(id_name_dict) %>%
filter(X1 > 0.05, X2 > 0.03) %>%
select(product_name, department, X1, X2) %>%
rename(X3 = X2, X2 = X1) %>%
sample_n(5) %>%
knitr::kable(digits = 2)
product_name | department | X2 | X3 |
---|---|---|---|
Apple and Carrot Stage 2 Baby Food | babies | 0.06 | 0.04 |
Berry & Barley Organic Baby Food | babies | 0.07 | 0.05 |
Baby Food Pouch - Butternut Squash, Carrot & Chickpea | babies | 0.06 | 0.04 |
Squash & Sweet Peas Stage 2 | babies | 0.09 | 0.06 |
Stage 2 Spinach, Apple & Kale | babies | 0.08 | 0.05 |
While this effect is particularly drastic for the baby food, it is also observable in later dimensions for different product groups. For example, we could consider dimensions 13 and 14. They are not able to separate products quite as clearly, but some clusters are still visible. The products indicated in blue, for example, are a cluster of frozen and foremost Indian meals. The orange cluster, on the other hand, groups different items used for baking. Note how the rug lines of dimension 13 show that this dimension alone is not able to separate blue from orange observations. It takes the second dimension to successfully perform a separation.
Frozen meals, Indian dishes for the most part:
product_name | department | X13 | X14 |
---|---|---|---|
Harvest Casserole Bowls | frozen | -0.07 | 0.06 |
Roasted Vegetable Tamale | frozen | -0.07 | 0.06 |
Gluten Free & Dairy Free Thai Red Curry With Organic Jasmin Rice & Vegetables | frozen | -0.07 | 0.06 |
Indian Vegetable Korma Entrée | frozen | -0.07 | 0.06 |
Light & Lean Sweet & Sour Asian Noodle Pasta | frozen | -0.07 | 0.06 |
Light and Lean Meatless Swedish Meatballs With Broccoli And Pasta | frozen | -0.07 | 0.06 |
Thai Green Curry | frozen | -0.07 | 0.06 |
Indian Mattar Tofu | frozen | -0.07 | 0.06 |
Teriyaki Meal Bowls | frozen | -0.07 | 0.06 |
Black Bean Tamale Verde | frozen | -0.07 | 0.06 |
Advanced baking ingredients:
product_name | department | X13 | X14 |
---|---|---|---|
Angel Flake Sweetened Coconut | pantry | -0.06 | 0.01 |
Nestle Toll House Premier White Chocolate Morsels | pantry | -0.07 | 0.02 |
Pure Granulated Sugar | pantry | -0.06 | -0.01 |
Light Corn Syrup with Real Vanilla | pantry | -0.08 | 0.03 |
Powdered Confectioners Sugar | pantry | -0.08 | 0.01 |
Since this post would become even longer if I walked you through every interesting dimension, I instead put the results in a small Shiny app. You can use it to explore more customer behavior patterns that this simple model was able to learn. For example, find a dimension for all-things burrito, as well as dairy free yogurts and milk in dimension 42.
Closing Thoughts
In this post we have derived a low dimensional representation of products sold through Instacart. Even though a singular value decomposition is a simple model, the derived vector representation seems useful upon inspection. In this post, we have focused our attention on exploring the resulting vectors and showing how they compress and represent information about certain groups of products that are purchased together by customers.
We have seen that the singular vectors can be interpreted similarly to LDA’s topics given that a given group of products loads on a one or a few vectors. What is still to be explored in future blog posts is the ability to perform vector operations, i.e., operations such as Hamburger Buns - Beef + Wiener ~ Hot Dog Buns. We have quickly shown how one can similar products to a given product through the dot product, but we have not used the cosine similarity yet. A future blog post will use the two dimensional graphs above to show how the dot product and cosine similarity can be used to explore the space of products.
Until then, check out the product dimensions in the Shiny app to find more than just the group of vegan products.
References
The Instacart Online Grocery Shopping Dataset 2017. Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on May 2, 2018.
Chris Moody. Stop Using word2vec. Blog post on Stitchfix’ MultiThreaded blog, October 18, 2017.
Julia Silge. Word Vectors with Tidy Data Principles. Blog post, October 30, 2017.
Daniel Jurafsky and James H. Martin. Vector Semantics. Book chapter in Speech and Language Processing. Draft of August 7, 2017.
Omer Levy and Yoav Goldberg. Neural Word Embedding as Implicit Matrix Factorization. In Advances in Neural Information Processing Systems 27 (NIPS 2014).
Omer Levy, Yoav Goldberg and Ido Dagan. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, vol. 3, pp. 211–225, 2015.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality.