{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Similarity-based recommendation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first recommender system we'll implement is a simple simaliry-based recommender that makes recommendations based on the Jaccard similarity between items." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll start with several standard imports:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import gzip\n", "from collections import defaultdict\n", "import scipy\n", "import scipy.optimize\n", "import numpy\n", "import random\n", "import math" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And load the data much as before, converting integer-valued fields as we go. Note that here we use a larger \"Musical Instruments\" dataset for the sake of demonstrating a more scalable system, but you could repeat this exercise with any dataset (in particular, you could use a smaller one if these exercises run slowly on your machine). Again, this dataset was collected from https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt and should be saved locally." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "path = \"/home/jmcauley/PML book data/amazon_reviews_us_Musical_Instruments_v1_00.tsv.gz\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "f = gzip.open(path, 'rt', encoding=\"utf8\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "header = f.readline()\n", "header = header.strip().split('\\t')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "dataset = []" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "for line in f:\n", " fields = line.strip().split('\\t')\n", " d = dict(zip(header, fields))\n", " d['star_rating'] = int(d['star_rating'])\n", " d['helpful_votes'] = int(d['helpful_votes'])\n", " d['total_votes'] = int(d['total_votes'])\n", " dataset.append(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's examine one of the entries in this dataset:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'customer_id': '45610553',\n", " 'helpful_votes': 0,\n", " 'marketplace': 'US',\n", " 'product_category': 'Musical Instruments',\n", " 'product_id': 'B00HH62VB6',\n", " 'product_parent': '618218723',\n", " 'product_title': 'AGPtek® 10 Isolated Output 9V 12V 18V Guitar Pedal Board Power Supply Effect Pedals with Isolated Short Cricuit / Overcurrent Protection',\n", " 'review_body': 'Works very good, but induces ALOT of noise.',\n", " 'review_date': '2015-08-31',\n", " 'review_headline': 'Three Stars',\n", " 'review_id': 'RMDCHWD0Y5OZ9',\n", " 'star_rating': 3,\n", " 'total_votes': 1,\n", " 'verified_purchase': 'N',\n", " 'vine': 'N'}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we'll build a few useful data structures, in this case just to maintain a collection of the items reviewed by each user, and the collection of users who have reviewed each item." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "usersPerItem = defaultdict(set)\n", "itemsPerUser = defaultdict(set)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "itemNames = {}\n", "ratingDict = {}" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "for d in dataset:\n", " user,item = d['customer_id'], d['product_id']\n", " usersPerItem[item].add(user)\n", " itemsPerUser[user].add(item)\n", " ratingDict[(user,item)] = d['star_rating']\n", " itemNames[item] = d['product_title']" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "userAverages = {}\n", "itemAverages = {}" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "for u in itemsPerUser:\n", " rs = [ratingDict[(u,i)] for i in itemsPerUser[u]]\n", " userAverages[u] = sum(rs) / len(rs)\n", " \n", "for i in usersPerItem:\n", " rs = [ratingDict[(u,i)] for u in usersPerItem[i]]\n", " itemAverages[i] = sum(rs) / len(rs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a generic implementation of the Jaccard similarity between two sets, which we'll use to build our recommender:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def Jaccard(s1, s2):\n", " numer = len(s1.intersection(s2))\n", " denom = len(s1.union(s2))\n", " return numer / denom" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our implementation of the recommender system just finds the most similar item (i2) compared to the query item (i), based on their Jaccard similarities (i.e., overlap between users who purchased both items)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def Cosine(s1, s2):\n", " # Not a proper implementation, operates on sets so correct for interactions only\n", " numer = len(s1.intersection(s2))\n", " denom = math.sqrt(len(s1)) * math.sqrt(len(s2))\n", " return numer / denom" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def Pearson(i1, i2):\n", " # Between two items\n", " iBar1 = itemAverages[i1]\n", " iBar2 = itemAverages[i2]\n", " inter = usersPerItem[i1].intersection(usersPerItem[i2])\n", " numer = 0\n", " denom1 = 0\n", " denom2 = 0\n", " for u in inter:\n", " numer += (ratingDict[(u,i1)] - iBar1)*(ratingDict[(u,i2)] - iBar2)\n", " for u in inter: #usersPerItem[i1]:\n", " denom1 += (ratingDict[(u,i1)] - iBar1)**2\n", " #for u in usersPerItem[i2]:\n", " denom2 += (ratingDict[(u,i2)] - iBar2)**2\n", " denom = math.sqrt(denom1) * math.sqrt(denom2)\n", " if denom == 0: return 0\n", " return numer / denom" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def mostSimilar(i):\n", " similarities = []\n", " users = usersPerItem[i]\n", " for i2 in usersPerItem:\n", " if i2 == i: continue\n", " #sim = Jaccard(users, usersPerItem[i2])\n", " sim = Pearson(i, i2)\n", " similarities.append((sim,i2))\n", " similarities.sort(reverse=True)\n", " return similarities[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generating a recommendation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's select some example item from the dataset to use as a query to generate similar recommendations:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'customer_id': '35297198',\n", " 'helpful_votes': 0,\n", " 'marketplace': 'US',\n", " 'product_category': 'Musical Instruments',\n", " 'product_id': 'B006MIU7U2',\n", " 'product_parent': '125871705',\n", " 'product_title': 'Halco 80000 - MR16/3WW/FL/LED2 MR16 Flood LED Light Bulb',\n", " 'review_body': 'Had to replace a new light after a lightning storm. Works fine. Hope it lasts through the warranty period this time!',\n", " 'review_date': '2015-08-31',\n", " 'review_headline': 'Works fine. Hope it lasts through the warranty period this ...',\n", " 'review_id': 'R30SHYQXGG5EYC',\n", " 'star_rating': 5,\n", " 'total_votes': 0,\n", " 'verified_purchase': 'Y',\n", " 'vine': 'N'}" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset[8]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "query = dataset[8]['product_id']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we'll examine the most similar items compared to this query:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "ms = mostSimilar(query)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0, 'B0716RFGH8'),\n", " (0, 'B01IBI8HKC'),\n", " (0, 'B01IBI8F86'),\n", " (0, 'B014K2S348'),\n", " (0, 'B014GZ52IO'),\n", " (0, 'B0145DOBL6'),\n", " (0, 'B0143TQ7ZK'),\n", " (0, 'B0143H0W3U'),\n", " (0, 'B0142OY3ZM'),\n", " (0, 'B0141UGNBE')]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ms" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Halco 80000 - MR16/3WW/FL/LED2 MR16 Flood LED Light Bulb'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "itemNames[query]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Ibanez RG Prestige Series RG655M Electric Guitar',\n", " 'Planet Waves NS Artist Capo, Metallic Bronze',\n", " 'Planet Waves NS Artist Capo, Metallic Bronze',\n", " 'AKG 2015 NEWEST MODEL M220 PRO STYLIST PROFESSIONAL LARGE DIAPHRAGM DJ SEMI-OPEN HIGH DEFINITION OVER-EAR STUDIO HEADPHONES',\n", " 'Amdirect Drum Soft Padded Seat Stool Stand Drummers Drumming Adjustable Chair',\n", " 'Waveburg Custom Acoustic Pro PA Public Address Ceiling Speaker IC-8RTU',\n", " 'Wallet Capo',\n", " 'KLIQ TinyTune Mini Chromatic Tuner Pedal for Guitar & Bass',\n", " 'WOWTOU RGB Stage light Seriers',\n", " 'Sonder Instruments Professional Guitar Capo with Pick Holder ★ Durable ★ Tightly Holds ALL Strings ★ Evenly Pressured Spring ★ Hold Your Picks in STYLE While You Play!']" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[itemNames[x[1]] for x in ms]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Efficient similarity-based recommendation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above recommender generated reasonably effective recommendations, but was inefficient as it required iteration over all items. We can make this implementation more efficient by considering a smaller candidate set of items to be iterated over, namely we can restrict the search to only those items that could possibly have non-zero Jaccard similarity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specifically, this search is based on the _set of items that were purchased by any user who purchased the query item i._" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def mostSimilarFast(i):\n", " similarities = []\n", " users = usersPerItem[i]\n", " candidateItems = set()\n", " for u in users:\n", " candidateItems = candidateItems.union(itemsPerUser[u])\n", " for i2 in candidateItems:\n", " if i2 == i: continue\n", " sim = Jaccard(users, usersPerItem[i2])\n", " similarities.append((sim,i2))\n", " similarities.sort(reverse=True)\n", " return similarities[:10]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mostSimilarFast(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Collaborative-filtering-based rating estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use the similarity-based recommender we developed above to make predictions about user's ratings. Although this is not an example of machine learning, it is a simple heuristic that can be used to estimate a user's future ratings based on their ratings in the past." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specifically, a user's rating for an item is assumed to be a weighted sum of their previous ratings, weighted by how similar the query item is to each of their previous purchases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by building a few more utility data structures to keep track of all of the reviews by each user and for each item." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "reviewsPerUser = defaultdict(list)\n", "reviewsPerItem = defaultdict(list)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "for d in dataset:\n", " user,item = d['customer_id'], d['product_id']\n", " reviewsPerUser[user].append(d)\n", " reviewsPerItem[item].append(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we compute the rating mean. This will be used as a simple baseline, but will also be used as a \"default\" prediction in the event that the user has rated no previous items with a Jaccard similarity greater than zero (compared to the query)." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "ratingMean = sum([d['star_rating'] for d in dataset]) / len(dataset)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.251102772543146" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ratingMean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our prediction function computes (a) a list of the user's previous ratings (ignoring the query item); and (b) a list of the similarities of these previous items, compared to the query. These weights are used to constructed a weighted average of the ratings from the first set." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "def predictRating(user,item):\n", " ratings = []\n", " similarities = []\n", " for d in reviewsPerUser[user]:\n", " i2 = d['product_id']\n", " if i2 == item: continue\n", " ratings.append(d['star_rating'] - itemAverages[i2])\n", " similarities.append(Jaccard(usersPerItem[item],usersPerItem[i2]))\n", " if (sum(similarities) > 0):\n", " weightedRatings = [(x*y) for x,y in zip(ratings,similarities)]\n", " return itemAverages[item] + sum(weightedRatings) / sum(similarities)\n", " else:\n", " # User hasn't rated any similar items\n", " return ratingMean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try a simple example:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'customer_id': '14640079',\n", " 'helpful_votes': 0,\n", " 'marketplace': 'US',\n", " 'product_category': 'Musical Instruments',\n", " 'product_id': 'B003LRN53I',\n", " 'product_parent': '986692292',\n", " 'product_title': 'Sennheiser HD203 Closed-Back DJ Headphones',\n", " 'review_body': 'Nice headphones at a reasonable price.',\n", " 'review_date': '2015-08-31',\n", " 'review_headline': 'Five Stars',\n", " 'review_id': 'RZSL0BALIYUNU',\n", " 'star_rating': 5,\n", " 'total_votes': 0,\n", " 'verified_purchase': 'Y',\n", " 'vine': 'N'}" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset[1]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "u,i = dataset[1]['customer_id'], dataset[1]['product_id']" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.509357030989021" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictRating(u, i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, we evaluate the performace of our model:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "def MSE(predictions, labels):\n", " differences = [(x-y)**2 for x,y in zip(predictions,labels)]\n", " return sum(differences) / len(differences)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "alwaysPredictMean = [ratingMean for d in dataset]" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "cfPredictions = [predictRating(d['customer_id'], d['product_id']) for d in dataset]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "labels = [d['star_rating'] for d in dataset]" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.4796142779564334" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MSE(alwaysPredictMean, labels)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.44672577948388" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MSE(cfPredictions, labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, the accuracy of our rating prediction model was actually _worse_ (in terms of the MSE) than just predicting the mean rating. However note again that this is just a heuristic, and could be modified to improve its predictions (e.g. by using a different similarity function other than the Jaccard similarity)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Simple (bias only) latent factor-based recommender" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we'll use gradient descent to implement a machine-learning-based recommender (a latent-factor model)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a fairly difficult exercise, but brings together many of the ideas we've seen previously in this class, especially regarding gradient descent." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we build some utility data structures to store the variables of our model (alpha, userBiases, and itemBiases)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "N = len(dataset)\n", "nUsers = len(reviewsPerUser)\n", "nItems = len(reviewsPerItem)\n", "users = list(reviewsPerUser.keys())\n", "items = list(reviewsPerItem.keys())" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "alpha = ratingMean" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "userBiases = defaultdict(float)\n", "itemBiases = defaultdict(float)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The actual prediction function of our model is simple: Just predict using a global offset (alpha), a user offset (beta_u in the slides), and an item offset (beta_i)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "def prediction(user, item):\n", " return alpha + userBiases[user] + itemBiases[item]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use another library in this example to perform gradient descent. This library requires that we pass it a \"flat\" parameter vector (theta) containing all of our parameters. This utility function just converts between a flat feature vector, and our model parameters, i.e., it \"unpacks\" theta into our offset and bias parameters." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "def unpack(theta):\n", " global alpha\n", " global userBiases\n", " global itemBiases\n", " alpha = theta[0]\n", " userBiases = dict(zip(users, theta[1:nUsers+1]))\n", " itemBiases = dict(zip(items, theta[1+nUsers:]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The \"cost\" function is the function we are trying to optimize. Again this is a requirement of the gradient descent library we'll use. In this case, we're just computing the (regularized) MSE of a particular solution (theta), and returning the cost." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "def cost(theta, labels, lamb):\n", " unpack(theta)\n", " predictions = [prediction(d['customer_id'], d['product_id']) for d in dataset]\n", " cost = MSE(predictions, labels)\n", " print(\"MSE = \" + str(cost))\n", " for u in userBiases:\n", " cost += lamb*userBiases[u]**2\n", " for i in itemBiases:\n", " cost += lamb*itemBiases[i]**2\n", " return cost" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The derivative function is the most difficult to implement, but follows the definitions of the derivatives for this model as given in the lectures. This step could be avoided if using a gradient descent implementation based on (e.g.) Tensorflow." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "def derivative(theta, labels, lamb):\n", " unpack(theta)\n", " N = len(dataset)\n", " dalpha = 0\n", " dUserBiases = defaultdict(float)\n", " dItemBiases = defaultdict(float)\n", " for d in dataset:\n", " u,i = d['customer_id'], d['product_id']\n", " pred = prediction(u, i)\n", " diff = pred - d['star_rating']\n", " dalpha += 2/N*diff\n", " dUserBiases[u] += 2/N*diff\n", " dItemBiases[i] += 2/N*diff\n", " for u in userBiases:\n", " dUserBiases[u] += 2*lamb*userBiases[u]\n", " for i in itemBiases:\n", " dItemBiases[i] += 2*lamb*itemBiases[i]\n", " dtheta = [dalpha] + [dUserBiases[u] for u in users] + [dItemBiases[i] for i in items]\n", " return numpy.array(dtheta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the MSE of a trivial baseline (always predicting the mean) for comparison:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.4796142779564334" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MSE(alwaysPredictMean, labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can run gradient descent. This particular gradient descent library takes as inputs (1) Our cost function (implemented above); (2) Initial parameter values; (3) Our derivative function; and (4) Any additional arguments to be passed to the cost function (in this case the labels and the regularization strength)." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE = 1.4796142779564334\n", "MSE = 1.468686355953835\n", "MSE = 2.696168718199957\n", "MSE = 1.4681419018494115\n", "MSE = 1.4523523347391183\n", "MSE = 1.4513575397272935\n", "MSE = 1.4476987674765165\n", "MSE = 1.4421925605950805\n", "MSE = 1.4415262672088112\n", "MSE = 1.4413460037417385\n", "MSE = 1.441397612244052\n", "MSE = 1.4414066017099247\n" ] }, { "data": { "text/plain": [ "(array([ 4.24278450e+00, 6.58050335e-04, -2.68137205e-04, ...,\n", " -5.81392897e-03, 8.31819064e-04, 1.39862043e-03]),\n", " 1.4574364057349312,\n", " {'funcalls': 12,\n", " 'grad': array([-4.52665781e-07, -4.29225921e-09, 1.58622111e-09, ...,\n", " 1.97006613e-08, -5.16161175e-09, -9.15540124e-09]),\n", " 'nit': 9,\n", " 'task': b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL',\n", " 'warnflag': 0})" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scipy.optimize.fmin_l_bfgs_b(cost, [alpha] + [0.0]*(nUsers+nItems),\n", " derivative, args = (labels, 0.001))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Complete latent factor model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code extends the example above to implement a complete latent factor model (i.e., including low-dimensional user and item terms)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first initialize our parameters as before:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "alpha = ratingMean" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "userBiases = defaultdict(float)\n", "itemBiases = defaultdict(float)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each user and item we now have a low dimensional descriptor (representing that user's preferences, and that item's properties), of dimension K." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "userGamma = {}\n", "itemGamma = {}" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "K = 2" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "for u in reviewsPerUser:\n", " userGamma[u] = [random.random() * 0.1 - 0.05 for k in range(K)]" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "for i in reviewsPerItem:\n", " itemGamma[i] = [random.random() * 0.1 - 0.05 for k in range(K)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again we must implement an \"unpack\" function. This is the same as before, though has some additional terms." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "def unpack(theta):\n", " global alpha\n", " global userBiases\n", " global itemBiases\n", " global userGamma\n", " global itemGamma\n", " index = 0\n", " alpha = theta[index]\n", " index += 1\n", " userBiases = dict(zip(users, theta[index:index+nUsers]))\n", " index += nUsers\n", " itemBiases = dict(zip(items, theta[index:index+nItems]))\n", " index += nItems\n", " for u in users:\n", " userGamma[u] = theta[index:index+K]\n", " index += K\n", " for i in items:\n", " itemGamma[i] = theta[index:index+K]\n", " index += K" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, our cost and derivative functions serve the same role as before, though their implementations are somewhat more complicated." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "def inner(x, y):\n", " return sum([a*b for a,b in zip(x,y)])" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "def prediction(user, item):\n", " return alpha + userBiases[user] + itemBiases[item] + inner(userGamma[user], itemGamma[item])" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "def cost(theta, labels, lamb):\n", " unpack(theta)\n", " predictions = [prediction(d['customer_id'], d['product_id']) for d in dataset]\n", " cost = MSE(predictions, labels)\n", " print(\"MSE = \" + str(cost))\n", " for u in users:\n", " cost += lamb*userBiases[u]**2\n", " for k in range(K):\n", " cost += lamb*userGamma[u][k]**2\n", " for i in items:\n", " cost += lamb*itemBiases[i]**2\n", " for k in range(K):\n", " cost += lamb*itemGamma[i][k]**2\n", " return cost" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "def derivative(theta, labels, lamb):\n", " unpack(theta)\n", " N = len(dataset)\n", " dalpha = 0\n", " dUserBiases = defaultdict(float)\n", " dItemBiases = defaultdict(float)\n", " dUserGamma = {}\n", " dItemGamma = {}\n", " for u in reviewsPerUser:\n", " dUserGamma[u] = [0.0 for k in range(K)]\n", " for i in reviewsPerItem:\n", " dItemGamma[i] = [0.0 for k in range(K)]\n", " for d in dataset:\n", " u,i = d['customer_id'], d['product_id']\n", " pred = prediction(u, i)\n", " diff = pred - d['star_rating']\n", " dalpha += 2/N*diff\n", " dUserBiases[u] += 2/N*diff\n", " dItemBiases[i] += 2/N*diff\n", " for k in range(K):\n", " dUserGamma[u][k] += 2/N*itemGamma[i][k]*diff\n", " dItemGamma[i][k] += 2/N*userGamma[u][k]*diff\n", " for u in userBiases:\n", " dUserBiases[u] += 2*lamb*userBiases[u]\n", " for k in range(K):\n", " dUserGamma[u][k] += 2*lamb*userGamma[u][k]\n", " for i in itemBiases:\n", " dItemBiases[i] += 2*lamb*itemBiases[i]\n", " for k in range(K):\n", " dItemGamma[i][k] += 2*lamb*itemGamma[i][k]\n", " dtheta = [dalpha] + [dUserBiases[u] for u in users] + [dItemBiases[i] for i in items]\n", " for u in users:\n", " dtheta += dUserGamma[u]\n", " for i in items:\n", " dtheta += dItemGamma[i]\n", " return numpy.array(dtheta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again we optimize using our gradient descent library, and compare to a simple baseline." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.4796142779564334" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MSE(alwaysPredictMean, labels)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE = 1.4796187542512018\n", "MSE = 1.4775956560775485\n", "MSE = 1.4700880535730312\n", "MSE = 224.03085248619777\n", "MSE = 1.475994500700181\n", "MSE = 1.4428354361657811\n", "MSE = 1.4370243020059277\n", "MSE = 1.4388863162867522\n", "MSE = 1.4401974063184106\n", "MSE = 1.441302615984079\n", "MSE = 1.4413924433995964\n", "MSE = 1.4413770291838046\n", "MSE = 1.4413879601662825\n" ] }, { "data": { "text/plain": [ "(array([ 4.24278455e+00, 8.34803069e-04, -2.68754387e-04, ...,\n", " 1.46572165e-07, 2.04229088e-06, -5.12669050e-06]),\n", " 1.4574363611735737,\n", " {'funcalls': 13,\n", " 'grad': array([-5.02197680e-07, 1.75296214e-10, -5.17488426e-11, ...,\n", " 2.93099595e-10, 4.07065191e-09, -1.02519340e-08]),\n", " 'nit': 10,\n", " 'task': b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL',\n", " 'warnflag': 0})" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scipy.optimize.fmin_l_bfgs_b(cost, [alpha] + # Initialize alpha\n", " [0.0]*(nUsers+nItems) + # Initialize beta\n", " [random.random() * 0.1 - 0.05 for k in range(K*(nUsers+nItems))], # Gamma\n", " derivative, args = (labels, 0.001))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note finally that in the above exercise we only computed the _training_ error of our model, i.e., we never confirmed that it works well on held-out (validation/testing) data!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "import json\n", "import matplotlib.pyplot as plt\n", "import matplotlib as mpl\n", "import numpy\n", "import math\n", "import dateutil.parser\n", "from collections import defaultdict\n", "from mpl_toolkits.mplot3d import Axes3D\n", "from matplotlib.patches import FancyArrowPatch\n", "from mpl_toolkits.mplot3d import proj3d\n", "from math import sin, cos\n", "from mpl_toolkits.mplot3d.art3d import Poly3DCollection" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "plt.switch_backend('PS') \n", "mpl.use('PS')\n", "mpl.rcParams['text.usetex']=True\n", "mpl.rcParams['text.latex.unicode']=True" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "user0 = userGamma[list(userGamma.keys())[1]]" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([5.13196468e-07, 5.63456655e-07])" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "user0" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "items = list(itemGamma.keys())[:500]" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "def dist(X,Y):\n", " d = 0\n", " for x,y in zip(X,Y):\n", " d += (x-y)**2\n", " return math.sqrt(d)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "innerProducts = [(inner(itemGamma[i],user0), i) for i in items]\n", "distances = [(-dist(itemGamma[i],user0), i) for i in items]" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "innerProducts.sort()\n", "distances.sort()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "max5 = [i[1] for i in innerProducts[-5:]]\n", "max5d = [i[1] for i in distances[-5:]]" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "xs = [itemGamma[i][0] for i in items]\n", "ys = [itemGamma[i][1] for i in items]" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "xs5 = [itemGamma[i][0] for i in max5]\n", "ys5 = [itemGamma[i][1] for i in max5]" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "xs5d = [itemGamma[i][0] for i in max5d]\n", "ys5d = [itemGamma[i][1] for i in max5d]" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(2.5,2.5))\n", "ax = plt.subplot(111)\n", "ax.set_position([0.1,0.1,0.85,0.8])\n", "plt.xlim(-0.000015,0.000035)\n", "plt.ylim(-0.000015,0.000035)\n", "plt.plot([0,0],[-0.0001,0.0001],color='k',lw=0.25,zorder=1)\n", "plt.plot([-0.0001,0.0001],[0,0],color='k',lw=0.25,zorder=2)\n", "plt.arrow(0,0,user0[0],user0[1],zorder=4,color='k',width=0.00000005)\n", "plt.scatter(xs,ys,lw=0,color='grey',zorder=3,s=3)\n", "plt.scatter(xs5,ys5,lw=0,color='k',zorder=5,s=8)\n", "plt.xticks([])\n", "plt.yticks([])\n", "plt.xlabel(r\"$\\gamma_u[0]$\")\n", "plt.ylabel(r\"$\\gamma_u[1]$\")\n", "plt.title(\"Maximum inner product\")\n", "plt.savefig(\"recPointsInner.pdf\")" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(2.5,2.5))\n", "ax = plt.subplot(111)\n", "ax.set_position([0.1,0.1,0.85,0.8])\n", "plt.xlim(-0.000002,0.000002)\n", "plt.ylim(-0.000002,0.000002)\n", "plt.plot([0,0],[-0.0001,0.0001],color='k',lw=0.25,zorder=1)\n", "plt.plot([-0.0001,0.0001],[0,0],color='k',lw=0.25,zorder=2)\n", "plt.plot([0,user0[0]],[0,user0[1]],zorder=4,color='k')\n", "plt.scatter(xs,ys,lw=0,color='grey',zorder=3,s=3)\n", "plt.scatter(xs5d,ys5d,lw=0,color='k',zorder=5,s=8)\n", "plt.xticks([])\n", "plt.yticks([])\n", "plt.xlabel(r\"$\\gamma_u[0]$\")\n", "plt.ylabel(r\"$\\gamma_u[1]$\")\n", "plt.title(\"Nearest neighbors\")\n", "plt.savefig(\"recPointsNN.pdf\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [], "source": [ "from matplotlib.patches import Circle, Wedge, Polygon\n", "from matplotlib.collections import PatchCollection" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [], "source": [ "mpl.rc('image', cmap='gray')" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [], "source": [ "user0 = userGamma[list(userGamma.keys())[1]]\n", "#user1 = userGamma[list(userGamma.keys())[1]]" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [], "source": [ "theta = numpy.arctan(user0[1] / user0[0]) * (180/math.pi)" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "47.672734088519526" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "theta" ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/lib/python3/dist-packages/matplotlib/pyplot.py:516: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).\n", " max_open_warning, RuntimeWarning)\n" ] } ], "source": [ "plt.figure(figsize=(2.5,2.5))\n", "ax = plt.subplot(111)\n", "ax.set_position([0.1,0.1,0.85,0.8])\n", "plt.set_cmap('gray')\n", "plt.xlim(-0.000002,0.000002)\n", "plt.ylim(-0.000002,0.000002)\n", "plt.plot([0,0],[-0.0001,0.0001],color='k',lw=0.25,zorder=1)\n", "plt.plot([-0.0001,0.0001],[0,0],color='k',lw=0.25,zorder=2)\n", "plt.plot([0,user0[0]],[0,user0[1]],zorder=4,color='k')\n", "#plt.plot([0,user1[0]],[0,user1[1]],zorder=4,color='k')\n", "plt.scatter(xs,ys,lw=0,color='grey',zorder=3,s=3)\n", "#plt.scatter(xs5d,ys5d,lw=0,color='k',zorder=5,s=8)\n", "plt.xticks([])\n", "plt.yticks([])\n", "plt.xlabel(r\"$\\gamma_u[0]$ (e.g.~action)\")\n", "plt.ylabel(r\"$\\gamma_u[1]$ (e.g.~comedy)\")\n", "plt.title(\"Maximum inner product\")\n", "patches = []\n", "patches.append(Wedge((0, 0), .1, theta-20, theta+20, color='k'))\n", "patches.append(Wedge((0, 0), .1, theta-15, theta+15, color='k'))\n", "patches.append(Wedge((0, 0), .1, theta-10, theta+10, color='k'))\n", "patches.append(Wedge((0, 0), .1, theta-5, theta+5, color='k'))\n", "colors = [1]\n", "p = PatchCollection(patches, alpha=0.1, lw=0)\n", "p.set_array(numpy.array(colors))\n", "ax.add_collection(p)\n", "plt.savefig(\"IPSim.pdf\")" ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/lib/python3/dist-packages/matplotlib/pyplot.py:516: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).\n", " max_open_warning, RuntimeWarning)\n" ] } ], "source": [ "plt.figure(figsize=(2.5,2.5))\n", "ax = plt.subplot(111)\n", "ax.set_position([0.1,0.1,0.85,0.8])\n", "plt.set_cmap('gray')\n", "plt.xlim(-0.000002,0.000002)\n", "plt.ylim(-0.000002,0.000002)\n", "plt.plot([0,0],[-0.0001,0.0001],color='k',lw=0.25,zorder=1)\n", "plt.plot([-0.0001,0.0001],[0,0],color='k',lw=0.25,zorder=2)\n", "plt.plot([0,user0[0]],[0,user0[1]],zorder=4,color='k')\n", "#plt.plot([0,user1[0]],[0,user1[1]],zorder=4,color='k')\n", "plt.scatter(xs,ys,lw=0,color='grey',zorder=3,s=3)\n", "#plt.scatter(xs5d,ys5d,lw=0,color='k',zorder=5,s=8)\n", "plt.xticks([])\n", "plt.yticks([])\n", "plt.xlabel(r\"$\\gamma_u[0]$ (e.g.~action)\")\n", "plt.ylabel(r\"$\\gamma_u[1]$ (e.g.~comedy)\")\n", "plt.title(\"Nearest neighbors\")\n", "patches = []\n", "patches.append(Circle((user0[0], user0[1]), 0.0000002, color='k'))\n", "patches.append(Circle((user0[0], user0[1]), 0.0000004, color='k'))\n", "patches.append(Circle((user0[0], user0[1]), 0.0000006, color='k'))\n", "patches.append(Circle((user0[0], user0[1]), 0.0000008, color='k'))\n", "colors = [1]\n", "p = PatchCollection(patches, alpha=0.1, lw=0)\n", "p.set_array(numpy.array(colors))\n", "ax.add_collection(p)\n", "plt.savefig(\"NNSim.pdf\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(2.5,2.5))\n", "ax = plt.subplot(111)\n", "ax.set_position([0.1,0.1,0.85,0.8])\n", "plt.xlim(-0.0000005,0.0000005)\n", "plt.ylim(-0.0000005,0.0000005)\n", "plt.plot([0,0],[-0.0001,0.0001],color='k',lw=0.25,zorder=1)\n", "plt.plot([-0.0001,0.0001],[0,0],color='k',lw=0.25,zorder=2)\n", "# plt.plot([0,user0[0]],[0,user0[1]],zorder=4,color='k')\n", "plt.scatter(xs,ys,lw=0,color='grey',zorder=3,s=3)\n", "#plt.scatter(xs5d,ys5d,lw=0,color='k',zorder=5,s=8)\n", "plt.arrow(xs5d[0],ys5d[0],xs5d[3]-xs5d[0],ys5d[3]-ys5d[0],zorder=4,color='k',width=0.0000000015)\n", "ax.annotate(r\"$\\gamma_i^{(\\mathrm{start})}$\",\n", " xy=(xs5d[0] + 0.00000002,ys5d[0]), textcoords='data'\n", " )\n", "ax.annotate(r\"$\\gamma_i^{(\\mathrm{end})}$\",\n", " xy=(xs5d[3] + 0.00000002,ys5d[3]), textcoords='data'\n", " )\n", "plt.xticks([])\n", "plt.yticks([])\n", "plt.xlabel(r\"$\\gamma_i[0]$\")\n", "plt.ylabel(r\"$\\gamma_i[1]$\")\n", "plt.title(\"Translation in Latent Space\")\n", "plt.savefig(\"translationIdea.pdf\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }