import csv
import dateutil.parser # for parsing plain-text dates
import demjson # json (loose formatting)
import json
import math
import matplotlib.pyplot as plt
import numpy
import sklearn
from collections import defaultdict
from sklearn import linear_model
Data is available at http://cseweb.ucsd.edu/~jmcauley/pml/data/. Download and save to your own directory.
dataDir = "/home/jmcauley/pml_data/"
Read in a small dataset of reviews of fantasy novels (from Goodreads)
path = dataDir + "fantasy_100.json"
f = open(path)
data = []
for l in f:
d = json.loads(l)
data.append(d)
f.close()
What does the data look like? (json format)
data[0]
To predict ratings from review length, let's start by assembling vectors of both
ratings = [d['rating'] for d in data]
lengths = [len(d['review_text']) for d in data]
And generating a quick scatter plot of the data to see overall trends...
plt.scatter(lengths, ratings, color='grey')
plt.xlim(0, 5500)
plt.ylim(0.5, 5.5)
plt.xlabel("Review length (characters)")
plt.ylabel("Rating")
plt.title("Rating vs. review length")
plt.show()
To perform regression, convert to features (X) and labels (y)
X = numpy.matrix([[1,l] for l in lengths]) # Note the inclusion of the constant term
y = numpy.matrix(ratings).T
Fit a linear regression model to the data (sklearn)
model = sklearn.linear_model.LinearRegression(fit_intercept=False)
model.fit(X, y)
Extract the model coefficients (theta)
theta = model.coef_
theta
Same thing using numpy (lstsq)...
theta,residuals,rank,s = numpy.linalg.lstsq(X, y, rcond=None)
theta
Same thing using by computing the pseudoinverse directly
numpy.linalg.inv(X.T*X)*X.T*y
Plot the line of best fit...
xplot = numpy.arange(0,5501,10)
yplot = [(theta[0] + theta[1]*x).item() for x in xplot]
plt.scatter(lengths, ratings, color='grey')
plt.plot(numpy.array(xplot), yplot, color = 'k', linestyle = '--',\
label = r"$3.983 + 1.193 \times 10^{-4} \mathit{length}$")
plt.xlim(0, 5500)
plt.ylim(0.5, 5.5)
plt.xlabel("Review length (characters)")
plt.ylabel("Rating")
plt.title("Rating vs. review length")
plt.legend(loc='lower right')
plt.show()
Fit a slightly more complex model with two features (review length and number of comments)
def feature(d):
feat = [1] # Constant feature
feat.append(len(d['review_text'])) # Length of review
feat.append(d['n_comments']) # Number of comments
return feat
X = numpy.matrix([feature(d) for d in data])
Fit the model and extract the coefficients
model = sklearn.linear_model.LinearRegression(fit_intercept=False)
model.fit(X, y)
theta = model.coef_
theta
Extract model predictions
y_pred = model.predict(X)
Sum of squared errors (SSE)
sse = sum([x**2 for x in (y - y_pred)])
Mean squared error (MSE)
mse = sse / len(y)
mse
fvu = mse / numpy.var(y)
r2 = 1 - fvu
r2
Can also get the R^2 coefficient directly from the library...
model.score(X, y)
Polynomial (quadratic) function
def feature(d):
feat = [1]
feat.append(len(d['review_text']))
feat.append((len(d['review_text']))**2) # Quadratic term
return feat
X = numpy.matrix([feature(d) for d in data])
model = sklearn.linear_model.LinearRegression(fit_intercept=False)
model.fit(X, y)
theta = model.coef_
theta
Compute the R^2 coefficient (using the library)
model.score(X, y)
Cubic function
def feature(d):
feat = [1]
feat.append(len(d['review_text'])/1000)
feat.append((len(d['review_text'])/1000)**2) # Quadratic term
feat.append((len(d['review_text'])/1000)**3) # Cubic term
return feat
X = numpy.matrix([feature(d) for d in data])
model = sklearn.linear_model.LinearRegression(fit_intercept=False)
model.fit(X, y)
theta = model.coef_
theta
model.score(X, y)
Use a (slightly larger) dataset of fantasy reviews
path = dataDir + "fantasy_10000.json"
f = open(path)
data = []
for l in f:
d = json.loads(l)
data.append(d)
f.close()
Extract averages for each day
weekAverages = defaultdict(list)
for d in data:
timeString = d['date_added']
t = dateutil.parser.parse(timeString)
weekAverages[t.weekday()].append(d['rating'])
for k in weekAverages:
weekAverages[k] = sum(weekAverages[k]) / len(weekAverages[k])
weekAverages
Plot the averages for each day
xplot = list(weekAverages.keys())
xplot.sort()
yplot = [weekAverages[k] for k in xplot]
plt.bar(xplot,yplot,color='grey',lw=0)
plt.xticks(numpy.arange(0.4,7.4,1), ["S", "M", "T", "W", "T", "F", "S"])
plt.xlim(-0.2,7.0)
plt.ylim(3,4)
plt.ylabel("Rating")
plt.xlabel("Day of week")
plt.title(r"Rating vs. day of week")
plt.show()
Read a small dataset of beer reviews with gender attributes
path = dataDir + "beer_500.json"
f = open(path)
data = []
for l in f:
d = demjson.decode(l) # Handles the looser json format in this file (fairly slow)
data.append(d)
f.close()
What does the data look like?
data[0]
Filter the dataset to include only those users who specified gender
data = [d for d in data if 'user/gender' in d]
How many users have specified gender?
len(data)
Binary representation of gender attribute
def feat(d):
return [1, 1.0 * (d['user/gender'] == 'Female')]
X = [feat(d) for d in data]
y = [len(d['review/text'].split()) for d in data]
Fit a model to the binary feature
model = sklearn.linear_model.LinearRegression(fit_intercept=False)
model.fit(X, y)
theta = model.coef_
theta
The model indicates that females tend to leave slightly longer reviews than males. Plot data from the two groups.
xplot = numpy.arange(-1,3,1)
yplot = [theta[0] + theta[1]*x for x in xplot]
plt.plot(xplot, yplot, color='k')
plt.scatter([[x[1] for x in X]], y, color='k', s=2)
plt.xlim(-0.3, 1.3)
plt.ylim(0, 550)
plt.xticks([0,1])
plt.xlabel("is female?")
plt.ylabel(r"review length (words)")
plt.title(r"gender vs.~length")
plt.show()
Dataset of reddit submissions (and resubmissions)
f = open(dataDir + "redditSubmissions.csv")
cs = csv.reader(f)
header = next(cs)
header
Compute popularity as a function of submission number (i.e., how many times has this identical image been submitted)
pops = defaultdict(list)
imId = None
count = 0
for l in cs:
d = dict(zip(header, l))
if d['#image_id'] != imId:
count = 0
imId = d['#image_id']
count += 1
try:
pops[count].append(int(d['number_of_upvotes']))
except Exception as e:
continue
for x in pops:
pops[x] = sum(pops[x]) / len(pops[x])
Extract a single feature which is just the submission number (0 for first submission, etc., 1 for first resubmission, etc.)
x = list(pops.keys())
x.sort()
X = [[1,a] for a in x]
The label to predict is the number of upvotes as a function of submission number
y = [pops[a] for a in x]
ylog = [math.log2(a) for a in y]
Fit two models: one with the original output variable, one with a log-transformed output variable
mod = linear_model.LinearRegression(fit_intercept = False)
modlog = linear_model.LinearRegression(fit_intercept = False)
mod.fit(X,y)
modlog.fit(X,ylog)
theta = mod.coef_
thetalog = modlog.coef_
Plot data and fitted values for the two models
plt.plot(x,y,color='k')
plt.xlim(1,30)
plt.ylim(0,3000)
x1 = [0,30]
y1 = [theta[0] + theta[1]*a for a in x1]
plt.plot(x1,y1,color='grey')
plt.xlabel("submission number")
plt.ylabel("upvotes")
plt.yticks([0,1000,2000,3000])
plt.title("submission no. vs. upvotes")
plt.show()
plt.plot(x,ylog, color='k')
plt.xlim(1,30)
x1 = [0,30]
y1 = [thetalog[0] + thetalog[1]*a for a in x1]
plt.plot(x1,y1, color='grey')
plt.xlabel("submission number")
plt.ylabel("log upvotes")
plt.yticks([2,4,6,8,10])
plt.title("submission no. vs. upvotes")
plt.show()
Simple regression question, same form as the examples above
path = dataDir + "fantasy_100.json"
f = open(path)
data = []
for l in f:
d = json.loads(l)
data.append(d)
f.close()
ratings = [d['rating'] for d in data]
lengths = [len(d['review_text']) for d in data]
X = numpy.matrix([[1,l] for l in lengths]) # Note the inclusion of the constant term
y = numpy.matrix(ratings).T
model = sklearn.linear_model.LinearRegression(fit_intercept=False)
model.fit(X, y)
Extract theta
theta = model.coef_
theta
y_pred = model.predict(X)
sse = sum([r**2 for r in y - y_pred])
MSE
mse = sse / len(y)
mse
Fit a model with an additional variable
def feat(d):
f = [1]
f.append(len(d['review_text']))
f.append(d['n_comments'])
return f
X = [feat(d) for d in data]
model = sklearn.linear_model.LinearRegression(fit_intercept=False)
model.fit(X, y)
Extract theta
theta = model.coef_
theta
Compute the MSE
y_pred = model.predict(X)
sse = sum([r**2 for r in y - y_pred])
mse = sse / len(y)
mse
(explanation: the coefficient of length becomes smaller, as the variability in ratings is already largely explained by the number of comments)
Sketch proof:
$\frac{\partial}{\partial \theta_0} \sum_i (\theta_0 - y_i)^2 = 2\sum_i (\theta - y_i)$
equals zero when $\sum_i \theta = \sum_i y_i$, i.e., $\theta = \frac{1}{N} \sum_i y_i$
Sketch:
$\frac{\partial}{\partial \theta_0} \sum_i | \theta_0 - y_i | = \sum_i \delta(y_i > \theta_0) - \sum_i \delta(y_i < \theta_0)$
zero when $\theta_0$ is the median value of $y$ (meaning that the two terms balance out)
Sketch:
$\max_\theta \prod_i \frac{1}{2b}\exp(-\frac{|x_i\cdot \theta - y_i|}{b})$
$= \max_\theta \sum_i -| x_i \cdot \theta - y_i |$
$= \min_\theta \sum_i | x_i \cdot \theta - y_i|$
Sketch:
rewrite $\sum_i (x_i \cdot \theta - y_i)^2$ as $(X\theta - y)^T(X\theta - y)$
$= (\theta^TX^T - y^T)(X\theta - y)$
$= \theta^TX^TX\theta -2y^TX\theta -y^Ty$ (all terms are scalar)
$\frac{\partial}{\partial \theta} (\theta^TX^TX\theta -2y^TX\theta -y^Ty) = 2(\theta^T X^TX - y^TX)$
$= 0$ when $\theta^TX^TX = y^TX$; or when $X^T X \theta = X^Ty$
i.e., when $\theta = (X^TX)^{-1}X^Ty$
Similar to 2.3: when solving $\theta$ by computing $\frac{\partial}{\partial \theta} \sum_i (x_i \cdot \theta - y_i)^2 = 2\sum_i (x_i \cdot \theta - y_i)$, the expression will be minimized when $\sum_i (x_i \cdot \theta - y_i) = 0$