Recommender Systems Datasets
Julian McAuley, UCSD
Description
This page contains a collection of recommender systems datasets that have been used for research in my lab. Datasets contain the following features:
- user/item interactions
- star ratings
- timestamps
- product reviews
- social networks
- item-to-item relationships (e.g. copurchases)
- product images
- price, brand, and category information
- GPS data
- other metadata
Please cite the appropriate reference if you use any of the datasets below.
Datasets are in (loose) json format unless specified otherwise, meaning they can be treated as python dictionary objects. A simple script to read json-formatted data is as follows:
Rough Directory
The datasets below can be roughly organized in terms of the types of metadata they contain:
Review text: see Amazon, BeerAdvocate, RateBeer, Google Local
Item-to-item relationships: Amazon
Q/A data: Amazon Q/A
Geographical data: Google Local
Bundle data: Steam
Peer-to-peer trades: Tradesy, Ratebeer, Gameswap
Social connections: Librarything, Epinions
Fit feedback: Modcloth, Renttherunway
Multple aspects: BeerAdvocate, RateBeer
Amazon Product Reviews
Description
This is a large crawl of product reviews from Amazon. This dataset contains 82.83 million unique reviews, from around 20 million users.
Basic statistics
| Ratings: | 82.83 million |
| Users: | 20.98 million |
| Items: | 9.35 million |
| Timespan: | May 1996 - July 2014 |
Metadata
- reviews and ratings
- item-to-item relationships (e.g. "people who bought X also bought Y")
- timestamps
- helpfulness votes
- product image (and CNN features)
- price
- category
- salesRank
Example
Download link
See the Amazon Dataset Page for download information.
Citation
Please cite the following if you use the data:
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016
pdf
Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
pdf
Amazon Question and Answer Data
Description
These datasets contain questions and answers about products from the Amazon dataset above.
Basic statistics
| Questions: | 1.48 million |
| Answers: | 4,019,744 |
| Labeled yes/no questions: | 309,419 |
| Number of unique products with questions: | 191,185 |
Metadata
- question and answer text
- is the question binary (yes/no), and if so does it have a yes/no answer?
- timestamps
- product ID (to reference the review dataset)
Example
Download link
See the Amazon Q/A Page for download information.
Citation
Please cite the following if you use the data:
Modeling ambiguity, subjectivity, and diverging viewpoints in opinion question answering systems
Mengting Wan, Julian McAuley
International Conference on Data Mining (ICDM), 2016
pdf
Addressing complex and subjective product-related queries with customer reviews
Julian McAuley, Alex Yang
World Wide Web (WWW), 2016
pdf
Google Local Reviews
Description
These datasets contain reviews about businesses from Google Local (Google Maps). Data includes geographic information for each business as well as reviews.
Basic statistics
| Reviews: | 11,453,845 |
| Users: | 4,567,431 |
| Businesses: | 3,116,785 |
Metadata
- reviews and ratings
- GPS coordinates and address
- User information (places lives, jobs)
- timestamps
- business category, opening hours, etc.
Example (review)
Example (business)
Download links
Places Data (276mb)
User Data (178mb)
Review Data (1.4gb)
Citation
Please cite the following if you use the data:
Translation-based factorization machines for sequential recommendation
Rajiv Pasricha, Julian McAuley
RecSys, 2018
pdf
Translation-based recommendation
Ruining He, Wang-Cheng Kang, Julian McAuley
RecSys, 2017
pdf
Steam Video Game and Bundle Data
Description
These datasets contain reviews reviews from the Steam video game platform, and information about which games were bundled together.
Basic statistics
| Reviews: | 59,305 |
| Purchases: | 5,153,209 |
| Users: | 88,310 |
| Items: | 10,978 |
| Bundles: | 615 |
Metadata
- reviews
- purchases, plays, recommends ("likes")
- product bundles
- pricing information
Example (bundle)
Download links
Review Data (6.7mb)
User and Item Data (71mb)
Review Data (92kb)
Citation
Please cite the following if you use the data:
Item recommendation on monotonic behavior chains
Mengting Wan, Julian McAuley
RecSys, 2018
pdf
Generating and personalizing bundle recommendations on Steam
Apurva Pathak, Kshitiz Gupta, Julian McAuley
SIGIR, 2017
pdf
Goodreads Book Reviews
Coming soon (Mengting RecSys 2018)
Clothing Fit Data
Description
These datasets contain measurements of clothing fit from ModCloth and RentTheRunway.
Basic statistics
| Modcloth | Renttherunway | |
| Number of users: | 47,958 | 105,508 |
| Number of items: | 1,378 | 5,850 |
| Number of transactions: | 82,790 | 192,544 |
Metadata
- ratings and reviews
- fit feedback (small/fit/large etc.)
- user/item measurements
- category information
Example (RentTheRunway)
Download links
Modcloth (8.5mb)
Renttherunway (31mb)
Citation
Please cite the following if you use the data:
Decomposing fit semantics for product size recommendation in metric spaces
Rishabh Misra, Mengting Wan, Julian McAuley
RecSys, 2018
pdf
Product Exchange/Bartering Data
Description
These datasets contain peer-to-peer trades from various recommendation platforms.
Basic statistics
| Tradesy | Ratebeer | Gameswap | |
| Number of users: | 128,152 | 2,215 | 9,888 |
| Number of transactions: | 68,543 | 125,665 | 3,470 |
Metadata
- peer-to-peer trades
- "have" and "want" lists
- image data (tradesy)
Example (tradesy)
Download links
Tradesy (3.8mb)
See the project page for ratebeer, gameswap (and other) datasets
Citation
Please cite the following if you use the data:
Bartering books to beers: A recommender system for exchange platforms
Jérémie Rappaz, Maria-Luiza Vladarean, Julian McAuley, Michele Catasta
WSDM, 2017
pdf
VBPR: Visual bayesian personalized ranking from implicit feedback
Ruining He, Julian McAuley
AAAI, 2016
pdf
Behance Community Art Data
Description
Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.
Basic statistics
| Users: | 63,497 |
| Items: | 178,788 |
| Appreciates ("likes"): | 1,000,000 |
Metadata
- appreciates (likes)
- timestamps
- extracted image features
Example ("appreciate" data)
Each entry is a user, item, timestamp triple:
Code to read image features
Download links
See our Google Drive folder containing all Behance files. The folder also contains additional documentation.
Citation
Please cite the following if you use the data:
Vista: A visually, socially, and temporally-aware model for artistic recommendation
Ruining He, Chen Fang, Zhaowen Wang, Julian McAuley
RecSys, 2016
pdf
Social Recommendation Data
Description
These datasets include ratings as well as social (or trust) relationships between users. Data are from LibraryThing (a book review website) and epinions (general consumer reviews).
Basic statistics
| Librarything | Epinions | |
| Number of users: | 73,882 | 41,554 |
| Number of items: | 337,561 | 112,991 |
| Number of ratings/feedback: | 979,053 | 181,394 |
| Number of social relations: | 120,536 | 181,304 |
Metadata
- reviews
- price paid (epinions)
- helpfulness votes (librarything)
- flags (librarything)
Example
Download links
LibraryThing (594mb)
epinions (66mb)
Citation
Please cite the following if you use the data:
SPMC: Socially-aware personalized Markov chains for sparse sequential recommendation
Chenwei Cai, Ruining He, Julian McAuley
IJCAI, 2017
pdf
Improving latent factor models via personalized feature projection for one-class recommendation
Tong Zhao, Julian McAuley, Irwin King
Conference on Information and Knowledge Management (CIKM), 2015
pdf
Older and Non-Recommender-Systems Datasets
Description
Below are older datasets, as well as datasets collected by my lab that are not related to recommender systems specifically. Formats of these datasets vary, so their respective project pages should be consulted for further details.
Video Game Data
Description
Step charts from the video game Dance Dance Revolution, and audio files from the NES platform.
Basic statistics
| Num songs (DDR): | 223 (7 hours) |
| Num charts (DDR): | 1,102 |
| Num games (NES): | 397 |
| Num songs (NES): | 5,278 (46 hours) |
| Num notes (NES): | 2,325,636 |
Download links
See the project pages for Dance Dance Convolution and NES MDB for further details and links to the data
Multi-aspect Reviews
Description
These datasets include reviews with multiple rated dimensions. The most comprehensive of these are beer review datasets from Ratebeer and Beeradvocate, which include sensory aspects such as taste, look, feel, and smell.
Basic statistics
| Ratebeer | BeerAdvocate | |
| Number of users: | 40,213 | 33,387 |
| Number of items: | 110,419 | 66,051 |
| Number of ratings/reviews: | 110,419 | 1,586,259 |
| Timespan: | Apr 2000 - Nov 2011 | Jan 1998 - Nov 2011 |
Metadata
- reviews
- aspect-specific ratings (taste, look, feel, smell, overall impression)
- product category
- ABV
Example (ratebeer)
Download links
See SNAP beeradvocate and ratebeer dataset pages
Citation
Please cite the following if you use the data:
Learning attitudes and attributes from multi-aspect reviews
Julian McAuley, Jure Leskovec, Dan Jurafsky
International Conference on Data Mining (ICDM), 2012
pdf
From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews
Julian McAuley, Jure Leskovec
WWW, 2013
pdf
Social Circles
Description
These datasets contain social connections and "circles" from Facebook, Twitter, and Google Plus.
Basic statistics
| Google Plus | |||
| Number of networks: | 10 | 133 | 1,000 |
| Number of nodes: | 4,039 | 106,674 | 192,075 |
| Number of circles: | 193 | 479 | 5,541 |
Metadata
- social connections
- circles (sets of friends sharing a common property)
- user metadata
Example (Kaggle egonet data)
Download links
See SNAP facebook, twitter, and Google Plus data, as well as the Kaggle competition based on the same data.
Citation
Please cite the following if you use the data:
Learning to Discover Social Circles in Ego Networks
Julian McAuley, Jure Leskovec
Neural Information Processing Systems (NIPS), 2012
pdf
Reddit Submissions
Description
Submissions of reddit posts (and in particular resubmissions of the same content) along with metadata.
Basic statistics
| Num of submissions (images): | 132,308 |
| Num of unique images: | 16,736 |
| Timespan | July 2008 - January 2013 |
Metadata
- timestamps
- upvotes/downvotes
- post title, subreddit, etc.
Example
Download links
resubmissions data (7.3mb)
raw html of resubmissions (1.8gb)
See also the SNAP project page.
Citation
Please cite the following if you use the data:
Understanding the interplay between titles, content, and communities in social media
Himabindu Lakkaraju, Julian McAuley, Jure Leskovec
ICWSM, 2013
pdf
Questions and comments to Julian McAuley