Facebook
TwitterStable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Released 4/2015; updated 10/2016 to update links.csv and add tag genome data.
Facebook
TwitterStable benchmark dataset. 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. Includes tag genome data with 15 million relevance scores across 1,129 tags. Released 12/2019
Facebook
TwitterThis dataset contains a set of movie ratings from the MovieLens website, a movie recommendation service. This dataset was collected and maintained by GroupLens, a research group at the University of Minnesota. There are 5 versions included: "25m", "latest-small", "100k", "1m", "20m". In all datasets, the movies data and ratings data are joined on "movieId". The 25m dataset, latest-small dataset, and 20m dataset contain only movie data and rating data. The 1m dataset and 100k dataset contain demographic data in addition to movie and rating data.
For each version, users can view either only the movies data by adding the "-movies" suffix (e.g. "25m-movies") or the ratings data joined with the movies data (and users data in the 1m and 100k datasets) by adding the "-ratings" suffix (e.g. "25m-ratings").
The features below are included in all versions with the "-ratings" suffix.
The "100k-ratings" and "1m-ratings" versions in addition include the following demographic features.
In addition, the "100k-ratings" dataset would also have a feature "raw_user_age" which is the exact ages of the users who made the rating
Datasets with the "-movies" suffix contain only "movie_id", "movie_title", and "movie_genres" features.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('movielens', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterThe datasets describe ratings and free-text tagging activities from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016.
Users were selected at random for inclusion. All selected users had rated at least 20 movies.
No demographic information is included. Each user is represented by an id, and no other information is provided.
The data are contained in six files.
tag.csv that contains tags applied to movies by users:
userId
movieId
tag
timestamp
rating.csv that contains ratings of movies by users:
userId
movieId
rating
timestamp
movie.csv that contains movie information:
movieId
title
genres
link.csv that contains identifiers that can be used to link to other sources:
movieId
imdbId
tmbdId
genome_scores.csv that contains movie-tag relevance data:
movieId
tagId
relevance
genome_tags.csv that contains tag descriptions:
tagId
tag
The original datasets can be found here. To acknowledge use of the dataset in publications, please cite the following paper:
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872
Some ideas worth exploring:
Which genres receive the highest ratings? How does this change over time?
Determine the temporal trends in the genres/tagging activity of the movies released
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Standardized Hudup dataset receives information from raw data, which is composed of ten units such as “hdp_config”, “hdp_account”, “hdp_attribute_map”, “hdp_nominal”, “hdp_user”, “hdp_item”, “hdp_rating”, “hdp_context_template”, “hdp_context”, and “hdp_sample”. Each unit has particular functions, which is described in the section of data description. Hudup dataset is meta-data which models any raw data with abstract level. The default raw data which is source of Hudup dataset here is Movielens dataset (GroupLens, 1998) 100K has 100,000 ratings from 943 users on 1682 movies (items), which is available at https://files.grouplens.org/datasets/movielens/ml-100k.zip.
Facebook
TwitterStable benchmark dataset. 32 million ratings and two million tag applications applied to 87,585 movies by 200,948 users. Collected 10/2023 Released 05/2024
Facebook
TwitterAttribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
I uploaded GroupLens' Book Genome dataset on Kaggle. It doesn't seem like they're active here any more and I want to use this here for some exploratory learning work I did.
Official link here: https://grouplens.org/datasets/book-genome/
Tag Genome is a data structure containing scores indicating the degree to which tags apply to items, such as movies or books. This dataset contains a Tag Genome generated for a set of books along with the data used for its generation (raw data). Raw data consists of a subset of the Goodreads dataset [Wan and McAuley, 2018, Wan et al., 2019] and book-tag ratings. The Goodreads subset includes information on popular books, such as titles, authors, release years, user ratings, reviews and shelves. Shelves are lists that users use to organize books in Goodreads (https://www.goodreads.com/). In these instructions, we refer to adding books to shelves as attaching tags (shelf names) to books. To collect book-tag ratings, we conducted a survey on Amazon Mechanical Turk, where we asked users to indicate degree to which tags apply to books from this subset. To generate book-tag scores, we used two state-of-the-art algorithms: Glmer [Vig et al., 2012] and TagDL [Kotkov et al., 2021]. The code is available in the following GitHub repository: https://github.com/Bionic1251/Revisiting-the-Tag-Relevance-Prediction-Problem
Facebook
TwitterContext These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.
This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.
Content This dataset consists of the following files:
movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.
keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.
credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.
links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.
links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.
ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.
The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed here
Acknowledgements This dataset is an ensemble of data collected from TMDB and GroupLens. The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.
The Movie Links and Ratings have been obtained from the Official GroupLens website. The files are a part of the dataset available here
Inspiration This dataset was assembled as part of my second Capstone Project for Springboard's Data Science Career Track. I wanted to perform an extensive EDA on Movie Data to narrate the history and the story of Cinema and use this metadata in combination with MovieLens ratings to build various types of Recommender Systems.
Both my notebooks are available as kernels with this dataset: The Story of Film and Movie Recommender Systems
Some of the things you can do with this dataset: Predicting movie revenue and/or movie success based on a certain metric. What movies tend to get higher vote counts and vote averages on TMDB? Building Content Based and Collaborative Filtering Based Recommendation Engines.
Facebook
Twitter关于数据集 此数据集 (ml-latest-small) 描述了电影推荐服务 MovieLens 的 5 星评分和自由文本标记活动。它包含 9742 部电影的 100836 个评分和 3683 个标签应用。这些数据由 610 名用户在 1996 年 3 月 29 日至 2018 年 9 月 24 日期间创建。此数据集于 2018 年 9 月 26 日生成。 用户是随机选择的。所有选定的用户都至少评价过 20 部电影。不包括人口统计信息。每个用户都用一个 ID 表示,不提供其他信息。 数据包含在以下文件中 - 链接.csv 电影.csv 评级.csv 标签.csv 该数据集和其他 GroupLens 数据集均可从http://grouplens.org/datasets/公开下载。 许可证: 此数据集来源于明尼苏达大学的 GroupLens 研究小组。它仅用于非商业研究和教育目的。 许可证详细信息可在使用许可证下找到 - https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html 重要的: 此数据集按“原样”提供,不提供任何担保。 如需商业使用,请联系 grouplens-info@umn.edu。” 引文 F. Maxwell Harper 和 Joseph A. Konstan。2015 年。MovieLens 数据集:历史和背景。ACM 交互式智能系统汇刊 (TiiS) 5, 4: 19:1–19:19。https ://doi.org/10.1145/2827872
Facebook
TwitterStable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users.
this dataset has got three files named as ratings.csv, movies.csv and tags.csv
ratings.csv the movies have been rated by 138493 users on the scale of 1 to 5, this file contains the information divided in the column 'userId', 'movieId', 'rating' and 'timestamp'.
tags.csv this file has the data divided under category 'userId','movieId' and 'tag'
I got this data from MovieLens, for a mini project. http://grouplens.org/datasets/movielens/20m/"> This is the link to original data set
You have got a ton data. You can use this to make fun decisions like which is the best movie series of all time or create a completely new story out of the data that you have.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of several data sets used to illustrate fitting (generalized) linear mixed-effects models. Individual data sets are in Feather format (https://github.com/wesm/feather). They include Dyestuff, Dyestuff2, Penicillin, Pastes, InstEval, sleepstudy, cbpp, Contraception, grouseticks and VerbAgg from the lme4 package for R. The kb07 data is from github.com/dalejbarr/kronmueller-barr-2007 and ml1m is from https://grouplens.org/datasets/movielens/1m/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Four multimedia recommender systems datasets to study popularity bias and fairness:
Last.fm (lfm.zip), based on the LFM-1b dataset of JKU Linz (http://www.cp.jku.at/datasets/LFM-1b/)
MovieLens (ml.zip), based on MovieLens-1M dataset (https://grouplens.org/datasets/movielens/1m/)
BookCrossing (book.zip), based on the BookCrossing dataset of Uni Freiburg (http://www2.informatik.uni-freiburg.de/~cziegler/BX/)
MyAnimeList (anime.zip), based on the MyAnimeList dataset of Kaggle (https://www.kaggle.com/CooperUnion/anime-recommendations-database)
Each dataset contains of user interactions (user_events.txt) and three user groups that differ in their inclination to popular/mainstream items: LowPop (low_main_users.txt), MedPop (med_main_users.txt), and HighPop (high_main_users.txt).
The format of the three user files are "user,mainstreaminess"
The format of the user-events files are "user,item,preference"
Example Python-code for analyzing the datasets as well as more information on the user groups can be found on Github (https://github.com/domkowald/FairRecSys) and on Arxiv (https://arxiv.org/abs/2203.00376)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic trace generated using techniques from DLRM from the data distributions of the Taobao Ad Display/Click Dataset and the Movielens 20M Dataset. Intended for testing of FEDORA-OramSim simulator.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GroupLens is a research lab in the Department of Computer Science and Engineering at the University of Minnesota, Twin Cities specializing in recommender systems, online communities, mobile and ubiquitous technologies, digital libraries, and local geographic information systems.
Facebook
TwitterThis dataset is a subset of MovieLens 100k data which were collected by the GroupLens Research Project at the University of Minnesota. You can find full dataset from here👍
This data set consists of 6 columns: * movie_id -- unique id for each movie * title -- title of the movie * year -- year in which the movie was released * directors -- director of the movie * actors -- actors of the movie * genres -- genres of the movie (ex: comedy, action, horror, etc...)
Thanks to GroupLens for providing up this data.
Facebook
TwitterThis dataset was created by Max Harper
Released under Other (specified in description)
It contains the following files:
Facebook
TwitterBackground of Problem Statement
The GroupLens Research Project is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Members of the GroupLens Research Project are involved in many research projects related to the fields of information filtering, collaborative filtering, and recommender systems. The project is led by professors John Riedl and Joseph Konstan. The project began to explore automated collaborative filtering in 1992 but is most well known for its worldwide trial of an automated collaborative filtering system for Usenet news in 1996. Since then the project has expanded its scope to research overall information by filtering solutions, integrating into content-based methods, as well as, improving current collaborative filtering technology.
Problem Objective :
Here, we ask you to perform the analysis using the Exploratory Data Analysis technique. You need to find features affecting the ratings of any particular movie and build a model to predict the movie ratings.
Not seeing a result you expected?
Learn how you can add new datasets to our index.