62 datasets found

g
MovieLens 1M
grouplens.org
meilu1.jpshuntong.com
+1more
Updated Mar 19, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). MovieLens 1M [Dataset]. https://grouplens.org/datasets/movielens/1m/
Explore at:
Dataset updated
Mar 19, 2016
Description
Stable benchmark dataset. 1 million ratings from 6000 users on 4000 movies. Released 2/2003.
a
MovieLens 20M Dataset
academictorrents.com
grouplens.org
bittorrent
Updated Dec 16, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
None (2016). MovieLens 20M Dataset [Dataset]. https://academictorrents.com/details/296054417b4d8eeeb4c7b1c842570bf792ee4d14
Explore at:
bittorrent(198702078)Available download formats
Dataset updated
Dec 16, 2016
Authors
None
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Released 4/2015; updated 10/2016 to update links.csv and add tag genome data. ### Summary This dataset (ml-20m) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016. Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided. The data are contained in six files, genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all thes
MovieLens 10M Dataset
kaggle.com
zip
Updated Mar 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Smriti (2021). MovieLens 10M Dataset [Dataset]. https://www.kaggle.com/smritisingh1997/movielens-10m-dataset
Explore at:
zip(67393676 bytes)Available download formats
Dataset updated
Mar 26, 2021
Authors
Smriti
Description
Build a RBM using this dataset to predict whether a particular user will like a movie or not. This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service. Users were selected at random for inclusion. All users selected had rated at least 20 movies. Unlike previous MovieLens data sets, no demographic information is included. Each user is represented by an id, and no other information is provided. The data are contained in three files, movies.dat, ratings.dat and tags.dat. Also included are scripts for generating subsets of the data to support five-fold cross-validation of rating predictions.

User Ids Movielens users were selected at random for inclusion. Their ids have been anonymized.

Users were selected separately for inclusion in the ratings and tags data sets, which implies that user ids may appear in one set but not the other.

The anonymized values are consistent between the ratings and tags data files. That is, user id n, if it appears in both files, refers to the same real MovieLens user.

Ratings Data File Structure All ratings are contained in the file ratings.dat. Each line of this file represents one rating of one movie by one user, and has the following format:

UserID::MovieID::Rating::Timestamp

The lines within this file are ordered first by UserID, then, within user, by MovieID.

Ratings are made on a 5-star scale, with half-star increments.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

Tags Data File Structure All tags are contained in the file tags.dat. Each line of this file represents one tag applied to one movie by one user, and has the following format:

UserID::MovieID::Tag::Timestamp

The lines within this file are ordered first by UserID, then, within user, by MovieID.

Tags are user generated metadata about movies. Each tag is typically a single word, or short phrase. The meaning, value and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

Movies Data File Structure Movie information is contained in the file movies.dat. Each line of this file represents one movie, and has the following format:

MovieID::Title::Genres

MovieID is the real MovieLens id.

Movie titles, by policy, should be entered identically to those found in IMDB, including year of release. However, they are entered manually, so errors and inconsistencies may exist.

Genres are a pipe-separated list, and are selected from the following:

Action Adventure Animation Children's Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
g
MovieLens 100K
grouplens.org
Updated Oct 12, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2015). MovieLens 100K [Dataset]. https://grouplens.org/datasets/movielens/100k/
Explore at:
Dataset updated
Oct 12, 2015
Description
Stable benchmark dataset. 100,000 ratings from 1000 users on 1700 movies. Released 4/1998.
d
National box office statistics
data.gov.tw
csv, json
Updated Jun 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ministry of Culture (2024). National box office statistics [Dataset]. https://data.gov.tw/en/datasets/94224
Explore at:
json, csvAvailable download formats
Dataset updated
Jun 26, 2024
Dataset authored and provided by
Ministry of Culture
License
https://data.gov.tw/licensehttps://data.gov.tw/license
Description
This dataset provides national theater box office statistics for films distributed by the Administrative Institution National Film and Audiovisual Culture Center. The data is up to the last Sunday before the announcement date and does not include films that have not been screened for less than 7 calendar days. The earliest CSV format data in this dataset begins on July 30, 2018, and the earliest JSON format data begins on March 1, 2020. JSON format queries require entering the start and end dates (in the format of year, month, and day), and can provide data for a maximum of 90 days at a time.
TMDB top 10K movies data
kaggle.com
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanish Jangir (2025). TMDB top 10K movies data [Dataset]. https://www.kaggle.com/datasets/tanishjangir/tmdb-top-10k-movies-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 3, 2025
Dataset provided by
Kaggle
Authors
Tanish Jangir
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset contains information about 10,000 movies, including their titles, release dates, popularity metrics, and voting statistics, sourced from The Movie Database (TMDB). It can be used for data analysis, visualization, and machine learning tasks related to the film industry. The dataset includes detailed movie descriptions and metadata for analysis. Column Descriptors adult (bool): Indicates if the movie is adult content. id (int): Unique identifier for the movie in the TMDB database. title (string): The movie's primary title. overview (string): A brief description or summary of the movie. popularity (float): The movie's popularity score on TMDB. release_date (string): The official release date of the movie in YYYY-MM-DD format. vote_count (int): The total number of votes received by the movie. original_title (string): The movie's title in its original language.
Movie Subtitle Durations
kaggle.com
Updated Oct 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nevo Itzhak (2023). Movie Subtitle Durations [Dataset]. https://www.kaggle.com/datasets/nevoit/movie-subtitle-durations
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nevo Itzhak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset includes statistics about durations between two consecutive subtitles in 5,000 top-ranked IMDB movies. The dataset can be used to understand how dialogue is used in films and to develop tools to improve the watching experience. This notebook contains the code and data that were used to create this dataset.

Dataset statistics:

Average duration between subtitles

Average duration between subtitles with a duration greater than 10, 30, 60, 120, and 300 seconds

Maximum duration between subtitles

Percentage of duration between subtitles from the runtime

Dataset use cases:

Understanding how dialogue is used in movies, such as the average duration of a dialogue scene and how the duration of dialogue varies between different genres

Developing tools to improve the watching experience by adjusting the playback speed of dialogue scenes

Evaluating the effectiveness of tools like the VLC extension mentioned below

Data Analysis:

The next histogram shows the distribution of movie runtimes in minutes. The mean runtime is 99.903 minutes, the maximum runtime is 877 minutes, and the median runtime is 98.5 minutes.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F5c78e4866f203dfe5f7a7f55e41f69d0%2Ffig%201.png?generation=1696861842737260&alt=media" alt="">

Figure 1: Histogram of the runtime in minutes

The next histogram shows the distribution of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime. The mean percentage of gaps is 0.187, the maximum percentage of gaps is 0.033, and the median percentage of gaps is 327.586.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F235453706269472da11082f080b1f41d%2Ffig%202.png?generation=1696862163125288&alt=media" alt="">

Figure 2: Histogram of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime

The next histogram shows the distribution of the total movie's subtitle duration (seconds) between two consecutive subtitles. The mean subtitle duration is 4,837.089 seconds and the median subtitle duration is 2,906.435 seconds.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F234d31e3abaf6c4d174f494bf5cb86fa%2Ffig%203.png?generation=1696862309880510&alt=media" alt="">

Figure 3: Histogram of the total movie's subtitle duration (seconds) between two consecutive subtitles

Example use case:

The Dynamic Adjustment of Playback Speed (DAPS), a VLC extension, can be used to save time while watching movies by increasing the playback speed between dialogues. However, it is essential to choose the appropriate settings for the extension, as increasing the playback speed can impact the overall tone and impact of the film.

The dataset of 5,000 top-ranked movie subtitle durations can be used to help users choose the appropriate settings for the DAPS extension. For example, users who are watching a fast-paced action movie may want to set a higher minimum duration between subtitles before speeding up, while users who are watching a slow-paced drama movie may want to set a lower minimum duration.

Additionally, users can use the dataset to understand how the different settings of the DAPS extension impact the overall viewing experience. For example, users can experiment with different settings to see how they affect the pacing of the movie and the overall impact of the dialogue scenes.

Conclusion

This dataset is a valuable resource for researchers and developers who are interested in understanding and improving the use of dialogue in movies or in tools for watching movies.
iFlix movie streaming dataset
kaggle.com
Updated Jan 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aung Pyae (2020). iFlix movie streaming dataset [Dataset]. https://www.kaggle.com/aungpyaeap/movie-streaming-datasets-iflix/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 8, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aung Pyae
Description
users.csv User_id: Unique identifier of user Country_code: Country code where the user registered assets.csv Show_type: Type of content, whether the asset is a movie or an episode of a TV series Genre: Genre of content Running_miutes: Runtime of content (Playable number of minutes) Source_language: Production language of content Asset_id: Unique identifier of video content at the most granular level (a movie or an episode of a TV series) Season_id: Unique identifier of content at season level. This is only applicable to TV series Series_id: Unique identifier of content at series level. This is only applicable to TV series Studio_id: Unique identifier of production studio for the content plays.csv Platform: Platform of consumption Minutes_viewed : Total number of minutes viewed, rounded to the nearest integer (0 means less than 30 seconds) Demographics.csv Psychographics.csv The dataset identifies psychographic and demographic tags about some iflix users. Each user-tag pair has an associated confidence score (1 is the highest, and 0 is the lowest confidence). Each trait can have up to 3 levels, depending on its granularity. Some traits can be identified by only considering the first two levels. At the same time, there are others that make more sense when all the three levels are considered, e.g., ‘iflix Viewing Behaviour’ is a level 2 psychographic trait that only makes sense when it is looked at in combination with the level 3 traits corresponding to it (‘casual,’ ‘player’ and ‘addict’). These traits represent different levels of viewing behavior of iflix users. Casual users have less than five viewing days in a month, player users have 5 to 12 viewing days in a month, and people with an addiction have more than 12 viewing days in a month. Traits are available corresponding to a user_id in the dataset only if we have certain confidence that the user belongs to the trait. Column and Description Level_1: Identifies the first level of the trait (psychologic or demographic) Level_2: Identifies the second level of the trait (e.g., Music Lovers, Movies Lovers) Level_3 : Identifies the third level of the trait, if available/relevant (e.g. Malay Movies Lovers, Indonesian TV Fans) Confidence_score: Confidence in associating the said trait (level_1, level_2, level_3) with the user
d
Replication Data for 'Gender (im)balance in the Russian cinema: on the...
search.dataone.org
dataverse.harvard.edu
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leontyeva, Xenia (2024). Replication Data for 'Gender (im)balance in the Russian cinema: on the screen and behind the camera' [Dataset]. http://doi.org/10.7910/DVN/ISVTB4
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ISVTB4
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Leontyeva, Xenia
Description
There are two CSV datasets in this publication used initially in the master thesis in sociology of Xenia Leontyeva at HSE University Saint Petersburg, titled "Popularity Factors of Domestic Films: Gender Characteristics and State Support Measures" (2022), and lately for the article by Leontyeva, Xenia, Olessia Koltsova, and Deb Verhoeven, titled "Gender (Im)Balance in Russian Cinema: On the Screen and behind the Camera" (Accepted in January 2024 in The Journal of Cultural Analytics). The first dataset (N=1285) includes all Russian films produced between 2008 and 2019 and theatrically released between December 1, 2008, and December 31, 2019. Distribution statistics cover the territory of the CIS, of which the Russian Federation is the biggest market. Budget information is available for 644 films. The second dataset contains the Bechdel-Wallace test modified by Leontyeva markup for 243 films, 193 of which have budget information. There is also a supplement with a detailed description of all variables and R-code producing tables, plots, and models for the article. The database was collected by Xenia Leontyeva while working at Nevafilm Research (until 2018) and later. In terms of distribution data, it is based on sources such as the open base Russian Cinema Fund Analytics – RCFA (since 2015), the closed base comScore/Rentrak ("International Box Office Essential") serving major Hollywood studios (data from it has been used since 2008 to fill gaps in open databases), Bookers' Bulletin (since 2011), and Russian Film Business Today magazines (since 2004), as well as self-collected by Nevafilm Research employees from film distributors and producers; the rights to use and continue this dataset have been received from Nevafilm company. In terms of production data, the information was taken from the State register of film distribution certificates, Kinopoisk.ru, and from the films' credits.
g
MovieLens 10M
grouplens.org
Updated Mar 22, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). MovieLens 10M [Dataset]. https://grouplens.org/datasets/movielens/10m/
Explore at:
Dataset updated
Mar 22, 2016
Description
Stable benchmark dataset. 10 million ratings and 100,000 tag applications applied to 10,000 movies by 72,000 users. Released 1/2009.
h
rotten_tomatoes
huggingface.co
Updated Aug 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cornell-movie-review-data (2023). rotten_tomatoes [Dataset]. https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 14, 2023
Dataset authored and provided by
cornell-movie-review-data
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for "rotten_tomatoes"

Dataset Summary

Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

Supported Tasks and Leaderboards

More Information Needed

Languages… See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.
ENTERTAINMENT
kaggle.com
Updated Mar 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rallapalli Shahul (2025). ENTERTAINMENT [Dataset]. https://www.kaggle.com/datasets/rallapallishahul/entertainment
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 24, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rallapalli Shahul
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Dataset includes demographic questions such as age, gender, and location, along with preferences related to entertainment and media consumption. Which may be used for research purpose. The key topics covered in the survey are: • Age, Gender, and Location: Respondents' demographic details. • Movie Preferences: Favorite types of movies and preferred cinema industries (Bollywood, Tollywood, Hollywood, etc.). • Streaming Platforms: Commonly used streaming services like Hotstar, Netflix, YouTube, Ibomma, etc. • Social Media Usage: Preferred social media platforms such as Instagram, WhatsApp, Facebook, and Snapchat. • Leisure Activities: Interests such as watching movies, playing video games, reading books, or listening to music. • Video Games: Favorite games like Free Fire, BGMI, Candy Crush, and Asphalt. • Music Genres: Preferences for different genres, including rock, pop, hip-hop, and classic. • Sports and IPL Preferences: Favorite sportspersons (Sachin Tendulkar, Virat Kohli, Dhoni, etc.) and IPL teams (CSK, SRH, RCB, MI). • Favorite Directors: Preferences for movie directors like SS Rajamouli, Sukumar, Prasanth Neel, and Trivikram.
s
Moviegalaxies – Social Networks in Movies
marketplace.sshopencloud.eu
dataverse.harvard.edu
+1more
Updated Feb 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Moviegalaxies – Social Networks in Movies [Dataset]. http://doi.org/10.7910/DVN/T4HBA3
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/T4HBA3
Dataset updated
Feb 11, 2022
Description
This repository contains network graphs and network metadata from Moviegalaxies, a website providing network graph data from about 773 films (1915–2012). The data includes individual network graph data in Graph Exchange XML Format and descriptive statistics on measures such as clustering coefficient, degree, density, diameter, modularity, average path length, the total number of edges, and the total number of nodes.
P
MR Dataset
paperswithcode.com
Updated Apr 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). MR Dataset [Dataset]. https://paperswithcode.com/dataset/mr
Explore at:
Dataset updated
Apr 28, 2021
Description
MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.
h
Data from: imdb
huggingface.co
Updated May 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
scikit-learn (2025). imdb [Dataset]. https://huggingface.co/datasets/scikit-learn/imdb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 10, 2025
Dataset authored and provided by
scikit-learn
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
This is the sentiment analysis dataset based on IMDB reviews initially released by Stanford University. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/imdb.
Film and video distribution, summary statistics
ouvert.canada.ca
www150.statcan.gc.ca
+2more
csv, html, xml
Updated Oct 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statistics Canada (2024). Film and video distribution, summary statistics [Dataset]. https://ouvert.canada.ca/data/dataset/030fdbcc-0f41-4958-804b-63f4c0429b7c
Explore at:
xml, html, csvAvailable download formats
Dataset updated
Oct 3, 2024
Dataset provided by
Statistics Canadahttps://statcan.gc.ca/en
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
The summary statistics by North American Industry Classification System (NAICS) which include: operating revenue (dollars x 1,000,000), operating expenses (dollars x 1,000,000), salaries wages and benefits (dollars x 1,000,000), and operating profit margin (by percent), of motion picture and video distribution (NAICS 512120), annual, for five years of data.
h
MoViFex_Dataset
huggingface.co
Updated May 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Tourani (2024). MoViFex_Dataset [Dataset]. https://huggingface.co/datasets/alitourani/MoViFex_Dataset
Explore at:
Dataset updated
May 11, 2024
Authors
Ali Tourani
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
🎬 MoViFex Dataset

The Movies Visual Features Extracted (MoViFex) dataset contains visual features obtained from a wide range of movies (full-length), their shots, and free trailers. It contains frame-level extracted visual features and aggregated version of them. MoViFex can be used in recommendation, information retrieval, classification, etc tasks.

📃 Table of Content

How to Use Dataset Stats Files Structure

🚀 How to Use? The Dataset Web-Page… See the full description on the dataset page: https://huggingface.co/datasets/alitourani/MoViFex_Dataset.
Arizona State University Flixster Data Set
academictorrents.com
bittorrent
Updated Dec 23, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Flixter (2013). Arizona State University Flixster Data Set [Dataset]. https://academictorrents.com/details/4960373ea6dec89153639b0975ea92f9e3d3c914
Explore at:
bittorrent(36140875)Available download formats
Dataset updated
Dec 23, 2013
Dataset provided by
Flixster.comhttps://www.facebook.com/FlixsterMovies
Authors
Flixter
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Area covered
Arizona
Description
Flixster is a social movie site allowing users to share movie ratings, discover new movies and meet others with similar movie taste. Number of Nodes: 2523386 Number of Edges: 9197338 Missing Values? no Source: N/A Data Set Information: 2 files are included: 1. nodes.csv — it s the file of all the users. This file works as a dictionary of all the users in this data set. It s useful for fast reference. It contains all the node ids used in the dataset 2. edges.csv — this is the friendship network among the users. The friends are represented using edges. Here is an example. 1,2 This means user with id "1" is friend with user id "2". Attribute Information: Flixster is a social movie site allowing users to share movie ratings, discover new movies and meet others with similar movie taste. This contains the friendship network crawled in December 2010 by Javier Parra (Javier.Parra@asu.edu). For easier understanding, all the contents are organized in CSV file form

See the Movie, Hear the Song, Read the Book: Extending MovieLens-1M, Last.fm...

zenodo.org

bin, tsv, zip

Updated May 16, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Giuseppe Spillo; Giuseppe Spillo; Elio Musacchio; Elio Musacchio; Cataldo Musto; Cataldo Musto; Marco de Gemmis; Marco de Gemmis; Pasquale Lops; Pasquale Lops; Giovanni Semeraro; Giovanni Semeraro (2025). See the Movie, Hear the Song, Read the Book: Extending MovieLens-1M, Last.fm 2K, and DBBook with multimodal Data [Dataset]. http://doi.org/10.5281/zenodo.15403972

Explore at:

zip, tsv, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15403972

Dataset updated

May 16, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Giuseppe Spillo; Giuseppe Spillo; Elio Musacchio; Elio Musacchio; Cataldo Musto; Cataldo Musto; Marco de Gemmis; Marco de Gemmis; Pasquale Lops; Pasquale Lops; Giovanni Semeraro; Giovanni Semeraro

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Datasets Structure

This folder contains the multimodal features of the three state-of-the-art we have extended (`MovieLens-1M`, `DBbook`, `Last.FM-2K`).

For each folder, we provide both the interaction data in the original format (in the folder `interaction_data`) and the multimodal features in several formats, based on the needs (in the `multimodal_data` folder).

In the following, we provide all the information needed to work with such data. Note that, although some dataset-specif details mght change, the general strucuture is common to all the three datasets.

Dataset statistics

CF data	ML1M	DBbook	LFM2k
Users	6040	6181	1892
Items	3706	7672	17642
Interactions	1000209	140360	92834

Interaction data

The `interaction_data` contains the interaction data provided in the original version of each datasets. We prefer sharing the original version so that each one can pre-process it in the way they prefer (e.g., apply a certain k-core filtering, adapt the task to sequential recommendation by exploiting temporal information - when available -, and so on).

ML1M

In `MovieLens-1M`, interaction data includes user information (`users.dat`), movie information (`movies.dat`), and user ratings (`ratings.dat`); in order to work with this data, we suggest to read those files with the `pandas` python library, by using the `ISO-8859-1` encoding (if using other encoding, like `utf-8`, the reading will raise an error); the default separation character sequence is `::`. For example, in order to read ratings and movie information, one should use:

ratings = pd.read_csv('interaction_data/ratings.dat', sep='::', names=['user', 'item', 'rating', 'timestamp'])
movies = pd.read_csv('interaction_data/movies.dat', sep='::', names=['id', 'name', 'genres'], encoding='ISO-8859-1')

DBbook

In `DBbook`, interaction data includes training and testing data (already split, as in the original version); unfortunately, such version cannot be download anymore as the original web page is no longer accessible; using tools like [waybackmachines, it possible to access that page and download some files, but only the training data is available in the backups that have been made, while test data is not obtaibale.
For these reasons, we considered the version of the dataset that have been used in other works listed below and reachable at the public repository of our SWAP Research Group:
- https://dl.acm.org/doi/abs/10.1145/3523227.3551484
- https://dl.acm.org/doi/abs/10.1145/3565472.3592965
- https://dl.acm.org/doi/abs/10.1145/3627043.3659548
- https://link.springer.com/article/10.1007/s11257-024-09417-x

This way, we have been able to reconstruct the full verison of this dataset.
Similarly to `MovieLens-1M`, interaction data contains user ratings in the `train.tsv` and `test.tsv` files, and book information in the `DBbook_Items_DBpedia_mapping.tsv` file.

We suggest to load such data using `pandas` as follows:

train = pd.read_csv('interaction_data/train.tsv', sep='\t', names=['userID', 'itemID', 'rating'])
test = pd.read_csv('interaction_data/test.tsv', sep='\t', names=['userID', 'itemID', 'rating'])
books = pd.read_csv('interaction_data/DBbook_Items_DBpedia_mapping.tsv', sep='\t')

Last.FM-2K

In `LFM2K`, interaction data is encoded in the `user_artists.dat` file; this file encodes the listening counts for each pair (user,item) available (from this information, it is possible to derive the user ratings); the file `artist_info` encodes information assiciated to the artists, including the name of the artist, the URL of the associated Last.FM resource, and the link to the image (not available anymore); the file `tags.dat` contains the set of all the possible tags users attributed to artists, while all the tags attributed to specific artists is encoded in the `user_taggedartists.dat` file (the `user_taggedartists-timestamps` contains, in addition, the timestamp of the attribution).

In order to read data, we suggest to use `pandas` as follows:

interactions = pd.read_csv('original_data/user_artists.dat', sep='\t')
artist_info = pd.read_csv('original_data/artists.dat', sep='\t')
usertag = pd.read_csv('original_data/user_taggedartists-timestamps.dat', sep='\t')
tags = pd.read_csv('original_data/tags.dat', sep='\t', encoding='latin-1')

Multimodal data

Each dataset is also provided with with multimodal data, in the `multimodal_features` folder. In this folder, we include the data source data we considered (plain text and links to image/audio/video files), with the pre-trained multimodal features.

Here is the coverage of multimodal information w.r.t. the datasets considered:

Multimodal item coverage	ML1M	DBbook	LFM2K
Text	3667 (Plots)	4197 (Abstracts)	2813 (Tags)
Image	3197 (Movie posters)	7588 (Book covers)	2820 (Top-5 Album Covers)
Audio	3104 (Trailer audio)	-	2742 (Top-5 album songs)
Video	3105 (Trailer video)	-	-

As depicted in the table, for `ML1M` we have gathered movie plots (text), movie posters (images), and movie trailers (for audio and video); in the `movielens_1m/multimodal_features` folder, we provide an extended mapping named `ml1m_full_extended_mapping`, in which we report which are the links to download `covers` and `trailers`, while `text` is available in the `text_ml1m.tsv` file.
For `DBbook`, we have gathered book abstracts (text) and book covers (images); in the `dbbook/multimodal_features` folder, we provide an extended mapping named `full_extended_dbbook_img_links.tsv`, in which we report which are the links to download the `book covers`, while `text` is available in the `dbbook_text.tsv` file.
For `LFM2K`, we have gathered artist tags (text), the top-5 most popular album covers (images), and the top-5 most popular audio songs (audio); in the `lfm2k/multimodal_features.tsv` folder, we report extended mappings, named `lfm2k_song_extended_mapping.tsv` and `lfm2k_covers_extended_mapping.tsv`, tha encode the top-5 most popular `songs` and `album covers` for each artist, respectively; on the other hand, the `lfm2k_text.tsv` encode the `text` we considered, obtained from the user tags.

With this information, anyone can donwload the raw features and use them in their recommendation scenario; in our case, to carry out our experiments, we considered the following state-of-the-art multimodal encoders:

Text: we considered `MiniLM` and `MPNET` (for `ML1M`, `DBbook`, and `LFM2K`)
Image: we considered `ResNet152`, `VGG`, `ViT_AVG`, `ViT_CLS` (for `ML1M`, `DBbook`, and `LFM2K`)
Audio: we considered `VGGish` and `Whisper` (for `ML1M` and `LFM2K`)
Video: we considered `I3D` and `R(2+1)D` (for `ML1M`)

The resulting features have been dumped as `dict` (`item_id` -> `np.float32` embedding) in a pickle `.pkl` file, that can be found in the `multimodal_features/dict` folders (one for each dataset); moreover, to avoid any error in reading such files, we have also saved the embeddings in `.json` files, in the `multimodal_features/json` folders (one for each dataset); finally, to reproduce our experiments, we report the same data as `.npy` files (as required by `MMRec`), that can be found in the `multimodal_features/npy` folders (one for each dataset).

Encode multimodal features

In order to learn the multimodal features by exploiting the encoders we considered in our experimental analysis, please refer to the GitHub reporisory

IMDB Spoiler Dataset
kaggle.com
Updated May 22, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rishabh Misra (2019). IMDB Spoiler Dataset [Dataset]. https://www.kaggle.com/datasets/rmisra/imdb-spoiler-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rishabh Misra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

User-generated reviews are often our first point of contact when we consider watching a movie or a TV show. However, beyond telling us the qualitative aspects of the item we want to consume, reviews may inevitably contain undesired revelatory information (i.e. 'spoilers') such as the surprising fate of a character in a movie, or identity of a murderer in a crime-suspense movie etc. For users who are interested in consuming the item but are unaware of the critical plot twists, spoilers may decrease the excitement regarding the pleasurable uncertainty and curiosity of media consumption. Therefore, a natural question is how to identify these spoilers in entertainment reviews, so that users can more effectively navigate review platforms.

Content

This dataset is collected from IMDB. It contains meta-data about items as well as user reviews with information regarding whether a review contains a spoiler or not. For more details on the attributes, please check file descriptions. Following stats provide a good sense of the scale of the dataset:

# records = 573913

# users = 263407

# movies = 1572

# spoiler reviews = 150924

# users with at least one spoiler review = 79039

# items with at least one spoiler review = 1570

Citation

If you use the dataset for your work, please cite the following:

Citation in text format Misra, Rishabh. "IMDB Spoiler Dataset." DOI: 10.13140/RG.2.2.11584.15362 (2019). Citation in BibTex format @dataset{misra2019imdb, author = {Misra, Rishabh}, year = {2019}, month = {05}, pages = {}, title = {IMDB Spoiler Dataset}, doi = {10.13140/RG.2.2.11584.15362} } Please link to rishabhmisra.github.io/publications as the source of this dataset.

Acknowledgement

This dataset is collected from IMDB.

Inspiration

Can you utilize the metadata to identify reviews which contain spoiler?

Additionally, can you uncover signals that make a review spoiler-y?

Apart from spoiler detection, the metadata available can be used for other tasks as well like rating prediction etc.

Want to contribute your own datasets?

If you are interested in learning how to collect high-quality datasets for various ML tasks and the overall importance of data in the ML ecosystem, consider reading my book Sculpting Data for ML.

Other datasets

Please also checkout the following datasets collected by me:

News Headlines Dataset For Sarcasm Detection

News Category Dataset

Clothing Fit Dataset for Size Recommendation

Politifact Fact Check Dataset

Facebook

Twitter

Click to copy link

Link copied

Cite

(2016). MovieLens 1M [Dataset]. https://grouplens.org/datasets/movielens/1m/

MovieLens 1M

Explore at:

Dataset updated

Mar 19, 2016

Description

Stable benchmark dataset. 1 million ratings from 6000 users on 4000 movies. Released 2/2003.

Clear search

Close search

Google apps

Main menu

MovieLens 1M

MovieLens 20M Dataset

MovieLens 10M Dataset

MovieLens 100K

National box office statistics

TMDB top 10K movies data

Movie Subtitle Durations

iFlix movie streaming dataset

Replication Data for 'Gender (im)balance in the Russian cinema: on the...

MovieLens 10M

rotten_tomatoes

ENTERTAINMENT

Moviegalaxies – Social Networks in Movies

MR Dataset

Data from: imdb

Film and video distribution, summary statistics

MoViFex_Dataset

Arizona State University Flixster Data Set

See the Movie, Hear the Song, Read the Book: Extending MovieLens-1M, Last.fm...

Datasets Structure

Dataset statistics

Interaction data

ML1M

DBbook

Last.FM-2K

Multimodal data

Encode multimodal features

IMDB Spoiler Dataset

Context

Content

Citation

Acknowledgement

Inspiration

Want to contribute your own datasets?

Other datasets

MovieLens 1MSee More Versions

MovieLens 1M