https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Released 4/2015; updated 10/2016 to update links.csv and add tag genome data. ### Summary This dataset (ml-20m) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016. Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided. The data are contained in six files, genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all thes
Build a RBM using this dataset to predict whether a particular user will like a movie or not. This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service. Users were selected at random for inclusion. All users selected had rated at least 20 movies. Unlike previous MovieLens data sets, no demographic information is included. Each user is represented by an id, and no other information is provided. The data are contained in three files, movies.dat, ratings.dat and tags.dat. Also included are scripts for generating subsets of the data to support five-fold cross-validation of rating predictions.
User Ids Movielens users were selected at random for inclusion. Their ids have been anonymized.
Users were selected separately for inclusion in the ratings and tags data sets, which implies that user ids may appear in one set but not the other.
The anonymized values are consistent between the ratings and tags data files. That is, user id n, if it appears in both files, refers to the same real MovieLens user.
Ratings Data File Structure All ratings are contained in the file ratings.dat. Each line of this file represents one rating of one movie by one user, and has the following format:
UserID::MovieID::Rating::Timestamp
The lines within this file are ordered first by UserID, then, within user, by MovieID.
Ratings are made on a 5-star scale, with half-star increments.
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
Tags Data File Structure All tags are contained in the file tags.dat. Each line of this file represents one tag applied to one movie by one user, and has the following format:
UserID::MovieID::Tag::Timestamp
The lines within this file are ordered first by UserID, then, within user, by MovieID.
Tags are user generated metadata about movies. Each tag is typically a single word, or short phrase. The meaning, value and purpose of a particular tag is determined by each user.
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
Movies Data File Structure Movie information is contained in the file movies.dat. Each line of this file represents one movie, and has the following format:
MovieID::Title::Genres
MovieID is the real MovieLens id.
Movie titles, by policy, should be entered identically to those found in IMDB, including year of release. However, they are entered manually, so errors and inconsistencies may exist.
Genres are a pipe-separated list, and are selected from the following:
Action Adventure Animation Children's Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
https://data.gov.tw/licensehttps://data.gov.tw/license
This dataset provides national theater box office statistics for films distributed by the Administrative Institution National Film and Audiovisual Culture Center. The data is up to the last Sunday before the announcement date and does not include films that have not been screened for less than 7 calendar days. The earliest CSV format data in this dataset begins on July 30, 2018, and the earliest JSON format data begins on March 1, 2020. JSON format queries require entering the start and end dates (in the format of year, month, and day), and can provide data for a maximum of 90 days at a time.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset contains information about 10,000 movies, including their titles, release dates, popularity metrics, and voting statistics, sourced from The Movie Database (TMDB). It can be used for data analysis, visualization, and machine learning tasks related to the film industry. The dataset includes detailed movie descriptions and metadata for analysis. Column Descriptors adult (bool): Indicates if the movie is adult content. id (int): Unique identifier for the movie in the TMDB database. title (string): The movie's primary title. overview (string): A brief description or summary of the movie. popularity (float): The movie's popularity score on TMDB. release_date (string): The official release date of the movie in YYYY-MM-DD format. vote_count (int): The total number of votes received by the movie. original_title (string): The movie's title in its original language.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset includes statistics about durations between two consecutive subtitles in 5,000 top-ranked IMDB movies. The dataset can be used to understand how dialogue is used in films and to develop tools to improve the watching experience. This notebook contains the code and data that were used to create this dataset.
Dataset statistics:
Dataset use cases:
Data Analysis:
The next histogram shows the distribution of movie runtimes in minutes. The mean runtime is 99.903 minutes, the maximum runtime is 877 minutes, and the median runtime is 98.5 minutes.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F5c78e4866f203dfe5f7a7f55e41f69d0%2Ffig%201.png?generation=1696861842737260&alt=media" alt="">
Figure 1: Histogram of the runtime in minutes
The next histogram shows the distribution of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime. The mean percentage of gaps is 0.187, the maximum percentage of gaps is 0.033, and the median percentage of gaps is 327.586.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F235453706269472da11082f080b1f41d%2Ffig%202.png?generation=1696862163125288&alt=media" alt="">
Figure 2: Histogram of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime
The next histogram shows the distribution of the total movie's subtitle duration (seconds) between two consecutive subtitles. The mean subtitle duration is 4,837.089 seconds and the median subtitle duration is 2,906.435 seconds.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F234d31e3abaf6c4d174f494bf5cb86fa%2Ffig%203.png?generation=1696862309880510&alt=media" alt="">
Figure 3: Histogram of the total movie's subtitle duration (seconds) between two consecutive subtitles
Example use case:
The Dynamic Adjustment of Playback Speed (DAPS), a VLC extension, can be used to save time while watching movies by increasing the playback speed between dialogues. However, it is essential to choose the appropriate settings for the extension, as increasing the playback speed can impact the overall tone and impact of the film.
The dataset of 5,000 top-ranked movie subtitle durations can be used to help users choose the appropriate settings for the DAPS extension. For example, users who are watching a fast-paced action movie may want to set a higher minimum duration between subtitles before speeding up, while users who are watching a slow-paced drama movie may want to set a lower minimum duration.
Additionally, users can use the dataset to understand how the different settings of the DAPS extension impact the overall viewing experience. For example, users can experiment with different settings to see how they affect the pacing of the movie and the overall impact of the dialogue scenes.
Conclusion
This dataset is a valuable resource for researchers and developers who are interested in understanding and improving the use of dialogue in movies or in tools for watching movies.
users.csv User_id: Unique identifier of user Country_code: Country code where the user registered assets.csv Show_type: Type of content, whether the asset is a movie or an episode of a TV series Genre: Genre of content Running_miutes: Runtime of content (Playable number of minutes) Source_language: Production language of content Asset_id: Unique identifier of video content at the most granular level (a movie or an episode of a TV series) Season_id: Unique identifier of content at season level. This is only applicable to TV series Series_id: Unique identifier of content at series level. This is only applicable to TV series Studio_id: Unique identifier of production studio for the content plays.csv Platform: Platform of consumption Minutes_viewed : Total number of minutes viewed, rounded to the nearest integer (0 means less than 30 seconds) Demographics.csv Psychographics.csv The dataset identifies psychographic and demographic tags about some iflix users. Each user-tag pair has an associated confidence score (1 is the highest, and 0 is the lowest confidence). Each trait can have up to 3 levels, depending on its granularity. Some traits can be identified by only considering the first two levels. At the same time, there are others that make more sense when all the three levels are considered, e.g., ‘iflix Viewing Behaviour’ is a level 2 psychographic trait that only makes sense when it is looked at in combination with the level 3 traits corresponding to it (‘casual,’ ‘player’ and ‘addict’). These traits represent different levels of viewing behavior of iflix users. Casual users have less than five viewing days in a month, player users have 5 to 12 viewing days in a month, and people with an addiction have more than 12 viewing days in a month. Traits are available corresponding to a user_id in the dataset only if we have certain confidence that the user belongs to the trait. Column and Description Level_1: Identifies the first level of the trait (psychologic or demographic) Level_2: Identifies the second level of the trait (e.g., Music Lovers, Movies Lovers) Level_3 : Identifies the third level of the trait, if available/relevant (e.g. Malay Movies Lovers, Indonesian TV Fans) Confidence_score: Confidence in associating the said trait (level_1, level_2, level_3) with the user
There are two CSV datasets in this publication used initially in the master thesis in sociology of Xenia Leontyeva at HSE University Saint Petersburg, titled "Popularity Factors of Domestic Films: Gender Characteristics and State Support Measures" (2022), and lately for the article by Leontyeva, Xenia, Olessia Koltsova, and Deb Verhoeven, titled "Gender (Im)Balance in Russian Cinema: On the Screen and behind the Camera" (Accepted in January 2024 in The Journal of Cultural Analytics). The first dataset (N=1285) includes all Russian films produced between 2008 and 2019 and theatrically released between December 1, 2008, and December 31, 2019. Distribution statistics cover the territory of the CIS, of which the Russian Federation is the biggest market. Budget information is available for 644 films. The second dataset contains the Bechdel-Wallace test modified by Leontyeva markup for 243 films, 193 of which have budget information. There is also a supplement with a detailed description of all variables and R-code producing tables, plots, and models for the article. The database was collected by Xenia Leontyeva while working at Nevafilm Research (until 2018) and later. In terms of distribution data, it is based on sources such as the open base Russian Cinema Fund Analytics – RCFA (since 2015), the closed base comScore/Rentrak ("International Box Office Essential") serving major Hollywood studios (data from it has been used since 2008 to fill gaps in open databases), Bookers' Bulletin (since 2011), and Russian Film Business Today magazines (since 2004), as well as self-collected by Nevafilm Research employees from film distributors and producers; the rights to use and continue this dataset have been received from Nevafilm company. In terms of production data, the information was taken from the State register of film distribution certificates, Kinopoisk.ru, and from the films' credits.
Stable benchmark dataset. 10 million ratings and 100,000 tag applications applied to 10,000 movies by 72,000 users. Released 1/2009.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for "rotten_tomatoes"
Dataset Summary
Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.
Supported Tasks and Leaderboards
More Information Needed
Languages… See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Dataset includes demographic questions such as age, gender, and location, along with preferences related to entertainment and media consumption. Which may be used for research purpose. The key topics covered in the survey are: • Age, Gender, and Location: Respondents' demographic details. • Movie Preferences: Favorite types of movies and preferred cinema industries (Bollywood, Tollywood, Hollywood, etc.). • Streaming Platforms: Commonly used streaming services like Hotstar, Netflix, YouTube, Ibomma, etc. • Social Media Usage: Preferred social media platforms such as Instagram, WhatsApp, Facebook, and Snapchat. • Leisure Activities: Interests such as watching movies, playing video games, reading books, or listening to music. • Video Games: Favorite games like Free Fire, BGMI, Candy Crush, and Asphalt. • Music Genres: Preferences for different genres, including rock, pop, hip-hop, and classic. • Sports and IPL Preferences: Favorite sportspersons (Sachin Tendulkar, Virat Kohli, Dhoni, etc.) and IPL teams (CSK, SRH, RCB, MI). • Favorite Directors: Preferences for movie directors like SS Rajamouli, Sukumar, Prasanth Neel, and Trivikram.
This repository contains network graphs and network metadata from Moviegalaxies, a website providing network graph data from about 773 films (1915–2012). The data includes individual network graph data in Graph Exchange XML Format and descriptive statistics on measures such as clustering coefficient, degree, density, diameter, modularity, average path length, the total number of edges, and the total number of nodes.
MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
This is the sentiment analysis dataset based on IMDB reviews initially released by Stanford University. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/imdb.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The summary statistics by North American Industry Classification System (NAICS) which include: operating revenue (dollars x 1,000,000), operating expenses (dollars x 1,000,000), salaries wages and benefits (dollars x 1,000,000), and operating profit margin (by percent), of motion picture and video distribution (NAICS 512120), annual, for five years of data.
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
🎬 MoViFex Dataset
The Movies Visual Features Extracted (MoViFex) dataset contains visual features obtained from a wide range of movies (full-length), their shots, and free trailers. It contains frame-level extracted visual features and aggregated version of them. MoViFex can be used in recommendation, information retrieval, classification, etc tasks.
📃 Table of Content
How to Use Dataset Stats Files Structure
🚀 How to Use?
The Dataset Web-Page… See the full description on the dataset page: https://huggingface.co/datasets/alitourani/MoViFex_Dataset.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Flixster is a social movie site allowing users to share movie ratings, discover new movies and meet others with similar movie taste. Number of Nodes: 2523386 Number of Edges: 9197338 Missing Values? no Source: N/A Data Set Information: 2 files are included: 1. nodes.csv — it s the file of all the users. This file works as a dictionary of all the users in this data set. It s useful for fast reference. It contains all the node ids used in the dataset 2. edges.csv — this is the friendship network among the users. The friends are represented using edges. Here is an example. 1,2 This means user with id "1" is friend with user id "2". Attribute Information: Flixster is a social movie site allowing users to share movie ratings, discover new movies and meet others with similar movie taste. This contains the friendship network crawled in December 2010 by Javier Parra (Javier.Parra@asu.edu). For easier understanding, all the contents are organized in CSV file form
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the multimodal features of the three state-of-the-art we have extended (`MovieLens-1M`, `DBbook`, `Last.FM-2K`).
For each folder, we provide both the interaction data in the original format (in the folder `interaction_data`) and the multimodal features in several formats, based on the needs (in the `multimodal_data` folder).
In the following, we provide all the information needed to work with such data. Note that, although some dataset-specif details mght change, the general strucuture is common to all the three datasets.
CF data | ML1M | DBbook | LFM2k |
Users | 6040 | 6181 | 1892 |
Items | 3706 | 7672 | 17642 |
Interactions | 1000209 | 140360 | 92834 |
The `interaction_data` contains the interaction data provided in the original version of each datasets. We prefer sharing the original version so that each one can pre-process it in the way they prefer (e.g., apply a certain k-core filtering, adapt the task to sequential recommendation by exploiting temporal information - when available -, and so on).
In `MovieLens-1M`, interaction data includes user information (`users.dat`), movie information (`movies.dat`), and user ratings (`ratings.dat`); in order to work with this data, we suggest to read those files with the `pandas` python library, by using the `ISO-8859-1` encoding (if using other encoding, like `utf-8`, the reading will raise an error); the default separation character sequence is `::`. For example, in order to read ratings and movie information, one should use:
ratings = pd.read_csv('interaction_data/ratings.dat', sep='::', names=['user', 'item', 'rating', 'timestamp'])
movies = pd.read_csv('interaction_data/movies.dat', sep='::', names=['id', 'name', 'genres'], encoding='ISO-8859-1')
In `DBbook`, interaction data includes training and testing data (already split, as in the original version); unfortunately, such version cannot be download anymore as the original web page is no longer accessible; using tools like [waybackmachines, it possible to access that page and download some files, but only the training data is available in the backups that have been made, while test data is not obtaibale.
For these reasons, we considered the version of the dataset that have been used in other works listed below and reachable at the public repository of our SWAP Research Group:
- https://dl.acm.org/doi/abs/10.1145/3523227.3551484
- https://dl.acm.org/doi/abs/10.1145/3565472.3592965
- https://dl.acm.org/doi/abs/10.1145/3627043.3659548
- https://link.springer.com/article/10.1007/s11257-024-09417-x
This way, we have been able to reconstruct the full verison of this dataset.
Similarly to `MovieLens-1M`, interaction data contains user ratings in the `train.tsv` and `test.tsv` files, and book information in the `DBbook_Items_DBpedia_mapping.tsv` file.
We suggest to load such data using `pandas` as follows:
train = pd.read_csv('interaction_data/train.tsv', sep='\t', names=['userID', 'itemID', 'rating'])
test = pd.read_csv('interaction_data/test.tsv', sep='\t', names=['userID', 'itemID', 'rating'])
books = pd.read_csv('interaction_data/DBbook_Items_DBpedia_mapping.tsv', sep='\t')
In `LFM2K`, interaction data is encoded in the `user_artists.dat` file; this file encodes the listening counts for each pair (user,item) available (from this information, it is possible to derive the user ratings); the file `artist_info` encodes information assiciated to the artists, including the name of the artist, the URL of the associated Last.FM resource, and the link to the image (not available anymore); the file `tags.dat` contains the set of all the possible tags users attributed to artists, while all the tags attributed to specific artists is encoded in the `user_taggedartists.dat` file (the `user_taggedartists-timestamps` contains, in addition, the timestamp of the attribution).
In order to read data, we suggest to use `pandas` as follows:
interactions = pd.read_csv('original_data/user_artists.dat', sep='\t')
artist_info = pd.read_csv('original_data/artists.dat', sep='\t')
usertag = pd.read_csv('original_data/user_taggedartists-timestamps.dat', sep='\t')
tags = pd.read_csv('original_data/tags.dat', sep='\t', encoding='latin-1')
Each dataset is also provided with with multimodal data, in the `multimodal_features` folder. In this folder, we include the data source data we considered (plain text and links to image/audio/video files), with the pre-trained multimodal features.
Here is the coverage of multimodal information w.r.t. the datasets considered:
Multimodal item coverage | ML1M | DBbook | LFM2K |
Text | 3667 (Plots) | 4197 (Abstracts) | 2813 (Tags) |
Image | 3197 (Movie posters) | 7588 (Book covers) | 2820 (Top-5 Album Covers) |
Audio | 3104 (Trailer audio) | - | 2742 (Top-5 album songs) |
Video | 3105 (Trailer video) | - | - |
With this information, anyone can donwload the raw features and use them in their recommendation scenario; in our case, to carry out our experiments, we considered the following state-of-the-art multimodal encoders:
The resulting features have been dumped as `dict` (`item_id` -> `np.float32` embedding) in a pickle `.pkl` file, that can be found in the `multimodal_features/dict` folders (one for each dataset); moreover, to avoid any error in reading such files, we have also saved the embeddings in `.json` files, in the `multimodal_features/json` folders (one for each dataset); finally, to reproduce our experiments, we report the same data as `.npy` files (as required by `MMRec`), that can be found in the `multimodal_features/npy` folders (one for each dataset).
In order to learn the multimodal features by exploiting the encoders we considered in our experimental analysis, please refer to the GitHub reporisory
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
User-generated reviews are often our first point of contact when we consider watching a movie or a TV show. However, beyond telling us the qualitative aspects of the item we want to consume, reviews may inevitably contain undesired revelatory information (i.e. 'spoilers') such as the surprising fate of a character in a movie, or identity of a murderer in a crime-suspense movie etc. For users who are interested in consuming the item but are unaware of the critical plot twists, spoilers may decrease the excitement regarding the pleasurable uncertainty and curiosity of media consumption. Therefore, a natural question is how to identify these spoilers in entertainment reviews, so that users can more effectively navigate review platforms.
This dataset is collected from IMDB. It contains meta-data about items as well as user reviews with information regarding whether a review contains a spoiler or not. For more details on the attributes, please check file descriptions. Following stats provide a good sense of the scale of the dataset:
# records
= 573913
# users
= 263407
# movies
= 1572
# spoiler reviews
= 150924
# users with at least one spoiler review
= 79039
# items with at least one spoiler review
= 1570
If you use the dataset for your work, please cite the following:
Citation in text format
Misra, Rishabh. "IMDB Spoiler Dataset." DOI: 10.13140/RG.2.2.11584.15362 (2019).
Citation in BibTex format
@dataset{misra2019imdb,
author = {Misra, Rishabh},
year = {2019},
month = {05},
pages = {},
title = {IMDB Spoiler Dataset},
doi = {10.13140/RG.2.2.11584.15362}
}
Please link to rishabhmisra.github.io/publications as the source of this dataset.
This dataset is collected from IMDB.
If you are interested in learning how to collect high-quality datasets for various ML tasks and the overall importance of data in the ML ecosystem, consider reading my book Sculpting Data for ML.
Please also checkout the following datasets collected by me: