62 datasets found
  1. g

    MovieLens 1M

    • grouplens.org
    • meilu1.jpshuntong.com
    • +1more
    Updated Mar 19, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). MovieLens 1M [Dataset]. https://grouplens.org/datasets/movielens/1m/
    Explore at:
    Dataset updated
    Mar 19, 2016
    Description

    Stable benchmark dataset. 1 million ratings from 6000 users on 4000 movies. Released 2/2003.

  2. a

    MovieLens 20M Dataset

    • academictorrents.com
    • grouplens.org
    bittorrent
    Updated Dec 16, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    None (2016). MovieLens 20M Dataset [Dataset]. https://academictorrents.com/details/296054417b4d8eeeb4c7b1c842570bf792ee4d14
    Explore at:
    bittorrent(198702078)Available download formats
    Dataset updated
    Dec 16, 2016
    Authors
    None
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Released 4/2015; updated 10/2016 to update links.csv and add tag genome data. ### Summary This dataset (ml-20m) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016. Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided. The data are contained in six files, genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all thes

  3. MovieLens 10M Dataset

    • kaggle.com
    zip
    Updated Mar 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Smriti (2021). MovieLens 10M Dataset [Dataset]. https://www.kaggle.com/smritisingh1997/movielens-10m-dataset
    Explore at:
    zip(67393676 bytes)Available download formats
    Dataset updated
    Mar 26, 2021
    Authors
    Smriti
    Description

    Build a RBM using this dataset to predict whether a particular user will like a movie or not. This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service. Users were selected at random for inclusion. All users selected had rated at least 20 movies. Unlike previous MovieLens data sets, no demographic information is included. Each user is represented by an id, and no other information is provided. The data are contained in three files, movies.dat, ratings.dat and tags.dat. Also included are scripts for generating subsets of the data to support five-fold cross-validation of rating predictions.

    User Ids Movielens users were selected at random for inclusion. Their ids have been anonymized.

    Users were selected separately for inclusion in the ratings and tags data sets, which implies that user ids may appear in one set but not the other.

    The anonymized values are consistent between the ratings and tags data files. That is, user id n, if it appears in both files, refers to the same real MovieLens user.

    Ratings Data File Structure All ratings are contained in the file ratings.dat. Each line of this file represents one rating of one movie by one user, and has the following format:

    UserID::MovieID::Rating::Timestamp

    The lines within this file are ordered first by UserID, then, within user, by MovieID.

    Ratings are made on a 5-star scale, with half-star increments.

    Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

    Tags Data File Structure All tags are contained in the file tags.dat. Each line of this file represents one tag applied to one movie by one user, and has the following format:

    UserID::MovieID::Tag::Timestamp

    The lines within this file are ordered first by UserID, then, within user, by MovieID.

    Tags are user generated metadata about movies. Each tag is typically a single word, or short phrase. The meaning, value and purpose of a particular tag is determined by each user.

    Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

    Movies Data File Structure Movie information is contained in the file movies.dat. Each line of this file represents one movie, and has the following format:

    MovieID::Title::Genres

    MovieID is the real MovieLens id.

    Movie titles, by policy, should be entered identically to those found in IMDB, including year of release. However, they are entered manually, so errors and inconsistencies may exist.

    Genres are a pipe-separated list, and are selected from the following:

    Action Adventure Animation Children's Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western

  4. g

    MovieLens 100K

    • grouplens.org
    Updated Oct 12, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2015). MovieLens 100K [Dataset]. https://grouplens.org/datasets/movielens/100k/
    Explore at:
    Dataset updated
    Oct 12, 2015
    Description

    Stable benchmark dataset. 100,000 ratings from 1000 users on 1700 movies. Released 4/1998.

  5. d

    National box office statistics

    • data.gov.tw
    csv, json
    Updated Jun 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ministry of Culture (2024). National box office statistics [Dataset]. https://data.gov.tw/en/datasets/94224
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Jun 26, 2024
    Dataset authored and provided by
    Ministry of Culture
    License

    https://data.gov.tw/licensehttps://data.gov.tw/license

    Description

    This dataset provides national theater box office statistics for films distributed by the Administrative Institution National Film and Audiovisual Culture Center. The data is up to the last Sunday before the announcement date and does not include films that have not been screened for less than 7 calendar days. The earliest CSV format data in this dataset begins on July 30, 2018, and the earliest JSON format data begins on March 1, 2020. JSON format queries require entering the start and end dates (in the format of year, month, and day), and can provide data for a maximum of 90 days at a time.

  6. TMDB top 10K movies data

    • kaggle.com
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanish Jangir (2025). TMDB top 10K movies data [Dataset]. https://www.kaggle.com/datasets/tanishjangir/tmdb-top-10k-movies-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Dataset provided by
    Kaggle
    Authors
    Tanish Jangir
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This dataset contains information about 10,000 movies, including their titles, release dates, popularity metrics, and voting statistics, sourced from The Movie Database (TMDB). It can be used for data analysis, visualization, and machine learning tasks related to the film industry. The dataset includes detailed movie descriptions and metadata for analysis. Column Descriptors adult (bool): Indicates if the movie is adult content. id (int): Unique identifier for the movie in the TMDB database. title (string): The movie's primary title. overview (string): A brief description or summary of the movie. popularity (float): The movie's popularity score on TMDB. release_date (string): The official release date of the movie in YYYY-MM-DD format. vote_count (int): The total number of votes received by the movie. original_title (string): The movie's title in its original language.

  7. Movie Subtitle Durations

    • kaggle.com
    Updated Oct 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nevo Itzhak (2023). Movie Subtitle Durations [Dataset]. https://www.kaggle.com/datasets/nevoit/movie-subtitle-durations
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 9, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nevo Itzhak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset includes statistics about durations between two consecutive subtitles in 5,000 top-ranked IMDB movies. The dataset can be used to understand how dialogue is used in films and to develop tools to improve the watching experience. This notebook contains the code and data that were used to create this dataset.

    Dataset statistics:

    • Average duration between subtitles
    • Average duration between subtitles with a duration greater than 10, 30, 60, 120, and 300 seconds
    • Maximum duration between subtitles
    • Percentage of duration between subtitles from the runtime

    Dataset use cases:

    • Understanding how dialogue is used in movies, such as the average duration of a dialogue scene and how the duration of dialogue varies between different genres
    • Developing tools to improve the watching experience by adjusting the playback speed of dialogue scenes
    • Evaluating the effectiveness of tools like the VLC extension mentioned below

    Data Analysis:

    The next histogram shows the distribution of movie runtimes in minutes. The mean runtime is 99.903 minutes, the maximum runtime is 877 minutes, and the median runtime is 98.5 minutes.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F5c78e4866f203dfe5f7a7f55e41f69d0%2Ffig%201.png?generation=1696861842737260&alt=media" alt="">

    Figure 1: Histogram of the runtime in minutes

    The next histogram shows the distribution of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime. The mean percentage of gaps is 0.187, the maximum percentage of gaps is 0.033, and the median percentage of gaps is 327.586.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F235453706269472da11082f080b1f41d%2Ffig%202.png?generation=1696862163125288&alt=media" alt="">

    Figure 2: Histogram of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime

    The next histogram shows the distribution of the total movie's subtitle duration (seconds) between two consecutive subtitles. The mean subtitle duration is 4,837.089 seconds and the median subtitle duration is 2,906.435 seconds.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F234d31e3abaf6c4d174f494bf5cb86fa%2Ffig%203.png?generation=1696862309880510&alt=media" alt="">

    Figure 3: Histogram of the total movie's subtitle duration (seconds) between two consecutive subtitles

    Example use case:

    The Dynamic Adjustment of Playback Speed (DAPS), a VLC extension, can be used to save time while watching movies by increasing the playback speed between dialogues. However, it is essential to choose the appropriate settings for the extension, as increasing the playback speed can impact the overall tone and impact of the film.

    The dataset of 5,000 top-ranked movie subtitle durations can be used to help users choose the appropriate settings for the DAPS extension. For example, users who are watching a fast-paced action movie may want to set a higher minimum duration between subtitles before speeding up, while users who are watching a slow-paced drama movie may want to set a lower minimum duration.

    Additionally, users can use the dataset to understand how the different settings of the DAPS extension impact the overall viewing experience. For example, users can experiment with different settings to see how they affect the pacing of the movie and the overall impact of the dialogue scenes.

    Conclusion

    This dataset is a valuable resource for researchers and developers who are interested in understanding and improving the use of dialogue in movies or in tools for watching movies.

  8. iFlix movie streaming dataset

    • kaggle.com
    Updated Jan 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aung Pyae (2020). iFlix movie streaming dataset [Dataset]. https://www.kaggle.com/aungpyaeap/movie-streaming-datasets-iflix/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aung Pyae
    Description

    users.csv User_id: Unique identifier of user Country_code: Country code where the user registered assets.csv Show_type: Type of content, whether the asset is a movie or an episode of a TV series Genre: Genre of content Running_miutes: Runtime of content (Playable number of minutes) Source_language: Production language of content Asset_id: Unique identifier of video content at the most granular level (a movie or an episode of a TV series) Season_id: Unique identifier of content at season level. This is only applicable to TV series Series_id: Unique identifier of content at series level. This is only applicable to TV series Studio_id: Unique identifier of production studio for the content plays.csv Platform: Platform of consumption Minutes_viewed : Total number of minutes viewed, rounded to the nearest integer (0 means less than 30 seconds) Demographics.csv Psychographics.csv The dataset identifies psychographic and demographic tags about some iflix users. Each user-tag pair has an associated confidence score (1 is the highest, and 0 is the lowest confidence). Each trait can have up to 3 levels, depending on its granularity. Some traits can be identified by only considering the first two levels. At the same time, there are others that make more sense when all the three levels are considered, e.g., ‘iflix Viewing Behaviour’ is a level 2 psychographic trait that only makes sense when it is looked at in combination with the level 3 traits corresponding to it (‘casual,’ ‘player’ and ‘addict’). These traits represent different levels of viewing behavior of iflix users. Casual users have less than five viewing days in a month, player users have 5 to 12 viewing days in a month, and people with an addiction have more than 12 viewing days in a month. Traits are available corresponding to a user_id in the dataset only if we have certain confidence that the user belongs to the trait. Column and Description Level_1: Identifies the first level of the trait (psychologic or demographic) Level_2: Identifies the second level of the trait (e.g., Music Lovers, Movies Lovers) Level_3 : Identifies the third level of the trait, if available/relevant (e.g. Malay Movies Lovers, Indonesian TV Fans) Confidence_score: Confidence in associating the said trait (level_1, level_2, level_3) with the user

  9. d

    Replication Data for 'Gender (im)balance in the Russian cinema: on the...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leontyeva, Xenia (2024). Replication Data for 'Gender (im)balance in the Russian cinema: on the screen and behind the camera' [Dataset]. http://doi.org/10.7910/DVN/ISVTB4
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Leontyeva, Xenia
    Description

    There are two CSV datasets in this publication used initially in the master thesis in sociology of Xenia Leontyeva at HSE University Saint Petersburg, titled "Popularity Factors of Domestic Films: Gender Characteristics and State Support Measures" (2022), and lately for the article by Leontyeva, Xenia, Olessia Koltsova, and Deb Verhoeven, titled "Gender (Im)Balance in Russian Cinema: On the Screen and behind the Camera" (Accepted in January 2024 in The Journal of Cultural Analytics). The first dataset (N=1285) includes all Russian films produced between 2008 and 2019 and theatrically released between December 1, 2008, and December 31, 2019. Distribution statistics cover the territory of the CIS, of which the Russian Federation is the biggest market. Budget information is available for 644 films. The second dataset contains the Bechdel-Wallace test modified by Leontyeva markup for 243 films, 193 of which have budget information. There is also a supplement with a detailed description of all variables and R-code producing tables, plots, and models for the article. The database was collected by Xenia Leontyeva while working at Nevafilm Research (until 2018) and later. In terms of distribution data, it is based on sources such as the open base Russian Cinema Fund Analytics – RCFA (since 2015), the closed base comScore/Rentrak ("International Box Office Essential") serving major Hollywood studios (data from it has been used since 2008 to fill gaps in open databases), Bookers' Bulletin (since 2011), and Russian Film Business Today magazines (since 2004), as well as self-collected by Nevafilm Research employees from film distributors and producers; the rights to use and continue this dataset have been received from Nevafilm company. In terms of production data, the information was taken from the State register of film distribution certificates, Kinopoisk.ru, and from the films' credits.

  10. g

    MovieLens 10M

    • grouplens.org
    Updated Mar 22, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). MovieLens 10M [Dataset]. https://grouplens.org/datasets/movielens/10m/
    Explore at:
    Dataset updated
    Mar 22, 2016
    Description

    Stable benchmark dataset. 10 million ratings and 100,000 tag applications applied to 10,000 movies by 72,000 users. Released 1/2009.

  11. h

    rotten_tomatoes

    • huggingface.co
    Updated Aug 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cornell-movie-review-data (2023). rotten_tomatoes [Dataset]. https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 14, 2023
    Dataset authored and provided by
    cornell-movie-review-data
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for "rotten_tomatoes"

      Dataset Summary
    

    Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages… See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.
    
  12. ENTERTAINMENT

    • kaggle.com
    Updated Mar 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rallapalli Shahul (2025). ENTERTAINMENT [Dataset]. https://www.kaggle.com/datasets/rallapallishahul/entertainment
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 24, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rallapalli Shahul
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Dataset includes demographic questions such as age, gender, and location, along with preferences related to entertainment and media consumption. Which may be used for research purpose. The key topics covered in the survey are: • Age, Gender, and Location: Respondents' demographic details. • Movie Preferences: Favorite types of movies and preferred cinema industries (Bollywood, Tollywood, Hollywood, etc.). • Streaming Platforms: Commonly used streaming services like Hotstar, Netflix, YouTube, Ibomma, etc. • Social Media Usage: Preferred social media platforms such as Instagram, WhatsApp, Facebook, and Snapchat. • Leisure Activities: Interests such as watching movies, playing video games, reading books, or listening to music. • Video Games: Favorite games like Free Fire, BGMI, Candy Crush, and Asphalt. • Music Genres: Preferences for different genres, including rock, pop, hip-hop, and classic. • Sports and IPL Preferences: Favorite sportspersons (Sachin Tendulkar, Virat Kohli, Dhoni, etc.) and IPL teams (CSK, SRH, RCB, MI). • Favorite Directors: Preferences for movie directors like SS Rajamouli, Sukumar, Prasanth Neel, and Trivikram.

  13. s

    Moviegalaxies – Social Networks in Movies

    • marketplace.sshopencloud.eu
    • dataverse.harvard.edu
    • +1more
    Updated Feb 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Moviegalaxies – Social Networks in Movies [Dataset]. http://doi.org/10.7910/DVN/T4HBA3
    Explore at:
    Dataset updated
    Feb 11, 2022
    Description

    This repository contains network graphs and network metadata from Moviegalaxies, a website providing network graph data from about 773 films (1915–2012). The data includes individual network graph data in Graph Exchange XML Format and descriptive statistics on measures such as clustering coefficient, degree, density, diameter, modularity, average path length, the total number of edges, and the total number of nodes.

  14. P

    MR Dataset

    • paperswithcode.com
    Updated Apr 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). MR Dataset [Dataset]. https://paperswithcode.com/dataset/mr
    Explore at:
    Dataset updated
    Apr 28, 2021
    Description

    MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.

  15. h

    Data from: imdb

    • huggingface.co
    Updated May 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-learn (2025). imdb [Dataset]. https://huggingface.co/datasets/scikit-learn/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2025
    Dataset authored and provided by
    scikit-learn
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    This is the sentiment analysis dataset based on IMDB reviews initially released by Stanford University. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/imdb.

  16. Film and video distribution, summary statistics

    • ouvert.canada.ca
    • www150.statcan.gc.ca
    • +2more
    csv, html, xml
    Updated Oct 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statistics Canada (2024). Film and video distribution, summary statistics [Dataset]. https://ouvert.canada.ca/data/dataset/030fdbcc-0f41-4958-804b-63f4c0429b7c
    Explore at:
    xml, html, csvAvailable download formats
    Dataset updated
    Oct 3, 2024
    Dataset provided by
    Statistics Canadahttps://statcan.gc.ca/en
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Description

    The summary statistics by North American Industry Classification System (NAICS) which include: operating revenue (dollars x 1,000,000), operating expenses (dollars x 1,000,000), salaries wages and benefits (dollars x 1,000,000), and operating profit margin (by percent), of motion picture and video distribution (NAICS 512120), annual, for five years of data.

  17. h

    MoViFex_Dataset

    • huggingface.co
    Updated May 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Tourani (2024). MoViFex_Dataset [Dataset]. https://huggingface.co/datasets/alitourani/MoViFex_Dataset
    Explore at:
    Dataset updated
    May 11, 2024
    Authors
    Ali Tourani
    License

    https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

    Description

    🎬 MoViFex Dataset

    The Movies Visual Features Extracted (MoViFex) dataset contains visual features obtained from a wide range of movies (full-length), their shots, and free trailers. It contains frame-level extracted visual features and aggregated version of them. MoViFex can be used in recommendation, information retrieval, classification, etc tasks.

      📃 Table of Content
    

    How to Use Dataset Stats Files Structure

      🚀 How to Use? 
    
    
    
    
    
      The Dataset Web-Page… See the full description on the dataset page: https://huggingface.co/datasets/alitourani/MoViFex_Dataset.
    
  18. Arizona State University Flixster Data Set

    • academictorrents.com
    bittorrent
    Updated Dec 23, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flixter (2013). Arizona State University Flixster Data Set [Dataset]. https://academictorrents.com/details/4960373ea6dec89153639b0975ea92f9e3d3c914
    Explore at:
    bittorrent(36140875)Available download formats
    Dataset updated
    Dec 23, 2013
    Dataset provided by
    Flixster.comhttps://www.facebook.com/FlixsterMovies
    Authors
    Flixter
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Area covered
    Arizona
    Description

    Flixster is a social movie site allowing users to share movie ratings, discover new movies and meet others with similar movie taste. Number of Nodes: 2523386 Number of Edges: 9197338 Missing Values? no Source: N/A Data Set Information: 2 files are included: 1. nodes.csv — it s the file of all the users. This file works as a dictionary of all the users in this data set. It s useful for fast reference. It contains all the node ids used in the dataset 2. edges.csv — this is the friendship network among the users. The friends are represented using edges. Here is an example. 1,2 This means user with id "1" is friend with user id "2". Attribute Information: Flixster is a social movie site allowing users to share movie ratings, discover new movies and meet others with similar movie taste. This contains the friendship network crawled in December 2010 by Javier Parra (Javier.Parra@asu.edu). For easier understanding, all the contents are organized in CSV file form

  19. See the Movie, Hear the Song, Read the Book: Extending MovieLens-1M, Last.fm...

    • zenodo.org
    bin, tsv, zip
    Updated May 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giuseppe Spillo; Giuseppe Spillo; Elio Musacchio; Elio Musacchio; Cataldo Musto; Cataldo Musto; Marco de Gemmis; Marco de Gemmis; Pasquale Lops; Pasquale Lops; Giovanni Semeraro; Giovanni Semeraro (2025). See the Movie, Hear the Song, Read the Book: Extending MovieLens-1M, Last.fm 2K, and DBBook with multimodal Data [Dataset]. http://doi.org/10.5281/zenodo.15403972
    Explore at:
    zip, tsv, binAvailable download formats
    Dataset updated
    May 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Giuseppe Spillo; Giuseppe Spillo; Elio Musacchio; Elio Musacchio; Cataldo Musto; Cataldo Musto; Marco de Gemmis; Marco de Gemmis; Pasquale Lops; Pasquale Lops; Giovanni Semeraro; Giovanni Semeraro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets Structure

    This folder contains the multimodal features of the three state-of-the-art we have extended (`MovieLens-1M`, `DBbook`, `Last.FM-2K`).

    For each folder, we provide both the interaction data in the original format (in the folder `interaction_data`) and the multimodal features in several formats, based on the needs (in the `multimodal_data` folder).

    In the following, we provide all the information needed to work with such data. Note that, although some dataset-specif details mght change, the general strucuture is common to all the three datasets.

    Dataset statistics

    CF dataML1MDBbookLFM2k
    Users604061811892
    Items3706767217642
    Interactions100020914036092834

    Interaction data

    The `interaction_data` contains the interaction data provided in the original version of each datasets. We prefer sharing the original version so that each one can pre-process it in the way they prefer (e.g., apply a certain k-core filtering, adapt the task to sequential recommendation by exploiting temporal information - when available -, and so on).

    ML1M

    In `MovieLens-1M`, interaction data includes user information (`users.dat`), movie information (`movies.dat`), and user ratings (`ratings.dat`); in order to work with this data, we suggest to read those files with the `pandas` python library, by using the `ISO-8859-1` encoding (if using other encoding, like `utf-8`, the reading will raise an error); the default separation character sequence is `::`. For example, in order to read ratings and movie information, one should use:

    ratings = pd.read_csv('interaction_data/ratings.dat', sep='::', names=['user', 'item', 'rating', 'timestamp'])
    movies = pd.read_csv('interaction_data/movies.dat', sep='::', names=['id', 'name', 'genres'], encoding='ISO-8859-1')

    DBbook

    In `DBbook`, interaction data includes training and testing data (already split, as in the original version); unfortunately, such version cannot be download anymore as the original web page is no longer accessible; using tools like [waybackmachines, it possible to access that page and download some files, but only the training data is available in the backups that have been made, while test data is not obtaibale.
    For these reasons, we considered the version of the dataset that have been used in other works listed below and reachable at the public repository of our SWAP Research Group:
    - https://dl.acm.org/doi/abs/10.1145/3523227.3551484
    - https://dl.acm.org/doi/abs/10.1145/3565472.3592965
    - https://dl.acm.org/doi/abs/10.1145/3627043.3659548
    - https://link.springer.com/article/10.1007/s11257-024-09417-x

    This way, we have been able to reconstruct the full verison of this dataset.
    Similarly to `MovieLens-1M`, interaction data contains user ratings in the `train.tsv` and `test.tsv` files, and book information in the `DBbook_Items_DBpedia_mapping.tsv` file.

    We suggest to load such data using `pandas` as follows:

    train = pd.read_csv('interaction_data/train.tsv', sep='\t', names=['userID', 'itemID', 'rating'])
    test = pd.read_csv('interaction_data/test.tsv', sep='\t', names=['userID', 'itemID', 'rating'])
    books = pd.read_csv('interaction_data/DBbook_Items_DBpedia_mapping.tsv', sep='\t')


    Last.FM-2K

    In `LFM2K`, interaction data is encoded in the `user_artists.dat` file; this file encodes the listening counts for each pair (user,item) available (from this information, it is possible to derive the user ratings); the file `artist_info` encodes information assiciated to the artists, including the name of the artist, the URL of the associated Last.FM resource, and the link to the image (not available anymore); the file `tags.dat` contains the set of all the possible tags users attributed to artists, while all the tags attributed to specific artists is encoded in the `user_taggedartists.dat` file (the `user_taggedartists-timestamps` contains, in addition, the timestamp of the attribution).

    In order to read data, we suggest to use `pandas` as follows:


    interactions = pd.read_csv('original_data/user_artists.dat', sep='\t')
    artist_info = pd.read_csv('original_data/artists.dat', sep='\t')
    usertag = pd.read_csv('original_data/user_taggedartists-timestamps.dat', sep='\t')
    tags = pd.read_csv('original_data/tags.dat', sep='\t', encoding='latin-1')

    Multimodal data

    Each dataset is also provided with with multimodal data, in the `multimodal_features` folder. In this folder, we include the data source data we considered (plain text and links to image/audio/video files), with the pre-trained multimodal features.

    Here is the coverage of multimodal information w.r.t. the datasets considered:

    Multimodal item coverageML1MDBbookLFM2K
    Text3667 (Plots)4197 (Abstracts)2813 (Tags)
    Image3197 (Movie posters)7588 (Book covers)2820 (Top-5 Album Covers)
    Audio3104 (Trailer audio)-2742 (Top-5 album songs)
    Video3105 (Trailer video)--

    • As depicted in the table, for `ML1M` we have gathered movie plots (text), movie posters (images), and movie trailers (for audio and video); in the `movielens_1m/multimodal_features` folder, we provide an extended mapping named `ml1m_full_extended_mapping`, in which we report which are the links to download `covers` and `trailers`, while `text` is available in the `text_ml1m.tsv` file.
    • For `DBbook`, we have gathered book abstracts (text) and book covers (images); in the `dbbook/multimodal_features` folder, we provide an extended mapping named `full_extended_dbbook_img_links.tsv`, in which we report which are the links to download the `book covers`, while `text` is available in the `dbbook_text.tsv` file.
    • For `LFM2K`, we have gathered artist tags (text), the top-5 most popular album covers (images), and the top-5 most popular audio songs (audio); in the `lfm2k/multimodal_features.tsv` folder, we report extended mappings, named `lfm2k_song_extended_mapping.tsv` and `lfm2k_covers_extended_mapping.tsv`, tha encode the top-5 most popular `songs` and `album covers` for each artist, respectively; on the other hand, the `lfm2k_text.tsv` encode the `text` we considered, obtained from the user tags.

    With this information, anyone can donwload the raw features and use them in their recommendation scenario; in our case, to carry out our experiments, we considered the following state-of-the-art multimodal encoders:

    • Text: we considered `MiniLM` and `MPNET` (for `ML1M`, `DBbook`, and `LFM2K`)
    • Image: we considered `ResNet152`, `VGG`, `ViT_AVG`, `ViT_CLS` (for `ML1M`, `DBbook`, and `LFM2K`)
    • Audio: we considered `VGGish` and `Whisper` (for `ML1M` and `LFM2K`)
    • Video: we considered `I3D` and `R(2+1)D` (for `ML1M`)

    The resulting features have been dumped as `dict` (`item_id` -> `np.float32` embedding) in a pickle `.pkl` file, that can be found in the `multimodal_features/dict` folders (one for each dataset); moreover, to avoid any error in reading such files, we have also saved the embeddings in `.json` files, in the `multimodal_features/json` folders (one for each dataset); finally, to reproduce our experiments, we report the same data as `.npy` files (as required by `MMRec`), that can be found in the `multimodal_features/npy` folders (one for each dataset).

    Encode multimodal features

    In order to learn the multimodal features by exploiting the encoders we considered in our experimental analysis, please refer to the GitHub reporisory

  20. IMDB Spoiler Dataset

    • kaggle.com
    Updated May 22, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rishabh Misra (2019). IMDB Spoiler Dataset [Dataset]. https://www.kaggle.com/datasets/rmisra/imdb-spoiler-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rishabh Misra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    User-generated reviews are often our first point of contact when we consider watching a movie or a TV show. However, beyond telling us the qualitative aspects of the item we want to consume, reviews may inevitably contain undesired revelatory information (i.e. 'spoilers') such as the surprising fate of a character in a movie, or identity of a murderer in a crime-suspense movie etc. For users who are interested in consuming the item but are unaware of the critical plot twists, spoilers may decrease the excitement regarding the pleasurable uncertainty and curiosity of media consumption. Therefore, a natural question is how to identify these spoilers in entertainment reviews, so that users can more effectively navigate review platforms.

    Content

    This dataset is collected from IMDB. It contains meta-data about items as well as user reviews with information regarding whether a review contains a spoiler or not. For more details on the attributes, please check file descriptions. Following stats provide a good sense of the scale of the dataset:

    # records = 573913

    # users = 263407

    # movies = 1572

    # spoiler reviews = 150924

    # users with at least one spoiler review = 79039

    # items with at least one spoiler review = 1570

    Citation

    If you use the dataset for your work, please cite the following:

    Citation in text format Misra, Rishabh. "IMDB Spoiler Dataset." DOI: 10.13140/RG.2.2.11584.15362 (2019). Citation in BibTex format @dataset{misra2019imdb, author = {Misra, Rishabh}, year = {2019}, month = {05}, pages = {}, title = {IMDB Spoiler Dataset}, doi = {10.13140/RG.2.2.11584.15362} } Please link to rishabhmisra.github.io/publications as the source of this dataset.

    Acknowledgement

    This dataset is collected from IMDB.

    Inspiration

    • Can you utilize the metadata to identify reviews which contain spoiler?
    • Additionally, can you uncover signals that make a review spoiler-y?
    • Apart from spoiler detection, the metadata available can be used for other tasks as well like rating prediction etc.

    Want to contribute your own datasets?

    If you are interested in learning how to collect high-quality datasets for various ML tasks and the overall importance of data in the ML ecosystem, consider reading my book Sculpting Data for ML.

    Other datasets

    Please also checkout the following datasets collected by me:

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2016). MovieLens 1M [Dataset]. https://grouplens.org/datasets/movielens/1m/

MovieLens 1M

Explore at:
Dataset updated
Mar 19, 2016
Description

Stable benchmark dataset. 1 million ratings from 6000 users on 4000 movies. Released 2/2003.

Search
Clear search
Close search
Google apps
Main menu