56 datasets found
  1. šŸŽ„ Movie Plot Database

    • kaggle.com
    Updated Aug 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2024). šŸŽ„ Movie Plot Database [Dataset]. https://www.kaggle.com/datasets/mexwell/movie-plot-database/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 7, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    mexwell
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset of movie plot summaries and associated metadata. This data was collected by David Bamman, Brendan O'Connor, and Noah Smith at the Language Technologies Institute and Machine Learning Department at Carnegie Mellon University.

    Data

    plot_summaries.csv

    Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia. Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.

    movie_metadata.csv

    Metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns: - Wikipedia movie ID - Freebase movie ID - Movie name - Movie release date - Movie box office revenue - Movie runtime - Movie languages (Freebase ID:name tuples) - Movie countries (Freebase ID:name tuples) - Movie genres (Freebase ID:name tuples)

    character_metadata.csv

    Metadata for 450,669 characters aligned to the movies above, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns:

    • Wikipedia movie ID
    • Freebase movie ID
    • Movie release date
    • Character name
    • Actor date of birth
    • Actor gender
    • Actor height (in meters)
    • Actor ethnicity (Freebase ID)
    • Actor name
    • Actor age at movie release
    • Freebase character/actor map ID
    • Freebase character ID
    • Freebase actor ID

    tvtropes.clusters.txt

    72 character types drawn from tvtropes.com, along with 501 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.

    name.clusters.txt

    970 unique character names used in at least two different movies, along with 2,666 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.

    Acknowledgments

    This research was supported in part by U.S. National Science Foundation grant IIS-0915187.

    All data is released under a Creative Commons Attribution-ShareAlike License. For questions or comments, please contact David Bamman (dbamman@cs.cmu.edu).

    Foto von Jakob Owens auf Unsplash

  2. h

    wiki-movie-plots-with-summaries

    • huggingface.co
    Updated Oct 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishnu Priya VR (2023). wiki-movie-plots-with-summaries [Dataset]. https://huggingface.co/datasets/vishnupriyavr/wiki-movie-plots-with-summaries
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 7, 2023
    Authors
    Vishnu Priya VR
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for Wikipedia Movie Plots with AI Plot Summaries

      Dataset Summary
    
    
    
    
    
      Context
    

    Wikipedia Movies Plots dataset by JustinR ( https://www.kaggle.com/jrobischon/wikipedia-movie-plots )

      Content
    

    Everything is the same as in https://www.kaggle.com/jrobischon/wikipedia-movie-plots

      Acknowledgements
    

    Please, go upvote https://www.kaggle.com/jrobischon/wikipedia-movie-plots dataset, since this is 100% based on that.

      Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/vishnupriyavr/wiki-movie-plots-with-summaries.
    
  3. o

    Wikipedia Movie Plot Collection

    • opendatabay.com
    .undefined
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Wikipedia Movie Plot Collection [Dataset]. https://www.opendatabay.com/data/ai-ml/624e3736-74ea-4f5c-9ee5-fda14c16c770
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 8, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Entertainment & Media Consumption
    Description

    This dataset contains movie plots extracted from Wikipedia, along with other key metadata. It is specifically curated for movies released between 1950 and 2023 that have accumulated over 1000 ratings on IMDb. The primary purpose of this dataset is to facilitate development in Large Language Models (LLMs) for applications such as movie searching or recommendation systems. The plot summaries have been meticulously cleaned to remove irrelevant elements like links and references, ensuring a pure text value. Where Wikipedia plots were unavailable, IMDb synopses were used as a fallback. The dataset includes 89% of movies with detailed plot information, while 100% include a short summary untouched from Wikipedia, which is useful for matching metadata in retriever applications. Columns like 'stars', 'directors', and 'genres' are provided as lists of values, making them suitable for direct loading into vector databases.

    Columns

    • title: The title of the film, presented in lowercase.
    • stars: The names of the actors featured in the film, also in lowercase.
    • directors: The names of the film's directors, in lowercase.
    • year: The year when the movie was released.
    • genre: The genres associated with the film, listed in lowercase.
    • runtime: The duration of the film, measured in minutes.
    • ratingCount: An indication of the film's popularity, showing the number of people who have rated it on IMDb.
    • plot: Detailed storyline of the film.
    • summary: A short overview and additional details about the film.
    • imdb_rating: The film's rating on IMDb, on a scale of 1 to 10.

    Distribution

    The data file is typically in CSV format. The dataset spans movies released from 1950 up to 2023. There are 20,617 unique movie titles, 21,596 unique star names, and 9,863 unique director names. The genres column contains 21,675 unique values. Movie runtimes range from -1 to 776 minutes, with a significant majority (17,433 entries) falling between 76.70 and 115.55 minutes. The number of ratings (ratingCount) varies widely, starting from 1,001 and going up to 2.73 million. IMDb ratings range from 1.2 to 9.3. While specific total row/record counts are not available, the distribution data for year, runtime, ratingCount, and imdb_rating show various value counts within different ranges.

    Usage

    This dataset is ideal for: * Developing demonstration projects leveraging Large Language Models (LLMs). * Creating movie search applications, such as the example of a movie searching app like cinemattr.ca. * Building retriever applications where the 'summary' column can be used for metadata matching. * Populating vector databases with structured information from 'stars', 'directors', and 'genres' for advanced querying and analysis.

    Coverage

    The dataset's geographic scope is global. It includes movies released within the time frame of 1950 to 2023. The data availability specifies that 89% of the movies have detailed plot information, and all movies (100%) include a short summary. The dataset focuses on films with more than 1000 ratings on IMDb.

    License

    CC0

    Who Can Use It

    This dataset is suitable for: * AI and machine learning developers who are building models based on natural language processing. * Data scientists and researchers interested in film data and entertainment analytics. * Software engineers developing applications that require movie plot summaries or metadata, such as recommendation engines. * Students and enthusiasts looking for high-quality, pre-processed text data for LLM projects.

    Dataset Name Suggestions

    • IMDb Verified Movie Plots
    • Historical Film Summaries (1950-2023)
    • Wikipedia Movie Plot Collection
    • LLM-Ready Movie Dataset
    • Global Cinema Plot Archive

    Attributes

    Original Data Source: Movie Plots from Wikipedia

  4. Latest 10000 Movies Dataset from TMDB

    • kaggle.com
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nagraj Desai (2023). Latest 10000 Movies Dataset from TMDB [Dataset]. https://www.kaggle.com/datasets/nagrajdesai/latest-10000-movies-dataset-from-tmdb/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nagraj Desai
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This movies dataset can certainly be used for a variety of purposes, depending on goals and the insights you're looking to derive from the data. Here are some potential use cases for the dataset.

    Movie Analysis

    Recommendation Systems

    Popularity Measurement

    Audience Engagement

    Comparative Analysis

    The dataset consists of various attributes related to movies. These attributes provide information about each entry in the dataset:

    1. Index: - Index for each row

    2. Title: - The title attribute represents the name of the movie.

    3. Original Language: - This attribute signifies the language in which the movie was originally produced. It could offer insights into the target audience and geographical scope of the content.

    4. Release Date: - This attribute indicates when the movie was officially released for public viewing. The release date can impact factors like marketing strategies, competition with other releases, and audience anticipation.

    5. Popularity: - This attribute likely represents the measure of how well-known or talked-about a particular movie is within a given context. It could be based on factors such as online discussions, social media mentions, and viewer interest.

    6. Vote Average: - This attribute likely represents the average rating or score given to the movie by viewers who have voted. A higher average could imply that the content is generally well-received.

    7. Vote Count: - This attribute indicates the number of votes or ratings that the movie has received from viewers. A higher vote count might suggest a larger viewer base or a more engaging content.

    8. Overview: - This attribute provides a concise summary or description of the movie plot, themes, and overall content. It offers a glimpse into what the content is about.

  5. d

    Korean Movie Database

    • data.go.kr
    json+xml
    Updated Jan 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Korean Movie Database [Dataset]. https://www.data.go.kr/en/data/3035985/openapi.do
    Explore at:
    json+xmlAvailable download formats
    Dataset updated
    Jan 7, 2022
    License

    https://data.go.kr/ugs/selectPortalPolicyView.dohttps://data.go.kr/ugs/selectPortalPolicyView.do

    Description

    Information on Korean and foreign films that have been released, imported, and released in Korea, established and published by the Korea Film Archive. It contains information such as the movie title, director, production company, production year, release date, participating actors and staff, genre, and plot.

  6. h

    rotten_tomatoes

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cornell-movie-review-data, rotten_tomatoes [Dataset]. https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    cornell-movie-review-data
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for "rotten_tomatoes"

      Dataset Summary
    

    Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages… See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.
    
  7. TMDb Top 10,000 Popular Movies Dataset

    • kaggle.com
    Updated Apr 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Balaka Biswas (2020). TMDb Top 10,000 Popular Movies Dataset [Dataset]. https://www.kaggle.com/balaka18/tmdb-top-10000-popular-movies-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Balaka Biswas
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Introduction

    This is dataset of the 10,000 most popular movies across the world, irrespective of language and recency. These have been extracted using TMDb API.

    About the Dataset

    What is TMDB's API? The closed-source API service is for those people interested in using their movies, TV shows or actor images and/or data in their application. TMDb's API is a system that they provide for developers and their team to programmatically fetch and use TMDb's data and/or images. Their API is free to use as long as you attribute TMDb as the source of the data and/or images. Also, they update their API from time to time.

    This dataset lists 10,000 most popular movies across the globe. Information held inside the dataset - A. Dataset 1 : Movies dataset - 1. title - Title of the Movie in English. 2. overview - A small summary of the plot. 3. original_lang - Original language it was shot in. 4. rel_date - Date of release. 5. popularity - Popularity. 6. vote_count - Votes received. 7. vote_average - Average of all votes received.

    B. Dataset 2 : Genres dataset 1. id 2. Movie ID 3. Genre

  8. H

    Replication Data for: Movie Scripts Corpus

    • dataverse.harvard.edu
    Updated May 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lance Drouet (2024). Replication Data for: Movie Scripts Corpus [Dataset]. http://doi.org/10.7910/DVN/PZTL2L
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Lance Drouet
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data Source: https://www.kaggle.com/datasets/gufukuro/movie-scripts-corpus Data Description : Movie Scripts Corpus This corpus was collected to use for screenplay analysis with machine learning methods. Corpus includes movie scripts, crawled from different sources, their annotations by script structural elements and movies metadata. Corpus description Screenplay data consists of: Movie scripts TXT-documents with raw full text (2858 docs) Movie scripts TXT-documents with full text lemmas (2858 docs) Manual annotation TXT-documents for some movie scripts (33 docs, more than 6000 annotated rows) Movie scripts annotations TXT-documents obtained by BERT Movie scripts annotations json-documents obtained by rule-based annotator ScreenPy Movies metadata consists of: Cut versions of movie reviews and scores from metacritic: Number of reviews: 21025 Number of movies with reviews: 2038 Metadata for movies, including: title, akas, launch year, score from metacritic, imdb user rating and number of votes from imdb.com, movie awards, opening weekend, producers, budget, script department, production companies, writers, directors, cast info, countries involved in production, age restrict, plot (with outline), keywords, genres, taglines, critics' synopsis Screenplay awards information: Academy Awards adapted screenplay, Academy Awards original screenplay, BAFTA, Golden Globe Award for Best Screenplay, Writers Guild Awards Winners & Nominees 2020-2013 nominations information for 462 movies in total. Movie characters data consists of: Script text fragments with dialogs and scene descriptions for characters, gathered with annotators: 2153 movies and text fragments for 32114 characters in total Gender labels for 4792 characters

  9. o

    Indonesian Film Database (IMDb)

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Indonesian Film Database (IMDb) [Dataset]. https://www.opendatabay.com/data/dataset/e6c24dd2-f5c7-4abf-83f4-ac3deb784967
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Entertainment & Media Consumption
    Description

    This dataset contains details for 1262 Indonesian movies, compiled to offer insights into the country's film industry. It was assembled using an IMDb-Scraper and then converted and cleaned into a CSV file, providing a structured collection of movie information [1]. The data was collected from IMDb.com [1].

    Columns

    • title: The primary title of the movie [2].
    • year: The release year of the movie, with values ranging from 1926 to 2020 [2].
    • description: A textual summary or plot outline for the movie [2].
    • genre: Categories that describe the movie's style or content, such as Drama or Comedy [2, 3].
    • rating: The age rating certification applied to the movie, for example, '13+' [2, 3].
    • users_rating: The average rating given by IMDb users, typically ranging from 1.2 to 9.4 [2, 3].
    • votes: The total count of votes received from IMDb users, with values varying from 5 to 187,000 [2, 4].
    • languages: The language(s) in which the movie is primarily presented, notably Indonesian and English [2, 4].
    • directors: The individual(s) credited with directing the movie, including names like Nayato Fio Nuala [2, 4].
    • actors: The main cast members or performers featured in the movie [2].
    • runtime: The duration of the movie [1].

    Distribution

    The dataset is provided in a CSV file format [1]. It includes 1262 unique movie records or rows [1, 2].

    Usage

    This dataset is ideal for: * Exploratory data analysis of Indonesian cinema trends [1]. * Natural Language Processing (NLP) tasks on movie descriptions [1]. * Analysing movie characteristics such as genre distribution, rating trends, and language prevalence. * Studying the impact of directors and actors within the Indonesian film landscape.

    Coverage

    The dataset specifically covers Indonesian movies [1, 2]. The time range for these movies spans from 1926 to 2020 [2].

    License

    CCO

    Who Can Use It

    • Data Analysts and Scientists: For statistical analysis, trend identification, and data visualisations related to movies.
    • Researchers: Studying film history, cultural impact of cinema, or market analysis within the Indonesian context.
    • Natural Language Processing Specialists: For training models on movie descriptions, sentiment analysis, or content categorisation.
    • Film Enthusiasts and Critics: To explore movie characteristics, ratings, and directorial styles.

    Dataset Name Suggestions

    • IMDb Indonesian Movies Data
    • Indonesian Film Database (IMDb)
    • IMDb Indonesian Cinema
    • Indonesian Movie Catalogue (IMDb)

    Attributes

    Original Data Source: IMDb Indonesian Movies

  10. h

    Data from: imdb

    • huggingface.co
    Updated Aug 3, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2003
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "imdb"

      Dataset Summary
    

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
    
  11. "9,565 Top-Rated Movies Dataset"

    • kaggle.com
    Updated Aug 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harshit@85 (2024). "9,565 Top-Rated Movies Dataset" [Dataset]. https://www.kaggle.com/datasets/harshit85/9565-top-rated-movies-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Harshit@85
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About the Dataset

    Title: 9,565 Top-Rated Movies Dataset

    Description:
    This dataset offers a comprehensive collection of 9,565 of the highest-rated movies according to audience ratings on the Movie Database (TMDb). The dataset includes detailed information about each movie, such as its title, overview, release date, popularity score, average vote, and vote count. It is designed to be a valuable resource for anyone interested in exploring trends in popular cinema, analyzing factors that contribute to a movie’s success, or building recommendation engines.

    Key Features: - Title: The official title of each movie. - Overview: A brief synopsis or description of the movie's plot. - Release Date: The release date of the movie, formatted as YYYY-MM-DD. - Popularity: A score indicating the current popularity of the movie on TMDb, which can be used to gauge current interest. - Vote Average: The average rating of the movie, based on user votes. - Vote Count: The total number of votes the movie has received.

    Data Source: The data was sourced from the TMDb API, a well-regarded platform for movie information, using the /movie/top_rated endpoint. The dataset represents a snapshot of the highest-rated movies as of the time of data collection.

    Data Collection Process: - API Access: Data was retrieved programmatically using TMDb’s API. - Pagination Handling: Multiple API requests were made to cover all pages of top-rated movies, ensuring the dataset’s comprehensiveness. - Data Aggregation: Collected data was aggregated into a single, unified dataset using the pandas library. - Cleaning: Basic data cleaning was performed to remove duplicates and handle missing or malformed data entries.

    Potential Uses: - Trend Analysis: Analyze trends in movie ratings over time or compare ratings across different genres. - Recommendation Systems: Build and train models to recommend movies based on user preferences. - Sentiment Analysis: Perform text analysis on movie overviews to understand common themes and sentiments. - Statistical Analysis: Explore the relationship between popularity, vote count, and average ratings.

    Data Format: The dataset is provided in a structured tabular format (e.g., CSV), making it easy to load into data analysis tools like Python, R, or Excel.

    Usage License: The dataset is shared under [appropriate license], ensuring that it can be used for educational, research, or commercial purposes, with proper attribution to the data source (TMDb).

    This description provides a clear and detailed overview, helping potential users understand the dataset's content, origin, and potential applications.

  12. the_movies_dataset

    • kaggle.com
    zip
    Updated Jun 19, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sezgin ildes (2021). the_movies_dataset [Dataset]. https://www.kaggle.com/sezginildes/the-movies-dataset
    Explore at:
    zip(15456686 bytes)Available download formats
    Dataset updated
    Jun 19, 2021
    Authors
    sezgin ildes
    Description

    Context These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

    This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

    Content This dataset consists of the following files:

    movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

    keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

    credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

    links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

    links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

    ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

    The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed here

    Acknowledgements This dataset is an ensemble of data collected from TMDB and GroupLens. The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.

    The Movie Links and Ratings have been obtained from the Official GroupLens website. The files are a part of the dataset available here

    Inspiration This dataset was assembled as part of my second Capstone Project for Springboard's Data Science Career Track. I wanted to perform an extensive EDA on Movie Data to narrate the history and the story of Cinema and use this metadata in combination with MovieLens ratings to build various types of Recommender Systems.

    Both my notebooks are available as kernels with this dataset: The Story of Film and Movie Recommender Systems

    Some of the things you can do with this dataset: Predicting movie revenue and/or movie success based on a certain metric. What movies tend to get higher vote counts and vote averages on TMDB? Building Content Based and Collaborative Filtering Based Recommendation Engines.

  13. o

    Global Movie Popularity Dataset

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Global Movie Popularity Dataset [Dataset]. https://www.opendatabay.com/data/dataset/c9597b23-d205-46ff-abb3-674815373730
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Entertainment & Media Consumption
    Description

    This dataset provides details on the 10,000 most popular films globally, sourced from The Movie Database (TMDb) via its read API. TMDb is a crowd-sourced movie information database widely used by various film-related platforms and applications. The dataset is ideal for film-related analysis, building recommender systems, and natural language processing tasks, even for those new to data analysis, as it contains some missing values.

    Columns

    • index: An identifier for each record.
    • title: The name of the movie.
    • overview: A concise summary or synopsis of the movie.
    • original_language: The primary language in which the movie was filmed.
    • vote_count: The number of votes received for the movie, also indicated as the date of publish in some contexts.
    • vote_average: The average rating given to the movie by voters.
    • popularity: A metric indicating the popularity score of the movie.

    Distribution

    The dataset is provided in a CSV file format. It comprises approximately 10,000 individual movie records. While exact row and record counts are not specified, the dataset is structured as tabular data, with each row representing a unique movie entry and columns detailing various attributes.

    Usage

    This dataset is well-suited for a variety of applications, including: * Developing and enhancing film-related consoles, websites, and mobile applications. * Creating movie recommender systems. * Performing data visualisations related to film trends and popularity. * Conducting natural language processing (NLP) tasks on movie overviews. * Data analysis and exploration, particularly for those looking to practise handling missing data.

    Coverage

    The dataset covers movies from across the world, offering a global scope. While a specific time range for the movies is not explicitly stated, the data is fetched from TMDb, which updates its API periodically. It's noted that the dataset includes some null values where information was missing from the original TMDb database.

    License

    CCO

    Who Can Use It

    This dataset is intended for a broad audience including: * Young analysts: To practise data cleaning and analysis with datasets containing missing values. * Developers: For integrating movie information into media managers, mobile apps, and social sites. * Researchers: For studies on movie popularity, audience reception, and content analysis. * Data scientists: For building and testing machine learning models such as recommender systems and NLP models.

    Dataset Name Suggestions

    • TMDb Popular Movies
    • Global Movie Popularity Dataset
    • Top Movies from TMDb API
    • Movie Data for Film Analysis
    • TMDb Film Insights

    Attributes

    Original Data Source: Popular Movies of IMDb

  14. IMDB Selection Database

    • zenodo.org
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cristian Campo PƩrez; Nieves FernƔndez Ochoa; Cristian Campo PƩrez; Nieves FernƔndez Ochoa (2022). IMDB Selection Database [Dataset]. http://doi.org/10.5281/zenodo.7339445
    Explore at:
    Dataset updated
    Nov 21, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cristian Campo PƩrez; Nieves FernƔndez Ochoa; Cristian Campo PƩrez; Nieves FernƔndez Ochoa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Selection of top 1000 entries of each gender in IMDB..

    Contains information of:

    • title -> title of the entry
    • genres -> list genres of the entry
    • score -> mean rating from the viewers
    • people_votin -> number of votes
    • normal_number_of_reviews -> number of reviews from normal userss
    • prof_number_of_reviews -> number of reviews from professionals
    • type_filmed -> type of content ( e.g. TV Series / original )
    • year -> release year
    • year_certification -> Age restriction certification
    • runtime -> length of chapter / movie
    • country -> Country where it was produced
    • creators -> List of name of the directors
    • cast -> List of names of the actors
    • plot -> brief summary of the plot
    • JPEG_link -> link to JPEG promotional image

    This is a sumulated dataset.

  15. 10000 Most Popular English Movies (2023)

    • kaggle.com
    Updated Jul 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dnyanesh Yeole (2023). 10000 Most Popular English Movies (2023) [Dataset]. https://www.kaggle.com/datasets/dnyaneshyeole/10000-most-popular-english-movies-2023
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dnyanesh Yeole
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    šŸŽ¬ Welcome to the Popular English Movies Dataset (2023) šŸŽ¬! This dataset features information on a diverse collection of popular English movies.

    Contents

    The dataset provides a comprehensive set of features for each movie entry:

    • Title: The name of the movie, identifying it uniquely in the dataset.
    • Overview: A summary or synopsis of the movie, giving users an idea of its plot and theme.
    • Release_Date: The date when the movie was officially released.
    • Genre: The categories or genres to which the movie can be classified.
    • Popularity: This metric is calculated by TMDB developers
    • Vote_Average: The average rating of the movie, ranging from 0 to 10
    • Vote_Count: The total number of votes received.

    Usage

    The Popular English Movies Dataset (2023) offers a wealth of opportunities for exploration and innovation in the realms of Data Science and Machine Learning. Here are some exciting ways to utilize and contribute to the dataset:

    1. Genre Prediction Model: Leveraging the 'overview' and 'title' features, data enthusiasts can build powerful Natural Language Processing (NLP) models to predict movie genres. By analyzing the movie summaries and titles, learners can gain insights into the relationships between textual data and movie genres, enabling more accurate genre predictions.
    2. Movie Recommender System: The dataset serves as a fantastic foundation for constructing a movie recommender system. By applying collaborative filtering or content-based filtering techniques, learners can develop personalized recommendations for users based on their preferences, leading to enhanced movie discovery experiences.
    3. Popularity Analysis: Utilizing the 'vote_count' and 'vote_average' features, learners can delve into the factors influencing a movie's popularity. Through data exploration and visualization, one can uncover trends and patterns that contribute to a movie's overall appeal among viewers.

    Source

    The data was sourced by leveraging the power of TMDB's API, and it can be explored in its entirety at https://www.themoviedb.org/movie. This platform showcases an extensive collection of movie data

    Lights, Camera, Upvote! Dive into 10,000 Popular English Movies from 2023! šŸŽ¬šŸ‘

  16. o

    TMDB Top Movies Dataset

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). TMDB Top Movies Dataset [Dataset]. https://www.opendatabay.com/data/dataset/a663f3c0-8065-4aff-807a-a50f31b6034c
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Entertainment & Media Consumption
    Description

    šŸ“½ļø Movie Descriptions Dataset This dataset contains a curated list of classic and contemporary films along with their titles, genres, and detailed plot descriptions. It includes globally acclaimed movies across genres such as drama, crime, romance, animation, fantasy, action, and more. From cinematic masterpieces like The Shawshank Redemption and Schindler’s List to iconic anime like Your Name and A Silent Voice, this dataset offers a diverse mix of storytelling across cultures and decades.

    Each entry features:

    šŸŽ¬ Movie Name

    šŸŽ­ Genre(s)

    šŸ“ Brief Description / Plot Summary

    This dataset can be used for:

    šŸŽžļø Movie recommendation systems

    🧠 NLP tasks like sentiment analysis, genre prediction, and text classification

    šŸŽ„ Data visualization and storytelling

    šŸ—£ļø Text summarization or chatbot training on movie-related queries

    Ideal for data science, machine learning, and natural language processing enthusiasts who want to experiment with real-world descriptive text data.

    Original Data Source: TMDB Top Movies Dataset

  17. o

    Global Movie Popularity Dataset

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Global Movie Popularity Dataset [Dataset]. https://www.opendatabay.com/data/consumer/af505531-100e-4731-b7e9-f817fa91f16d
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Entertainment & Media Consumption
    Description

    This dataset contains details for 10,000 top-rated movies from TMDB, updated as of 26th July 2022. Its primary purpose is to facilitate text preprocessing and cleansing for Natural Language Processing (NLP) tasks related to movie data. It is also highly suitable for developing content-based and collaborative filtering recommendation engines. This resource offers a rich context for understanding movie popularity, genres, and audience reception.

    Columns

    • id: The unique identification number for the movie on the website.
    • title: The name of the movie.
    • genre: The categorisation of the movie, such as crime, adventure, or drama.
    • original_language: The initial language in which the movie was released.
    • overview: A brief summary or synopsis of the movie.
    • popularity: A metric indicating the movie's popularity.
    • release_date: The date when the movie was first released.
    • vote_average: The average rating given to the movie by voters.
    • vote_count: The total number of votes received by the movie.

    Distribution

    This dataset comprises approximately 10,000 records, typically provided in a CSV file format. Specific row counts for a sample file are updated separately. The dataset includes unique values for movie IDs, with original_language predominantly being English (around 78%) and French (7%). Movie genres include Comedy (7%) and Drama (6%), with a wide array of other genres. Release dates span a broad period from 1902 to 2022, with the majority of entries from 1998 onwards. Popularity scores range from 0.6 to over 10,000, and vote averages are generally between 4.6 and 8.7, with vote counts reaching up to 31,900.

    Usage

    This dataset is ideal for: * Performing extensive text preprocessing and cleansing for NLP applications on movie descriptions and titles. * Building various movie recommendation systems, including content-based recommenders and collaborative filtering engines. * Analysing trends in movie popularity, audience ratings, and language distribution. * Developing data science projects focused on entertainment and media consumption.

    Coverage

    The dataset's geographic scope is global. It covers movies released between 17th April 1902 and 13th July 2022, with the dataset itself assembled with data up to 26th July 2022. There are no specific demographic notes available, but it broadly covers top-rated films from the TMDB database.

    License

    CCO

    Who Can Use It

    This dataset is suitable for: * Data Scientists and Machine Learning Engineers working on recommendation systems or NLP projects. * Researchers studying film industry trends, audience engagement, or language processing. * Developers looking to integrate movie data into applications. * Anyone interested in exploratory data analysis within the entertainment sector.

    Dataset Name Suggestions

    • TMDB Top Movies Dataset
    • Movie Data for NLP & Recommendations
    • Global Movie Popularity Dataset
    • Film Data Hub

    Attributes

    Original Data Source: TMDB Movies Dataset

  18. A

    ā€˜IMDB Horror Movie Dataset [2012 Onwards]’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 2, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2017). ā€˜IMDB Horror Movie Dataset [2012 Onwards]’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-imdb-horror-movie-dataset-2012-onwards-ca86/3437da9d/?iid=004-265&v=presentation
    Explore at:
    Dataset updated
    Nov 2, 2017
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ā€˜IMDB Horror Movie Dataset [2012 Onwards]’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/PromptCloudHQ/imdb-horror-movie-dataset on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    On the occasion of Halloween, we thought of sharing a spooky dataset for the community to crunch on the data!

    Remember - "This Halloween could get a lot more spookier, but treats are guaranteed".

    Content

    The dataset goes back to 2012 and contains the following data fields:

    • Title
    • Genres
    • Release Date
    • Release Country
    • Movie Rating
    • Review Rating
    • Movie Run Time
    • Plot
    • Cast
    • Language
    • Filming Locations
    • Budget

    Acknowledgements

    The data was extracted by PromptCloud's in-house data extraction solution.

    Inspiration

    Some of the things that can be explored are the following:

    • Number of horror movies released over the years
    • Number of movies released in terms of country
    • Rating and run time distribution
    • Spooky regions by considering the shooting location
    • Text mining on the description text

    --- Original source retains full ownership of the source dataset ---

  19. Z

    Plot Data: Analysis of thin liquid films driven by SAW

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mitas, Kevin David Joachim (2021). Plot Data: Analysis of thin liquid films driven by SAW [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5069876
    Explore at:
    Dataset updated
    Oct 25, 2021
    Dataset authored and provided by
    Mitas, Kevin David Joachim
    Description

    Data of the relevant plots of the thesis

  20. P

    Movie Reviews Dataset

    • paperswithcode.com
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Movie Reviews Dataset [Dataset]. https://paperswithcode.com/dataset/movie-reviews
    Explore at:
    Dataset updated
    Apr 2, 2024
    Description

    This dataset is based on the movie review polarity dataset (v2.0) collected and maintained by Bo Pang and Lillian Lee. Their dataset (we'll call it PL2.0) consists of 1000 positive and 1000 negative movie reviews obtained from the Internet Movie Database (IMDb) review archive.

    The main contribution of this release is the enrichment of the documents with "annotator rationales," a concept we describe in our NAACL HLT 2007 paper.

    Basically, "rationales" are segments of the text that support an annotator's classification. Let's say we have a movie review that is labeled as positive (i.e. the writer has a favorable opinion of the movie). Then the rationales would be segments of the text that support the claim (by an annotator) that the review is, indeed, positive.

    Here are some examples of positive rationales (the segments enclosed by double square brackets):

    [[you will enjoy the hell out of]] American Pie. fortunately, they [[managed to do it in an interesting and funny way]]. he is [[one of the most exciting martial artists on the big screen]], continuing to perform his own stunts and [[dazzling audiences]] with his flashy kicks and punches. the romance was [[enchanting]].

    And here are some examples of negative rationales:

    A woman in peril. A confrontation. An explosion. The end. [[Yawn. Yawn. Yawn.]] when a film makes watching Eddie Murphy [[a tedious experience, you know something is terribly wrong]]. the movie is [[so badly put together]] that even the most casual viewer may notice the [[miserable pacing and stray plot threads]]. [[don't go see]] this movie

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
mexwell (2024). šŸŽ„ Movie Plot Database [Dataset]. https://www.kaggle.com/datasets/mexwell/movie-plot-database/data
Organization logo

šŸŽ„ Movie Plot Database

42k movie plot summaries with information about movie and actors

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 7, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
mexwell
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Dataset of movie plot summaries and associated metadata. This data was collected by David Bamman, Brendan O'Connor, and Noah Smith at the Language Technologies Institute and Machine Learning Department at Carnegie Mellon University.

Data

plot_summaries.csv

Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia. Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.

movie_metadata.csv

Metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns: - Wikipedia movie ID - Freebase movie ID - Movie name - Movie release date - Movie box office revenue - Movie runtime - Movie languages (Freebase ID:name tuples) - Movie countries (Freebase ID:name tuples) - Movie genres (Freebase ID:name tuples)

character_metadata.csv

Metadata for 450,669 characters aligned to the movies above, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns:

  • Wikipedia movie ID
  • Freebase movie ID
  • Movie release date
  • Character name
  • Actor date of birth
  • Actor gender
  • Actor height (in meters)
  • Actor ethnicity (Freebase ID)
  • Actor name
  • Actor age at movie release
  • Freebase character/actor map ID
  • Freebase character ID
  • Freebase actor ID

tvtropes.clusters.txt

72 character types drawn from tvtropes.com, along with 501 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.

name.clusters.txt

970 unique character names used in at least two different movies, along with 2,666 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.

Acknowledgments

This research was supported in part by U.S. National Science Foundation grant IIS-0915187.

All data is released under a Creative Commons Attribution-ShareAlike License. For questions or comments, please contact David Bamman (dbamman@cs.cmu.edu).

Foto von Jakob Owens auf Unsplash

Search
Clear search
Close search
Google apps
Main menu