56 datasets found

🎥 Movie Plot Database
kaggle.com
Updated Aug 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mexwell (2024). 🎥 Movie Plot Database [Dataset]. https://www.kaggle.com/datasets/mexwell/movie-plot-database/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 7, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
mexwell
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset of movie plot summaries and associated metadata. This data was collected by David Bamman, Brendan O'Connor, and Noah Smith at the Language Technologies Institute and Machine Learning Department at Carnegie Mellon University.

Data

plot_summaries.csv

Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia. Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.

movie_metadata.csv

Metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns: - Wikipedia movie ID - Freebase movie ID - Movie name - Movie release date - Movie box office revenue - Movie runtime - Movie languages (Freebase ID:name tuples) - Movie countries (Freebase ID:name tuples) - Movie genres (Freebase ID:name tuples)

character_metadata.csv

Metadata for 450,669 characters aligned to the movies above, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns:

Wikipedia movie ID

Freebase movie ID

Movie release date

Character name

Actor date of birth

Actor gender

Actor height (in meters)

Actor ethnicity (Freebase ID)

Actor name

Actor age at movie release

Freebase character/actor map ID

Freebase character ID

Freebase actor ID

tvtropes.clusters.txt

72 character types drawn from tvtropes.com, along with 501 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.

name.clusters.txt

970 unique character names used in at least two different movies, along with 2,666 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.

Acknowledgments

This research was supported in part by U.S. National Science Foundation grant IIS-0915187.

All data is released under a Creative Commons Attribution-ShareAlike License. For questions or comments, please contact David Bamman (dbamman@cs.cmu.edu).

Foto von Jakob Owens auf Unsplash
h
wiki-movie-plots-with-summaries
huggingface.co
Updated Oct 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vishnu Priya VR (2023). wiki-movie-plots-with-summaries [Dataset]. https://huggingface.co/datasets/vishnupriyavr/wiki-movie-plots-with-summaries
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 7, 2023
Authors
Vishnu Priya VR
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for Wikipedia Movie Plots with AI Plot Summaries

Dataset Summary Context

Wikipedia Movies Plots dataset by JustinR ( https://www.kaggle.com/jrobischon/wikipedia-movie-plots )

Content

Everything is the same as in https://www.kaggle.com/jrobischon/wikipedia-movie-plots

Acknowledgements

Please, go upvote https://www.kaggle.com/jrobischon/wikipedia-movie-plots dataset, since this is 100% based on that.

Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/vishnupriyavr/wiki-movie-plots-with-summaries.
o
Wikipedia Movie Plot Collection
opendatabay.com
.undefined
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Wikipedia Movie Plot Collection [Dataset]. https://www.opendatabay.com/data/ai-ml/624e3736-74ea-4f5c-9ee5-fda14c16c770
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 8, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
This dataset contains movie plots extracted from Wikipedia, along with other key metadata. It is specifically curated for movies released between 1950 and 2023 that have accumulated over 1000 ratings on IMDb. The primary purpose of this dataset is to facilitate development in Large Language Models (LLMs) for applications such as movie searching or recommendation systems. The plot summaries have been meticulously cleaned to remove irrelevant elements like links and references, ensuring a pure text value. Where Wikipedia plots were unavailable, IMDb synopses were used as a fallback. The dataset includes 89% of movies with detailed plot information, while 100% include a short summary untouched from Wikipedia, which is useful for matching metadata in retriever applications. Columns like 'stars', 'directors', and 'genres' are provided as lists of values, making them suitable for direct loading into vector databases.

Columns

title: The title of the film, presented in lowercase.

stars: The names of the actors featured in the film, also in lowercase.

directors: The names of the film's directors, in lowercase.

year: The year when the movie was released.

genre: The genres associated with the film, listed in lowercase.

runtime: The duration of the film, measured in minutes.

ratingCount: An indication of the film's popularity, showing the number of people who have rated it on IMDb.

plot: Detailed storyline of the film.

summary: A short overview and additional details about the film.

imdb_rating: The film's rating on IMDb, on a scale of 1 to 10.

Distribution

The data file is typically in CSV format. The dataset spans movies released from 1950 up to 2023. There are 20,617 unique movie titles, 21,596 unique star names, and 9,863 unique director names. The genres column contains 21,675 unique values. Movie runtimes range from -1 to 776 minutes, with a significant majority (17,433 entries) falling between 76.70 and 115.55 minutes. The number of ratings (ratingCount) varies widely, starting from 1,001 and going up to 2.73 million. IMDb ratings range from 1.2 to 9.3. While specific total row/record counts are not available, the distribution data for year, runtime, ratingCount, and imdb_rating show various value counts within different ranges.

Usage

This dataset is ideal for: * Developing demonstration projects leveraging Large Language Models (LLMs). * Creating movie search applications, such as the example of a movie searching app like cinemattr.ca. * Building retriever applications where the 'summary' column can be used for metadata matching. * Populating vector databases with structured information from 'stars', 'directors', and 'genres' for advanced querying and analysis.

Coverage

The dataset's geographic scope is global. It includes movies released within the time frame of 1950 to 2023. The data availability specifies that 89% of the movies have detailed plot information, and all movies (100%) include a short summary. The dataset focuses on films with more than 1000 ratings on IMDb.

License

CC0

Who Can Use It

This dataset is suitable for: * AI and machine learning developers who are building models based on natural language processing. * Data scientists and researchers interested in film data and entertainment analytics. * Software engineers developing applications that require movie plot summaries or metadata, such as recommendation engines. * Students and enthusiasts looking for high-quality, pre-processed text data for LLM projects.

Dataset Name Suggestions

IMDb Verified Movie Plots

Historical Film Summaries (1950-2023)

Wikipedia Movie Plot Collection

LLM-Ready Movie Dataset

Global Cinema Plot Archive

Attributes

Original Data Source: Movie Plots from Wikipedia
Latest 10000 Movies Dataset from TMDB
kaggle.com
Updated Aug 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nagraj Desai (2023). Latest 10000 Movies Dataset from TMDB [Dataset]. https://www.kaggle.com/datasets/nagrajdesai/latest-10000-movies-dataset-from-tmdb/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 17, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nagraj Desai
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This movies dataset can certainly be used for a variety of purposes, depending on goals and the insights you're looking to derive from the data. Here are some potential use cases for the dataset.

Movie Analysis

Recommendation Systems

Popularity Measurement

Audience Engagement

Comparative Analysis

The dataset consists of various attributes related to movies. These attributes provide information about each entry in the dataset:

1. Index: - Index for each row

2. Title: - The title attribute represents the name of the movie.

3. Original Language: - This attribute signifies the language in which the movie was originally produced. It could offer insights into the target audience and geographical scope of the content.

4. Release Date: - This attribute indicates when the movie was officially released for public viewing. The release date can impact factors like marketing strategies, competition with other releases, and audience anticipation.

5. Popularity: - This attribute likely represents the measure of how well-known or talked-about a particular movie is within a given context. It could be based on factors such as online discussions, social media mentions, and viewer interest.

6. Vote Average: - This attribute likely represents the average rating or score given to the movie by viewers who have voted. A higher average could imply that the content is generally well-received.

7. Vote Count: - This attribute indicates the number of votes or ratings that the movie has received from viewers. A higher vote count might suggest a larger viewer base or a more engaging content.

8. Overview: - This attribute provides a concise summary or description of the movie plot, themes, and overall content. It offers a glimpse into what the content is about.
d
Korean Movie Database
data.go.kr
json+xml
Updated Jan 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Korean Movie Database [Dataset]. https://www.data.go.kr/en/data/3035985/openapi.do
Explore at:
json+xmlAvailable download formats
Dataset updated
Jan 7, 2022
License
https://data.go.kr/ugs/selectPortalPolicyView.dohttps://data.go.kr/ugs/selectPortalPolicyView.do
Description
Information on Korean and foreign films that have been released, imported, and released in Korea, established and published by the Korea Film Archive. It contains information such as the movie title, director, production company, production year, release date, participating actors and staff, genre, and plot.
h
rotten_tomatoes
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cornell-movie-review-data, rotten_tomatoes [Dataset]. https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
cornell-movie-review-data
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for "rotten_tomatoes"

Dataset Summary

Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

Supported Tasks and Leaderboards

More Information Needed

Languages… See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.
TMDb Top 10,000 Popular Movies Dataset
kaggle.com
Updated Apr 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Balaka Biswas (2020). TMDb Top 10,000 Popular Movies Dataset [Dataset]. https://www.kaggle.com/balaka18/tmdb-top-10000-popular-movies-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Balaka Biswas
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Introduction

This is dataset of the 10,000 most popular movies across the world, irrespective of language and recency. These have been extracted using TMDb API.

About the Dataset

What is TMDB's API? The closed-source API service is for those people interested in using their movies, TV shows or actor images and/or data in their application. TMDb's API is a system that they provide for developers and their team to programmatically fetch and use TMDb's data and/or images. Their API is free to use as long as you attribute TMDb as the source of the data and/or images. Also, they update their API from time to time.

This dataset lists 10,000 most popular movies across the globe. Information held inside the dataset - A. Dataset 1 : Movies dataset - 1. title - Title of the Movie in English. 2. overview - A small summary of the plot. 3. original_lang - Original language it was shot in. 4. rel_date - Date of release. 5. popularity - Popularity. 6. vote_count - Votes received. 7. vote_average - Average of all votes received.

B. Dataset 2 : Genres dataset 1. id 2. Movie ID 3. Genre
H
Replication Data for: Movie Scripts Corpus
dataverse.harvard.edu
Updated May 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lance Drouet (2024). Replication Data for: Movie Scripts Corpus [Dataset]. http://doi.org/10.7910/DVN/PZTL2L
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/PZTL2L
Dataset updated
May 6, 2024
Dataset provided by
Harvard Dataverse
Authors
Lance Drouet
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data Source: https://www.kaggle.com/datasets/gufukuro/movie-scripts-corpus Data Description : Movie Scripts Corpus This corpus was collected to use for screenplay analysis with machine learning methods. Corpus includes movie scripts, crawled from different sources, their annotations by script structural elements and movies metadata. Corpus description Screenplay data consists of: Movie scripts TXT-documents with raw full text (2858 docs) Movie scripts TXT-documents with full text lemmas (2858 docs) Manual annotation TXT-documents for some movie scripts (33 docs, more than 6000 annotated rows) Movie scripts annotations TXT-documents obtained by BERT Movie scripts annotations json-documents obtained by rule-based annotator ScreenPy Movies metadata consists of: Cut versions of movie reviews and scores from metacritic: Number of reviews: 21025 Number of movies with reviews: 2038 Metadata for movies, including: title, akas, launch year, score from metacritic, imdb user rating and number of votes from imdb.com, movie awards, opening weekend, producers, budget, script department, production companies, writers, directors, cast info, countries involved in production, age restrict, plot (with outline), keywords, genres, taglines, critics' synopsis Screenplay awards information: Academy Awards adapted screenplay, Academy Awards original screenplay, BAFTA, Golden Globe Award for Best Screenplay, Writers Guild Awards Winners & Nominees 2020-2013 nominations information for 462 movies in total. Movie characters data consists of: Script text fragments with dialogs and scene descriptions for characters, gathered with annotators: 2153 movies and text fragments for 32114 characters in total Gender labels for 4792 characters
o
Indonesian Film Database (IMDb)
opendatabay.com
.undefined
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Indonesian Film Database (IMDb) [Dataset]. https://www.opendatabay.com/data/dataset/e6c24dd2-f5c7-4abf-83f4-ac3deb784967
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
This dataset contains details for 1262 Indonesian movies, compiled to offer insights into the country's film industry. It was assembled using an IMDb-Scraper and then converted and cleaned into a CSV file, providing a structured collection of movie information [1]. The data was collected from IMDb.com [1].

Columns

title: The primary title of the movie [2].

year: The release year of the movie, with values ranging from 1926 to 2020 [2].

description: A textual summary or plot outline for the movie [2].

genre: Categories that describe the movie's style or content, such as Drama or Comedy [2, 3].

rating: The age rating certification applied to the movie, for example, '13+' [2, 3].

users_rating: The average rating given by IMDb users, typically ranging from 1.2 to 9.4 [2, 3].

votes: The total count of votes received from IMDb users, with values varying from 5 to 187,000 [2, 4].

languages: The language(s) in which the movie is primarily presented, notably Indonesian and English [2, 4].

directors: The individual(s) credited with directing the movie, including names like Nayato Fio Nuala [2, 4].

actors: The main cast members or performers featured in the movie [2].

runtime: The duration of the movie [1].

Distribution

The dataset is provided in a CSV file format [1]. It includes 1262 unique movie records or rows [1, 2].

Usage

This dataset is ideal for: * Exploratory data analysis of Indonesian cinema trends [1]. * Natural Language Processing (NLP) tasks on movie descriptions [1]. * Analysing movie characteristics such as genre distribution, rating trends, and language prevalence. * Studying the impact of directors and actors within the Indonesian film landscape.

Coverage

The dataset specifically covers Indonesian movies [1, 2]. The time range for these movies spans from 1926 to 2020 [2].

License

CCO

Who Can Use It

Data Analysts and Scientists: For statistical analysis, trend identification, and data visualisations related to movies.

Researchers: Studying film history, cultural impact of cinema, or market analysis within the Indonesian context.

Natural Language Processing Specialists: For training models on movie descriptions, sentiment analysis, or content categorisation.

Film Enthusiasts and Critics: To explore movie characteristics, ratings, and directorial styles.

Dataset Name Suggestions

IMDb Indonesian Movies Data

Indonesian Film Database (IMDb)

IMDb Indonesian Cinema

Indonesian Movie Catalogue (IMDb)

Attributes

Original Data Source: IMDb Indonesian Movies
h
Data from: imdb
huggingface.co
Updated Aug 3, 2003
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2003
Dataset authored and provided by
Stanford NLP
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "imdb"

Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
"9,565 Top-Rated Movies Dataset"
kaggle.com
Updated Aug 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harshit@85 (2024). "9,565 Top-Rated Movies Dataset" [Dataset]. https://www.kaggle.com/datasets/harshit85/9565-top-rated-movies-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Harshit@85
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About the Dataset

Title: 9,565 Top-Rated Movies Dataset

Description:
This dataset offers a comprehensive collection of 9,565 of the highest-rated movies according to audience ratings on the Movie Database (TMDb). The dataset includes detailed information about each movie, such as its title, overview, release date, popularity score, average vote, and vote count. It is designed to be a valuable resource for anyone interested in exploring trends in popular cinema, analyzing factors that contribute to a movie’s success, or building recommendation engines.

Key Features: - Title: The official title of each movie. - Overview: A brief synopsis or description of the movie's plot. - Release Date: The release date of the movie, formatted as YYYY-MM-DD. - Popularity: A score indicating the current popularity of the movie on TMDb, which can be used to gauge current interest. - Vote Average: The average rating of the movie, based on user votes. - Vote Count: The total number of votes the movie has received.

Data Source: The data was sourced from the TMDb API, a well-regarded platform for movie information, using the /movie/top_rated endpoint. The dataset represents a snapshot of the highest-rated movies as of the time of data collection.

Data Collection Process: - API Access: Data was retrieved programmatically using TMDb’s API. - Pagination Handling: Multiple API requests were made to cover all pages of top-rated movies, ensuring the dataset’s comprehensiveness. - Data Aggregation: Collected data was aggregated into a single, unified dataset using the pandas library. - Cleaning: Basic data cleaning was performed to remove duplicates and handle missing or malformed data entries.

Potential Uses: - Trend Analysis: Analyze trends in movie ratings over time or compare ratings across different genres. - Recommendation Systems: Build and train models to recommend movies based on user preferences. - Sentiment Analysis: Perform text analysis on movie overviews to understand common themes and sentiments. - Statistical Analysis: Explore the relationship between popularity, vote count, and average ratings.

Data Format: The dataset is provided in a structured tabular format (e.g., CSV), making it easy to load into data analysis tools like Python, R, or Excel.

Usage License: The dataset is shared under [appropriate license], ensuring that it can be used for educational, research, or commercial purposes, with proper attribution to the data source (TMDb).

This description provides a clear and detailed overview, helping potential users understand the dataset's content, origin, and potential applications.
the_movies_dataset
kaggle.com
zip
Updated Jun 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sezgin ildes (2021). the_movies_dataset [Dataset]. https://www.kaggle.com/sezginildes/the-movies-dataset
Explore at:
zip(15456686 bytes)Available download formats
Dataset updated
Jun 19, 2021
Authors
sezgin ildes
Description
Context These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

Content This dataset consists of the following files:

movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed here

Acknowledgements This dataset is an ensemble of data collected from TMDB and GroupLens. The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.

The Movie Links and Ratings have been obtained from the Official GroupLens website. The files are a part of the dataset available here

Inspiration This dataset was assembled as part of my second Capstone Project for Springboard's Data Science Career Track. I wanted to perform an extensive EDA on Movie Data to narrate the history and the story of Cinema and use this metadata in combination with MovieLens ratings to build various types of Recommender Systems.

Both my notebooks are available as kernels with this dataset: The Story of Film and Movie Recommender Systems

Some of the things you can do with this dataset: Predicting movie revenue and/or movie success based on a certain metric. What movies tend to get higher vote counts and vote averages on TMDB? Building Content Based and Collaborative Filtering Based Recommendation Engines.
o
Global Movie Popularity Dataset
opendatabay.com
.undefined
Updated Jul 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Global Movie Popularity Dataset [Dataset]. https://www.opendatabay.com/data/dataset/c9597b23-d205-46ff-abb3-674815373730
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
This dataset provides details on the 10,000 most popular films globally, sourced from The Movie Database (TMDb) via its read API. TMDb is a crowd-sourced movie information database widely used by various film-related platforms and applications. The dataset is ideal for film-related analysis, building recommender systems, and natural language processing tasks, even for those new to data analysis, as it contains some missing values.

Columns

index: An identifier for each record.

title: The name of the movie.

overview: A concise summary or synopsis of the movie.

original_language: The primary language in which the movie was filmed.

vote_count: The number of votes received for the movie, also indicated as the date of publish in some contexts.

vote_average: The average rating given to the movie by voters.

popularity: A metric indicating the popularity score of the movie.

Distribution

The dataset is provided in a CSV file format. It comprises approximately 10,000 individual movie records. While exact row and record counts are not specified, the dataset is structured as tabular data, with each row representing a unique movie entry and columns detailing various attributes.

Usage

This dataset is well-suited for a variety of applications, including: * Developing and enhancing film-related consoles, websites, and mobile applications. * Creating movie recommender systems. * Performing data visualisations related to film trends and popularity. * Conducting natural language processing (NLP) tasks on movie overviews. * Data analysis and exploration, particularly for those looking to practise handling missing data.

Coverage

The dataset covers movies from across the world, offering a global scope. While a specific time range for the movies is not explicitly stated, the data is fetched from TMDb, which updates its API periodically. It's noted that the dataset includes some null values where information was missing from the original TMDb database.

License

CCO

Who Can Use It

This dataset is intended for a broad audience including: * Young analysts: To practise data cleaning and analysis with datasets containing missing values. * Developers: For integrating movie information into media managers, mobile apps, and social sites. * Researchers: For studies on movie popularity, audience reception, and content analysis. * Data scientists: For building and testing machine learning models such as recommender systems and NLP models.

Dataset Name Suggestions

TMDb Popular Movies

Global Movie Popularity Dataset

Top Movies from TMDb API

Movie Data for Film Analysis

TMDb Film Insights

Attributes

Original Data Source: Popular Movies of IMDb
IMDB Selection Database
zenodo.org
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cristian Campo Pérez; Nieves Fernández Ochoa; Cristian Campo Pérez; Nieves Fernández Ochoa (2022). IMDB Selection Database [Dataset]. http://doi.org/10.5281/zenodo.7339445
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7339445
Dataset updated
Nov 21, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cristian Campo Pérez; Nieves Fernández Ochoa; Cristian Campo Pérez; Nieves Fernández Ochoa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Selection of top 1000 entries of each gender in IMDB..

Contains information of:

title -> title of the entry

genres -> list genres of the entry

score -> mean rating from the viewers

people_votin -> number of votes

normal_number_of_reviews -> number of reviews from normal userss

prof_number_of_reviews -> number of reviews from professionals

type_filmed -> type of content ( e.g. TV Series / original )

year -> release year

year_certification -> Age restriction certification

runtime -> length of chapter / movie

country -> Country where it was produced

creators -> List of name of the directors

cast -> List of names of the actors

plot -> brief summary of the plot

JPEG_link -> link to JPEG promotional image

This is a sumulated dataset.
10000 Most Popular English Movies (2023)
kaggle.com
Updated Jul 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dnyanesh Yeole (2023). 10000 Most Popular English Movies (2023) [Dataset]. https://www.kaggle.com/datasets/dnyaneshyeole/10000-most-popular-english-movies-2023
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dnyanesh Yeole
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🎬 Welcome to the Popular English Movies Dataset (2023) 🎬! This dataset features information on a diverse collection of popular English movies.

Contents

The dataset provides a comprehensive set of features for each movie entry:

Title: The name of the movie, identifying it uniquely in the dataset.

Overview: A summary or synopsis of the movie, giving users an idea of its plot and theme.

Release_Date: The date when the movie was officially released.

Genre: The categories or genres to which the movie can be classified.

Popularity: This metric is calculated by TMDB developers

Vote_Average: The average rating of the movie, ranging from 0 to 10

Vote_Count: The total number of votes received.

Usage

The Popular English Movies Dataset (2023) offers a wealth of opportunities for exploration and innovation in the realms of Data Science and Machine Learning. Here are some exciting ways to utilize and contribute to the dataset:

Genre Prediction Model: Leveraging the 'overview' and 'title' features, data enthusiasts can build powerful Natural Language Processing (NLP) models to predict movie genres. By analyzing the movie summaries and titles, learners can gain insights into the relationships between textual data and movie genres, enabling more accurate genre predictions.

Movie Recommender System: The dataset serves as a fantastic foundation for constructing a movie recommender system. By applying collaborative filtering or content-based filtering techniques, learners can develop personalized recommendations for users based on their preferences, leading to enhanced movie discovery experiences.

Popularity Analysis: Utilizing the 'vote_count' and 'vote_average' features, learners can delve into the factors influencing a movie's popularity. Through data exploration and visualization, one can uncover trends and patterns that contribute to a movie's overall appeal among viewers.

Source

The data was sourced by leveraging the power of TMDB's API, and it can be explored in its entirety at https://www.themoviedb.org/movie. This platform showcases an extensive collection of movie data

Lights, Camera, Upvote! Dive into 10,000 Popular English Movies from 2023! 🎬👍
o
TMDB Top Movies Dataset
opendatabay.com
.undefined
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). TMDB Top Movies Dataset [Dataset]. https://www.opendatabay.com/data/dataset/a663f3c0-8065-4aff-807a-a50f31b6034c
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
📽️ Movie Descriptions Dataset This dataset contains a curated list of classic and contemporary films along with their titles, genres, and detailed plot descriptions. It includes globally acclaimed movies across genres such as drama, crime, romance, animation, fantasy, action, and more. From cinematic masterpieces like The Shawshank Redemption and Schindler’s List to iconic anime like Your Name and A Silent Voice, this dataset offers a diverse mix of storytelling across cultures and decades.

Each entry features:

🎬 Movie Name

🎭 Genre(s)

📝 Brief Description / Plot Summary

This dataset can be used for:

🎞️ Movie recommendation systems

🧠 NLP tasks like sentiment analysis, genre prediction, and text classification

🎥 Data visualization and storytelling

🗣️ Text summarization or chatbot training on movie-related queries

Ideal for data science, machine learning, and natural language processing enthusiasts who want to experiment with real-world descriptive text data.

Original Data Source: TMDB Top Movies Dataset
o
Global Movie Popularity Dataset
opendatabay.com
.undefined
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Global Movie Popularity Dataset [Dataset]. https://www.opendatabay.com/data/consumer/af505531-100e-4731-b7e9-f817fa91f16d
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
This dataset contains details for 10,000 top-rated movies from TMDB, updated as of 26th July 2022. Its primary purpose is to facilitate text preprocessing and cleansing for Natural Language Processing (NLP) tasks related to movie data. It is also highly suitable for developing content-based and collaborative filtering recommendation engines. This resource offers a rich context for understanding movie popularity, genres, and audience reception.

Columns

id: The unique identification number for the movie on the website.

title: The name of the movie.

genre: The categorisation of the movie, such as crime, adventure, or drama.

original_language: The initial language in which the movie was released.

overview: A brief summary or synopsis of the movie.

popularity: A metric indicating the movie's popularity.

release_date: The date when the movie was first released.

vote_average: The average rating given to the movie by voters.

vote_count: The total number of votes received by the movie.

Distribution

This dataset comprises approximately 10,000 records, typically provided in a CSV file format. Specific row counts for a sample file are updated separately. The dataset includes unique values for movie IDs, with original_language predominantly being English (around 78%) and French (7%). Movie genres include Comedy (7%) and Drama (6%), with a wide array of other genres. Release dates span a broad period from 1902 to 2022, with the majority of entries from 1998 onwards. Popularity scores range from 0.6 to over 10,000, and vote averages are generally between 4.6 and 8.7, with vote counts reaching up to 31,900.

Usage

This dataset is ideal for: * Performing extensive text preprocessing and cleansing for NLP applications on movie descriptions and titles. * Building various movie recommendation systems, including content-based recommenders and collaborative filtering engines. * Analysing trends in movie popularity, audience ratings, and language distribution. * Developing data science projects focused on entertainment and media consumption.

Coverage

The dataset's geographic scope is global. It covers movies released between 17th April 1902 and 13th July 2022, with the dataset itself assembled with data up to 26th July 2022. There are no specific demographic notes available, but it broadly covers top-rated films from the TMDB database.

License

CCO

Who Can Use It

This dataset is suitable for: * Data Scientists and Machine Learning Engineers working on recommendation systems or NLP projects. * Researchers studying film industry trends, audience engagement, or language processing. * Developers looking to integrate movie data into applications. * Anyone interested in exploratory data analysis within the entertainment sector.

Dataset Name Suggestions

TMDB Top Movies Dataset

Movie Data for NLP & Recommendations

Global Movie Popularity Dataset

Film Data Hub

Attributes

Original Data Source: TMDB Movies Dataset
A
‘IMDB Horror Movie Dataset [2012 Onwards]’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 2, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2017). ‘IMDB Horror Movie Dataset [2012 Onwards]’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-imdb-horror-movie-dataset-2012-onwards-ca86/3437da9d/?iid=004-265&v=presentation
Explore at:
Dataset updated
Nov 2, 2017
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘IMDB Horror Movie Dataset [2012 Onwards]’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/PromptCloudHQ/imdb-horror-movie-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

On the occasion of Halloween, we thought of sharing a spooky dataset for the community to crunch on the data!

Remember - "This Halloween could get a lot more spookier, but treats are guaranteed".

Content

The dataset goes back to 2012 and contains the following data fields:

Title

Genres

Release Date

Release Country

Movie Rating

Review Rating

Movie Run Time

Plot

Cast

Language

Filming Locations

Budget

Acknowledgements

The data was extracted by PromptCloud's in-house data extraction solution.

Inspiration

Some of the things that can be explored are the following:

Number of horror movies released over the years

Number of movies released in terms of country

Rating and run time distribution

Spooky regions by considering the shooting location

Text mining on the description text

--- Original source retains full ownership of the source dataset ---
Z
Plot Data: Analysis of thin liquid films driven by SAW
data.niaid.nih.gov
zenodo.org
Updated Oct 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mitas, Kevin David Joachim (2021). Plot Data: Analysis of thin liquid films driven by SAW [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5069876
Explore at:
Dataset updated
Oct 25, 2021
Dataset authored and provided by
Mitas, Kevin David Joachim
Description
Data of the relevant plots of the thesis
P
Movie Reviews Dataset
paperswithcode.com
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Movie Reviews Dataset [Dataset]. https://paperswithcode.com/dataset/movie-reviews
Explore at:
Dataset updated
Apr 2, 2024
Description
This dataset is based on the movie review polarity dataset (v2.0) collected and maintained by Bo Pang and Lillian Lee. Their dataset (we'll call it PL2.0) consists of 1000 positive and 1000 negative movie reviews obtained from the Internet Movie Database (IMDb) review archive.

The main contribution of this release is the enrichment of the documents with "annotator rationales," a concept we describe in our NAACL HLT 2007 paper.

Basically, "rationales" are segments of the text that support an annotator's classification. Let's say we have a movie review that is labeled as positive (i.e. the writer has a favorable opinion of the movie). Then the rationales would be segments of the text that support the claim (by an annotator) that the review is, indeed, positive.

Here are some examples of positive rationales (the segments enclosed by double square brackets):

[[you will enjoy the hell out of]] American Pie. fortunately, they [[managed to do it in an interesting and funny way]]. he is [[one of the most exciting martial artists on the big screen]], continuing to perform his own stunts and [[dazzling audiences]] with his flashy kicks and punches. the romance was [[enchanting]].

And here are some examples of negative rationales:

A woman in peril. A confrontation. An explosion. The end. [[Yawn. Yawn. Yawn.]] when a film makes watching Eddie Murphy [[a tedious experience, you know something is terribly wrong]]. the movie is [[so badly put together]] that even the most casual viewer may notice the [[miserable pacing and stray plot threads]]. [[don't go see]] this movie

Facebook

Twitter

Click to copy link

Link copied

Cite

mexwell (2024). 🎥 Movie Plot Database [Dataset]. https://www.kaggle.com/datasets/mexwell/movie-plot-database/data

🎥 Movie Plot Database

42k movie plot summaries with information about movie and actors

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 7, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

mexwell

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Dataset of movie plot summaries and associated metadata. This data was collected by David Bamman, Brendan O'Connor, and Noah Smith at the Language Technologies Institute and Machine Learning Department at Carnegie Mellon University.

Data

plot_summaries.csv

Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia. Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.

movie_metadata.csv

Metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns: - Wikipedia movie ID - Freebase movie ID - Movie name - Movie release date - Movie box office revenue - Movie runtime - Movie languages (Freebase ID:name tuples) - Movie countries (Freebase ID:name tuples) - Movie genres (Freebase ID:name tuples)

character_metadata.csv

Metadata for 450,669 characters aligned to the movies above, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns:

Wikipedia movie ID
Freebase movie ID
Movie release date
Character name
Actor date of birth
Actor gender
Actor height (in meters)
Actor ethnicity (Freebase ID)
Actor name
Actor age at movie release
Freebase character/actor map ID
Freebase character ID
Freebase actor ID

tvtropes.clusters.txt

72 character types drawn from tvtropes.com, along with 501 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.

name.clusters.txt

970 unique character names used in at least two different movies, along with 2,666 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.

Acknowledgments

This research was supported in part by U.S. National Science Foundation grant IIS-0915187.

All data is released under a Creative Commons Attribution-ShareAlike License. For questions or comments, please contact David Bamman (dbamman@cs.cmu.edu).

Foto von Jakob Owens auf Unsplash

Clear search

Close search

Google apps

Main menu

🎥 Movie Plot Database

Data

plot_summaries.csv

movie_metadata.csv

character_metadata.csv

tvtropes.clusters.txt

name.clusters.txt

Acknowledgments

wiki-movie-plots-with-summaries

Wikipedia Movie Plot Collection

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Latest 10000 Movies Dataset from TMDB

Korean Movie Database

rotten_tomatoes

TMDb Top 10,000 Popular Movies Dataset

Introduction

About the Dataset

Replication Data for: Movie Scripts Corpus

Indonesian Film Database (IMDb)

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Data from: imdb

"9,565 Top-Rated Movies Dataset"

About the Dataset

the_movies_dataset

Global Movie Popularity Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

IMDB Selection Database

10000 Most Popular English Movies (2023)

Contents

Usage

Source

TMDB Top Movies Dataset

Global Movie Popularity Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

‘IMDB Horror Movie Dataset [2012 Onwards]’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

Plot Data: Analysis of thin liquid films driven by SAW

Movie Reviews Dataset

🎥 Movie Plot Database

42k movie plot summaries with information about movie and actors

Data

plot_summaries.csv

movie_metadata.csv

character_metadata.csv

tvtropes.clusters.txt

name.clusters.txt

Acknowledgments