Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset of movie plot summaries and associated metadata. This data was collected by David Bamman, Brendan O'Connor, and Noah Smith at the Language Technologies Institute and Machine Learning Department at Carnegie Mellon University.
Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia. Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.
Metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns: - Wikipedia movie ID - Freebase movie ID - Movie name - Movie release date - Movie box office revenue - Movie runtime - Movie languages (Freebase ID:name tuples) - Movie countries (Freebase ID:name tuples) - Movie genres (Freebase ID:name tuples)
Metadata for 450,669 characters aligned to the movies above, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns:
72 character types drawn from tvtropes.com, along with 501 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.
970 unique character names used in at least two different movies, along with 2,666 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.
This research was supported in part by U.S. National Science Foundation grant IIS-0915187.
All data is released under a Creative Commons Attribution-ShareAlike License. For questions or comments, please contact David Bamman (dbamman@cs.cmu.edu).
Foto von Jakob Owens auf Unsplash
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for Wikipedia Movie Plots with AI Plot Summaries
Dataset Summary
Context
Wikipedia Movies Plots dataset by JustinR ( https://www.kaggle.com/jrobischon/wikipedia-movie-plots )
Content
Everything is the same as in https://www.kaggle.com/jrobischon/wikipedia-movie-plots
Acknowledgements
Please, go upvote https://www.kaggle.com/jrobischon/wikipedia-movie-plots dataset, since this is 100% based on that.
Supported Tasks and⦠See the full description on the dataset page: https://huggingface.co/datasets/vishnupriyavr/wiki-movie-plots-with-summaries.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains movie plots extracted from Wikipedia, along with other key metadata. It is specifically curated for movies released between 1950 and 2023 that have accumulated over 1000 ratings on IMDb. The primary purpose of this dataset is to facilitate development in Large Language Models (LLMs) for applications such as movie searching or recommendation systems. The plot summaries have been meticulously cleaned to remove irrelevant elements like links and references, ensuring a pure text value. Where Wikipedia plots were unavailable, IMDb synopses were used as a fallback. The dataset includes 89% of movies with detailed plot information, while 100% include a short summary untouched from Wikipedia, which is useful for matching metadata in retriever applications. Columns like 'stars', 'directors', and 'genres' are provided as lists of values, making them suitable for direct loading into vector databases.
The data file is typically in CSV format. The dataset spans movies released from 1950 up to 2023. There are 20,617 unique movie titles, 21,596 unique star names, and 9,863 unique director names. The genres column contains 21,675 unique values. Movie runtimes range from -1 to 776 minutes, with a significant majority (17,433 entries) falling between 76.70 and 115.55 minutes. The number of ratings (ratingCount
) varies widely, starting from 1,001 and going up to 2.73 million. IMDb ratings range from 1.2 to 9.3. While specific total row/record counts are not available, the distribution data for year
, runtime
, ratingCount
, and imdb_rating
show various value counts within different ranges.
This dataset is ideal for: * Developing demonstration projects leveraging Large Language Models (LLMs). * Creating movie search applications, such as the example of a movie searching app like cinemattr.ca. * Building retriever applications where the 'summary' column can be used for metadata matching. * Populating vector databases with structured information from 'stars', 'directors', and 'genres' for advanced querying and analysis.
The dataset's geographic scope is global. It includes movies released within the time frame of 1950 to 2023. The data availability specifies that 89% of the movies have detailed plot information, and all movies (100%) include a short summary. The dataset focuses on films with more than 1000 ratings on IMDb.
CC0
This dataset is suitable for: * AI and machine learning developers who are building models based on natural language processing. * Data scientists and researchers interested in film data and entertainment analytics. * Software engineers developing applications that require movie plot summaries or metadata, such as recommendation engines. * Students and enthusiasts looking for high-quality, pre-processed text data for LLM projects.
Original Data Source: Movie Plots from Wikipedia
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This movies dataset can certainly be used for a variety of purposes, depending on goals and the insights you're looking to derive from the data. Here are some potential use cases for the dataset.
Movie Analysis
Recommendation Systems
Popularity Measurement
Audience Engagement
Comparative Analysis
The dataset consists of various attributes related to movies. These attributes provide information about each entry in the dataset:
1. Index: - Index for each row
2. Title: - The title attribute represents the name of the movie.
3. Original Language: - This attribute signifies the language in which the movie was originally produced. It could offer insights into the target audience and geographical scope of the content.
4. Release Date: - This attribute indicates when the movie was officially released for public viewing. The release date can impact factors like marketing strategies, competition with other releases, and audience anticipation.
5. Popularity: - This attribute likely represents the measure of how well-known or talked-about a particular movie is within a given context. It could be based on factors such as online discussions, social media mentions, and viewer interest.
6. Vote Average: - This attribute likely represents the average rating or score given to the movie by viewers who have voted. A higher average could imply that the content is generally well-received.
7. Vote Count: - This attribute indicates the number of votes or ratings that the movie has received from viewers. A higher vote count might suggest a larger viewer base or a more engaging content.
8. Overview: - This attribute provides a concise summary or description of the movie plot, themes, and overall content. It offers a glimpse into what the content is about.
https://data.go.kr/ugs/selectPortalPolicyView.dohttps://data.go.kr/ugs/selectPortalPolicyView.do
Information on Korean and foreign films that have been released, imported, and released in Korea, established and published by the Korea Film Archive. It contains information such as the movie title, director, production company, production year, release date, participating actors and staff, genre, and plot.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for "rotten_tomatoes"
Dataset Summary
Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.
Supported Tasks and Leaderboards
More Information Needed
Languages⦠See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This is dataset of the 10,000 most popular movies across the world, irrespective of language and recency. These have been extracted using TMDb API.
What is TMDB's API? The closed-source API service is for those people interested in using their movies, TV shows or actor images and/or data in their application. TMDb's API is a system that they provide for developers and their team to programmatically fetch and use TMDb's data and/or images. Their API is free to use as long as you attribute TMDb as the source of the data and/or images. Also, they update their API from time to time.
This dataset lists 10,000 most popular movies across the globe. Information held inside the dataset - A. Dataset 1 : Movies dataset - 1. title - Title of the Movie in English. 2. overview - A small summary of the plot. 3. original_lang - Original language it was shot in. 4. rel_date - Date of release. 5. popularity - Popularity. 6. vote_count - Votes received. 7. vote_average - Average of all votes received.
B. Dataset 2 : Genres dataset 1. id 2. Movie ID 3. Genre
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data Source: https://www.kaggle.com/datasets/gufukuro/movie-scripts-corpus Data Description : Movie Scripts Corpus This corpus was collected to use for screenplay analysis with machine learning methods. Corpus includes movie scripts, crawled from different sources, their annotations by script structural elements and movies metadata. Corpus description Screenplay data consists of: Movie scripts TXT-documents with raw full text (2858 docs) Movie scripts TXT-documents with full text lemmas (2858 docs) Manual annotation TXT-documents for some movie scripts (33 docs, more than 6000 annotated rows) Movie scripts annotations TXT-documents obtained by BERT Movie scripts annotations json-documents obtained by rule-based annotator ScreenPy Movies metadata consists of: Cut versions of movie reviews and scores from metacritic: Number of reviews: 21025 Number of movies with reviews: 2038 Metadata for movies, including: title, akas, launch year, score from metacritic, imdb user rating and number of votes from imdb.com, movie awards, opening weekend, producers, budget, script department, production companies, writers, directors, cast info, countries involved in production, age restrict, plot (with outline), keywords, genres, taglines, critics' synopsis Screenplay awards information: Academy Awards adapted screenplay, Academy Awards original screenplay, BAFTA, Golden Globe Award for Best Screenplay, Writers Guild Awards Winners & Nominees 2020-2013 nominations information for 462 movies in total. Movie characters data consists of: Script text fragments with dialogs and scene descriptions for characters, gathered with annotators: 2153 movies and text fragments for 32114 characters in total Gender labels for 4792 characters
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains details for 1262 Indonesian movies, compiled to offer insights into the country's film industry. It was assembled using an IMDb-Scraper and then converted and cleaned into a CSV file, providing a structured collection of movie information [1]. The data was collected from IMDb.com [1].
The dataset is provided in a CSV file format [1]. It includes 1262 unique movie records or rows [1, 2].
This dataset is ideal for: * Exploratory data analysis of Indonesian cinema trends [1]. * Natural Language Processing (NLP) tasks on movie descriptions [1]. * Analysing movie characteristics such as genre distribution, rating trends, and language prevalence. * Studying the impact of directors and actors within the Indonesian film landscape.
The dataset specifically covers Indonesian movies [1, 2]. The time range for these movies spans from 1926 to 2020 [2].
CCO
Original Data Source: IMDb Indonesian Movies
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "imdb"
Dataset Summary
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure⦠See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Title: 9,565 Top-Rated Movies Dataset
Description:
This dataset offers a comprehensive collection of 9,565 of the highest-rated movies according to audience ratings on the Movie Database (TMDb). The dataset includes detailed information about each movie, such as its title, overview, release date, popularity score, average vote, and vote count. It is designed to be a valuable resource for anyone interested in exploring trends in popular cinema, analyzing factors that contribute to a movieās success, or building recommendation engines.
Key Features:
- Title: The official title of each movie.
- Overview: A brief synopsis or description of the movie's plot.
- Release Date: The release date of the movie, formatted as YYYY-MM-DD
.
- Popularity: A score indicating the current popularity of the movie on TMDb, which can be used to gauge current interest.
- Vote Average: The average rating of the movie, based on user votes.
- Vote Count: The total number of votes the movie has received.
Data Source:
The data was sourced from the TMDb API, a well-regarded platform for movie information, using the /movie/top_rated
endpoint. The dataset represents a snapshot of the highest-rated movies as of the time of data collection.
Data Collection Process:
- API Access: Data was retrieved programmatically using TMDbās API.
- Pagination Handling: Multiple API requests were made to cover all pages of top-rated movies, ensuring the datasetās comprehensiveness.
- Data Aggregation: Collected data was aggregated into a single, unified dataset using the pandas
library.
- Cleaning: Basic data cleaning was performed to remove duplicates and handle missing or malformed data entries.
Potential Uses: - Trend Analysis: Analyze trends in movie ratings over time or compare ratings across different genres. - Recommendation Systems: Build and train models to recommend movies based on user preferences. - Sentiment Analysis: Perform text analysis on movie overviews to understand common themes and sentiments. - Statistical Analysis: Explore the relationship between popularity, vote count, and average ratings.
Data Format: The dataset is provided in a structured tabular format (e.g., CSV), making it easy to load into data analysis tools like Python, R, or Excel.
Usage License: The dataset is shared under [appropriate license], ensuring that it can be used for educational, research, or commercial purposes, with proper attribution to the data source (TMDb).
This description provides a clear and detailed overview, helping potential users understand the dataset's content, origin, and potential applications.
Context These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.
This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.
Content This dataset consists of the following files:
movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.
keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.
credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.
links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.
links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.
ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.
The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed here
Acknowledgements This dataset is an ensemble of data collected from TMDB and GroupLens. The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.
The Movie Links and Ratings have been obtained from the Official GroupLens website. The files are a part of the dataset available here
Inspiration This dataset was assembled as part of my second Capstone Project for Springboard's Data Science Career Track. I wanted to perform an extensive EDA on Movie Data to narrate the history and the story of Cinema and use this metadata in combination with MovieLens ratings to build various types of Recommender Systems.
Both my notebooks are available as kernels with this dataset: The Story of Film and Movie Recommender Systems
Some of the things you can do with this dataset: Predicting movie revenue and/or movie success based on a certain metric. What movies tend to get higher vote counts and vote averages on TMDB? Building Content Based and Collaborative Filtering Based Recommendation Engines.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides details on the 10,000 most popular films globally, sourced from The Movie Database (TMDb) via its read API. TMDb is a crowd-sourced movie information database widely used by various film-related platforms and applications. The dataset is ideal for film-related analysis, building recommender systems, and natural language processing tasks, even for those new to data analysis, as it contains some missing values.
The dataset is provided in a CSV file format. It comprises approximately 10,000 individual movie records. While exact row and record counts are not specified, the dataset is structured as tabular data, with each row representing a unique movie entry and columns detailing various attributes.
This dataset is well-suited for a variety of applications, including: * Developing and enhancing film-related consoles, websites, and mobile applications. * Creating movie recommender systems. * Performing data visualisations related to film trends and popularity. * Conducting natural language processing (NLP) tasks on movie overviews. * Data analysis and exploration, particularly for those looking to practise handling missing data.
The dataset covers movies from across the world, offering a global scope. While a specific time range for the movies is not explicitly stated, the data is fetched from TMDb, which updates its API periodically. It's noted that the dataset includes some null values where information was missing from the original TMDb database.
CCO
This dataset is intended for a broad audience including: * Young analysts: To practise data cleaning and analysis with datasets containing missing values. * Developers: For integrating movie information into media managers, mobile apps, and social sites. * Researchers: For studies on movie popularity, audience reception, and content analysis. * Data scientists: For building and testing machine learning models such as recommender systems and NLP models.
Original Data Source: Popular Movies of IMDb
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Selection of top 1000 entries of each gender in IMDB..
Contains information of:
This is a sumulated dataset.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
š¬ Welcome to the Popular English Movies Dataset (2023) š¬! This dataset features information on a diverse collection of popular English movies.
The dataset provides a comprehensive set of features for each movie entry:
The Popular English Movies Dataset (2023) offers a wealth of opportunities for exploration and innovation in the realms of Data Science and Machine Learning. Here are some exciting ways to utilize and contribute to the dataset:
The data was sourced by leveraging the power of TMDB's API, and it can be explored in its entirety at https://www.themoviedb.org/movie. This platform showcases an extensive collection of movie data
Lights, Camera, Upvote! Dive into 10,000 Popular English Movies from 2023! š¬š
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
š½ļø Movie Descriptions Dataset This dataset contains a curated list of classic and contemporary films along with their titles, genres, and detailed plot descriptions. It includes globally acclaimed movies across genres such as drama, crime, romance, animation, fantasy, action, and more. From cinematic masterpieces like The Shawshank Redemption and Schindlerās List to iconic anime like Your Name and A Silent Voice, this dataset offers a diverse mix of storytelling across cultures and decades.
Each entry features:
š¬ Movie Name
š Genre(s)
š Brief Description / Plot Summary
This dataset can be used for:
šļø Movie recommendation systems
š§ NLP tasks like sentiment analysis, genre prediction, and text classification
š„ Data visualization and storytelling
š£ļø Text summarization or chatbot training on movie-related queries
Ideal for data science, machine learning, and natural language processing enthusiasts who want to experiment with real-world descriptive text data.
Original Data Source: TMDB Top Movies Dataset
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains details for 10,000 top-rated movies from TMDB, updated as of 26th July 2022. Its primary purpose is to facilitate text preprocessing and cleansing for Natural Language Processing (NLP) tasks related to movie data. It is also highly suitable for developing content-based and collaborative filtering recommendation engines. This resource offers a rich context for understanding movie popularity, genres, and audience reception.
This dataset comprises approximately 10,000 records, typically provided in a CSV file format. Specific row counts for a sample file are updated separately. The dataset includes unique values for movie IDs, with original_language
predominantly being English (around 78%) and French (7%). Movie genres include Comedy (7%) and Drama (6%), with a wide array of other genres. Release dates span a broad period from 1902 to 2022, with the majority of entries from 1998 onwards. Popularity scores range from 0.6 to over 10,000, and vote averages are generally between 4.6 and 8.7, with vote counts reaching up to 31,900.
This dataset is ideal for: * Performing extensive text preprocessing and cleansing for NLP applications on movie descriptions and titles. * Building various movie recommendation systems, including content-based recommenders and collaborative filtering engines. * Analysing trends in movie popularity, audience ratings, and language distribution. * Developing data science projects focused on entertainment and media consumption.
The dataset's geographic scope is global. It covers movies released between 17th April 1902 and 13th July 2022, with the dataset itself assembled with data up to 26th July 2022. There are no specific demographic notes available, but it broadly covers top-rated films from the TMDB database.
CCO
This dataset is suitable for: * Data Scientists and Machine Learning Engineers working on recommendation systems or NLP projects. * Researchers studying film industry trends, audience engagement, or language processing. * Developers looking to integrate movie data into applications. * Anyone interested in exploratory data analysis within the entertainment sector.
Original Data Source: TMDB Movies Dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of āIMDB Horror Movie Dataset [2012 Onwards]ā provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/PromptCloudHQ/imdb-horror-movie-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
On the occasion of Halloween, we thought of sharing a spooky dataset for the community to crunch on the data!
Remember - "This Halloween could get a lot more spookier, but treats are guaranteed".
The dataset goes back to 2012 and contains the following data fields:
The data was extracted by PromptCloud's in-house data extraction solution.
Some of the things that can be explored are the following:
--- Original source retains full ownership of the source dataset ---
Data of the relevant plots of the thesis
This dataset is based on the movie review polarity dataset (v2.0) collected and maintained by Bo Pang and Lillian Lee. Their dataset (we'll call it PL2.0) consists of 1000 positive and 1000 negative movie reviews obtained from the Internet Movie Database (IMDb) review archive.
The main contribution of this release is the enrichment of the documents with "annotator rationales," a concept we describe in our NAACL HLT 2007 paper.
Basically, "rationales" are segments of the text that support an annotator's classification. Let's say we have a movie review that is labeled as positive (i.e. the writer has a favorable opinion of the movie). Then the rationales would be segments of the text that support the claim (by an annotator) that the review is, indeed, positive.
Here are some examples of positive rationales (the segments enclosed by double square brackets):
[[you will enjoy the hell out of]] American Pie. fortunately, they [[managed to do it in an interesting and funny way]]. he is [[one of the most exciting martial artists on the big screen]], continuing to perform his own stunts and [[dazzling audiences]] with his flashy kicks and punches. the romance was [[enchanting]].
And here are some examples of negative rationales:
A woman in peril. A confrontation. An explosion. The end. [[Yawn. Yawn. Yawn.]] when a film makes watching Eddie Murphy [[a tedious experience, you know something is terribly wrong]]. the movie is [[so badly put together]] that even the most casual viewer may notice the [[miserable pacing and stray plot threads]]. [[don't go see]] this movie
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset of movie plot summaries and associated metadata. This data was collected by David Bamman, Brendan O'Connor, and Noah Smith at the Language Technologies Institute and Machine Learning Department at Carnegie Mellon University.
Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia. Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.
Metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns: - Wikipedia movie ID - Freebase movie ID - Movie name - Movie release date - Movie box office revenue - Movie runtime - Movie languages (Freebase ID:name tuples) - Movie countries (Freebase ID:name tuples) - Movie genres (Freebase ID:name tuples)
Metadata for 450,669 characters aligned to the movies above, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns:
72 character types drawn from tvtropes.com, along with 501 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.
970 unique character names used in at least two different movies, along with 2,666 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.
This research was supported in part by U.S. National Science Foundation grant IIS-0915187.
All data is released under a Creative Commons Attribution-ShareAlike License. For questions or comments, please contact David Bamman (dbamman@cs.cmu.edu).
Foto von Jakob Owens auf Unsplash