Facebook
TwitterLarge Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imdb_reviews', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
The TMDb (The Movie Database) is a comprehensive movie database that provides information about movies, including details like titles, ratings, release dates, revenue, genres, and much more.
This dataset contains a collection of 1,000,000 movies from the TMDB database.
Dataset is updated daily. If you find this dataset valuable, don't forget to hit the upvote button! 😊💝
Clash of Clans Clans Dataset 2023 (3.5M Clans)
Black-White Wage Gap in the USA Dataset
USA Unemployment Rates by Demographics & Race
Photo by Onur Binay on Unsplash
Facebook
TwitterBy Himanshu Sekhar Paul [source]
This inspiring IMDB Movie Dataset is a comprehensive database of movie ratings, featuring director_name, duration, actor_2_name, genres, actor_1_name, movie title and more. Whether you're a fan of dramatic thrillers or nostalgic '90s classics from our childhoods; here you'll find information about the most voted movies from users across the world. Delve into num_voted_users trends and discover the language each movie was released in to craft your very own personal film library of country-specific titles released in any given year. With this dataset at your disposal comparing imdb scores will never be easier! Who will come out top when the votes have been tallied? Dive into data for a journey unparalleled!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset offers a comprehensive overview of the movie ratings from IMDB. It includes data about director name, duration, actors, genres, movie title, number of votes, language, country of origin, year released and IMDB score.
To use this dataset to get a deeper understanding of how movies are rated on IMDB you can take the following steps:
- Look through each column of the data to get an overall understanding. This will help you identify any specific trends or correlations in the data that you can then analyze further in later steps.
- Take some time to explore relationships between different columns such as 'Number Voted Users' and 'IMDB Score' – it could be interesting to look at how these numbers relate with each other in order better understan rating trends on IMDB?
- Analyze how particular sub-groups perform within various categories such as genre or country; this could provide insight into preferences towards certain types of movies or countries with higher associated scores than others?
- Through your analysis try and gain answers to questions related to specific demographic groups on IMDB – are there distinct preferences among age groups when it comes to what they watch? Are there any clear correlations between rating and genre within certain countries? etc…
By utilizing the questions above and taking an initial 'big picture' view before diving into more detailed analysis users should be able find value from this dataset by uncovering useful insights about movie ratings on IMDB!
- Movie Recommendation System: The dataset can be used to build a movie recommendation system using machine learning algorithms like k-nearest neighbors or collaborative filtering. Based on the user's past ratings, the system can suggest relevant movies with similar genres, actors and directors.
- Movie Popularity Index: Using the data, a metric could be designed that provides an overall popularity index for movies released over the years. This index could be constructed by considering factors such as IMDb score, number of votes and reviews collected, etc..
- Genre-based Over/Under Performance Analysis: Based on genre selections in each movie year, this dataset can provide insight into which genres are performing well and which are not. This kind of analysis could help form important decisioning when deciding to allocate resources towards production budgeting or marketing campaigns for upcoming films in different genres across different regions or markets
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: movie_data.csv | Column name | Description | |:-------------------------|:---------------------------------------------------| | director_name | Name of the director of the movie. (String) | | duration | Length of the movie in minutes. (Integer) | | actor_2_name | Name of the second actor in the movie. (String) | | genres | Genre of the movie. (String) | | actor_1_name | Name of the first actor in the movie. (String) | | movie_title | Title of the movie. (String) | | num_voted_users | Number of users who voted for the movie. (Integer) | | actor_3_name | Name of the third actor in the movie. (String) | | movie_imdb_link | Link to the movie's IMDB page. (String) | | num_user_for_reviews |...
Facebook
TwitterThis dataset contains a set of movie ratings from the MovieLens website, a movie recommendation service. This dataset was collected and maintained by GroupLens, a research group at the University of Minnesota. There are 5 versions included: "25m", "latest-small", "100k", "1m", "20m". In all datasets, the movies data and ratings data are joined on "movieId". The 25m dataset, latest-small dataset, and 20m dataset contain only movie data and rating data. The 1m dataset and 100k dataset contain demographic data in addition to movie and rating data.
For each version, users can view either only the movies data by adding the "-movies" suffix (e.g. "25m-movies") or the ratings data joined with the movies data (and users data in the 1m and 100k datasets) by adding the "-ratings" suffix (e.g. "25m-ratings").
The features below are included in all versions with the "-ratings" suffix.
The "100k-ratings" and "1m-ratings" versions in addition include the following demographic features.
In addition, the "100k-ratings" dataset would also have a feature "raw_user_age" which is the exact ages of the users who made the rating
Datasets with the "-movies" suffix contain only "movie_id", "movie_title", and "movie_genres" features.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('movielens', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains detailed information about movies listed on IMDb, including titles, genres, release dates, and ratings. It also includes user reviews and ratings, making it an excellent resource for sentiment analysis and trend analysis in the movie industry. This dataset can be used to gain insights into movie trends, audience preferences, and the correlation between movie attributes and ratings. The second file has additional feature called poster_src which is a link Movies poster image. The second is bigger than the first file and has a wider range of moives.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The "Real Movies Dataset" offers a comprehensive repository of diverse movie information, facilitating in-depth analysis and meaningful comparisons across various cinematic attributes. With its wealth of key details, this dataset serves as an invaluable resource for researchers, enthusiasts, and industry professionals alike.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18544731%2Fbfb64d5c16fa1164befbde46928b7f83%2FMovies%20Kaggle.jpg?generation=1707490228580924&alt=media" alt="">
Each entry in the dataset includes the following attributes:
* Movie Name: The title of the movie.
* Year of Release: The year in which the movie was officially released to the public.
* Watch Time: The duration of the movie in terms of hours and minutes, indicating the length of time required to watch the entire film.
* Movie Rating: This refers to the rating assigned to the movie based on various criteria such as content, suitability for different age groups, and overall quality. Ratings could be numerical (e.g., out of 10).
* Meatscore of Movie: This is a unique metric that represents the "meatiness" or substance of the movie. It might be a score assigned based on the complexity of the plot, character development, thematic depth, or other qualitative aspects.
* Votes: The number of votes or ratings received by the movie from viewers or critics. This metric provides an indication of the movie's popularity or reception.
* Gross: The total box office gross earnings generated by the movie, typically measured in a specific currency (e.g., USD). This metric reflects the commercial success of the film.
* Description: The dataset includes a brief description field providing a summary or overview of the movie's plot, genre, themes, or notable aspects. This description offers context and insight into the content and style of each film, aiding in understanding and analysis.
Overall, the "Real Movies Dataset" serves as a valuable resource for researchers, analysts, and enthusiasts interested in exploring and studying the dynamics of the film industry, including trends in movie production, audience preferences, and financial performance.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Unlock one of the most comprehensive movie datasets available—4.5 million structured IMDb movie records, extracted and enriched for data science, machine learning, and entertainment research.
This dataset includes a vast collection of global movie metadata, including details on title, release year, genre, country, language, runtime, cast, directors, IMDb ratings, reviews, and synopsis. Whether you're building a recommendation engine, benchmarking trends, or training AI models, this dataset is designed to give you deep and wide access to cinematic data across decades and continents.
Perfect for use in film analytics, OTT platforms, review sentiment analysis, knowledge graphs, and LLM fine-tuning, the dataset is cleaned, normalized, and exportable in multiple formats.
Genres: Drama, Comedy, Horror, Action, Sci-Fi, Documentary, and more
Train LLMs or chatbots on cinematic language and metadata
Build or enrich movie recommendation engines
Run cross-lingual or multi-region film analytics
Benchmark genre popularity across time periods
Power academic studies or entertainment dashboards
Feed into knowledge graphs, search engines, or NLP pipelines
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Facebook
Twitterhttps://ai.stanford.edu/~amaas/data/sentimenthttps://ai.stanford.edu/~amaas/data/sentiment
The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The providers also include an additional 50,000 unlabeled documents for unsupervised learning.
The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset also contains an additional 50,000 unlabeled documents for unsupervised learning. See the README file contained in the release for more details.
The data is split into a train (25k reviews) and test (25k reviews) set. A preview file cannot be provided - please download the data directly from the data provider's website.
When using the dataset, please cite: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
Facebook
TwitterThe "IMDB Dataset of 50K Movie Reviews" dataset is a tabular dataset with listings for 50k reviews from IMDB. There are two fields: "review", containing the review text, and "sentiment", containing either the value "positive" or the value "negative".
Using HQ Data Profiler, data quality issues in the original dataset were identified and fixed and this CLEANED version prepared.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F29643712%2Fff70cdf355229a9160466f64a0816b4e%2FIMDB%20Promo.png?generation=1762216952842160&alt=media" alt="Data quality improvements">
HQ Data Profiler's comprehensive profile report showed that the original dataset contained 418 duplicated "review" values. All rows with duplicated review values were removed. The dataset was then balanced by randomly removing rows in the more populated sentiment category. Result: 24698 "positive" and 24698 "negative" reviews, with no duplicates.
Original dataset link (uncleaned): https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Dataset citation ( https://ai.stanford.edu/~amaas/data/sentiment/ ): @InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }
Facebook
TwitterStable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Released 4/2015; updated 10/2016 to update links.csv and add tag genome data.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
For details about the scraping process, explore the complete code repository on GitHub.
This dataset provides annual data for the most popular 500–600 movies per year from 1920 to 2025, extracted from IMDb. It includes over 60,000 movies, spanning more than 100 years of cinematic history. Each year’s data is divided into three CSV files for flexibility and ease of use:
- imdb_movies_[year].csv: Basic movie details.
- advanced_movies_details_[year].csv: Comprehensive metadata and financial details.
- merged_movies_data_[year].csv: A unified dataset combining both files.
imdb_movies_[year].csvEssential movie information, including:
- Title: Movie title.
- Description: Movie Description.
- méta_score: IMDB's meta score.
- Movie Link: IMDb URL for the movie.
- Year: Year of release.
- Duration: Runtime (in minutes).
- MPA: Motion Picture Association rating (e.g., PG, R).
- Rating: IMDb rating (scale of 1–10).
- Votes: Total user votes on IMDb.
advanced_movies_details_[year].csvDetailed movie metadata:
- Link: IMDb URL (for linking with other data).
- budget: Production budget (in USD).
- grossWorldWide: Global box office revenue.
- gross_US_Canada: North American box office earnings.
- opening_weekend_Gross: Opening weekend revenue.
- directors: List of directors.
- writers: List of writers.
- stars: Main cast members.
- genres: Movie genres.
- countries_origin: Countries of production.
- filming_locations: Primary filming locations.
- production_companies: Associated production companies.
- Languages: Languages spoken in the movie.
- Award_information: Information about awards, nominations and wins.
- release_date: Official release date.
merged_movies_data_[year].csvA unified dataset combining all columns from the previous two files:
- Basic Details: Title, Year, Rating, Votes.
- Advanced Features: budget, grossWorldWide, directors, genres, and awards.
Template Columns:
- imdb_movies_[year].csv:
Title, Year, Duration, MPA, Rating, Votes, meta_score, description, Movie Link
advanced_movies_details_[year].csv:
link, writers, directors, stars, budget, opening_weekend_Gross, grossWorldWide, gross_US_Canada, release_date, countries_origin, filming_locations, production_company, awards_content, genres, Languages
merged_movies_data_[year].csv:
Title, Year, Duration, MPA, Rating, Votes, meta_score, description, Movie Link, writers, directors, stars, budget, opening_weekend_Gross, grossWorldWide, gross_US_Canada, release_date, countries_origin, filming_locations, production_company, awards_content, genres, Languages
The dataset is updated annually in December to include the latest data.
This dataset is ideal for:
- Trend Analysis: Explore changes in the movie industry over six decades.
- Predictive Modeling: Build models to forecast box office revenue, ratings, or awards.
- Recommendation Systems: Use attributes like genres, cast, and ratings for personalized recommendations.
- Comparative Analysis: Study differences across eras, genres, or regions.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Amazon Prime TV Shows and Movies Dataset offered by Crawl Feeds is an extensive resource containing over 92,000 records in JSON format. This dataset encompasses a wide array of data points, including links, titles, descriptions, release dates, genres, posters, streaming platforms, countries, number of seasons, content ratings, IMDb ratings, cast and crew details, unique identifiers, and scraping timestamps. Such comprehensive information is invaluable for researchers, data analysts, and developers aiming to conduct in-depth analyses, develop recommendation systems, or explore trends within Amazon Prime's content library.
For those interested in broader media datasets, Crawl Feeds also offers the Movies and TV Shows Dataset, which includes 118,000 records, and the IMDb Movie Details Dataset, comprising 250,000 records. These datasets provide extensive information across various platforms, facilitating comparative studies and cross-platform analyses.
Integrating these datasets into your projects can significantly enhance the depth and quality of your analyses, providing a robust foundation for exploring various facets of the entertainment industry. Whether you're developing a new application, conducting market research, or performing academic studies, these datasets serve as a valuable resource for gaining insights into the dynamic world of streaming media.
Explore the Amazon Prime TV Shows and Movies Dataset and other related datasets on Crawl Feeds to elevate your data-driven projects.
Facebook
TwitterThis dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.
Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.
The data are contained in the files - - links.csv - movies.csv - ratings.csv - tags.csv
This and other GroupLens data sets are publicly available for download at http://grouplens.org/datasets/.
License: This dataset is sourced from the GroupLens Research Group at the University of Minnesota. It is provided for non-commercial research and educational purposes only. License details can be found here under Usage License - https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html
Important:
Citation F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872
Facebook
TwitterThis Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:
More reviews:
New reviews:
Metadata: - We have added transaction metadata for each review shown on the review page.
If you publish articles based on this dataset, please cite the following paper:
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Explore the Netflix Titles dataset, featuring detailed insights on over 8,800 movies and TV shows. Ideal for content analysis, recommendation systems, and market research, covering genre trends, directors, cast, production countries, release years, and ratings.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for [Dataset Name]
Dataset Summary
The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/sst2.
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service MovieLens.
Users were selected at random for inclusion. All users selected had rated at least 20 movies. Unlike previous MovieLens data sets, no demographic information is included. Each user is represented by an id, and no other information is provided.
The data are contained in three files, movies.dat, ratings.dat, and tags.dat. Also included are scripts for generating subsets of the data to support the five-fold cross-validation of rating predictions. More details about the contents and use of all these files follow.
This and other GroupLens data sets are publicly available for download at GroupLens Data Sets.
Facebook
TwitterLarge Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imdb_reviews', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.