50 datasets found
  1. T

    imdb_reviews

    • tensorflow.org
    • kaggle.com
    Updated Sep 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imdb_reviews [Dataset]. https://www.tensorflow.org/datasets/catalog/imdb_reviews
    Explore at:
    Dataset updated
    Sep 20, 2024
    Description

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imdb_reviews', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  2. IMDB Dataset of 50K Movie Reviews - CLEANED

    • kaggle.com
    zip
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HQ Data Profiler (2025). IMDB Dataset of 50K Movie Reviews - CLEANED [Dataset]. https://www.kaggle.com/datasets/hqdataprofiler/imdb-dataset-of-50k-movie-reviews-cleaned
    Explore at:
    zip(26469422 bytes)Available download formats
    Dataset updated
    Nov 4, 2025
    Authors
    HQ Data Profiler
    Description

    The "IMDB Dataset of 50K Movie Reviews" dataset is a tabular dataset with listings for 50k reviews from IMDB. There are two fields: "review", containing the review text, and "sentiment", containing either the value "positive" or the value "negative".

    Using HQ Data Profiler, data quality issues in the original dataset were identified and fixed and this CLEANED version prepared. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F29643712%2Fff70cdf355229a9160466f64a0816b4e%2FIMDB%20Promo.png?generation=1762216952842160&alt=media" alt="Data quality improvements"> HQ Data Profiler's comprehensive profile report showed that the original dataset contained 418 duplicated "review" values. All rows with duplicated review values were removed. The dataset was then balanced by randomly removing rows in the more populated sentiment category. Result: 24698 "positive" and 24698 "negative" reviews, with no duplicates.

    Original dataset link (uncleaned): https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

    Dataset citation ( https://ai.stanford.edu/~amaas/data/sentiment/ ): @InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }

  3. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  4. Movie Reviews Word2Vec Embeddings Dataset

    • kaggle.com
    zip
    Updated Jan 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Movie Reviews Word2Vec Embeddings Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/movie-reviews-word2vec-embeddings-dataset
    Explore at:
    zip(23254182 bytes)Available download formats
    Dataset updated
    Jan 17, 2023
    Authors
    The Devastator
    Description

    Movie Reviews Word2Vec Embeddings Dataset

    Capturing Semantics in Textual Reviews

    By Jared Fernandez [source]

    About this dataset

    This dataset contains a collection of Word2Vec embeddings for nearly 12,000 reviews from movies and other films. These embeddings allow the reviews to be represented in a meaningful way, providing insight into topics and trends present in the reviews. By utilizing this source of data, researchers can gain better understanding of language patterns that appear across various types of movie reviews. Additionally, models with these embeddings can be used to help create/improve models for sentiment analysis and other natural language processing tasks. Each row includes the reviewer's unique ID along with their review text and related word2vec embedding representing textual relationships found therein

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    How to Use this Dataset:

    • Download the dataset ‘Movie Reviews Word2Vec Embeddings’ from Kaggle.
    • This dataset contains an embedding type of word2vec, which is a type of neural network that creates high-dimensional vector representations of words based on their context in a training corpus.
    • Before making use of these embeddings, it’s important to understand what they are representing and how you can match them with other datasets for analysis purposes. The word2vec embeddings contain two columns – word (the specific word), and vec (the vector representation associated with that particular word).
    • To leverage the data from this text corpus effectively, it is important to first extract meaningful information out of them such as sentiment ratings or determining various topics that appears more frequently in movie reviews etc.. Sorting through millions of reviews will require automated processing – either by leveraging machine learning algorithms or using natural language processing to determine sentiment polarities and extracting relevant keywords/topics for each review.
    • You can also use the pre-processed Word Vectors (embeddings) along with supervised or unsupervised approaches available like Logistic Regression, BERT models etc.. to create features such as sentiment scoring or topic modelling - classifying texts into distinct categories etc.. That may be useful while doing some predictive analysis such as predicting movie ratings based on user reviews etc..

    6 Once you have made use of the pre-processed data from this dataset, you can extend your model's performance further by having better understanding about how those words relate one another using the vectors derived from thems (i.e., Cosine Similarity measurement) which shows relatedness between words thus providing additional insights about relationships among different text fragments or paragraphs in documents eventually helping your model understand better contextual relationships while performing analytics tasks on text corpora involving movie reviews data!

    Research Ideas

    • Automatically clustering movies with similar sentiment and themes.
    • Automatically generating movie plot summaries based on sentiment analysis of reviews.
    • Developing a movie recommendation system based on users’ preference in different genres or topics related to the movie in question

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

    Columns

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Jared Fernandez.

  5. b

    IMDb Movie Reviews Dataset

    • berd-platform.de
    bin
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts; Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts (2025). IMDb Movie Reviews Dataset [Dataset]. http://doi.org/10.82939/z8gxk-w3567
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Stanford University
    Authors
    Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts; Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts
    License

    https://ai.stanford.edu/~amaas/data/sentimenthttps://ai.stanford.edu/~amaas/data/sentiment

    Description

    The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The providers also include an additional 50,000 unlabeled documents for unsupervised learning.

    The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset also contains an additional 50,000 unlabeled documents for unsupervised learning. See the README file contained in the release for more details.

    The data is split into a train (25k reviews) and test (25k reviews) set. A preview file cannot be provided - please download the data directly from the data provider's website.

    When using the dataset, please cite: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

  6. IMDB 5000 Movie Dataset

    • kaggle.com
    zip
    Updated Dec 16, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yueming (2017). IMDB 5000 Movie Dataset [Dataset]. https://www.kaggle.com/datasets/carolzhangdc/imdb-5000-movie-dataset
    Explore at:
    zip(567524 bytes)Available download formats
    Dataset updated
    Dec 16, 2017
    Authors
    Yueming
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Dataset

    This dataset was created by Yueming

    Released under Database: Open Database, Contents: Database Contents

    Contents

  7. IMDb Movies Metadata Dataset – 4.5M Records (Global Coverage)

    • crawlfeeds.com
    csv, zip
    Updated Nov 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). IMDb Movies Metadata Dataset – 4.5M Records (Global Coverage) [Dataset]. https://crawlfeeds.com/datasets/imdb-movies-metadata-dataset-4-5m-records-global-coverage
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Nov 9, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Unlock one of the most comprehensive movie datasets available—4.5 million structured IMDb movie records, extracted and enriched for data science, machine learning, and entertainment research.

    This dataset includes a vast collection of global movie metadata, including details on title, release year, genre, country, language, runtime, cast, directors, IMDb ratings, reviews, and synopsis. Whether you're building a recommendation engine, benchmarking trends, or training AI models, this dataset is designed to give you deep and wide access to cinematic data across decades and continents.

    Perfect for use in film analytics, OTT platforms, review sentiment analysis, knowledge graphs, and LLM fine-tuning, the dataset is cleaned, normalized, and exportable in multiple formats.

    What’s Included:

    • Genres: Drama, Comedy, Horror, Action, Sci-Fi, Documentary, and more

    • Delivery: Direct download

    Use Cases:

    • Train LLMs or chatbots on cinematic language and metadata

    • Build or enrich movie recommendation engines

    • Run cross-lingual or multi-region film analytics

    • Benchmark genre popularity across time periods

    • Power academic studies or entertainment dashboards

    • Feed into knowledge graphs, search engines, or NLP pipelines

  8. u

    Amazon review data 2018

    • cseweb.ucsd.edu
    • nijianmo.github.io
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Amazon review data 2018 [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/
    Explore at:
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    Context

    This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

    • More reviews:

      • The total number of reviews is 233.1 million (142.8 million in 2014).
    • New reviews:

      • Current data includes reviews in the range May 1996 - Oct 2018.
    • Metadata: - We have added transaction metadata for each review shown on the review page.

      • Added more detailed metadata of the product landing page.

    Acknowledgements

    If you publish articles based on this dataset, please cite the following paper:

    • Jianmo Ni, Jiacheng Li, Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. EMNLP, 2019.
  9. IMDb Dataset (2024) updated

    • kaggle.com
    zip
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parth (2024). IMDb Dataset (2024) updated [Dataset]. https://www.kaggle.com/datasets/parthdande/imdb-dataset-2024-updated
    Explore at:
    zip(335942 bytes)Available download formats
    Dataset updated
    Jul 6, 2024
    Authors
    Parth
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains detailed information about movies listed on IMDb, including titles, genres, release dates, and ratings. It also includes user reviews and ratings, making it an excellent resource for sentiment analysis and trend analysis in the movie industry. This dataset can be used to gain insights into movie trends, audience preferences, and the correlation between movie attributes and ratings. The second file has additional feature called poster_src which is a link Movies poster image. The second is bigger than the first file and has a wider range of moives.

  10. Full TMDB Movies Dataset 2024 (1M Movies)

    • kaggle.com
    zip
    Updated Nov 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    asaniczka (2025). Full TMDB Movies Dataset 2024 (1M Movies) [Dataset]. https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies
    Explore at:
    zip(239404730 bytes)Available download formats
    Dataset updated
    Nov 11, 2025
    Authors
    asaniczka
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    The TMDb (The Movie Database) is a comprehensive movie database that provides information about movies, including details like titles, ratings, release dates, revenue, genres, and much more.

    This dataset contains a collection of 1,000,000 movies from the TMDB database.

    Dataset is updated daily. If you find this dataset valuable, don't forget to hit the upvote button! 😊💝

    Interesting Task Ideas:

    1. Predict movie ratings based on features such as revenue, popularity, genre, and runtime.
    2. Identify trends in movie release dates and analyze their impact on revenue.
    3. Analyze the relationship between budget, revenue, and popularity to determine factors that contribute to a movie's success.
    4. Build a recommendation system that suggests similar movies based on genres, production companies, and language.
    5. Perform sentiment analysis on movie reviews to understand audience reactions.
    6. Explore the impact of movie genres on popularity and revenue.
    7. Investigate the correlation between runtime and audience engagement.
    8. Identify successful production companies and analyze their strategies.
    9. Utilize natural language processing techniques to extract meaningful insights from movie overviews.
    10. Visualize movie popularity over time and identify popular genres in different periods.

    Checkout my other datasets

    Clash of Clans Clans Dataset 2023 (3.5M Clans)

    Black-White Wage Gap in the USA Dataset

    130K Kindle Books

    USA Unemployment Rates by Demographics & Race

    150K TMDb TV Shows

    Photo by Onur Binay on Unsplash

  11. T

    IMDb Reviews Dataset of Spider-Man: No Way Home Film

    • dataverse.telkomuniversity.ac.id
    tsv
    Updated Apr 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Telkom University Dataverse (2022). IMDb Reviews Dataset of Spider-Man: No Way Home Film [Dataset]. http://doi.org/10.34820/FK2/BUS4WO
    Explore at:
    tsv(342144)Available download formats
    Dataset updated
    Apr 13, 2022
    Dataset provided by
    Telkom University Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset is used in the paper entitled "The Sentiment Analysis of Spider-Man: No Way Home Film Based on IMDb Reviews". Download full paper at http://jurnal.iaii.or.id/index.php/RESTI/article/view/3851.

  12. IMDB Movie Ratings Dataset

    • kaggle.com
    zip
    Updated Jan 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). IMDB Movie Ratings Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/imdb-movie-ratings-dataset
    Explore at:
    zip(319960 bytes)Available download formats
    Dataset updated
    Jan 17, 2023
    Authors
    The Devastator
    Description

    IMDB Movie Ratings Dataset

    Evaluating Directors, Actors, Genres, and Movie Titles

    By Himanshu Sekhar Paul [source]

    About this dataset

    This inspiring IMDB Movie Dataset is a comprehensive database of movie ratings, featuring director_name, duration, actor_2_name, genres, actor_1_name, movie title and more. Whether you're a fan of dramatic thrillers or nostalgic '90s classics from our childhoods; here you'll find information about the most voted movies from users across the world. Delve into num_voted_users trends and discover the language each movie was released in to craft your very own personal film library of country-specific titles released in any given year. With this dataset at your disposal comparing imdb scores will never be easier! Who will come out top when the votes have been tallied? Dive into data for a journey unparalleled!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset offers a comprehensive overview of the movie ratings from IMDB. It includes data about director name, duration, actors, genres, movie title, number of votes, language, country of origin, year released and IMDB score.

    To use this dataset to get a deeper understanding of how movies are rated on IMDB you can take the following steps:

    • Look through each column of the data to get an overall understanding. This will help you identify any specific trends or correlations in the data that you can then analyze further in later steps.
    • Take some time to explore relationships between different columns such as 'Number Voted Users' and 'IMDB Score' – it could be interesting to look at how these numbers relate with each other in order better understan rating trends on IMDB?
    • Analyze how particular sub-groups perform within various categories such as genre or country; this could provide insight into preferences towards certain types of movies or countries with higher associated scores than others?
    • Through your analysis try and gain answers to questions related to specific demographic groups on IMDB – are there distinct preferences among age groups when it comes to what they watch? Are there any clear correlations between rating and genre within certain countries? etc…

    By utilizing the questions above and taking an initial 'big picture' view before diving into more detailed analysis users should be able find value from this dataset by uncovering useful insights about movie ratings on IMDB!

    Research Ideas

    • Movie Recommendation System: The dataset can be used to build a movie recommendation system using machine learning algorithms like k-nearest neighbors or collaborative filtering. Based on the user's past ratings, the system can suggest relevant movies with similar genres, actors and directors.
    • Movie Popularity Index: Using the data, a metric could be designed that provides an overall popularity index for movies released over the years. This index could be constructed by considering factors such as IMDb score, number of votes and reviews collected, etc..
    • Genre-based Over/Under Performance Analysis: Based on genre selections in each movie year, this dataset can provide insight into which genres are performing well and which are not. This kind of analysis could help form important decisioning when deciding to allocate resources towards production budgeting or marketing campaigns for upcoming films in different genres across different regions or markets

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: movie_data.csv | Column name | Description | |:-------------------------|:---------------------------------------------------| | director_name | Name of the director of the movie. (String) | | duration | Length of the movie in minutes. (Integer) | | actor_2_name | Name of the second actor in the movie. (String) | | genres | Genre of the movie. (String) | | actor_1_name | Name of the first actor in the movie. (String) | | movie_title | Title of the movie. (String) | | num_voted_users | Number of users who voted for the movie. (Integer) | | actor_3_name | Name of the third actor in the movie. (String) | | movie_imdb_link | Link to the movie's IMDB page. (String) | | num_user_for_reviews |...

  13. Real Movies Dataset

    • kaggle.com
    zip
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harshit Sharma (2024). Real Movies Dataset [Dataset]. https://www.kaggle.com/datasets/harshitstark/real-movies-dataset
    Explore at:
    zip(104062 bytes)Available download formats
    Dataset updated
    Feb 9, 2024
    Authors
    Harshit Sharma
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The "Real Movies Dataset" offers a comprehensive repository of diverse movie information, facilitating in-depth analysis and meaningful comparisons across various cinematic attributes. With its wealth of key details, this dataset serves as an invaluable resource for researchers, enthusiasts, and industry professionals alike. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18544731%2Fbfb64d5c16fa1164befbde46928b7f83%2FMovies%20Kaggle.jpg?generation=1707490228580924&alt=media" alt=""> Each entry in the dataset includes the following attributes: * Movie Name: The title of the movie. * Year of Release: The year in which the movie was officially released to the public. * Watch Time: The duration of the movie in terms of hours and minutes, indicating the length of time required to watch the entire film. * Movie Rating: This refers to the rating assigned to the movie based on various criteria such as content, suitability for different age groups, and overall quality. Ratings could be numerical (e.g., out of 10). * Meatscore of Movie: This is a unique metric that represents the "meatiness" or substance of the movie. It might be a score assigned based on the complexity of the plot, character development, thematic depth, or other qualitative aspects. * Votes: The number of votes or ratings received by the movie from viewers or critics. This metric provides an indication of the movie's popularity or reception. * Gross: The total box office gross earnings generated by the movie, typically measured in a specific currency (e.g., USD). This metric reflects the commercial success of the film. * Description: The dataset includes a brief description field providing a summary or overview of the movie's plot, genre, themes, or notable aspects. This description offers context and insight into the content and style of each film, aiding in understanding and analysis.

    Overall, the "Real Movies Dataset" serves as a valuable resource for researchers, analysts, and enthusiasts interested in exploring and studying the dynamics of the film industry, including trends in movie production, audience preferences, and financial performance.

  14. IMDB Movies From 1920 to 2025

    • kaggle.com
    zip
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raed Addala (2025). IMDB Movies From 1920 to 2025 [Dataset]. https://www.kaggle.com/datasets/raedaddala/imdb-movies-from-1960-to-2023
    Explore at:
    zip(46688739 bytes)Available download formats
    Dataset updated
    Mar 27, 2025
    Authors
    Raed Addala
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Over 60,000 Movies, 100+ Years of Data, and Rich Metadata!

    Links:

    For details about the scraping process, explore the complete code repository on GitHub.

    About the Dataset

    This dataset provides annual data for the most popular 500–600 movies per year from 1920 to 2025, extracted from IMDb. It includes over 60,000 movies, spanning more than 100 years of cinematic history. Each year’s data is divided into three CSV files for flexibility and ease of use:
    - imdb_movies_[year].csv: Basic movie details.
    - advanced_movies_details_[year].csv: Comprehensive metadata and financial details.
    - merged_movies_data_[year].csv: A unified dataset combining both files.

    File Descriptions

    1. imdb_movies_[year].csv

    Essential movie information, including:
    - Title: Movie title. - Description: Movie Description. - méta_score: IMDB's meta score. - Movie Link: IMDb URL for the movie.
    - Year: Year of release.
    - Duration: Runtime (in minutes).
    - MPA: Motion Picture Association rating (e.g., PG, R).
    - Rating: IMDb rating (scale of 1–10).
    - Votes: Total user votes on IMDb.

    2. advanced_movies_details_[year].csv

    Detailed movie metadata:
    - Link: IMDb URL (for linking with other data).
    - budget: Production budget (in USD).
    - grossWorldWide: Global box office revenue.
    - gross_US_Canada: North American box office earnings.
    - opening_weekend_Gross: Opening weekend revenue.
    - directors: List of directors.
    - writers: List of writers.
    - stars: Main cast members.
    - genres: Movie genres.
    - countries_origin: Countries of production.
    - filming_locations: Primary filming locations.
    - production_companies: Associated production companies.
    - Languages: Languages spoken in the movie.
    - Award_information: Information about awards, nominations and wins.
    - release_date: Official release date.

    3. merged_movies_data_[year].csv

    A unified dataset combining all columns from the previous two files:
    - Basic Details: Title, Year, Rating, Votes.
    - Advanced Features: budget, grossWorldWide, directors, genres, and awards.

    Data Structure

    Template Columns:
    - imdb_movies_[year].csv:
    Title, Year, Duration, MPA, Rating, Votes, meta_score, description, Movie Link

    • advanced_movies_details_[year].csv:
      link, writers, directors, stars, budget, opening_weekend_Gross, grossWorldWide, gross_US_Canada, release_date, countries_origin, filming_locations, production_company, awards_content, genres, Languages

    • merged_movies_data_[year].csv:
      Title, Year, Duration, MPA, Rating, Votes, meta_score, description, Movie Link, writers, directors, stars, budget, opening_weekend_Gross, grossWorldWide, gross_US_Canada, release_date, countries_origin, filming_locations, production_company, awards_content, genres, Languages

    Updates

    The dataset is updated annually in December to include the latest data.

    Applications

    This dataset is ideal for:
    - Trend Analysis: Explore changes in the movie industry over six decades.
    - Predictive Modeling: Build models to forecast box office revenue, ratings, or awards.
    - Recommendation Systems: Use attributes like genres, cast, and ratings for personalized recommendations.
    - Comparative Analysis: Study differences across eras, genres, or regions.

    Dataset Features

    • Over 60,000 Movies: Detailed data from 1920 to 2025.
    • Rich Metadata: Financial, creative, and recognition-related attributes.
    • User-friendly: Modular files for tailored use or comprehensive merged files.
    • Consistency: Uniform structure enables seamless analysis.

    Notes

    • For issues, suggestions, or feature requests, please feel free to contact me: send me an email or open an issue on GitHub. Your input is highly appreciated.
  15. IMDB Movies Analysis - SQL

    • kaggle.com
    zip
    Updated Feb 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav B R (2023). IMDB Movies Analysis - SQL [Dataset]. https://www.kaggle.com/datasets/gauravbr/imdb-movies-data-erd
    Explore at:
    zip(3818401 bytes)Available download formats
    Dataset updated
    Feb 21, 2023
    Authors
    Gaurav B R
    Description

    SQL IMDB Movies Analysis for RSVP (Film Production Company)

    RSVP Movies is an Indian film production company which has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.

    The production company wants to plan their every move analytically based on data. We have taken the last three years IMDB movies data and carried out the analysis using SQL. We have analysed the data set and drew meaningful insights that could help them start their new project.

    For our convenience, the entire analytics process has been divided into four segments, where each segment leads to significant insights from different combinations of tables. The questions in each segment with business objectives are written in the script given below. We have written the solution code below every question.

  16. Movie Reviews Dataset

    • kaggle.com
    zip
    Updated Jan 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    czyzi0 (2023). Movie Reviews Dataset [Dataset]. https://www.kaggle.com/datasets/czyzi0/movie-reviews-dataset
    Explore at:
    zip(161459498 bytes)Available download formats
    Dataset updated
    Jan 30, 2023
    Authors
    czyzi0
    Description

    Description: This dataset contains movie reviews and their sentiment labels. All text were scraped from Internet from various websites in 2020. Reviews are available in few languages: cs, de, es, fr, pl, sk. Split into training and testing data is provided. There are three sentiment labels: - pos - for positive sentiment, - neg - for negative sentiment, - n\a - not assigned, can be used for some unsupervised learning.

    Distribution of training data: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4135817%2F5035e68ab296b928f1511957cd2052fa%2Ftraining.png?generation=1675604158298685&alt=media" alt="">

    Distribution of testing data: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4135817%2Ffccbe9806c21850cab6f4d9fe035ff5e%2Ftesting.png?generation=1675604176597583&alt=media" alt="">

    License and copyright: The Movie Reviews Dataset is distributed under the CC BY-NC 4.0. The copyright remains with the original owners of the texts.

    Notice and take down policy: Should you consider that data contains material that is owned by you and should therefore not be reproduced here, please: - Identify yourself, with contact data such as an email address at which you can be contacted. - Identify the copyrighted work claimed to be infringed. - Identify the material that is claimed to be infringing and information reasonably sufficient to allow me to locate the material. - Send the request to me.

    I will comply to legitimate requests by removing the affected sources from the corpus.

    I've collected these reviews for scientific purposes. It has been more than 2 years since publication date of any of these reviews. That's why I've decided to share this collection. This way other people will also be able to use it for educational purposes.

  17. R

    Football Game Film Angle Dataset

    • universe.roboflow.com
    zip
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Football Analysis (2024). Football Game Film Angle Dataset [Dataset]. https://universe.roboflow.com/football-analysis-fm44i/football-game-film-angle
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset authored and provided by
    Football Analysis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Film Angles
    Description

    Football Game Film Angle

    ## Overview
    
    Football Game Film Angle is a dataset for classification tasks - it contains Film Angles annotations for 595 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  18. Game of Thrones - A naturalistic viewing dataset

    • openneuro.org
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kira Noad; David Watson; Timothy Andrews (2023). Game of Thrones - A naturalistic viewing dataset [Dataset]. http://doi.org/10.18112/openneuro.ds004848.v1.0.0
    Explore at:
    Dataset updated
    Nov 16, 2023
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Kira Noad; David Watson; Timothy Andrews
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Game of Thrones - A naturalistic viewing dataset

    Overview

    This dataset contains fMRI movie-watching and category localiser data in 28 developmental prosopagnosics and 45 neurologically healthy controls. Participants are additionally grouped by their familiarity with the Game of Thrones television series.

    In movie-watching scans, participants passively viewed a series of short audiovisual clips (ranging from 50 to 117 s duration; total duration = 12 min 58 s) taken from the Game of Thrones television series.

    In category localiser scans, participants viewed images of faces, scenes, and phase scrambled versions of the face images. These can be used to define face and scene selective regions of interest.

    Please refer to the folloiwng paper when using this dataset:

    Noad, K., Watson, D.M., Andrews, T.J. (In review). Natural viewing reveals an extended network of regions for familiar faces that is disrupted in developmental prosopagnosia.

    Data Contents

    • participants.tsv - List of subject IDs in control and developmental prosopagnosic groups, along with whether they were familiar or unfamiliar with Game of Thrones.

    • slice_timings.tsv, fsl_slice_timings.txt - Slice timings for functional scans. The TSV file gives the times in milliseconds, and the text file gives the times in normalised units of the TR suitable for entering into FEAT.

      Scans were acquired with the HCP/CMRR multiband sequence. More information on slices timings can be found at: https://wiki.humanconnectome.org/download/attachments/40534057/CMRR_MB_Slice_Order.pdf

    • behavioural_measures.tsv - Scores on PI20, CFMT, and Game of Thrones quiz tasks (see below for more details). PI20 scores are out of 100. CFMT scores are given as percentage accuracies. Quiz scores are given as percentage accuracies over all questions as well as broken down by face, scene, and narrative questions.

    • Subject Directories - MRI data directories for each subject:

      • anat - T1 anatomical images
      • fmap - Magnitude and phase difference fieldmap images
      • func - Movie-watching (Game of Thrones) and category localiser data

    Behavioural Measures

    We provide two measures of face processing ability (PI20 and CFMT) and a quiz assessing familiarity with the Game of Thrones TV series. All participants completed the Game of Thrones quiz, and all developmental prosopagnosics completed the PI20 and CFMT assessments. Approximately half of the control subjects also completed the CFMT.

    • PI20 - 20-item prosopagnosia index, used as initial screening for developmental prosopagnosia. All developmental prosopagnosic participants comleted this.

      Reference: Shah et al. (2015), Royal Society Open Science, 2(150305), 1-6.

    • CFMT - Cambridge Face Memory Test, used as secondary screening for developmental prosopagnosia. All developmentral prosopagnosic participants and approximately half of the control participants completed this.

      Reference: Duchaine & Nakayama (2006), Neuropsychologia, 44(4), 576-585.

    • Game of Thrones Quiz - We developed this quiz to assess familiarity with the Game of Thrones television series. All participants completed this quiz. The quiz comprised 3 types of questions:

      • Face questions presented participants with a picture of a character, and participants had to provide a name or some defining biographical information for that character (e.g., "Jon Snow").
      • Scene questions similarly presented participants with a picture of a scene from the show and participants had to name or provide some details of the location (e.g, "King's Landing").
      • Narrative questions were 4-option multiple choice questions about key elements of the Game of Thrones story. For example, "Which character was Lord of Winterfell and was beheaded at the end of Season 1 - A) Daenerys Targaryen, B) Jon Snow, C) Ned Stark, or D) Tyrion Lannister?"

    Notes

    • sub-DP15 is missing category localiser and fieldmap scans due to time constraints during the scanning session.
  19. 🎬📽️The MOTHER OF ALL MOVIE REVIEW DATASETS

    • kaggle.com
    zip
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2024). 🎬📽️The MOTHER OF ALL MOVIE REVIEW DATASETS [Dataset]. https://www.kaggle.com/datasets/bwandowando/rotten-tomatoes-9800-movie-critic-and-user-reviews/code
    Explore at:
    zip(4253772949 bytes)Available download formats
    Dataset updated
    Jul 17, 2024
    Authors
    BwandoWando
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Banner

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F0411cd02654d97cd74132c69908feae3%2FMEGAPACK3A.png?generation=1721222705178453&alt=media" alt="">

    Context

    The MOTHER OF ALL MOVIE REVIEW DATASETS for all your NLP, research, and learning needs!

    Contents

    • 10500 movies
    • 56M+ user reviews!
    • 1M+ critic reviews!
    • Movies as early as the early 1900's to 2024 can be found here!
    • English, French, Japanese, Hindi, and many more movies!
    • varying movies from very bad to blockbusters!
    • (and many more!)

    Possible Usages

    • NLP
    • Sentiment Analysis
    • Topic Modelling
    • Research
    • Sentiment Analysis
    • Studying
    • Visualizations
    • (and many more!)

    Collection Methodology

    I wrote my own scripts to get data from Rotten Tomatoes

    Image

    Generated with Bing Image Generator

    Note

    I'm looking forward to the community creating and generating analyses, content, and insights from this MOTHER OF ALL MOVIE REVIEW DATASETS! @bwandowando

  20. s

    SEHI (Secondary Electron Hyperspectral Imaging) dataset of Metal alloy and...

    • orda.shef.ac.uk
    zip
    Updated Aug 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jingqiong Zhang; James Nohl; Nicholas Farr; Cornelia Rodenburg; Kerry Abrams; Kate Black; Lyudmila Mihaylova (2025). SEHI (Secondary Electron Hyperspectral Imaging) dataset of Metal alloy and Carbon film (Palladium Silver Carbon complex film) [Dataset]. http://doi.org/10.15131/shef.data.22202923.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 19, 2025
    Dataset provided by
    The University of Sheffield
    Authors
    Jingqiong Zhang; James Nohl; Nicholas Farr; Cornelia Rodenburg; Kerry Abrams; Kate Black; Lyudmila Mihaylova
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This data repository can be used as benchmark data for the purpose of material characterization, particularly for investigating nanostructures and chemical properties in materials using SEHI (Secondary Electron Hyperspectral Imaging), as well as research in Scanning Electron Microscopy and Secondary Electron (SE) spectroscopy, and advanced image processing and data analysis (computer vision and machine learning) techniques.This work is supported by the UK EPSRC EP/V012126/1 the grant ‘‘SEE MORE, MAKE MORE: Secondary Electron Energy Measurement Optimisation for Reliable Manufacturing of Key Materials’’. Contact: SM3 (SEE MORE MAKE MORE) project PI, Professor Cornelia Rodenburg, c.rodenburg@shefield.ac.uk.We also acknowledge the support from Insigneo Institute for In Silico Medicine in Sheffield.The complex metal alloy (palladium silver, abbreviated as Pd-Ag) and carbon films were printed by University of Liverpool, and a Helios Nanolab G3 UC microscope was used to acquire the raw image stacks [1]. One can find more information from [1] regarding the sample preparation, and experimental conditions. This dataset contains four processed SEHI stacks (cropped and aligned) collected from different regions of interest, and the associated metadata.[1] Abrams, K.J., Dapor, M., Stehling, N., Azzolini, M., Kyle, S.J., Schäfer, J., Quade, A., Mika, F., Kratky, S., Pokorna, Z., et al., 2019. Making sense of complex carbon and metal/carbon systems by secondary electron hyperspectral imaging. Advanced Science 6, 1900719.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). imdb_reviews [Dataset]. https://www.tensorflow.org/datasets/catalog/imdb_reviews

imdb_reviews

Explore at:
35 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 20, 2024
Description

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('imdb_reviews', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

Search
Clear search
Close search
Google apps
Main menu