76 datasets found
  1. T

    imdb_reviews

    • tensorflow.org
    • kaggle.com
    Updated Sep 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imdb_reviews [Dataset]. https://www.tensorflow.org/datasets/catalog/imdb_reviews
    Explore at:
    Dataset updated
    Sep 20, 2024
    Description

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imdb_reviews', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  2. Full TMDB Movies Dataset 2024 (1M Movies)

    • kaggle.com
    zip
    Updated Nov 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    asaniczka (2025). Full TMDB Movies Dataset 2024 (1M Movies) [Dataset]. https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies
    Explore at:
    zip(239404730 bytes)Available download formats
    Dataset updated
    Nov 11, 2025
    Authors
    asaniczka
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    The TMDb (The Movie Database) is a comprehensive movie database that provides information about movies, including details like titles, ratings, release dates, revenue, genres, and much more.

    This dataset contains a collection of 1,000,000 movies from the TMDB database.

    Dataset is updated daily. If you find this dataset valuable, don't forget to hit the upvote button! 😊💝

    Interesting Task Ideas:

    1. Predict movie ratings based on features such as revenue, popularity, genre, and runtime.
    2. Identify trends in movie release dates and analyze their impact on revenue.
    3. Analyze the relationship between budget, revenue, and popularity to determine factors that contribute to a movie's success.
    4. Build a recommendation system that suggests similar movies based on genres, production companies, and language.
    5. Perform sentiment analysis on movie reviews to understand audience reactions.
    6. Explore the impact of movie genres on popularity and revenue.
    7. Investigate the correlation between runtime and audience engagement.
    8. Identify successful production companies and analyze their strategies.
    9. Utilize natural language processing techniques to extract meaningful insights from movie overviews.
    10. Visualize movie popularity over time and identify popular genres in different periods.

    Checkout my other datasets

    Clash of Clans Clans Dataset 2023 (3.5M Clans)

    Black-White Wage Gap in the USA Dataset

    130K Kindle Books

    USA Unemployment Rates by Demographics & Race

    150K TMDb TV Shows

    Photo by Onur Binay on Unsplash

  3. IMDB Movie Ratings Dataset

    • kaggle.com
    zip
    Updated Jan 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). IMDB Movie Ratings Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/imdb-movie-ratings-dataset
    Explore at:
    zip(319960 bytes)Available download formats
    Dataset updated
    Jan 17, 2023
    Authors
    The Devastator
    Description

    IMDB Movie Ratings Dataset

    Evaluating Directors, Actors, Genres, and Movie Titles

    By Himanshu Sekhar Paul [source]

    About this dataset

    This inspiring IMDB Movie Dataset is a comprehensive database of movie ratings, featuring director_name, duration, actor_2_name, genres, actor_1_name, movie title and more. Whether you're a fan of dramatic thrillers or nostalgic '90s classics from our childhoods; here you'll find information about the most voted movies from users across the world. Delve into num_voted_users trends and discover the language each movie was released in to craft your very own personal film library of country-specific titles released in any given year. With this dataset at your disposal comparing imdb scores will never be easier! Who will come out top when the votes have been tallied? Dive into data for a journey unparalleled!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset offers a comprehensive overview of the movie ratings from IMDB. It includes data about director name, duration, actors, genres, movie title, number of votes, language, country of origin, year released and IMDB score.

    To use this dataset to get a deeper understanding of how movies are rated on IMDB you can take the following steps:

    • Look through each column of the data to get an overall understanding. This will help you identify any specific trends or correlations in the data that you can then analyze further in later steps.
    • Take some time to explore relationships between different columns such as 'Number Voted Users' and 'IMDB Score' – it could be interesting to look at how these numbers relate with each other in order better understan rating trends on IMDB?
    • Analyze how particular sub-groups perform within various categories such as genre or country; this could provide insight into preferences towards certain types of movies or countries with higher associated scores than others?
    • Through your analysis try and gain answers to questions related to specific demographic groups on IMDB – are there distinct preferences among age groups when it comes to what they watch? Are there any clear correlations between rating and genre within certain countries? etc…

    By utilizing the questions above and taking an initial 'big picture' view before diving into more detailed analysis users should be able find value from this dataset by uncovering useful insights about movie ratings on IMDB!

    Research Ideas

    • Movie Recommendation System: The dataset can be used to build a movie recommendation system using machine learning algorithms like k-nearest neighbors or collaborative filtering. Based on the user's past ratings, the system can suggest relevant movies with similar genres, actors and directors.
    • Movie Popularity Index: Using the data, a metric could be designed that provides an overall popularity index for movies released over the years. This index could be constructed by considering factors such as IMDb score, number of votes and reviews collected, etc..
    • Genre-based Over/Under Performance Analysis: Based on genre selections in each movie year, this dataset can provide insight into which genres are performing well and which are not. This kind of analysis could help form important decisioning when deciding to allocate resources towards production budgeting or marketing campaigns for upcoming films in different genres across different regions or markets

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: movie_data.csv | Column name | Description | |:-------------------------|:---------------------------------------------------| | director_name | Name of the director of the movie. (String) | | duration | Length of the movie in minutes. (Integer) | | actor_2_name | Name of the second actor in the movie. (String) | | genres | Genre of the movie. (String) | | actor_1_name | Name of the first actor in the movie. (String) | | movie_title | Title of the movie. (String) | | num_voted_users | Number of users who voted for the movie. (Integer) | | actor_3_name | Name of the third actor in the movie. (String) | | movie_imdb_link | Link to the movie's IMDB page. (String) | | num_user_for_reviews |...

  4. T

    movielens

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Jul 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). movielens [Dataset]. https://www.tensorflow.org/datasets/catalog/movielens
    Explore at:
    Dataset updated
    Jul 8, 2020
    Description

    This dataset contains a set of movie ratings from the MovieLens website, a movie recommendation service. This dataset was collected and maintained by GroupLens, a research group at the University of Minnesota. There are 5 versions included: "25m", "latest-small", "100k", "1m", "20m". In all datasets, the movies data and ratings data are joined on "movieId". The 25m dataset, latest-small dataset, and 20m dataset contain only movie data and rating data. The 1m dataset and 100k dataset contain demographic data in addition to movie and rating data.

    • "25m": This is the latest stable version of the MovieLens dataset. It is recommended for research purposes.
    • "latest-small": This is a small subset of the latest version of the MovieLens dataset. It is changed and updated over time by GroupLens.
    • "100k": This is the oldest version of the MovieLens datasets. It is a small dataset with demographic data.
    • "1m": This is the largest MovieLens dataset that contains demographic data.
    • "20m": This is one of the most used MovieLens datasets in academic papers along with the 1m dataset.

    For each version, users can view either only the movies data by adding the "-movies" suffix (e.g. "25m-movies") or the ratings data joined with the movies data (and users data in the 1m and 100k datasets) by adding the "-ratings" suffix (e.g. "25m-ratings").

    The features below are included in all versions with the "-ratings" suffix.

    • "movie_id": a unique identifier of the rated movie
    • "movie_title": the title of the rated movie with the release year in parentheses
    • "movie_genres": a sequence of genres to which the rated movie belongs
    • "user_id": a unique identifier of the user who made the rating
    • "user_rating": the score of the rating on a five-star scale
    • "timestamp": the timestamp of the ratings, represented in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

    The "100k-ratings" and "1m-ratings" versions in addition include the following demographic features.

    • "user_gender": gender of the user who made the rating; a true value corresponds to male
    • "bucketized_user_age": bucketized age values of the user who made the rating, the values and the corresponding ranges are:
      • 1: "Under 18"
      • 18: "18-24"
      • 25: "25-34"
      • 35: "35-44"
      • 45: "45-49"
      • 50: "50-55"
      • 56: "56+"
    • "user_occupation_label": the occupation of the user who made the rating represented by an integer-encoded label; labels are preprocessed to be consistent across different versions
    • "user_occupation_text": the occupation of the user who made the rating in the original string; different versions can have different set of raw text labels
    • "user_zip_code": the zip code of the user who made the rating

    In addition, the "100k-ratings" dataset would also have a feature "raw_user_age" which is the exact ages of the users who made the rating

    Datasets with the "-movies" suffix contain only "movie_id", "movie_title", and "movie_genres" features.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('movielens', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  5. IMDb Dataset (2024) updated

    • kaggle.com
    zip
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parth (2024). IMDb Dataset (2024) updated [Dataset]. https://www.kaggle.com/datasets/parthdande/imdb-dataset-2024-updated
    Explore at:
    zip(335942 bytes)Available download formats
    Dataset updated
    Jul 6, 2024
    Authors
    Parth
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains detailed information about movies listed on IMDb, including titles, genres, release dates, and ratings. It also includes user reviews and ratings, making it an excellent resource for sentiment analysis and trend analysis in the movie industry. This dataset can be used to gain insights into movie trends, audience preferences, and the correlation between movie attributes and ratings. The second file has additional feature called poster_src which is a link Movies poster image. The second is bigger than the first file and has a wider range of moives.

  6. Real Movies Dataset

    • kaggle.com
    zip
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harshit Sharma (2024). Real Movies Dataset [Dataset]. https://www.kaggle.com/datasets/harshitstark/real-movies-dataset
    Explore at:
    zip(104062 bytes)Available download formats
    Dataset updated
    Feb 9, 2024
    Authors
    Harshit Sharma
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The "Real Movies Dataset" offers a comprehensive repository of diverse movie information, facilitating in-depth analysis and meaningful comparisons across various cinematic attributes. With its wealth of key details, this dataset serves as an invaluable resource for researchers, enthusiasts, and industry professionals alike. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18544731%2Fbfb64d5c16fa1164befbde46928b7f83%2FMovies%20Kaggle.jpg?generation=1707490228580924&alt=media" alt=""> Each entry in the dataset includes the following attributes: * Movie Name: The title of the movie. * Year of Release: The year in which the movie was officially released to the public. * Watch Time: The duration of the movie in terms of hours and minutes, indicating the length of time required to watch the entire film. * Movie Rating: This refers to the rating assigned to the movie based on various criteria such as content, suitability for different age groups, and overall quality. Ratings could be numerical (e.g., out of 10). * Meatscore of Movie: This is a unique metric that represents the "meatiness" or substance of the movie. It might be a score assigned based on the complexity of the plot, character development, thematic depth, or other qualitative aspects. * Votes: The number of votes or ratings received by the movie from viewers or critics. This metric provides an indication of the movie's popularity or reception. * Gross: The total box office gross earnings generated by the movie, typically measured in a specific currency (e.g., USD). This metric reflects the commercial success of the film. * Description: The dataset includes a brief description field providing a summary or overview of the movie's plot, genre, themes, or notable aspects. This description offers context and insight into the content and style of each film, aiding in understanding and analysis.

    Overall, the "Real Movies Dataset" serves as a valuable resource for researchers, analysts, and enthusiasts interested in exploring and studying the dynamics of the film industry, including trends in movie production, audience preferences, and financial performance.

  7. g

    MovieLens 100K

    • grouplens.org
    • kaggle.com
    Updated Oct 12, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2015). MovieLens 100K [Dataset]. https://grouplens.org/datasets/movielens/100k/
    Explore at:
    Dataset updated
    Oct 12, 2015
    Description

    Stable benchmark dataset. 100,000 ratings from 1000 users on 1700 movies. Released 4/1998.

  8. IMDb Movies Metadata Dataset – 4.5M Records (Global Coverage)

    • crawlfeeds.com
    csv, zip
    Updated Nov 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). IMDb Movies Metadata Dataset – 4.5M Records (Global Coverage) [Dataset]. https://crawlfeeds.com/datasets/imdb-movies-metadata-dataset-4-5m-records-global-coverage
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Nov 9, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Unlock one of the most comprehensive movie datasets available—4.5 million structured IMDb movie records, extracted and enriched for data science, machine learning, and entertainment research.

    This dataset includes a vast collection of global movie metadata, including details on title, release year, genre, country, language, runtime, cast, directors, IMDb ratings, reviews, and synopsis. Whether you're building a recommendation engine, benchmarking trends, or training AI models, this dataset is designed to give you deep and wide access to cinematic data across decades and continents.

    Perfect for use in film analytics, OTT platforms, review sentiment analysis, knowledge graphs, and LLM fine-tuning, the dataset is cleaned, normalized, and exportable in multiple formats.

    What’s Included:

    • Genres: Drama, Comedy, Horror, Action, Sci-Fi, Documentary, and more

    • Delivery: Direct download

    Use Cases:

    • Train LLMs or chatbots on cinematic language and metadata

    • Build or enrich movie recommendation engines

    • Run cross-lingual or multi-region film analytics

    • Benchmark genre popularity across time periods

    • Power academic studies or entertainment dashboards

    • Feed into knowledge graphs, search engines, or NLP pipelines

  9. g

    MovieLens 1M

    • grouplens.org
    • kaggle.com
    Updated Mar 19, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). MovieLens 1M [Dataset]. https://grouplens.org/datasets/movielens/1m/
    Explore at:
    Dataset updated
    Mar 19, 2016
    Description

    Stable benchmark dataset. 1 million ratings from 6000 users on 4000 movies. Released 2/2003.

  10. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  11. b

    IMDb Movie Reviews Dataset

    • berd-platform.de
    bin
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts; Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts (2025). IMDb Movie Reviews Dataset [Dataset]. http://doi.org/10.82939/z8gxk-w3567
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Stanford University
    Authors
    Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts; Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts
    License

    https://ai.stanford.edu/~amaas/data/sentimenthttps://ai.stanford.edu/~amaas/data/sentiment

    Description

    The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The providers also include an additional 50,000 unlabeled documents for unsupervised learning.

    The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset also contains an additional 50,000 unlabeled documents for unsupervised learning. See the README file contained in the release for more details.

    The data is split into a train (25k reviews) and test (25k reviews) set. A preview file cannot be provided - please download the data directly from the data provider's website.

    When using the dataset, please cite: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

  12. IMDB Dataset of 50K Movie Reviews - CLEANED

    • kaggle.com
    zip
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HQ Data Profiler (2025). IMDB Dataset of 50K Movie Reviews - CLEANED [Dataset]. https://www.kaggle.com/datasets/hqdataprofiler/imdb-dataset-of-50k-movie-reviews-cleaned
    Explore at:
    zip(26469422 bytes)Available download formats
    Dataset updated
    Nov 4, 2025
    Authors
    HQ Data Profiler
    Description

    The "IMDB Dataset of 50K Movie Reviews" dataset is a tabular dataset with listings for 50k reviews from IMDB. There are two fields: "review", containing the review text, and "sentiment", containing either the value "positive" or the value "negative".

    Using HQ Data Profiler, data quality issues in the original dataset were identified and fixed and this CLEANED version prepared. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F29643712%2Fff70cdf355229a9160466f64a0816b4e%2FIMDB%20Promo.png?generation=1762216952842160&alt=media" alt="Data quality improvements"> HQ Data Profiler's comprehensive profile report showed that the original dataset contained 418 duplicated "review" values. All rows with duplicated review values were removed. The dataset was then balanced by randomly removing rows in the more populated sentiment category. Result: 24698 "positive" and 24698 "negative" reviews, with no duplicates.

    Original dataset link (uncleaned): https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

    Dataset citation ( https://ai.stanford.edu/~amaas/data/sentiment/ ): @InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }

  13. g

    MovieLens 20M

    • grouplens.org
    • academictorrents.com
    Updated Mar 19, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). MovieLens 20M [Dataset]. https://grouplens.org/datasets/movielens/20m/
    Explore at:
    Dataset updated
    Mar 19, 2016
    Description

    Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Released 4/2015; updated 10/2016 to update links.csv and add tag genome data.

  14. IMDB Movies From 1920 to 2025

    • kaggle.com
    zip
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raed Addala (2025). IMDB Movies From 1920 to 2025 [Dataset]. https://www.kaggle.com/datasets/raedaddala/imdb-movies-from-1960-to-2023
    Explore at:
    zip(46688739 bytes)Available download formats
    Dataset updated
    Mar 27, 2025
    Authors
    Raed Addala
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Over 60,000 Movies, 100+ Years of Data, and Rich Metadata!

    Links:

    For details about the scraping process, explore the complete code repository on GitHub.

    About the Dataset

    This dataset provides annual data for the most popular 500–600 movies per year from 1920 to 2025, extracted from IMDb. It includes over 60,000 movies, spanning more than 100 years of cinematic history. Each year’s data is divided into three CSV files for flexibility and ease of use:
    - imdb_movies_[year].csv: Basic movie details.
    - advanced_movies_details_[year].csv: Comprehensive metadata and financial details.
    - merged_movies_data_[year].csv: A unified dataset combining both files.

    File Descriptions

    1. imdb_movies_[year].csv

    Essential movie information, including:
    - Title: Movie title. - Description: Movie Description. - méta_score: IMDB's meta score. - Movie Link: IMDb URL for the movie.
    - Year: Year of release.
    - Duration: Runtime (in minutes).
    - MPA: Motion Picture Association rating (e.g., PG, R).
    - Rating: IMDb rating (scale of 1–10).
    - Votes: Total user votes on IMDb.

    2. advanced_movies_details_[year].csv

    Detailed movie metadata:
    - Link: IMDb URL (for linking with other data).
    - budget: Production budget (in USD).
    - grossWorldWide: Global box office revenue.
    - gross_US_Canada: North American box office earnings.
    - opening_weekend_Gross: Opening weekend revenue.
    - directors: List of directors.
    - writers: List of writers.
    - stars: Main cast members.
    - genres: Movie genres.
    - countries_origin: Countries of production.
    - filming_locations: Primary filming locations.
    - production_companies: Associated production companies.
    - Languages: Languages spoken in the movie.
    - Award_information: Information about awards, nominations and wins.
    - release_date: Official release date.

    3. merged_movies_data_[year].csv

    A unified dataset combining all columns from the previous two files:
    - Basic Details: Title, Year, Rating, Votes.
    - Advanced Features: budget, grossWorldWide, directors, genres, and awards.

    Data Structure

    Template Columns:
    - imdb_movies_[year].csv:
    Title, Year, Duration, MPA, Rating, Votes, meta_score, description, Movie Link

    • advanced_movies_details_[year].csv:
      link, writers, directors, stars, budget, opening_weekend_Gross, grossWorldWide, gross_US_Canada, release_date, countries_origin, filming_locations, production_company, awards_content, genres, Languages

    • merged_movies_data_[year].csv:
      Title, Year, Duration, MPA, Rating, Votes, meta_score, description, Movie Link, writers, directors, stars, budget, opening_weekend_Gross, grossWorldWide, gross_US_Canada, release_date, countries_origin, filming_locations, production_company, awards_content, genres, Languages

    Updates

    The dataset is updated annually in December to include the latest data.

    Applications

    This dataset is ideal for:
    - Trend Analysis: Explore changes in the movie industry over six decades.
    - Predictive Modeling: Build models to forecast box office revenue, ratings, or awards.
    - Recommendation Systems: Use attributes like genres, cast, and ratings for personalized recommendations.
    - Comparative Analysis: Study differences across eras, genres, or regions.

    Dataset Features

    • Over 60,000 Movies: Detailed data from 1920 to 2025.
    • Rich Metadata: Financial, creative, and recognition-related attributes.
    • User-friendly: Modular files for tailored use or comprehensive merged files.
    • Consistency: Uniform structure enables seamless analysis.

    Notes

    • For issues, suggestions, or feature requests, please feel free to contact me: send me an email or open an issue on GitHub. Your input is highly appreciated.
  15. c

    Amazon prime tv shows and movies dataset

    • crawlfeeds.com
    csv, zip
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Amazon prime tv shows and movies dataset [Dataset]. https://crawlfeeds.com/datasets/amazon-prime-tv-shows-and-movies-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Jul 4, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Amazon Prime TV Shows and Movies Dataset offered by Crawl Feeds is an extensive resource containing over 92,000 records in JSON format. This dataset encompasses a wide array of data points, including links, titles, descriptions, release dates, genres, posters, streaming platforms, countries, number of seasons, content ratings, IMDb ratings, cast and crew details, unique identifiers, and scraping timestamps. Such comprehensive information is invaluable for researchers, data analysts, and developers aiming to conduct in-depth analyses, develop recommendation systems, or explore trends within Amazon Prime's content library.

    For those interested in broader media datasets, Crawl Feeds also offers the Movies and TV Shows Dataset, which includes 118,000 records, and the IMDb Movie Details Dataset, comprising 250,000 records. These datasets provide extensive information across various platforms, facilitating comparative studies and cross-platform analyses.

    Integrating these datasets into your projects can significantly enhance the depth and quality of your analyses, providing a robust foundation for exploring various facets of the entertainment industry. Whether you're developing a new application, conducting market research, or performing academic studies, these datasets serve as a valuable resource for gaining insights into the dynamic world of streaming media.

    Explore the Amazon Prime TV Shows and Movies Dataset and other related datasets on Crawl Feeds to elevate your data-driven projects.

  16. MovieLens Dataset - 100K Ratings

    • kaggle.com
    zip
    Updated Feb 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sriharsha B S Prasad (2025). MovieLens Dataset - 100K Ratings [Dataset]. https://www.kaggle.com/datasets/sriharshabsprasad/movielens-dataset-100k-ratings
    Explore at:
    zip(994099 bytes)Available download formats
    Dataset updated
    Feb 28, 2025
    Authors
    Sriharsha B S Prasad
    Description

    This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

    Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

    The data are contained in the files - - links.csv - movies.csv - ratings.csv - tags.csv

    This and other GroupLens data sets are publicly available for download at http://grouplens.org/datasets/.

    License: This dataset is sourced from the GroupLens Research Group at the University of Minnesota. It is provided for non-commercial research and educational purposes only. License details can be found here under Usage License - https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html

    Important:

    • This dataset is provided "as is" without warranty.
    • For commercial use, please contact grouplens-info@umn.edu."

    Citation F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

  17. g

    Amazon review data 2018

    • nijianmo.github.io
    • cseweb.ucsd.edu
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Amazon review data 2018 [Dataset]. https://nijianmo.github.io/amazon/
    Explore at:
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    Context

    This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

    • More reviews:

      • The total number of reviews is 233.1 million (142.8 million in 2014).
    • New reviews:

      • Current data includes reviews in the range May 1996 - Oct 2018.
    • Metadata: - We have added transaction metadata for each review shown on the review page.

      • Added more detailed metadata of the product landing page.

    Acknowledgements

    If you publish articles based on this dataset, please cite the following paper:

    • Jianmo Ni, Jiacheng Li, Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. EMNLP, 2019.
  18. g

    Netflix Movies and TV Shows

    • gts.ai
    csv/json
    Updated Jan 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2025). Netflix Movies and TV Shows [Dataset]. https://gts.ai/dataset-download/page/82/
    Explore at:
    csv/jsonAvailable download formats
    Dataset updated
    Jan 25, 2025
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Explore the Netflix Titles dataset, featuring detailed insights on over 8,800 movies and TV shows. Ideal for content analysis, recommendation systems, and market research, covering genre trends, directors, cast, production countries, release years, and ratings.

  19. h

    sst2

    • huggingface.co
    Updated Mar 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2024). sst2 [Dataset]. https://huggingface.co/datasets/stanfordnlp/sst2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 26, 2024
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for [Dataset Name]

      Dataset Summary
    

    The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/sst2.

  20. MovieLens 10M Dataset (Latest Version)

    • kaggle.com
    zip
    Updated Feb 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir Motefaker (2023). MovieLens 10M Dataset (Latest Version) [Dataset]. https://www.kaggle.com/datasets/amirmotefaker/movielens-10m-dataset-latest-version
    Explore at:
    zip(67393808 bytes)Available download formats
    Dataset updated
    Feb 9, 2023
    Authors
    Amir Motefaker
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service MovieLens.

    Users were selected at random for inclusion. All users selected had rated at least 20 movies. Unlike previous MovieLens data sets, no demographic information is included. Each user is represented by an id, and no other information is provided.

    The data are contained in three files, movies.dat, ratings.dat, and tags.dat. Also included are scripts for generating subsets of the data to support the five-fold cross-validation of rating predictions. More details about the contents and use of all these files follow.

    This and other GroupLens data sets are publicly available for download at GroupLens Data Sets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). imdb_reviews [Dataset]. https://www.tensorflow.org/datasets/catalog/imdb_reviews

imdb_reviews

Explore at:
35 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 20, 2024
Description

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('imdb_reviews', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

Search
Clear search
Close search
Google apps
Main menu