80 datasets found
  1. IMDB Dataset of 50K Movie Reviews - CLEANED

    • kaggle.com
    zip
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HQ Data Profiler (2025). IMDB Dataset of 50K Movie Reviews - CLEANED [Dataset]. https://www.kaggle.com/datasets/hqdataprofiler/imdb-dataset-of-50k-movie-reviews-cleaned
    Explore at:
    zip(26469422 bytes)Available download formats
    Dataset updated
    Nov 4, 2025
    Authors
    HQ Data Profiler
    Description

    The "IMDB Dataset of 50K Movie Reviews" dataset is a tabular dataset with listings for 50k reviews from IMDB. There are two fields: "review", containing the review text, and "sentiment", containing either the value "positive" or the value "negative".

    Using HQ Data Profiler, data quality issues in the original dataset were identified and fixed and this CLEANED version prepared. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F29643712%2Fff70cdf355229a9160466f64a0816b4e%2FIMDB%20Promo.png?generation=1762216952842160&alt=media" alt="Data quality improvements"> HQ Data Profiler's comprehensive profile report showed that the original dataset contained 418 duplicated "review" values. All rows with duplicated review values were removed. The dataset was then balanced by randomly removing rows in the more populated sentiment category. Result: 24698 "positive" and 24698 "negative" reviews, with no duplicates.

    Original dataset link (uncleaned): https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

    Dataset citation ( https://ai.stanford.edu/~amaas/data/sentiment/ ): @InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }

  2. h

    IMDB-Dataset-of-50K-Movie-Reviews-Backup

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Q-b1t, IMDB-Dataset-of-50K-Movie-Reviews-Backup [Dataset]. https://huggingface.co/datasets/Q-b1t/IMDB-Dataset-of-50K-Movie-Reviews-Backup
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Q-b1t
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Q-b1t/IMDB-Dataset-of-50K-Movie-Reviews-Backup dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. IMDb movie review dataset

    • kaggle.com
    zip
    Updated Nov 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Saidul Islam (2024). IMDb movie review dataset [Dataset]. https://www.kaggle.com/datasets/mdsaidulislam43/imdb-movie-review-dataset
    Explore at:
    zip(26249652 bytes)Available download formats
    Dataset updated
    Nov 14, 2024
    Authors
    Md Saidul Islam
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is a curated collection of IMDB movie reviews, specifically designed for text classification tasks, including sentiment analysis and other natural language processing (NLP) applications. Each review is labeled to help identify sentiment polarity or other key features in text data.

  4. IMDb Movie Reviews Genres Description and Emotions

    • kaggle.com
    zip
    Updated Mar 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fahad Rehman (2024). IMDb Movie Reviews Genres Description and Emotions [Dataset]. https://www.kaggle.com/datasets/fahadrehman07/movie-reviews-and-emotion-dataset
    Explore at:
    zip(32966193 bytes)Available download formats
    Dataset updated
    Mar 27, 2024
    Authors
    Fahad Rehman
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🟡Please upvote the dataset if you like it.🍒

    The "IMDB Dataset of Movies Reviews and Translation" dataset has been expanded significantly and is now available on Kaggle in a modified version. Three new columns have been added to the dataset: genres, descriptions, and emotions. The original dataset only had four columns: ratings, reviews, movies, and resenhas. This extension adds to the dataset's richness and offers insightful information about movie genres, in-depth synopses, and the sentimentality of the reviews.

    The addition of the Genres column provides an extensive movie classification that enables scholars and film aficionados to explore particular genres and their traits in greater detail. By examining patterns, trends, and preferences across various genres, analysts can use this data to create more specialized research and moviegoer suggestions.

    The newly added Descriptions column is a valuable addition as it provides textual summaries or synopses of each movie. These descriptions offer a concise overview of the plot, characters, and themes, making it easier for users to understand and evaluate movies of interest. Researchers can leverage this information to conduct sentiment analysis, topic modeling, or recommendation systems based on movie summaries.

    Finally, the Emotions column adds an intriguing dimension to the dataset. By capturing the emotional tone expressed within each description, this column allows for a deeper understanding of sentiments toward the movies. Sentiment analysis techniques can be applied to this data, enabling researchers to gain insights into emotions: like joy, anger, sadness, and more emotions associated with different movies. This information can be particularly valuable for filmmakers, production companies, marketers looking to gauge audience reactions and tailor their strategies accordingly and especially for moviegoers who like to watch movies based on emotions.

    Overall, the expanded version of the "50k Movie Reviews" dataset offers a wealth of new information that fosters detailed analysis and exploration of movie genres, descriptions, and emotional responses. This dataset presents a valuable resource for researchers, data scientists, and movie enthusiasts alike, enabling a deeper understanding of the movie landscape and facilitating the development of innovative tools and applications in the field of movie analysis and recommendation systems.

  5. IMDB Movies Reviews Dataset

    • kaggle.com
    zip
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vince (2025). IMDB Movies Reviews Dataset [Dataset]. https://www.kaggle.com/datasets/shivvm/popular-movies-imdb-reviews-dataset
    Explore at:
    zip(3495548 bytes)Available download formats
    Dataset updated
    Jul 11, 2025
    Authors
    Vince
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Introduction

    This dataset contains featured user reviews from IMDb for the top 25 movies for each year from 2015 to 2024. Scraped using Selenium and bs4 , all content are from IMDB website. The 25 most popular movies from each year was chosen, hence a total of 250 movies' reviews are present in the dataset. The rankings are as of July'2025 and might change later. The data has been extracted for a personal project of sentiment analysis of movie reviews. 👍

    File Information

    • imdb_list.csv : contains all 250 movies
    • imbd_reviews.csv : contains all featured reviews, title (of review), review rating(given by reviewer) and the imdb id of the movie
  6. IMDB Large Movie Reviews Sentiment Dataset

    • kaggle.com
    zip
    Updated Nov 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Christian Blaise Cruz (2019). IMDB Large Movie Reviews Sentiment Dataset [Dataset]. https://www.kaggle.com/jcblaise/imdb-sentiments
    Explore at:
    zip(38677807 bytes)Available download formats
    Dataset updated
    Nov 18, 2019
    Authors
    Jan Christian Blaise Cruz
    Description

    IMDB Movie Reviews Sentiment Dataset

    This dataset contains CSV versions of the Large Movie Review dataset by Maas, et al. (2011) from its original Stanford AI Repository. It contains 50k highly polar movie reviews, evenly split to 25k positives and 25k negatives. Each sample is labeled with a 0 (positive) or 1 (negative). The additional ~11k unlabeled review data has also been included in CSV format for your convenience.

    Citations

    Works using this dataset must use the appropriate citations via this bibtex entry:

    @InProceedings{maas-EtAl:2011:ACL-HLT2011,
     author  = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
     title   = {Learning Word Vectors for Sentiment Analysis},
     booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
     month   = {June},
     year   = {2011},
     address  = {Portland, Oregon, USA},
     publisher = {Association for Computational Linguistics},
     pages   = {142--150},
     url    = {http://www.aclweb.org/anthology/P11-1015}
    }
    
  7. h

    Data from: imdb

    • huggingface.co
    Updated May 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-learn (2025). imdb [Dataset]. https://huggingface.co/datasets/scikit-learn/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2025
    Dataset authored and provided by
    scikit-learn
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    This is the sentiment analysis dataset based on IMDB reviews initially released by Stanford University. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/imdb.

  8. IMDb Movie Review Sentiment

    • kaggle.com
    zip
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). IMDb Movie Review Sentiment [Dataset]. https://www.kaggle.com/datasets/thedevastator/imdb-movie-review-sentiment-dataset
    Explore at:
    zip(52028315 bytes)Available download formats
    Dataset updated
    Dec 2, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    IMDb Movie Review Sentiment

    Movie Review Sentiment

    By imdb (From Huggingface) [source]

    About this dataset

    The IMDb Large Movie Review Dataset is a comprehensive collection of movie reviews used for sentiment classification. The dataset includes a wide range of movie reviews along with their corresponding sentiment labels, which indicate whether the review is positive or negative in nature. This invaluable dataset is aimed at facilitating sentiment analysis and classification tasks in the field of natural language processing.

    The main purpose of the train.csv file within this dataset is to provide a curated collection of movie reviews, each accompanied by its respective sentiment label. This file proves particularly useful for training machine learning models to accurately predict sentiment and classify reviews based on their emotional tone.

    Similarly, the test.csv file contains another set of movie reviews along with corresponding sentiment labels. Meant for testing and validating the performance of trained models, this dataset enables researchers and developers to evaluate their models' effectiveness in real-world scenarios.

    Additionally, the unsupervised.csv file offers an alternative subset within the dataset. Unlike train.csv and test.csv, unsupervised.csv does not include any associated sentiment labels for individual movie reviews. This specific subset serves as a valuable resource for exploring unsupervised learning techniques within the domain of sentiment classification.

    By utilizing this meticulously compiled IMDb Large Movie Review Dataset, researchers and data scientists can delve into various aspects related to analyzing sentiments in textual data. With its carefully labeled data points covering both positive and negative sentiments expressed in diverse film critiques, this dataset empowers users to develop sophisticated machine learning algorithms that accurately assess subjective opinions from text data

    How to use the dataset

    Introduction:

    Dataset Overview: - Train.csv: This file contains a set of movie reviews along with their sentiment labels. It is intended for training your sentiment analysis models. - Test.csv: This file provides another set of movie reviews along with their corresponding sentiment labels. You can use this file to evaluate the performance of your trained models. - Unsupervised.csv: This file includes movie reviews without any associated sentiment labels. It can be used for unsupervised sentiment classification tasks.

    Columns in the Dataset: - text: The main column containing the text of each movie review. - label: The sentiment label assigned to each review, indicating whether it is positive or negative.

    Guidelines for Using the Dataset:

    • Training Your Model:

      • Begin by loading and preprocessing the data from train.csv
      • Treat 'text' as your input feature and 'label' as your target variable
      • Explore different machine learning or deep learning algorithms suitable for text classification
      • Train your model using various techniques, such as bag-of-words, word embeddings, or transformers
      • Evaluate and fine-tune your model's performance using test.csv
    • Evaluating Your Model:

      • Load test.csv and preprocess the data similar to what you did with train.csv
      • Use this preprocessed test data to evaluate the accuracy, precision, recall, F1 score or other relevant metrics of your trained model on unseen data
      • Analyze these metrics to understand how well your model is performing in predicting sentiments
    • Advancing Your Model (Unsupervised Classification):

      • Utilize unsupervised.csv for unsupervised sentiment classification tasks
      • Preprocess the movie reviews in this file and explore techniques like clustering, topic modeling, or self-supervised learning
      • Extract patterns, themes, or sentiments from the reviews without any guidance from labeled data

    Conclusion:

    Research Ideas

    • Sentiment Analysis: This dataset can be used to train models for sentiment analysis, where the goal is to predict whether a movie review is positive or negative based on its text.
    • NLP Research: The dataset can be used for various natural language processing (NLP) tasks such as text classification, information extraction, or named entity recognition. Researchers and practitioners can leverage this dataset to develop and evaluate new algorithms and techniques in the field of NLP.
    • Recommendation Systems: The sentiment labels in this dataset can be used as a source of feedback or user preferences for recommendation systems. By analyzing the sentiments expressed in reviews,...
  9. IMDB dataset (Sentiment analysis) in CSV format

    • kaggle.com
    zip
    Updated Nov 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziqi Yuan (2019). IMDB dataset (Sentiment analysis) in CSV format [Dataset]. https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format
    Explore at:
    zip(26928397 bytes)Available download formats
    Dataset updated
    Nov 28, 2019
    Authors
    Ziqi Yuan
    License

    https://www.worldbank.org/en/about/legal/terms-of-use-for-datasetshttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasets

    Description

    Context

    A movie review dataset. NLP tasks Sentiment Analysis.

    Note : all the movie review are long sentence(most of them are longer than 200 words.)

    Content

    two columns used (text : the review of the movie and label : the sentiment label of the movie review)

  10. Reviews of IMDB Movies

    • kaggle.com
    Updated Jan 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Reviews of IMDB Movies [Dataset]. https://www.kaggle.com/datasets/thedevastator/reviews-of-imdb-movies
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 17, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reviews of IMDB Movies

    Exploring Ratings, Genres, and Spoilers

    By [source]

    About this dataset

    This dataset houses user reviews and ratings of movies from the popular Internet Movie Database (IMDB). Our IMDB movie reviews data contains detailed sentiment analysis from users on thousands of films. With the help of this dataset, we can explore the opinions and attitudes of viewers about a wide range of titles. The columns include information such as a user's username, date posted, helpfulness ratings, spoiler alert level, genre classification and review title. By utilizing this data to its fullest potential we can learn more about why people are drawn to certain types of films and which movies may have been overlooked by general audiences. In doing so we can gain a better understanding for how preferences for different genres have changed over time as well as discover hidden gems that should not be missed!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains user reviews for movies in the IMDB database. It includes several columns of data such as username, date, review title, rating, text, helpfulYes (the number of users who found the review helpful), helpfulTotal (the total number of users who voted on the review), isSpoiler (whether or not the review contains spoilers), and genre.

    To use this dataset effectively to gain insights into IMDB movie ratings and reviews from actual viewers:

    • Start by exploring user ratings over time to identify any noticeable trends that could be used to develop marketing strategies or inform programming decisions. For example, see which genres consistently receive higher or lower ratings over time in order to better target audiences.
    • Analyze how specific words within reviews are rated differently across genres or languages; word frequency can be seen as a measure of reviewer sentiment toward each film's content - look for patterns between amount of positive/negative words used in different language versions/$genres etc).
    • Utilize helpfulness scores by looking at how many people are engaging with each individual review - see where other reviewers find value within a given user's commentaries and identify which ones stand out from all the others! Finally analyze spoiler access within comments too- determine whether viewers find warning labels actually effective at deterring them away..

    Research Ideas

    • Identifying correlation between specific review characteristics (ie. length, rating, use of keywords) and helpfulness ratings to find patterns in user reviews and optimize the usefulness of future reviews
    • Analyzing user preferences for certain genres or ratings to help marketers modify movies based on user desires
    • Utilizing Natural Language Processing on the text data to identify what users generally like/dislike about movies in order to better personalize movie recommendations for viewers

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: reviews.csv | Column name | Description | |:-----------------|:--------------------------------------------------------------------------------| | username | The username of the reviewer. (String) | | date | The date the review was posted. (Date) | | review_title | The title of the review. (String) | | rating | The rating given to the movie by the reviewer. (Integer) | | text | The text of the review. (String) | | helpfulYes | The number of users who found the review helpful. (Integer) | | helpfulTotal | The total number of users who voted on the helpfulness of the review. (Integer) | | isSpoiler | Whether or not the review contains spoilers. (Boolean) |

    File: genres.csv | Column name | Description | |:--------------|...

  11. IMDb Dataset (2024) updated

    • kaggle.com
    zip
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parth (2024). IMDb Dataset (2024) updated [Dataset]. https://www.kaggle.com/datasets/parthdande/imdb-dataset-2024-updated
    Explore at:
    zip(335942 bytes)Available download formats
    Dataset updated
    Jul 6, 2024
    Authors
    Parth
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains detailed information about movies listed on IMDb, including titles, genres, release dates, and ratings. It also includes user reviews and ratings, making it an excellent resource for sentiment analysis and trend analysis in the movie industry. This dataset can be used to gain insights into movie trends, audience preferences, and the correlation between movie attributes and ratings. The second file has additional feature called poster_src which is a link Movies poster image. The second is bigger than the first file and has a wider range of moives.

  12. IMDB Movie Ratings Dataset

    • kaggle.com
    zip
    Updated Jan 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). IMDB Movie Ratings Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/imdb-movie-ratings-dataset
    Explore at:
    zip(319960 bytes)Available download formats
    Dataset updated
    Jan 17, 2023
    Authors
    The Devastator
    Description

    IMDB Movie Ratings Dataset

    Evaluating Directors, Actors, Genres, and Movie Titles

    By Himanshu Sekhar Paul [source]

    About this dataset

    This inspiring IMDB Movie Dataset is a comprehensive database of movie ratings, featuring director_name, duration, actor_2_name, genres, actor_1_name, movie title and more. Whether you're a fan of dramatic thrillers or nostalgic '90s classics from our childhoods; here you'll find information about the most voted movies from users across the world. Delve into num_voted_users trends and discover the language each movie was released in to craft your very own personal film library of country-specific titles released in any given year. With this dataset at your disposal comparing imdb scores will never be easier! Who will come out top when the votes have been tallied? Dive into data for a journey unparalleled!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset offers a comprehensive overview of the movie ratings from IMDB. It includes data about director name, duration, actors, genres, movie title, number of votes, language, country of origin, year released and IMDB score.

    To use this dataset to get a deeper understanding of how movies are rated on IMDB you can take the following steps:

    • Look through each column of the data to get an overall understanding. This will help you identify any specific trends or correlations in the data that you can then analyze further in later steps.
    • Take some time to explore relationships between different columns such as 'Number Voted Users' and 'IMDB Score' – it could be interesting to look at how these numbers relate with each other in order better understan rating trends on IMDB?
    • Analyze how particular sub-groups perform within various categories such as genre or country; this could provide insight into preferences towards certain types of movies or countries with higher associated scores than others?
    • Through your analysis try and gain answers to questions related to specific demographic groups on IMDB – are there distinct preferences among age groups when it comes to what they watch? Are there any clear correlations between rating and genre within certain countries? etc…

    By utilizing the questions above and taking an initial 'big picture' view before diving into more detailed analysis users should be able find value from this dataset by uncovering useful insights about movie ratings on IMDB!

    Research Ideas

    • Movie Recommendation System: The dataset can be used to build a movie recommendation system using machine learning algorithms like k-nearest neighbors or collaborative filtering. Based on the user's past ratings, the system can suggest relevant movies with similar genres, actors and directors.
    • Movie Popularity Index: Using the data, a metric could be designed that provides an overall popularity index for movies released over the years. This index could be constructed by considering factors such as IMDb score, number of votes and reviews collected, etc..
    • Genre-based Over/Under Performance Analysis: Based on genre selections in each movie year, this dataset can provide insight into which genres are performing well and which are not. This kind of analysis could help form important decisioning when deciding to allocate resources towards production budgeting or marketing campaigns for upcoming films in different genres across different regions or markets

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: movie_data.csv | Column name | Description | |:-------------------------|:---------------------------------------------------| | director_name | Name of the director of the movie. (String) | | duration | Length of the movie in minutes. (Integer) | | actor_2_name | Name of the second actor in the movie. (String) | | genres | Genre of the movie. (String) | | actor_1_name | Name of the first actor in the movie. (String) | | movie_title | Title of the movie. (String) | | num_voted_users | Number of users who voted for the movie. (Integer) | | actor_3_name | Name of the third actor in the movie. (String) | | movie_imdb_link | Link to the movie's IMDB page. (String) | | num_user_for_reviews |...

  13. Film Circulation dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
    Explore at:
    csv, png, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.


    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

  14. IMDb Movie and Crew Data

    • kaggle.com
    Updated Jan 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). IMDb Movie and Crew Data [Dataset]. https://www.kaggle.com/datasets/thedevastator/imdb-movie-and-crew-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 16, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    Description

    IMDb Movie and Crew Data

    Insights into Movie Performance and Crew Performance

    By mahesh [source]

    About this dataset

    This IMDb Movies dataset contains information about some of the most beloved and critically praised films of all time. It includes a variety of features, such as the movie's title, original title, year published, date released, genre, duration in minutes, country of origin, language spoken in the movie, director and writer credits, production company responsible for its creation and distribution. Additionally we've included field descriptions for each actor involved as well members member who had a role in its makeup or promotion. Along with these fields we can also see detailed reviews from users and critics alike regarding the film’s basis; thereby providing a comprehensive set to evaluate how different generations have rated it throughout the years. Our selection even offers a description field offering viewers an intimate peek into its plot line before watching if desired! Finally you can discover what kind of budget was appropriated to make this movie possible along with gross income both domestically and globally worldwide! So grab your popcorn and search within this dataset today to find out more info on some classic cinematic favorites!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In order to use this dataset properly, it is important to become familiar with the columns that make up the data set. The columns include: title, original_title, year, date_published ,genre, duration, country , language , director , writer , production_company , actors , :description avg_vote votes budget usa_gross income metascore reviews from users reviews from critics .

    By studying the various columns in this dataset you can discover trends in movies over time such as genres gaining in popularity or budgets increasing or decreasing annually. Additionally you can compare productions companies or directors over time to see how their output has changed or if they produce consistently well-regarded content. Finally by looking at actors over time you can track whether particular actors have experienced ups and downs in their career as well as seeing which actors have remained popular for extended periods of times thanks to larger bodies of work.

    With so many data points available it is easy to come up with dozens of questions that this dataset could help answer about movies both past present & future! Have fun exploring!

    Research Ideas

    • Identifying movie trends in different countries, such as genre preference and budget size.
    • Studying how aspects of the movie, such as actors, writers and crew, influence ratings and gross income.
    • Analysing reviews from critics and users to understand correlations between reviews and metascores or vote values

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: IMDb names.csv | Column name | Description | |:-----------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | title | The title of the movie. (String) | | original_title | The original title of the movie (in case it was changed in other languages)...

  15. IMDB Movie Dataset Till Dec-2023

    • kaggle.com
    zip
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kianindeed (2024). IMDB Movie Dataset Till Dec-2023 [Dataset]. https://www.kaggle.com/datasets/kianindeed/imdb-movie-dataset-dec-2023
    Explore at:
    zip(108933 bytes)Available download formats
    Dataset updated
    Jan 8, 2024
    Authors
    kianindeed
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This csv File contains 11 columns namely: Moive Name, Rating, Votes, Meta Score, Genre, PG Rating, Year, Duration, Cast, Director. The data Contains 1950 rows, This data is extract as a part of this personal project MovieRecommendationSystem. You can fork the repository and create your an even larger dataset. This file contains top IMDB movies updated till 15-Dec 2023. Go through the readme.md to know more about creating data.

    GitHub

  16. IMDB Dataset of 50K Movie Reviews

    • kaggle.com
    zip
    Updated Jun 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ricson Ramos (2025). IMDB Dataset of 50K Movie Reviews [Dataset]. https://www.kaggle.com/datasets/ricsonramos/imdb-dataset-of-50k-movie-reviews
    Explore at:
    zip(26962657 bytes)Available download formats
    Dataset updated
    Jun 11, 2025
    Authors
    Ricson Ramos
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Ricson Ramos

    Released under CC0: Public Domain

    Contents

  17. IMDB 5000 Movie Dataset

    • kaggle.com
    zip
    Updated Dec 16, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yueming (2017). IMDB 5000 Movie Dataset [Dataset]. https://www.kaggle.com/datasets/carolzhangdc/imdb-5000-movie-dataset
    Explore at:
    zip(567524 bytes)Available download formats
    Dataset updated
    Dec 16, 2017
    Authors
    Yueming
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Dataset

    This dataset was created by Yueming

    Released under Database: Open Database, Contents: Database Contents

    Contents

  18. Z

    Oscar nominated movies

    • data.niaid.nih.gov
    Updated Apr 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gloria MarĂ­a Manresa SantamarĂ­a (2023). Oscar nominated movies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7853440
    Explore at:
    Dataset updated
    Apr 22, 2023
    Dataset provided by
    Bárbara Valentina García Deus
    Authors
    Gloria MarĂ­a Manresa SantamarĂ­a
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    The dataset shows the films nominated for the Oscar for Best Picture since 1929. The dataset format is a CSV file that collects information from each of the nominated films.

    For each film, information is extracted from the film itself, such as the title, year, length, director(s), description, genre, scriptwriters, and main cast. Different IMDb ratings have also been included in the dataset, such as the movie's score (“rating”) and number of reviews (“reviews”). Finally, the cover and the trailer of the film have been included as well as if the film won the Oscar for best film or not.

  19. IMDB Dataset of 50k Movie Reviews

    • kaggle.com
    zip
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sam (2025). IMDB Dataset of 50k Movie Reviews [Dataset]. https://www.kaggle.com/datasets/lestiessam/imdb-dataset-of-50k-movie-reviews
    Explore at:
    zip(26962657 bytes)Available download formats
    Dataset updated
    May 19, 2025
    Authors
    Sam
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Sam

    Released under Apache 2.0

    Contents

  20. IMDB Movies From 1920 to 2025

    • kaggle.com
    zip
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raed Addala (2025). IMDB Movies From 1920 to 2025 [Dataset]. https://www.kaggle.com/datasets/raedaddala/imdb-movies-from-1960-to-2023
    Explore at:
    zip(46688739 bytes)Available download formats
    Dataset updated
    Mar 27, 2025
    Authors
    Raed Addala
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Over 60,000 Movies, 100+ Years of Data, and Rich Metadata!

    Links:

    For details about the scraping process, explore the complete code repository on GitHub.

    About the Dataset

    This dataset provides annual data for the most popular 500–600 movies per year from 1920 to 2025, extracted from IMDb. It includes over 60,000 movies, spanning more than 100 years of cinematic history. Each year’s data is divided into three CSV files for flexibility and ease of use:
    - imdb_movies_[year].csv: Basic movie details.
    - advanced_movies_details_[year].csv: Comprehensive metadata and financial details.
    - merged_movies_data_[year].csv: A unified dataset combining both files.

    File Descriptions

    1. imdb_movies_[year].csv

    Essential movie information, including:
    - Title: Movie title. - Description: Movie Description. - méta_score: IMDB's meta score. - Movie Link: IMDb URL for the movie.
    - Year: Year of release.
    - Duration: Runtime (in minutes).
    - MPA: Motion Picture Association rating (e.g., PG, R).
    - Rating: IMDb rating (scale of 1–10).
    - Votes: Total user votes on IMDb.

    2. advanced_movies_details_[year].csv

    Detailed movie metadata:
    - Link: IMDb URL (for linking with other data).
    - budget: Production budget (in USD).
    - grossWorldWide: Global box office revenue.
    - gross_US_Canada: North American box office earnings.
    - opening_weekend_Gross: Opening weekend revenue.
    - directors: List of directors.
    - writers: List of writers.
    - stars: Main cast members.
    - genres: Movie genres.
    - countries_origin: Countries of production.
    - filming_locations: Primary filming locations.
    - production_companies: Associated production companies.
    - Languages: Languages spoken in the movie.
    - Award_information: Information about awards, nominations and wins.
    - release_date: Official release date.

    3. merged_movies_data_[year].csv

    A unified dataset combining all columns from the previous two files:
    - Basic Details: Title, Year, Rating, Votes.
    - Advanced Features: budget, grossWorldWide, directors, genres, and awards.

    Data Structure

    Template Columns:
    - imdb_movies_[year].csv:
    Title, Year, Duration, MPA, Rating, Votes, meta_score, description, Movie Link

    • advanced_movies_details_[year].csv:
      link, writers, directors, stars, budget, opening_weekend_Gross, grossWorldWide, gross_US_Canada, release_date, countries_origin, filming_locations, production_company, awards_content, genres, Languages

    • merged_movies_data_[year].csv:
      Title, Year, Duration, MPA, Rating, Votes, meta_score, description, Movie Link, writers, directors, stars, budget, opening_weekend_Gross, grossWorldWide, gross_US_Canada, release_date, countries_origin, filming_locations, production_company, awards_content, genres, Languages

    Updates

    The dataset is updated annually in December to include the latest data.

    Applications

    This dataset is ideal for:
    - Trend Analysis: Explore changes in the movie industry over six decades.
    - Predictive Modeling: Build models to forecast box office revenue, ratings, or awards.
    - Recommendation Systems: Use attributes like genres, cast, and ratings for personalized recommendations.
    - Comparative Analysis: Study differences across eras, genres, or regions.

    Dataset Features

    • Over 60,000 Movies: Detailed data from 1920 to 2025.
    • Rich Metadata: Financial, creative, and recognition-related attributes.
    • User-friendly: Modular files for tailored use or comprehensive merged files.
    • Consistency: Uniform structure enables seamless analysis.

    Notes

    • For issues, suggestions, or feature requests, please feel free to contact me: send me an email or open an issue on GitHub. Your input is highly appreciated.
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
HQ Data Profiler (2025). IMDB Dataset of 50K Movie Reviews - CLEANED [Dataset]. https://www.kaggle.com/datasets/hqdataprofiler/imdb-dataset-of-50k-movie-reviews-cleaned
Organization logo

IMDB Dataset of 50K Movie Reviews - CLEANED

Cleaned version of "IMDB Dataset of 50K Movie Reviews" dataset (CSV format)

Explore at:
zip(26469422 bytes)Available download formats
Dataset updated
Nov 4, 2025
Authors
HQ Data Profiler
Description

The "IMDB Dataset of 50K Movie Reviews" dataset is a tabular dataset with listings for 50k reviews from IMDB. There are two fields: "review", containing the review text, and "sentiment", containing either the value "positive" or the value "negative".

Using HQ Data Profiler, data quality issues in the original dataset were identified and fixed and this CLEANED version prepared. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F29643712%2Fff70cdf355229a9160466f64a0816b4e%2FIMDB%20Promo.png?generation=1762216952842160&alt=media" alt="Data quality improvements"> HQ Data Profiler's comprehensive profile report showed that the original dataset contained 418 duplicated "review" values. All rows with duplicated review values were removed. The dataset was then balanced by randomly removing rows in the more populated sentiment category. Result: 24698 "positive" and 24698 "negative" reviews, with no duplicates.

Original dataset link (uncleaned): https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset citation ( https://ai.stanford.edu/~amaas/data/sentiment/ ): @InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }

Search
Clear search
Close search
Google apps
Main menu