28 datasets found
  1. h

    Data from: imdb

    • huggingface.co
    Updated Aug 3, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2003
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "imdb"

      Dataset Summary
    

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
    
  2. Data from: IMDB Dataset

    • kaggle.com
    Updated Dec 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heemali Chaudhari (2022). IMDB Dataset [Dataset]. https://www.kaggle.com/datasets/heemalichaudhari/imdb-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 26, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Heemali Chaudhari
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    IMDB dataset having 50K movie reviews for natural language processing or Text analytics. IMDb (Internet Movie Database) is a website that provides information about movies, television shows, and other audiovisual works. It is one of the most comprehensive and widely used film and television resources available online. IMDb was launched in 1990 and is now owned by Amazon.com.

  3. h

    Data from: imdb

    • huggingface.co
    Updated May 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-learn (2025). imdb [Dataset]. https://huggingface.co/datasets/scikit-learn/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2025
    Dataset authored and provided by
    scikit-learn
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    This is the sentiment analysis dataset based on IMDB reviews initially released by Stanford University. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/imdb.

  4. Film Circulation dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
    Explore at:
    csv, png, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.


    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

  5. All US Movies IMDB from 1972 to 2016

    • kaggle.com
    Updated Dec 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Cat (2019). All US Movies IMDB from 1972 to 2016 [Dataset]. https://www.kaggle.com/datacat0/all-us-movies-imdb-from-1972-to-2016/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 8, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data Cat
    Description

    Content

    The dataset contains 9722 US Movies from 1972 to 2016 in IMDB. Movies are classified by different features like year, runtime, genre, rating, director, certificate, cast and total gross of the movie. The data was obtained scraping the official IMDB website.

    https://www.imdb.com/list/ls057823854/?sort=alpha,asc&st_dt=&mode=detail&page=1

    Considerations

    1. Columns "genre" and "cast" are splited by "|".
    2. Title of the movie is in Portuguese, the rest in english.
  6. M

    Movie Rating Sites Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Movie Rating Sites Report [Dataset]. https://www.marketreportanalytics.com/reports/movie-rating-sites-75765
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Apr 10, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global movie rating sites market is experiencing robust growth, driven by the increasing consumption of online streaming services and the rising demand for credible film reviews before purchasing tickets or subscribing. The market's expansion is fueled by several factors, including the proliferation of smartphones and internet access, making it easier for users to access rating platforms. Furthermore, the integration of social media features on many platforms fosters engagement and user-generated content, creating a dynamic and interactive ecosystem. The market is segmented by application (movie promotion, movie research, audience choice, and others) and by rating type (user-based, professional-based, and others). While precise market sizing data is unavailable, given the significant presence of established players like Rotten Tomatoes and IMDb, and considering the considerable global viewership of movies, we can estimate the 2025 market size to be approximately $2 billion. This estimation accounts for advertising revenue, premium subscriptions (where applicable), and potential data licensing to film studios and distributors. The projected CAGR suggests continued substantial growth throughout the forecast period (2025-2033), likely driven by technological advancements and the ever-growing global movie-watching audience. However, potential restraints include the risk of biased reviews and the increasing competition from new platforms and emerging technologies like AI-powered recommendation systems. The North American market currently holds a significant share due to the established presence of major players and a large movie-going audience. However, rapid growth is anticipated in the Asia-Pacific region, particularly in countries like India and China, fueled by the expansion of streaming platforms and increasing internet penetration. Europe, with its diverse film culture and established digital infrastructure, also represents a substantial market segment. Competitive pressures are intensifying, with existing players continually innovating to enhance user experiences, introduce new features, and attract and retain users in a crowded market. The market's future trajectory will be shaped by the strategic moves of key players, technological disruptions, and evolving consumer preferences regarding how they discover and choose movies to watch. Strategic partnerships and acquisitions could also play a significant role in shaping the market landscape in the coming years.

  7. m

    Data on regional, ethnicity, and minorities representation in movies

    • data.mendeley.com
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FERNANDO TAMBERLINI ALVES (2025). Data on regional, ethnicity, and minorities representation in movies [Dataset]. http://doi.org/10.17632/kzv2m4hsvw.1
    Explore at:
    Dataset updated
    Feb 20, 2025
    Authors
    FERNANDO TAMBERLINI ALVES
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data sources are primary from three public databases: MovieLens, IMDb, and Brazilian National Cinema Agency. We also collected movie data and subtitles files using web scrapping and public API from six internet public sites: imdb.com, letterboxd.com, metacritic.com, rottentomatoes.com, subdl.com, and subscene.co.in. In addition, we used LLM Tool (Claude.Ai by Anthropic) to collect regional and ethnicity from movie’s director, screenwriter and main character.

  8. b

    IMDb Movie Reviews Dataset

    • berd-platform.de
    bin
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts; Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts (2025). IMDb Movie Reviews Dataset [Dataset]. http://doi.org/10.82939/z8gxk-w3567
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Stanford University
    Authors
    Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts; Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts
    License

    https://ai.stanford.edu/~amaas/data/sentimenthttps://ai.stanford.edu/~amaas/data/sentiment

    Description

    The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The providers also include an additional 50,000 unlabeled documents for unsupervised learning.

    The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset also contains an additional 50,000 unlabeled documents for unsupervised learning. See the README file contained in the release for more details.

    The data is split into a train (25k reviews) and test (25k reviews) set. A preview file cannot be provided - please download the data directly from the data provider's website.

    When using the dataset, please cite: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

  9. IMDB 100,000+ Movies/TvShows

    • kaggle.com
    Updated Apr 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kurt Nakasato (2023). IMDB 100,000+ Movies/TvShows [Dataset]. https://www.kaggle.com/datasets/kurtnakasato/imdb-100000-moviestvshows/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 2, 2023
    Dataset provided by
    Kaggle
    Authors
    Kurt Nakasato
    Description

    Around 100k+ movies/tvshows scraped from the IMDB website. Contains 3 files. contentDataGenre contains the primary key and genres of the movie data in contendDataPrime. contentDataRegion contains the primary key and regions of the movie data in contendDataPrime. All region, genre, and contentdataprime data is taken directly from the IMDB website.

    -1 represents a missing number value and null represents a missing string or date value.

  10. M

    Movie Rating Sites Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Movie Rating Sites Report [Dataset]. https://www.marketreportanalytics.com/reports/movie-rating-sites-75768
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Apr 10, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global movie rating sites market is experiencing robust growth, driven by the increasing popularity of streaming services, a surge in online movie consumption, and the growing reliance on user reviews and professional ratings to inform viewing decisions. The market, estimated at $2 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This expansion is fueled by several key factors. Firstly, the continuous evolution of user interfaces and functionalities on these platforms enhances user experience, fostering engagement and loyalty. Secondly, strategic partnerships between rating sites and streaming platforms provide cross-promotional opportunities, expanding reach and user base. Thirdly, the rising demand for data-driven insights in the film industry is driving the adoption of professional rating services within the movie research and production segments. Competition among established players like Rotten Tomatoes and IMDb, alongside the emergence of niche platforms catering to specific film genres or demographics, is shaping the market landscape. However, the market faces certain restraints. Data security and privacy concerns regarding user information are a major challenge. Maintaining the accuracy and integrity of ratings to avoid manipulation or biased reviews is also crucial for sustaining user trust. Furthermore, the market's growth is susceptible to fluctuations in the film industry itself, including production delays, changes in consumer preferences, and the impact of external economic factors. The market is segmented by application (movie promotion, movie research, audience choice, others) and type (user ratings, professional ratings, others), providing opportunities for specialized platforms to emerge and cater to specific niche needs. Geographic expansion, especially in rapidly developing markets in Asia Pacific, presents significant potential for future growth. The North American market currently holds a substantial share due to the established presence of key players and high online movie consumption.

  11. m

    Movie Box Office Revenue Prediction

    • data.mendeley.com
    Updated Oct 7, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Canaan Madongo (2020). Movie Box Office Revenue Prediction [Dataset]. http://doi.org/10.17632/xv9wtc9gdk.2
    Explore at:
    Dataset updated
    Oct 7, 2020
    Authors
    Canaan Madongo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We collected movie dataset from Internet Movie Database (IMDB) website for our experiments using an IMDbPy script to extract all the movie metadata. We obtained the box office revenues from The Movies Dataset, Box-office Mojo and The Movie Database (TMDB).These databases predominantly consisted of movies from 2006 to 2020 in various countries, and we also collected movie posters. We also used the Open Images dataset V6 for object detection of movie posters.

  12. Movies IMDb Oscar nominated dataset

    • kaggle.com
    Updated Aug 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 23, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ritik Chauhan
    Description

    The dataset contains information about all oscar nominated english movies from 1999 to 2023 from the IMDb site. The data has been collected through web scraping with python from the 'https://www.imdb.com/search/title/' webpage.

    Dataset Details: -Source: IMDb site -Collection Period: 2023-24 -Data Format: CSV -Data Size: 40kv

    The dataset consists of the following columns:

    movie: Name of movies year: In which year movie realeased Genre: Which type of genre earning: how much movie earns metascore: how many metascore that movie get

    Data Usage: This dataset can be used for various purposes: Analyzing which genre movie nominted for oscar . -Their earning

  13. 2021 -What movies to watch today?

    • kaggle.com
    zip
    Updated Sep 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gayathri Nagarajan (2021). 2021 -What movies to watch today? [Dataset]. https://www.kaggle.com/gayathrirprog/2021-what-movies-to-watch-today
    Explore at:
    zip(147839676 bytes)Available download formats
    Dataset updated
    Sep 28, 2021
    Authors
    Gayathri Nagarajan
    Description

    Context

    With Covid in place,when we sit to pick movies to watch as a family, we end up browsing for about 40 + minutes pondering through what movies to watch as a family with kids. THen I realised why not access the database of movies, use my knowledge in R to bring out something useful for folks so that they can use this link to pick their favourite movies to watch per the genre.

    Content

    I have downloaded this data set from "https://www.imdb.com/interfaces/" -This link is linked here and will be updated/refreshed weekly.

    Acknowledgements

    Thanks to imdb website folks for making this data public

    Subsets of IMDb data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the Non-Commercial Licensing and copyright/license and verify compliance.

    Inspiration

    With this data, I want to bring out answers to common questions

    1) WHat movies can I watch as a family under science fiction , horror , doggy movies or christmas movies ? 2) AS I analyse the data, I would want to ultimately make a shiny App page showcasing this for folks to use and benefit.

  14. IMDb Dataset - From 1888 to 2023

    • kaggle.com
    Updated Apr 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Komal Khetlani
    Description

    Context

    For all the movie buffs out there, IMDb is the place to go for all the movie-related data.

    Content

    The dataset contains information of movie titles and their details from 1888 to 2023.

    Acknowledgements

    The dataset is from IMDb website and is uploaded here for learning purposes itself.

    Inspiration

    The inspiration is to get deep insights from the data.

  15. FilmTV movies dataset

    • berd-platform.de
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefano Leone; Stefano Leone (2025). FilmTV movies dataset [Dataset]. http://doi.org/10.82939/3688y-24031
    Explore at:
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Stefano Leone; Stefano Leone
    Description

    Movies data are available on websites such as IMDb with average votes, vote numbers, reviews and descriptions. While IMDb is the most trustworthy source for data, other websites as FilmTV.it can provide the information on how users from different countries rate the movies compared to each other. The dataset is 0.11 GB large.

    Each row represents a movie available on FilmTV.it, with the original title, year, genre, duration, country, director, actors, average vote and votes.
    The file in the English version contains 37,711 movies and 19 attributes, while the Italian version contains one extra-attribute for the local title used when the movie was published in Italy.

    The data set includes movies from: 1897 – 2023. Data has been scraped from the publicly available website https://www.filmtv.it as of 2023-10-21.

  16. Data from: Movie Ratings Dataset

    • kaggle.com
    Updated Feb 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Devesh Kumar Rai (2019). Movie Ratings Dataset [Dataset]. https://www.kaggle.com/datasets/raidevesh05/movie-ratings-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Devesh Kumar Rai
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Context

    I did web scrapping from imdb site to prepare this dataset.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  17. the_movies_dataset

    • kaggle.com
    zip
    Updated Jun 19, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sezgin ildes (2021). the_movies_dataset [Dataset]. https://www.kaggle.com/sezginildes/the-movies-dataset
    Explore at:
    zip(15456686 bytes)Available download formats
    Dataset updated
    Jun 19, 2021
    Authors
    sezgin ildes
    Description

    Context These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

    This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

    Content This dataset consists of the following files:

    movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

    keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

    credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

    links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

    links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

    ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

    The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed here

    Acknowledgements This dataset is an ensemble of data collected from TMDB and GroupLens. The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.

    The Movie Links and Ratings have been obtained from the Official GroupLens website. The files are a part of the dataset available here

    Inspiration This dataset was assembled as part of my second Capstone Project for Springboard's Data Science Career Track. I wanted to perform an extensive EDA on Movie Data to narrate the history and the story of Cinema and use this metadata in combination with MovieLens ratings to build various types of Recommender Systems.

    Both my notebooks are available as kernels with this dataset: The Story of Film and Movie Recommender Systems

    Some of the things you can do with this dataset: Predicting movie revenue and/or movie success based on a certain metric. What movies tend to get higher vote counts and vote averages on TMDB? Building Content Based and Collaborative Filtering Based Recommendation Engines.

  18. IMDB Featured Film

    • kaggle.com
    Updated Dec 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas N. Sulaksana (2020). IMDB Featured Film [Dataset]. https://kaggle.com/nicholasnanda/imdb-featured-film
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nicholas N. Sulaksana
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Content

    The dataset contains 9750 Featured film from IMDB. with 10 attributes : - Film Title - Year Release - Motion Picture Association Rating - Film duration in minutes - IMDB users rating - Metascore rating - List of Main, secondary, and third genre - Net Gross of the film in million-dollar

    Acknowledgements

    Data has been scraped from the publicly available website https://www.imdb.com. For further use, this data scraped on December 9, 2020. Some information about the film is missing on the IMDB so it describes being NA.

  19. Latest IMDB

    • kaggle.com
    Updated Aug 11, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditya Soni (2017). Latest IMDB [Dataset]. https://www.kaggle.com/adityaecdrid/latest-imdb/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aditya Soni
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Movies always interested me to a great extend.. So i decided to create my own movies Dataset...

    Content

    Well the dataset kind of looks like movie -> Names year -> Release Year imdb -> imdb_ratings metascore -> metascores votes -> Public Votes

    I have collected this one using web scrapping...

    Acknowledgements

    Used python to scrape the website.. So thanks to the developers to make it so easy to do so...

    Inspiration

    Build an Exciting Movie Recommendation Engine For all .....

  20. Movie Review Dataset

    • kaggle.com
    Updated Nov 22, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 22, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vipul Gandhi
    Description

    The Movie Review Data is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing. The reviews were originally released in 2002, but an updated and cleaned up version was released in 2004, referred to as v2.0. The dataset is comprised of 1,000 positive and 1,000 negative movie reviews drawn from an archive of the rec.arts.movies.reviews newsgroup hosted at IMDB. The authors refer to this dataset as the polarity dataset.

    Our data contains 1000 positive and 1000 negative reviews all written before 2002, with a cap of 20 reviews per author (312 authors total) per category. We refer to this corpus as the polarity dataset. - A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

    The data has been cleaned up somewhat, for example: - The dataset is comprised of only English reviews. - All text has been converted to lowercase. - There is white space around punctuation like periods, commas, and brackets. - Text has been split into one sentence per line.

    The data has been used for a few related natural language processing tasks. For classification, the performance of classical models (such as Support Vector Machines) on the data is in the range of high 70% to low 80% (e.g. 78%-to-82%). More sophisticated data preparation may see results as high as 86% with 10-fold cross-validation. This gives us a ballpark of low-to-mid 80s if we were looking to use this dataset in experiments on modern methods.

    ... depending on choice of downstream polarity classifier, we can achieve highly statistically significant improvement (from 82.8% to 86.4%) - A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

    After unzipping the file, you will have a directory called txt sentoken with two sub- directories containing the text neg and pos for negative and positive reviews. Reviews are stored one per file with a naming convention from cv000 to cv999 for each of neg and pos. Next, let’s look at loading the text data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb

Data from: imdb

IMDB

stanfordnlp/imdb

Related Article
Explore at:
22 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2003
Dataset authored and provided by
Stanford NLP
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Dataset Card for "imdb"

  Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

  Supported Tasks and Leaderboards

More Information Needed

  Languages

More Information Needed

  Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
Search
Clear search
Close search
Google apps
Main menu