https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "imdb"
Dataset Summary
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
IMDB dataset having 50K movie reviews for natural language processing or Text analytics. IMDb (Internet Movie Database) is a website that provides information about movies, television shows, and other audiovisual works. It is one of the most comprehensive and widely used film and television resources available online. IMDb was launched in 1990 and is now owned by Amazon.com.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
This is the sentiment analysis dataset based on IMDB reviews initially released by Stanford University. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/imdb.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
The dataset contains 9722 US Movies from 1972 to 2016 in IMDB. Movies are classified by different features like year, runtime, genre, rating, director, certificate, cast and total gross of the movie. The data was obtained scraping the official IMDB website.
https://www.imdb.com/list/ls057823854/?sort=alpha,asc&st_dt=&mode=detail&page=1
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The global movie rating sites market is experiencing robust growth, driven by the increasing consumption of online streaming services and the rising demand for credible film reviews before purchasing tickets or subscribing. The market's expansion is fueled by several factors, including the proliferation of smartphones and internet access, making it easier for users to access rating platforms. Furthermore, the integration of social media features on many platforms fosters engagement and user-generated content, creating a dynamic and interactive ecosystem. The market is segmented by application (movie promotion, movie research, audience choice, and others) and by rating type (user-based, professional-based, and others). While precise market sizing data is unavailable, given the significant presence of established players like Rotten Tomatoes and IMDb, and considering the considerable global viewership of movies, we can estimate the 2025 market size to be approximately $2 billion. This estimation accounts for advertising revenue, premium subscriptions (where applicable), and potential data licensing to film studios and distributors. The projected CAGR suggests continued substantial growth throughout the forecast period (2025-2033), likely driven by technological advancements and the ever-growing global movie-watching audience. However, potential restraints include the risk of biased reviews and the increasing competition from new platforms and emerging technologies like AI-powered recommendation systems. The North American market currently holds a significant share due to the established presence of major players and a large movie-going audience. However, rapid growth is anticipated in the Asia-Pacific region, particularly in countries like India and China, fueled by the expansion of streaming platforms and increasing internet penetration. Europe, with its diverse film culture and established digital infrastructure, also represents a substantial market segment. Competitive pressures are intensifying, with existing players continually innovating to enhance user experiences, introduce new features, and attract and retain users in a crowded market. The market's future trajectory will be shaped by the strategic moves of key players, technological disruptions, and evolving consumer preferences regarding how they discover and choose movies to watch. Strategic partnerships and acquisitions could also play a significant role in shaping the market landscape in the coming years.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data sources are primary from three public databases: MovieLens, IMDb, and Brazilian National Cinema Agency. We also collected movie data and subtitles files using web scrapping and public API from six internet public sites: imdb.com, letterboxd.com, metacritic.com, rottentomatoes.com, subdl.com, and subscene.co.in. In addition, we used LLM Tool (Claude.Ai by Anthropic) to collect regional and ethnicity from movie’s director, screenwriter and main character.
https://ai.stanford.edu/~amaas/data/sentimenthttps://ai.stanford.edu/~amaas/data/sentiment
The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The providers also include an additional 50,000 unlabeled documents for unsupervised learning.
The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset also contains an additional 50,000 unlabeled documents for unsupervised learning. See the README file contained in the release for more details.
The data is split into a train (25k reviews) and test (25k reviews) set. A preview file cannot be provided - please download the data directly from the data provider's website.
When using the dataset, please cite: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
Around 100k+ movies/tvshows scraped from the IMDB website. Contains 3 files. contentDataGenre contains the primary key and genres of the movie data in contendDataPrime. contentDataRegion contains the primary key and regions of the movie data in contendDataPrime. All region, genre, and contentdataprime data is taken directly from the IMDB website.
-1 represents a missing number value and null represents a missing string or date value.
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The global movie rating sites market is experiencing robust growth, driven by the increasing popularity of streaming services, a surge in online movie consumption, and the growing reliance on user reviews and professional ratings to inform viewing decisions. The market, estimated at $2 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This expansion is fueled by several key factors. Firstly, the continuous evolution of user interfaces and functionalities on these platforms enhances user experience, fostering engagement and loyalty. Secondly, strategic partnerships between rating sites and streaming platforms provide cross-promotional opportunities, expanding reach and user base. Thirdly, the rising demand for data-driven insights in the film industry is driving the adoption of professional rating services within the movie research and production segments. Competition among established players like Rotten Tomatoes and IMDb, alongside the emergence of niche platforms catering to specific film genres or demographics, is shaping the market landscape. However, the market faces certain restraints. Data security and privacy concerns regarding user information are a major challenge. Maintaining the accuracy and integrity of ratings to avoid manipulation or biased reviews is also crucial for sustaining user trust. Furthermore, the market's growth is susceptible to fluctuations in the film industry itself, including production delays, changes in consumer preferences, and the impact of external economic factors. The market is segmented by application (movie promotion, movie research, audience choice, others) and type (user ratings, professional ratings, others), providing opportunities for specialized platforms to emerge and cater to specific niche needs. Geographic expansion, especially in rapidly developing markets in Asia Pacific, presents significant potential for future growth. The North American market currently holds a substantial share due to the established presence of key players and high online movie consumption.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We collected movie dataset from Internet Movie Database (IMDB) website for our experiments using an IMDbPy script to extract all the movie metadata. We obtained the box office revenues from The Movies Dataset, Box-office Mojo and The Movie Database (TMDB).These databases predominantly consisted of movies from 2006 to 2020 in various countries, and we also collected movie posters. We also used the Open Images dataset V6 for object detection of movie posters.
The dataset contains information about all oscar nominated english movies from 1999 to 2023 from the IMDb site. The data has been collected through web scraping with python from the 'https://www.imdb.com/search/title/' webpage.
Dataset Details: -Source: IMDb site -Collection Period: 2023-24 -Data Format: CSV -Data Size: 40kv
The dataset consists of the following columns:
movie: Name of movies year: In which year movie realeased Genre: Which type of genre earning: how much movie earns metascore: how many metascore that movie get
Data Usage: This dataset can be used for various purposes: Analyzing which genre movie nominted for oscar . -Their earning
With Covid in place,when we sit to pick movies to watch as a family, we end up browsing for about 40 + minutes pondering through what movies to watch as a family with kids. THen I realised why not access the database of movies, use my knowledge in R to bring out something useful for folks so that they can use this link to pick their favourite movies to watch per the genre.
I have downloaded this data set from "https://www.imdb.com/interfaces/" -This link is linked here and will be updated/refreshed weekly.
Thanks to imdb website folks for making this data public
Subsets of IMDb data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the Non-Commercial Licensing and copyright/license and verify compliance.
With this data, I want to bring out answers to common questions
1) WHat movies can I watch as a family under science fiction , horror , doggy movies or christmas movies ? 2) AS I analyse the data, I would want to ultimately make a shiny App page showcasing this for folks to use and benefit.
For all the movie buffs out there, IMDb is the place to go for all the movie-related data.
The dataset contains information of movie titles and their details from 1888 to 2023.
The dataset is from IMDb website and is uploaded here for learning purposes itself.
The inspiration is to get deep insights from the data.
Movies data are available on websites such as IMDb with average votes, vote numbers, reviews and descriptions. While IMDb is the most trustworthy source for data, other websites as FilmTV.it can provide the information on how users from different countries rate the movies compared to each other. The dataset is 0.11 GB large.
Each row represents a movie available on FilmTV.it, with the original title, year, genre, duration, country, director, actors, average vote and votes.
The file in the English version contains 37,711 movies and 19 attributes, while the Italian version contains one extra-attribute for the local title used when the movie was published in Italy.
The data set includes movies from: 1897 – 2023. Data has been scraped from the publicly available website https://www.filmtv.it as of 2023-10-21.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
I did web scrapping from imdb site to prepare this dataset.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Context These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.
This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.
Content This dataset consists of the following files:
movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.
keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.
credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.
links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.
links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.
ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.
The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed here
Acknowledgements This dataset is an ensemble of data collected from TMDB and GroupLens. The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.
The Movie Links and Ratings have been obtained from the Official GroupLens website. The files are a part of the dataset available here
Inspiration This dataset was assembled as part of my second Capstone Project for Springboard's Data Science Career Track. I wanted to perform an extensive EDA on Movie Data to narrate the history and the story of Cinema and use this metadata in combination with MovieLens ratings to build various types of Recommender Systems.
Both my notebooks are available as kernels with this dataset: The Story of Film and Movie Recommender Systems
Some of the things you can do with this dataset: Predicting movie revenue and/or movie success based on a certain metric. What movies tend to get higher vote counts and vote averages on TMDB? Building Content Based and Collaborative Filtering Based Recommendation Engines.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains 9750 Featured film from IMDB. with 10 attributes : - Film Title - Year Release - Motion Picture Association Rating - Film duration in minutes - IMDB users rating - Metascore rating - List of Main, secondary, and third genre - Net Gross of the film in million-dollar
Data has been scraped from the publicly available website https://www.imdb.com. For further use, this data scraped on December 9, 2020. Some information about the film is missing on the IMDB so it describes being NA.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Movies always interested me to a great extend.. So i decided to create my own movies Dataset...
Well the dataset kind of looks like movie -> Names year -> Release Year imdb -> imdb_ratings metascore -> metascores votes -> Public Votes
I have collected this one using web scrapping...
Used python to scrape the website.. So thanks to the developers to make it so easy to do so...
Build an Exciting Movie Recommendation Engine For all .....
The Movie Review Data is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing. The reviews were originally released in 2002, but an updated and cleaned up version was released in 2004, referred to as v2.0. The dataset is comprised of 1,000 positive and 1,000 negative movie reviews drawn from an archive of the rec.arts.movies.reviews newsgroup hosted at IMDB. The authors refer to this dataset as the polarity dataset.
Our data contains 1000 positive and 1000 negative reviews all written before 2002, with a cap of 20 reviews per author (312 authors total) per category. We refer to this corpus as the polarity dataset. - A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.
The data has been cleaned up somewhat, for example: - The dataset is comprised of only English reviews. - All text has been converted to lowercase. - There is white space around punctuation like periods, commas, and brackets. - Text has been split into one sentence per line.
The data has been used for a few related natural language processing tasks. For classification, the performance of classical models (such as Support Vector Machines) on the data is in the range of high 70% to low 80% (e.g. 78%-to-82%). More sophisticated data preparation may see results as high as 86% with 10-fold cross-validation. This gives us a ballpark of low-to-mid 80s if we were looking to use this dataset in experiments on modern methods.
... depending on choice of downstream polarity classifier, we can achieve highly statistically significant improvement (from 82.8% to 86.4%) - A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.
After unzipping the file, you will have a directory called txt sentoken with two sub- directories containing the text neg and pos for negative and positive reviews. Reviews are stored one per file with a naming convention from cv000 to cv999 for each of neg and pos. Next, let’s look at loading the text data.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "imdb"
Dataset Summary
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.