28 datasets found

h
Data from: imdb
huggingface.co
Updated Aug 3, 2003
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2003
Dataset authored and provided by
Stanford NLP
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "imdb"

Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
Data from: IMDB Dataset
kaggle.com
Updated Dec 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heemali Chaudhari (2022). IMDB Dataset [Dataset]. https://www.kaggle.com/datasets/heemalichaudhari/imdb-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 26, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Heemali Chaudhari
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
IMDB dataset having 50K movie reviews for natural language processing or Text analytics. IMDb (Internet Movie Database) is a website that provides information about movies, television shows, and other audiovisual works. It is one of the most comprehensive and widely used film and television resources available online. IMDb was launched in 1990 and is now owned by Amazon.com.
h
Data from: imdb
huggingface.co
Updated May 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
scikit-learn (2025). imdb [Dataset]. https://huggingface.co/datasets/scikit-learn/imdb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 10, 2025
Dataset authored and provided by
scikit-learn
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
This is the sentiment analysis dataset based on IMDB reviews initially released by Stanford University. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/imdb.
Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
All US Movies IMDB from 1972 to 2016
kaggle.com
Updated Dec 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Cat (2019). All US Movies IMDB from 1972 to 2016 [Dataset]. https://www.kaggle.com/datacat0/all-us-movies-imdb-from-1972-to-2016/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 8, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data Cat
Description
Content

The dataset contains 9722 US Movies from 1972 to 2016 in IMDB. Movies are classified by different features like year, runtime, genre, rating, director, certificate, cast and total gross of the movie. The data was obtained scraping the official IMDB website.

https://www.imdb.com/list/ls057823854/?sort=alpha,asc&st_dt=&mode=detail&page=1

Considerations

Columns "genre" and "cast" are splited by "|".

Title of the movie is in Portuguese, the rest in english.
M
Movie Rating Sites Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Movie Rating Sites Report [Dataset]. https://www.marketreportanalytics.com/reports/movie-rating-sites-75765
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Apr 10, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global movie rating sites market is experiencing robust growth, driven by the increasing consumption of online streaming services and the rising demand for credible film reviews before purchasing tickets or subscribing. The market's expansion is fueled by several factors, including the proliferation of smartphones and internet access, making it easier for users to access rating platforms. Furthermore, the integration of social media features on many platforms fosters engagement and user-generated content, creating a dynamic and interactive ecosystem. The market is segmented by application (movie promotion, movie research, audience choice, and others) and by rating type (user-based, professional-based, and others). While precise market sizing data is unavailable, given the significant presence of established players like Rotten Tomatoes and IMDb, and considering the considerable global viewership of movies, we can estimate the 2025 market size to be approximately $2 billion. This estimation accounts for advertising revenue, premium subscriptions (where applicable), and potential data licensing to film studios and distributors. The projected CAGR suggests continued substantial growth throughout the forecast period (2025-2033), likely driven by technological advancements and the ever-growing global movie-watching audience. However, potential restraints include the risk of biased reviews and the increasing competition from new platforms and emerging technologies like AI-powered recommendation systems. The North American market currently holds a significant share due to the established presence of major players and a large movie-going audience. However, rapid growth is anticipated in the Asia-Pacific region, particularly in countries like India and China, fueled by the expansion of streaming platforms and increasing internet penetration. Europe, with its diverse film culture and established digital infrastructure, also represents a substantial market segment. Competitive pressures are intensifying, with existing players continually innovating to enhance user experiences, introduce new features, and attract and retain users in a crowded market. The market's future trajectory will be shaped by the strategic moves of key players, technological disruptions, and evolving consumer preferences regarding how they discover and choose movies to watch. Strategic partnerships and acquisitions could also play a significant role in shaping the market landscape in the coming years.
m
Data on regional, ethnicity, and minorities representation in movies
data.mendeley.com
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FERNANDO TAMBERLINI ALVES (2025). Data on regional, ethnicity, and minorities representation in movies [Dataset]. http://doi.org/10.17632/kzv2m4hsvw.1
Explore at:
Unique identifier
https://doi.org/10.17632/kzv2m4hsvw.1
Dataset updated
Feb 20, 2025
Authors
FERNANDO TAMBERLINI ALVES
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data sources are primary from three public databases: MovieLens, IMDb, and Brazilian National Cinema Agency. We also collected movie data and subtitles files using web scrapping and public API from six internet public sites: imdb.com, letterboxd.com, metacritic.com, rottentomatoes.com, subdl.com, and subscene.co.in. In addition, we used LLM Tool (Claude.Ai by Anthropic) to collect regional and ethnicity from movie’s director, screenwriter and main character.
b
IMDb Movie Reviews Dataset
berd-platform.de
bin
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts; Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts (2025). IMDb Movie Reviews Dataset [Dataset]. http://doi.org/10.82939/z8gxk-w3567
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.82939/z8gxk-w3567
Dataset updated
Jul 31, 2025
Dataset provided by
Stanford University
Authors
Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts; Andrew L. Maas; Raymond E. Daly; Peter T. Pham; Dan Huang; Andrew Y. Ng; Christopher Potts
License
https://ai.stanford.edu/~amaas/data/sentimenthttps://ai.stanford.edu/~amaas/data/sentiment
Description
The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The providers also include an additional 50,000 unlabeled documents for unsupervised learning.
The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset also contains an additional 50,000 unlabeled documents for unsupervised learning. See the README file contained in the release for more details.
The data is split into a train (25k reviews) and test (25k reviews) set. A preview file cannot be provided - please download the data directly from the data provider's website.
When using the dataset, please cite: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
IMDB 100,000+ Movies/TvShows
kaggle.com
Updated Apr 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kurt Nakasato (2023). IMDB 100,000+ Movies/TvShows [Dataset]. https://www.kaggle.com/datasets/kurtnakasato/imdb-100000-moviestvshows/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 2, 2023
Dataset provided by
Kaggle
Authors
Kurt Nakasato
Description
Around 100k+ movies/tvshows scraped from the IMDB website. Contains 3 files. contentDataGenre contains the primary key and genres of the movie data in contendDataPrime. contentDataRegion contains the primary key and regions of the movie data in contendDataPrime. All region, genre, and contentdataprime data is taken directly from the IMDB website.

-1 represents a missing number value and null represents a missing string or date value.
M
Movie Rating Sites Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Movie Rating Sites Report [Dataset]. https://www.marketreportanalytics.com/reports/movie-rating-sites-75768
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Apr 10, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global movie rating sites market is experiencing robust growth, driven by the increasing popularity of streaming services, a surge in online movie consumption, and the growing reliance on user reviews and professional ratings to inform viewing decisions. The market, estimated at $2 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This expansion is fueled by several key factors. Firstly, the continuous evolution of user interfaces and functionalities on these platforms enhances user experience, fostering engagement and loyalty. Secondly, strategic partnerships between rating sites and streaming platforms provide cross-promotional opportunities, expanding reach and user base. Thirdly, the rising demand for data-driven insights in the film industry is driving the adoption of professional rating services within the movie research and production segments. Competition among established players like Rotten Tomatoes and IMDb, alongside the emergence of niche platforms catering to specific film genres or demographics, is shaping the market landscape. However, the market faces certain restraints. Data security and privacy concerns regarding user information are a major challenge. Maintaining the accuracy and integrity of ratings to avoid manipulation or biased reviews is also crucial for sustaining user trust. Furthermore, the market's growth is susceptible to fluctuations in the film industry itself, including production delays, changes in consumer preferences, and the impact of external economic factors. The market is segmented by application (movie promotion, movie research, audience choice, others) and type (user ratings, professional ratings, others), providing opportunities for specialized platforms to emerge and cater to specific niche needs. Geographic expansion, especially in rapidly developing markets in Asia Pacific, presents significant potential for future growth. The North American market currently holds a substantial share due to the established presence of key players and high online movie consumption.
m
Movie Box Office Revenue Prediction
data.mendeley.com
Updated Oct 7, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Canaan Madongo (2020). Movie Box Office Revenue Prediction [Dataset]. http://doi.org/10.17632/xv9wtc9gdk.2
Explore at:
Unique identifier
https://doi.org/10.17632/xv9wtc9gdk.2
Dataset updated
Oct 7, 2020
Authors
Canaan Madongo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We collected movie dataset from Internet Movie Database (IMDB) website for our experiments using an IMDbPy script to extract all the movie metadata. We obtained the box office revenues from The Movies Dataset, Box-office Mojo and The Movie Database (TMDB).These databases predominantly consisted of movies from 2006 to 2020 in various countries, and we also collected movie posters. We also used the Open Images dataset V6 for object detection of movie posters.
Movies IMDb Oscar nominated dataset
kaggle.com
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 23, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ritik Chauhan
Description
The dataset contains information about all oscar nominated english movies from 1999 to 2023 from the IMDb site. The data has been collected through web scraping with python from the 'https://www.imdb.com/search/title/' webpage.

Dataset Details: -Source: IMDb site -Collection Period: 2023-24 -Data Format: CSV -Data Size: 40kv

The dataset consists of the following columns:

movie: Name of movies year: In which year movie realeased Genre: Which type of genre earning: how much movie earns metascore: how many metascore that movie get

Data Usage: This dataset can be used for various purposes: Analyzing which genre movie nominted for oscar . -Their earning
2021 -What movies to watch today?
kaggle.com
zip
Updated Sep 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gayathri Nagarajan (2021). 2021 -What movies to watch today? [Dataset]. https://www.kaggle.com/gayathrirprog/2021-what-movies-to-watch-today
Explore at:
zip(147839676 bytes)Available download formats
Dataset updated
Sep 28, 2021
Authors
Gayathri Nagarajan
Description
Context

With Covid in place,when we sit to pick movies to watch as a family, we end up browsing for about 40 + minutes pondering through what movies to watch as a family with kids. THen I realised why not access the database of movies, use my knowledge in R to bring out something useful for folks so that they can use this link to pick their favourite movies to watch per the genre.

Content

I have downloaded this data set from "https://www.imdb.com/interfaces/" -This link is linked here and will be updated/refreshed weekly.

Acknowledgements

Thanks to imdb website folks for making this data public

Subsets of IMDb data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the Non-Commercial Licensing and copyright/license and verify compliance.

Inspiration

With this data, I want to bring out answers to common questions

1) WHat movies can I watch as a family under science fiction , horror , doggy movies or christmas movies ? 2) AS I analyse the data, I would want to ultimately make a shiny App page showcasing this for folks to use and benefit.
IMDb Dataset - From 1888 to 2023
kaggle.com
Updated Apr 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 5, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Komal Khetlani
Description
Context

For all the movie buffs out there, IMDb is the place to go for all the movie-related data.

Content

The dataset contains information of movie titles and their details from 1888 to 2023.

Acknowledgements

The dataset is from IMDb website and is uploaded here for learning purposes itself.

Inspiration

The inspiration is to get deep insights from the data.
FilmTV movies dataset
berd-platform.de
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefano Leone; Stefano Leone (2025). FilmTV movies dataset [Dataset]. http://doi.org/10.82939/3688y-24031
Explore at:
Unique identifier
https://doi.org/10.82939/3688y-24031
Dataset updated
Jul 31, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Stefano Leone; Stefano Leone
Description
Movies data are available on websites such as IMDb with average votes, vote numbers, reviews and descriptions. While IMDb is the most trustworthy source for data, other websites as FilmTV.it can provide the information on how users from different countries rate the movies compared to each other. The dataset is 0.11 GB large.
Each row represents a movie available on FilmTV.it, with the original title, year, genre, duration, country, director, actors, average vote and votes.
The file in the English version contains 37,711 movies and 19 attributes, while the Italian version contains one extra-attribute for the local title used when the movie was published in Italy.
The data set includes movies from: 1897 – 2023. Data has been scraped from the publicly available website https://www.filmtv.it as of 2023-10-21.
Data from: Movie Ratings Dataset
kaggle.com
Updated Feb 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Devesh Kumar Rai (2019). Movie Ratings Dataset [Dataset]. https://www.kaggle.com/datasets/raidevesh05/movie-ratings-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Devesh Kumar Rai
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Context

I did web scrapping from imdb site to prepare this dataset.

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
the_movies_dataset
kaggle.com
zip
Updated Jun 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sezgin ildes (2021). the_movies_dataset [Dataset]. https://www.kaggle.com/sezginildes/the-movies-dataset
Explore at:
zip(15456686 bytes)Available download formats
Dataset updated
Jun 19, 2021
Authors
sezgin ildes
Description
Context These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

Content This dataset consists of the following files:

movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed here

Acknowledgements This dataset is an ensemble of data collected from TMDB and GroupLens. The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.

The Movie Links and Ratings have been obtained from the Official GroupLens website. The files are a part of the dataset available here

Inspiration This dataset was assembled as part of my second Capstone Project for Springboard's Data Science Career Track. I wanted to perform an extensive EDA on Movie Data to narrate the history and the story of Cinema and use this metadata in combination with MovieLens ratings to build various types of Recommender Systems.

Both my notebooks are available as kernels with this dataset: The Story of Film and Movie Recommender Systems

Some of the things you can do with this dataset: Predicting movie revenue and/or movie success based on a certain metric. What movies tend to get higher vote counts and vote averages on TMDB? Building Content Based and Collaborative Filtering Based Recommendation Engines.
IMDB Featured Film
kaggle.com
Updated Dec 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas N. Sulaksana (2020). IMDB Featured Film [Dataset]. https://kaggle.com/nicholasnanda/imdb-featured-film
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nicholas N. Sulaksana
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Content

The dataset contains 9750 Featured film from IMDB. with 10 attributes : - Film Title - Year Release - Motion Picture Association Rating - Film duration in minutes - IMDB users rating - Metascore rating - List of Main, secondary, and third genre - Net Gross of the film in million-dollar

Acknowledgements

Data has been scraped from the publicly available website https://www.imdb.com. For further use, this data scraped on December 9, 2020. Some information about the film is missing on the IMDB so it describes being NA.
Latest IMDB
kaggle.com
Updated Aug 11, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aditya Soni (2017). Latest IMDB [Dataset]. https://www.kaggle.com/adityaecdrid/latest-imdb/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 11, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aditya Soni
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Movies always interested me to a great extend.. So i decided to create my own movies Dataset...

Content

Well the dataset kind of looks like movie -> Names year -> Release Year imdb -> imdb_ratings metascore -> metascores votes -> Public Votes

I have collected this one using web scrapping...

Acknowledgements

Used python to scrape the website.. So thanks to the developers to make it so easy to do so...

Inspiration

Build an Exciting Movie Recommendation Engine For all .....
Movie Review Dataset
kaggle.com
Updated Nov 22, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 22, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vipul Gandhi
Description
The Movie Review Data is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing. The reviews were originally released in 2002, but an updated and cleaned up version was released in 2004, referred to as v2.0. The dataset is comprised of 1,000 positive and 1,000 negative movie reviews drawn from an archive of the rec.arts.movies.reviews newsgroup hosted at IMDB. The authors refer to this dataset as the polarity dataset.

Our data contains 1000 positive and 1000 negative reviews all written before 2002, with a cap of 20 reviews per author (312 authors total) per category. We refer to this corpus as the polarity dataset. - A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

The data has been cleaned up somewhat, for example: - The dataset is comprised of only English reviews. - All text has been converted to lowercase. - There is white space around punctuation like periods, commas, and brackets. - Text has been split into one sentence per line.

The data has been used for a few related natural language processing tasks. For classification, the performance of classical models (such as Support Vector Machines) on the data is in the range of high 70% to low 80% (e.g. 78%-to-82%). More sophisticated data preparation may see results as high as 86% with 10-fold cross-validation. This gives us a ballpark of low-to-mid 80s if we were looking to use this dataset in experiments on modern methods.

... depending on choice of downstream polarity classifier, we can achieve highly statistically significant improvement (from 82.8% to 86.4%) - A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, 2004.

After unzipping the file, you will have a directory called txt sentoken with two sub- directories containing the text neg and pos for negative and positive reviews. Reviews are stored one per file with a naming convention from cv000 to cv999 for each of neg and pos. Next, let’s look at loading the text data.

Facebook

Twitter

Click to copy link

Link copied

Cite

Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb

Data from: imdb

IMDB

stanfordnlp/imdb

Explore at:

22 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 3, 2003

Dataset authored and provided by

Stanford NLP

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Dataset Card for "imdb"

  Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

  Supported Tasks and Leaderboards

More Information Needed

  Languages

More Information Needed

  Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.

Clear search

Close search

Google apps

Main menu

Data from: imdb

Data from: IMDB Dataset

Data from: imdb

Film Circulation dataset

All US Movies IMDB from 1972 to 2016

Content

Considerations

Movie Rating Sites Report

Data on regional, ethnicity, and minorities representation in movies

IMDb Movie Reviews Dataset

IMDB 100,000+ Movies/TvShows

Movie Rating Sites Report

Movie Box Office Revenue Prediction

Movies IMDb Oscar nominated dataset

2021 -What movies to watch today?

Context

Content

Acknowledgements

Inspiration

IMDb Dataset - From 1888 to 2023

Context

Content

Acknowledgements

Inspiration

FilmTV movies dataset

Data from: Movie Ratings Dataset

Context

Content

Acknowledgements

Inspiration

the_movies_dataset

IMDB Featured Film

Content

Acknowledgements

Latest IMDB

Context

Content

Acknowledgements

Inspiration

Movie Review Dataset

Data from: imdbSee More Versions

IMDB

stanfordnlp/imdb

Data from: imdb